Claude Opus 4.6 Anthropic

We’re upgrading our smartest model.

The brand new Claude Opus 4.6 improves on its predecessor’s coding skills. It plans more fastidiously, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has higher code review and debugging skills to catch its own mistakes. And, in a primary for our Opus-class models, Opus 4.6 includes a 1M token context window in beta.

Opus 4.6 also can apply its improved abilities to a variety of on a regular basis work tasks: running financial analyses, doing research, and using and creating documents, spreadsheets, and presentations. Inside Cowork, where Claude can multitask autonomously, Opus 4.6 can put all these skills to work in your behalf.

The model’s performance is state-of-the-art on several evaluations. For instance, it achieves the very best rating on the agentic coding evaluation Terminal-Bench 2.0 and leads all other frontier models on Humanity’s Last Exam, a posh multidisciplinary reasoning test. On GDPval-AA—an evaluation of performance on economically worthwhile knowledge work tasks in finance, legal, and other domains¹—Opus 4.6 outperforms the industry’s next-best model (OpenAI’s GPT-5.2) by around 144 Elo points,² and its own predecessor (Claude Opus 4.5) by 190 points. Opus 4.6 also performs higher than every other model on BrowseComp, which measures a model’s ability to locate hard-to-find information online.

As we show in our extensive system card, Opus 4.6 also shows an overall safety profile pretty much as good as, or higher than, every other frontier model within the industry, with low rates of misaligned behavior across safety evaluations.

In Claude Code, you’ll be able to now assemble agent teams to work on tasks together. On the API, Claude can use compaction to summarize its own context and perform longer-running tasks without bumping up against limits. We’re also introducing adaptive pondering, where the model can pick up on contextual clues about how much to make use of its prolonged pondering, and recent effort controls to provide developers more control over intelligence, speed, and value.

We’ve made substantial upgrades to Claude in Excel, and we’re releasing Claude in PowerPoint in a research preview. This makes Claude far more capable for on a regular basis work.

Claude Opus 4.6 is on the market today on claude.ai, our API, and all major cloud platforms. If you happen to’re a developer, use claude-opus-4-6 via the Claude API. Pricing stays the identical at $5/$25 per million tokens; for full details, see our pricing page.

We cover the model, our recent product updates, our evaluations, and our extensive safety testing in depth below.

First impressions

We construct Claude with Claude. Our engineers write code with Claude Code on daily basis, and each recent model first gets tested on our own work. With Opus 4.6, we’ve found that the model brings more focus to probably the most difficult parts of a task without being told to, moves quickly through the more straightforward parts, handles ambiguous problems with higher judgment, and stays productive over longer sessions.

Opus 4.6 often thinks more deeply and more fastidiously revisits its reasoning before deciding on a solution. This produces higher results on harder problems, but can add cost and latency on simpler ones. If you happen to’re finding that the model is overthinking on a given task, we recommend dialing effort down from its default setting (high) to medium. You may control this easily with the /effort parameter.

Listed here are among the things our Early Access partners told us about Claude Opus 4.6, including its propensity to work autonomously without hand-holding, its success where previous models failed, and its effect on how teams work:

Claude Opus 4.6 is the strongest model Anthropic has shipped. It takes complicated requests and really follows through, breaking them into concrete steps, executing, and producing polished work even when the duty is ambitious. For Notion users, it feels less like a tool and more like a capable collaborator.

Early testing shows Claude Opus 4.6 delivering on the complex, multi-step coding work developers face on daily basis—especially agentic workflows that demand planning and power calling. This starts unlocking long-horizon tasks on the frontier.

Claude Opus 4.6 is a big leap for agentic planning. It breaks complex tasks into independent subtasks, runs tools and subagents in parallel, and identifies blockers with real precision.

Claude Opus 4.6 is the very best model we have tested yet. Its reasoning and planning capabilities have been exceptional at powering our AI Teammates. It is also a implausible coding model – its ability to navigate a big codebase and discover the appropriate changes to make is cutting-edge.

Claude Opus 4.6 reasons through complex problems at a level we’ve not seen before. It considers edge cases that other models miss and consistently lands on more elegant, well-considered solutions. We’re particularly impressed with Opus 4.6 in Devin Review, where it’s increased our bug catching rates.

Claude Opus 4.6 feels noticeably higher than Opus 4.5 in Windsurf, especially on tasks that require careful exploration like debugging and understanding unfamiliar codebases. We’ve noticed Opus 4.6 thinks longer, which pays off when deeper reasoning is required.

Claude Opus 4.6 represents a meaningful leap in long-context performance. In our testing, we saw it handle much larger bodies of data with a level of consistency that strengthens how we design and deploy complex research workflows. Progress on this area gives us more powerful constructing blocks to deliver truly expert-grade systems professionals can trust.

Across 40 cybersecurity investigations, Claude Opus 4.6 produced the very best results 38 of 40 times in a blind rating against Claude 4.5 models. Each model ran end to finish on the identical agentic harness with as much as 9 subagents and 100+ tool calls.

Claude Opus 4.6 is the brand new frontier on long-running tasks from our internal benchmarks and testing. It is also been highly effective at reviewing code.

Claude Opus 4.6 achieved the very best BigLaw Bench rating of any Claude model at 90.2%. With 40% perfect scores and 84% above 0.8, it’s remarkably capable for legal reasoning.

Claude Opus 4.6 autonomously closed 13 issues and assigned 12 issues to the appropriate team members in a single day, managing a ~50-person organization across 6 repositories. It handled each product and organizational decisions while synthesizing context across multiple domains, and it knew when to escalate to a human.

Claude Opus 4.6 is an uplift in design quality. It really works beautifully with our design systems and it’s more autonomous, which is core to Lovable’s values. People ought to be creating things that matter, not micromanaging AI.

Claude Opus 4.6 excels in high-reasoning tasks like multi-source evaluation across legal, financial, and technical content. Box’s eval showed a ten% lift in performance, reaching 68% vs. a 58% baseline, and near-perfect scores in technical domains.

Claude Opus 4.6 generates complex, interactive apps and prototypes in Figma Make with a formidable creative range. The model translates detailed designs and multi-layered tasks into code on the primary try, making it a robust start line for teams to explore and construct ideas.

Claude Opus 4.6 is the very best Anthropic model we’ve tested. It understands intent with minimal prompting and went above and beyond, exploring and creating details I didn’t even know I wanted until I saw them. It felt like I used to be working with the model, not waiting on it.

Each hands-on testing and evals show Claude Opus 4.6 is a meaningful improvement for design systems and huge codebases, use cases that drive enormous enterprise value. It also one-shotted a totally functional physics engine, handling a big multi-scope task in a single pass.

Claude Opus 4.6 is the largest leap I’ve seen in months. I’m more comfortable giving it a sequence of tasks across the stack and letting it run. It’s smart enough to make use of subagents for the person pieces.

Claude Opus 4.6 handled a multi-million-line codebase migration like a senior engineer. It planned up front, adapted its strategy because it learned, and finished in half the time.

We only ship models in v0 when developers will genuinely feel the difference. Claude Opus 4.6 passed that bar with ease. Its frontier-level reasoning, especially with edge cases, helps v0 to deliver on our number-one aim: to let anyone elevate their ideas from prototype to production.

The performance jump with Claude Opus 4.6 feels almost unbelievable. Real-world tasks that were difficult for Opus [4.5] suddenly became easy. This seems like a watershed moment for spreadsheet agents on Shortcut.

Evaluating Claude Opus 4.6

Across agentic coding, computer use, tool use, search, and finance, Opus 4.6 is an industry-leading model, often by a large margin. The table below shows how Claude Opus 4.6 compares to our previous models and to other industry models on quite a lot of benchmarks.

Benchmark table comparing Opus 4.6 to other models

Opus 4.6 is significantly better at retrieving relevant information from large sets of documents. This extends to long-context tasks, where it holds and tracks information over a whole lot of hundreds of tokens with less drift, and picks up buried details that even Opus 4.5 would miss.

A standard grievance about AI models is “context rot,” where performance degrades as conversations exceed a certain variety of tokens. Opus 4.6 performs markedly higher than its predecessors: on the 8-needle 1M variant of MRCR v2—a needle-in-a-haystack benchmark that tests a model’s ability to retrieve information “hidden” in vast amounts of text—Opus 4.6 scores 76%, whereas Sonnet 4.5 scores just 18.5%. This can be a qualitative shift in how much context a model can actually use while maintaining peak performance.

All in all, Opus 4.6 is best at finding information across long contexts, higher at reasoning after absorbing that information, and has substantially higher expert-level reasoning abilities normally.

Finally, the charts below show how Claude Opus 4.6 performs on quite a lot of benchmarks that assess its software engineering skills, multilingual coding ability, long-term coherence, cybersecurity capabilities, and its life sciences knowledge.

A step forward on safety

These intelligence gains don’t come at the price of safety. On our automated behavioral audit, Opus 4.6 showed a low rate of misaligned behaviors resembling deception, sycophancy, encouragement of user delusions, and cooperation with misuse. Overall, it’s just as well-aligned as its predecessor, Claude Opus 4.5, which was our most-aligned frontier model up to now. Opus 4.6 also shows the bottom rate of over-refusals—where the model fails to reply benign queries—of any recent Claude model.

Bar charts comparing Opus 4.6 to other Claude models on overall misaligned behavior — The general misaligned behavior rating for every recent Claude model on our automated behavioral audit (described in full within the Claude Opus 4.6 system card).

For Claude Opus 4.6, we ran probably the most comprehensive set of safety evaluations of any model, applying many various tests for the primary time and upgrading several that we’ve used before. We included recent evaluations for user wellbeing, more complex tests of the model’s ability to refuse potentially dangerous requests, and updated evaluations of the model’s ability to surreptitiously perform harmful actions. We also experimented with recent methods from interpretability, the science of the inner workings of AI models, to start to grasp why the model behaves in certain ways—and, ultimately, to catch problems that standard testing might miss.

An in depth description of all capability and safety evaluations is on the market within the Claude Opus 4.6 system card.

We’ve also applied recent safeguards in areas where Opus 4.6 shows particular strengths that may be put to dangerous in addition to helpful uses. Specifically, for the reason that model shows enhanced cybersecurity abilities, we’ve developed six recent cybersecurity probes—methods of detecting harmful responses—to assist us track different types of potential misuse.

We’re also accelerating the cyberdefensive uses of the model, using it to assist find and patch vulnerabilities in open-source software (as we describe in our recent cybersecurity blog post). We predict it’s critical that cyberdefenders use AI models like Claude to assist level the playing field. Cybersecurity moves fast, and we’ll be adjusting and updating our safeguards as we learn more about potential threats; within the near future, we may institute real-time intervention to dam abuse.

Product and API updates

We’ve made substantial updates across Claude, Claude Code, and the Claude Developer Platform to let Opus 4.6 perform at its best.

Claude Developer Platform

On the API, we’re giving developers higher control over model effort and more flexibility for long-running agents. To achieve this, we’re introducing the next features:

Adaptive pondering. Previously, developers only had a binary selection between enabling or disabling prolonged pondering. Now, with adaptive pondering, Claude can determine when deeper reasoning can be helpful. On the default effort level (high), the model uses prolonged pondering when useful, but developers can adjust the trouble level to make it roughly selective.
Effort. There at the moment are 4 effort levels to select from: low, medium, high (default), and max. We encourage developers to experiment with different options to seek out what works best.
Context compaction (beta). Long-running conversations and agentic tasks often hit the context window. Context compaction mechanically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.
1M token context (beta). Opus 4.6 is our first Opus-class model with 1M token context. Premium pricing applies for prompts exceeding 200k tokens ($10/$37.50 per million input/output tokens).
128k output tokens. Opus 4.6 supports outputs of as much as 128k tokens, which lets Claude complete larger-output tasks without breaking them into multiple requests.
US-only inference. For workloads that must run in the US, US-only inference is on the market at 1.1× token pricing.

Product updates

Across Claude and Claude Code, we’ve added features that allow knowledge staff and developers to tackle harder tasks with more of the tools they use on daily basis.

We’ve introduced agent teams in Claude Code as a research preview. You may now spin up multiple agents that work in parallel as a team and coordinate autonomously—best for tasks that split into independent, read-heavy work like codebase reviews. You may take over any subagent directly using Shift+Up/Down or tmux.

Claude now also works higher with the office tools you already use. Claude in Excel handles long-running and harder tasks with improved performance, and may plan before acting, ingest unstructured data and infer the appropriate structure without guidance, and handle multi-step changes in a single pass. Pair that with Claude in PowerPoint, and you’ll be able to first process and structure your data in Excel, then bring it to life visually in PowerPoint. Claude reads your layouts, fonts, and slide masters to remain on brand, whether you’re constructing from a template or generating a full deck from an outline. Claude in PowerPoint is now available in research preview for Max, Team, and Enterprise plans.

Source link

Claude Opus 4.6 Anthropic

First impressions

Evaluating Claude Opus 4.6

A step forward on safety

Product and API updates

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Image Classification with AutoTrain

Very Large Language Models and The best way to Evaluate Them

Japanese Stable Diffusion

the Digital Object Identifier to Datasets and Models

Optimization story: Bloom inference

Claude Opus 4.6 Anthropic

First impressions

Evaluating Claude Opus 4.6

A step forward on safety

Product and API updates

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.