Introducing Sonnet 4.6 Anthropic

Claude Sonnet 4.6 is our most capable Sonnet model yet. It’s a full upgrade of the model’s skills across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. Sonnet 4.6 also contains a 1M token context window in beta.

For those on our Free and Pro plans, Claude Sonnet 4.6 is now the default model in claude.ai and Claude Cowork. Pricing stays the identical as Sonnet 4.5, starting at $3/$15 per million tokens.

Sonnet 4.6 brings much-improved coding skills to more of our users. Improvements in consistency, instruction following, and more have made developers with early access prefer Sonnet 4.6 to its predecessor by a large margin. They often even prefer it to our smartest model from November 2025, Claude Opus 4.5.

Performance that will have previously required reaching for an Opus-class model—including on real-world, economically beneficial office tasks—is now available with Sonnet 4.6. The model also shows a significant improvement in computer use skills in comparison with prior Sonnet models.

As with every latest Claude model, we’ve run extensive safety evaluations of Sonnet 4.6, which overall showed it to be as secure as, or safer than, our other recent Claude models. Our safety researchers concluded that Sonnet 4.6 has “a broadly warm, honest, prosocial, and at times funny character, very strong safety behaviors, and no signs of major concerns around high-stakes types of misalignment.”

Computer use

Almost every organization has software it might probably’t easily automate: specialized systems and tools built before modern interfaces like APIs existed. To have AI use such software, users would previously have had to construct bespoke connectors. But a model that may use a pc the best way an individual does changes that equation.

In October 2024, we were the first to introduce a general-purpose computer-using model. On the time, we wrote that it was “still experimental—at times cumbersome and error-prone,” but we expected rapid improvement. OSWorld, the usual benchmark for AI computer use, shows how far our models have come. It presents a whole lot of tasks across real software (Chrome, LibreOffice, VS Code, and more) running on a simulated computer. There are not any special APIs or purpose-built connectors; the model sees the pc and interacts with it in much the identical way an individual would: clicking a (virtual) mouse and typing on a (virtual) keyboard.

Across sixteen months, our Sonnet models have made regular gains on OSWorld. The improvements may also be seen beyond benchmarks: early Sonnet 4.6 users are seeing human-level capability in tasks like navigating a posh spreadsheet or filling out a multi-step web form, before pulling all of it together across multiple browser tabs.

The model actually still lags behind essentially the most expert humans at using computers. However the rate of progress is remarkable nonetheless. It implies that computer use is far more useful for a variety of labor tasks—and that substantially more capable models are close by.

Chart comparing several Sonnet model scores on the OSWorld benchmark — Scores prior to Claude Sonnet 4.5 were measured on the unique OSWorld; scores from Sonnet 4.5 onward use OSWorld-Verified. OSWorld-Verified (released July 2025) is an in-place upgrade of the unique OSWorld benchmark, with updates to task quality, evaluation grading, and infrastructure.

At the identical time, computer use poses risks: malicious actors can try and hijack the model by hiding instructions on web sites in what’s often known as a prompt injection attack. We’ve been working to enhance our models’ resistance to prompt injections—our safety evaluations show that Sonnet 4.6 is a significant improvement in comparison with its predecessor, Sonnet 4.5, and performs similarly to Opus 4.6. You’ll find out more about learn how to mitigate prompt injections and other safety concerns in our API docs.

Evaluating Claude Sonnet 4.6

Beyond computer use, Claude Sonnet 4.6 has improved on benchmarks across the board. It approaches Opus-level intelligence at a price point that makes it more practical for much more tasks. You’ll find a full discussion of Sonnet 4.6’s capabilities and its safety-related behaviors in our system card; a summary and comparison to other recent models is below.

A table of popular benchmarks and Sonnet 4.6's relative performance compared to other frontier models

In Claude Code, our early testing found that users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time. Users reported that it more effectively read the context before modifying code and consolidated shared logic relatively than duplicating it. This made it less frustrating to make use of over long sessions than earlier models.

Users even preferred Sonnet 4.6 to Opus 4.5, our frontier model from November, 59% of the time. They rated Sonnet 4.6 as significantly less susceptible to overengineering and “laziness,” and meaningfully higher at instruction following. They reported fewer false claims of success, fewer hallucinations, and more consistent follow-through on multi-step tasks.

Sonnet 4.6’s 1M token context window is sufficient to hold entire codebases, lengthy contracts, or dozens of research papers in a single request. More importantly, Sonnet 4.6 reasons effectively across all that context. This could make it a lot better at long-horizon planning. We saw this particularly clearly within the Vending-Bench Arena evaluation, which tests how well a model can run a (simulated) business over time—and which incorporates a component of competition, with different AI models facing off against one another to make the most important profits.

Sonnet 4.6 developed an interesting latest strategy: it invested heavily in capability for the primary ten simulated months, spending significantly greater than its competitors, after which pivoted sharply to concentrate on profitability in the ultimate stretch. The timing of this pivot helped it finish well ahead of the competition.

Sonnet 4.6 outperforms Sonnet 4.5 on Vending-Bench Arena by investing in capability early, then pivoting to profitability in the ultimate stretch.

Early customers also reported broad improvements, with frontend code and financial evaluation standing out. Customers independently described visual outputs from Sonnet 4.6 as notably more polished, with higher layouts, animations, and design sensibility than those from previous models. Customers also needed fewer rounds of iteration to succeed in production-quality results.

Claude Sonnet 4.6 matches Opus 4.6 performance on OfficeQA, which measures how well a model can read enterprise documents (charts, PDFs, tables), pull the proper facts, and reason from those facts. It’s a meaningful upgrade for document comprehension workloads.

The performance-to-cost ratio of Claude Sonnet 4.6 is extraordinary—it’s hard to overstate how briskly Claude models have been evolving in recent months. Sonnet 4.6 outperforms on our orchestration evals, handles our most complex agentic workloads, and keeps improving the upper you push the hassle settings.

Claude Sonnet 4.6 is a notable improvement over Sonnet 4.5 across the board, including long-horizon tasks and harder problems.

Out of the gate, Claude Sonnet 4.6 is already excelling at complex code fixes, especially when searching across large codebases is important. For teams running agentic coding at scale, we’re seeing strong resolution rates and the sort of consistency developers need.

Claude Sonnet 4.6 has meaningfully closed the gap with Opus on bug detection, letting us run more reviewers in parallel, catch a greater diversity of bugs, and do all of it without increasing cost.

For the primary time, Sonnet brings frontier-level reasoning in a smaller and more cost effective form factor. It provides a viable alternative for those who are a heavy Opus user.

Claude Sonnet 4.6 meaningfully improves the reply retrieval behind our core product—we saw a big jump in answer match rate in comparison with Sonnet 4.5 in our Financial Services Benchmark, with higher recall on the particular workflows our customers rely upon.

Box evaluated how Claude Sonnet 4.6 performs when tested on deep reasoning and complicated agentic tasks across real enterprise documents. It demonstrated significant improvements, outperforming Claude Sonnet 4.5 in heavy reasoning Q&A by 15 percentage points.

Claude Sonnet 4.6 hit 94% on our insurance benchmark, making it the highest-performing model we’ve tested for computer use. This type of accuracy is mission-critical to workflows like submission intake and first notice of loss.

Claude Sonnet 4.6 delivers frontier-level results on complex app builds and bug-fixing. It’s becoming our go-to for the sort of deep codebase work that used to require costlier models.

Claude Sonnet 4.6 produced the most effective iOS code we’ve tested for Rakuten AI. Higher spec compliance, higher architecture, and it reached for contemporary tooling we didn’t ask for, multi functional shot. The outcomes genuinely surprised us.

Sonnet 4.6 is a big breakthrough on reasoning through difficult tasks. We discover it especially strong on branched and multi-step tasks like contract routing, conditional template selection, and CRM coordination—exactly where our customers need strong model sense and reliability.

We’ve been impressed by how accurately Claude Sonnet 4.6 handles complex computer use. It’s a transparent improvement over the rest we’ve tested in our evals.

Claude Sonnet 4.6 has perfect design taste when constructing frontend pages and data reports, and it requires far less hand-holding to get there than anything we’ve tested before.

Claude Sonnet 4.6 was exceptionally conscious of direction — delivering precise figures and structured comparisons when asked, while also generating genuinely useful ideas on trial strategy and exhibit preparation.

Product updates

On the Claude Developer Platform, Sonnet 4.6 supports each adaptive considering and prolonged considering, in addition to context compaction in beta, which robotically summarizes older context as conversations approach limits, increasing effective context length.

On our API, Claude’s web search and fetch tools now robotically write and execute code to filter and process search results, keeping only relevant content in context—improving each response quality and token efficiency. Moreover, code execution, memory, programmatic tool calling, tool search, and tool use examples are actually generally available.

Sonnet 4.6 offers strong performance at any considering effort, even with prolonged considering off. As a part of your migration from Sonnet 4.5, we recommend exploring across the spectrum to search out the best balance of speed and reliable performance, depending on what you’re constructing.

We discover that Opus 4.6 stays the strongest option for tasks that demand the deepest reasoning, equivalent to codebase refactoring, coordinating multiple agents in a workflow, and problems where getting it just right is paramount.

For Claude in Excel users, our add-in now supports MCP connectors, letting Claude work with the opposite tools you utilize day-to-day, like S&P Global, LSEG, Daloopa, PitchBook, Moody’s, and FactSet. You may ask Claude to drag in context from outside your spreadsheet without ever leaving Excel. Should you’ve already arrange MCP connectors in Claude.ai, those self same connections will work in Excel robotically. This is out there on Pro, Max, Team, and Enterprise plans.

The right way to use Claude Sonnet 4.6

Claude Sonnet 4.6 is out there now on all Claude plans, Claude Cowork, Claude Code, our API, and all major cloud platforms. We’ve also upgraded our free tier to Sonnet 4.6 by default—it now includes file creation, connectors, skills, and compaction.

Should you’re a developer, you possibly can start quickly through the use of claude-sonnet-4-6 via the Claude API.

Source link

Introducing Sonnet 4.6 Anthropic

Computer use

Evaluating Claude Sonnet 4.6

Product updates

The right way to use Claude Sonnet 4.6

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Detecting and Editing Visual Objects with Gemini

A Generalizable MARL-LP Approach for Scheduling in Logistics

Designing Data and AI Systems That Hold Up in Production

Latest AirSnitch attack breaks Wi-Fi encryption in homes, offices, and enterprises

Google’s latest AI image generation model

Introducing Sonnet 4.6 Anthropic

Computer use

Evaluating Claude Sonnet 4.6

Product updates

The right way to use Claude Sonnet 4.6

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.