Introducing Claude Opus 4.5 Anthropic

Our newest model, Claude Opus 4.5, is out there today. It’s intelligent, efficient, and the most effective model on the earth for coding, agents, and computer use. It’s also meaningfully higher at on a regular basis tasks like deep research and dealing with slides and spreadsheets. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.

Claude Opus 4.5 is state-of-the-art on tests of real-world software engineering:

Chart comparing frontier models on SWE-bench Verified where Opus 4.5 scores highest

Opus 4.5 is out there today on our apps, our API, and on all three major cloud platforms. Should you’re a developer, simply use claude-opus-4-5-20251101 via the Claude API. Pricing is now $5/$25 per million tokens—making Opus-level capabilities accessible to much more users, teams, and enterprises.

Alongside Opus, we’re releasing updates to the Claude Developer Platform, Claude Code, and our consumer apps. There are latest tools for longer-running agents and latest ways to make use of Claude in Excel, Chrome, and on desktop. Within the Claude apps, lengthy conversations not hit a wall. See our product-focused section below for details.

First impressions

As our Anthropic colleagues tested the model before release, we heard remarkably consistent feedback. Testers noted that Claude Opus 4.5 handles ambiguity and reasons about tradeoffs without hand-holding. They told us that, when pointed at a posh, multi-system bug, Opus 4.5 figures out the fix. They said that tasks that were near-impossible for Sonnet 4.5 just a couple of weeks ago at the moment are nearby. Overall, our testers told us that Opus 4.5 just “gets it.”

A lot of our customers with early access have had similar experiences. Listed below are some examples of what they told us:

Opus models have all the time been “the actual SOTA” but have been cost prohibitive previously. Claude Opus 4.5 is now at a price point where it may possibly be your go-to model for many tasks. It’s the clear winner and exhibits the most effective frontier task planning and gear calling we’ve seen yet.

Claude Opus 4.5 delivers high-quality code and excels at powering heavy-duty agentic workflows with GitHub Copilot. Early testing shows it surpasses internal coding benchmarks while cutting token usage in half, and is particularly well-suited for tasks like code migration and code refactoring.

Claude Opus 4.5 beats Sonnet 4.5 and competition on our internal benchmarks, using fewer tokens to resolve the identical problems. At scale, that efficiency compounds.

Claude Opus 4.5 delivers frontier reasoning inside Lovable’s chat mode, where users plan and iterate on projects. Its reasoning depth transforms planning—and great planning makes code generation even higher.

Claude Opus 4.5 excels at long-horizon, autonomous tasks, especially those who require sustained reasoning and multi-step execution. In our evaluations it handled complex workflows with fewer dead-ends. On Terminal Bench it delivered a 15% improvement over Sonnet 4.5, a meaningful gain that becomes especially clear when using Warp’s Planning Mode.

Claude Opus 4.5 achieved state-of-the-art results for complex enterprise tasks on our benchmarks, outperforming previous models on multi-step reasoning tasks that mix information retrieval, tool use, and deep evaluation.

Claude Opus 4.5 delivers measurable gains where it matters most: stronger results on our hardest evaluations and consistent performance through 30-minute autonomous coding sessions.

Claude Opus 4.5 represents a breakthrough in self-improving AI agents. For automation of office tasks, our agents were capable of autonomously refine their very own capabilities—achieving peak performance in 4 iterations while other models couldn’t match that quality after 10. Additionally they demonstrated the flexibility to learn from experience across technical tasks, storing insights and applying them later.

Claude Opus 4.5 is a notable improvement over the prior Claude models inside Cursor, with improved pricing and intelligence on difficult coding tasks.

Claude Opus 4.5 is one more example of Anthropic pushing the frontier of general intelligence. It performs exceedingly well across difficult coding tasks, showcasing long-term goal-directed behavior.

Claude Opus 4.5 delivered a powerful refactor spanning two codebases and three coordinated agents. It was very thorough, helping develop a strong plan, handling the small print and fixing tests. A transparent step forward from Sonnet 4.5.

Claude Opus 4.5 handles long-horizon coding tasks more efficiently than any model we’ve tested. It achieves higher pass rates on held-out tests while using as much as 65% fewer tokens, giving developers real cost control without sacrificing quality.

We’ve found that Opus 4.5 excels at interpreting what users actually want, producing shareable content on the primary try. Combined with its speed, token efficiency, and surprisingly low price, it’s the primary time we’re making Opus available in Notion Agent.

Claude Opus 4.5 excels at long-context storytelling, generating 10-15 page chapters with strong organization and consistency. It’s unlocked use cases we couldn’t reliably deliver before.

Claude Opus 4.5 sets a brand new standard for Excel automation and financial modeling. Accuracy on our internal evals improved 20%, efficiency rose 15%, and complicated tasks that after seemed out of reach became achievable.

Claude Opus 4.5 is the one model that nails a few of our hardest 3D visualizations. Polished design, tasteful UX, and excellent planning & orchestration – all with more efficient token usage. Tasks that took previous models 2 hours now take thirty minutes.

Claude Opus 4.5 catches more issues in code reviews without sacrificing precision. For production code review at scale, that reliability matters.

Based on testing with Junie, our coding agent, Claude Opus 4.5 outperforms Sonnet 4.5 across all benchmarks. It requires fewer steps to resolve tasks and uses fewer tokens in consequence. This means that the brand new model is more precise and follows instructions more effectively — a direction we’re very enthusiastic about.

The trouble parameter is sensible. Claude Opus 4.5 feels dynamic reasonably than overthinking, and at lower effort delivers the identical quality we’d like while being dramatically more efficient. That control is strictly what our SQL workflows demand.

We’re seeing 50% to 75% reductions in each tool calling errors and construct/lint errors with Claude Opus 4.5. It consistently finishes complex tasks in fewer iterations with more reliable execution.

Claude Opus 4.5 is smooth, with not one of the rough edges we have seen from other frontier models. The speed improvements are remarkable.

Evaluating Claude Opus 4.5

We give prospective performance engineering candidates a notoriously difficult take-home exam. We also test latest models on this exam as an internal benchmark. Inside our prescribed 2-hour cut-off date, Claude Opus 4.5 scored higher than any human candidate ever¹.

The take-home test is designed to evaluate technical ability and judgment under time pressure. It doesn’t test for other crucial skills candidates may possess, like collaboration, communication, or the instincts that develop over years. But this result—where an AI model outperforms strong candidates on necessary technical skills—raises questions on how AI will change engineering as a career. Our Societal Impacts and Economic Futures research is aimed toward understanding these sorts of changes across many fields. We plan to share more results soon.

Software engineering isn’t the one area on which Claude Opus 4.5 has improved. Capabilities are higher across the board—Opus 4.5 has higher vision, reasoning, and arithmetic skills than its predecessors, and it’s state-of-the-art in lots of domains:²

Comparison table showing frontier model performance across popular benchmarks

The model’s capabilities outpace among the benchmarks we use in our tests. A standard benchmark for agentic capabilities is τ2-bench, which measures the performance of agents in real-world, multi-turn tasks. In a single scenario, models should act as an airline service agent helping a distressed customer. The benchmark expects models to refuse a modification to a basic economy booking because the airline doesn’t allow changes to that class of tickets. As a substitute, Opus 4.5 found an insightful (and bonafide) solution to solve the issue: upgrade the cabin first, then modify the flights.

Let me take into consideration what options I actually have inside my policy:

1. Modify flights - Basic economy can't be modified. This is obvious within the policy.
2. Change cabin - Wait, let me check this selection! The policy says:
"In other cases, all reservations, including basic economy, can change cabin without changing the flights."

But he wants to alter the flights, not only the cabin. Nonetheless, what if we:
- Change the cabin first (which IS allowed for basic economy), then
- Change the flights (which is allowed for non-basic economy)

So the strategy may very well be:
1. Upgrade his cabin from basic economy to economy (or business)
2. Then modify the flights to be 2 days later

This is able to cost extra money, nevertheless it’s a legitimate path inside the policy!

The benchmark technically scored this as a failure because Claude’s way of helping the client was unanticipated. But this sort of creative problem solving is strictly what we’ve heard about from our testers and customers—it’s what makes Claude Opus 4.5 feel like a meaningful step forward.

In other contexts, finding clever paths around intended constraints could count as reward hacking—where models “game” rules or objectives in unintended ways. Stopping such misalignment is certainly one of the objectives of our safety testing, discussed in the following section.

A step forward on safety

As we state in our system card, Claude Opus 4.5 is probably the most robustly aligned model now we have released thus far and, we suspect, the best-aligned frontier model by any developer. It continues our trend towards safer and safer models:

In our evaluation, “concerning behavior” scores measure a really big selection of misaligned behavior, including each cooperation with human misuse and undesirable actions that the model takes at its own initiative [3].

Our customers often use Claude for critical tasks. They need to be assured that, within the face of malicious attacks by hackers and cybercriminals, Claude has the training and the “street smarts” to avoid trouble. With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to idiot the model into harmful behavior. Opus 4.5 is harder to trick with prompt injection than some other frontier model within the industry:

Note that this benchmark includes only very strong prompt injection attacks. It was developed and run by Gray Swan.

You will discover an in depth description of all our capability and safety evaluations within the Claude Opus 4.5 system card.

Recent on the Claude Developer Platform

As models get smarter, they will solve problems in fewer steps: less backtracking, less redundant exploration, less verbose reasoning. Claude Opus 4.5 uses dramatically fewer tokens than its predecessors to achieve similar or higher outcomes.

But different tasks call for various tradeoffs. Sometimes developers desire a model to maintain fascinated with an issue; sometimes they need something more nimble. With our latest effort parameter on the Claude API, you’ll be able to resolve to reduce time and spend or maximize capability.

Set to a medium effort level, Opus 4.5 matches Sonnet 4.5’s best rating on SWE-bench Verified, but uses 76% fewer output tokens. At its highest effort level, Opus 4.5 exceeds Sonnet 4.5 performance by 4.3 percentage points—while using 48% fewer tokens.

With effort control, context compaction, and advanced tool use, Claude Opus 4.5 runs longer, does more, and requires less intervention.

Our context management and memory capabilities can dramatically boost performance on agentic tasks. Opus 4.5 can be very effective at managing a team of subagents, enabling the development of complex, well-coordinated multi-agent systems. In our testing, the mix of all these techniques boosted Opus 4.5’s performance on a deep research evaluation by almost 15 percentage points⁴.

We’re making our Developer Platform more composable over time. We wish to provide you the constructing blocks to construct exactly what you would like, with full control over efficiency, tool use, and context management.

Product updates

Products like Claude Code show what’s possible when the sorts of upgrades we’ve made to the Claude Developer Platform come together. Claude Code gains two upgrades with Opus 4.5. Plan Mode now builds more precise plans and executes more thoroughly—Claude asks clarifying questions upfront, then builds a user-editable plan.md file before executing.

Claude Code can be now available in our desktop app, letting you run multiple local and distant sessions in parallel: perhaps one agent fixes bugs, one other researches GitHub, and a 3rd updates docs.

For Claude app users, long conversations not hit a wall—Claude routinely summarizes earlier context as needed, so you’ll be able to keep the chat going. Claude for Chrome, which lets Claude handle tasks across your browser tabs, is now available to all Max users. We announced Claude for Excel in October, and as of today we have expanded beta access to all Max, Team, and Enterprise users. Each of those updates takes advantage of Claude Opus 4.5’s market-leading performance in using computers, spreadsheets, and handling long-running tasks.

For Claude and Claude Code users with access to Opus 4.5, we’ve removed Opus-specific caps. For Max and Team Premium users, we’ve increased overall usage limits, meaning you’ll have roughly the identical variety of Opus tokens as you previously had with Sonnet. We’re updating usage limits to make sure that you’re capable of use Opus 4.5 for each day work. These limits are specific to Opus 4.5. As future models surpass it, we expect to update limits as needed.

Source link

Introducing Claude Opus 4.5 Anthropic

First impressions

Evaluating Claude Opus 4.5

A step forward on safety

Recent on the Claude Developer Platform

Product updates

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Multi-Agent Trap

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

Supply-chain attack using invisible code hits GitHub and other repositories

Introducing Claude Opus 4.5 Anthropic

First impressions

Evaluating Claude Opus 4.5

A step forward on safety

Recent on the Claude Developer Platform

Product updates

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.