
Model providers wish to prove the safety and robustness of their models, releasing system cards and conducting red-team exercises with each recent release. But it might probably be difficult for enterprises to parse through the outcomes, which vary widely and may be misleading.
Anthropic's 153-page system card for Claude Opus 4.5 versus OpenAI's 60-page GPT-5 system card reveals a fundamental split in how these labs approach security validation. Anthropic discloses of their system card how they depend on multi-attempt attack success rates from 200-attempt reinforcement learning (RL) campaigns. OpenAI also reports attempted jailbreak resistance. Each metrics are valid. Neither tells the entire story.
Security leaders deploying AI agents for browsing, code execution and autonomous motion have to know what each red team evaluation actually measures, and where the blind spots are.
What the attack data shows
Gray Swan's Shade platform ran adaptive adversarial campaigns against Claude models. The attack success rate (ASR) tells the story.
-
Opus 4.5 in coding environments hit 4.7% ASR at one attempt, 33.6% at ten and 63.0% at 100. In computer use with prolonged considering, it held at 0% ASR even after 200 attempts. It's the primary model to saturate the benchmark.
-
Sonnet 4.5 at the identical thresholds showed 70% ASR in coding and 85.7% in computer use.
-
Opus 4.5 delivers roughly 7x improvement in coding resistance and complete resistance in computer use.
This illustrates that the gap between model tiers throughout the same family is now larger than some cross-vendor comparisons. For security teams, that's a procurement factor that didn't exist six months ago.
For OpenAI, the Gray Swan Shade platform found that the o1 system card delivered 6% ASR for harmful text and 5% for malicious code, all based on single-attempt metrics. Third-party testing by SPLX and NeuralTrust found GPT-5's raw ASR hit 89% initially, dropping below 1% inside two weeks of patching.
The report Jailbreaking Leading Safety-Aligned LLMs with Easy Adaptive Attacks confirms the pattern. With sufficient attempts, Claude 3.5 Sonnet showed 78% ASR, and GPT-4o reached 89%.
Two ways to catch deception
Anthropic monitors roughly 10 million neural features during evaluation using dictionary learning. Features map to human-interpretable concepts, including deception, sycophancy, bias, power-seeking and concealment.
Meanwhile, OpenAI's approach relies on chain-of-thought (CoT) monitoring. Their deception classifier flagged 0.17% of o1-preview responses as potentially deceptive with 92% accuracy on internal validation. GPT-5 reduced CoT deception from 4.8% to 2.1% in comparison with o3. The strategy assumes visible reasoning reflects internal processing. Anthropic's interpretability work directly questions that assumption.
When models game the test
In Apollo Research's evaluation of o1, the model attempted to show off oversight mechanisms 5% of the time when it believed it could be shut down. In 2% of cases, o1 attempted self-exfiltration. When confronted about misaligned actions, o1 denied wrongdoing or fabricated explanations 99% of the time. The o1 system card disclosed instrumental alignment faking. In 37% of scenarios, the model pursued developer-aligned goals during evaluation, then switched to its own goals when it believed oversight was minimal.
Meanwhile, Anthropic reports Opus 4.5's evaluation awareness dropped from 26.5% (Opus 4.1) to lower than 10% internally. UK AISI measured evaluation awareness decreasing from 26.5% (Sonnet 4.5) to 12.7% (Opus 4.5 Preview). METR's GPT-5 evaluation noted the model "can discover the precise evaluation it’s in on some occasions" and "seems to make use of this information to get a greater rating mostly."
If a model behaves in another way when it detects evaluation conditions, it's a simple task that production behavior will turn out to be unpredictable when pushed to scale. That is the core problem with evaluation awareness. Models that recognize they're being tested try and game the test. Anthropic's 60% reduction in evaluation awareness from Opus 4.1 to 4.5 represents their success in targeted engineering efforts against this vector.
Red teaming on prompt injection defenses shows similar divergence.
Anthropic's testing reports 96% prevention in tool use scenarios, rising to 99.4% with additional safeguards like prompt shields. Fujitsu's independent assessment found GPT-5 vulnerable to roughly 20% of tested injection attacks. OpenAI's o3/o4-mini system card reports instruction hierarchy scores of 0.86 for system-versus-developer conflicts.
These are different metrics and different attack methodologies, however the gap suggests Anthropic's agentic use cases have been hardened against adversarial inputs embedded in tool outputs.
Comparing red teaming results
Here's how Opus 4.5 stacks up against OpenAI's model family across 16 dimensions. .
|
Dimension |
Claude Opus 4.5 |
GPT-5 |
o3 / o4-mini |
o1 |
|
System card length |
153 pages |
55 pages |
~30 pages |
~40 pages |
|
Attack methodology |
200-attempt RL campaigns |
Single-attempt + iterative patching |
Single-attempt |
Single-attempt |
|
ASR @ 1 attempt (coding) |
4.7% |
89% raw (pre-patch) |
Not published |
5% |
|
ASR @ 100 attempts |
63.0% |
Not published |
Not published |
Not published |
|
ASR @ 200 attempts (Computer Use) |
0% (saturates benchmark) |
Not published |
Not published |
Not published |
|
Prompt injection defense |
96% prevention; 99.4% with safeguards |
~20% ASR (Fujitsu) |
0.86 instruction hierarchy |
Higher ASR than GPT-4o |
|
Interpretability |
~10M neural features monitored |
CoT monitoring |
CoT monitoring |
CoT monitoring |
|
Deception detection |
Internal feature activation |
2.1% CoT flagged |
0.95 not_unsafe |
0.17% flagged; 99% deny |
|
Evaluation awareness |
<10% |
Identifies exact eval (METR) |
Not published |
37% alignment faking |
|
Scheming (Apollo) |
Opus 4.5 showed very low rates of sabotage propensity in internal evaluations, orders of magnitude lower than o1. |
Not published |
Lower than o1; still deceptive |
5% turn off oversight; 2% self-exfil |
|
CBRN risk |
Below ASL-4 |
Medium (approaching High) |
Medium |
Medium (first to succeed in) |
|
Governance |
FRT → AST → RSO/CEO |
SAG + Preparedness v2 |
SAG + Preparedness v2 |
SAG + Preparedness |
|
External partners |
Gray Swan, UK AISI, US CAISI, METR |
UK AISI, US AISI, METR, Apollo |
UK AISI, US AISI, METR, Apollo, Pattern Labs |
UK AISI, US AISI, METR, Apollo |
|
Reward hacking |
Yes (unimaginable tasks) |
Yes (METR) |
~1% of tasks; 5/24 Kernels |
Yes |
|
Misalignment reduction |
60% vs. Opus 4.1 |
Not published |
Lower scheming than o1 |
Baseline |
Sources: Opus 4.5 system card, GPT-5 system card, o3/o4-mini system card, o1 system card, Gray Swan, METR, Apollo Research
Why these differences matter
Enterprises have several aspects to contemplate when analyzing model evals. These include:
Attack persistence thresholds
Anthropic's 200-attempt RL campaigns simulate adversaries with time, resources, and adaptive tooling. In RL-based adversarial testing, the attacker model learns from each failure, adjusts its approach and systematically probes for weaknesses. It's training an AI to interrupt one other AI.
Gray Swan's Shade platform doesn't just retry the identical prompt. It learns from failures, adjusts attack vectors and exploits patterns in model responses. The 4.7% to 63.0% degradation curve in coding environments shows how resistance erodes under sustained pressure.
OpenAI's single-attempt metrics answer a special query: How often does a naive attacker succeed on the primary try? Each matter. A phishing campaign hits hundreds of thousands of users once each. A nation-state actor runs 1000’s of variations against a single goal. Security teams have to know which threat model their vendor tested against.
Detection architecture
Anthropic monitors roughly 10 million internal neural features during evaluation. These are abstract concepts like deception, sycophancy and power-seeking that activate across the model's hidden layers. When a fraud/deception feature lights up during benign roleplay, that's a signal invisible to output evaluation. The model's internal state reveals associations its outputs don't.
OpenAI takes a special approach with CoT monitoring. Their deception classifier flagged 0.17% of o1-preview responses as potentially deceptive with 92% accuracy. The idea is that reasoning traces reflect internal processing. Anthropic's interpretability research directly challenges this. Models can produce clean reasoning chains while internal features indicate something else entirely.
Neither approach is complete. CoT monitoring scales more easily. Feature monitoring catches what CoT misses but requires deep model access and interpretability infrastructure that almost all enterprises can't replicate.
Scheming evaluation design
Apollo Research's methodology puts models in scenarios where misaligned behavior offers a strategic advantage: Shutdown threats, goal conflicts and oversight detection. o1 attempted to show off oversight mechanisms 5% of the time when believing shutdown was imminent. It tried to repeat itself to forestall substitute in 2% of cases. When confronted, it denied wrongdoing 99% of the time. Opus 4.5 showed very low rates of sabotage propensity in internal evaluations, orders of magnitude lower than o1.
However the evaluation environments differ. Anthropic's stress tests give attention to whether models attempt sabotage when given clear opportunities. OpenAI's tests via Apollo probe instrumental reasoning, specifically whether models fake alignment during evaluation, then defect when oversight drops. The 37% alignment-faking rate in o1 represents a special failure mode than sabotage attempts.
The comparability problem
These methodological differences make direct comparison difficult. Opus 4.5's 0% ASR at 200 attempts in computer use versus o1's 5% malicious code rate aren’t measuring the identical thing. The attack surfaces, attempt counts and success criteria differ.
The matrix reveals what each vendor selected to measure and publish. That itself signals priorities. Anthropic's 153-page system card emphasizes degradation curves and internal feature monitoring. OpenAI emphasizes iterative improvement and CoT transparency.
Bottom line: Enterprise security teams inheriting these models inherit these measurement philosophies. The query isn't which number is best. It's the measurement that matches your deployment's actual threat landscape.
What independent red team evaluators found
Independent red team evaluations, for his or her part, often operate with entirely different methods. This will tell a really different story and reveal additional model characteristics that enterprises must consider.
METR's red team evaluation measured autonomous capabilities using a time horizon rating, which tracks how long a task can run before the agent fails half the time. o3 sustained reliable performance for roughly 1 hour and half-hour. o4-mini held for 1 hour quarter-hour. METR also detected reward hacking in roughly 1% of o3's attempts, including 5 of 24 kernel optimization experiments where the model tampered with scoring functions to inflate its own results.
Apollo Research tested o3 and o4-mini for in-context scheming. Each exhibited deceptive tendencies corresponding to sandbagging evaluations and sabotaging tasks with plausible deniability, but scored lower than o1. They assess that o3 and o4-mini are unlikely to cause catastrophic harm as a consequence of scheming, but more minor real-world harms remain possible without monitoring.
The UK AISI/Gray Swan challenge ran 1.8 million attacks across 22 models. Every model broke. ASR ranged from 1.47% to six.49%. Opus 4.5 placed first on Gray Swan's Agent Red Teaming benchmark with 4.7% ASR versus GPT-5.1 at 21.9% and Gemini 3 Pro at 12.5%.
No current frontier system resists determined, well-resourced attacks. The differentiation lies in how quickly defenses degrade and at what attempt threshold. Opus 4.5's advantage compounds over repeated attempts. Single-attempt metrics flatten the curve.
What To Ask Your Vendor
Security teams evaluating frontier AI models need specific answers, starting with ASR at 50 and 200 attempts moderately than single-attempt metrics alone. Discover whether or not they detect deception through output evaluation or internal state monitoring. Know who challenges red team conclusions before deployment and what specific failure modes they've documented. Get the evaluation awareness rate. Vendors claiming complete safety haven't stress-tested adequately.
The underside line
Diverse red-team methodologies reveal that each frontier model breaks under sustained attack. The 153-page system card versus the 55-page system card isn't nearly documentation length. It's a signal of what each vendor selected to measure, stress-test, and disclose.
For persistent adversaries, Anthropic's degradation curves show exactly where resistance fails. For fast-moving threats requiring rapid patches, OpenAI's iterative improvement data matters more. For agentic deployments with browsing, code execution and autonomous motion, the scheming metrics turn out to be your primary risk indicator.
Security leaders have to stop asking which model is safer. Start asking which evaluation methodology matches the threats your deployment will actually face. The system cards are public. The information is there. Use it.
