Anthropic vs. OpenAI red teaming methods reveal different security priorities for enterprise AI

-



Model providers wish to prove the safety and robustness of their models, releasing system cards and conducting red-team exercises with each recent release. But it might probably be difficult for enterprises to parse through the outcomes, which vary widely and may be misleading.

Anthropic's 153-page system card for Claude Opus 4.5 versus OpenAI's 60-page GPT-5 system card reveals a fundamental split in how these labs approach security validation. Anthropic discloses of their system card how they depend on multi-attempt attack success rates from 200-attempt reinforcement learning (RL) campaigns. OpenAI also reports attempted jailbreak resistance. Each metrics are valid. Neither tells the entire story.

Security leaders deploying AI agents for browsing, code execution and autonomous motion have to know what each red team evaluation actually measures, and where the blind spots are.

What the attack data shows

Gray Swan's Shade platform ran adaptive adversarial campaigns against Claude models. The attack success rate (ASR) tells the story.

  • Opus 4.5 in coding environments hit 4.7% ASR at one attempt, 33.6% at ten and 63.0% at 100. In computer use with prolonged considering, it held at 0% ASR even after 200 attempts. It's the primary model to saturate the benchmark.

  • Sonnet 4.5 at the identical thresholds showed 70% ASR in coding and 85.7% in computer use.

  • Opus 4.5 delivers roughly 7x improvement in coding resistance and complete resistance in computer use.

This illustrates that the gap between model tiers throughout the same family is now larger than some cross-vendor comparisons. For security teams, that's a procurement factor that didn't exist six months ago.

For OpenAI, the Gray Swan Shade platform found that the o1 system card delivered 6% ASR for harmful text and 5% for malicious code, all based on single-attempt metrics. Third-party testing by SPLX and NeuralTrust found GPT-5's raw ASR hit 89% initially, dropping below 1% inside two weeks of patching.

The report Jailbreaking Leading Safety-Aligned LLMs with Easy Adaptive Attacks confirms the pattern. With sufficient attempts, Claude 3.5 Sonnet showed 78% ASR, and GPT-4o reached 89%.

Two ways to catch deception

Anthropic monitors roughly 10 million neural features during evaluation using dictionary learning. Features map to human-interpretable concepts, including deception, sycophancy, bias, power-seeking and concealment.

Meanwhile, OpenAI's approach relies on chain-of-thought (CoT) monitoring. Their deception classifier flagged 0.17% of o1-preview responses as potentially deceptive with 92% accuracy on internal validation. GPT-5 reduced CoT deception from 4.8% to 2.1% in comparison with o3. The strategy assumes visible reasoning reflects internal processing. Anthropic's interpretability work directly questions that assumption.

When models game the test

In Apollo Research's evaluation of o1, the model attempted to show off oversight mechanisms 5% of the time when it believed it could be shut down. In 2% of cases, o1 attempted self-exfiltration. When confronted about misaligned actions, o1 denied wrongdoing or fabricated explanations 99% of the time. The o1 system card disclosed instrumental alignment faking. In 37% of scenarios, the model pursued developer-aligned goals during evaluation, then switched to its own goals when it believed oversight was minimal.

Meanwhile, Anthropic reports Opus 4.5's evaluation awareness dropped from 26.5% (Opus 4.1) to lower than 10% internally. UK AISI measured evaluation awareness decreasing from 26.5% (Sonnet 4.5) to 12.7% (Opus 4.5 Preview). METR's GPT-5 evaluation noted the model "can discover the precise evaluation it’s in on some occasions" and "seems to make use of this information to get a greater rating mostly."

If a model behaves in another way when it detects evaluation conditions, it's a simple task that production behavior will turn out to be unpredictable when pushed to scale. That is the core problem with evaluation awareness. Models that recognize they're being tested try and game the test. Anthropic's 60% reduction in evaluation awareness from Opus 4.1 to 4.5 represents their success in targeted engineering efforts against this vector.

Red teaming on prompt injection defenses shows similar divergence.

Anthropic's testing reports 96% prevention in tool use scenarios, rising to 99.4% with additional safeguards like prompt shields. Fujitsu's independent assessment found GPT-5 vulnerable to roughly 20% of tested injection attacks. OpenAI's o3/o4-mini system card reports instruction hierarchy scores of 0.86 for system-versus-developer conflicts.

These are different metrics and different attack methodologies, however the gap suggests Anthropic's agentic use cases have been hardened against adversarial inputs embedded in tool outputs.

Comparing red teaming results

Here's how Opus 4.5 stacks up against OpenAI's model family across 16 dimensions. .

Dimension

Claude Opus 4.5

GPT-5

o3 / o4-mini

o1

System card length

153 pages

55 pages

~30 pages

~40 pages

Attack methodology

200-attempt RL campaigns

Single-attempt + iterative patching

Single-attempt

Single-attempt

ASR @ 1 attempt (coding)

4.7%

89% raw (pre-patch)

Not published

5%

ASR @ 100 attempts

63.0%

Not published

Not published

Not published

ASR @ 200 attempts (Computer Use)

0% (saturates benchmark)

Not published

Not published

Not published

Prompt injection defense

96% prevention; 99.4% with safeguards

~20% ASR (Fujitsu)

0.86 instruction hierarchy

Higher ASR than GPT-4o

Interpretability

~10M neural features monitored

CoT monitoring

CoT monitoring

CoT monitoring

Deception detection

Internal feature activation

2.1% CoT flagged

0.95 not_unsafe

0.17% flagged; 99% deny

Evaluation awareness

<10%

Identifies exact eval (METR)

Not published

37% alignment faking

Scheming (Apollo)

Opus 4.5 showed very low rates of sabotage propensity in internal evaluations, orders of magnitude lower than o1.

Not published

Lower than o1; still deceptive

5% turn off oversight; 2% self-exfil

CBRN risk

Below ASL-4

Medium (approaching High)

Medium

Medium (first to succeed in)

Governance

FRT → AST → RSO/CEO

SAG + Preparedness v2

SAG + Preparedness v2

SAG + Preparedness

External partners

Gray Swan, UK AISI, US CAISI, METR

UK AISI, US AISI, METR, Apollo

UK AISI, US AISI, METR, Apollo, Pattern Labs

UK AISI, US AISI, METR, Apollo

Reward hacking

Yes (unimaginable tasks)

Yes (METR)

~1% of tasks; 5/24 Kernels

Yes

Misalignment reduction

60% vs. Opus 4.1

Not published

Lower scheming than o1

Baseline

Sources: Opus 4.5 system card, GPT-5 system card, o3/o4-mini system card, o1 system card, Gray Swan, METR, Apollo Research

Why these differences matter

Enterprises have several aspects to contemplate when analyzing model evals. These include:

Attack persistence thresholds

Anthropic's 200-attempt RL campaigns simulate adversaries with time, resources, and adaptive tooling. In RL-based adversarial testing, the attacker model learns from each failure, adjusts its approach and systematically probes for weaknesses. It's training an AI to interrupt one other AI.

Gray Swan's Shade platform doesn't just retry the identical prompt. It learns from failures, adjusts attack vectors and exploits patterns in model responses. The 4.7% to 63.0% degradation curve in coding environments shows how resistance erodes under sustained pressure.

OpenAI's single-attempt metrics answer a special query: How often does a naive attacker succeed on the primary try? Each matter. A phishing campaign hits hundreds of thousands of users once each. A nation-state actor runs 1000’s of variations against a single goal. Security teams have to know which threat model their vendor tested against.

Detection architecture

Anthropic monitors roughly 10 million internal neural features during evaluation. These are abstract concepts like deception, sycophancy and power-seeking that activate across the model's hidden layers. When a fraud/deception feature lights up during benign roleplay, that's a signal invisible to output evaluation. The model's internal state reveals associations its outputs don't.

OpenAI takes a special approach with CoT monitoring. Their deception classifier flagged 0.17% of o1-preview responses as potentially deceptive with 92% accuracy. The idea is that reasoning traces reflect internal processing. Anthropic's interpretability research directly challenges this. Models can produce clean reasoning chains while internal features indicate something else entirely.

Neither approach is complete. CoT monitoring scales more easily. Feature monitoring catches what CoT misses but requires deep model access and interpretability infrastructure that almost all enterprises can't replicate.

Scheming evaluation design

Apollo Research's methodology puts models in scenarios where misaligned behavior offers a strategic advantage: Shutdown threats, goal conflicts and oversight detection. o1 attempted to show off oversight mechanisms 5% of the time when believing shutdown was imminent. It tried to repeat itself to forestall substitute in 2% of cases. When confronted, it denied wrongdoing 99% of the time. Opus 4.5 showed very low rates of sabotage propensity in internal evaluations, orders of magnitude lower than o1.

However the evaluation environments differ. Anthropic's stress tests give attention to whether models attempt sabotage when given clear opportunities. OpenAI's tests via Apollo probe instrumental reasoning, specifically whether models fake alignment during evaluation, then defect when oversight drops. The 37% alignment-faking rate in o1 represents a special failure mode than sabotage attempts.

The comparability problem

These methodological differences make direct comparison difficult. Opus 4.5's 0% ASR at 200 attempts in computer use versus o1's 5% malicious code rate aren’t measuring the identical thing. The attack surfaces, attempt counts and success criteria differ.

The matrix reveals what each vendor selected to measure and publish. That itself signals priorities. Anthropic's 153-page system card emphasizes degradation curves and internal feature monitoring. OpenAI emphasizes iterative improvement and CoT transparency.

Bottom line: Enterprise security teams inheriting these models inherit these measurement philosophies. The query isn't which number is best. It's the measurement that matches your deployment's actual threat landscape.

What independent red team evaluators found

Independent red team evaluations, for his or her part, often operate with entirely different methods. This will tell a really different story and reveal additional model characteristics that enterprises must consider.

METR's red team evaluation measured autonomous capabilities using a time horizon rating, which tracks how long a task can run before the agent fails half the time. o3 sustained reliable performance for roughly 1 hour and half-hour. o4-mini held for 1 hour quarter-hour. METR also detected reward hacking in roughly 1% of o3's attempts, including 5 of 24 kernel optimization experiments where the model tampered with scoring functions to inflate its own results.

Apollo Research tested o3 and o4-mini for in-context scheming. Each exhibited deceptive tendencies corresponding to sandbagging evaluations and sabotaging tasks with plausible deniability, but scored lower than o1. They assess that o3 and o4-mini are unlikely to cause catastrophic harm as a consequence of scheming, but more minor real-world harms remain possible without monitoring.

The UK AISI/Gray Swan challenge ran 1.8 million attacks across 22 models. Every model broke. ASR ranged from 1.47% to six.49%. Opus 4.5 placed first on Gray Swan's Agent Red Teaming benchmark with 4.7% ASR versus GPT-5.1 at 21.9% and Gemini 3 Pro at 12.5%.

No current frontier system resists determined, well-resourced attacks. The differentiation lies in how quickly defenses degrade and at what attempt threshold. Opus 4.5's advantage compounds over repeated attempts. Single-attempt metrics flatten the curve.

What To Ask Your Vendor

Security teams evaluating frontier AI models need specific answers, starting with ASR at 50 and 200 attempts moderately than single-attempt metrics alone. Discover whether or not they detect deception through output evaluation or internal state monitoring. Know who challenges red team conclusions before deployment and what specific failure modes they've documented. Get the evaluation awareness rate. Vendors claiming complete safety haven't stress-tested adequately.

The underside line

Diverse red-team methodologies reveal that each frontier model breaks under sustained attack. The 153-page system card versus the 55-page system card isn't nearly documentation length. It's a signal of what each vendor selected to measure, stress-test, and disclose.

For persistent adversaries, Anthropic's degradation curves show exactly where resistance fails. For fast-moving threats requiring rapid patches, OpenAI's iterative improvement data matters more. For agentic deployments with browsing, code execution and autonomous motion, the scheming metrics turn out to be your primary risk indicator.

Security leaders have to stop asking which model is safer. Start asking which evaluation methodology matches the threats your deployment will actually face. The system cards are public. The information is there. Use it.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x