IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Ayhan Sebin
Saurabh Jha
Rohan Arora
Daby Sow
Mert Cemri
Melissa Pan
Ion Stoica

ITBench HF Space
ITBench HF Dataset
MAST HF Dataset
ITBench Github
MAST Github

IBM Research and UC Berkeley collaborated to review how agentic LLM systems break in real-world IT automation, for tasks involving incident triage, logs/metrics queries, and Kubernetes actions in long-horizon tool loops.

Benchmarks typically reduce performance to a single number, telling you whether an agent failed but never why. To resolve this black-box problem, we applied MAST (Multi-Agent System Failure Taxonomy), an emerging practice for diagnosing agentic reliability ). By leveraging MAST to investigate ITBench—the industry benchmark for SRE, Security, and FinOps automation—we turned raw execution traces into structured failure signatures, revealing exactly what broke and methods to fix it. We annotated 310 ITBench SRE traces across three distinct model classes: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B.

Key Findings:

Frontier models like Gemini-3-Flash fail cleanly (2.6 failure modes/trace), typically hitting isolated bottlenecks like verification. Large open models like GPT-OSS-120B suffer from cascading failure modes (5.3 failure modes/trace). -A single reasoning mismatch early within the run poisons the context, resulting in compounding hallucinations.
Across all models, the strongest predictor of failure is FM-3.3 (Incorrect Verification). Agents consistently “declare victory” without checking ground truth.
Kimi-K2 struggles to acknowledge when a task is completed. It exhibits an enormous spike in Premature Termination (+46%) and Unaware of Termination Conditions (+43%), often quitting just before solving the issue or looping indefinitely.

Takeaways from our evaluation when constructing agents:

For Frontier Models like Gemini: Externalize Verification. Never let the LLM grade its own homework. Require hard tool evidence before exit.
Put termination + loop control outside the model: Termination issues are common killers (FM-1.5). Add explicit stop conditions + loop detectors for repeated tool calls/actions or implement Finite State Machines.
Force clarify-or-read-only when inputs are ambiguous: Clarification failures (FM-2.2) are a serious failure driver for smaller models. Make ambiguity a first-class branch in your agent graph.

In case you’re constructing agents for enterprise IT workflows, that is the form of evaluation you wish: not only “did it pass?”, but “what broke, where, and what intervention is most leverageable?”

The “Black Box” Problem of Agent Benchmarks

Benchmarks like ITBench have gotten the usual for measuring agentic performance in high-stakes IT automation tasks. In ITBench, agents act as Site Reliability Engineers (SREs) or Security Analysts tasked with diagnosing Kubernetes outages, patching vulnerabilities, or managing cloud costs in production environments.

This benchmarks use success rate as a primary metric to guage agents. Nonetheless, this metric is insufficient for engineering robust systems. Knowing that an agentic system achieves a 14% success rate on ITBench tells us that it failed, but not why: Did it fail since it forgot the context? Since it hallucinated a command? Or since it simply didn’t terminate?

With out a comprehensive approach to diagnose these failures, developers are left guessing, often resorting to blind prompting tweaks that solve one problem only to create one other.

As a brand new standard to investigate the failure modes of complex agentic systems, we developed MAST (Multi-Agent System Failure Taxonomy). MAST brings more insights and open up the opaque evaluation of those benchmarks. Derived from a rigorous evaluation of over 1,600 traces across seven different frameworks, MAST provides a standardized taxonomy for agent failures.

MAST converts unstructured execution logs into structured “failure vectors” based on 14 distinct patterns across three key categories:

FC1: System Design Issues (The “Skeleton”)
- Failures here stem from the agent’s architecture and role definition.
- Examples: FM-1.3 Step Repetition (looping), FM-1.4 Lack of Conversation History (memory leaks), FM-1.5 Unaware of Termination (failing to stop).
FC2: Inter-Agent Misalignment (The “Communication”)
- Failures arising during runtime from how agents consult with one another or the environment.
- Examples: FM-2.2 Fail to Ask for Clarification (assuming as an alternative of asking), FM-2.3 Task Derailment (going off-topic).
FC3: Task Verification (The “Quality Control”)
- Failures in quality assurance of the agents’ output.
- Examples: FM-3.1 Premature Termination (giving up too soon), FM-3.3 Incorrect Verification (hallucinating success).

The Experiment: Diagnosing ITBench Agents

We stress-test the thought of using MAST to make agent evaluations actionable and gain insights on the failure modes by applying it to ITBench, a well-liked evaluation suite for IT automation tasks across SRE, Security/Compliance, and FinOps.

We annotated 310 ITBench SRE execution traces produced by an SRE agent built with Codex in realistic environments. These traces capture natural language interactions between agents and their tools across three models representing different capability tiers: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. This lets us look past easy success metrics and investigate the distinct failure signatures driving these results. For this we use the recall scores, because the models by design only output a maximum of 3-5 outputs and SREs prefer the recall scores over F-1 rating.

Gemini-3-Flash: 100 traces (75.5% Mean Recall)
Kimi-K2: 105 traces (28.6% Mean Recall)
GPT-OSS-120B: 105 traces (12.4% Mean Recall)

Below, we detail the findings from this diagnostic evaluation.

Finding 1: Stronger models like Gemini-3-Flash shows surgical (isolated failure modes) per trace whereas open sourced Kimi-K2 and GPT-oss-120b show compounding failure patterns

After we examine the failed traces, a transparent hierarchy of complexity becomes apparent across the three models. That is measured by the variety of distinct failure modes observed per failed run.

Gemini-3-Flash: 2.6 failure modes per failed trace
Kimi-K2: 4.7 failure modes per failed trace
GPT-OSS-120B: 5.3 failure modes per failed trace

This disparity in failure mode density reveals a fundamental difference in how these systems break down. Gemini-3-Flash exhibits a surgical failure profile. Even in unsuccessful runs, it maintains high internal coherence and typically fails resulting from a single isolated failure, comparable to an incorrect verification step. These failures are precise and much easier to diagnose.

On the other end of the spectrum, GPT-OSS-120B suffers from cascading collapse. In these traces, we observe that errors are likely to compound over time. A small reasoning mismatch early in the method often results in a deviation from the duty specification, which in turn triggers a complete derailment of the agent. Kimi-K2 represents the middle ground, where failures are more frequent and complicated than the frontier model but do not reach the systemic instability seen within the 120B open weights model.

The importance of this finding is that the next success rate is usually accompanied by isolated failure. Systems that fail with fewer simultaneous problems are way more predictable and simpler to enhance through targeted engineering interventions.

Finding 2: “Non-Fatal” vs. “Fatal” Failures

Perhaps probably the most critical insight from MAST is distinguishing between failures that the system can tolerate versus those which are fatal to success of the downstream task. By comparing the distribution of failure modes in Successful Traces vs. Failed Traces, we will classify them into three categories.

The “Non-Fatal” (Benign) Flaws

Across all three models, certain failure modes appear continuously even in runs that ultimately succeed. These are sometimes structural frictions slightly than terminal bugs.

FM-1.3 Step Repetition: This mode is present in over 90 percent of successful Kimi-K2 runs. Within the SRE domain, iteration is usually a necessity. An agent might query the same metric multiple times to confirm if a service is stabilizing or if a fix has taken effect. Gemini-3-Flash actually shows less repetition in its failed traces, suggesting that it sometimes fails since it doesn’t iterate enough.
FM-1.1 Disobey Task Specification: Agents continuously deviate from strict tool formatting or sequential instructions yet still manage to discover the right root cause.

This separation is where MAST proves its value. It allows us to disregard the bening failures like repetition that always occurs in troubleshooting, and focus as an alternative on fatal failures that killed a run.

The “Fatal” Flaws

Certain behaviors strongly separate success from failure. When these modes appear, the probability of a successful consequence drops precipitously. Probably the most outstanding example is FM-3.3 (Incorrect Verification). This mode shows a 52 percent increase in failed Gemini-3-Flash traces in comparison with its successful ones. Other outstanding failure modes are 1.5 (Unaware of Termination Conditions) and a couple of.6 (Reasoning Motion Mismatch).

If these occur, the run is probably going dead; guiding practitioners to develop robust context management strategies across agents within the system and multiple turns of interactions.

Case Study: Gemini-3-Flash (Decisive but Overconfident)

Gemini-3-Flash is very efficient, but its primary bottleneck is its tendency to assume success without rigorous proof. Its failure signature is dominated by an enormous delta in verification errors. It often identifies the right signals but terminates before cross-referencing them against the bottom truth. To repair this, developers should implement an external verification gate. By requiring tool-based evidence like a cleared alert or a healthy metric threshold before allowing the agent to exit, we will mitigate this model’s inherent overconfidence.

Fix: To enhance Gemini-3-Flash on ITBench, prompt engineering won’t help much. Specifically, the experiments we shown in our NeurIPS 2025 paper shows that with manual interventions like prompt engineering for memory related failures, we will get only as much as around 15.6% performance improvements, whereas in a previous blogpost on MAST, we showed that by introducing latest agents comparable to a Summarizer Agent to remind the opposite agents of what is happening and repeatedly augment their state (fixing FM-1.4) or by introducing context management mechanisms (comparable to a stricter State Machine to implement termination to repair FM-1.5), we will stand up to 53% performance improvement as these tackle more fundamental issues with the system.

Case Study: Kimi-K2 (The Termination Crisis)

While termination confusion (FM-3.1 and FM-1.5) is the prevalent failure mode for Kimi-K2, its failed trajectories are defined by a pervasive Motion-Reasoning Mismatch (FM-2.6), which is present in a staggering 92% of its failures.

The Execution Gap: While parts of its internal reasoning are sometimes accurate, it suffers from a 92 percent failure prevalence of FM-2.6 (Motion-Reasoning Mismatch). It continuously identifies the right next step but then executes a redundant or irrelevant command.
The Meta-Loop Trap: Roughly 25 percent of failed traces involve FM-2.3 (Task Derailment). When a tool call returns a minor error, the agent often abandons the first incident to enter a cycle of debugging its own investigation scripts.

Kimi-K2 is a superb example of an overthinking model, its reasoning chains are sometimes too long but can fail at execution.

Case Study: GPT-OSS-120B

GPT-OSS-120B exhibits probably the most unstable failure signature of the cohort. This model exhibits a median of 5.3 distinct failure modes per failed trace, indicating a fundamental inability to take care of internal state.

Lack of Conversation History (FM-1.4): This can be a unique fatal flaw for the 120B model. It loses conversation history in 24% of traces, whereas Gemini-3-Flash exhibited zero memory loss and Kimi-K2 only 7%. As SRE traces grow in length, GPT-OSS-120B effectively “forgets” the alerts it was originally triaging, resulting in total task derailment.
Reasoning Disconnect (FM-2.6): A staggering 94% of traces show a decoupling of reasoning and motion. It is almost 3x more likely than Gemini (31%) to describe an accurate plan but then execute a totally unrelated or redundant tool call.

A distinct (and more useful) technique to read the plots: “fatal” vs “non-fatal”

In summary, MAST enables you to split failure modes into two buckets:

Recoverable / structural (show up even in successful traces)

These are failures which usually are not fatal and from which the system can get better to successfully complete the duty.

FM-1.3 Step repetition
FM-3.3 Incorrect verification (essential nuance: the system does confirm; it just verifies poorly)
FM-2.6 Reasoning–motion mismatch (often present, but not all the time decisive)

Fatal / decisive (strongly related to failed traces)

These are failures from which the system typically cannot get better.

FM-1.5 Unaware of termination conditions
FM-3.1 Premature termination
FM-1.4 Lack of conversation history
FM-2.3 Task derailment (rare but extremely diagnostic when it appears)
FM-2.2 Fail to ask for clarification (especially for Granite/Llama regimes)

That is the “richer understanding” piece: two models can have the identical success rate on a small slice, yet fail for entirely different reasons—requiring different fixes.

Conclusion

MAST is a tool that inspects the agentic system traces to discover fine-grain failure types that support system development and debugging. On this blog, we show that by applying MAST to ITBench, we move from generic observations (“Open models struggle”) to a concrete engineering roadmap that help improving the performance of agentic systems counting on thse models, e.g.:

For Gemini-3-Flash: Verification failure (FM-3.3) is probably the most common fatal failure for surgical models. Never allow an agent to self-terminate; require hard, tool-mediated evidence (e.g., AlertManager clearance or K8s state changes) before a run is taken into account successful.
For Kimi-K2: Use a deterministic state machine to repair the model’s frequent struggle with recognizing task completion. This model’s reasoning chains might be too long and struggle to terminate, so it’d profit significantly from a tighter control on when to finish.
For GPT-oss-120b: Systemic collapse occurs when minor reasoning mismatches (FM-2.6) poison the duty history. Implement aggressive context hygiene and early error detection to be certain that small misalignment’s don’t compound into total derailment.

Source link

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

The “Black Box” Problem of Agent Benchmarks

The Experiment: Diagnosing ITBench Agents

Finding 1: Stronger models like Gemini-3-Flash shows surgical (isolated failure modes) per trace whereas open sourced Kimi-K2 and GPT-oss-120b show compounding failure patterns