Conversational voice agents present a definite evaluation challenge: they need to concurrently satisfy two objectives — accuracy (completing the user’s task accurately and faithfully) and conversational experience (doing so naturally, concisely, and in a way appropriate for spoken interaction). These objectives are deeply intertwined: mishearing a confirmation code renders perfect LLM reasoning meaningless, a wall of options overwhelms a caller who cannot skim spoken output, and delayed responses can pass every accuracy check while remaining unusable in practice. Existing frameworks treat these as separate concerns — evaluating task success or conversational dynamics, but not each.
We introduce EVA, an end-to-end evaluation framework for conversational voice agents that evaluates complete, multi-turn spoken conversations using a sensible bot-to-bot architecture. EVA produces two high-level scores, EVA-A (Accuracy) and EVA-X (Experience), and is designed to surface failures along each dimension. EVA is the primary to jointly rating task success and conversational experience. We release EVA with an initial airline dataset of fifty scenarios covering flight rebooking, cancellation handling, vouchers, and more — the primary in a planned series of domains.
We also provide benchmark results for 20 cascade and audio-native systems, corresponding to speech-to-speech models and enormous audio language models. Our biggest finding is that there’s a consistent Accuracy-Experience tradeoff; agents that perform well on task completion are inclined to deliver worse user experiences, and vice versa.
🌐 Website – Explore the total framework, early results, and a demo.
💻 GitHub – Dive into the code, dataset, and judge prompts.
Background and Motivation
The sphere currently lacks a framework that evaluates the total quality of voice agent interactions, as most existing efforts assess individual components in isolation. For instance, AudioBench, SD-Eval, VoxEval, Kimi-Eval, VoiceBench and VoxDialogue evaluate core speech understanding capabilities — transcription, paralinguistics, acoustic cues — but remain confined to single-turn, non-interactive settings. However, EmergentTTS and SHEET assess perceived speech quality using subjective listening tests (e.g., Mean Opinion Rating). Beyond speech perception, FD-Bench, Talking Turns, Full-Duplex-Bench provide deeper analyses of conversational dynamics — interruptions, backchanneling, turn-taking — yet evaluate these in isolation from task-oriented tool use, leaving the connection between dialogue quality and agentic capability unexamined. More moderen efforts, notably VoiceAgentBench and CAVA, take steps towards evaluating the agentic capabilities of economic voice agent systems, including tool-calling and sophisticated instruction-following. Nevertheless, these voice-agentic capabilities should not evaluated inside complete conversational workflows that voice agents must navigate in practice: from initial user request through multi-step tool orchestration to final task resolution.
The shortage of frameworks that jointly capture accuracy and experience underscores the necessity for a framework that treats voice agent quality as an integrated whole. This implies evaluating not only whether the duty succeeded, but whether the agent communicated accurately, concisely, and naturally throughout, and surfacing how these dimensions trade off against each other in realistic deployment conditions.
EVA
The Framework
End-to-end evaluation reveals interaction dynamics that should not apparent on the component level: whether the agent interrupts users during natural pauses in speech, whether it recovers easily when a user corrects a transcription error, or whether high latency disrupts the conversational flow enough to prompt users to repeat themselves or abandon the duty entirely.
EVA simulates multi-turn spoken conversations over live audio by which the agent must invoke appropriate tools, adhere to task-specific policies, and reach a deterministically verifiable end state. EVA evaluates voice agents using a bot-to-bot audio architecture composed of 5 core components:
-
User Simulator — A conversational AI configured with a selected goal and persona that plays the role of a caller. It operates in audio using high-quality TTS models, ensuring the evaluation captures representative speech-understanding challenges in natural-sounding conversational speech and realistic turn-taking dynamics.
-
Voice Agent — The voice agent being evaluated, built with Pipecat, an open-source Python framework for real-time voice applications. EVA supports each cascade architectures (STT → LLM → TTS) and audio-native models (S2S or S2T→ TTS).
-
Tool Executor — The engine that gives deterministic, reproducible tool responses via custom Python functions. It dynamically queries and modifies a predefined per-scenario database.
-
Validators — A set of validation metrics that check that conversations are complete and that the user faithfully reproduced the intended behavior and speech, with no human annotation required. Any conversation that fails on this validation step is regenerated, ensuring that only valid, accurately executed conversations enter evaluation. This stands in contrast to approaches that depend on post-hoc human labeling to discover simulator errors.
-
Metrics Suite — A collection of metrics evaluates the voice agent using the conversation recording, transcript, and gear call logs.
Data
Each test case (scenario) in our framework is an evaluation record, structured to make tests reproducible:
- User Goal — What the caller is trying to perform. Features a highly specific user objective with an actual decision tree that guides the user simulator through the conversation, leaving no ambiguity in regards to the intended end result.
- User Persona — How the caller should behave — their speaking style, patience level, and personality traits.
- Scenario Database — The backend data the agent’s tools will query.
- Ground Truth — The expected final state of the scenario database after a successful conversation.
We release EVA with an artificial airline dataset of fifty scenarios, spanning IRROPS rebooking, voluntary itinerary changes, cancellations, same-day standby, and compensation vouchers. Scenarios are designed to check temporal reasoning, policy-following, constraint satisfaction, and named-entity handling.
Evaluation Methodology
EVA evaluates voice agents across two fundamental dimensions, EVA-A for accuracy, and EVA-X for experience. EVA also features a set of diagnostic metrics. Unlike the first metrics, these should not used directly to match or rank models — reasonably, they provide granular insight into why a model scores the way in which it does, helping discover and understand specific failure modes (e.g., ASR, speech synthesis, etc.). We report pass@k (the probability that not less than certainly one of k runs succeeds) and pass^k (the probability that each one k runs succeed) across three trials per scenario (k = 3), capturing each peak performance and behavioral consistency.
EVA uses two evaluation methods: deterministic code-based metrics, which compute scores directly from structured data and are fast; and LLM-as-Judge metrics, which use Large Language Models (LLMs) to evaluate qualitative elements of the conversation, or Large Audio Language Models (LALM) to judge speech directly. Each judge-based metric uses the model that performs best on a curated evaluation dataset for that specific metric.
EVA-A: Accuracy
Task completion alone is a obligatory but insufficient measure of accuracy. An agent can reach the right end state while fabricating a policy detail, misreading a confirmation code aloud, or hallucinating a flight number mid-conversation. These failures are invisible to a binary pass/fail check but directly harm users. EVA-A subsequently measures three dimensions of accuracy:
- Task Completion [Deterministic]. Measures whether the agent accurately accomplished the duty by comparing the expected end state of the scenario database against the actual end state after the conversation.
- Faithfulness [LLM-as-Judge]. Measures whether the agent’s responses were grounded in its instructions, policies, user inputs, and gear call results — flagging fabrications, misrepresentations, policy violations, and hallucinations.
- Agent Speech Fidelity [LALM-as-Judge]. Measures whether the speech system faithfully reproduced the intended text in spoken audio, with particular give attention to entities critical to get right in a voice context, corresponding to confirmation codes, flight numbers, and dollar amounts. That is the one metric in any end-to-end voice agent benchmark that evaluates the standard of the agent’s own spoken output on the audio level.
EVA-X: Experience
Turn-taking timing matters, however it tells only a part of the story. An agent can have perfect timing while overwhelming a caller with a wall of spoken options they can’t skim, or repeatedly asking for information already given. These failures degrade the experience without ever involving a mistimed response. EVA-X subsequently measures three dimensions of experience:
- Conciseness [LLM-as-Judge]. Measures whether the agent’s responses were appropriately temporary and focused for spoken delivery, since phone users cannot skim, re-read, or scroll back through long responses.
- Conversation Progression [LLM-as-Judge]. Measures whether the agent moved the conversation forward effectively — avoiding repetition, retaining context across turns, and driving toward task completion without stalling.
- Turn-Taking [LLM-as-Judge]. Measures whether the agent spoke at the precise time — neither interrupting the user nor introducing excessive silence after they finish speaking.
Findings
We evaluated 20 systems — proprietary and open-source, cascade and audio-native — and discover a consistent accuracy-experience tradeoff: agents that perform well on task completion are inclined to deliver worse user experiences, and vice versa — a tradeoff invisible to benchmarks that rating only task completion. No single configuration dominates each axes, confirming that accuracy and experience should be measured jointly.
Moreover, we identified named entity transcription as a dominant failure mode. A single misheard character can cascade into an authentication failure and a full conversation breakdown. Also, multi-step workflows break agents in predictable ways. Rebooking a flight while preserving ancillary services — seats, baggage — is the dominant complexity breaker across all configurations. Finally, we observed that additional calibration is required for real-world use cases. The gap between pass@3 and pass^3 is substantial across all configurations. Even agents that may complete a task often cannot accomplish that consistently, which is critical for real-world success.
View the early results here.
Limitations
EVA-Bench is designed to offer rigorous, end-to-end evaluation of conversational voice agents, but several limitations are essential to acknowledge, across the framework, data, and metrics dimensions:
-
Framework: The user simulator relies on a single industrial provider whose voice characteristics may systematically favor certain ASR systems, and the bot-to-bot pipeline — including audio format conversions and real-time audio interfaces — may not fully represent production deployments. Also, full reproduction requires industrial API access, and latency measurements will vary across providers and infrastructure.
-
Data: the present release covers 50 English-language scenarios in a single domain; results may not generalize to other use cases, languages, or accents.
-
Metrics: LLM-as-judge models carry inherent biases and should favor certain response styles independent of quality, with additional risk of systematic bias when the evaluated and judge models share a provider. While we validate our judges against labeled datasets and report accuracy measurements on our website, these alignment scores don’t eliminate systematic bias entirely. Moreover, task completion is measured as binary, which doesn’t capture partial credits and should understate the relative quality of systems that fail gracefully versus catastrophically.
What’s Next
On the evaluation side, we plan so as to add prosodic quality assessment (pronunciation, rhythm, expressiveness) — currently an open problem after finding very low alignment between LALM-as-Judge and human judgments. We also plan robustness testing under noisy conditions, diverse accents, multilingual users, and varied speaker behaviors, alongside affect-aware evaluation of how agents reply to user distress. By way of data, we’re developing additional domain datasets — each with distinct policy structures, named entity profiles, and conversational dynamics — and more complex scenarios involving compound requests, multi-step follow-ups, and longer conversational memory. On the tooling front, we are going to release a results and error evaluation application that robotically identifies errors per metric and model, surfaces representative examples for exploration, and generates structured summaries of every model’s strengths and weaknesses. Finally, we intend to expand the leaderboard repeatedly to offer an up-to-date assessment of voice agent capabilities across the sector.
View more details about limitations and our upcoming roadmap here.
Getting Began
Go to our GitHub to make use of the framework!
Acknowledgements
Core contributors include Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Hoang Nguyen, Raghav Mehndiratta, and Hari Subramani.
We also thank Lindsay Brin, Akshay Kalkunte, Joseph Marinier, Jishnu Nair, and Aman Tiwari for his or her careful data review and thoughtful contributions to the framework, and Fanny Riols, Anil Madamala, Sridhar Nemala, and Srinivas Sunkara for his or her management, leadership, and support throughout. We also extend our due to the PAVA and CLAE ServiceNow teams, whose prior work on evaluations and voice agents provided precious inspiration for this project.
Citation
@misc{eva-2026,
title={A Recent Framework for Evaluation of Voice Agents (EVA)},
creator={Bogavelli, Tara and Gauthier Melançon, Gabrielle and Stankiewicz, Katrina and Bamgbose, Oluwanifemi and Nguyen, Hoang and Mehndiratta, Raghav and Subramani, Hari},
yr={2026},
url={https://github.com/ServiceNow/EVA-Bench}
}



