A Unified and Diverse Benchmark for Speculative Decoding**

Speculative Decoding (SD) has emerged as a critical technique for accelerating LLM inference.
SD uses a light-weight draft model to invest multiple future tokens, that are then verified in parallel by the goal model. This manner, SD can significantly improve throughput while preserving the precise output distribution of the goal model.

Despite rapid progress in SD algorithms, their evaluation stays fragmented and infrequently unrepresentative of real-world data and serving conditions.
In practice, SD speculation quality and inference speedups are inherently data-dependent, serving-regime–dependent, and system-dependent.
Yet most existing benchmarks depend on small prompt sets, limited semantic diversity, short input sequence lengths, batch size one, or high-level inference stacks that don’t reflect production environments.

To deal with these gaps, we introduce SPEED-Bench: a unified benchmark designed to judge SD across diverse semantic domains and realistic serving regimes, using production-grade inference engines.

What’s SPEED-Bench?

SD have to be evaluated from two perspectives.

On one hand, draft quality depends upon the semantic domain and entropy of the input text.
Then again, real-world speedups rely upon batch size, input sequence length (ISL), and system constraints, which determine whether inference is memory-bound or compute-bound.

SPEED-Bench subsequently introduces a benchmarking ecosystem for SD.
It combines two purpose-built dataset splits and a unified measurement framework, each designed to capture a distinct aspect of SD behavior:

A “Qualitative” data split, optimized for semantic diversity and designed to measure speculation quality (drafter accuracy) across domains.
A “Throughput” data split, constructed to judge system-level speedups across various input sequence lengths and high concurrency.
A unified measurement framework, integrated with production inference engines, that standardizes evaluation across systems.

Together, these components enable practitioners and researchers to investigate SD behavior that is commonly masked by existing benchmarks.

Figure 1 provides a high-level overview of the SPEED-Bench ecosystem.

**Figure 1**: **Overview of the SPEED-Bench ecosystem.** **(Left)** Curation of the Qualitative split, utilizing a custom selection algorithm on prompt embeddings to maximise semantic diversity across categories. **(Middle)** Construction of the Throughput Split, where data is aggregated and processed into fixed Input Sequence Length (ISL) buckets (1k-32k) across three domain difficulties, supporting large batch sizes (as much as 512 per ISL and difficulty). **(Right)** The unified measurement framework used to report standard SD metrics and speedups.

The Qualitative split: semantic coverage and draft accuracy

The goal of the Qualitative split is to measure speculative decoding quality, specifically conditional acceptance rates (ARs) and acceptance lengths (ALs), across a wide selection of semantic domains.

SpecBench introduced the primary unified SD benchmark across diverse application scenarios, reminiscent of multi-turn conversation, translation, and mathematical reasoning, by aggregating instances from widely used datasets right into a unified testing environment. Nevertheless, despite being a major step toward standardized evaluations, it has critical limitations regarding scale and variety. Most categories contain as few as 10 samples with short mean input lengths (< 100 tokens) which will fail to emphasize modern drafters. Moreover, a few of its categories often lack structural diversity, reminiscent of the multilingual category consisting entirely of German-to-English translation prompts.

While extensive evaluation across quite a few datasets is theoretically possible, it’s tedious, impractical for rapid experimentation, and hinders direct comparisons between different research groups releasing SD algorithms and models. As an alternative of counting on exhaustive evaluations across disparate datasets, we curate a compact yet highly representative subset designed to maximise semantic diversity.
We aggregate data from 18 publicly available sources and organize it into 11 categories, including Coding, Math, Humanities, STEM, Writing, Summarization, Roleplay, RAG, Multilingual, Reasoning, and QA.

Each category accommodates 80 samples, leading to a complete of 880 prompts.
Unlike prior benchmarks, which regularly suffer from low intra-category diversity, the SPEED-Bench Qualitative split explicitly prioritizes semantic diversity.

To realize this, each candidate prompt is embedded right into a dense vector space using a pretrained text embedder (openai/text-embedding-3-small).
We then apply a range algorithm that minimizes average pairwise cosine similarity inside each category.
This ensures that the chosen samples span the semantic space as widely as possible, reducing redundancy and increasing evaluation fidelity.

The effectiveness of this approach is shown in Figure 2, which compares average semantic similarity across SPEED-Bench and SpecBench.

**Figure 2:** Comparison of average semantic similarity between samples **(lower is best)**. SPEED-Bench achieves lower similarity than each random selection and SpecBench across all categories.

This semantic diversity is critical for exposing domain-dependent behavior in SD, reminiscent of the strong contrast between low-entropy domains (e.g., Coding, Math) and high-entropy domains (e.g., Roleplay, Writing).

The Throughput split: realistic serving workloads

While the Qualitative split captures draft accuracy, it’s insufficient for evaluating system-level speedups.

We evaluate system-level speedups using two metrics: Throughput (Output TPS), the whole tokens generated per second across all concurrent requests, and User TPS, the per-request token generation rate. User TPS acts as a proxy for end-user latency.

In production environments, models are served under high concurrency and a wide selection of input sequence lengths, which are sometimes for much longer than the short ISL samples utilized in many SD benchmarks.
As batch size increases, inference often transitions from a compute-bound regime to a memory-bound regime, fundamentally changing the cost-benefit trade-offs of speculative decoding.

The Throughput split is designed specifically to capture this behavior.

We construct fixed ISL buckets starting from 1k to 32k tokens, reflecting the growing importance of long-context applications reminiscent of coding assistants and retrieval-augmented generation.
For every ISL bucket, prompts are aggregated into three coarse difficulty categories corresponding to low-, mixed-, and high-entropy domains.

Each ISL bucket accommodates 1,536 prompts (512 per difficulty category), providing sufficient volume to construct stable throughput Pareto curves across a wide selection of batch sizes (Figure 3).

**Figure 3:** Throughput as a function of user TPS, comparing DL=1,3 on the Throughput Split (2k). Goal is GPT-OSS 120B with EAGLE3, measured on vLLM. Points represent BS from 2 to 512.

To make sure deterministic prefill cost, prompts are either truncated or padded in a controlled manner, while preserving their semantic content.

Importantly, SPEED-Bench avoids the usage of random token inputs for throughput benchmarking.
As we show later, random tokens can severely distort acceptance behavior, expert routing in MoE models, and throughput measurements, resulting in overly optimistic conclusions.

A unified measurement framework

Benchmarking SD across inference engines presents a subtle but critical challenge.

Different engines may apply different chat templates, handle BOS tokens in a different way, or tokenize inputs inconsistently.
These differences can silently alter the drafted sequence, making cross-engine comparisons unreliable.

SPEED-Bench introduces a light-weight measurement framework that handles tokenization and prompt formatting externally.
Inference engines receive pre-tokenized sequences, ensuring that every one systems process similar inputs.

The framework integrates with production-grade engines: TensorRT-LLM, vLLM, and SGLang.
It captures fine-grained timing information from streaming responses to compute acceptance behavior, step latency, user-level tokens-per-second, and overall throughput.

Below is an example of running our measurement framework on Llama 3.3 70B Instruct because the goal model with EAGLE3 because the draft model on the Qualitative split of SPEED-Bench, using TensorRT-LLM with a batch size of 32 (8*H100 GPUs):

Example output of the measurement framework

bash-5.2$ mpirun -n 1 --oversubscribe python3 run.py --model_dir meta-llama/Llama-3.3-70B-Instruct --tokenizer meta-llama/Llama-3.3-70B-Instruct --draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B --dataset speed --dataset_path data/speed/qualitative --tp_size 8 --ep_size 1 --draft_length 3 --output_length 4096 --engine TRTLLM --concurrency 32 --show_progress
...
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc1
...
Running requests (concurrency=32): 100%|██████████| 880/880 [02:58<00:00,  4.93it/s]
...
Acceptance Length Histogram
{1: 57385, 2: 36968, 3: 24441, 4: 61182}
Conditional acceptance rate

1 1.0
2 0.681151931368627
3 0.6984444208791836
4 0.7145509968116043

    Acceptance Rate Results     

┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Category        ┃ Average AR ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ coding          │     3.0001 │
│ humanities      │     2.3699 │
│ math            │     2.4710 │
│ multilingual    │     1.7277 │
│ qa              │     2.3184 │
│ rag             │     2.7502 │
│ reasoning       │     2.6142 │
│ roleplay        │     2.0407 │
│ stem            │     2.4306 │
│ summarization   │     2.6026 │
│ writing         │     2.6364 │
├─────────────────┼────────────┤
│ Overall Average │     2.4511 │
└─────────────────┴────────────┘
Output TPS 2518.1464055829188
Output TPS/gpu 314.76830069786485
E2E Request Time {'min': '0.1031', 'max': '51.3442', 'mean': '4.7313', 'std': '4.8872', 'quantiles': {'0.25': '1.6972', '0.5': '3.5459', '0.75': '6.1715'}}
TTFT Time {'min': '0.0528', 'max': '0.8010', 'mean': '0.1217', 'std': '0.1133', 'quantiles': {'0.25': '0.0841', '0.5': '0.0982', '0.75': '0.1139'}}
Request Generation Step Time {'min': '0.0000', 'max': '0.8010', 'mean': '0.0299', 'std': '0.0173', 'quantiles': {'0.25': '0.0259', '0.5': '0.0267', '0.75': '0.0280'}}
Request Generation Tokens Per Second {'min': '35.1116', 'max': '142.9547', 'mean': '85.3162', 'std': '17.4823', 'quantiles': {'0.25': '75.0608', '0.5': '84.9756', '0.75': '97.0815'}}
Variety of Output Tokens {'min': '3.0000', 'max': '4096.0000', 'mean': '394.8787', 'std': '452.7451', 'quantiles': {'0.25': '123.0000', '0.5': '281.0000', '0.75': '518.2500'}}

This design isolates the results of SD algorithms and system optimizations from preprocessing artifacts.

Insights from SPEED-Bench

Domain-dependent accuracy and speedups

The table below reports average acceptance lengths and speedups across domains and models at a practical batch size (32) and a draft length of three.

The outcomes confirm that SD acceptance length is extremely domain-dependent.
Low-entropy domains reminiscent of Coding and Math consistently yield higher acceptance lengths, while high-entropy tasks reminiscent of Roleplay and Writing are tougher to invest on.

The table also highlights differences between speculation methods.
Lightweight approaches reminiscent of N-Gram speculation can lead to net slowdowns at moderate batch sizes. We further see that native MTP heads achieve significantly larger ALs than post-trained alternatives like EAGLE3, highlighting the good thing about co-training the bottom model and drafter from scratch.

Domain	Llama 3.3 70B with N-Gram (TensorRT-LLM)	GPT OSS 120B with EAGLE3 (TensorRT-LLM)	Qwen3-Next with MTP (SGLang)
Coding	1.54	2.46	3.34
Math	1.43	2.46	3.13
Roleplay	1.15	1.87	2.09
Writing	1.33	1.98	2.46
Mean AL	1.41	2.25	2.81
Mean Speedup	0.88x	1.34x	1.20x

Full results on the Qualitative split of all categories and different models are in our paper.

Vocabulary pruning reveals long-tail failures

SPEED-Bench can even assist with exposing unwanted effects of aggressive system optimizations.

Vocabulary pruning is utilized in EAGLE3 to cut back the computational cost of the ultimate projection layer.
While effective on narrow domains, this optimization can degrade acceptance length on the “long tail” of user inputs.

Figure 4 shows acceptance length across domains when using full vs. pruned vocabularies for GPT-OSS 120B with EAGLE3.
The impact is minimal in Coding and Math, but substantial in Multilingual, RAG, and Summarization categories.

**Figure 4:** Average AL across chosen categories using GPT-OSS 120B and EAGLE3 drafters (full vs. pruned vocabulary), DL=3.

These effects are largely invisible in low-diversity benchmarks, underscoring the importance of broad semantic coverage within the evaluation data.

Random tokens overestimate throughput

A standard practice in inference benchmarking is to make use of random tokens to simulate prompt load.
While could also be sufficient for autoregressive decoding, this approach is fundamentally flawed for SD algorithms and even for mixture of experts models (MoE) without speculation, as presented below.

Random tokens trigger two failure modes that skew measurements:

Trivial Response: The model identifies noise and defaults to predictable acknowledgments, artificially inflating ALs.

Example output (Base: GPT-OSS 120b, Drafter: EAGLE3, Draft Length:3, Average AL: 3.44):

It looks such as you’ve pasted a really long block of mixed‑language text that doesn’t form a transparent query or request. I’m completely happy to assist, but I would like a bit more guidance.
Could you let me know what you’d wish to do with this text? For instance: …

Topic Latching: The model anchors to specific keywords inside the noise and hallucinates a coherent response typically leading to lower ALs.

Example output (Base: GPT-OSS 120b, Drafter: EAGLE3, Draft Length:3, Average AL: 1.877):

Below is an expanded, production‑ready roadmap that takes you from the very first Unity install all of the option to a complete, polished 2‑D platformer (player, camera, enemies, collectibles, UI, audio, level loading, and a final construct).
The whole lot is broken into bite‑size tasks, each with the precise actions it’s essential perform and prepared‑to‑copy C# snippets.
…

Figure 5 compares throughput measured using random tokens vs. SPEED-Bench workloads.
When SD is enabled, random tokens overestimate throughput by roughly 23%.

**Figure 5:** Throughput as a function of user TPS, comparing random input tokens to the Throughput Split (8k). Goal is GPT-OSS 120B with EAGLE3 drafter, measured on TensorRT-LLM. DL=3. Points represent BS from 1 to 128.

Random inputs also fail to trigger realistic expert routing in MoE models, resulting in inaccurate throughput measurements even in non-speculative settings, as presented in Figure 6.

**Figure 6:** Variety of unique experts activated as a function of layer index, comparing random input tokens to the Throughput Split (8k). Goal model is GPT-OSS 120B, BS=32.

Start using SPEED-Bench

SPEED-Bench is released to determine a unified standard for evaluating SD in each research and production settings.

It enables practitioners to investigate draft accuracy across diverse domains, measure throughput under realistic serving regimes, and compare inference engines using similar workloads.

The dataset and measurement framework are openly available and designed to integrate directly with existing SD implementations.

Resources

We hope SPEED-Bench helps drive more rigorous, realistic, and deployment-aware evaluation of speculative decoding!

Source link

A Unified and Diverse Benchmark for Speculative Decoding**

What’s SPEED-Bench?

The Qualitative split: semantic coverage and draft accuracy

The Throughput split: realistic serving workloads

A unified measurement framework

Insights from SPEED-Bench

Domain-dependent accuracy and speedups

Vocabulary pruning reveals long-tail failures

Random tokens overestimate throughput

Start using SPEED-Bench

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines

Vibe Coding with AI: Best Practices for Human-AI Collaboration in Software Development

Google bets on ‘vibe design’ with Stitch

Generative AI improves a wireless vision system that sees through obstructions

A greater method for identifying overconfident large language models

A Unified and Diverse Benchmark for Speculative Decoding**

What’s SPEED-Bench?

The Qualitative split: semantic coverage and draft accuracy

The Throughput split: realistic serving workloads

A unified measurement framework

Insights from SPEED-Bench

Domain-dependent accuracy and speedups

Vocabulary pruning reveals long-tail failures

Random tokens overestimate throughput

Start using SPEED-Bench

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.