Faster Assisted Generation with Dynamic Speculation

⭐ On this blog post, we’ll explore dynamic speculative decoding —a novel method developed by Intel labs and Hugging Face that accelerates text generation by as much as 2.7x, depending on the duty. This method is the default operational mode for assisted generation ranging from Transformers🤗 release 4.45.0 ⭐

Speculative Decoding

Speculative decoding is a preferred technique to speed up the inference of enormous language models, while preserving their accuracy. As shown within the figure below, speculative decoding works by splitting the generative process into two stages. In the primary stage, a quick, but less accurate draft model (AKA assistant) autoregressively generates a sequence of tokens. Within the second stage, a big, but more accurate goal model conducts parallelized verification over the generated draft tokens. This process allows the goal model to supply multiple tokens in a single forward pass and thus speed up autoregressive decoding. The success of speculative decoding largely hinges on the speculation lookahead (SL), i.e. the variety of tokens produced by the draft model in each iteration. In practice, the SL is either a static value or based on heuristics, neither of which is perfect for squeezing out maximium performance during inference.

Speculative decoding iteration.

Dynamic Speculative Decoding

Transformers🤗 offers two distinct methods to find out the schedule for adjusting the variety of draft (assistant) tokens during inference. The simple method, based on Leviathan et al., uses a static value of the speculation lookahead and involves generating a continuing variety of candidate tokens at each speculative iteration. Alternatively, a heuristic-based approach adjusts the variety of candidate tokens for the subsequent iteration based on the acceptance rate of the present iteration. If all speculative tokens are correct, the variety of candidate tokens increases; otherwise, it decreases.

We anticipate that an enhanced optimization strategy for managing the variety of generated draft tokens could squeeze out further latency reductions. For testing this thesis we utilize an oracle that determines the optimal speculation lookahead value for every speculative iteration. The oracle employs the draft model to autoregressively generate tokens until a discrepancy arises between the anticipated tokens of the draft and goal models. This process is repeated for every speculative iteration, ultimately identifying the optimal (maximum) variety of draft tokens accepted per iteration. The draft/goal token mismatch is identified using the rejection sampling algorithm, introduced by Leviathan et al., with zero temperature. This oracle realizes the total potential of speculative decoding by generating the utmost variety of valid draft tokens at each step and minimizing the variety of calls to each the draft and goal models.

The left figure below illustrates the oracle and static speculation lookahead values across the speculative iterations of a code generation example from the MBPP dataset. A high variance in oracle speculation lookahead values (orange bars) is observed.
The static speculation lookahead (blue bars), where the variety of generated draft tokens is fixed to five, performs 38 goal forward passes and 192 draft forward passes, whereas the oracle speculation lookahead, performs only 27 goal forward passes and 129 draft forward passes – a major reduction. The suitable figure shows the oracle and static speculation lookahead across the whole Alpaca dataset.

Oracle and static speculation lookahead (SL) values on one MBPP example.

Average oracle speculation lookahead for the whole Alpaca dataset.

Each figures show significant variability in oracle speculation lookahead values, suggesting that a static speculation lookahead could also be suboptimal.

With the intention to catch up with to the Oracle and gain extra speedup, we developed a simple method to dynamically adjust the speculation lookahead value at each iteration. After generating each draft token, we determine whether the draft model should proceed generating the subsequent token or switch to the goal model for verification. This decision relies on the assistant model’s confidence in its prediction, estimated by the softmax of the logits. If the assistant model’s confidence in the present token prediction falls below a predefined threshold, known as the assistant_confidence_threshold, it halts the token generation process for that iteration, even when the utmost variety of speculative tokens num_assistant_tokens has not been reached. Once halted, the draft tokens generated throughout the current iteration are sent to the goal model for verification.

Benchmarking

We benchmarked the dynamic approach against the heuristic approach across a spread of tasks and model pairings. The dynamic approach showed higher performance in all tests.
Notably, using the dynamic approach with Llama3.2-1B because the assistant for Llama3.1-8B, we observe speedups of as much as 1.52x, whereas the heuristic approach showed no significant speedups with the identical setup. One other statement is that codegen-6B-mono yields slowdown using the heuristic approach, whereas the dynamic approach shows speedup.

Goal model	Draft (Assistant) model	Task	Speedup – heuristic	Speedup – dynamic
`facebook/opt-6.7b`	`facebook/opt-125m`	summarization	1.82x	2.71x
`facebook/opt-6.7b`	`facebook/opt-125m`	open-ended generation	1.23x	1.59x
`Salesforce/codegen-6B-mono`	`Salesforce/codegen-350M-mono`	code generation (python)	0.89x	1.09x
`google/flan-t5-xl`	`google/flan-t5-small`	summarization	1.18x	1.31x
`meta-llama/Llama-3.1-8B`	`meta-llama/Llama-3.2-1B`	summarization	1.00x	1.52x
`meta-llama/Llama-3.1-8B`	`meta-llama/Llama-3.2-1B`	open-ended generation	1.00x	1.18x
`meta-llama/Llama-3.1-8B`	`meta-llama/Llama-3.2-1B`	code generation (python)	1.09x	1.15x

Code

Dynamic speculation has been integrated into release 4.45.0 of the Hugging Face Transformers library and now serves because the default operation mode for assisted decoding. To make use of assisted generation with dynamic speculation, no code changes are required—just execute the code as you normally would:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device)

outputs = model.generate(**inputs, assistant_model=assistant_model)

The default dynamic speculation lookahead parameters reflect optimal values but will be adjusted to enhance performance for specific model pairs or datasets through the use of the next code:


assistant_model.generation_config.assistant_confidence_threshold=0.4


assistant_model.generation_config.num_assistant_tokens_schedule='constant'



assistant_model.generation_config.num_assistant_tokens=20

To revert to the heuristic or constant (as in Leviathan et al.) approaches, simply set num_assistant_tokens_schedule to 'heuristic' or 'constant', respectively, set assistant_confidence_threshold=0 and num_assistant_tokens=5 as follows:


assistant_model.generation_config.num_assistant_tokens_schedule='heuristic'
assistant_model.generation_config.assistant_confidence_threshold=0
assistant_model.generation_config.num_assistant_tokens=5

What’s next?

We introduced a faster strategy for assisted generation called dynamic speculative decoding, which outperforms heuristics-based methods in addition to drawing a continuing variety of candidate tokens.

In an upcoming blog post, we’ll show a brand new method for assisted generation: mix any goal model with any assistant model! It will open the door for accelerating countless models on the Hugging Face Hub that should not have sufficiently small assistant variants. For instance, Phi 3, Gemma 2, CodeLlama and plenty of more will likely be eligible for speculative decoding. Stay tuned!

References

Citation

@article{mamou2024accelerating,
  title={Accelerating Speculative Decoding using Dynamic Speculation Length},
  creator={Mamou, Jonathan and Pereg, Oren and Korat, Daniel and Berchansky, Moshe and Timor, Nadav and Wasserblat, Moshe and Schwartz, Roy},
  journal={arXiv preprint arXiv:2405.04304},
  yr={2024}
}

Source link

Faster Assisted Generation with Dynamic Speculation

Speculative Decoding

Dynamic Speculative Decoding

Benchmarking

Code

What’s next?

References

Citation

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Object Detection Leaderboard

Scaling Volatile ML Models in Production

Inference for PROs

TDS Newsletter: Is It Time to Revisit RAG?

Llama 2 on Amazon SageMaker a Benchmark

Faster Assisted Generation with Dynamic Speculation

Speculative Decoding

Dynamic Speculative Decoding

Benchmarking

Code

What’s next?

References

Citation

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.