Faster Text Generation with Self-Speculative Decoding

Self-speculative decoding, proposed in
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
is a novel approach to text generation. It combines the strengths of speculative decoding with early
exiting from a big language model (LLM). This method allows for efficient generation
through the use of the same model’s early layers for drafting tokens, and later layers for verification.

This system not only hurries up text generation, nevertheless it also achieves significant
memory savings and reduces computational latency. With a view to obtain an end-to-end speedup, the
output of the sooner layers must be close enough to the last layer. That is achieved by a
training recipe which, as described within the paper, might be applied during pretraining,
and likewise while fine-tuning on a particular domain. Self-speculative decoding is
especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering
the general hardware footprint needed for large-scale inference.

On this blog post, we explore the concept of self-speculative decoding, its implementation,
and practical applications using the 🤗 transformers library. You’ll learn in regards to the
technical underpinnings, including early exit layers, unembedding, and training modifications.
To ground these concepts in practice, we provide code examples, benchmark comparisons with traditional
speculative decoding, and insights into performance trade-offs.

Dive straight into the next Hugging Face artifacts to know more in regards to the method
and check out it out yourself:

Speculative Decoding and Self-Speculative Decoding

Illustration of LayerSkip inference on facebook/layerskip-llama2-7B
(Llama2 7B continually pretrained with the LayerSkip recipe).

Traditional speculative decoding uses two models:
a smaller one (draft model) to generate a sequence of draft tokens, and a bigger one
(verification model) to confirm the draft’s accuracy. The smaller model performs a big
portion of the generation, while the larger model refines the outcomes. This increases text
generation speed for the reason that larger model verifies full sequences directly, as an alternative of generating
one draft at a time.

In self-speculative decoding, the authors construct on this idea but use the early layers of a
large model to generate draft tokens which can be then verified by the model’s deeper layers.
This “self” aspect of speculative decoding, which requires specific training,
allows the model to perform each drafting and verification. This, in turn, improves speed and
reduces computational costs in comparison with the standard speculative decoding.

Usage with `transformers`

With a view to enable early-exit self-speculative decoding within the
🤗 transformers library, we
just have to add the assistant_early_exit argument to the generate() function.

Here is a straightforward code snippet showcasing the functionality.

pip install transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

early_exit_layer = 4
prompt = "Alice and Bob"
checkpoint = "facebook/layerskip-llama2-7B"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained(checkpoint).to("cuda")
outputs = model.generate(**inputs, assistant_early_exit=early_exit_layer)

Note: While the assistant_early_exit argument can potentially enable early-exit
self-speculative decoding for any decoder-only transformer, the logits from the intermediate layers
can’t be unembedded (strategy of decoding through LM Head, described later within the blog post)
unless the model is specifically trained for that. You can even only obtain speedups for
a checkpoint that was trained in such a method to increase the accuracy of earlier layers.
The LayerSkip paper proposes a training recipe to attain that
(namely, applying early exit loss, and progressively increasing layer dropout rates). A group
of Llama2, Llama3, and Code Llama checkpoints which have been continually pretrained with the
LayerSkip training recipe are provided here.

Benchmarking

We ran an in depth list of benchmarks to measure the speedup of LayerSkip’s self-speculative
decoding with respect to autoregressive decoding on various models. We also compare self-speculative
decoding (based on early exiting) with standrad speculative decoding techniques. To breed the
results, you might find the code here
and the command to run each experiment on this
spreadsheet.
All of the experiments were ran on a single 80GB A100 GPU, apart from Llama2 70B experiments that
ran on a node of 8 A100 GPUs.

Llama3.2 1B

Model Variant	Layers	Assistant Model	Assistant Layers	Task	Total Layers	FLOPs/Input (G)	Time/Input (s)	FLOPs/Output (G)	Time/Output (s)	Efficiency
facebook/layerskip-llama3.2-1B	1	Early Exit @ Layer 4		summarization	1	1195.28	9.96	2147.7	17.9	1.80

Llama3 8B

Model Variant	Layers	Assistant Model	Assistant Layers	Task	Total Layers	FLOPs/Input (G)	Time/Input (s)	FLOPs/Output (G)	Time/Output (s)	Efficiency
meta-llama/Meta-Llama-3-8B	8	meta-llama/Llama-3.2-1B	1	summarization	9	1872.46	19.04	2859.35	29.08	1.53
meta-llama/Meta-Llama-3-8B	8	meta-llama/Llama-3.2-3B	3	summarization	11	2814.82	28.63	2825.36	28.73	1.00
facebook/layerskip-llama3-8B	8	Early Exit @ Layer 4		summarization	8	1949.02	15.75	3571.81	28.87	1.83

Llama2 70B

Model Variant	Layers	Assistant Model	Assistant Layers	Task	Total Layers	FLOPs/Input (G)	Time/Input (s)	FLOPs/Output (G)	Time/Output (s)	Efficiency
meta-llama/Llama-2-70b-hf	70	meta-llama/Llama-2-13b-hf	13	summarization	83	5036.54	46.3	12289.01	112.97	2.44
meta-llama/Llama-2-70b-hf	70	meta-llama/Llama-2-7b-hf	7	summarization	77	4357.55	40.06	12324.19	113.3	2.83
meta-llama/Llama-2-70b-hf	70	TinyLlama/TinyLlama_v1.1	1	summarization	71	4356.21	40.05	12363.22	113.66	2.84
facebook/layerskip-llama2-70B	70	Early Exit @ Layer 10		summarization	70	6012.04	54.96	1283.34	113.2	2.06

Llama2 13B

Model Variant	Layers	Assistant Model	Assistant Layers	Task	Total Layers	FLOPs/Input (G)	Time/Input (s)	FLOPs/Output (G)	Time/Output (s)	Efficiency
meta-llama/Llama-2-13b-hf	13	meta-llama/Llama-2-7b-hf	7	summarization	20	3557.07	27.79	4088.48	31.94	1.15
meta-llama/Llama-2-13b-hf	13	TinyLlama/TinyLlama_v1.1	1	summarization	14	2901.92	22.67	4190.42	32.74	1.44
meta-llama/Llama-2-13b-hf	13	apple/OpenELM-270M	0.27	summarization	13.27	2883.33	22.53	4521.12	35.32	1.57
meta-llama/Llama-2-13b-hf	13	apple/OpenELM-450M	0.45	summarization	13.45	3267.69	25.53	4321.75	33.76	1.32
facebook/layerskip-llama2-13B	13	Early Exit @ Layer 4		summarization	13	4238.45	33.11	4217.78	32.95	0.995
facebook/layerskip-llama2-13B	13	Early Exit @ Layer 8		summarization	13	2459.61	19.22	4294.98	33.55	1.746

Llama2 7B

Model Variant	Layers	Assistant Model	Assistant Layers	Task	Total Layers	FLOPs/Input (G)	Time/Input (s)	FLOPs/Output (G)	Time/Output (s)	Efficiency
meta-llama/Llama-2-7b-hf	7	TinyLlama/TinyLlama_v1.1	1	summarization	8	2771.54	21.65	3368.48	26.32	1.22
meta-llama/Llama-2-7b-hf	7	apple/OpenELM-270M	0.27	summarization	7.27	2607.82	20.37	4221.14	32.98	1.62
meta-llama/Llama-2-7b-hf	7	apple/OpenELM-450M	0.45	summarization	7.45	3324.68	25.97	4178.66	32.65	1.26
facebook/layerskip-llama2-7B	7	Early Exit @ Layer 4		summarization	7	2548.4	19.91	3306.73	25.83	1.297

Some observations we will make from the outcomes:

As seen within the Total Variety of Parameters column, self-speculative decoding consumes less memory
since it doesn’t require a separate draft model and weights for the draft stage layers are re-used.
For all model sizes and generations except Llama2 70B, the early-exit self-speculative decoding
is quicker than the regular two-model speculative decoding.
There may very well be different reasons for the relatively limited speedups of self-speculative decoding
on Llama2 70B in comparison with other models, e.g., the LayerSkip checkpoint of Llama2 70B was continually
pretrained with fewer tokens (328 M tokens for Llama2 70B in comparison with 52B tokens for Llama2 7B).
But that is an area of improvement to research for future research. Nevertheless,
self-speculative decoding for 70B is significantly faster than autoregressive decoding.

Early Exit and Unembedding

One key technique in self-speculative decoding is early exit, where the generation process can halt
at a pre specified layer. To perform this, we unembed the logits from these layers by projecting
them onto the language model (LM) head to predict the subsequent token. This enables the model to skip
subsequent layers and improve inference time.

Unembedding might be performed at any transformer layer, turning early-exit into an efficient
token-prediction mechanism. A natural query arises: how can the LM head be adapted to unembed
logits from earlier layers when it was initially trained to work with the ultimate layer only? This
is where the training modifications come into play.

Training Modifications: Layer Dropout and Early Exit Loss

Within the training phase, we introduce layer dropout,
which allows the model to skip certain layers during training. The dropout rate increases
progressively in deeper layers, making the model less reliant on its later layers, in addition to
enhancing the model’s generalization and speeding up training.

Along with layer dropout, early exit loss is applied to make sure the LM head learns to
unembed different layers. The full loss function for training the model with early exits is
given by a summation of normalized loss from each exit (intermediate layers). This system enables
efficient training by distributing the training task across all layers.

Self-Drafting and Self-Verification

Once training is complete, we will apply self-speculative decoding during inference.
The method
begins with self-drafting, where tokens are generated by exiting early from some intermediate
layer. The variety of speculative tokens defines what number of draft tokens are produced during
this stage, and the layer we exit at defines how large and accurate is the draft stage.
Each parameters might be specified at inference based on a
trade-off between speed and accuracy of the draft stage.

The subsequent stage is self-verification, where the complete model is used to confirm the draft tokens.
The verification model reuses the portion of cache from the draft model. If the draft tokens align
with the verified tokens, they’re added to the ultimate output, leading to a greater usage of the
memory bandwidth in our system, since it’s far more expensive to generate a sequence of tokens
with the complete model than verifying a draft, so long as several of the tokens match.

Within the self-verification stage, only the remaining layers are computed for verification, because
the outcomes from the early layers are cached through the drafting phase.

Optimizations: Shared Weights, Shared KV Cache, and Shared Compute

Self-speculative decoding advantages significantly from cache reuse, particularly the KV cache,
which stores key-value pairs computed through the drafting stage. This cache allows the model to skip
redundant calculations, as each the draft and verification stages use the identical early layers.
Moreover, the exit query cache stores the query vector from the exit layer, allowing
verification to proceed seamlessly from the draft stage.

In comparison with traditional two-model speculative decoding, early-exit self-speculative decoding can
profit from the next savings:

Shared Weights: Reuses the weights from the primary $E$
Shared KV Cache: Reuses key-value pairs from the primary $E$
Shared Compute: Reuses the compute of the primary $E$

The mixture of KV and exit query caches, often called the KVQ cache, reduces memory
overhead and improves inference latency.

Up to now, the 🤗 transformers library has implemented the primary optimization (Shared Weights) on this
pull request. Because the variety of models
that use this method increases, we’ll consider the extra optimizations.
Be happy to open a PR if you happen to’re interested!

How Early Can We Exit?

The early exit layer of the draft stage is a hyperparameter that we will tune or modify during inference:

The sooner we exit, the faster the generation of draft tokens are however the less accurate they shall be.
The later we exit, the more accurate the draft tokens generated are however the slower their generation shall be.

We wrote a script
to comb across different early exit layers and measure the tokens per second on A100 GPUs.
Within the Tables below we plot the tokens per second versus the early exit layer for various
Llama models for each LayerSkip and baseline checkpoints (you’ll be able to view the complete logs
here).

Llama3.2 1B

Normal	LayerSkip

Llama3 8B

Normal	LayerSkip

Code Llama3 34B

Normal	LayerSkip

Code Llama3 7B

Normal	LayerSkip

Llama2 70B

Normal	LayerSkip

Llama2 13B

Normal	LayerSkip

Llama2 7B

Normal	LayerSkip

We will observe the next:

For the baseline checkpoints which have not been pretrained or continually pretrained with the
LayerSkip training recipe, early exit self-speculative decoding is slower than autoregressive decoding.
It’s because during training of most LLMs, earlier layers aren’t motivated to learn to predict
the output, and hence generating tokens using earlier layers could have a really low acceptance rate.
However, for the Llama checkpoints that were continually pre-trained with the LayerSkip
training, early exit self-speculative decoding has higher speedup than autoregressive decoding for
a minimum of a subset of the layers.
- For many models, except Llama3.2 1B, we notice an everyday pattern after we traverse across
  layers: speedup starts low for the primary few layers, increases progressively to a sweet spot, and
  then decreases again.
- The early exit layer sweet spot is when we now have the optimal tradeoff between high accuracy of
  predictions and low overhead of generating tokens. This sweet spot relies on each model,
  and might also depend upon the prompt or domain of the prompt.

These observations present intriguing opportunities for further experimentation and exploration.
We encourage readers to construct upon these ideas, test variations, and pursue their very own research.
Such efforts can result in precious insights and contribute meaningfully to the sphere.

Conclusion

LayerSkip leverages the synergy between early exit, layer dropout, and cache reuse to create a quick
and efficient text generation pipeline. By training the model to unembed outputs from different
layers and optimizing the verification process with caches, this approach strikes a balance between
speed and accuracy. Because of this, it significantly improves inference times in large language models
while maintaining high-quality outputs. It also reduces memory in comparison with traditional speculative
decoding techniques on account of a single model used as each the draft and verification model.

Self-speculation is an exciting field where the identical LLM can create draft tokens and fix itself. Other
self-speculation approaches include:

Draft & Confirm: where the draft stage involves
skipping pre-determined attention and feed forward layers.
MagicDec: where the draft stage uses a subset of the KV cache,
which is beneficial for long context inputs.
Jacobi Decoding and Lookahead Decoding:
Where the draft stage are a series of “guess tokens” that may very well be either random or obtained from a n-gram lookup table.

Source link

Faster Text Generation with Self-Speculative Decoding

Speculative Decoding and Self-Speculative Decoding