Faster Decoding with Any Assistant Model

-



TL;DR: Many LLMs corresponding to gemma-2-9b and Mixtral-8x22B-Instruct-v0.1 lack a much smaller version to make use of for assisted generation. On this blog post, we present Universal Assisted Generation: a technique developed by Intel Labs and Hugging Face which extends assisted generation to work with a small language model from any model family 🤯. In consequence, it’s now possible to speed up inference from any decoder or Mixture of Experts model by 1.5x-2.0x with almost zero overhead 🔥🔥🔥. Let’s dive in!



Introduction

Nowadays, the strongest open weight LLMs typically have billions to a whole lot of billions parameters (hello Llama-3.1-405B 👋), and deploying these beasts in production environments poses a variety of engineering challenges. One such challenge is that generating text from these large models is slow, which has prompted the community to develop a big selection of techniques to speed up the decoding process. Assisted generation, also referred to as speculative decoding, is a very fashionable and practical approach for accelerating LLM inference without accuracy loss. On this blog post, we take a take a look at how assisted generation works and share our research to increase it towards any of the 140,000 language models on the Hugging Face Hub 🚀!



Assisted Generation

The core idea behind assisted generation involves using a pair of models, known as the goal and assistant models. The assistant model is a smaller, more efficient version of the goal model, for instance you need to use Llama-3.2-1B because the assistant model for the larger Llama-3.1-70b goal model.
Assisted generation is an iterative process. Each cycle, the assistant model generates a sequence of tokens autoregressively, one by one. The goal model then verifies all of the assistant tokens within the sequence in a single forward pass. The speedup is achieved by confirming multiple tokens in each forward pass of the goal model, quite than producing only one token at a time. For a more detailed explanation, see the unique blog post. Combined with the recently introduced Dynamic Speculation strategy, assisted generation accelerates text generation by 1.5x-3x, depending on the duty and the models used.

The remarkable speedups offered by assisted generation include a big drawback: the goal and assistant models must share the identical tokenizer, meaning they should be from the identical model family. Nonetheless, many widely-used models lack smaller versions which are each compact and accurate enough to deliver substantial latency reductions. Based on our experience, meaningful speedups are typically seen when the assistant model is a minimum of 50-100 times smaller than the goal one. As an illustration, CodeLlama-13b lacks a smaller version, and gemma-2-9b only has a 2b variant which continues to be not small enough/fast to attain significant performance improvements.



Universal Assisted Generation

To be able to mitigate this pain point, Intel Labs, along with our friends at Hugging Face, has developed Universal Assisted Generation (UAG). UAG enables choosing any pair of goal and assistant models no matter their tokenizer. For instance, gemma-2-9b will be used because the goal model, with the tiny vicuna-68m because the assistant.

The foremost idea behind the tactic we propose is 2-way tokenizer translations. Once the assistant model completes a generation iteration, the assistant tokens are converted to text, which is then tokenized using the goal model’s tokenizer to generate goal tokens. After the verification step, the goal tokens are similarly converted back to the assistant tokens format, that are then appended to the assistant model’s context before the subsequent iteration begins.

Because the assistant and goal tokenizers use different vocabularies it’s mandatory to handle the discrepancies between them. To accurately re-encode the newly generated assistant tokens, it’s essential to prepend a context window consisting of several previous tokens. This complete sequence is then re-encoded into the goal token format and aligned with essentially the most recent goal tokens to pinpoint the precise location where the newly generated tokens must be appended. This process is illustrated within the video below.

While not shown within the video above, token re-encoding from goal to assistant follows the same process. Nonetheless, mismatched tokens should be discarded from the assistant model’s key-value (KV) cache to make sure data integrity.



Benchmarks

The table below shows the latency improvements observed for goal models when paired with assistant models using different tokenizers.

Goal model Assistant model Dataset Task Speedup
codellama/CodeLlama-13b-Instruct-hf bigcode/tiny_starcoder_py openai/humaneval code generation 1.90x
mistralai/Mixtral-8x22B-Instruct-v0.1 double7/vicuna-68m cnn_dailymail summarization 1.52x
google/gemma-2-9b double7/vicuna-68m cnn_dailymail summarization 1.76x
mistralai/Mixtral-8x22B-Instruct-v0.1 Qwen/Qwen2-0.5B-Instruct tau/scrolls long-context summarization 1.78x
meta-llama/Llama-3.1-70B Qwen/Qwen2-0.5B-Instruct tau/scrolls long-context summarization 1.78x
microsoft/Phi-3-medium-128k-instruct Qwen/Qwen2-0.5B-Instruct tau/scrolls long-context summarization 1.91x

Note that the goal models above don’t have small variants (under 1 billion parameters) that are suitable for acceleration using standard assisted generation.

Each experiment was conducted on 100 randomly chosen examples.
Experiments with Llama and Mixtral goal models use 2 and 4 A100 GPUs, respectively. All other experiments ran with a single A6000 GPU.



Code

Universal assisted generation has been integrated into release 4.46.0 of 🤗 Transformers.

To make use of, pass tokenizer and assistant_tokenizer to generate():

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> prompt = "Alice and Bob"
>>> checkpoint = "google/gemma-2-9b"
>>> assistant_checkpoint = "double7/vicuna-68m"

>>> assistant_tokenizer = AutoTokenizer.from_pretrained(assistant_checkpoint)
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
>>> outputs = model.generate(**inputs, assistant_model=assistant_model, tokenizer=tokenizer, assistant_tokenizer=assistant_tokenizer)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']



Future Directions

While passing do_sample=True with standard assisted generation uses the speculative sampling algorithm (Algorithm 1 from the paper), UAG
currently supports multinomial sampling only. In multinomial sampling, if the goal model doesn’t sample the identical token because the assistant, the token is mechanically rejected, which is just not the case with speculative sampling. In practice, which means UAG with do_sample=True may have a lower throughput in comparison with the case where the assistant has the identical tokenizer. In the longer term, we plan so as to add support for speculative sampling with UAG.
As well as, we intend to integrate UAG into 🤗 Transformers pipelines, for a more concise and streamlined usage.



References



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x