As model sizes grow, Generative AI implementations require significant inference resources. This not only increases the fee per generation, but additionally increases the ability consumption used to serve such requests.
Inference optimizations for text generation are essential for reducing latency, infrastructure costs, and power consumption. This will result in an improved user experience and increased efficiency in text generation tasks.
Assisted decoding is a preferred method for speeding up text generation. We adapted and optimized it for Intel Gaudi, which delivers similar performance as Nvidia H100 GPUs as shown in a previous post, while its price is in the identical ballpark as Nvidia A100 80GB GPUs. This work is now a part of Optimum Habana, which extends various Hugging Face libraries like Transformers and Diffusers in order that your AI workflows are fully optimized for Intel Gaudi processors.
Speculative Sampling – Assisted Decoding
Speculative sampling is a method used to hurry up text generation. It really works by generating a draft model that generates K tokens, that are then evaluated within the goal model. If the draft model is rejected, the goal model is used to generate the subsequent token. This process repeats. By utilizing speculative sampling, we will improve the speed of text generation and achieve similar sampling quality as autoregressive sampling. The technique allows us to specify a draft model when generating text. This method has been shown to offer speedups of about 2x for giant transformer-based models. Overall, these techniques can speed up text generation and improve performance on Intel Gaudi processors.
Nevertheless, the draft model and goal model have different sizes that might be represented in a KV cache, so the challenge is to benefit from separate optimization strategies concurrently. For this text, we assume a quantized model and leverage KV caching along with Speculative Sampling. Note that every model has its own KV cache, and the draft model is used to generate K tokens, that are then evaluated within the goal model. The goal model is used to generate the subsequent token when the draft model is rejected. The draft model is used to generate the subsequent K tokens, and the method repeats.
Note that the authors [2] prove that the goal distribution is recovered when performing speculative sampling – this guarantees the identical sampling quality as autoregressive sampling on the goal itself. Due to this fact, the situations where not leveraging speculative sampling isn’t worthwhile need to do with the case where there aren’t enough savings within the relative size of the draft model or the acceptance rate of the draft model isn’t high enough to profit from the smaller size of the draft model.
There may be a method much like Speculative Sampling, often called Assisted Generation. This was developed independently around the identical time [3]. The writer integrated this method into Hugging Face Transformers, and the .generate() call now has an optional assistant_model parameter to enable this method.
Usage & Experiments
The usage of Assisted Generation is easy. An example is provided here.
As could be expected, the parameter --assistant_model is used to specify the draft model. The draft model is used to generate K tokens, that are then evaluated within the goal model. The goal model is used to generate the subsequent token when the draft model is rejected. The draft model is used to generate the subsequent K tokens, and the method repeats. The acceptance rate of the draft model is partly depending on the input text. Typically, we now have seen speed-ups of about 2x for giant transformer-based models.
Conclusion
Accelerating text generation with Gaudi with assisted generation is now supported and straightforward to make use of. This could be used to enhance performance on Intel Gaudi processors. The tactic is predicated on Speculative Sampling, which has been shown to be effective in improving performance on large transformer-based models.
[1] N. Shazeer, “Fast Transformer Decoding: One Write-Head is All You Need,” Nov. 2019. arXiv:1911.02150.
[2] C. Chen, S. Borgeaud, G. Irving, J.B. Lespiau, L. Sifre, and J. Jumper, “Accelerating Large Language Model Decoding with Speculative Sampling,” Feb. 2023. arXiv:2302.01318.
[3] J. Gante, “Assisted Generation: a brand new direction toward low-latency text generation,” May 2023, https://huggingface.co/blog/assisted-generation.
