Update (02/2024): Performance has improved much more! Check our updated benchmarks.
In a previous post on the Hugging Face blog, we introduced AWS Inferentia2, the second-generation AWS Inferentia accelerator, and explained how you might use optimum-neuron to quickly deploy Hugging Face models for normal text and vision tasks on AWS Inferencia 2 instances.
In an additional step of integration with the AWS Neuron SDK, it’s now possible to make use of 🤗 optimum-neuron to deploy LLM models for text generation on AWS Inferentia2.
And what higher model could we decide for that demonstration than Llama 2, probably the most popular models on the Hugging Face hub.
Setup 🤗 optimum-neuron in your Inferentia2 instance
Our suggestion is to make use of the Hugging Face Neuron Deep Learning AMI (DLAMI). The DLAMI comes with all required libraries pre-packaged for you, including the Optimum Neuron, Neuron Drivers, Transformers, Datasets, and Speed up.
Alternatively, you should use the Hugging Face Neuron SDK DLC to deploy on Amazon SageMaker.
Note: stay tuned for an upcoming post dedicated to SageMaker deployment.
Finally, these components can be installed manually on a fresh Inferentia2 instance following the optimum-neuron installation instructions.
Export the Llama 2 model to Neuron
As explained within the optimum-neuron documentation, models have to be compiled and exported to a serialized format before running them on Neuron devices.
Fortunately, 🤗 optimum-neuron offers a quite simple API to export standard 🤗 transformers models to the Neuron format.
>>> from optimum.neuron import NeuronModelForCausalLM
>>> compiler_args = {"num_cores": 24, "auto_cast_type": 'fp16'}
>>> input_shapes = {"batch_size": 1, "sequence_length": 2048}
>>> model = NeuronModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
export=True,
**compiler_args,
**input_shapes)
This deserves just a little explanation:
- using
compiler_args, we specify on what number of cores we would like the model to be deployed (each neuron device has two cores), and with which precision (herefloat16), - using
input_shape, we set the static input and output dimensions of the model. All model compilers require static shapes, and neuron makes no exception. Note that the
sequence_lengthnot only constrains the length of the input context, but in addition the length of the KV cache, and thus, the output length.
Depending in your alternative of parameters and inferentia host, this will take from a number of minutes to greater than an hour.
Fortunately, you will have to do that just once because you possibly can save your model and reload it later.
>>> model.save_pretrained("a_local_path_for_compiled_neuron_model")
Even higher, you possibly can push it to the Hugging Face hub.
>>> model.push_to_hub(
"a_local_path_for_compiled_neuron_model",
repository_id="aws-neuron/Llama-2-7b-hf-neuron-latency")
Generate Text using Llama 2 on AWS Inferentia2
Once your model has been exported, you possibly can generate text using the transformers library, because it has been described in detail on this previous post.
>>> from optimum.neuron import NeuronModelForCausalLM
>>> from transformers import AutoTokenizer
>>> model = NeuronModelForCausalLM.from_pretrained('aws-neuron/Llama-2-7b-hf-neuron-latency')
>>> tokenizer = AutoTokenizer.from_pretrained("aws-neuron/Llama-2-7b-hf-neuron-latency")
>>> inputs = tokenizer("What's deep-learning ?", return_tensors="pt")
>>> outputs = model.generate(**inputs,
max_new_tokens=128,
do_sample=True,
temperature=0.9,
top_k=50,
top_p=0.9)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['What is deep-learning ?nThe term “deep-learning” refers to a type of machine-learning
that aims to model high-level abstractions of the data in the form of a hierarchy of multiple
layers of increasingly complex processing nodes.']
Note: when passing multiple input prompts to a model, the resulting token sequences should be padded to the left with an end-of-stream token.
The tokenizers saved with the exported models are configured accordingly.
The next generation strategies are supported:
- greedy search,
- multinomial sampling with top-k and top-p (with temperature).
Most logits pre-processing/filters (equivalent to repetition penalty) are supported.
All-in-one with optimum-neuron pipelines
For individuals who prefer to keep it easy, there’s an excellent simpler technique to use an LLM model on AWS inferentia 2 using optimum-neuron pipelines.
Using them is so simple as:
>>> from optimum.neuron import pipeline
>>> p = pipeline('text-generation', 'aws-neuron/Llama-2-7b-hf-neuron-budget')
>>> p("My favorite place on earth is", max_new_tokens=64, do_sample=True, top_k=50)
[{'generated_text': 'My favorite place on earth is the ocean. It is where I feel most
at peace. I love to travel and see new places. I have a'}]
Benchmarks
But how much efficient is text-generation on Inferentia2? Let’s determine!
We’ve uploaded on the hub pre-compiled versions of the LLama 2 7B and 13B models with different configurations:
Note: all models are compiled with a maximum sequence length of 2048.
The llama2 7B “budget” model is supposed to be deployed on inf2.xlarge instance that has just one neuron device, and enough cpu memory to load the model.
All other models are compiled to make use of the complete extent of cores available on the inf2.48xlarge instance.
Note: please confer with the inferentia2 product page for details on the available instances.
We created two “latency” oriented configurations for the llama2 7B and llama2 13B models that may serve just one request at a time, but at full speed.
We also created two “throughput” oriented configurations to serve as much as 4 requests in parallel.
To judge the models, we generate tokens as much as a complete sequence length of 1024, ranging from
256 input tokens (i.e. we generate 256, 512 and 768 tokens).
Note: the “budget” model numbers are reported but not included within the graphs for higher readability.
Encoding time
The encoding time is the time required to process the input tokens and generate the primary output token.
It’s an important metric, because it corresponds to the latency directly perceived by the user when streaming generated tokens.
We test the encoding time for increasing context sizes, 256 input tokens corresponding roughly to a typical Q/A usage,
while 768 is more typical of a Retrieval Augmented Generation (RAG) use-case.
The “budget” model (Llama2 7B-B) is deployed on an inf2.xlarge instance while other models are deployed on an inf2.48xlarge instance.
Encoding time is expressed in seconds.
| input tokens | Llama2 7B-L | Llama2 7B-T | Llama2 13B-L | Llama2 13B-T | Llama2 7B-B |
|---|---|---|---|---|---|
| 256 | 0.5 | 0.9 | 0.6 | 1.8 | 0.3 |
| 512 | 0.7 | 1.6 | 1.1 | 3.0 | 0.4 |
| 768 | 1.1 | 3.3 | 1.7 | 5.2 | 0.5 |
We are able to see that each one deployed models exhibit excellent response times, even for long contexts.
End-to-end Latency
The top-to-end latency corresponds to the overall time to succeed in a sequence length of 1024 tokens.
It subsequently includes the encoding and generation time.
The “budget” model (Llama2 7B-B) is deployed on an inf2.xlarge instance while other models are deployed on an inf2.48xlarge instance.
Latency is expressed in seconds.
| recent tokens | Llama2 7B-L | Llama2 7B-T | Llama2 13B-L | Llama2 13B-T | Llama2 7B-B |
|---|---|---|---|---|---|
| 256 | 2.3 | 2.7 | 3.5 | 4.1 | 15.9 |
| 512 | 4.4 | 5.3 | 6.9 | 7.8 | 31.7 |
| 768 | 6.2 | 7.7 | 10.2 | 11.1 | 47.3 |
All models deployed on the high-end instance exhibit a superb latency, even those actually configured to optimize throughput.
The “budget” deployed model latency is significantly higher, but still okay.
Throughput
We adopt the identical convention as other benchmarks to guage the throughput, by dividing the end-to-end
latency by the sum of each input and output tokens.
In other words, we divide the end-to-end latency by batch_size * sequence_length to acquire the variety of generated tokens per second.
The “budget” model (Llama2 7B-B) is deployed on an inf2.xlarge instance while other models are deployed on an inf2.48xlarge instance.
Throughput is expressed in tokens/second.
| recent tokens | Llama2 7B-L | Llama2 7B-T | Llama2 13B-L | Llama2 13B-T | Llama2 7B-B |
|---|---|---|---|---|---|
| 256 | 227 | 750 | 145 | 504 | 32 |
| 512 | 177 | 579 | 111 | 394 | 24 |
| 768 | 164 | 529 | 101 | 370 | 22 |
Again, the models deployed on the high-end instance have a excellent throughput, even those optimized for latency.
The “budget” model has a much lower throughput, but still okay for a streaming use-case, considering that a mean reader reads around 5 words per-second.
Conclusion
We’ve illustrated how easy it’s to deploy llama2 models from the Hugging Face hub on
AWS Inferentia2 using 🤗 optimum-neuron.
The deployed models exhibit excellent performance by way of encoding time, latency and throughput.
Interestingly, the deployed models latency just isn’t too sensitive to the batch size, which opens the way in which for his or her deployment on inference endpoints
serving multiple requests in parallel.
There remains to be loads of room for improvement though:
- in the present implementation, the one technique to augment the throughput is to extend the batch size, however it is currently limited by the device memory.
Alternative options equivalent to pipelining are currently integrated, - the static sequence length limits the model ability to encode long contexts. It will be interesting to see if attention sinks is perhaps a legitimate option to handle this.



