Optimum-NVIDIA Unlocking blazingly fast LLM inference in only 1 line of code

Large Language Models (LLMs) have revolutionized natural language processing and are increasingly deployed to resolve complex problems at scale. Achieving optimal performance with these models is notoriously difficult attributable to their unique and intense computational demands. Optimized performance of LLMs is incredibly beneficial for end users searching for a handy guide a rough and responsive experience, in addition to for scaled deployments where improved throughput translates to dollars saved.

That is where the Optimum-NVIDIA inference library is available in. Available on Hugging Face, Optimum-NVIDIA dramatically accelerates LLM inference on the NVIDIA platform through a particularly easy API.
By changing only a single line of code, you may unlock as much as 28x faster inference and 1,200 tokens/second on the NVIDIA platform.

Optimum-NVIDIA is the primary Hugging Face inference library to profit from the brand new float8 format supported on the NVIDIA Ada Lovelace and Hopper architectures.
FP8, along with the advanced compilation capabilities of NVIDIA TensorRT-LLM software software, dramatically accelerates LLM inference.

Easy methods to Run

You possibly can start running LLaMA with blazingly fast inference speeds in only 3 lines of code with a pipeline from Optimum-NVIDIA.
In case you already arrange a pipeline from Hugging Face’s transformers library to run LLaMA, you simply need to change a single line of code to unlock peak performance!

- from transformers.pipelines import pipeline
+ from optimum.nvidia.pipelines import pipeline

# every little thing else is identical as in transformers!
pipe = pipeline('text-generation', 'meta-llama/Llama-2-7b-chat-hf', use_fp8=True)
pipe("Describe a real-world application of AI in sustainable energy.")

It’s also possible to enable FP8 quantization with a single flag, which means that you can run an even bigger model on a single GPU at faster speeds and without sacrificing accuracy.
The flag shown in this instance uses a predefined calibration strategy by default, though you may provide your individual calibration dataset and customised tokenization to tailor the quantization to your use case.

The pipeline interface is great for getting up and running quickly, but power users who want fine-grained control over setting sampling parameters can use the Model API.

- from transformers import AutoModelForCausalLM
+ from optimum.nvidia import AutoModelForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf", padding_side="left")

model = AutoModelForCausalLM.from_pretrained(
  "meta-llama/Llama-2-13b-chat-hf",
+ use_fp8=True,  
)

model_inputs = tokenizer(
    ["How is autonomous vehicle technology transforming the future of transportation and urban planning?"], 
    return_tensors="pt"
).to("cuda")

generated_ids, generated_length = model.generate(
    **model_inputs, 
    top_k=40, 
    top_p=0.7, 
    repetition_penalty=10,
)

tokenizer.batch_decode(generated_ids[0], skip_special_tokens=True)

For more details, try our documentation

Performance Evaluation

When evaluating the performance of an LLM, we consider two metrics: First Token Latency and Throughput.
First Token Latency (also generally known as Time to First Token or prefill latency) measures how long you wait from the time you enter your prompt to the time you start receiving your output, so this metric can let you know how responsive the model will feel.
Optimum-NVIDIA delivers as much as 3.3x faster First Token Latency in comparison with stock transformers:

Figure 1. Time it takes to generate the primary token (ms)

Throughput, alternatively, measures how briskly the model can generate tokens and is especially relevant when you wish to batch generations together.
While there are just a few ways to calculate throughput, we adopted a normal method to divide the end-to-end latency by the whole sequence length, including each input and output tokens summed over all batches.
Optimum-NVIDIA delivers as much as 28x higher throughput in comparison with stock transformers:

Figure 2. Throughput (token / second)

Initial evaluations of the recently announced NVIDIA H200 Tensor Core GPU show as much as an extra 2x boost in throughput for LLaMA models in comparison with an NVIDIA H100 Tensor Core GPU.
As H200 GPUs grow to be more available, we’ll share performance data for Optimum-NVIDIA running on them.

Next steps

Optimum-NVIDIA currently provides peak performance for the LLaMAForCausalLM architecture + task, so any LLaMA-based model, including fine-tuned versions, should work with Optimum-NVIDIA out of the box today.
We’re actively expanding support to incorporate other text generation model architectures and tasks, all from inside Hugging Face.

We proceed to push the boundaries of performance and plan to include cutting-edge optimization techniques like In-Flight Batching to enhance throughput when streaming prompts and INT4 quantization to run even larger models on a single GPU.

Give it a try: we’re releasing the Optimum-NVIDIA repository with instructions on the best way to start. Please share your feedback with us! 🤗

Source link

Optimum-NVIDIA Unlocking blazingly fast LLM inference in only 1 line of code

Easy methods to Run

Performance Evaluation

Next steps

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

I checked out considered one of the largest anti-AI protests ever

OpenAI steps into Anthropic’s Pentagon void

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Constructing Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

Optimum-NVIDIA Unlocking blazingly fast LLM inference in only 1 line of code

Easy methods to Run

Performance Evaluation

Next steps

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.