Large Language Models Out-of-the-Box Acceleration with AMD GPU

Earlier this yr, AMD and Hugging Face announced a partnership to speed up AI models throughout the AMD’s AI Day event. We’ve got been hard at work to bring this vision to reality, and make it easy for the Hugging Face community to run the most recent AI models on AMD hardware with the perfect possible performance.

AMD is powering a number of the strongest supercomputers within the World, including the fastest European one, LUMI, which operates over 10,000 MI250X AMD GPUs. At this event, AMD revealed their latest generation of server GPUs, the AMD Instinct™ MI300 series accelerators, which is able to soon turn out to be generally available.

On this blog post, we offer an update on our progress towards providing great out-of-the-box support for AMD GPUs, and improving the interoperability for the most recent server-grade AMD Instinct GPUs

Out-of-the-box Acceleration

Can you notice AMD-specific code changes below? Don’t hurt your eyes, there’s none in comparison with running on NVIDIA GPUs 🤗.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "01-ai/Yi-6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
with torch.device("cuda"):
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)

inp = tokenizer(["Today I am in Paris and"], padding=True, return_tensors="pt").to("cuda")
res = model.generate(**inp, max_new_tokens=30)

print(tokenizer.batch_decode(res))

Considered one of the foremost elements now we have been working on is the flexibility to run Hugging Face Transformers models with none code change. We now support all Transformers models and tasks on AMD Instinct GPUs. And our collaboration will not be stopping here, as we explore out-of-the-box support for diffusers models, and other libraries in addition to other AMD GPUs.

Achieving this milestone has been a big effort and collaboration between our teams and corporations. To keep up support and performances for the Hugging Face community, now we have built integrated testing of Hugging Face open source libraries on AMD Instinct GPUs in our datacenters – and were capable of minimize the carbon impact of those recent workloads working with Verne Global to deploy the AMD Instinct servers in Iceland.

On top of native support, one other major aspect of our collaboration is to supply integration for the most recent innovations and features available on AMD GPUs. Through the collaboration of Hugging Face team, AMD engineers and open source community members, we’re completely happy to announce support for:

We’re very excited to make these cutting-edge acceleration tools available and straightforward to make use of to Hugging Face users, and offer maintained support and performance with direct integration in our recent continuous integration and development pipeline for AMD Instinct GPUs.

One AMD Instinct MI250 GPU with 128 GB of High Bandwidth Memory has two distinct ROCm devices (GPU 0 and 1), each of them having 64 GB of High Bandwidth Memory.

MI250 two devices as displayed by `rocm-smi`

Which means that with only one MI250 GPU card, now we have two PyTorch devices that may be used very easily with tensor and data parallelism to attain higher throughputs and lower latencies.

In the remainder of the blog post, we report performance results for the 2 steps involved throughout the text generation through large language models:

Prefill latency: The time it takes for the model to compute the representation for the user’s provided input or prompt (also known as “Time To First Token”).
Decoding per token latency: The time it takes to generate each recent token in an autoregressive manner after the prefill step.
Decoding throughput: The variety of tokens generated per second throughout the decoding phase.

Using optimum-benchmark and running inference benchmarks on an MI250 and an A100 GPU with and without optimizations, we get the next results:

Inference benchmarks using Transformers and PEFT libraries. FA2 stands for “Flash Attention 2”, TP for “Tensor Parallelism”, DDP for “Distributed Data Parallel”.

Within the plots above, we will see how performant the MI250 is, especially for production settings where requests are processed in big batches, delivering greater than 2.33x more tokens (decode throughput) and taking half the time to the primary token (prefill latency), in comparison with an A100 card.

Running training benchmarks as seen below, one MI250 card suits larger batches of coaching samples and reaches higher training throughput.

Training benchmark using Transformers library at maximum batch size (power of two) that may fit on a given card

Production Solutions

One other essential focus for our collaboration is to construct support for Hugging Face production solutions, starting with Text Generation Inference (TGI). TGI provides an end-to-end solution to deploy large language models for inference at scale.

Initially, TGI was mostly driven towards Nvidia GPUs, leveraging a lot of the recent optimizations made for post Ampere architecture, akin to Flash Attention v1 and v2, GPTQ weight quantization and Paged Attention.

Today, we’re completely happy to announce initial support for AMD Instinct MI210 and MI250 GPUs in TGI, leveraging all the good open-source work detailed above, integrated in a whole end-to-end solution, able to be deployed.

Performance-wise, we spent numerous time benchmarking Text Generation Inference on AMD Instinct GPUs to validate and discover where we must always give attention to optimizations. As such, and with the support of AMD GPUs Engineers, now we have been capable of achieve matching performance in comparison with what TGI was already offering.

On this context, and with the long-term relationship we’re constructing between AMD and Hugging Face, now we have been integrating and testing with the AMD GeMM Tuner tool which allows us to tune the GeMM (matrix multiplication) kernels we’re using in TGI to search out the perfect setup towards increased performances. GeMM Tuner tool is predicted to be released as a part of PyTorch in a coming release for everybody to profit from it.

With the entire above being said, we’re thrilled to point out the very first performance numbers demonstrating the most recent AMD technologies, putting Text Generation Inference on AMD GPUs on the forefront of efficient inferencing solutions with Llama model family.

TGI latency results for Llama 34B, comparing one AMD Instinct MI250 against A100-SXM4-80GB. As explained above one MI250 corresponds to 2 PyTorch devices.

TGI latency results for Llama 70B, comparing two AMD Instinct MI250 against two A100-SXM4-80GB (using tensor parallelism)

Missing bars for A100 correspond to out of memory errors, as Llama 70B weights 138 GB in float16, and enough free memory is obligatory for intermediate activations, KV cache buffer (>5GB for 2048 sequence length, batch size 8), CUDA context, etc. The Instinct MI250 GPU has 128 GB global memory while an A100 has 80GB which explains the flexibility to run larger workloads (longer sequences, larger batches) on MI250.

Text Generation Inference is able to be deployed in production on AMD Instinct GPUs through the docker image ghcr.io/huggingface/text-generation-inference:1.2-rocm. Be sure that to consult with the documentation regarding the support and its limitations.

What’s next?

We hope this blog post got you as excited as we’re at Hugging Face about this partnership with AMD. After all, that is just the very starting of our journey, and we sit up for enabling more use cases on more AMD hardware.

In the approaching months, we shall be working on bringing more support and validation for AMD Radeon GPUs, the identical GPUs you’ll be able to put in your personal desktop for local usage, lowering down the accessibility barrier and paving the best way for much more versatility for our users.

After all we’ll soon be working on performance optimization for the MI300 lineup, ensuring that each the Open Source and the Solutions provide with the most recent innovations at the very best stability level we’re at all times in search of at Hugging Face.

One other area of focus for us shall be around AMD Ryzen AI technology, powering the most recent generation of AMD laptop CPUs, allowing to run AI at the sting, on the device. On the time where Coding Assistant, Image Generation tools and Personal Assistant have gotten increasingly broadly available, it can be crucial to supply solutions which might meet the needs of privacy to leverage these powerful tools. On this context, Ryzen AI compatible models are already being made available on the Hugging Face Hub and we’re working closely with AMD to bring more of them in the approaching months.

Source link

Large Language Models Out-of-the-Box Acceleration with AMD GPU

Out-of-the-box Acceleration

Production Solutions

What’s next?

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Pentagon is planning for AI firms to coach on classified data, defense official says

The right way to Effectively Review Claude Code Output

NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

A Compact Hybrid Model for Efficient Local AI

MIT-IBM Watson AI Lab seed to signal: Amplifying early-career faculty impact

Large Language Models Out-of-the-Box Acceleration with AMD GPU

Out-of-the-box Acceleration

Production Solutions

What’s next?

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.