Hugging Face on AMD Instinct MI300 GPU

Join the subsequent Hugging Forged on June sixth to ask inquiries to the post authors, watch a live demo deploying Llama 3 on MI300X on Azure, plus a bonus demo deploying models locally on Ryzen AI PC!

Register at https://streamyard.com/watch/iMZUvJnmz8BV

Introduction

At Hugging Face we need to make it easy to construct AI with open models and open source, whichever framework, cloud and stack you should use.
A key component is the power to deploy AI models on a flexible selection of hardware.
Through our collaboration with AMD, for a few 12 months now, we’re investing into multiple different accelerators reminiscent of AMD Instinct™ and Radeon™ GPUs, EPYC™ and Ryzen™ CPUs and Ryzen AI NPUs helping ensure there’ll all the time be a tool to run the biggest AI
community on the AMD fleet.
Today we’re delighted to announce that Hugging Face and AMD have been hard at work together to enable the newest generation of AMD GPU servers, namely AMD Instinct MI300, to have first-class citizen integration in the general Hugging Face Platform.
From prototyping in your local environment, to running models in production on Azure ND Mi300x V5 VMs, you needn’t make any code change using transformers[1], text-generation-inference and other libraries, or whenever you use Hugging Face products and solutions – we need to make it super easy to make use of AMD MI300 on Hugging Face and get the very best performance.
Let’s dive in!

Open-Source and production enablement

Maintaining support for AMD Instinct GPUs in Transformers and text-generation-inference

With so many things happening at once in AI it was absolutely obligatory to be certain the MI300 line-up is appropriately tested and monitored within the long-run.
To attain this, we now have been working closely with the infrastructure team here at Hugging Face to be certain we now have robust constructing blocks available for whoever requires to enable continuous integration and deployment (CI/CD) and to give you the chance to achieve this without pain and without impacting the others already in place.

To enable such things, we worked along with AMD and Microsoft Azure teams to leverage the recently introduced Azure ND MI300x V5 because the constructing block targeting MI300.
In a few hours our infrastructure team was in a position to deploy, setup and get all the pieces up and running for us to get our hands on the MI300!

We also moved away from our old infrastructure to a managed Kubernetes cluster caring for scheduling all of the Github workflows Hugging Face collaborators would really like to run on hardware specific pods.
This migration now allows us to run the very same CI/CD pipeline on a wide range of hardware platforms abstracted away from the developer.
We were in a position to get the CI/CD up and running inside a few days without much effort on the Azure MI300X VM.

In consequence, transformers and text-generation-inference at the moment are being tested regularly on each the previous generation of AMD Instinct GPUs, namely MI250 and likewise on the newest MI300.
In practice, there are tens of 1000’s of unit tests that are repeatedly validating the state of those repositories ensuring the correctness and robustness of the combination in the long term.

Improving performances for production AI workloads

Inferencing performance

As said within the prelude, we now have been working on enabling the brand new AMD Instinct MI300 GPUs to efficiently run inference workloads through our open source inferencing solution, text-generation-inference (TGI)
TGI could be seen as three different components:
– A transport layer, mostly HTTP, exposing and receiving API requests from clients
– A scheduling layer, ensuring these requests are potentially batched together (i.e. continuous batching) to extend the computational density on the hardware without impacting the user experience
– A modeling layer, caring for running the actual computations on the device, leveraging highly optimized routines involved within the model

Here, with the assistance of AMD engineers, we focused on this last component, the modeling, to effectively setup, run and optimize the workload for serving models because the Meta Llama family. Particularly, we focused on:
– Flash Attention v2
– Paged Attention
– GPTQ/AWQ compression techniques
– PyTorch integration of ROCm TunableOp
– Integration of optimized fused kernels

Most of those have been around for quite a while now, FlashAttention v2, PagedAttention and GPTQ/AWQ compression methods (especially their optimized routines/kernels). We won’t detail the three above and we invite you to navigate to their original implementation page to learn more about it.

Still, with a completely recent hardware platform, recent SDK releases, it was vital to rigorously validate, profile and optimize every bit to be certain the user gets all the ability from this recent platform.

Last but not least, as a part of this TGI release, we’re integrating the recently released AMD TunableOp, a part of PyTorch 2.3.
TunableOp provides a flexible mechanism which is able to search for essentially the most efficient way, with respect to the shapes and the info type, to execute general matrix-multiplication (i.e. GEMMs).
TunableOp is integrated in PyTorch and continues to be in lively development but, as you will notice below, makes it possible to enhance the performance of GEMMs operations without significantly impacting the user-experience.
Specifically, we gain a 8-10% speedup in latency using TunableOp for small input sequences, corresponding to the decoding phase of autoregressive models generation.

The truth is, when a brand new TGI instance is created, we launch an initial warming step which takes some dummy payloads and makes sure the model and its memory are being allocated and are able to shine.

With TunableOp, we enable the GEMM routine tuner to allocate a while to search for essentially the most optimal setup with respect to the parameters the user provided to TGI reminiscent of sequence length, maximum batch size, etc.
When the warmup phase is finished, we disable the tuner and leverage the optimized routines for the remaining of the server’s life.

As said previously, we ran all our benchmarks using Azure ND MI300x V5, recently introduced at Microsoft BUILD, which integrates eight AMD Instinct GPUs onboard, against the previous generation MI250 on Meta Llama 3 70B, deployment, we observe a 2x-3x speedup within the time to first token latency (also called prefill), and a 2x speedup in latency in the next autoregressive decoding phase.

TGI latency results for Meta Llama 3 70B, comparing AMD Instinct MI300X on an Azure VM against the previous generation AMD Instinct MI250

Model fine-tuning performances

Hugging Face libraries can as well be used to fine-tune models.
We use Transformers and PEFT libraries to finetune Llama 3 70B using low rank adapters (LoRA. To handle the parallelism over several devices, we leverage DeepSpeed Zero3 through Speed up library.

On Llama 3 70B, our workload consists of batches of 448 tokens, with a batch size of two. Using low rank adapters, the model’s original 70,570,090,496 parameters are frozen, and we as a substitute train a further subset of 16,384,000 parameters due to low rank adapters.

From our comparison on Llama 3 70B, we’re in a position to train about 2x times faster on an Azure VM powered by MI300X, in comparison with an HPC server using the previous generation AMD Instinct MI250.

Furthermore, because the MI300X advantages from its 192 GB HBM3 memory (in comparison with 128 GB for MI250), we manage to completely load and fine-tune Meta Llama 3 70B on a single device, while an MI250 GPU wouldn’t give you the chance to slot in full the ~140 GB model on a single device, in float16 nor bfloat16.
Since it’s all the time vital to give you the chance to copy and challenge a benchmark, we’re releasing a companion Github repository containing all of the artifacts and source code we used to gather performance showcased on this blog.

What’s next?

We’ve a variety of exciting features within the pipe for these recent AMD Instinct MI300 GPUs.
One in all the most important areas we might be investing a variety of efforts in the approaching weeks is minifloat (i.e. float8 and lower).
These data layouts have the inherent benefits of compressing the data in a non-uniform way alleviating among the issues faced with integers.

In scenarios like inferencing on LLMs this is able to divide by two the dimensions of the key-value cache often utilized in LLM.
Afterward, combining float8 stored key-value cache with float8/float8 matrix-multiplications, it might bring additional performance advantages together with reduced memory footprints.

Conclusion

As you’ll be able to see, AMD MI300 brings a big boost of performance on AI use-cases covering end-to-end use cases from training to inference.
We, at Hugging Face, are very excited to see what the community and enterprises will give you the chance to realize with these recent hardware and integrations.
We’re desperate to hear from you and assist in your use-cases.

Be certain to stop by optimum-AMD and text-generation-inference Github repositories to get the newest performance optimization towards AMD GPUs!

Source link

Hugging Face on AMD Instinct MI300 GPU

Introduction

Open-Source and production enablement

Maintaining support for AMD Instinct GPUs in Transformers and text-generation-inference

Improving performances for production AI workloads

Inferencing performance

Model fine-tuning performances

What’s next?

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Exposing biases, moods, personalities, and abstract concepts hidden in large language models

Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization

「データ不足」の壁を越える：合成ペルソナが日本のAI開発を加速

Microsoft has a brand new plan to prove what’s real and what’s AI online

Announcing our latest Gemini AI model

Hugging Face on AMD Instinct MI300 GPU

Introduction

Open-Source and production enablement

Maintaining support for AMD Instinct GPUs in Transformers and text-generation-inference

Improving performances for production AI workloads

Inferencing performance

Model fine-tuning performances

What’s next?

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.