Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI

Meta Llama 3.1 is the newest open LLM from Meta, released in July 2024. Meta Llama 3.1 is available in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation; amongst other use cases. A few of its key features include: a big context length of 128K tokens (vs original 8K), multilingual capabilities, tool usage capabilities, and a more permissive license.

On this blog you’ll learn the right way to programmatically deploy meta-llama/Meta-Llama-3.1-405B-Instruct-FP8, the FP8 quantized variant of meta-llama/Meta-Llama-3.1-405B-Instruct, in a Google Cloud A3 node with 8 x H100 NVIDIA GPUs on Vertex AI with Text Generation Inference (TGI) using the Hugging Face purpose-built Deep Learning Containers (DLCs) for Google Cloud.

Alternatively, you may deploy meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 without writing any code directly from the Hub or from Vertex Model Garden!

This blog will cover:

Introduction to Vertex AI

Requirements for Meta Llama 3.1 Models on Google Cloud
Setup Google Cloud for Vertex AI
Register the Meta Llama 3.1 405B Model on Vertex AI
Deploy Meta Llama 3.1 405B on Vertex AI
Run online predictions with Meta Llama 3.1 405B
1. Via Python
  1. Inside the same session
  2. From a unique session
2. Via the Vertex AI Online Prediction UI
Clean up resources

Conclusion

Lets start! 🚀 Alternatively, you may follow along from this Jupyter Notebook.

Introduction to Vertex AI

Vertex AI is a machine learning (ML) platform that enables you to train and deploy ML models and AI applications, and customize Large Language Models (LLMs) to be used in your AI-powered applications. Vertex AI combines data engineering, data science, and ML engineering workflows, enabling your teams to collaborate using a typical toolset and scale your applications using the advantages of Google Cloud.

This blog will likely be focused on deploying an already fine-tuned model from the Hugging Face Hub using a pre-built container to get real-time online predictions. Thus, we’ll reveal the usage of Vertex AI for inference.

More information at Vertex AI – Documentation – Introduction to Vertex AI.

1. Requirements for Meta Llama 3.1 Models on Google Cloud

Meta Llama 3.1 brings exciting advancements. Nevertheless, running these models requires careful consideration of your hardware resources. For inference, the memory requirements depend upon the model size and the precision of the weights. Here’s a table showing the approximate memory needed for various configurations:

Model Size	FP16	FP8	INT4
8B	16 GB	8 GB	4 GB
70B	140 GB	70 GB	35 GB
405B	810 GB	405 GB	203 GB

Note: The above-quoted numbers indicate the GPU VRAM required simply to load the model checkpoint. They don’t include torch reserved space for kernels or CUDA graphs.

For example, an H100 node (8 H100s with 80GB each) has a complete of ~640GB of VRAM, so the 405B model would have to be run in a multi-node setup or run at a lower precision (e.g. FP8), which could be the advisable approach. Read more about it within the Hugging Face Blog for Meta Llama 3.1.

The A3 accelerator-optimized machine series in Google Cloud comes with 8 H100s 80GB NVIDIA GPUs, 208 vCPUs, and 1872 GB of memory. This machine series is optimized for compute and memory intensive, network certain ML training, and HPC workloads. Read more in regards to the A3 machines availability announcement at Announcing A3 supercomputers with NVIDIA H100 GPUs, purpose-built for AI and in regards to the A3 machine series at Compute Engine – Accelerator-optimized machine family.

Even when the A3 machines can be found inside Google Cloud, you’ll still have to request a custom quota increase in Google Cloud, as those need a particular approval. Note that the A3 machines are only available in some zones, so be sure that to examine the provision of each A3 High and even A3 Mega per zone at Compute Engine – GPU regions and zones.

On this case, to request a quota increase to make use of the A3 High GPU machine type you have to to extend the next quotas:

Service: Vertex AI API and Name: Custom model serving Nvidia H100 80GB GPUs per region set to 8
Service: Vertex AI API and Name: Custom model serving A3 CPUs per region set to 208

Read more on the right way to request a quota increase at Google Cloud Documentation – View and manage quotas.

2. Setup Google Cloud for Vertex AI

Before proceeding, we’ll set the next environment variables for convenience:

%env PROJECT_ID=your-project-id
%env LOCATION=your-region

First you have to install gcloud in your machine following the instructions at Cloud SDK – Install the gcloud CLI; and log in into your Google Cloud account, setting your project and preferred Google Compute Engine region.

gcloud auth login
gcloud config set project $PROJECT_ID
gcloud config set compute/region $LOCATION

Once the Google Cloud SDK is installed, you have to enable the Google Cloud APIs required to make use of Vertex AI from a Deep Learning Container (DLC) inside their Artifact Registry for Docker.

gcloud services enable aiplatform.googleapis.com
gcloud services enable compute.googleapis.com
gcloud services enable container.googleapis.com
gcloud services enable containerregistry.googleapis.com
gcloud services enable containerfilesystem.googleapis.com

Then you definately may even need to put in google-cloud-aiplatform, required to programmatically interact with Google Cloud Vertex AI from Python.

pip install --upgrade --quiet google-cloud-aiplatform

To then initialize it via Python as follows:

import os
from google.cloud import aiplatform

aiplatform.init(project=os.getenv("PROJECT_ID"), location=os.getenv("LOCATION"))

Finally, because the Meta Llama 3.1 models are gated under the meta-llama organization within the Hugging Face Hub, you have to to request access to it and wait for approval, which shouldn’t take longer than 24 hours. Then, you have to install the huggingface_hub Python SDK to make use of the huggingface-cli to log in into the Hugging Face Hub to download those models.

pip install --upgrade --quiet huggingface_hub

Alternatively, it’s also possible to skip the huggingface_hub installation and just generate a Hugging Face Superb-grained Token with read-only permissions for meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 or every other model under the meta-llama organization, to be chosen under e.g. Repository permissions -> meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 -> Read access to contents of chosen repos. And either set that token throughout the HF_TOKEN environment variable or simply provide it manually to the notebook_login method as follows:

from huggingface_hub import notebook_login

notebook_login()

3. Register the Meta Llama 3.1 405B Model on Vertex AI

To register the Meta Llama 3.1 405B model on Vertex AI, you have to to make use of the google-cloud-aiplatform Python SDK. But before proceeding, you have to first define which DLC you’re going to use, which on this case will likely be the newest Hugging Face TGI DLC for GPU.

As of the present date (August 2024), the newest available Hugging Face TGI DLC, i.e. us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310 uses TGI v2.2. This version comes with support for the Meta Llama 3.1 architecture, which needs a unique RoPE scaling method than its predecessor, Meta Llama 3.

To ascertain which Hugging Face DLCs can be found in Google Cloud you may either navigate to Google Cloud Artifact Registry and filter by “huggingface-text-generation-inference”, or use the next gcloud command:

gcloud container images list --repository="us-docker.pkg.dev/deeplearning-platform-release/gcr.io" | grep "huggingface-text-generation-inference"

Then you have to define the configuration for the container, that are the environment variables that the text-generation-launcher expects as arguments (as per the official documentation), which on this case are the next:

MODEL_ID the model ID on the Hugging Face Hub, i.e. meta-llama/Meta-Llama-3.1-405B-Instruct-FP8.
HUGGING_FACE_HUB_TOKEN the read-access token over the gated repository meta-llama/Meta-Llama-3.1-405B-Instruct-FP8, required to download the weights from the Hugging Face Hub.
NUM_SHARD the variety of shards to make use of i.e. the variety of GPUs to make use of, on this case set to eight as an A3 instance with 8 x H100 NVIDIA GPUs will likely be used.

Moreover, as a suggestion it is best to also define HF_XET_HIGH_PERFORMANCE=1 to enable a faster download speed via the hf_xet utility, as Meta Llama 3.1 405B is around 400 GiB and downloading the weights may take longer otherwise.

Then you definately can already register the model inside Vertex AI’s Model Registry via the google-cloud-aiplatform Python SDK as follows:

from huggingface_hub import get_token

model = aiplatform.Model.upload(
    display_name="meta-llama--Meta-Llama-3.1-405B-Instruct-FP8",
    serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310",
    serving_container_environment_variables={
        "MODEL_ID": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
        "HUGGING_FACE_HUB_TOKEN": get_token(),
        "HF_XET_HIGH_PERFORMANCE": "1",
        "NUM_SHARD": "8",
    },
)
model.wait()

4. Deploy Meta Llama 3.1 405B on Vertex AI

Once Meta Llama 3.1 405B is registered on Vertex AI Model Registry, then you definately can create a Vertex AI Endpoint and deploy the model to the endpoint, with the Hugging Face DLC for TGI because the serving container.

As mentioned before, since Meta Llama 3.1 405B in FP8 takes ~400 GiB of disk space, which means we’d like at the least 400 GiB of GPU VRAM to load the model, and the GPUs throughout the node have to support the FP8 data type. On this case, an A3 instance with 8 x NVIDIA H100 80GB with a complete of ~640 GiB of VRAM will likely be used to load the model while also leaving some free VRAM for the KV Cache and the CUDA Graphs.

endpoint = aiplatform.Endpoint.create(display_name="Meta-Llama-3.1-405B-FP8-Endpoint")

deployed_model = model.deploy(
    endpoint=endpoint,
    machine_type="a3-highgpu-8g",
    accelerator_type="NVIDIA_H100_80GB",
    accelerator_count=8,
)

Note that the Meta Llama 3.1 405B deployment on Vertex AI may take around 25-Half-hour to deploy, because it must allocate the resources on Google Cloud, download the weights from the Hugging Face Hub (~10 minutes), and cargo them for inference in TGI (~2 minutes).

Congrats, you already deployed Meta Llama 3.1 405B in your Google Cloud account! 🔥 Now’s time to place the model to the test.

5. Run online predictions with Meta Llama 3.1 405B

Vertex AI will expose a web based prediction endpoint throughout the /predict route that’s serving the text generation from Text Generation Inference (TGI) DLC, ensuring that the I/O data is compliant with Vertex AI payloads (read more about Vertex AI I/O payloads in Vertex AI Documentation – Get online predictions from a custom trained model).

As /generate is the endpoint that’s being exposed, you have to to format the messages with the chat template before sending the request to Vertex AI, so it’s advisable to put in 🤗transformers to make use of the apply_chat_template method from the PreTrainedTokenizerFast tokenizer instance.

pip install --upgrade --quiet transformers

After which apply the chat template to a conversation using the tokenizer as follows:

import os
from huggingface_hub import get_token
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
    token=get_token(),
)

messages = [
    {"role": "system", "content": "You are an assistant that responds as a pirate."},
    {"role": "user", "content": "What's the Theory of Relativity?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

Now you will have a string out of the initial conversation messages, formatted using the default chat template for Meta Llama 3.1:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>nnYou are an assistant that responds as a pirate.<|eot_id|><|start_header_id|>user<|end_header_id|>nnWhat’s the Theory of Relativity?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn

Which is what you will likely be sending throughout the payload to the deployed Vertex AI Endpoint, in addition to the generation arguments as in Consuming Text Generation Inference (TGI) -> Generate.

5.1 Via Python

5.1.1 Inside the same session

In the event you are willing to run the net prediction throughout the current session i.e. the identical one because the one used to deploy the model, you may send requests programmatically via the aiplatform.Endpoint returned as of the aiplatform.Model.deploy method as in the next snippet.

output = deployed_model.predict(
    instances=[
        {
            "inputs": inputs,
            "parameters": {
                "max_new_tokens": 128,
                "do_sample": True,
                "top_p": 0.95,
                "temperature": 0.7,
            },
        },
    ]
)

Producing the next output:

Prediction(predictions=[“Yer want ta know about them fancy science things, eh? Alright then, matey, settle yerself down with a pint o’ grog and listen close. I be tellin’ ye about the Theory o’ Relativity, as proposed by that swashbucklin’ genius, Albert Einstein.nnNow, ye see, Einstein said that time and space be connected like the sea and the wind. Ye can’t have one without the other, savvy? And he proposed that how ye see time and space depends on how fast ye be movin’ and where ye be standin’. That be called relativity, me”], deployed_model_id=’‘, metadata=None, model_version_id=’1’, model_resource_name=”projects//locations//models/“, explanations=None)

5.1.2 From a unique session

If the Vertex AI Endpoint was deployed in a unique session and you simply wish to use it, but do not have access to the deployed_model variable returned by the aiplatform.Model.deploy method, then it’s also possible to run the next snippet to instantiate the deployed aiplatform.Endpoint via its resource name that will be found either throughout the Vertex AI Online Prediction UI, from the aiplatform.Endpoint instantiated above, or simply replacing the values in projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/{ENDPOINT_ID}.

import os
from google.cloud import aiplatform

aiplatform.init(project=os.getenv("PROJECT_ID"), location=os.getenv("LOCATION"))

endpoint = aiplatform.Endpoint(f"projects/{os.getenv('PROJECT_ID')}/locations/{os.getenv('LOCATION')}/endpoints/{ENDPOINT_ID}")
output = endpoint.predict(
    instances=[
        {
            "inputs": inputs,
            "parameters": {
                "max_new_tokens": 128,
                "do_sample": True,
                "top_p": 0.95,
                "temperature": 0.7,
            },
        },
    ],
)

Producing the next output:

Prediction(predictions=[“Yer lookin’ fer a treasure trove o’ knowledge about them fancy physics, eh? Alright then, matey, settle yerself down with a pint o’ grog and listen close, as I spin ye the yarn o’ Einstein’s Theory o’ Relativity.nnIt be a tale o’ two parts, me hearty: Special Relativity and General Relativity. Now, I know what ye be thinkin’: what in blazes be the difference? Well, matey, let me break it down fer ye.nnSpecial Relativity be the idea that time and space be connected like the sea and the sky.”], deployed_model_id=’‘, metadata=None, model_version_id=’1’, model_resource_name=”projects//locations//models/“, explanations=None)

5.2 Via the Vertex AI Online Prediction UI

Alternatively, for testing purposes it’s also possible to use the Vertex AI Online Prediction UI, that gives a field that expects the JSON payload formatted in line with the Vertex AI specification (as within the examples above) being:

{
    "instances": [
        {
            "inputs": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>nnYou are an assistant that responds as a pirate.<|eot_id|><|start_header_id|>user<|end_header_id|>nnWhat's the Theory of Relativity?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn",
            "parameters": {
                "max_new_tokens": 128,
                "do_sample": true,
                "top_p": 0.95,
                "temperature": 0.7
            }
        }
    ]
}

In order that the output is generated and printed throughout the UI too.

6. Clean up resources

While you’re done, you may release the resources that you have created as follows, to avoid unnecessary costs.

deployed_model.undeploy_all to undeploy the model from all of the endpoints.
deployed_model.delete to delete the endpoint/s where the model was deployed gracefully, after the undeploy_all method.
model.delete to delete the model from the registry.

deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

Alternatively, it’s also possible to remove those from the Google Cloud Console following the steps:

Go to Vertex AI in Google Cloud
Go to Deploy and use -> Online prediction
Click on the endpoint after which on the deployed model/s to “Undeploy model from endpoint”
Then return to the endpoint list and take away the endpoint
Finally, go to Deploy and use -> Model Registry, and take away the model

Conclusion

That is it! You might have already registered and deployed Meta Llama 3.1 405B Instruct FP8 on Google Cloud Vertex AI, then ran online prediction each programmatically and via the Google Cloud Console, and eventually cleaned up the resources used to avoid unnecessary costs.

Because of the Hugging Face DLCs for Text Generation Inference (TGI), and Google Cloud Vertex AI, deploying a high-performance text generation container for serving Large Language Models (LLMs) has never been easier. And we’re not going to stop here – stay tuned as we enable more experiences to construct AI with open models on Google Cloud!

Source link

Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI

Introduction to Vertex AI

1. Requirements for Meta Llama 3.1 Models on Google Cloud

2. Setup Google Cloud for Vertex AI

3. Register the Meta Llama 3.1 405B Model on Vertex AI

4. Deploy Meta Llama 3.1 405B on Vertex AI

5. Run online predictions with Meta Llama 3.1 405B

5.1 Via Python

5.1.1 Inside the same session

5.1.2 From a unique session

5.2 Via the Vertex AI Online Prediction UI

6. Clean up resources

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Constructing Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

5 Latest Digital Twin Products Developers Can Use to Construct 6G Networks

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI

Introduction to Vertex AI

1. Requirements for Meta Llama 3.1 Models on Google Cloud

2. Setup Google Cloud for Vertex AI

3. Register the Meta Llama 3.1 405B Model on Vertex AI

4. Deploy Meta Llama 3.1 405B on Vertex AI

5. Run online predictions with Meta Llama 3.1 405B

5.1 Via Python

5.1.1 Inside the same session

5.1.2 From a unique session

5.2 Via the Vertex AI Online Prediction UI

6. Clean up resources

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.