Welcome Llama 3 – Meta’s latest open LLM

-


Meta’s Llama 3, the following iteration of the open-access Llama family, is now released and available at Hugging Face. It’s great to see Meta continuing its commitment to open AI, and we’re excited to completely support the launch with comprehensive integration within the Hugging Face ecosystem.

Llama 3 is available in two sizes: 8B for efficient deployment and development on consumer-size GPU, and 70B for large-scale AI native applications. Each are available base and instruction-tuned variants. Along with the 4 models, a new edition of Llama Guard was fine-tuned on Llama 3 8B and is released as Llama Guard 2 (safety fine-tune).

We’ve collaborated with Meta to make sure one of the best integration into the Hugging Face ecosystem. You will discover all 5 open-access models (2 base models, 2 fine-tuned & Llama Guard) on the Hub. Among the many features and integrations being released, now we have:



Table of contents



What’s latest with Llama 3?

The Llama 3 release introduces 4 latest open LLM models by Meta based on the Llama 2 architecture. They are available two sizes: 8B and 70B parameters, each with base (pre-trained) and instruct-tuned versions. All of the variants might be run on various forms of consumer hardware and have a context length of 8K tokens.

Along with these 4 base models, Llama Guard 2 was also released. Advantageous-tuned on Llama 3 8B, it’s the most recent iteration within the Llama Guard family. Llama Guard 2, built for production use cases, is designed to categorise LLM inputs (prompts) in addition to LLM responses with a view to detect content that will be considered unsafe in a risk taxonomy.

A giant change in Llama 3 in comparison with Llama 2 is using a brand new tokenizer that expands the vocabulary size to 128,256 (from 32K tokens within the previous version). This larger vocabulary can encode text more efficiently (each for input and output) and potentially yield stronger multilingualism. This comes at a price, though: the embedding input and output matrices are larger, which accounts for a very good portion of the parameter count increase of the small model: it goes from 7B in Llama 2 to 8B in Llama 3. As well as, the 8B version of the model now uses Grouped-Query Attention (GQA), which is an efficient representation that ought to help with longer contexts.

The Llama 3 models were trained ~8x more data on over 15 trillion tokens on a brand new mixture of publicly available online data on two clusters with 24,000 GPUs. We don’t know the precise details of the training mix, and we will only guess that larger and more careful data curation was an enormous think about the improved performance. Llama 3 Instruct has been optimized for dialogue applications and was trained on over 10 Million human-annotated data samples with combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO).

Regarding the licensing terms, Llama 3 comes with a permissive license that permits redistribution, fine-tuning, and derivative works. The requirement for explicit attribution is latest within the Llama 3 license and was not present in Llama 2. Derived models, for example, need to incorporate “Llama 3” at first of their name, and you furthermore may must mention “Built with Meta Llama 3” in derivative works or services. For full details, please ensure that to read the official license.



Llama 3 evaluation

Here, you may see an inventory of models and their Open LLM Leaderboard scores. This will not be a comprehensive list and we encourage you to take a look at the complete leaderboard. Note that the LLM Leaderboard is specially useful to judge pre-trained models, as there are other benchmarks specific to conversational models.



Learn how to prompt Llama 3

The bottom models haven’t any prompt format. Like other base models, they might be used to proceed an input sequence with a plausible continuation or for zero-shot/few-shot inference. Also they are an important foundation for fine-tuning your personal use cases. The Instruct versions use the next conversation structure:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>

This format needs to be exactly reproduced for effective use. We’ll later show how easy it’s to breed the instruct prompt with the chat template available in transformers.



Demo

You possibly can chat with the Llama 3 70B instruct on Hugging Chat! Try the link here: https://huggingface.co/chat/models/meta-llama/Meta-Llama-3-70B-instruct



Using 🤗 Transformers

With Transformers release 4.40, you should utilize Llama 3 and leverage all of the tools inside the Hugging Face ecosystem, reminiscent of:

  • training and inference scripts and examples
  • secure file format (safetensors)
  • integrations with tools reminiscent of bitsandbytes (4-bit quantization), PEFT (parameter efficient fine-tuning), and Flash Attention 2
  • utilities and helpers to run generation with the model
  • mechanisms to export the models to deploy

As well as, Llama 3 models are compatible with torch.compile() with CUDA graphs, giving them a ~4x speedup at inference time!

To make use of Llama 3 models with transformers, ensure that to put in a recent version of transformers:

pip install --upgrade transformers

The next snippet shows the best way to use Llama-3-8b-instruct with transformers. It requires about 16 GB of RAM, which incorporates consumer GPUs reminiscent of 3090 or 4090.

from transformers import pipeline
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

terminators = [
    pipe.tokenizer.eos_token_id,
    pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)

Arrrr, me hearty! Me name be Captain Chat, the scurviest pirate chatbot to ever sail the Seven Seas! Me be here to swab the decks o’ yer mind with me trusty responses, savvy? I be able to hoist the Jolly Roger and set sail fer a swashbucklin’ good time, matey! So, what be bringin’ ye to those fair waters?

A few details:

  • We loaded the model in bfloat16. That is the kind utilized by the unique checkpoint published by Meta, so it’s the beneficial technique to run to make sure one of the best precision or to conduct evaluations. For real world use, it’s also secure to make use of float16, which could also be faster depending in your hardware.
  • Assistant responses may end with the special token <|eot_id|>, but we must also stop generation if the regular EOS token is found. We are able to stop generation early by providing an inventory of terminators within the eos_token_id parameter.
  • We used the default sampling parameters (temperature and top_p) taken from the unique meta codebase. We haven’t had time to conduct extensive tests yet, be happy to explore!

You can even mechanically quantize the model, loading it in 8-bit and even 4-bit mode. 4-bit loading takes about 7 GB of memory to run, making it compatible with plenty of consumer cards and all of the GPUs in Google Colab. That is the way you’d load the generation pipeline in 4-bit:

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
        "low_cpu_mem_usage": True,
    },
)

For more details on using the models with transformers, please check the model cards.



Inference Integrations

On this section, we’ll undergo different approaches to running inference of the Llama 3 models. Before using these models, ensure that you’ve gotten requested access to one among the models within the official Meta Llama 3 repositories.



Integration with Inference Endpoints

You possibly can deploy Llama 3 on Hugging Face’s Inference Endpoints, which uses Text Generation Inference because the backend. Text Generation Inference is a production-ready inference container developed by Hugging Face to enable easy deployment of huge language models. It has features reminiscent of continuous batching, token streaming, tensor parallelism for fast inference on multiple GPUs, and production-ready logging and tracing.

To deploy Llama 3, go to the model page and click on on the Deploy -> Inference Endpoints widget. You possibly can learn more about Deploying LLMs with Hugging Face Inference Endpoints in a previous blog post. Inference Endpoints supports Messages API through Text Generation Inference, which means that you can switch from one other closed model to an open one by simply changing the URL.

from openai import OpenAI


client = OpenAI(
    base_url="" + "/v1/",  
    api_key="",  
)
chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "user", "content": "Why is open-source software important?"},
    ],
    stream=True,
    max_tokens=500
)


for message in chat_completion:
    print(message.selections[0].delta.content, end="")



Integration with Google Cloud

You possibly can deploy Llama 3 on Google Cloud through Vertex AI or Google Kubernetes Engine (GKE), using Text Generation Inference.

To deploy the Llama 3 model from Hugging Face, go to the model page and click on on Deploy -> Google Cloud. This can bring you to the Google Cloud Console, where you may 1-click deploy Llama 3 on Vertex AI or GKE.



Integration with Amazon SageMaker

You possibly can deploy and train Llama 3 on Amazon SageMaker through AWS Jumpstart or using the Hugging Face LLM Container.

To deploy the Llama 3 model from Hugging Face, go to the model page and click on on Deploy -> Amazon SageMaker. This can display a code snippet you may copy and execute in your environment. Amazon SageMaker will now create a dedicated inference endpoint you should utilize to send requests.



Advantageous-tuning with 🤗 TRL

Training LLMs might be technically and computationally difficult. On this section, we’ll have a look at the tools available within the Hugging Face ecosystem to efficiently train Llama 3 on consumer-size GPUs. Below is an example command to fine-tune Llama 3 on the No Robots dataset. We use 4-bit quantization, and QLoRA and TRL’s SFTTrainer will mechanically format the dataset into chatml format. Let’s start!

First, install the most recent version of 🤗 TRL.

pip install -U transformers trl speed up

Should you just want to talk with the model within the terminal you should utilize the chat command of the TRL CLI (for more information see the docs):

trl chat 
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct 
--device cuda 
--eos_tokens "<|end_of_text|>,<|eod_id|>"

You can even use TRL CLI to supervise fine-tuning (SFT) Llama 3 on your personal, custom dataset. Use the trl sft command and pass your training arguments as CLI argument. Make certain you might be logged in and have access the Llama 3 checkpoint. You possibly can do that with huggingface-cli login.

trl sft 
--model_name_or_path meta-llama/Meta-Llama-3-8B 
--dataset_name HuggingFaceH4/no_robots 
--learning_rate 0.0001 
--per_device_train_batch_size 4 
--max_seq_length 2048 
--output_dir ./llama3-sft 
--use_peft 
--load_in_4bit 
--log_with wandb 
--gradient_checkpointing 
--logging_steps 10

This can run the fine-tuning out of your terminal and takes about 4 hours to coach on a single A10G, but might be easily parallelized by tweaking --num_processes to the variety of GPUs you’ve gotten available.

Note: You can even replace the CLI arguments with a yaml file. Learn more concerning the TRL CLI here.



Additional Resources



Acknowledgments

Releasing such models with support and evaluations within the ecosystem wouldn’t be possible without the contributions of many community members, including

Thanks to the Meta Team for releasing Llama 3 and making it available to the open-source AI community!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x