Llama 3.2 is out! Today, we welcome the following iteration of the Llama collection to Hugging Face. This time, we’re excited to collaborate with Meta on the discharge of multimodal and small models. Ten open-weight models (5 multimodal models and 5 text-only ones) can be found on the Hub.
Llama 3.2 Vision is available in two sizes: 11B for efficient deployment and development on consumer-size GPU, and 90B for large-scale applications. Each versions are available in base and instruction-tuned variants. Along with the 4 multimodal models, Meta released a new edition of Llama Guard with vision support. Llama Guard 3 is a safeguard model that may classify model inputs and generations, including detecting harmful multimodal prompts or assistant responses.
Llama 3.2 also includes small text-only language models that may run on-device. They are available in two recent sizes (1B and 3B) with base and instruct variants, and so they have strong capabilities for his or her sizes. There’s also a small 1B version of Llama Guard that could be deployed alongside these or the larger text models in production use cases.
Among the many features and integrations being released, now we have:
Table of contents
What’s Llama 3.2 Vision?
Llama 3.2 Vision is essentially the most powerful open multimodal model released by Meta. It has great visual understanding and reasoning capabilities and could be used to perform quite a lot of tasks, including visual reasoning and grounding, document query answering, and image-text retrieval. Chain of Thought (CoT) answers are sometimes superb, which makes visual reasoning particularly powerful.
Llama 3.2 Vision could be used to process text and a picture in addition to only text. With image-text prompting, the model can take English inputs, while for text-only prompting, the model can handle multiple languages.
The total list of languages in text-only mode is:
- English
- German
- French
- Italian
- Portuguese
- Hindi
- Spanish
- Thai
The architecture of those models relies on the mixture of Llama 3.1 LLMs combined with a vision tower and a picture adapter. The text models used are Llama 3.1 8B for the Llama 3.2 11B Vision model, and Llama 3.1 70B for the three.2 90B Vision model. To the most effective of our understanding, the text models were frozen through the training of the vision models to preserve text-only performance.
Below you’ll find some inference examples from the 11B instruction-tuned model that showcase real world knowledge, document reasoning and infographics understanding capabilities.
| Input Image | Input Prompt | Answer |
|---|---|---|
![]() |
Where is that this situated in? What are other attractions nearby? |
The image depicts the Grand Palace in Bangkok, Thailand. The Grand Palace is a posh of buildings that served because the official residence of the King of Siam and the executive seat of presidency for 150 years. It’s one of the popular tourist attractions in Bangkok and a must-see for anyone visiting town.
Another attractions near the Grand Palace include: |
![]() |
Where do the droughts occur based on this diagram? |
Drought Regions in 2016 In keeping with the diagram, droughts occurred in two distinct regions in 2016: * Eastern and Southern Africa: This region experienced severe droughts. * Chile: Chile was the driest in January in over 50 years. These areas faced significant water scarcity issues during that period. |
![]() |
How long does it take from invoice date to due date? Be short and concise. |
Invoice Date: 11/02/2019 Due Date: 26/02/2019 Time Difference Calculation: 26/02/2019 – 11/02/2019 = 15 days The time difference between the invoice date and the due date is 15 days. |
Vision models have a context length of 128k tokens, which allows for multiple-turn conversations which will contain images. Nonetheless, the model works best when attending to a single image, so the transformers implementation only attends to the last image provided within the input. This preserves quality and saves memory.
The 11B base model supports tile sizes of 448, while the instruct version and the 90B models all use tile sizes of 560. The models were trained on an enormous dataset of 6 billion image-text pairs with a various data mixture. This makes them excellent candidates for fine-tuning on downstream tasks. For reference, you possibly can see below how the 11B, 90B and their instruction fine-tuned versions compare in some benchmarks, as reported by Meta. Please, seek advice from the model cards for extra benchmarks and details.
| 11B | 11B (instruction-tuned) | 90B | 90B (instruction-tuned) | Metric | |
|---|---|---|---|---|---|
| MMMU (val) | 41.7 | 50.7 (CoT) | 49.3 (zero-shot) | 60.3 (CoT) | Micro Average Accuracy |
| VQAv2 | 66.8 (val) | 75.2 (test) | 73.6 (val) | 78.1 (test) | Accuracy |
| DocVQA | 62.3 (val) | 88.4 (test) | 70.7 (val) | 90.1 (test) | ANLS |
| AI2D | 62.4 | 91.1 | 75.3 | 92.3 | Accuracy |
We expect the text capabilities of those models to be on par with the 8B and 70B Llama 3.1 models, respectively, as our understanding is that the text models were frozen through the training of the Vision models. Hence, text benchmarks needs to be consistent with 8B and 70B.
Llama 3.2 license changes. Sorry, EU 🙁
Regarding the licensing terms, Llama 3.2 comes with a really similar license to Llama 3.1, with one key difference in the appropriate use policy: any individual domiciled in, or an organization with a principal place of job in, the European Union shouldn’t be being granted the license rights to make use of multimodal models included in Llama 3.2. This restriction doesn’t apply to finish users of a services or products that includes any such multimodal models, so people can still construct global products with the vision variants.
For full details, please be certain to read the official license and acceptable use policy.
What’s special about Llama 3.2 1B and 3B?
The Llama 3.2 collection includes 1B and 3B text models. These models are designed for on-device use cases, resembling prompt rewriting, multilingual knowledge retrieval, summarization tasks, tool usage, and locally running assistants. They outperform most of the available open-access models at these sizes and compete with models which are again and again larger. In a later section, we’ll show you the best way to run these models offline.
The models follow the identical architecture as Llama 3.1. They were trained with as much as 9 trillion tokens and still support the long context length of 128k tokens. The models are multilingual, supporting English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
There may be also a brand new small version of Llama Guard, Llama Guard 3 1B, that could be deployed with these models to guage the last user or assistant responses in a multi-turn conversation. It uses a set of pre-defined categories which (recent to this version) could be customized or excluded to account for the developer’s use case. For more details on the usage of Llama Guard, please seek advice from the model card.
Bonus: Llama 3.2 has been exposed to a broader collection of languages than the 8 supported languages mentioned above. Developers are encouraged to fine-tune Llama 3.2 models for his or her specific language use cases.
We ran the bottom models through the Open LLM Leaderboard evaluation suite, while the instruct models were evaluated across three popular benchmarks that measure instruction-following and correlate well with the LMSYS Chatbot Arena: IFEval, AlpacaEval, and MixEval-Hard. These are the outcomes for the bottom models, with Llama-3.1-8B included as a reference:
| Model | BBH | MATH Lvl 5 | GPQA | MUSR | MMLU-PRO | Average |
|---|---|---|---|---|---|---|
| Meta-Llama-3.2-1B | 4.37 | 0.23 | 0.00 | 2.56 | 2.26 | 1.88 |
| Meta-Llama-3.2-3B | 14.73 | 1.28 | 4.03 | 3.39 | 16.57 | 8.00 |
| Meta-Llama-3.1-8B | 25.29 | 4.61 | 6.15 | 8.98 | 24.95 | 14.00 |
And listed below are the outcomes for the instruct models, with Llama-3.1-8B-Instruct included as a reference:
| Model | AlpacaEval (LC) | IFEval | MixEval-Hard | Average |
|---|---|---|---|---|
| Meta-Llama-3.2-1B-Instruct | 7.17 | 58.92 | 26.10 | 30.73 |
| Meta-Llama-3.2-3B-Instruct | 20.88 | 77.01 | 31.80 | 43.23 |
| Meta-Llama-3.1-8B-Instruct | 25.74 | 76.49 | 44.10 | 48.78 |
Remarkably, the 3B model is as strong because the 8B one on IFEval! This makes the model well-suited for agentic applications, where following instructions is crucial for improving reliability. This high IFEval rating may be very impressive for a model of this size.
Tool use is supported in each the 1B and 3B instruction-tuned models. Tools are specified by the user in a zero-shot setting (the model has no previous information in regards to the tools developers will use). Thus, the built-in tools that were a part of the Llama 3.1 models (brave_search and wolfram_alpha) aren’t any longer available.
Attributable to their size, these small models could be used as assistants for larger models and perform assisted generation (also referred to as speculative decoding). Here is an example of using the Llama 3.2 1B model as an assistant to the Llama 3.1 8B model. For offline use cases, please check the on-device section later within the post.
Demo
You possibly can experiment with the three Instruct models in the next demos:
Using Hugging Face Transformers
The text-only checkpoints have the identical architecture as previous releases, so there is no such thing as a have to update your environment. Nonetheless, given the brand new architecture, Llama 3.2 Vision requires an update to Transformers. Please be certain to upgrade your installation to release 4.45.0 or later.
pip install "transformers>=4.45.0" --upgrade
Once upgraded, you should utilize the brand new Llama 3.2 models and leverage all of the tools of the Hugging Face ecosystem.
Llama 3.2 1B & 3B Language Models
You possibly can run the 1B and 3B Text model checkpoints in only a few lines with Transformers. The model checkpoints are uploaded in bfloat16 precision, but you too can use float16 or quantized weights. Memory requirements depend upon the model size and the precision of the weights. Here’s a table showing the approximate memory required for inference using different configurations:
| Model Size | BF16/FP16 | FP8 | INT4 |
|---|---|---|---|
| 3B | 6.5 GB | 3.2 GB | 1.75 GB |
| 1B | 2.5 GB | 1.25 GB | 0.75 GB |
from transformers import pipeline
import torch
model_id = "meta-llama/Llama-3.2-3B-Instruct"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
response = outputs[0]["generated_text"][-1]["content"]
print(response)
A few details:
-
We load the model in
bfloat16. As mentioned above, that is the kind utilized by the unique checkpoint published by Meta, so it’s the beneficial option to run to make sure the most effective precision or conduct evaluations. Depending in your hardware, float16 could be faster. -
By default, transformers uses the identical sampling parameters (temperature=0.6 and top_p=0.9) as the unique meta codebase. We haven’t conducted extensive tests yet, be at liberty to explore!
Llama 3.2 Vision
The Vision models are larger, in order that they require more memory to run than the small text models. For reference, the 11B Vision model takes about 10 GB of GPU RAM during inference, in 4-bit mode.
The simplest option to infer with the instruction-tuned Llama Vision model is to make use of the built-in chat template. The inputs have user and assistant roles to point the conversation turns. One difference with respect to the text models is that the system role shouldn’t be supported. User turns may include image-text or text-only inputs. To point that the input incorporates a picture, add {"type": "image"} to the content a part of the input after which pass the image data to the processor:
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device="cuda",
)
processor = AutoProcessor.from_pretrained(model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Can you please describe this image in just one sentence?"}
]}
]
input_text = processor.apply_chat_template(
messages, add_generation_prompt=True,
)
inputs = processor(
image,
input_text,
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
output = model.generate(**inputs, max_new_tokens=70)
print(processor.decode(output[0][inputs["input_ids"].shape[-1]:]))
You possibly can proceed the conversation in regards to the image. Remember, nonetheless, that in the event you provide a brand new image in a brand new user turn, the model will seek advice from the brand new image from that moment on. You possibly can’t query about two different images at the identical time. That is an example of the previous conversation continued, where we add the assistant turn to the conversation and ask for some more details:
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Can you please describe this image in just one sentence?"}
]},
{"role": "assistant", "content": "The image depicts a rabbit wearing a blue coat and brown vest, standing on a mud road in front of a stone house."},
{"role": "user", "content": "What's within the background?"}
]
input_text = processor.apply_chat_template(
messages,
add_generation_prompt=True,
)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=70)
print(processor.decode(output[0][inputs["input_ids"].shape[-1]:]))
And that is the response we got:
Within the background, there may be a stone house with a thatched roof, a mud road, a field of flowers, and rolling hills.
You may as well routinely quantize the model, loading it in 8-bit and even 4-bit mode with the bitsandbytes library. That is the way you’d load the generation pipeline in 4-bit:
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
+from transformers import BitsAndBytesConfig
+bnb_config = BitsAndBytesConfig(
+ load_in_4bit=True,
+ bnb_4bit_quant_type="nf4",
+ bnb_4bit_compute_dtype=torch.bfloat16
)
model = MllamaForConditionalGeneration.from_pretrained(
model_id,
- torch_dtype=torch.bfloat16,
- device="cuda",
+ quantization_config=bnb_config,
)
You possibly can then apply the chat template, use the processor, and call the model similar to you probably did before.
On-device
You possibly can run each Llama 3.2 1B and 3B directly in your device’s CPU/ GPU/ Browser using several open-source libraries like the next.
Llama.cpp & Llama-cpp-python
Llama.cpp is the go-to framework for all things cross-platform on-device ML inference. We offer quantized 4-bit & 8-bit weights for each 1B & 3B models on this collection. We expect the community to embrace these models and create additional quantizations and fine-tunes. You’ll find all of the quantized Llama 3.2 models here.
Here’s how you should utilize these checkpoints directly with llama.cpp.
Install llama.cpp through brew (works on Mac and Linux).
brew install llama.cpp
You should use the CLI to run a single generation or invoke the llama.cpp server, which is compatible with the Open AI messages specification.
You’d run the CLI using a command like this:
llama-cli --hf-repo hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF --hf-file llama-3.2-3b-instruct-q8_0.gguf -p "The intending to life and the universe is"
And also you’d fan the flames of the server like this:
llama-server --hf-repo hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF --hf-file llama-3.2-3b-instruct-q8_0.gguf -c 2048
You may as well use llama-cpp-python to access these models programmatically in Python. Pip install the library from PyPI using:
pip install llama-cpp-python
Then, you possibly can run the model as follows:
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF",
filename="*q8_0.gguf",
)
output = llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)
print(output)
Transformers.js
You possibly can even run Llama 3.2 in your browser (or any JavaScript runtime like Node.js, Deno, or Bun) using Transformers.js. You’ll find the ONNX model on the Hub. In the event you have not already, you possibly can install the library from NPM using:
npm i @huggingface/transformers
Then, you possibly can run the model as follows:
import { pipeline } from "@huggingface/transformers";
const generator = await pipeline("text-generation", "onnx-community/Llama-3.2-1B-Instruct");
const messages = [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Tell me a joke." },
];
const output = await generator(messages, { max_new_tokens: 128 });
console.log(output[0].generated_text.at(-1).content);
Example output
Here's a joke for you:
What do you call a fake noodle?
An impasta!
I hope that made you laugh! Do you wish to hear one other one?
MLC.ai Web-LLM
MLC.ai Web-LLM is a high-performance in-browser LLM inference engine that brings language model inference directly onto web browsers with hardware acceleration. All the things runs contained in the browser with no server support and is accelerated with WebGPU.
WebLLM is fully compatible with OpenAI API. That’s, you should utilize the identical OpenAI API on any open-source models locally, with functionalities including streaming, JSON-mode, function-calling, etc.
You possibly can install Web-LLM from npm
npm install @mlc/web-llm
Then, you possibly can run the model as follows:
import * as webllm from "@mlc-ai/web-llm";
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const initProgressCallback = (initProgress) => {
console.log(initProgress);
}
const selectedModel = "Llama-3.2-3B-Instruct-q4f32_1-MLC";
const engine = await CreateMLCEngine(
selectedModel,
{ initProgressCallback: initProgressCallback },
);
After successfully initializing the engine, you possibly can now invoke chat completions using OpenAI style chat APIs through the engine.chat.completions interface.
const messages = [
{ role: "system", content: "You are a helpful AI assistant." },
{ role: "user", content: "Explain the meaning of life as a pirate!" },
]
const reply = await engine.chat.completions.create({
messages,
});
console.log(reply.decisions[0].message);
console.log(reply.usage);
Fantastic-tuning Llama 3.2
TRL supports chatting and fine-tuning with the Llama 3.2 text models out of the box:
trl chat --model_name_or_path meta-llama/Llama-3.2-3B
trl sft --model_name_or_path meta-llama/Llama-3.2-3B
--dataset_name HuggingFaceH4/no_robots
--output_dir Llama-3.2-3B-Instruct-sft
--gradient_checkpointing
Support for positive tuning Llama 3.2 Vision can be available in TRL with this script.
speed up launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml
examples/scripts/sft_vlm.py
--dataset_name HuggingFaceH4/llava-instruct-mix-vsft
--model_name_or_path meta-llama/Llama-3.2-11B-Vision-Instruct
--per_device_train_batch_size 8
--gradient_accumulation_steps 8
--output_dir Llama-3.2-11B-Vision-Instruct-sft
--bf16
--torch_dtype bfloat16
--gradient_checkpointing
You may as well try this notebook for LoRA fine-tuning using transformers and PEFT.
Hugging Face Partner Integrations
We’re currently working with our partners at AWS, Google Cloud, Microsoft Azure and DELL on adding Llama 3.2 11B, 90B to Amazon SageMaker, Google Kubernetes Engine, Vertex AI Model Catalog, Azure AI Studio, DELL Enterprise Hub. We are going to update this section as soon because the containers can be found, and you possibly can subscribe to Hugging Squad for email updates.
Additional Resources
Acknowledgements
Releasing such models with support and evaluations within the ecosystem wouldn’t be possible without the contributions of 1000’s of community members who’ve contributed to transformers, text-generation-inference, vllm, pytorch, LM Eval Harness, and lots of other projects. Hat tip to the VLLM team for his or her assist in testing and reporting issues. This release couldn’t have happened without all of the support of Clémentine, Alina, Elie, and Loubna for LLM evaluations, Nicolas Patry, Olivier Dehaene, and Daniël de Kok for Text Generation Inference; Lysandre, Arthur, Pavel, Edward Beeching, Amy, Benjamin, Joao, Pablo, Raushan Turganbay, Matthew Carrigan, and Joshua Lochner for transformers, transformers.js, TRL, and PEFT support; Nathan Sarrazin and Victor for making Llama 3.2 available in Hugging Chat; Brigitte Tousignant and Florent Daudens for communication; Julien, Simon, Pierric, Eliott, Lucain, Alvaro, Caleb, and Mishig from the Hub team for Hub development and features for launch.
And massive because of the Meta Team for releasing Llama 3.2 and making it available to the open AI community!





