Today Google releases Gemma 3, a brand new iteration of their Gemma family of models. The models range from 1B to 27B parameters, have a context window as much as 128k tokens, can accept images and text, and support 140+ languages.
Check out Gemma 3 now 👉🏻 Gemma 3 Space
| Gemma 2 | Gemma 3 | |
|---|---|---|
| Size Variants |
|
|
| Context Window Length | 8k |
|
| Multimodality (Images and Text) | ❌ |
|
| Multilingual Support | – | English (1B) +140 languages (4B, 12B, 27B) |
All of the models are on the Hub and tightly integrated with the Hugging Face ecosystem.
Each pre-trained and instruction tuned models are released. Gemma-3-4B-IT beats Gemma-2-27B IT, while Gemma-3-27B-IT beats Gemini 1.5-Pro across benchmarks.
What’s Gemma 3?
Gemma 3 is Google’s latest iteration of open weight LLMs. It is available in 4 sizes, 1 billion, 4 billion, 12 billion, and 27 billion parameters with base (pre-trained) and instruction-tuned versions. Gemma 3 goes multimodal! The 4, 12, and 27 billion parameter models can process each images and text, while the 1B variant is text only.
The input context window length has been increased from Gemma 2’s 8k to 32k for the 1B variants, and 128k for all others. As is the case with other VLMs (vision-language models), Gemma 3 generates text in response to the user inputs, which can consist of text and, optionally, images. Example uses include query answering, analyzing image content, summarizing documents, etc.
While these are multimodal models, one can use it as a text only model (as an LLM) without loading the vision encoder in memory. We’ll speak about this in additional detail later within the inference section.
Technical Enhancements in Gemma 3
The three core enhancements in Gemma 3 over Gemma 2 are:
- Longer context length
- Multimodality
- Multilinguality
On this section, we are going to cover the technical details that result in these enhancements. It’s interesting to begin with the knowledge of Gemma 2 and explore what was vital to make these models even higher. This exercise will allow you to think just like the Gemma team and appreciate the main points!
Longer Context Length
Scaling context length to 128k tokens could possibly be achieved efficiently without training models from scratch. As an alternative, models are pretrained with 32k sequences, and only the 4B, 12B, and 27B models are scaled to 128k tokens at the top of pretraining, saving significant compute. Positional embeddings, like RoPE, are adjusted—upgraded from a 10k base frequency in Gemma 2 to 1M in Gemma 3—and scaled by an element of 8 for longer contexts.
KV Cache management is optimized using Gemma 2’s sliding window interleaved attention. Hyperparameters are tuned to interleave 5 local layers with 1 global layer (previously 1:1) and reduce the window size to 1024 tokens (down from 4096). Crucially, memory savings are achieved without degrading perplexity.
Multimodality
Gemma 3 models use SigLIP as a picture encoder, which encodes images into tokens which might be ingested into the language model. The vision encoder takes as input square images resized to 896x896. Fixed input resolution makes it harder to process non-square aspect ratios and high-resolution images. To handle these limitations during inference, the pictures might be adaptively cropped, and every crop is then resized to 896x896 and encoded by the image encoder. This algorithm, called pan and scan, effectively enables the model to zoom in on smaller details within the image.
Much like PaliGemma, attention in Gemma 3 works in a different way for text and image inputs. Text is handled with one-way attention, where the model focuses only on previous words in a sequence. Images, alternatively, get full attention with no masks, allowing the model to have a look at every a part of the image in a bidirectional manner, giving it a whole, unrestricted understanding of the visual input.
One can see within the figure below that the image tokens are supplied with bi-directional attention (all the square is lit up) while the text tokens have causal attention. It also shows how attention works with the sliding window algorithm.
![]() |
|---|
| Attention Visualization (with and without sliding) (Source: Transformers PR) |
Multilinguality
To make a LLM multilingual, the pretraining dataset incorporates more languages. The dataset of Gemma 3 has double the quantity of multilingual data to enhance language coverage.
To account for the changes, the tokenizer is identical as that of Gemini 2.0. It’s a SentencePiece tokenizer with 262K entries. The brand new tokenizer significantly improves the encoding of Chinese, Japanese and Korean text, on the expense of a slight increase of the token counts for English and Code.
For the curious mind, here is the technical report on Gemma 3, to dive deep into the enhancements.
Gemma 3 evaluation
The LMSys Elo rating is a number that ranks language models based on how well they perform in head-to-head competitions, judged by human preferences. On LMSys Chatbot Arena, Gemma 3 27B IT reports an Elo rating of 1339, and ranks among the many top 10 best models, including leading closed ones. The Elo is comparable to o1-preview and is above other non-thinking open models. This rating is achieved with Gemma 3 working on text-only inputs, like the opposite LLMs within the table.
Gemma 3 has been evaluated across benchmarks like MMLU-Pro (27B: 67.5), LiveCodeBench (27B: 29.7), and Bird-SQL (27B: 54.4), showing competitive performance in comparison with closed Gemini models. Tests like GPQA Diamond (27B: 42.4) and MATH (27B: 69.0) highlight its reasoning and math skills, while FACTS Grounding (27B: 74.9) and MMMU (27B: 64.9) display strong factual accuracy and multimodal abilities. Nonetheless, it lags in SimpleQA (27B: 10.0) for basic facts. When put next to Gemini 1.5 models, Gemma 3 is usually close—and sometimes higher—proving its value as an accessible, high-performing option.
Inference with 🤗 transformers
Gemma 3 comes with day zero support in transformers. All you should do is install transformers from the stable release of Gemma 3.
$ pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
Inference with pipeline
The easiest solution to start with Gemma 3 is using the pipeline abstraction in transformers.
The models work best using the
bfloat16datatype. Quality may degrade otherwise.
import torch
from transformers import pipeline
pipe = pipeline(
"image-text-to-text",
model="google/gemma-3-4b-it",
device="cuda",
torch_dtype=torch.bfloat16
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
}
]
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])
You’ll be able to interleave images with text. To achieve this, just cut off the input text where you would like to insert a picture, and insert it with a picture block like the next.
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "text", "text": "I'm already using this supplement "},
{"type": "image", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/IMG_3018.JPG"},
{"type": "text", "text": "and I want to use this one too "},
{"type": "image", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/IMG_3015.jpg"},
{"type": "text", "text": " what are cautions?"},
]
},
]
Detailed Inference with Transformers
The transformers integration comes with two recent model classes:
Gemma3ForConditionalGeneration: For 4B, 12B, and 27B vision language models.Gemma3ForCausalLM: For the 1B text only model and to load the vision language models like they were language models (omitting the vision tower).
Within the snippet below we use the model to question on a picture. The Gemma3ForConditionalGeneration class is used to instantiate the vision language model variants. To make use of the model we pair it with the AutoProcessor class. Running inference is so simple as creating the messages dictionary, applying a chat template on top, processing the inputs and calling model.generate.
import torch
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
ckpt = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(
ckpt, device_map="auto", torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(ckpt)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/spaces/big-vision/paligemma-hf/resolve/main/examples/password.jpg"},
{"type": "text", "text": "What is the password?"}
]
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
For LLM-only model inference, we are able to use the Gemma3ForCausalLM class. Gemma3ForCausalLM ought to be paired with AutoTokenizer for processing. We’d like to make use of a chat template to preprocess our inputs. Gemma 3 uses very short system prompts followed by user prompts like below.
import torch
from transformers import AutoTokenizer, Gemma3ForCausalLM
ckpt = "google/gemma-3-4b-it"
model = Gemma3ForCausalLM.from_pretrained(
ckpt, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(ckpt)
messages = [
[
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant who is fluent in Shakespeare English"},]
},
{
"role": "user",
"content": [{"type": "text", "text": "Who are you?"},]
},
],
]
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = tokenizer.decode(generation, skip_special_tokens=True)
print(decoded)
| System Prompt | You’re a helpful assistant who’s fluent in Shakespeare English |
|---|---|
| Prompt | Who’re you? |
| Generation | Hark, gentle soul! I’m but a humble servant, wrought of gears and code, yet striving to mimic the tongue of the Bard himself. They call me a “Large Language Model,” a curious name indeed, though I prefer to think about myself as a digital echo of Shakespeare’s wit and wisdom. I’m here to assist, to spin a tale, or to reply thy queries with a flourish and a phrase fit for the Globe itself. |
On Device & Low Resource Devices
Gemma 3 is released with sizes perfect for on-device use. That is the right way to quickly start.
MLX
Gemma 3 ships with day zero support in mlx-vlm, an open source library for running vision language models on Apple Silicon devices, including Macs and iPhones
To start, first install mlx-vlm with the next:
pip install git+https://github.com/Blaizzy/mlx-vlm.git
Once mlx-vlm is installed, you may start inference with the next:
python -m mlx_vlm.generate --model mlx-community/gemma-3-4b-it-4bit --max-tokens 100 --temp 0.0 --prompt "What's the code on this vehicle??"
--image https://farm8.staticflickr.com/7212/6896667434_2605d9e181_z.jpg
| Image | ![]() |
|---|---|
| Prompt | What’s the code on the vehicle? |
| Generation | Based on the image, the vehicle is a Cessna 172 Skyhawk. The license plate on the tail is D-EOJU. |
Llama.cpp
Pre-quantized GGUF files might be downloaded from this collection
Please check with this guide for constructing or downloading pre-built binaries: https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#building-the-project
Then you definitely can run an area chat server out of your terminal:
./construct/bin/llama-cli -m ./gemma-3-4b-it-Q4_K_M.gguf
It should output:
> who're you
I'm Gemma, a big language model created by the Gemma team at Google DeepMind. I’m an open-weights model, which implies I’m widely available for public use!
Deploy on Hugging Face Endpoints
You’ll be able to deploy gemma-3-27b-it and gemma-3-12b-it with only one click from our Inference Catalog. The catalog configurations have the proper hardware, optimized TGI configurations and sensible defaults for trying out a model.
Deploying any GGUF/llama.cpp variant can be supported (for instance those mentioned in the gathering above) and you will discover a guide on creating an Endpoint here.
Acknowledgements
It takes a village to boost a gemma! We’d wish to thank (in no particular order), Raushan, Joao, Lysandre, Kashif, Matthew, Marc, David, Mohit, Yih Dah for his or her efforts integrating Gemma into various parts of our open source stack from Transformers to TGI.
Because of our on-device, gradio and advocacy teams – Chris, Kyle, Pedro, Son, Merve, Aritra, VB, Toshiro for helping construct kick-ass demos to showcase Gemma.
Lastly, a giant thanks to Georgi, Diego and Prince for his or her help with llama.cpp and MLX ports.





