Gemma 3n was announced as a preview during Google I/O. The on-device community got really excited, because it is a model designed from the bottom as much as run locally in your hardware. On top of that, it’s natively multimodal, supporting image, text, audio, and video inputs 🤯
Today, Gemma 3n is finally available on essentially the most used open source libraries. This includes transformers & timm, MLX, llama.cpp (text inputs), transformers.js, ollama, Google AI Edge, and others.
This post quickly goes through practical snippets to display learn how to use the model with these libraries, and the way easy it’s to fine-tune it for other domains.
Models released today
Here is the Gemma 3n Release Collection
Two model sizes have been released today, with two variants (base and instruct) each. The model names follow a non-standard nomenclature: they’re called gemma-3n-E2B and gemma-3n-E4B. The E preceding the parameter count stands for Effective. Their actual parameter counts are 5B and 8B, respectively, but due to improvements in memory efficiency, they manage to only need 2B and 4B in VRAM (GPU memory).
These models, subsequently, behave like 2B and 4B by way of hardware support, but they punch over 2B/4B by way of quality. The E2B model can run in as little as 2GB of GPU RAM, while E4B can run with just 3GB of GPU RAM.
Details of the models
Along with the language decoder, Gemma 3n uses an audio encoder and a vision encoder. We highlight their fundamental features below, and describe how they’ve been added to transformers and timm, as they’re the reference for other implementations.
- Vision Encoder (MobileNet-V5). Gemma 3n uses a new edition of MobileNet: MobileNet-v5-300, which has been added to the new edition of
timmreleased today.- Features 300M parameters.
- Supports resolutions of
256x256,512x512, and768x768. - Achieves 60 FPS on Google Pixel, outperforming ViT Giant while using 3x fewer parameters.
- Audio Encoder:
- Based on the Universal Speech Model (USM).
- Processes audio in
160mschunks. - Enables speech-to-text and translation functionalities (e.g., English to Spanish/French).
- Gemma 3n Architecture and Language Model. The architecture itself has been added to the new edition of
transformersreleased today. This implementation branches out totimmfor image encoding, so there’s a single reference implementation of the MobileNet architecture.
Architecture Highlights
- MatFormer Architecture:
- A nested transformer design, just like Matryoshka embeddings, allows for various subsets of layers to be extracted as in the event that they were individual models.
- E2B and E4B were trained together, configuring E2B as a sub-model of E4B.
- Users can “mix and match” layers, depending on their hardware characteristics and memory budget.
- Per-Layer Embeddings (PLE): Reduces accelerator memory usage by offloading embeddings to the CPU. That is the rationale why the E2B model, while having 5B real parameters, takes about as much GPU memory as if it were a 2B parameter model.
- KV Cache Sharing: Accelerates long-context processing for audio and video, achieving 2x faster prefill in comparison with Gemma 3 4B.
Performance & Benchmarks:
- LMArena Rating: E4B is the primary sub-10B model to realize a rating of 1300+.
- MMLU Scores: Gemma 3n shows competitive performance across various sizes (E4B, E2B, and several other Mix-n-Match configurations).
- Multilingual Support: Supports 140 languages for text and 35 languages for multimodal interactions.
Demo Space
The simplest option to vibe check the model is with the dedicated Hugging Face Space for the model. You possibly can check out different prompts, with different modalities, here.
Inference with transformers
Install the most recent version of timm (for the vision encoder) and transformers to run inference, or if you need to nice tune it.
pip install -U -q timm
pip install -U -q transformers
Inference with pipeline
The simplest option to start using Gemma 3n is by utilizing the pipeline abstraction in transformers:
import torch
from transformers import pipeline
pipe = pipeline(
"image-text-to-text",
model="google/gemma-3n-E4B-it",
device="cuda",
torch_dtype=torch.bfloat16
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
{"type": "text", "text": "Describe this image"}
]
}
]
output = pipe(text=messages, max_new_tokens=32)
print(output[0]["generated_text"][-1]["content"])
Output:
The image shows a futuristic, sleek aircraft soaring through the sky. It's designed with a particular, almost alien aesthetic, featuring a large body and enormous
Detailed inference with transformers
Initialize the model and the processor from the Hub, and write the model_generation function that takes care of processing the prompts and running the inference on the model.
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_id = "google/gemma-3n-e4b-it"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id).to(device)
def model_generation(model, messages):
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
)
input_len = inputs["input_ids"].shape[-1]
inputs = inputs.to(model.device, dtype=model.dtype)
with torch.inference_mode():
generation = model.generate(**inputs, max_new_tokens=32, disable_compile=False)
generation = generation[:, input_len:]
decoded = processor.batch_decode(generation, skip_special_tokens=True)
print(decoded[0])
For the reason that model supports all modalities as inputs, here’s a temporary code explanation of how you should utilize them via transformers.
Text only
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is the capital of France?"}
]
}
]
model_generation(model, messages)
Output:
The capital of France is **Paris**.
Interleaved with Audio
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the following speech segment in English:"},
{"type": "audio", "audio": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/speech.wav"},
]
}
]
model_generation(model, messages)
Output:
Send a text to Mike. I will be home late tomorrow.
Interleaved with Image/Video
Support for videos is finished as a set of frames of images
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
{"type": "text", "text": "Describe this image."}
]
}
]
model_generation(model, messages)
Output:
The image shows a futuristic, sleek, white airplane against a backdrop of a transparent blue sky transitioning right into a cloudy, hazy landscape below. The airplane is tilted at
Inference with MLX
Gemma 3n comes with day 0 support for MLX across all 3 modalities. Be certain to upgrade your mlx-vlm installation.
pip install -u mlx-vlm
Start with vision:
python -m mlx_vlm.generate --model google/gemma-3n-E4B-it --max-tokens 100 --temperature 0.5 --prompt "Describe this image intimately." --image https://huggingface.co/datasets/ariG23498/demo-data/resolve/fundamental/airplane.jpg
And audio:
python -m mlx_vlm.generate --model google/gemma-3n-E4B-it --max-tokens 100 --temperature 0.0 --prompt "Transcribe the next speech segment in English:" --audio https://huggingface.co/datasets/huggingface/documentation-images/resolve/fundamental/blog/audio-samples/jfk.wav
Inference with llama.cpp
Along with MLX, Gemma 3n (text only) works out of the box with llama.cpp. Be certain to put in llama.cpp/ Ollama from source.
Take a look at the Installation instruction for llama.cpp here: https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md
You possibly can run it as:
llama-server -hf ggml-org/gemma-3n-E4B-it-GGUF:Q8_0
Inference with Transformers.js and ONNXRuntime
Finally, we’re also releasing ONNX weights for the gemma-3n-E2B-it model variant, enabling flexible deployment across diverse runtimes and platforms. For JavaScript developers, Gemma3n has been integrated into Transformers.js and is on the market as of version 3.6.0.
For more information on learn how to run the model with these libraries, try the usage section within the model card.
Positive Tune in a Free Google Colab
Given the scale of the model, it’s pretty convenient to fine-tune it for specific downstream tasks across modalities. To make it easier so that you can fine-tune the model, we’ve created a straightforward notebook that lets you experiment on a free Google Colab!
We also provide a dedicated notebook for fine-tuning on audio tasks, so you possibly can easily adapt the model to your speech datasets and benchmarks!
Hugging Face Gemma Recipes
With this release, we also introduce the Hugging Face Gemma Recipes repository. One will find notebooks and scripts to run the models and nice tune them.
We’d love so that you can use the Gemma family of models and add more recipes to it! Be at liberty to open Issues and create Pull Requests to the repository.
Conclusion
We’re all the time excited to host Google and their Gemma family of models. We hope the community will get together and take advantage of these models. Multimodal, small sized, and highly capable, make an amazing model release!
If you need to discuss the models in additional detail, go ahead and begin a discussion right below this blog post. We shall be greater than completely happy to assist!
An enormous due to Arthur, Cyril, Raushan, Lysandre, and everybody at Hugging Face who took care of the mixing and made it available
to the community!

