Gemma 3n fully available within the open-source ecosystem!

Gemma 3n was announced as a preview during Google I/O. The on-device community got really excited, because it is a model designed from the bottom as much as run locally in your hardware. On top of that, it’s natively multimodal, supporting image, text, audio, and video inputs 🤯

Today, Gemma 3n is finally available on essentially the most used open source libraries. This includes transformers & timm, MLX, llama.cpp (text inputs), transformers.js, ollama, Google AI Edge, and others.

This post quickly goes through practical snippets to display learn how to use the model with these libraries, and the way easy it’s to fine-tune it for other domains.

Models released today

Here is the Gemma 3n Release Collection

Two model sizes have been released today, with two variants (base and instruct) each. The model names follow a non-standard nomenclature: they’re called gemma-3n-E2B and gemma-3n-E4B. The E preceding the parameter count stands for Effective. Their actual parameter counts are 5B and 8B, respectively, but due to improvements in memory efficiency, they manage to only need 2B and 4B in VRAM (GPU memory).

These models, subsequently, behave like 2B and 4B by way of hardware support, but they punch over 2B/4B by way of quality. The E2B model can run in as little as 2GB of GPU RAM, while E4B can run with just 3GB of GPU RAM.

Details of the models

Along with the language decoder, Gemma 3n uses an audio encoder and a vision encoder. We highlight their fundamental features below, and describe how they’ve been added to transformers and timm, as they’re the reference for other implementations.

Vision Encoder (MobileNet-V5). Gemma 3n uses a new edition of MobileNet: MobileNet-v5-300, which has been added to the new edition of timm released today.
- Features 300M parameters.
- Supports resolutions of 256x256, 512x512, and 768x768.
- Achieves 60 FPS on Google Pixel, outperforming ViT Giant while using 3x fewer parameters.
Audio Encoder:
- Based on the Universal Speech Model (USM).
- Processes audio in 160ms chunks.
- Enables speech-to-text and translation functionalities (e.g., English to Spanish/French).
Gemma 3n Architecture and Language Model. The architecture itself has been added to the new edition of transformers released today. This implementation branches out to timm for image encoding, so there’s a single reference implementation of the MobileNet architecture.

Architecture Highlights

MatFormer Architecture:
- A nested transformer design, just like Matryoshka embeddings, allows for various subsets of layers to be extracted as in the event that they were individual models.
- E2B and E4B were trained together, configuring E2B as a sub-model of E4B.
- Users can “mix and match” layers, depending on their hardware characteristics and memory budget.
Per-Layer Embeddings (PLE): Reduces accelerator memory usage by offloading embeddings to the CPU. That is the rationale why the E2B model, while having 5B real parameters, takes about as much GPU memory as if it were a 2B parameter model.
KV Cache Sharing: Accelerates long-context processing for audio and video, achieving 2x faster prefill in comparison with Gemma 3 4B.

Performance & Benchmarks:

LMArena Rating: E4B is the primary sub-10B model to realize a rating of 1300+.
MMLU Scores: Gemma 3n shows competitive performance across various sizes (E4B, E2B, and several other Mix-n-Match configurations).
Multilingual Support: Supports 140 languages for text and 35 languages for multimodal interactions.

Demo Space

The simplest option to vibe check the model is with the dedicated Hugging Face Space for the model. You possibly can check out different prompts, with different modalities, here.

📱 Space

Inference with transformers

Install the most recent version of timm (for the vision encoder) and transformers to run inference, or if you need to nice tune it.

pip install -U -q timm
pip install -U -q transformers

Inference with pipeline

The simplest option to start using Gemma 3n is by utilizing the pipeline abstraction in transformers:

import torch
from transformers import pipeline

pipe = pipeline(
   "image-text-to-text",
   model="google/gemma-3n-E4B-it", 
   device="cuda",
   torch_dtype=torch.bfloat16
)

messages = [
   {
       "role": "user",
       "content": [
           {"type": "image", "url": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
           {"type": "text", "text": "Describe this image"}
       ]
   }
]

output = pipe(text=messages, max_new_tokens=32)
print(output[0]["generated_text"][-1]["content"])

Output:

The image shows a futuristic, sleek aircraft soaring through the sky. It's designed with a particular, almost alien aesthetic, featuring a large body and enormous

Detailed inference with transformers

Initialize the model and the processor from the Hub, and write the model_generation function that takes care of processing the prompts and running the inference on the model.

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "google/gemma-3n-e4b-it" 
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id).to(device)

def model_generation(model, messages):
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    )
    input_len = inputs["input_ids"].shape[-1]

    inputs = inputs.to(model.device, dtype=model.dtype)

    with torch.inference_mode():
        generation = model.generate(**inputs, max_new_tokens=32, disable_compile=False)
        generation = generation[:, input_len:]

    decoded = processor.batch_decode(generation, skip_special_tokens=True)
    print(decoded[0])

For the reason that model supports all modalities as inputs, here’s a temporary code explanation of how you should utilize them via transformers.

Text only



messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is the capital of France?"}
        ]
    }
]
model_generation(model, messages)

Output:

The capital of France is **Paris**.

Interleaved with Audio



messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English:"},
            {"type": "audio", "audio": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/speech.wav"},
        ]
    }
]
model_generation(model, messages)

Output:

Send a text to Mike. I will be home late tomorrow.

Interleaved with Image/Video

Support for videos is finished as a set of frames of images



messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
            {"type": "text", "text": "Describe this image."}
        ]
    }
]
model_generation(model, messages)

Output:

The image shows a futuristic, sleek, white airplane against a backdrop of a transparent blue sky transitioning right into a cloudy, hazy landscape below. The airplane is tilted at

Inference with MLX

Gemma 3n comes with day 0 support for MLX across all 3 modalities. Be certain to upgrade your mlx-vlm installation.

pip install -u mlx-vlm

Start with vision:

python -m mlx_vlm.generate --model google/gemma-3n-E4B-it --max-tokens 100 --temperature 0.5 --prompt "Describe this image intimately." --image https://huggingface.co/datasets/ariG23498/demo-data/resolve/fundamental/airplane.jpg

And audio:

python -m mlx_vlm.generate --model google/gemma-3n-E4B-it --max-tokens 100 --temperature 0.0 --prompt "Transcribe the next speech segment in English:" --audio https://huggingface.co/datasets/huggingface/documentation-images/resolve/fundamental/blog/audio-samples/jfk.wav

Inference with llama.cpp

Along with MLX, Gemma 3n (text only) works out of the box with llama.cpp. Be certain to put in llama.cpp/ Ollama from source.

Take a look at the Installation instruction for llama.cpp here: https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md

You possibly can run it as:

llama-server -hf ggml-org/gemma-3n-E4B-it-GGUF:Q8_0

Inference with Transformers.js and ONNXRuntime

Finally, we’re also releasing ONNX weights for the gemma-3n-E2B-it model variant, enabling flexible deployment across diverse runtimes and platforms. For JavaScript developers, Gemma3n has been integrated into Transformers.js and is on the market as of version 3.6.0.

For more information on learn how to run the model with these libraries, try the usage section within the model card.

Positive Tune in a Free Google Colab

Given the scale of the model, it’s pretty convenient to fine-tune it for specific downstream tasks across modalities. To make it easier so that you can fine-tune the model, we’ve created a straightforward notebook that lets you experiment on a free Google Colab!

We also provide a dedicated notebook for fine-tuning on audio tasks, so you possibly can easily adapt the model to your speech datasets and benchmarks!

Hugging Face Gemma Recipes

With this release, we also introduce the Hugging Face Gemma Recipes repository. One will find notebooks and scripts to run the models and nice tune them.

We’d love so that you can use the Gemma family of models and add more recipes to it! Be at liberty to open Issues and create Pull Requests to the repository.

Conclusion

We’re all the time excited to host Google and their Gemma family of models. We hope the community will get together and take advantage of these models. Multimodal, small sized, and highly capable, make an amazing model release!

If you need to discuss the models in additional detail, go ahead and begin a discussion right below this blog post. We shall be greater than completely happy to assist!

An enormous due to Arthur, Cyril, Raushan, Lysandre, and everybody at Hugging Face who took care of the mixing and made it available
to the community!

Source link

Gemma 3n fully available within the open-source ecosystem!

Models released today

Details of the models

Architecture Highlights

Performance & Benchmarks:

Demo Space

Inference with transformers

Inference with pipeline

Detailed inference with transformers

Text only

Interleaved with Audio

Interleaved with Image/Video

Inference with MLX

Inference with llama.cpp

Inference with Transformers.js and ONNXRuntime

Positive Tune in a Free Google Colab

Hugging Face Gemma Recipes

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

Statement from Dario Amodei on our discussions with the Department of War Anthropic

Google quantum-proofs HTTPS by squeezing 2.5kB of information into 64-byte space – Ars Technica

Generative AI, Discriminative Human

Gemma 3n fully available within the open-source ecosystem!

Models released today

Details of the models

Architecture Highlights

Performance & Benchmarks:

Demo Space

Inference with transformers

Inference with pipeline

Detailed inference with transformers

Text only

Interleaved with Audio

Interleaved with Image/Video

Inference with MLX

Inference with llama.cpp

Inference with Transformers.js and ONNXRuntime

Positive Tune in a Free Google Colab

Hugging Face Gemma Recipes

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.