The first Gemma model launched early last yr and has since grown right into a thriving Gemmaverse of over 160 million collective downloads. This ecosystem includes our family of over a dozen specialized models for all the things from safeguarding to medical applications and, most inspiringly, the countless innovations from the community. From innovators like Roboflow constructing enterprise computer vision to the Institute of Science Tokyo creating highly-capable Japanese Gemma variants, your work has shown us the trail forward.
Constructing on this incredible momentum, we’re excited to announce the complete release of Gemma 3n. While last month’s preview offered a glimpse, today unlocks the complete power of this mobile-first architecture. Gemma 3n is designed for the developer community that helped shape Gemma. It’s supported by your favorite tools including Hugging Face Transformers, llama.cpp, Google AI Edge, Ollama, MLX, and lots of others, enabling you to fine-tune and deploy to your specific on-device applications with ease. This post is the developer deep dive: we’ll explore a number of the innovations behind Gemma 3n, share recent benchmark results, and show you learn how to start constructing today.
What’s recent in Gemma 3n?
Gemma 3n represents a serious advancement for on-device AI, bringing powerful multimodal capabilities to edge devices with performance previously only seen in last yr’s cloud-based frontier models.
Achieving this leap in on-device performance required rethinking the model from the bottom up. The muse is Gemma 3n’s unique mobile-first architecture, and all of it starts with MatFormer.
MatFormer: One model, many sizes
On the core of Gemma 3n is the MatFormer (🪆Matryoshka Transformer) architecture, a novel nested transformer built for elastic inference. Consider it like Matryoshka dolls: a bigger model accommodates smaller, fully functional versions of itself. This approach extends the concept of Matryoshka Representation Learning from just embeddings to all transformer components.
Through the MatFormer training of the 4B effective parameter (E4B) model, a 2B effective parameter (E2B) sub-model is concurrently optimized inside it, as shown within the figure above. This provides developers two powerful capabilities and use cases today:
1: Pre-extracted models: You possibly can directly download and use either the predominant E4B model for the very best capabilities, or the standalone E2B sub-model which we’ve already extracted for you, offering as much as 2x faster inference.
2: Custom sizes with Mix-n-Match: For more granular control tailored to specific hardware constraints, you may create a spectrum of custom-sized models between E2B and E4B using a technique we call Mix-n-Match. This system lets you precisely slice the E4B model’s parameters, primarily by adjusting the feed forward network hidden dimension per layer (from 8192 to 16384) and selectively skipping some layers. We’re releasing the MatFormer Lab, a tool that shows learn how to retrieve these optimal models, which were identified by evaluating various settings on benchmarks like MMLU.

MMLU scores for the pre-trained Gemma 3n checkpoints at different model sizes (using Mix-n-Match)
Looking ahead, the MatFormer architecture also paves the way in which for elastic execution. While not a part of today’s launched implementations, this capability allows a single deployed E4B model to dynamically switch between E4B and E2B inference paths on the fly, enabling real-time optimization of performance and memory usage based on the present task and device load.
Per-Layer Embeddings (PLE): Unlocking more memory efficiency
Gemma 3n models incorporate Per-Layer Embeddings (PLE). This innovation is tailored for on-device deployment because it dramatically improves model quality without increasing the high-speed memory footprint required in your device’s accelerator (GPU/TPU).
While the Gemma 3n E2B and E4B models have a complete parameter count of 5B and 8B respectively, PLE allows a significant slice of those parameters (the embeddings related to each layer) to be loaded and computed efficiently on the CPU. This implies only the core transformer weights (roughly 2B for E2B and 4B for E4B) need to sit down within the typically more constrained accelerator memory (VRAM).

With Per-Layer Embeddings, you need to use Gemma 3n E2B while only having ~2B parameters loaded in your accelerator.
KV Cache sharing: Faster long-context processing
Processing long inputs, equivalent to the sequences derived from audio and video streams, is crucial for a lot of advanced on-device multimodal applications. Gemma 3n introduces KV Cache Sharing, a feature designed to significantly speed up time-to-first-token for streaming response applications.
KV Cache Sharing optimizes how the model handles the initial input processing stage (often called the “prefill” phase). The keys and values of the center layer from local and global attention are directly shared with all the highest layers, delivering a notable 2x improvement on prefill performance in comparison with Gemma 3 4B. This implies the model can ingest and understand lengthy prompt sequences much faster than before.
Audio understanding: Introducing speech to text and translation
Gemma 3n uses a complicated audio encoder based on the Universal Speech Model (USM). The encoder generates a token for each 160ms of audio (about 6 tokens per second), that are then integrated as input to the language model, providing a granular representation of the sound context.
This integrated audio capability unlocks key features for on-device development, including:
- Automatic Speech Recognition (ASR): Enable high-quality speech-to-text transcription directly on the device.
- Automatic Speech Translation (AST): Translate spoken language into text in one other language.
We have observed particularly strong AST results for translation between English and Spanish, French, Italian, and Portuguese, offering great potential for developers targeting applications in these languages. For tasks like speech translation, leveraging Chain-of-Thought prompting can significantly enhance results. Here’s an example:
user
Transcribe the next speech segment in Spanish, then translate it into English:
model
Plain text
At launch time, the Gemma 3n encoder is implemented to process audio clips as much as 30 seconds. Nonetheless, this shouldn’t be a fundamental limitation. The underlying audio encoder is a streaming encoder, able to processing arbitrarily long audios with additional long form audio training. Follow-up implementations will unlock low-latency, long streaming applications.
MobileNet-V5: Recent state-of-the-art vision encoder
Alongside its integrated audio capabilities, Gemma 3n encompasses a recent, highly efficient vision encoder, MobileNet-V5-300M, delivering state-of-the-art performance for multimodal tasks on edge devices.
Designed for flexibility and power on constrained hardware, MobileNet-V5 gives developers:
- Multiple input resolutions: Natively supports resolutions of 256×256, 512×512, and 768×768 pixels, allowing you to balance performance and detail to your specific applications.
- Broad visual understanding: Co-trained on extensive multimodal datasets, it excels at a big selection of image and video comprehension tasks.
- High throughput: Processes as much as 60 frames per second on a Google Pixel, enabling real-time, on-device video evaluation and interactive experiences.
This level of performance is achieved with multiple architectural innovations, including:
- A sophisticated foundation of MobileNet-V4 blocks (including Universal Inverted Bottlenecks and Mobile MQA).
- A significantly scaled up architecture, featuring a hybrid, deep pyramid model that’s 10x larger than the most important MobileNet-V4 variant.
- A novel Multi-Scale Fusion VLM adapter that enhances the standard of tokens for higher accuracy and efficiency.
Benefiting from novel architectural designs and advanced distillation techniques, MobileNet-V5-300M substantially outperforms the baseline SoViT in Gemma 3 (trained with SigLip, no distillation). On a Google Pixel Edge TPU, it delivers a 13x speedup with quantization (6.5x without), requires 46% fewer parameters, and has a 4x smaller memory footprint, all while providing significantly higher accuracy on vision-language tasks
We’re excited to share more concerning the work behind this model. Look out for our upcoming MobileNet-V5 technical report, which is able to deep dive into the model architecture, data scaling strategies, and advanced distillation techniques.
Making Gemma 3n accessible from day one has been a priority. We’re proud to partner with many incredible open source developers to make sure broad support across popular tools and platforms, including contributions from teams behind AMD, Axolotl, Docker, Hugging Face, llama.cpp, LMStudio, MLX, NVIDIA, Ollama, RedHat, SGLang, Unsloth, and vLLM.
But this ecosystem is only the start. The true power of this technology is in what you’ll construct with it. That’s why we’re launching the Gemma 3n Impact Challenge. Your mission: use Gemma 3n’s unique on-device, offline, and multimodal capabilities to construct a product for a greater world. With $150,000 in prizes, we’re on the lookout for a compelling video story and a “wow” factor demo that shows real-world impact. Join the challenge and help construct a greater future.
Start with Gemma 3n today
Able to explore the potential of Gemma 3n today? Here’s how:
- Experiment directly: Use Google AI Studio to try Gemma 3n in only a few clicks. Gemma models will also be deployed on to Cloud Run from AI Studio.
- Learn & integrate: Dive into our comprehensive documentation to quickly integrate Gemma into your projects or start with our inference and fine-tuning guides.
