Bringing AI Closer to the Edge and On-Device with Gemma 4

-


The Gemmaverse expands with the launch of the newest Gemma 4 multimodal and multilingual models, designed to scale across the total spectrum of deployments, from NVIDIA Blackwell in the info center to Jetson at the sting. These models are suited to satisfy the growing demand for local deployment for AI development and prototyping, secure on-prem requirements, cost efficiency, and latency-sensitive use cases. The most recent generation improves each efficiency and accuracy, making these general-purpose models well-suitable for a wide selection of common tasks:  

  • Reasoning: Strong performance on complex problem-solving tasks.
  • Coding: Code generation and debugging for developer workflows.
  • Agents: Native support for structured tool use (function calling). 
  • Vision, video and audio capability: Enables wealthy multimodal interactions to be used cases equivalent to object recognition, automated speech recognition (ASR), document and video intelligence, and more.  
  • Interleaved multimodal input: Freely mix text and pictures in any order inside a single prompt. 
  • Multilingual: Out-of-the-box support for over 35 languages, and pre-trained on over 140 languages. 

The bundle includes 4 models, including Gemma’s first MoE model, which might all fit on a single NVIDIA H100 GPU and supports over 140 languages. The 31B and 26B A4B variants are high-performing reasoning models suitable for each local and data center environments. The E4B and E2B are the most recent edition of on-device and mobile designed models first launched with Gemma 3n. 

Model Name  Architecture Type  Total Parameters  Energetic or Effective Parameters  Input Context Length 
(Tokens) 
Sliding Window 
(Tokens) 
Modalities 
Gemma-4-31B  Dense Transformer  31B  —  256K   1024   
Gemma-4-26B-A4B   MoE – 128 Experts  26B   3.8B  256K  —   
Gemma-4-E4B  Dense Transformer   7.9B with embeddings  4.5B effective  128K  512  Text, Audio, Vision, Video 
Gemma-4-E2B  Dense Transformer   5.1B with embeddings  2.3B effective  128K  512  Text, Audio, Vision, Video 
Table 1. Overview of the Gemma 4 model family, summarizing architecture types, parameter sizes, effective parameters, supported context lengths, and available modalities to assist developers select the proper model for data center, edge, and on‑device deployments.

Each model is accessible on Hugging Face with BF16 checkpoints, and an NVFP4 quantized check point for Gemma-4-31B will likely be available soon for NVIDIA Blackwell developers. 

Run intelligent workloads on-device 

As AI workflows and agents turn out to be more integrated into on a regular basis applications, the power to run these models beyond traditional data center environments is becoming critical. The NVIDIA suite of client and edge systems, from RTX GPUs and DGX Spark to Jetson Nano, provides developers with the flexibleness to administer cost and latency while supporting security requirements for highly regulated industries equivalent to healthcare and finance.

We collaborated with vLLM, Ollama and llama.cpp to supply the perfect local deployment experience for every of the Gemma 4 models. Unsloth also provides day-one support with optimized and quantized models for efficient local deployment via Unsloth Studio.

Try the RTX AI Garage blog post to start with Gemma 4 on RTX GPUs and DGX Spark.

  DGX Spark  Jetson   RTX / RTX PRO 
Use Case  AI research  
and prototyping 
Edge AI and robotics  Desktop apps  
and Windows development 
 
Key Highlights  A preinstalled NVIDIA AI software stack and 128 GB of unified memory power local prototyping, fine-tuning, and fully local OpenClaw workflows Near-zero latency on account of architecture features equivalent to conditional parameter loading and per-layer embeddings which could be cached for faster and reduced memory use (more info) 
 
Optimized performance for local inference for hobbyists, creators and professionals 
Getting Began Guide  DGX Spark Playbooks for vLLM, Ollama, Unsloth and llama.cpp deployment guides 

NeMo Automodel for fine-tuning on Spark guide 

Jetson AI Lab for tutorials and custom Gemma containers  RTX AI Garage for Ollama and llama.cpp guides. RTX Pro owners can use vLLM as well. 
Table 2. Comparison of local deployment options across NVIDIA platforms, highlighting primary use cases, key capabilities, and really useful getting‑began resources for DGX Spark, Jetson, and RTX / RTX PRO systems running Gemma 4 models. 

Construct secure agentic AI workflows with DGX Spark 

AI developers and enthusiasts profit from the GB10 Grace Blackwell Superchip paired with 128 GB of unified memory in DGX Spark, providing the resources needed to run Gemma 4 31B with BF16 model weights. Combined with DGX Linux OS and the total NVIDIA software stack, developers can efficiently prototype and construct agentic AI workflows with Gemma 4 while maintaining private, secure on-device execution.

The vLLM inference engine is designed to run LLMs efficiently, maximizing throughput while minimizing memory usage. Using vLLM high-throughput LLM serving on DGX Spark provides a high-performance platform for the most important Gemma 4 models; the vLLM for Inference DGX Spark playbook provides the small print to get vLLM running with Gemma 4 in your DGX Spark. Or start with Gemma 4 using Ollama or llama.cpp. Users can further fine-tune the models on DGX Spark with NeMo Automodel. 

Power physical AI agents with Jetson  

Modern physical AI agents are evolving rapidly with Gemma 4 models that integrate audio, multimodal perception, and deep reasoning capabilities. These advanced models enable robotics systems to maneuver beyond sure bet execution, allowing them to know speech, interpret visual context, and reason intelligently before taking motion. On NVIDIA Jetson, developers can run Gemma 4 inference at the sting using llama.cpp and vLLM. Jetson Orin Nano supports the Gemma 4 e2b and e4b variants, enabling multimodal inference on small, embedded, and power-constrained systems, with the identical model family scaling across the Jetson platform as much as Jetson Thor.

This supports scalable deployment across robotics, smart machines, and industrial automation use cases that rely upon low-latency performance and on-device intelligence.

Jetson developers can take a look at the tutorial and download the container to start from the Jetson AI Lab. 

Production ready deployment with NVIDIA NIM 

Enterprise developers can try the Gemma 4 31B model without spending a dime using an NVIDIA-hosted NIM API available within the NVIDIA API catalog for prototyping. For production deployment, they’ll use prepackaged and optimized NIM microservices for secure, self-hosted deployment with an NVIDIA Enterprise License.

Day 0 fine-tuning with NeMo Framework 

Developers can customize Gemma 4 with their very own domain data using the NVIDIA NeMo framework, specifically the NeMo Automodel library, which mixes native PyTorch ease of use with optimized performance. Using this wonderful‑tuning recipe for Gemma 4, developers can apply techniques equivalent to supervised wonderful‑tuning (SFT) and memory‑efficient LoRA to perform day‑0 wonderful‑tuning ranging from  Hugging Face model checkpoints without the necessity for conversion. 

Start today 

Regardless of which NVIDIA GPU you might be using, Gemma 4 is supported across your entire NVIDIA AI platform and is accessible under the commercial-friendly Apache 2.0 license. From Blackwell, with NVFP4 quantized checkpoints coming soon, to Jetson platforms, developers can quickly start deploying these high-accuracy multimodal models, with the flexibleness to satisfy their speed, security, and value requirements.

Try Gemma on Hugging Face, or test Gemma 4 31B without spending a dime using NVIDIA APIs at construct.nvidia.com.  



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x