Frontier multimodal intelligence on device

The Gemma 4 family of multimodal models by Google DeepMind is out on Hugging Face, with support to your favorite agents, inference engines, and fine-tuning libraries 🤗

These models are the actual deal: truly open with Apache 2 licenses, prime quality with pareto frontier arena scores, multimodal including audio, and sizes you should use in all places including on-device. Gemma 4 builds on advances from previous families and makes them click together. In our tests with pre-release checkpoints now we have been impressed by their capabilities, to the extent that we struggled to seek out good fine-tuning examples because they’re so good out of the box.

We collaborated with Google and the community to make them available in all places: transformers, llama.cpp, MLX, WebGPU, Rust; you name it. This blog post will show you the way to construct with your favorite tools so tell us what you think that!

Much like Gemma-3n, Gemma 4 supports image, text, and audio inputs, and generates text responses. The text decoder relies on the Gemma model with support for long context windows. The image encoder is analogous to the one from Gemma 3 but with two crucial improvements: variable aspect ratios, and configurable variety of image token inputs to seek out your sweet spot between speed, memory, and quality. All models support images (or video) and text inputs, while the small variants (E2B and E4B) support audio as well.

Gemma 4 is available in 4 sizes, all base and instruction fine-tuned:

Model	Parameter Size	Context Window	Checkpoints
Gemma 4 E2B	2.3B effective, 5.1B with embeddings	128k	base, IT
Gemma 4 E4B	4.5B effective, 8B with embeddings	128k	base, IT
Gemma 4 31B	31B dense model	256K	base, IT
Gemma 4 26B A4B	mixture-of-experts with 4B activated/26B total parameters	256K	base, IT

Overview of Capabilities and Architecture

Gemma 4 leverages several architecture components utilized in previous Gemma versions and other open models, and leaves out complex or inconclusive features akin to Altup. The mixture is a mixture designed to be highly compatible across libraries and devices, that may efficiently support long context and agentic use cases, whilst being ideal for quantization.

With this feature mix (and the undisclosed training data or recipe), the 31B dense model achieves an estimated LMArena rating (text only) of 1452, while the 26B MoE reaches 1441 with just 4B energetic parameters 🤯. To place this in context, these scores are kind of the identical because the recent GLM-5 or Kimi K2.5, but with ~30 times less parameters. As we’ll see, multimodal operation is relatively pretty much as good as text generation, a minimum of in informal and subjective tests.

These are the essential architecture characteristics in Gemma 4:

Alternating local sliding-window and global full-context attention layers. Smaller dense models use sliding windows of 512 tokens while larger models use 1024 tokens.
Dual RoPE configurations: standard RoPE for sliding layers, proportional RoPE for global layers, to enable longer context.
Per-Layer Embeddings (PLE): a second embedding table that feeds a small residual signal into every decoder layer.
Shared KV Cache: the last N layers of the model reuse key-value states from earlier layers, eliminating redundant KV projections.
Vision encoder: uses learned 2D positions and multidimensional RoPE. Preserves the unique aspect ratios and might encode images to a number of different token budgets (70, 140, 280, 560, 1120).
Audio encoder: USM-style conformer with the identical base architecture because the one in Gemma-3n.

Per-Layer Embeddings (PLE)

Some of the distinctive features in smaller Gemma 4 models is Per-Layer Embeddings (PLE), which was introduced previously in Gemma-3n. In a typical transformer, each token gets a single embedding vector at input, and the identical initial representation is what the residual stream builds on across all layers, forcing the embedding to frontload the whole lot the model might need. PLE adds a parallel, lower-dimensional conditioning pathway alongside the essential residual stream. For every token, it produces a small dedicated vector for each layer by combining two signals: a token-identity component (from an embedding lookup) and a context-aware component (from a learned projection of the essential embeddings). Each decoder layer then uses its corresponding vector to modulate the hidden states via a light-weight residual block after attention and feed-forward. This offers each layer its own channel to receive token-specific information only when it becomes relevant, reasonably than requiring the whole lot to be packed right into a single upfront embedding. Since the PLE dimension is way smaller than the essential hidden size, this adds meaningful per-layer specialization at modest parameter cost. For multimodal inputs (images, audio, video), PLE is computed before soft tokens are merged into the embedding sequence — since PLE relies on token IDs which might be lost once multimodal features replace the placeholders. Multimodal positions use the pad token ID, effectively receiving neutral per-layer signals.

Shared KV Cache

The shared KV cache is an efficiency optimization that reduces each compute and memory during inference. The last num_kv_shared_layers layers of the model don’t compute their very own key and value projections. As a substitute, they reuse the K and V tensors from the last non-shared layer of the identical attention type (sliding or full).

In practice, this has a minimal impact on quality while being far more efficient (by way of each memory and compute) for long context generation and on-device use.

Multimodal Capabilities

We saw in our tests that Gemma 4 supports comprehensive multimodal capabilities out of the box. We do not know what was the training mix, but we had success using it for tasks akin to OCR, speech-to-text, object detection, or pointing. It also supports text-only and multimodal function calling, reasoning, code completion and correction.

Here, we show a number of inference examples across different model sizes. You possibly can run them conveniently with this notebook. We encourage you to try the demos and share them below this blog!

Object Detection and Pointing

GUI detection

We test Gemma 4 on GUI element detection and pointing across different sizes, with the next image and text prompt: “What is the bounding box for the “view recipe” element within the image?”

With this prompt, the model natively responds in JSON format with the detected bounding boxes – no need for specific instructions or grammar-constrained generation. We found the coordinates seek advice from a picture size of 1000×1000, relative to the input dimensions.

We visualize the outputs below to your convenience. We parse the bounding boxes from the returned JSON: jsonn[n {"box_2d": [171, 75, 245, 308], "label": "view recipe element"}n]n

Object Detection

We test models to detect on a regular basis objects, here we ask them to detect the bike and compare different model outputs. As within the previous case, we parse the bounding box from the json and translate to image space coordinates.

Multimodal Considering and Function Calling

We asked Gemma 4 to put in writing HTML code to reconstruct a page we made with Gemini 3. Below you’ll find the code to do that, we enable considering and ask each model to generate as much as 4000 latest tokens, to make it foolproof.

Gemini Generated Website (Reference)	Gemini Reproduced Image

Inference code

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/landing_page.png",
            },
            {"type": "text", "text": "Write HTML code for this page."},
        ],
    }
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=4000)
input_len = inputs.input_ids.shape[-1]
generated_text_ids = output[0][input_len:]
generated_text = processor.decode(generated_text_ids, skip_special_tokens=True)
result = processor.parse_response(generated_text)
print(result["content"])

E2B	E4B

31B	26/A4B

Video Understanding

Smaller Gemma 4 models can absorb videos with audio while larger ones can absorb videos without audio. While the models will not be explicitly post-trained on videos, they’ll understand videos each with and without audio. The model is especially strong in audios

Inference code

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/concert.mp4"},
            {"type": "text", "text": "What is happening in the video? What is the song about?"},
        ],
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    load_audio_from_video=True, 
).to(model.device)
output = model.generate(**inputs, max_new_tokens=200)
input_len = inputs.input_ids.shape[-1]
generated_text_ids = output[0][input_len:]
generated_text = processor.decode(generated_text_ids, skip_special_tokens=True)
print(result["content"])

Model	Output
E2B (hallucinated for the audio)	Based on the video provided, here is an evaluation: What is going on on this video? The video captures a live musical performance, likely at a big outdoor festival or concert. * Setting: The scene is a big stage arrange outdoors, indicated by the extensive lighting rig, large speakers, and the audience visible within the foreground. * Performers: There are several musicians on stage. The focus appears to be a lead singer/guitarist in a blue shirt and white pants who’s actively performing, playing an acoustic or electric guitar. There are also other musicians, including drummers and bassists, visible within the background. * Atmosphere: The atmosphere is energetic and dramatic, characterised by shiny stage lights, intense blue and white spotlights, and a big amount of stage fog/smoke, which adds to the concert mood. * Audience: The foreground is stuffed with the backs of audience members, showing they’re near the stage and engaged in watching the performance. What’s the song about? The lyrics provided below the video are: > “I do not care ’bout street > From what I cherish > Could it is time to protest
E4B ✅	The video captures a live performance of a band on a stage with shiny lights and smoke effects, surrounded by an audience. The band performs and sings to the gang. The song appears to be concerning the struggles and disillusionment of recent life, specifically the sensation of being stuck and the difficulties find one’s place in a chaotic and overwhelming world. Lyrics mention themes of the “concrete street,” “broken guarantees,” “storm of the day,” and the problem of finding oneself amidst societal pressures.
26/A4B (no audio) ✅	This video shows a live music performance at an out of doors concert or festival at night. A solo male artist, wearing a blue shirt and white pants, is the central focus. He’s playing an acoustic guitar and singing right into a microphone on a big, lit stage. The performance is accompanied by a band, including a drummer and other musicians visible within the background and on large video screens. The stage is stuffed with atmospheric blue lighting, smoke, and shiny spotlights. The video captures different perspectives, including shots from the gang looking up on the stage and close-up views of the performers projected onto large screens. The group is visible within the foreground, watching the show. One among the screens also displays the “Estrella Damm” brand name.
31B (no audio) ✅	A musician is acting on stage during a concert, playing an acoustic guitar and singing right into a microphone. He’s wearing a blue shirt and white pants. The stage is stuffed with atmospheric smoke and illuminated by shiny blue and white lights. Other band members, including a drummer and a keyboardist, are also visible on stage. The video shows the musician from the attitude of the audience, with some shots specializing in him and others showing the massive screen on the side of the stage, which displays close-ups of the performer and the gang.

Captioning

We’ve tested all models on captioning. All checkpoints perform thoroughly and accurately capture nuance in complex scenerios.
Here’s the image we use, with the prompt “Write single detailed caption for this image.”.

Inference code

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/bird.png"},
            {"type": "text", "text": "Write single detailed caption for this image."},
        ],
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
input_len = inputs.input_ids.shape[-1]
generated_text_ids = output[0][input_len:]
generated_text = processor.decode(generated_text_ids, skip_special_tokens=True)
result = processor.parse_response(generated_text)
print(result["content"])

Model	Output
E2B ✅	A medium shot captures a weathered seagull perched atop a stone pedestal in what appears to be a bustling European square, with a grand, classical-style constructing featuring ornate columns and architectural details dominating the correct side of the frame. Within the background, a low, light-colored constructing stretches across the center ground, flanked by some green foliage, and hints of other structures, including a terracotta-roofed constructing, are visible. Within the foreground, a part of a contemporary bus stop or shelter is visible, with signage indicating routes, and the general scene is bathed in shiny daylight under a pale sky.
E4B ✅	A solitary seagull perches atop a gray stone pedestal in a bustling European piazza, with the historical grandeur of a classical constructing dominating the correct side of the frame, contrasting with the more modern, low-rise architecture within the background under a shiny, hazy sky.
26B/A4B ✅	A full shot shows a white and brown seagull perched on a gray pillar in Venice, Italy. The seagull faces left, its beak pointed towards the sting of the frame. Behind it’s a lightweight gray, two-story constructing with several dark rectangular windows. To the correct of the seagull is a big, white constructing with columns and complicated carvings. The sky is a transparent, light blue. The lighting is shiny and even, suggesting it’s daytime.
31B ✅	A high-angle, medium shot shows a white and grey seagull perched on a gray concrete pylon in Venice, Italy. The seagull is facing left, its body angled barely towards the camera. Its white head, neck, and chest contrast with its gray wings and back. The seagull’s orange beak is barely open. The pylon is a lightweight gray concrete cylinder with a weathered texture. Within the background, there are several buildings. To the left is an extended, low, white constructing with a series of small, rectangular windows. In the middle is a cluster of green trees. To the correct is a red constructing with a tiled roof and a number of other small, rectangular windows. Further right is a big, ornate, white stone constructing with several columns and sculptures. The sky is a pale blue, and the sunshine is shiny and clear. Within the lower-left corner, there may be a yellow bus with the word “Ferrovia” printed on it in black letters. There are several people walking within the background, but they’re small and out of focus.

Audio Query Answering

These models are trained to reply questions on speech in audio. Music and non-speech sounds weren’t a part of the training data.

Inference code

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"},
            {"type": "text", "text": "Can you describe this audio in detail?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=1000,
    do_sample=False,
)

print(processor.decode(output[0], skip_special_tokens=True))

Model	Output
E2B	This audio is a private reflection. The speaker is talking about their final farewell address to the nation, which they delivered in Chicago. They express gratitude for the conversations they’ve had with the American people, noting that despite not having met them face-to-face and even greeted them, these interactions in various settings like living rooms, schools, farms, factory floors, diners, and military outposts have been what has kept them going.
E4B	The audio is a speech excerpt where a speaker is delivering a farewell address to the nation from Chicago. The speaker reflects on their time in office, expressing gratitude for the conversations that they had with the American people across various settings like living rooms, schools, farms, factories, diners, and military outposts. The tone is reflective and appreciative, highlighting the importance of those interactions of their political journey.

Model

Output

E2B

This audio is a private reflection. The speaker is talking about their final farewell address to the nation, which they delivered in Chicago. They express gratitude for the conversations they’ve had with the American people, noting that despite not having met them face-to-face and even greeted them, these interactions in various settings like living rooms, schools, farms, factory floors, diners, and military outposts have been what has kept them going.

E4B

The audio is a speech excerpt where a speaker is delivering a farewell address to the nation from Chicago. The speaker reflects on their time in office, expressing gratitude for the conversations that they had with the American people across various settings like living rooms, schools, farms, factories, diners, and military outposts. The tone is reflective and appreciative, highlighting the importance of those interactions of their political journey.

Here is an example if you ought to do transcription:

Inference code

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"},
            {"type": "text", "text": "Transcribe the audio?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=1000,
    do_sample=False,
)

print(processor.decode(output[0], skip_special_tokens=True))

Model	Output
E2B	This week I traveled to Chicago to deliver my final farewell address to the nation following within the tradition of presidents before me It was a chance to say thanks whether we have seen eye to eye or rarely agreed in any respect my conversations with you the American people in living rooms and schools at farms and on factory floors at diners and on distant military outposts all these conversations are what have kept me honest
E4B	This week I traveled to Chicago to deliver my final farewell address to the nation following within the tradition of presidents before me. It was a chance to say thanks. Whether we have seen eye to eye or rarely agreed in any respect, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, all these conversations are what have kept me honest.

Model

Output

E2B

This week I traveled to Chicago to deliver my final farewell address to the nation following within the tradition of presidents before me It was a chance to say thanks whether we have seen eye to eye or rarely agreed in any respect my conversations with you the American people in living rooms and schools at farms and on factory floors at diners and on distant military outposts all these conversations are what have kept me honest

E4B

This week I traveled to Chicago to deliver my final farewell address to the nation following within the tradition of presidents before me. It was a chance to say thanks. Whether we have seen eye to eye or rarely agreed in any respect, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, all these conversations are what have kept me honest.

Multimodal Function Calling

We test the model by asking to get the weather within the place shown within the image.

Inference code

import re 
WEATHER_TOOL = {
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Gets the present weather for a particular location.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "Town name"},
            },
            "required": ["city"],
        },
    },
}
tools = [WEATHER_TOOL]
messages = [
    {"role": "user", "content": [
        {"type": "image", "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/thailand.jpg"},
        {"type": "text", "text": "What is the city in this image? Check the weather there right now."},
    ]},
]
inputs = processor.apply_chat_template(
    messages,
    tools=[WEATHER_TOOL],
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=1000)
input_len = inputs.input_ids.shape[-1]
generated_text_ids = output[0][input_len:]
generated_text = processor.decode(generated_text_ids, skip_special_tokens=True)
result = processor.parse_response(generated_text)
print(result["content"])

Model	Output
E2B	Analyze the Request: The user has provided a picture and asked two questions:n * “What’s the city on this image?” * “Check the weather there right away.”nn2. Analyze the Image: The image displays a big, ornate, tall structure, characteristic of temple architecture, likely in Southeast Asia (given the style). It appears to be a famous landmark. 3. Discover the Landmark/Location (Internal Knowledge/Visual Cues): The structure strongly resembles the chedis/stupas present in places like Bangkok, Thailand (e.g., Wat Arun or similar structures).nn4. Determine the City: Based on the visual evidence, the situation is extremely prone to be Bangkok, Thailand.nn5. Check Available Tools: The available tool is `get_weather(city: STRING)`. 6. Formulate the Plan:n * Step 1: Discover the town from the image (Bangkok).n * Step 2: Use the `get_weather` tool with the identified city.nn7. Construct the Tool Call: `get_weather(city="Bangkok")`nn8. Generate Response: The model should first state the identified city after which call the tool to get the weather. (Self-correction: The model must call the tool in the right format if it needs external information.)call:get_weather{city:Bangkok}
E4B	‘1. Analyze the Request: The user has provided a picture and asked two things:n a. Discover the town within the image.n b. Check the weather in that city right away.nn2. Analyze the Image: The image shows a really distinctive, ornate tower/spire structure, characteristic of Thai architecture (specifically, a chedi or prang).nn3. Determine City Identification (Internal Knowledge/Visual Recognition): Based on the architectural style, the structure strongly resembles outstanding temples in Thailand, akin to Wat Arun in Bangkok.nn4. Evaluate Tool Availability: I actually have one available tool: `get_weather(city: STRING)`. This tool requires a city name as input.nn5. Formulate a Plan:n a. I want the town name to make use of the `get_weather` tool.n b. I have to first discover the town from the image.n c. Since my internal knowledge suggests Bangkok, I’ll proceed with that assumption, but I have to acknowledge that I cannot definitively prove the situation from the image alone.nn6. Execute Step 1 (City Identification): Assume the town is Bangkok based on the visual evidence of the architecture.nn7. Execute Step 2 (Tool Call Generation): Use the `get_weather` tool with “Bangkok” because the `city` argument.nn8. Construct the Response: Generate the suitable tool call in JSON format.’, ‘answer’: ‘<\|tool_call>call:get_weather”‘
31B	The user desires to know the town within the image and the present weather there. 1. Analyze the image: The image shows ornate, spire-like structures (prangs) characteristic of Thai temple architecture. Specifically, these appear like the prangs at Wat Phra Kaew (Temple of the Emerald Buddha) or Wat Pho in Bangkok, Thailand. The style may be very distinctive of Bangkok’s royal temple complexes. 2. Discover the town: Based on the architectural style, the town is Bangkok. 3. Find the weather: I want to call the `get_weather` tool with `city="Bangkok"`.call:get_weather{city:Bangkok}
26/A4B	The image shows several tall, ornate towers (prangs) typical of Thai Buddhist temple architecture. Specifically, these appear like the structures inside the Grand Palace or Wat Phra Kaew in Bangkok, Thailand. The style is unmistakably Thai. I’ll confirm this by the particular towers. These are very iconic.n2. Confirm the situation: The image shows elements of the Grand Palace complex in Bangkok.n3. Get weather information: Once the town is identified as Bangkok, I’ll use the `get_weather` tool with `city='Bangkok'`.call:get_weather{city:Bangkok}

Gemma 4 comes with day-0 support for a lot of open-source inference engines, and is right for tool calling and agents! We also release ONNX checkpoints that may run on many hardware backends, allowing use cases on edge devices or in browser!

transformers

Gemma 4 comes with first-class transformers support from the get-go 🤗. This integration allows using the model with other libraries like bitsandbytes, PEFT and TRL. Be sure to put in the most recent version of transformers.

pip install -U transformers

The best solution to infer with the small Gemma 4 models is thru the any-to-any pipeline. You possibly can initialize it as follows.

from transformers import pipeline
pipe = pipeline("any-to-any", model="google/gemma-4-e2b-it")

You possibly can then pass in images and text as follows.

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/thailand.jpg",
            },
            {"type": "text", "text": "Do you have travel advice going to here?"},
        ],
    }
]
output = pipe(messages, max_new_tokens=100, return_full_text=False)
output[0]["generated_text"]

When inferring with videos, you’ll be able to include the audio track using the load_audio_from_video argument.

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/rockets.mp4",
            },
            {"type": "text", "text": "What is happening in this video?"},
        ],
    }
]
pipe(messages, load_audio_from_video=True)

Going a level lower, you’ll be able to load Gemma 4 using the AutoModelForMultimodalLM class, especially useful for fine-tuning. The built-in chat template takes care of formatting the inputs appropriately, please be sure you utilize it to forestall subtle mistakes when constructing the prompt manually.

Inference code

from transformers import AutoModelForMultimodalLM, AutoProcessor
model = AutoModelForMultimodalLM.from_pretrained("google/gemma-4-E2B-it", device_map="auto")
processor = AutoProcessor.from_pretrained("google/gemma-4-E2B-it")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/rockets.mp4",
            },
            {"type": "text", "text": "What is happening in this video?"},
        ],
    }
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Llama.cpp

Gemma 4 models include image+text support in llama.cpp from the get-go! This unlocks using Gemma 4 with all your favorite local apps: llama-cpp server, lmstudio, Jan in addition to coding agents like Pi across many backends akin to Metal and CUDA.

You possibly can install llama-cpp as follows.

brew install llama.cpp 
winget install llama.cpp

You possibly can then start a server compatible with the OpenAI API Replace the quantization scheme at the tip of the command with the precision of your alternative.

llama-server -hf ggml-org/gemma-4-E2B-it-GGUF

Take a look at this link for more options on combining llama.cpp with different coding agents and native apps. Find all of the GGUF checkpoints on this collection.

Plug in your local agent

We worked on ensuring the brand new models work locally with agents like openclaw, hermes, pi, and open code. All because of llama.cpp! Run the next to try Gemma 4 immediately.

First, start your local server:

llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M

For hermes:

hermes model

For openclaw:

openclaw onboard

For pi define a ~/.pi/agent/models.json:

{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",	
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ggml-org-gemma-4-26b-4b-gguf"
        }
      ]
    }
  }
}

For open code define a ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "gemma-4-26b-4b-it": {
          "name": "Gemma 4 (local)",
          "limit": {
            "context": 128000,
            "output": 8192
          }
        }
      }
    }
  }
}

transformers.js

transformers.js enables running Gemma 4 right inside browser. You possibly can try the model card to see text-only, image & text, audio & text inference intimately here. We also shipped a demo so that you can test the model here.

MLX

Full multimodal support of Gemma 4 is on the market using the open-source mlx-vlm library. Here’s the way to ask the model to explain a picture:

pip install -U mlx-vlm

mlx_vlm.generate 
--model google/gemma-4-E4B-it 
--image https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg 
--prompt "Describe this image intimately"

mlx-vlm supports TurboQuant, which delivers the identical accuracy because the uncompressed baseline while using ~4x less energetic memory and running so much faster end-to-end. This makes long-context inference practical on Apple Silicon without sacrificing quality. Use it like this:

mlx_vlm.generate 
--model "mlx-community/gemma-4-26B-A4B-it" 
--prompt "Your prompt here" 
--kv-bits 3.5 
--kv-quant-scheme turboquant

For audio examples and more details, please check the MLX collection.

Mistral.rs

mistral.rs is a Rust-native inference engine with day-0 Gemma 4 support across all modalities (text, image, video, audio) and builtin tool-calling and agentic functionality. Install mistral.rs:

curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh 

irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex

You possibly can then start an OpenAI-compatible HTTP server:

mistralrs serve mistralrs-community/gemma-4-E4B-it-UQFF --from-uqff 8

Or, use interactive mode:

mistralrs run -m google/gemma-4-E4B-it --isq 8 --image image.png -i "Describe this image intimately."

mistralrs run -m google/gemma-4-E4B-it --isq 8 --audio audio.mp3 -i "Transcribe this fully."

Find all models on this collection. Find the instructions for installation and inference in model cards.

High quality-tuning for all

Gemma 4 models are perfect for fine-tuning in your favorite tools and platforms and at any budget.

High quality-tuning with TRL

Gemma 4 is fully supported for fine-tuning with TRL. To have fun, TRL has been upgraded with support for multimodal tool responses when interacting with environments, meaning models can now receive images back from tools during training, not only text.

To showcase this, we have built an example training script where Gemma 4 learns to drive within the CARLA simulator. The model sees the road through a camera, decides what to do and learns from the final result. After training, it consistently changes lanes to avoid pedestrians. The identical approach works for any task where a model must see and act: robotics, web browsing, or other interactive environments.

Start:

# pip install git+https://github.com/huggingface/trl.git

python examples/scripts/openenv/carla_vlm_gemma.py 
    --env-urls https://sergiopaniego-carla-env.hf.space 
            https://sergiopaniego-carla-env-2.hf.space 
    --model google/gemma-4-E2B-it

Find the instance here.

High quality-tuning with TRL on Vertex AI

Moreover, now we have prepared an example on the way to fine-tune Gemma 4 with TRL on Vertex AI using SFT, to showcase the way to extend the function calling capabilities, whilst freezing each the vision and audio towers. The examples include the way to construct a custom Docker container with latest Transformers, TRL, etc. with CUDA support on Google Cloud, and the way to run it via Vertex AI Serverless Training Jobs.


from google.cloud import aiplatform

aiplatform.init(
    project="",
    location="",
    staging_bucket="",
)

job = aiplatform.CustomContainerTrainingJob(
    display_name="gemma-4-fine-tuning",
    container_uri="",
    command=["python", "/gcs/gemma-4-fine-tuning/train.py"],
)

job = job.submit(
    replica_count=1,
    machine_type="a3-highgpu-1g",
    accelerator_type="NVIDIA_H100_80GB",
    accelerator_count=1,
    base_output_dir="/output-dir",
    environment_variables={
        "MODEL_ID": "google/gemma-4-E2B-it",
        "HF_TOKEN": ,
    },
    boot_disk_size_gb=500,
)

You could find the entire example within the “Hugging Face on Google Cloud” docs at https://hf.co/docs/google-cloud/examples/vertex-ai-notebooks-fine-tune-gemma-4.

High quality-tuning with Unsloth Studio

If you ought to high-quality tune and run a Gemma 4 model in a UI, check out Unsloth Studio. It runs locally or on Google Colab. First, install and begin the app:

# install unsloth studio on MacOS, Linux, WSL
curl -fsSL https://unsloth.ai/install.sh | sh

# install unsloth studio on Windows
irm https://unsloth.ai/install.ps1 | iex

# launch unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Search for for a Gemma 4 model like google/gemma-4-E2B-it

Then select any of the Gemma 4 models from the hub.

Try Gemma 4

We’ve shipped demos so that you can try different Gemma 4 models. We include demos based on transformers implementation for E4B, 26B/A4B, and dense 31B models, in addition to a WebGPU demo with transformers.js 🚀

Acknowledgements

Landing Gemma-4 within the open-source ecosystem took lots of effort from many individuals and never only the authors of this blog post. In no particular order, we thank many individuals from the open-source team: Gemma 4 transformers integration is owed to Cyril, Raushan, Eustache, Arthur, Lysandre. We thank Joshua for transformers.js integration and demo, Eric for mistral.rs integration, Son for Llama.cpp, Prince for MLX integration, Quentin, Albert and Kashif for TRL, Adarsh for SGLang transformers backend and Toshihiro for constructing the demos.
This work would not have been possible without Google’s extensive contribution with the model artefact, but additionally the numerous effort contributing the model to transformers in an effort to standardize it. The open-source ecosystem is now more complete, with a really capable, freely-licensed, open-source model.
The Gemma 4 transformers integration was handled by Cyril, Raushan, Eustache, Arthur, Lysandre. We thank Joshua for the transformers.js integration and demo, Eric for mistral.rs integration, Son for Llama.cpp, Prince for MLX, Quentin for TRL, Adarsh for SGLang transformers backend, and Toshihiro for constructing several demos.

This work would not have been possible without Google’s extensive contribution with the model artefact, but additionally their significant effort contributing the model to transformers in an effort to standardize it. The open-source ecosystem is now more complete, with a really capable, freely-licensed, open-source model.

Source link

Frontier multimodal intelligence on device

Table of Contents

Overview of Capabilities and Architecture

Per-Layer Embeddings (PLE)

Shared KV Cache

Multimodal Capabilities

Object Detection and Pointing

GUI detection

Object Detection

Multimodal Considering and Function Calling

Video Understanding

Captioning

Audio Query Answering

Multimodal Function Calling

transformers

Llama.cpp

Plug in your local agent

transformers.js

MLX

Mistral.rs

High quality-tuning for all

High quality-tuning with TRL

High quality-tuning with TRL on Vertex AI

High quality-tuning with Unsloth Studio

Try Gemma 4

Acknowledgements

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Linear Regression Is Actually a Projection Problem (Part 2: From Projections to Predictions)

Latest Rowhammer attacks give complete control of machines running Nvidia GPUs

Our most capable open models so far

Achieving Single-Digit Microsecond Latency Inference for Capital Markets

How one can Handle Classical Data in Quantum Models?

Frontier multimodal intelligence on device

Table of Contents

Overview of Capabilities and Architecture

Per-Layer Embeddings (PLE)

Shared KV Cache

Multimodal Capabilities

Object Detection and Pointing

GUI detection

Object Detection

Multimodal Considering and Function Calling

Video Understanding

Captioning

Audio Query Answering

Multimodal Function Calling

transformers

Llama.cpp

Plug in your local agent

transformers.js

MLX

Mistral.rs

High quality-tuning for all

High quality-tuning with TRL

High quality-tuning with TRL on Vertex AI

High quality-tuning with Unsloth Studio

Try Gemma 4

Acknowledgements

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.