Introducing Gemini Embeddings 2 Preview

a preview version of its latest embedding model. This model is notable for one important reason. It will possibly embed text, PDFs, images, audio, and video, making it a one-stop shop for embedding absolutely anything you’d care to throw at it.

If you happen to’re recent to embedding, you may wonder what all of the fuss is about, but it surely seems that embedding is certainly one of the cornerstones of retrieval augmented generation or RAG, because it’s known. In turn, RAG is one of the fundamental applications of contemporary artificial intelligence processing.

A fast recap of RAG and Embedding

RAG is a technique of chunking, encoding and storing information that may then be searched using similarity functions that match search terms to the embedded information. The encoding part turns whatever you’re searching right into a series of numbers called vectors — that is what embedding does. The vectors (embeddings) are then typically stored in a vector database.

When a user enters a search term, it is usually encoded as embeddings, and the resulting vectors are compared with the contents of the vector database, normally using a process called cosine similarity. The closer the search term vectors are to parts of the knowledge within the vector store, the more relevant the search terms are to those parts of the stored data. Large language models can interpret all this and retrieve and display essentially the most relevant parts to the user.

There’s an entire bunch of other stuff that surrounds this, like how the input data needs to be split up or chunked, however the embedding, storing, and retrieval are the important features of RAG processing. To make it easier to visualise, here’s a simplified schematic of a RAG process.

Image by Nano Banana

So, what’s special about Gemini Embedding?

Okay, so now that we understand how crucial embedding is for RAG, why is Google’s recent Gemini embedding model such an enormous deal? Simply this. Traditional embedding models — with a couple of exceptions — have been restricted to text, PDFs, and other document types, and possibly images at a push.

What Gemini now offers is true multi-modal input for embeddings. Meaning text, PDF’s and docs, images, audio video. Being a preview embedding model, there are specific size limitations on the inputs without delay, but hopefully you may see the direction of travel and the way potentially useful this may very well be.

Input limitations

I discussed that there are limitations on what we will input to the brand new Gemini embedding model. They’re:

Text: As much as 8192 input tokens, which is about 6000 words
Images: As much as 6 images per request, supporting PNG and JPEG formats
Videos: A maximum of two minutes of video in MP4 and MOV formats
Audio: A maximum duration of 80 seconds, supports MP3, WAV.
Documents: As much as 6 pages long

Okay, time to see the brand new embedding model in practice with some Python coding examples.

Establishing a development environment

To start, let’s arrange a typical development environment to maintain our projects separate. I’ll be using the UV tool for this, but be at liberty to make use of whichever methods you’re used to.

$ uv init embed-test --python 3.13
$ cd embed-test
$ uv venv
$ source embed-test/bin/activate
$ uv add google-genai jupyter numpy scikit-learn audioop-lts

# To run the notebook, type this in

$ uv run jupyter notebook

You’ll also need a Gemini API key, which you’ll be able to get from Google’s AI Studio home page.

https://aistudio.google.com

Search for a Get API Key link near the underside left of the screen after you’ve logged in. Be aware of it as you’ll need it later.

Please note, aside from being a user of their products, I even have no association or affiliation with Google or any of its subsidiaries.

Setup Code

I won’t talk much about embedding text or PDF documents, as these are relatively straightforward and are covered extensively elsewhere. As an alternative, we’ll have a look at embedding images and audio, that are less common.

That is the setup code, which is common to all our examples.

import os
import numpy as np
from pydub import AudioSegment
from google import genai
from google.genai import types
from sklearn.metrics.pairwise import cosine_similarity

from IPython.display import display, Image as IPImage, Audio as IPAudio, Markdown

client = genai.Client(api_key='YOUR_API_KEY')

MODEL_ID = "gemini-embedding-2-preview"

Example 1 — Embedding images

For this instance, we’ll embed 3 images: certainly one of a ginger cat, certainly one of a Labrador, and certainly one of a yellow dolphin. We’ll then arrange a series of questions or phrases, each yet another specific to or related to certainly one of the pictures, and see if the model can pick essentially the most appropriate image for every query. It does this by computing a similarity rating between the query and every image. The upper this rating, the more pertinent the query to the image.

Listed here are the pictures I’m using.

So, I even have two questions and two phrases.

Which animal is yellow
Which is almost definitely called Rover
There’s something fishy happening here
A purrrfect image

# Some helper function
#

# embed text
def embed_text(text: str) -> np.ndarray:
    """Encode a text string into an embedding vector.

    Simply pass the string on to embed_content.
    """
    result = client.models.embed_content(
        model=MODEL_ID,
        contents=[text],
    )
    return np.array(result.embeddings[0].values)
    
# Embed a picture
def embed_image(image_path: str) -> np.ndarray:

    # Determine MIME type from extension
    ext = image_path.lower().rsplit('.', 1)[-1]
    mime_map = {'png': 'image/png', 'jpg': 'image/jpeg', 'jpeg': 'image/jpeg'}
    mime_type = mime_map.get(ext, 'image/png')

    with open(image_path, 'rb') as f:
        image_bytes = f.read()

    result = client.models.embed_content(
        model=MODEL_ID,
        contents=[
            types.Part.from_bytes(data=image_bytes, mime_type=mime_type),
        ],
    )
    return np.array(result.embeddings[0].values)

# --- Define image files ---
image_files = ["dog.png", "cat.png", "dolphin.png"]
image_labels = ["dog","cat","dolphin"]

# Our questions
text_descriptions = [
    "Which animal is yellow",
    "Which is most likely called Rover",
    "There's something fishy going on here",
    "A purrrfect image"
]

# --- Compute embeddings ---
print("Embedding texts...")
text_embeddings = np.array([embed_text(t) for t in text_descriptions])

print("Embedding images...")
image_embeddings = np.array([embed_image(f) for f in image_files])

# Use cosine similarity for matches
text_image_sim = cosine_similarity(text_embeddings, image_embeddings)

# Print best matches for every text
print("nBest image match for every text:")
for i, text in enumerate(text_descriptions):
    # np.argmax looks across the row (i) to seek out the very best rating among the many columns
    best_idx = np.argmax(text_image_sim[i, :])
    best_image = image_labels[best_idx]
    best_score = text_image_sim[i, best_idx]
    
    print(f"  "{text}" => {best_image} (rating: {best_score:.3f})")

Here’s the output.

Embedding texts...
Embedding images...

Best image match for every text:
  "Which animal is yellow" => dolphin (rating: 0.399)
  "Which is almost definitely called Rover" => dog (rating: 0.357)
  "There's something fishy happening here" => dolphin (rating: 0.302)
  "A purrrfect image" => cat (rating: 0.368)

Not too shabby. The model got here up with the identical answers I might have given. How about you?

Example 2 — Embedding audio

For the audio, I used a person’s voice describing a fishing trip through which he sees a vibrant yellow dolphin. Click below to listen to the total audio. It’s about 37 seconds long.

If you happen to don’t wish to listen, here is the total transcript.

Hi, my name is Glen, and I need to inform you about an interesting sight I witnessed last Tuesday afternoon while out ocean fishing with some friends. It was a warm day with a yellow sun within the sky. We were fishing for Tuna and had no luck catching anything. Boy, we should have spent the perfect a part of 5 hours on the market. So, we were pretty glum as we headed back to dry land. But then, suddenly, and I swear this is not any lie, we saw a faculty of dolphins. Not only that, but certainly one of them was vibrant yellow in color. We never saw anything prefer it in our lives, but I can inform you all thoughts of a foul fishing day went out the window. It was mesmerising.

Now, let’s see if we will narrow down where the speaker talks about seeing a yellow dolphin.

Normally, when coping with embeddings, we’re only considering general properties, ideas, and ideas contained within the source information. If we wish to narrow down specific properties, corresponding to where in an audio file a specific phrase occurs or where in a video a specific motion or event occurs, it is a barely more complex task. To do this in our example, we first need to chunk the audio into smaller pieces before embedding each chunk. We then perform a similarity search on each embedded chunk before producing our final answer.


# --- HELPER FUNCTIONS ---

def embed_text(text: str) -> np.ndarray:
    result = client.models.embed_content(model=MODEL_ID, contents=[text])
    return np.array(result.embeddings[0].values)
    
def embed_audio(audio_path: str) -> np.ndarray:
    ext = audio_path.lower().rsplit('.', 1)[-1]
    mime_map = {'wav': 'audio/wav', 'mp3': 'audio/mp3'}
    mime_type = mime_map.get(ext, 'audio/wav')

    with open(audio_path, 'rb') as f:
        audio_bytes = f.read()

    result = client.models.embed_content(
        model=MODEL_ID,
        contents=[types.Part.from_bytes(data=audio_bytes, mime_type=mime_type)],
    )
    return np.array(result.embeddings[0].values)

# --- MAIN SEARCH SCRIPT ---

def search_audio_with_embeddings(audio_file_path: str, search_phrase: str, chunk_seconds: int = 5):
    print(f"Loading {audio_file_path}...")
    audio = AudioSegment.from_file(audio_file_path)
    
    # pydub works in milliseconds, so 5 seconds = 5000 ms
    chunk_length_ms = chunk_seconds * 1000 
    
    audio_embeddings = []
    temp_files = []
    
    print(f"Slicing audio into {chunk_seconds}-second pieces...")
    
    # 2. Chop the audio into pieces
    # We use a loop to leap forward by chunk_length_ms every time
    for i, start_ms in enumerate(range(0, len(audio), chunk_length_ms)):
        # Extract the slice
        chunk = audio[start_ms:start_ms + chunk_length_ms]
        
        # Reserve it temporarily to your folder so the Gemini API can read it
        chunk_name = f"temp_chunk_{i}.wav"
        chunk.export(chunk_name, format="wav")
        temp_files.append(chunk_name)
        
        # 3. Embed this specific chunk
        print(f"  Embedding chunk {i + 1}...")
        emb = embed_audio(chunk_name)
        audio_embeddings.append(emb)
        
    audio_embeddings = np.array(audio_embeddings)
    
    # 4. Embed the search text
    print(f"nEmbedding your search: '{search_phrase}'...")
    text_emb = np.array([embed_text(search_phrase)])
    
    # 5. Compare the text against all of the audio chunks
    print("Calculating similarities...")
    sim_scores = cosine_similarity(text_emb, audio_embeddings)[0]
    
    # Find the chunk with the very best rating
    best_chunk_idx = np.argmax(sim_scores)
    best_score = sim_scores[best_chunk_idx]
    
    # Calculate the timestamp
    start_time = best_chunk_idx * chunk_seconds
    end_time = start_time + chunk_seconds
    
    print("n--- Results ---")
    print(f"The concept '{search_phrase}' most closely matches the audio between {start_time}s and {end_time}s!")
    print(f"Confidence rating: {best_score:.3f}")
    

# --- RUN IT ---

# Replace with whatever phrase you're in search of!
search_audio_with_embeddings("fishing2.mp3", "yellow dolphin", chunk_seconds=5)

Here is the output.

Loading fishing2.mp3...
Slicing audio into 5-second pieces...
  Embedding chunk 1...
  Embedding chunk 2...
  Embedding chunk 3...
  Embedding chunk 4...
  Embedding chunk 5...
  Embedding chunk 6...
  Embedding chunk 7...
  Embedding chunk 8...

Embedding your search: 'yellow dolphin'...
Calculating similarities...

--- Results ---
The concept 'yellow dolphin' most closely matches the audio between 25s and 30s!
Confidence rating: 0.643

That’s pretty accurate. Listening to the audio again, the phrase “dolphin” is mentioned on the 25-second mark and “vibrant yellow” is mentioned on the 29-second mark. Earlier within the audio, I deliberately introduced the phrase “yellow sun” to see whether the model could be confused, but it surely handled the distraction well.

Summary

This text introduces Gemini Embeddings 2 Preview as Google’s recent all-in-one embedding model for text, PDFs, images, audio, and video. It explains why that matters for RAG systems, where embeddings help turn content and search queries into vectors that will be compared for similarity.

I then walked through two Python examples showing tips on how to generate embeddings for images and audio with the Google GenAI SDK, use similarity scoring to match text queries against images, and chunk audio into smaller segments to discover the a part of a spoken recording that’s semantically closest to a given search phrase.

The chance to perform semantic searches beyond just text and other documents is an actual boon. Google’s recent embedding model guarantees to open up an entire recent raft of possibilities for multimodal search, retrieval, and advice systems, making it much easier to work with images, audio, video, and documents in a single pipeline. Because the tooling matures, it could develop into a really practical foundation for richer RAG applications that understand way over text alone.

You will discover the unique blog post announcing Gemini Embeddings 2 using the link below.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2

Introducing Gemini Embeddings 2 Preview

A fast recap of RAG and Embedding

So, what’s special about Gemini Embedding?

Input limitations

Establishing a development environment

Setup Code

Example 1 — Embedding images

Example 2 — Embedding audio

Summary

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Generative AI improves a wireless vision system that sees through obstructions

A greater method for identifying overconfident large language models

Why You Should Stop Worrying About AI Taking Data Science Jobs

Two-Stage Hurdle Models: Predicting Zero-Inflated Outcomes

Federal cyber experts called Microsoft’s cloud a “pile of shit,” approved it anyway

Introducing Gemini Embeddings 2 Preview

A fast recap of RAG and Embedding

So, what’s special about Gemini Embedding?

Input limitations

Establishing a development environment

Setup Code

Example 1 — Embedding images

Example 2 — Embedding audio

Summary

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.