Construct Smarter, Language-Aware Search and Retrieval Systems
As global information continues to expand across languages, developers face a growing challenge: methods to make models understand text in dozens of languages with the identical precision, nuance, and semantic clarity. Traditional multilingual embedding models often struggle with alignment, scalability, and consistent performance.
The NVIDIA Llama-Embed-Nemotron-8B model changes this. Built by fine-tuning the Llama-3.1-8B foundation model, this embedding model applies cross-lingual representation learning to deliver unified, high-fidelity embeddings across linguistically diverse content. Whether you are constructing cross-language retrieval systems, search engines like google, or conversational AI, this model helps close the comprehension gap between languages – high-resource or low-resource alike.
This text provides an outline of the model’s architecture, training methodology and evaluation results, highlighting the way it empowers developers to construct more intelligent and inclusive multilingual applications.
Architectural Highlights
- Base model: 7.5B parameters, consisting of 32 hidden layers (hidden size dimension = 4,096).
- Key innovation: Replaces uni-directional causal attention with bi-directional self-attention, enabling richer semantic understanding across the complete token context.
- Embedding output: Global average pooling compresses token information right into a 4,096-dimensional dense vector, optimized for semantic search and cross-lingual tasks.
This design allows the model to generate consistent embeddings no matter input language or structure — a vital step when tackling multilingual retrieval or alignment problems.
Training Methodology
The model is trained using a bi-encoder architecture, independently encoding a pair of sentences (for instance, query and passage) using the embedding model. Using contrastive learning, it maximizes similarity between the query and the passage that incorporates the reply, while minimizing it between the query and sampled negative passages that aren’t useful for answering the query.
Training data mix (16M pairs):
- 8M from publicly available datasets: Nemotron-CC-v2, MIRACL, HotpotQA, MS MARCO, Natural Questions, SQuAD, and more.
- 8M from synthetic datasets: Generated from open-source LLMs covering retrieval, semantic similarity and classification problem types.
Two-stage training pipeline:
- Pre-training: 11.5M query, document pairs curated from Nemotron-CC-v2 – NVIDIA’s state-of-the-art LLM pre-training dataset.
- Tremendous-tuning: 4.5M pairs combining public and high-quality synthetic datasets to refine semantic precision.
We plan to open-source our multilingual data mix soon, and publish an in depth technical report covering training dynamics and multilingual alignment.
Performance Evaluation
We evaluate our model on the MMTEB Benchmark, specifically, on the fundamental MTEB (Multilingual, v2) split. It consists of 131 tasks across 9 task types and 1,038 languages. Rating on the MMTEB Leaderboards is performed based on the Borda rank. Each task is treated as a preference voter, which provides votes on the models per their relative performance on the duty. The perfect model obtains the very best variety of votes. The model with the very best variety of votes across tasks obtains the very best rank. The Borda rank tends to prefer models that perform well broadly across tasks.
Our model achieves state-of-the-art performance on the MMTEB Benchmark (as of October 21, 2025). Rating list is presented below:
| Borda Rank | Model | Borda Votes | Mean (Task) |
|---|---|---|---|
| 1. | llama-embed-nemotron-8b | 39,573 | 69.46 |
| 2. | gemini-embedding-001 | 39,368 | 68.37 |
| 3. | Qwen3-Embedding-8B | 39,364 | 70.58 |
| 4. | Qwen3-Embedding-4B | 39,099 | 69.45 |
| 5. | Qwen3-Embedding-0.6B | 37,419 | 64.34 |
| 6. | gte-Qwen2-7B-instruct | 37,167 | 62.51 |
| 7. | Linq-Embed-Mistral | 37,149 | 61.47 |
Modern applications — from document search to coding assistants — require embeddings that scale and generalize. With Llama-Embed-Nemotron-8B, developers can:
- Construct cross-language retrieval systems that align semantically across diverse alphabets and syntax.
- Power multi-lingual QA and semantic similarity tasks without compromising accuracy.
- Leverage an open, high-performing model that integrates easily with existing Hugging Face pipelines.
Try it Yourself
Deploy Llama-Embed-Nemotron-8B, or learn more in regards to the NVIDIA NeMo Retriever family of Nemotron RAG models on the product page.
Stay awake up to now on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, YouTube and the Nemotron channel on Discord.
