Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model

-


Ronay Ak's avatar


Modern search systems are increasingly designed to process heterogeneous document images which will contain text, tables, charts, figures, and other visual components. On this context, accurately retrieving relevant information across these diverse modalities is a central challenge. Multimodal embedding models built on top of foundational vision–language models (VLMs) map diverse content types right into a shared representation space, enabling unified retrieval over text, images, and structured visual elements. Although encoding a whole query and candidate document right into a single vector is a standard practice—exemplified by our recently released commercial-ready Llama-Nemotron-Embed-VL-1B which prioritizes efficiency and low storage—there’s an increasing research direction on multi-vector, late-interaction style embedding architectures which give fine-grained multi-vector interaction between queries and documents. By enabling richer token representations, these models higher capture more detailed semantic relationships, and so they have shown higher accuracy performance on various (multimodal) benchmarks.

NVIDIA introduces the Nemotron ColEmbed V2 family, a set of late-interaction embedding models available in three sizes—3B, 4B, and 8B—designed for highly accurate multimodal retrieval. These models adopt a unified approach to text–image retrieval and achieve state-of-the-art performance on the ViDoRe V1, V2, and V3 benchmarks.



Nemotron ColEmbed V2 Highlights (TL;DR)

The nemotron-colembed-vl-8b-v2, nemotron-colembed-vl-4b-v2 and llama-nemotron-colembed-vl-3b-v2 are state-of-the-art late interaction embedding models that rank 1st, third and sixth—the best ranked models in each weight class, as of Feb 3, 2026, on the ViDoRe V3 benchmark: a comprehensive evaluation of visual document retrieval for enterprise use-case benchmark.

late_interaction

The late interaction mechanism introduced by ColBERT for multi-vector embedding matching has been prolonged in our work to a multimodal setting, enabling fine-grained interactions between query and document tokens, whether textual or visual. As illustrated within the figure, each query token embedding interacts with all document token embeddings via the MaxSim operator, which selects the utmost similarity for every query token after which sums these maxima to supply the ultimate relevance rating. This approach requires storing the token embeddings for the complete document corpus, whether textual or visual, thereby increasing storage requirements. During inference, query token embeddings are computed and matched against the stored document embeddings using the identical MaxSim operation.

Nemotron ColEmbed V2 family of models is meant for researchers exploring visual document retrieval applications where accuracy is paramount. This distinguishes it from our 1B single-vector model released last month, which was designed for business environments requiring minimal storage and high throughput. It’s instrumental in multimodal RAG systems, where textual queries might be used to retrieve document images, comparable to pages, text, charts, tables, or infographics. The models output multi-vector embeddings for input queries and documents. Potential applications include multimedia serps, cross-modal retrieval systems, and conversational AI with wealthy input understanding.

As a brand new benchmark, ViDoRe V3 is designed to set an industry standard for multi-modal enterprise document retrieval. It tackles a key challenge in production RAG systems: accurately extracting information from complex, visually-rich documents. With its strong multi-modal document retrieval capability, the nemotron-colembed-vl-8b-v2 model ranks #1 on the ViDoRe V3 leaderboard, setting a brand new standard for accuracy.

Visual Document Retrieval benchmark (page retrieval) – Avg NDCG@10 on ViDoRe V3 private and non-private tasks.



Models’ Architecture

The llama-nemotron-colembed-vl-3b-v2 is a transformer-based multimodal embedding model built on top of a VLM based on google/siglip2-giant-opt-patch16-384 and meta-llama/Llama-3.2-3B. The nemotron-colembed-vl-8b-v2 and nemotron-colembed-vl-4b-v2 multimodal encoder models were built from Qwen3-VL-8B-Instruct and Qwen3-VL-4B-Instruct, respectively.



Architecture modifications:

  • Our models use bi-directional self-attention as an alternative of the unique uni-directional causal self-attention from the LLM decoder models. This permits the model to learn wealthy representations from the entire input sequence.
  • ColBERT-style late interaction mechanism- for every input token, each model outputs an n-dimensional embedding vector of floating-point values, where n is set by the model’s hidden size.



Training Methodology

The nemotron-colembed-vl-8b-v2, nemotron-colembed-vl-4b-v2 and llama-nemotron-colembed-vl-3b-v2 models were trained using a bi-encoder architecture, independently. This involves encoding a pair of sentences (for instance, a question and a document) independently using the embedding model. Using contrastive learning, it’s used to maximise the late interaction similarity between the query and the document that incorporates the reply, while minimizing the similarity between the query and sampled negative documents not useful to reply the query.

The llama-nemotron-colembed-vl-3b-v2 model was trained in a two-stage pipeline: it was first fine-tuned with 12.5M textQA pairs, and subsequently fine-tuned with text–image pairs. The nemotron-colembed-vl-8b-v2, nemotron-colembed-vl-4b-v2 models were fine-tuned using only text-image pairs (2nd stage).

Our training datasets contain each text-only and text-image pairs, and we apply hard negative mining following the positive-aware hard negative mining methods presented within the NV-Retriever paper to enhance retrieval performance.

Key Improvements over V1:

⚗️ Advanced Model Merging: Utilizes post-training model merging to mix the strengths of multiple fine-tuned checkpoints. This delivers the accuracy stability of an ensemble with none additional inference latency.

🌍 Enhanced Synthetic Data: We significantly enriched our training mixture with diverse multilingual synthetic data, improving semantic alignment across languages and complicated document types.

modelperfs_vidorev3



Start Constructing with Nemotron ColEmbed V2

Nemotron ColEmbed V2 models mark a serious step forward in high-accuracy text–image retrieval, delivering state-of-the-art results on the ViDoRe V1, V2, and V3 benchmarks. The provision of 3B, 4B and 8B model variants further establishes a solid foundation for future research and advanced experimentation in multimodal retrieval applications.

Start with Nemotron ColEmbed V2 models by downloading the models: nemotron-colembed-vl-8b-v2, nemotron-colembed-vl-4b-v2 and llama-nemotron-colembed-vl-3b-v2, available on Hugging Face. Learn more concerning the NVIDIA NeMo Retriever family of Nemotron RAG models on the product page, or access the microservice container from NVIDIA NGC. This is a wonderful opportunity to explore state-of-the-art retrieval in your personal applications and workflows.

Try NVIDIA Enterprise RAG Blueprint, using the Nemotron RAG models which might be powered by the identical tech behind our ViDoRe V3 winning.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x