Improve Accuracy In Multimodal Search and Visual Document Retrieval with Llama Nemotron RAG Models

-


Ronay Ak's avatar

Bo Liu's avatar



How one can construct accurate, low-latency visual document retrieval with small Llama Nemotron models that work out-of-the-box with standard vector databases

In real applications, data shouldn’t be just text. It lives in PDFs with charts, scanned contracts, tables, screenshots, and slide decks, so a text-only retrieval system will miss vital information. Multimodal RAG pipelines change this by enabling retrieval and reasoning over text, images, and layouts together, leading to more accurate and actionable answers.

This post walks through two small Llama Nemotron models for multimodal retrieval over visual documents:

Each models are:

  • Sufficiently small to run with most NVIDIA GPU resources
  • Compatible with standard vector databases (single dense vector per page)
  • Designed to cut back hallucinations by grounding generation on higher evidence, not longer prompts

We’ll show how they behave on realistic document benchmarks below.



Why multimodal RAG needs world-class retrieval

Multimodal RAG pipelines mix a retriever with a vision-language model (VLM) so responses are grounded in each retrieved page text and visual content, not only raw text prompts.

Embeddings control which pages are retrieved and shown to the VLM. Reranking models resolve which of those pages are most relevant and will influence the reply. If either step is inaccurate, the VLM is more more likely to hallucinate—often with high confidence. Using multimodal embeddings along with a multimodal reranker keeps generation grounded in the proper page images and text.



The State-of-the-Art in Industrial Multimodal Search

The llama-nemotron-embed-vl-1b-v2 and llama-nemotron-rerank-vl-1b-v2 models
are designed for developers constructing multimodal question-answering and search over large corpora of PDFs and pictures.

The llama-nemotron-embed-vl-1b-v2 model is a single-vector (dense) embedding model that efficiently condenses visual and textual information right into a single representation. This design ensures compatibility with all standard vector databases and enables millisecond-latency search at enterprise scale.

llama-nemotron-rerank-v1-1b-v2 is a cross-encoder reranking model that reorders the highest retrieved candidates to enhance relevance and boosts downstream answer quality without changing your storage or index format.

We evaluated llama-nemotron-embed-vl-1b-v2 and llama-nemotron-rerank-vl-1b-v2 on five visual document retrieval datasets: the favored ViDoRe V1, V2 and V3, a practical visual document retrieval benchmark for enterprises composed of 8 public datasets, and two internal visual document retrieval datasets:

  • DigitalCorpora-10k: A dataset with over 1300 questions based on a corpus of 10,000 documents from DigitalCorpora which have a superb mixture of text, tables, and charts.
  • Earnings V2: An internal retrieval dataset of 287 questions based on 500 PDFs, mostly consisting of earnings reports from big tech firms.



Visual Document Retrieval (page retrieval) benchmarks

The table below reports the common retrieval accuracy (Recall@5) across five datasets, focusing specifically on commercially viable dense retrieval models.

We are able to see that the llama-nemotron-embed-vl-1b-v2 provides higher retrieval accuracy (Recall@5) for the image and image+text modalities than its predecessor, llama-3.2-nemoretriever-1b-vlm-embed-v1 and in addition higher on text modality than llama-nemotron-embed-1b-v2, our small text embedding model. Finally, our VLM reranker llama-nemotron-rerank-vl-1b-v2 improves retrieval accuracy further by 7.2%, 6.9% and 6% per modality.

Note: Image+Text modality signifies that each the page image and its text (extracted using ingestion libraries like NV-Ingest) are fed as input to the embedding model for more accurate representation and retrieval.

Visual Document Retrieval benchmarks (page retrieval) – Avg Recall@5 on DigitalCorpora-10k, Earnings V2, ViDoRe V1, V2, V3

Model Text Image Image + Text
llama-nemotron-embed-1b-v2 69.35%
llama-3.2-nemoretriever-1b-vlm-embed-v1 71.07% 70.46% 71.71%
llama-nemotron-embed-vl-1b-v2 71.04% 71.20% 73.24%
llama-nemotron-embed-vl-1b-v2 + llama-nemotron-rerank-vl-1b-v2 76.12% 76.12% 77.64%

The table below demonstrates the accuracy evaluation of llama-nemotron-rerank-vl-1b-v2 in comparison with two other publicly available multimodal reranker models: jina-reranker-m0 and MonoQwen2-VL-v0.1. Although jina-reranker-m0, performs well on image-only tasks, its public weights are restricted to non-commercial use (CC-BY-NC). In contrast, llama-nemotron-rerank-vl-1b-v2 offers superior performance across Text and combined Image+Text modalities, and its permissive business license makes it an excellent alternative for enterprise deployments.

Model Text Image Image+Text
llama-nemotron-rerank-vl-1b-v2 76.12% 76.12% 77.64%
jina-reranker-m0 69.31% 78.33% NA
MonoQwen2-VL-v0.1 74.70% 75.80% 75.98%



Architectural Highlights & Training Methodology

The llama-nemotron-embed-vl-1b-v2 embedding model is a transformer-based encoder model, with roughly 1.7B parameters. It’s a fine-tuned version of the NVIDIA Eagle family of models, using the Llama 3.2 1B language model and SigLip2 400M vision encoder. Embedding models for retrieval are typically trained with a bi-encoder architecture that encodes query and document independently. The model applies mean pooling over the output token embeddings from the language model, in order that it outputs a single embedding with 2048 dimensions. Contrastive learning is used to coach the embedding model to extend similarity between queries and relevant documents while decreasing similarity to negative samples.

The llama-nemotron-rerank-vl-1b-v2 is a cross-encoder model with roughly 1.7B parameters. Additionally it is a fine-tuned version of an NVIDIA Eagle-family model. The ultimate layer hidden states of the language model are aggregated using a mean pooling strategy, and a binary classification head is fine-tuned for the rating task. The model was trained with CrossEntropy loss using publicly available and synthetically generated datasets.



How Organizations are Using These Models

Listed here are three examples of how organizations are applying the brand new Nemotron embedding and reranking models which you can adapt in your individual systems.

Cadence: design and EDA workflows
Cadence models logic design assets resembling micro-architecture and specification documents, constraints, and verification collateral as connected multimodal documents. In consequence, an engineer can ask, “I need to increase the interrupt controller to support a low power state, show me which spec sections need changes,” and immediately surface essentially the most relevant requirements. The system can then suggest just a few alternative specification-update strategies, compare their tradeoffs, and generate the corresponding spec edits for the choice the user selects.

IBM: domain-heavy storage and infra docs
IBM Storage treats each page of long PDFs—product guides, configuration manuals, and architecture diagrams—as a multimodal document, embeds it, and uses the reranker to prioritize pages where domain-specific terms, acronyms, and product names appear in the proper context before sending them to downstream LLMs. This improves how AI systems interpret storage concepts and reason over complex infrastructure documentation.

ServiceNow: chat over large sets of PDFs
ServiceNow uses multimodal embeddings to index pages from organizational PDFs after which applies the reranker to pick essentially the most relevant pages for every user query in its “Chat with PDF” experiences. By keeping high-scoring pages in context across turns, their agents maintain more coherent conversations and help users navigate large document collections more effectively.



Get Began

You may try the models directly:

  • Run llama-nemotron-embed-vl-1b-v2 in your vector database of alternative to power multimodal search over PDFs and pictures.
  • Add llama-nemotron-rerank-vl-1b-v2 as a second-stage reranker in your top-k results to enhance retrieval quality without changing your index.
  • Download Nemotron RAG models for those who want end-to-end components for agents. Models aren’t limited to standalone use—they will also be integrated into ingestion pipelines.

Plug the brand new models into your existing RAG stack, or mix them with other open models on Hugging Face to construct multimodal agents that understand your PDFs, not only their extracted text.

Not sleep to this point on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, YouTube and the Nemotron channel on Discord.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x