What in case your AI agent could immediately parse complex PDFs, extract nested tables, and “see” data inside charts as easily as reading a text file? With NVIDIA Nemotron RAG, you may construct a high-throughput intelligent document processing pipeline that handles massive document workloads with precision and accuracy.
This post walks you thru the core components of a multimodal retrieval pipeline step-by-step. First, we show you methods to use the open source NVIDIA NeMo Retriever library to decompose complex documents into structured data using GPU-accelerated microservices. Then, we exhibit methods to wire that data into Nemotron RAG models to make sure your assistant provides grounded, accurate answers with full traceability back to the source.
Let’s dive in.
Quick links to the model and code
Access the next resources for the tutorial:
🧠 Models on Hugging Face:
☁️ Cloud endpoints:
🛠️ Code and documentation:
Prerequisites
To follow this tutorial, you wish the next:
System requirements:
- Python 3.10 to three.12 (tested on 3.12)
- NVIDIA GPU with a minimum of 24 GB VRAM for local model deployment
- 250 GB of disk space (for models, datasets, and vector database)
API access:
Python environment:
[project]
name = "idp-pipeline"
version = "0.1.0"
description = "IDP Nemotron RAG Pipeline Demo"
requires-python = "==3.12"
dependencies = [
"ninja", "packaging", "wheel", "requests", "python-dotenv", "ipywidgets", # Utils
"markitdown", "nv-ingest==26.1.1", "nv-ingest-api==26.1.1", "nv-ingest-client==26.1.1", # Ingest
"milvus-lite==2.4.12", "pymilvus", "openai>=1.51.0", # Database & API
"transformers", "accelerate", "pillow", "torch", "torchvision", "timm" # ML Core
]
Time required:
One to 2 hours for complete implementation (longer if compiling GPU-optimized deps like flash-attn)
What you’ll get: A production-ready multimodal RAG pipeline for document processing
The tutorial is accessible as a launchable Jupyter Notebook on GitHub for hands-on experimentation. The next is an outline of the construct process.
- Unlocking trapped data: The method begins through the use of the NeMo Retriever library to extract information from complex documents.
- Context-aware orchestration: Using a microservice architecture, the pipeline decomposes documents and optimizes the info for Nemotron RAG models, making a high-speed, contextually aware system.
- High-throughput transformation: By scaling the workload with GPU-accelerated computing and NVIDIA NIM microservices, massive datasets are transformed into searchable intelligence in parallel.
- High precision in retrieval: The refined data is fed into Nemotron RAG, enabling the AI agent to pinpoint exact tables or paragraphs to reply complex queries with high reliability.
- Source-grounded reliability: The ultimate integration wires the retrieval output into an assistant that gives “source-grounded” answers, offering transparent citations back to the particular page or chart.
Why traditional OCR and text-only processing fails on complex documents
Before constructing your pipeline, it’s necessary to grasp these core challenges that standard text extraction fails to resolve:
- Structural complexity: Documents contain matrices and tables where relationships between data are critical. Standard PDF parsers merge columns and rows, destroying structure—turning “Model A: 95°C max” and “Model B: 120°C max” into unusable text. This causes errors in manufacturing, compliance, and decision-making.
- Multimodal content: Critical information lives in charts, diagrams, and scanned images that text-only parsers miss. Performance trends, diagnostic results, and process flowcharts require visual understanding.
- Citation requirements: Regulated industries demand precise citations for audit trails. Answers need traceable references like “Section 4.2, Page 47″—not only facts without provenance.
- Conditional logic: “If-then” rules often span multiple sections. Understanding “Use Protocol A below 0°C, otherwise Protocol B” requires preserving document hierarchy and cross-referencing across pages—essential for technical manuals, policies, and regulatory guidelines.
These challenges explain why Nemotron RAG uses specialized extraction models, structured embeddings, and citation-backed generation moderately than easy text parsing.
Key considerations for intelligent document processing deployments
When constructing your document processing pipeline, these aspects determine production viability:
- Chunk size tradeoffs: Smaller chunks (256-512 tokens) enable precise retrieval but may lose context. Larger chunks (1,024-2,048 tokens) preserve context but reduce precision. For enterprise documents, 512-1,024 tokens with 100-200 token overlap balances each needs.
- Extraction depth: Determine whether to segment content by page or keep documents whole. Page-level splitting enables precise citations and verification, while document-level segmentation maintains narrative flow and broader context. Select based on whether you wish exact source locations or a comprehensive understanding.
- Table output format: Converting tables to markdown preserves row/column relationships in an LLM-native format, significantly reducing numeric hallucinations attributable to plain text linearization.
- Library vs. container mode: Library mode (SimpleBroker) is suitable for development and small documents (<100 docs). Production deployments require container mode with Redis/Kafka for horizontal scaling across 1000’s of documents.
These configuration selections directly impact retrieval accuracy, citation precision, and system scalability.
What are the components of a multimodal RAG pipeline?
Your intelligent document processing pipeline has three major stages before generating the cited answer to your questions. Each has a transparent input/output contract.
Stage 1: Extraction (Nemotron page elements, table/chart extraction, and OCR)
- Input: PDF files
- Output: JSON with structured items: text chunks, table markdown, chart images
- Runs: Library, self-hosted (Docker), and/or distant client
Stage 2: Embedding (llama-nemotron-embed-vl-1b-v2)
- Input: Extracted items (text, tables, chart images)
- Output: 2048-dim vectors per item and original content
- Key capability: Multimodal—encodes text-only, image-only, or image and text together
- Runs: Locally in your GPU or remotely on NIM (soon)
Stage 3: Reranking (llama-nemotron-rerank-vl-1b-v2)
- Input: Top-K candidates from embedding search
- Output: Ranked list (highest relevance first)
- Key capability: Cross-encoder; sees (query, document, optional image) together
- Runs: Locally in your GPU or remotely on NIM (soon)
- Why it matters: Filters out “looks similar but incorrect” results; the VLM version also sees images to confirm relevance
Once the processing pipeline is about up, answers could be generated:
Generation (Llama-3.3-Nemotron-Super-49B)
- Input: Top-ranked documents + user query
- Output: Grounded, cited answer
- Key capability: Follows strict system prompt to cite sources, admit uncertainty
- Runs: Locally or NIM on construct.nvidia.com


Code for constructing each pipeline component
Try the starting code for every a part of the document processing pipeline.
Extraction converts a PDF from “pixels and layout” into structured, queryable units because downstream retrieval and reasoning models can’t reliably operate on raw page coordinates and flattened text without losing meaning. The NeMo Retriever library is built to preserve document structure (tables stay tables, figures stay figures) using specialized extraction capabilities (text, tables, charts/graphics) moderately than treating every part as plain text. The World Bank’s “Peru 2017 Country Profile” is a powerful stress test since it mixes narrative, charts, and dense appendix tables—the identical failure modes that break enterprise RAG if extraction is weak.
# Start nv-ingest (Library Mode) and connect a neighborhood client (SimpleClient on port 7671).
print("[INFO] Starting Ingestion Pipeline (Library Mode)...")
run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True, quiet=True)
time.sleep(15) # warmup
client = NvIngestClient(
message_client_allocator=SimpleClient,
message_client_port=7671, # Default LibMode port
message_client_hostname="localhost"
)
# Submit an extraction job: keep tables as Markdown + crop charts (for downstream multimodal RAG).
ingestor = (Ingestor(client=client)
.files([PDF_PATH])
.extract(
extract_text=True,
extract_tables=True,
extract_charts=True, # chart crops
extract_images=False, # concentrate on charts/tables
extract_method="pdfium",
table_output_format="markdown"
)
)
job_results = ingestor.ingest()
extracted_data = job_results[0]
Embedding
Embedding turns each extracted item right into a fixed-size vector for millisecond-scale similarity searches over large document collections. Using a multimodal embedder is vital to unlocking visually wealthy PDFs. Since it’s designed to embed document pages as text, image, or image and text, charts and tables could be retrieved as evidence moderately than ignored. On this pipeline each item is indexed into Milvus as a 2,048‑dim vector, and the resulting top‑K shortlist passes into reranking.
# Vector DB contract: 2048-dim vectors + original payload/metadata stored in Milvus.
HF_EMBED_MODEL_ID = "nvidia/llama-nemotron-embed-vl-1b-v2"
COLLECTION_NAME = "worldbank_peru_2017"
MILVUS_URI = "milvus_wb_demo.db"
milvus_client = MilvusClient(MILVUS_URI)
if milvus_client.has_collection(COLLECTION_NAME):
milvus_client.drop_collection(COLLECTION_NAME)
milvus_client.create_collection(collection_name=COLLECTION_NAME, dimension=2048, auto_id=True)
# Multimodal encoding: text-only vs image-only vs image+text (table markdown + chart/table crop).
with torch.inference_mode():
if modality == "image_text":
emb = embed_model.encode_documents(images=[image_obj], texts=[content_text])
elif modality == "image":
emb = embed_model.encode_documents(images=[image_obj])
else:
emb = embed_model.encode_documents(texts=[content_text])
# (Notebook then L2-normalizes emb[0] and inserts {vector, text, page, type, has_image, image_b64, ...} into Milvus.)
Reranking
Reranking is the precision layer applied after embedding retrieval. Rating every document with a cross-encoder is simply too expensive, so you simply rerank the embedder’s shortlist. A multimodal cross‑encoder reranker is particularly useful for enterprise PDFs because it may possibly judge relevance using the identical evidence users trust—tables and figures (optionally alongside text)—so “looks similar” gets filtered out and “actually answers” rises. Within the notebook, reranking starts from Milvus hits, then continues right into a scoring loop (not shown here) that assigns logits per candidate and sorts to supply the ultimate ranked context for answer generation.
# Stage 1: embed query -> dense retrieve from Milvus (high recall).
with torch.no_grad():
q_emb = embed_model.encode_queries([query])[0].float().cpu().numpy().tolist()
hits = milvus_client.search(
collection_name=COLLECTION_NAME,
data=[q_emb],
limit=retrieve_k,
output_fields=["text", "page", "source", "type", "has_image", "image_b64"]
)[0]
# Stage 2: VLM cross-encoder rerank (query + doc_text + optional doc_image) (high precision).
batch = rerank_inputs[i:i+batch_size] # list of {"query","doc_text","doc_image"} dicts (built from hits)
inputs = rerank_processor.process_queries_documents_crossencoder(batch)
inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
logits = rerank_model(**inputs).logits.squeeze(-1).float().cpu().numpy()
# (Notebook then attaches logits as scores and sorts valid_hits descending.)
What are the following steps for optimizing retrieval?
Together with your intelligent document processing pipeline live, the trail to production is wide open. The facility of this setup lies in its flexibility. Try connecting latest data sources to the NeMo Retriever library or refine your retrieval accuracy with specialized NIM microservices.
As your document library grows, you’ll find that this architecture serves as a scalable foundation for constructing multi-agent systems that understand the nuances of your enterprise knowledge. By pairing frontier models with NVIDIA Nemotron via an LLM router, you may sustain this high performance while optimizing for cost and efficiency. You too can find more information on how Justt leveraged Nemotron, enabling a 25% reduction in extraction error rate to extend the reliability of monetary chargeback evaluation for his or her customers.
Join the community of developers constructing with the NVIDIA Blueprint for Enterprise RAG—trusted by a dozen industry-leading AI Data Platform providers, available on construct.nvidia.com, GitHub, and the NGC catalog.
Not sleep-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.
