NVIDIA Llama Nemotron Nano VL is a state-of-the-art 8B Vision Language Model (VLM) designed for intelligent document processing, offering high accuracy and multimodal understanding. Available on Hugging Face, it excels in extracting and understanding information from complex documents like invoices, receipts, contracts, and more. With its powerful OCR capabilities and efficient performance on the OCRBench v2 benchmark, this model delivers industry-leading accuracy for text and table extraction, in addition to chart, diagram, and table parsing. Whether you’re automating financial document processing or improving business intelligence workflows, Llama Nemotron Nano VL is optimized for fast, scalable deployments.
Try the tutorial below to start out constructing your personal intelligent document processing solutions with Llama Nemotron Nano VL! Users can even post-train the model further with their very own datasets using NVIDIA NeMo.
Introduction to Llama Nemotron Nano VL
Llama Nemotron Nano VL, the most recent addition to the NVIDIA Nemotron family of models, is a vision language model (VLM) designed to push the boundaries of intelligent document processing (IDP) and optical character recognition (OCR). With its high accuracy, low model footprint, and multimodal capabilities, Llama Nemotron Nano VL enables the seamless extraction and understanding of knowledge from complex documents. This includes PDFs, images, tables, charts, formulas, and diagrams, making it a really perfect solution for automating document workflows across various industries like finance, healthcare, legal, and government.
High-Accuracy OCR with Llama Nemotron Nano VL
Llama Nemotron Nano VL demonstrates exceptional accuracy on the OCRBench v2 benchmark, which tests models on real-world OCR and document understanding tasks. These tasks include text recognition, table extraction, and element parsing across various document types. The model’s advanced capabilities enable it to deliver higher performance than current leading VLMs in real-world enterprise scenarios.
Llama Nemotron Nano VL OCRBench v2 Performance:
- Text Recognition: Llama Nemotron Nano VL excels at spotting and extracting text, achieving high accuracy in real-world OCR tasks similar to invoice processing.
- Element Parsing: The model accurately identifies and extracts critical document elements like tables, charts, and pictures, that are essential for understanding complex documents.
- Table Extraction: Extracting tabular data from documents is very accurate with this model, making it suitable for financial statements and similar use cases.
- Grounding: It also supports grounding through bounding boxes in each queries and outputs, enhancing the interpretability of the model’s responses.
Model Architecture and Innovations
Llama Nemotron Nano VL builds upon Llama-3.1-8B-Instruct and C-RADIOv2-VLM-H, a Vision Transformer (ViT) that serves because the backbone for visual feature extraction. This permits the model to handle a wide range of visual elements in documents, including charts, graphs, and other complex visual representations.
Core Technologies
Strong Vision Foundation
C-RADIOv2-VLM-H Vision Transformer (ViT): The core visual understanding component of the model, C-RADIO allows for high-resolution processing of documents containing complex visual elements. This serves as a vision backbone, excelling across visual domains and enabling the model to process multi-image understanding with high resolution. This technology underpins the model’s ability to handle complex documents containing visual elements similar to images, diagrams, charts, and tables.
C-RADIO is trained on multi-resolution data using multiple distillation techniques. Multiplicative noise was applied to our weights during training to enhance generalization.
Llama Nemotron VL further adopts a design that dynamically aggregates encoded patch features, enabling support for high-resolution input without sacrificing spatial continuity. This strategy efficiently processes documents with arbitrary aspect ratios while preserving each local detail and global context. It enables fine-grained evaluation of dense visual content—similar to small fonts, multi-column layouts, and complicated charts—without compromising computational efficiency or coverage. The model can even preserve the knowledge higher and have less distortion, resulting from the innovation in high-resolution tiling.
By empowering Llama-3.1 8B LLM with this strong vision foundation, Llama Nemotron Nano VL delivers unparalleled accuracy in parsing and interpreting documents.
Top quality data for document intelligence
Llama Nemotron Nano VL was trained using several OSS datasets together with data from NVIDIA’s VLM-based OCR solution, NeMo Retriever Parse. This provides capabilities in text and table parsing, together with grounding, enabling the Llama Nemotron Nano VL to perform at an industry-leading level in document understanding tasks. Synthetic table extraction datasets that were used for training this OCR solution were also used for training Llama Nemotron Nano VL 8B VLM to enable more optimal table understanding and extraction.
Llama Nemotron Nano VL excels in tasks like text recognition and visual reasoning, and demonstrates advanced chart and diagram understanding capabilities. Llama Nemotron Nano VL allows for predicting bounding box coordinates in normalized space to enable grounding like tasks and text-referring.
This strong performance is underpinned by high-quality in-domain data combined with a various training distribution across document types, languages, and layouts. A strong data strategy ensures coverage across difficult use cases through selective curation, targeted augmentation, and formatting techniques that make clear task intent and reduce ambiguity—leading to models that generalize effectively to real-world applications.
Pre-Training
Llama Nemotron Nano VL undergoes a two-stage training regimen: pre-training followed by Supervised Wonderful-Tuning (SFT).
The initial pre-training phase focuses on achieving cross-modal alignment between the language and vision domains. That is achieved through the training of a Multi-Layer Perceptron (MLP) connector, which serves as an interface between the 2 modalities.
For the training process, Llama Nemotron Nano VL leverages a comprehensive and diverse collection of datasets. This aggregated dataset, comprising a complete of ~1.5M samples, includes publicly available, synthetically generated, in addition to internally curated datasets. A summary of the datasets utilized in the course of the pre-training stage is presented in Figure 1.

Supervised Wonderful-Tuning
Within the Supervised Wonderful-Tuning stage, Llama-Nemotron-Nano-VL is trained end-to-end on a mixture of synthetic, public, and internally curated datasets. The information encompasses a large spectrum of tasks, including but not limited to: OCR, text grounding, table parsing, and general document-based VQA.
The document understanding capabilities of Llama Nemotron Nano VL could be largely attributed to the OCR-focused SFT data mix. Besides easy OCR, most of the datasets involve tasks similar to predicting the right reading order, reconstructing markdown formatting together with semantic classes (similar to Captions, Titles, Section headers) and bounding boxes of individual text blocks. The model can also be trained to parse mathematical formulas in LaTeX format and to extract tables in LaTeX, HTML, or markdown formats, depending on the prompt.
To make sure robustness across various domains, we apply affine and photometric augmentations to the document images. To enhance the table and chart parsing performance further, we enable swapping of tables and charts embedded into the full-page documents between the datasets. This allows the model to handle diverse document layouts and structures.
A big portion of the internally-created datasets are based on Nemo Retriever Parse training data. These include NVPDFTex – a set of arxiv documents with ground-truth labels consisting of formatted text in reading order, with bounding boxes and semantic classes of text, in addition to LaTeX tables and equations; Common Crawl pdfs which might be labeled by human annotators; rendered text from Wikipedia with markdown formatting and tables, in addition to quite a lot of synthetic datasets targeted at improving table parsing capabilities and dense OCR. As well as, the training mix includes quite a lot of publicly-available datasets, similar to DocLayNet, FinTabNet and PubTables-1M, during which we refine the groundtruth labels.
Figure 2 below shows the duty distribution of the training data. As could be seen, a good portion of the training samples involve OCR together with grounding and table parsing, in addition to OCR-adjacent VQA tasks.

Post Training Process
Llama Nemotron Nano VL was trained using NVIDIA Megatron and utilizes efficient Transformer implementations in NVIDIA Transformer Engine. For multimodal dataloading, we use Megatron Energon. We offer example Megatron training and inference scripts together with hyperparameters and other instructions to enable custom training of VLMs.
Examples
Table Extraction
VQA with Grounding
Text Extraction
Really helpful prompts
To make sure the output is formatted precisely as you wish, we recommend including detailed instructions inside your prompts. We have provided some examples below for instance how this works for various tasks:
Document extraction in reading order together with grounding and semantic classes
Parse this document in reading order as mathpix markdown with LaTeX equations and tables. Fetch the bounding box for every block together with the corresponding category from the next options: Bibliography, Caption, Code, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, TOC (Table-of-Contents), Table, Text and Title. The coordinates must be normalized starting from 0 to 1000 by the image width and height.
Your answer must be in the next format:n[{{"bbox": [x1, y1, x2, y2], "category": category, "content": text_content)}}...].
Table extraction for RD-TableBench
Convert the image to an HTML table. The output should begin with . Specify rowspan and colspan attributes after they are greater than 1. Don't specify every other attributes. Only use the b, br, tr, th, td, sub and sup HTML tags. No additional formatting is required.
Table extraction with grounding
Transcribe the tables as HTML and extract their bounding box coordinates. The coordinates must be normalized starting from 0 to 1000 by the image width and height and the reply must be in the next format:n[(x1, y1, x2, y2, html table), (x1, y1, x2, y2, html table)...].
OCRBench v2 Benchmark: A Closer Look
OCRBench v2 is a sophisticated benchmark designed to judge OCR models across a various range of real-world document types and layouts. It includes over 10,000 human-verified question-answer pairs to carefully assess a model’s capabilities in visual text localization, table parsing, diagram reasoning and key-value extraction.
Llama Nemotron Nano VL outperforms other VLMs on this benchmark and in addition achieves strong accuracies in benchmarks similar to ChartQA and AI2D, making it a compelling option for enterprises aiming to automate document workflows similar to:
- Invoice and receipt processing
- Compliance and identity document evaluation
- Contract and legal document review
- Healthcare and financial document processing
Its combination of high accuracy, strong layout aware reasoning, and efficient deployment on a single GPU makes it a really perfect selection for large-scale enterprise automation.
Advanced Use Cases for Llama Nemotron Nano VL
Llama Nemotron Nano VL is optimized for various document processing tasks across multiple industries. Listed below are a few of the key use cases where the model excels:
1. Invoice and Receipt Processing
Automating the extraction of line items, totals, dates, and other key data points from invoices and receipts. That is crucial for accounting, ERP integration, and expense management.
2. Compliance Document Evaluation
Extracting structured data from passports, IDs, and tax forms for regulatory compliance and KYC processes.
3. Contract Review
Mechanically identifying key clauses, dates, and obligations in legal documents.
4. Healthcare and Insurance Automation
Extracting patient data, claim information, and policy details from medical records and insurance forms.
Get Began with Llama Nemotron Nano VL
Llama Nemotron Nano VL provides developers with the tools to automate document processing workflows at scale. It is offered through the NVIDIA NIM API and for download on Hugging Face, where you possibly can begin constructing production-ready document understanding applications. Users can even use NVIDIA NeMo to finetune the model on their very own dataset.
Hands-On Tutorial: Constructing an Invoice/Receipt Document Intelligence Notebook and Video
The tutorial will walk you thru:
- Organising the environment for using Llama Nemotron Nano VL.
- Processing invoices and receipts to mechanically extract and organize data.
- Optimizing your solution to handle large-scale document workflows.
Conclusion
Llama Nemotron Nano VL is a robust multimodal model designed to fulfill the demanding needs of intelligent document processing in modern enterprises. Whether you’re processing invoices, contracts, or compliance documents, this model provides the accuracy, efficiency, and scalability required for high-performance document understanding.
For a hands-on experience, take a look at our tutorial on invoice and receipt document intelligence, and begin leveraging the complete power of Llama Nemotron Nano VL today.
Contributors
Amala Sanjay Deshmukh*, Kateryna Chumachenko*, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Lukas Voegtle, Philipp Fischer,
Jarno Seppanen, Ilia Karmanov, Guo Chen, Zhiqi Li, Guilin Liu, Zhiding Yu, Danial Mohseni Taheri, Pritam Biswas, Hao Zhang, Yao Xu, Mike Ranzinger, Greg Heinrich, Pavlo Molchanov, Jason Lu, Hongxu Yin, Sean Cha, Subhashree Radhakrishnan, Ratnesh Kumar, Zaid Pervaiz Bhat, Daniel Korzekwa, Sepehr Sameni, Boxin Wang, Zhuolin Yang, Nayeon Lee, Wei Ping, Wenliang Dai, Katherine Luna, Michael Evans, Leon Derczynski, Erick Galinkin, Akshay Hazare, Padmavathy Subramanian, Alejandra Rico, Amy Shen, Annie Surla,
Katherine Cheung, Saori Kaji, Meredith Price, Bo Liu, Benedikt Schifferer, Jean-Francois Puget, Oluwatobi Olabiyi, Karan Sapra,
Timo Roman, Jan Kautz, Andrew Tao, Bryan Catanzaro
* Equal Contribution




