We’re thrilled to release Holotron-12B, a multimodal computer-use model from H Company. Post-trained from the open NVIDIA Nemotron-Nano-2 VL model on H Company’s proprietary data mixture, Holotron-12B is the results of a detailed collaboration between our research labs to engineer a brand new kind of model optimized primarily for scale and performance in production.
H Company is a component of the NVIDIA Inception Program.
The model is now available on Hugging Face.
Most multimodal models today optimize primarily for static vision or following instructions. Holotron-12B, similar to our Holo2 model, nevertheless, has a unique goal: serving as a policy model for computer-use agents that must perceive, determine, and act efficiently in interactive environments.
With Holotron-12B, we desired to create a model that might efficiently and effectively scale in production while handling long contexts with multiple images, and still perform well on agent benchmarks. The NVIDIA Nemotron model offered a powerful foundation on the inference side, and by developing Holotron-12B we have demonstrated how far more the model can accomplish with further training.
High Throughput Inference with a Hybrid SSM Architecture
Holotron-12B’s significant leap in inference efficiency is made possible by its foundational Nemotron architecture, which utilizes a hybrid State-Space Model (SSM) and a focus mechanism. Unlike purely transformer-based models, this design is optimized for high-throughput serving. State-space models offer superior scalability for long-context inference by avoiding the quadratic computation cost related to the complete attention mechanism, particularly benefiting agentic workloads involving multiple images and lengthy interaction histories. When it comes to inference, the important contribution of an SSM is its dramatically reduced memory-footprint: while vanilla attention stores K and V activations per token and layer (the notorious KV Cache), SSMs are a linear recurrent model, storing only a continuing state per layer per generated sequence, independent of the length of the sequence.
When evaluated on the WebVoyager Benchmark, the model excels using a real-world multimodal agentic workload featuring long context, multiple high-resolution images, and a high request concurrency of 100 benchmark staff. Running on a single H100 GPU and using vLLM with the most recent SSM optimizations (v0.14.1), Holotron-12B achieved an over 2x higher throughput in comparison with Holo2-8B. This makes Holotron-12B a sexy alternative for throughput-bound workloads, resembling data generation, annotation, and online reinforcement learning.

In a controlled experiment setup (see figure 2), Holotron-12B continues to scale efficiently as concurrency increases, with total token throughput rising steadily to eight.9k tokens/s at a max concurrency of 100. In contrast, the entire token throughput of Holo2-8B plateaus far more quickly at 5.1k tokens/s. This behaviour highlights a key strength of the Nemotron architecture, namely simpler and efficient VRAM utilization, and smaller overall memory footprint, which allows much larger effective batch sizes on the identical hardware. Even at large batch sizes, Holotron-12B maintains strong throughput.

Training and Evaluating Holotron-12B
Holotron-12B was trained in two stages. We began from Nemotron-Nano-12B-v2-VL-BF16, a multimodal base model published by NVIDIA. We then performed supervised fine-tuning on H Company’s proprietary localization and navigation data mixture, specializing in screen understanding, grounding, and UI-level interactions.
The ultimate checkpoint was trained on roughly 14 billion tokens.
Agent Benchmarks
On computer-use and navigation benchmarks, Holotron-12B shows strong improvements over the Nemotron base model and powerful performance with established agent models. Its WebVoyager performance increased from 35.1% to 80.5%, exceeding Holo2-8B’s performance on the benchmark and illustrating the model’s ability to perform effectively in an agentic setting.

Localization Benchmarks
Holotron-12B also improves substantially over the bottom Nemotron model on localization and grounding benchmarks resembling OS-World-G, GroundUI, and WebClick.

Holotron-12B demonstrates that the NVIDIA Nemotron VL model provides a powerful foundation for real-world multimodal agents when paired with the fitting training setup and infrastructure work.
The model offers strong agent performance, significantly improved inference throughput, and a transparent path for future improvements, particularly around higher-resolution vision training.
We look ahead to seeing what others construct with Holotron-12B. The model and checkpoints can be found now on Hugging Face under an NVIDIA Open Model License.
NVIDIA announced today the discharge of Nemotron 3 Omni. Constructing on the success of Holotron-12B, we’re preparing to post-train this next generation of multimodal models. By leveraging the improved hybrid SSM-Attention and MoE architectural foundations of the Nemotron 3 family, we aim to deliver even greater leaps in reasoning capabilities and multimodal precision with the newly announced Nemotron 3 Omni. As this evolution pushes Holotron beyond research and right into a industrial application, it’s going to provide enterprises with the high-throughput, low-latency performance required for massive-scale autonomous “computer use” deployments.
