NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories

-


AI is evolving, and reasoning models are increasing token demand, placing latest requirements on every layer of AI infrastructure. Greater than ever, compute must scale efficiently to maximise token production and improve productivity for model creators and users.

Modern GPUs operate at peak capability, pushing throughput higher every generation, but system performance is increasingly gated by the CPU-bound serial tasks inside an agentic loop–a classic example of a core computer science principle, called Amdahl’s law.

This dynamic is particularly visible in two classes of workloads: reinforcement learning (RL) for training models with latest specialized skills akin to coding or engineering, and agentic actions, which enable AI agents to make use of tools like web browsers, databases, code interpreters, and other software to finish tasks in real environments, or sandboxes. 

Each workloads mix two historically separate CPU characteristics. Individual environments require strong single-threaded performance to execute complex code quickly, just like a workstation. At the identical time, modern AI systems launch 1000’s of those environments concurrently, creating large-scale throughput demands typical of server infrastructure.

The NVIDIA Vera CPU is designed for contemporary AI workloads, with key design features including:

  • Extreme single-core performance: Fast execution of individual tasks is critical, and performance must sustain under ‌constant load with many concurrent users and agentic tasks.
  • High memory and fabric bandwidth per core: To make sure consistent SLA under load that moves volumes of knowledge efficiently for real-time evaluation and context switching tasks.
  • Efficient rack-scale co-design:  AI factories must rapidly deploy and manage capability to meet agentic demand while maximizing power efficiency.

Data centers built with Vera maximize AI infrastructure investments, whether Vera CPUs are directly connected to accelerators or performing tasks on standalone CPU capability at the tip of a wire.

The post-training reality

Reinforcement learning requires models to always evaluate their outputs, recognizing which ends succeed or fail. For instance, models learning to do software development generate large amounts of code using models running on accelerators, which is then shipped to clusters of CPUs to construct, run, and test—acting in a feedback-reward loop (see Figure 1). 

These tasks span codebase research, compilation, runtime execution, scripting, data conversion, and other common operations. Overall, this flow requires many concurrent sandbox-like environments, each with a full complement of tools. Often, a single CPU core executes each flippantly threaded case end-to-end from a set of accelerator-generated requests.

To maximise accelerator utilization and implement rapid model iteration, the token generation and training phases of the cycle operate on a decent schedule (or policy). Often, some evaluation jobs running on a CPU finish jobs too late to influence the following step within the cycle. When this happens, it takes the model longer to learn to the identical quality, and useful tokens are wasted.

Agentic loops demand a novel mix of high single-core performance, massive data bandwidth, and deterministic execution with minimal tail latencies from the CPUs they employ. 

These requirements are a central focus of the NVIDIA Vera CPU design (Figure 2), which delivers as much as 50% faster sandbox performance in comparison with competitive platforms, 1.2 TB/s of memory bandwidth, and 88 Olympus cores with NVIDIA Spatial Multithreading (SMT) for the duty concurrency mandatory for AI Factories.

NVIDIA Olympus core

The necessity for higher-performance cores that support AI led to the NVIDIA Olympus core, the primary fully custom data center CPU core from NVIDIA. Olympus debuts in Vera alongside the second generation of the NVIDIA Scalable Coherency Fabric (SCF), originally developed for the NVIDIA Grace CPU.

Built for sustained high Instruction Per Cycle (IPC) operation on memory-intensive workloads with control-flow logic, Olympus uses a 10-wide instruction fetch and decode frontend, and a neural branch predictor able to evaluating two taken branches per cycle. It’s fully compatible with the Arm v9.2 instruction set and existing software for prime performance on Arm-based containers, binaries, libraries, and operating systems.

Users can make a choice from performance-per-thread and thread count at runtime with NVIDIA SMT. This offers each thread stable performance, stronger isolation, and predictable tail latency under heavy load. Traditional SMT relies on time-shared resources and frequent context switching between threads, introducing performance variation.

NVIDIA Scalable Coherency Fabric and memory subsystem

The Vera CPU is built on a single monolithic compute die and fabric, with adjoining dielets implementing memory and I/O subsystems while preserving the uniformity of the compute topology.

From the perspective of an application, every core is similar practical distance to resources like other cores, caches, memory, and networking, and is provisioned with uniform, high-throughput bandwidth. Most latency‑sensitive operations remain local, avoiding unnecessary cross‑die traffic typically observed on traditional CPUs.

The runtime paths of agentic tasks, analytics operations, KV and blob caches, orchestration, and control planes are inherently unpredictable in an AI factory. In traditional implementations, the topology of the processor and the usage patterns of neighboring tasks being run on it should be considered ahead of time to maximise application performance. The design enables optimal performance without this type of tuning.

The second-generation SCF connects all 88 Olympus cores to a shared L3 cache and memory subsystem, delivering consistent latency and three.4 TB/s of bisection bandwidth, enabling the Vera CPU to sustain over 90% of peak memory bandwidth under load. Each core is provisioned with as much as 14 GB/s of memory bandwidth, roughly 3x the per-core rate of traditional data center CPUs—ensuring Extract-Transform-Load (ETL), real-time analytics, and memory-bound workloads maintain throughput when every core is lively.

Feeding SCF is Vera’s second-generation LPDDR5X memory subsystem, delivering as much as 1.2 TB/s of total bandwidth at lower than half the memory power of traditional DDR configurations and as much as 1.5 TB of capability—a 3x increase over the prior generation. Small Outline Compression-Attached Memory Modules (SOCAMM) brings low-power memory into the info center for the primary time, replacing soldered memory with detachable, upgradable modules that mix LPDDR efficiency with server-class serviceability.  

Performance across the AI factory 

All these architectural elements enable the Vera CPU to deliver as much as 1.5x the agentic sandbox performance under full-socket load in comparison with competitive x86 platforms across compilers, scripting tools, runtime engines, compression, and agentic tool calls (Figure 3).

This advantage compounds across three dimensions. In RL post-training, a 1.5x faster sandbox returns evaluation results inside tighter time windows, enabling models to capture the very best gradient tokens and accelerating training cycles. 

In agentic inference, it reduces users’ wait time, improving accelerator utilization and easing pressure on KV cache offloading. 

For frontier training problems, 50% higher single-core performance means more sequential tests complete before hitting closing dates, expanding the range of hard problems a model can learn from.

Agentic environments by the rack

Every AI Factory requires thousands and thousands of CPU cores to enable the agentic loop of RL and gear use. To unlock the potential of AI infrastructure, deployment should be rapid. For a lot of AI factory operators, the Vera CPU will probably be the primary of their fleet, arriving in data centers designed for high-rack power and liquid cooling.

The brand new NVIDIA Vera CPU Rack offers incredible density and performance inside the same planning constraints, rack infrastructure, cooling, and power because the NVL72 products being deployed today.

With a capability of greater than 22.5K sandboxes, Vera CPU Rack delivers over 4x the capability and 2x the performance per watt of x86-based server racks (Figure 4). AI Factories deploy and manage capability on the rack level, radically reducing build-out times and improving time-to-market for brand new capability while simplifying site planning.

Each Vera CPU is connected with NVIDIA BlueField-4 SmartNICs containing dedicated Grace-based management cores, offloading networking tasks like security and management, and ensuring probably the most performant capability within the system is fully available to agentic tasks.

Vera platforms and configurations 

Along with the Vera CPU rack, NVIDIA has engineered a whole family of Vera-based platforms for the various workloads of recent AI factories. By delivering many decisions of densities, cooling capabilities, configurations, and form aspects, Vera’s design and system partners are enabling rapid deployment and capability build-out, adaptable to the constraints of space available in any data center facility.

Platform Description Scenarios
NVIDIA Vera Rubin NVL72 Integrated AI factory rack tightly couples Vera host CPUs and Rubin GPUs through high-bandwidth NVIDIA NVLink-C2C and NVIDIA NVLink scale-up fabric. Large-scale AI factories, frontier model training, reasoning, and high-throughput inference.
NVIDIA Vera CPU Rack Liquid-cooled (LC) CPU rack architecture with as much as 4 nodes per 1U tray, scaling to 256 Vera CPUs per rack for dense, efficient compute. Construct capability rapidly at rack-scale alongside NVL72. AI factory infrastructure, agentic pipelines, orchestration layers, data processing, HPC, and CPU-dense services.
Single and dual-socket Vera platforms Flexible server platforms built around one or two Vera CPUs, with as much as 1.5TB LPDDR5X per socket and 1.8TB/s NVLink-C2C between CPUs in dual-socket designs, suitable for any facility. Cloud infrastructure, enterprise, analytics, storage, HPC, NVIDIA PCIe GPU-equipped servers, and AI factories.  
NVIDIA HGX Rubin NVL8 Accelerated computing platform pairing Vera host CPUs with Rubin GPUs over PCIe, enabling balanced CPU-GPU performance across multiple server designs. AI inference, technical computing, analytics, and enterprise HPC deployments.
Table 1. Vera platform options for contemporary AI factories

Platform availability 

Vera systems will probably be available from major OEMs, including Cisco, Dell, HPE, Lenovo, and Supermicro, in the second half of 2026. See the Vera CPU webpage for more details.

Learn More in regards to the Vera CPU and Vera Rubin. 

NVIDIA Vera performance in comparison with AMD EPYC Turin and Intel Xeon 6 Granite Rapids, across a wide range of workloads, including code compilation, interpreters, scripting, runtime engines, ETL, data analytics, and graph. 



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x