Artificial intelligence is token-driven. Every prompt, reasoning step, and agent interaction generates tokens. Over the past yr, token consumption has grown multifold and now exceeds 10 quadrillion tokens per yr. And while the vast majority of tokens have been generated from humans interacting with AI, the brand new era is one through which most tokens will likely be generated from AI interacting with AI.
Modern agentic systems plan tasks, invoke tools, execute code, retrieve data, and coordinate across continuous multistep workflows with quite a few AI agents. These interactions generate large volumes of reasoning tokens, expand KV cache, and require CPU-based sandboxed environments to check and validate results generated by accelerated computing systems. This places low latency, high throughput demands across GPUs, CPUs, scale-up domains, scale-out networks, and storage.
Delivering useful intelligence for these modern agentic systems requires fleets of purpose-built rack-scale systems that function together as one coherent AI supercomputer. This post introduces the NVIDIA Vera Rubin POD, a set of 5 specialized rack-scale systems built on the third-generation NVIDIA MGX rack architecture for the era of agentic AI.
Introducing NVIDIA Vera Rubin POD
Built through extreme co-design of seven chips spanning compute, networking, and storage, NVIDIA Vera Rubin introduces probably the most sophisticated POD-scale AI platform. The platform features 40 racks, 1.2 quadrillion transistors, nearly 20,000 NVIDIA dies, 1,152 NVIDIA Rubin GPUs, 60 exaflops, and 10 PB/s total scale-up bandwidth.
The Vera Rubin POD introduces five recent distinct purpose-built rack-scale systems for agentic AI workloads that require high throughput, extreme low-latency inference, dense CPU sandboxing, and large context memory storage. Together, these racks form one cohesive system that can power the world’s most energy- and cost-efficient data centers.


Each chip within the POD scales with a third-generation NVIDIA MGX rack, supported by an ecosystem of greater than 80 partners with a world supply chain experienced in bringing large-scale AI systems to market. This permits fast deployments and seamless transitions with each NVIDIA MGX rack sharing the identical power, cooling, and mechanical envelopes.
There are two sorts of MGX racks with copper spines designed for performance, resiliency, and energy efficiency. The MGX NVL rack is connected by NVIDIA NVLink, and the brand new NVIDIA MGX ETL rack is connected by one in all two sorts of spines: NVIDIA Spectrum-X Ethernet or NVIDIA Groq 3 LPU direct chip-to-chip links.
NVIDIA Vera Rubin NVL72: Platform for the 4 scaling laws
NVIDIA Vera Rubin NVL72 is the core rack-scale compute engine of the most recent AI factory. Integrating 72 NVIDIA Rubin GPUs and 36 NVIDIA Vera CPUs connected through a large NVLink copper spine, it acts as a one giant GPU. NVIDIA Vera Rubin NVL72 is designed for the 4 scaling laws of AI: pretraining, post-training, test-time scaling, and agentic scaling. It might be optimized for complex mixture-of-experts (MoE) routing, and the heavy compute-bound context phase of AI inference. It delivers as much as 4x higher training performance and as much as 10x higher inference performance per watt, and one-tenth the token cost relative to NVIDIA Blackwell.
NVIDIA Groq 3 LPX: Inference accelerator racks
Co-designed with the NVIDIA Vera Rubin platform for the large context and low-latency demands of agentic AI, NVIDIA Groq 3 LPX features 256 language processing units (LPUs) per rack. It pairs with Vera Rubin NVL72 to eliminate the tradeoff between high-speed interactivity and throughput. By fusing high-bandwidth SRAM-only LPUs with Rubin GPUs with large HBM capability, the system delivers low latency and high throughput at long context lengths—supercharging user interactivity for trillion-parameter models without sacrificing system throughput. Vera Rubin NVL72 plus LPX delivers as much as 35x more tokens and as much as 10x more revenue opportunity for trillion-parameter models relative to Blackwell. To learn more, see Inside NVIDIA Groq 3 LPX.
NVIDIA Vera CPU rack: Agentic AI and reinforcement learning at scale
The NVIDIA Vera CPU rack integrates as much as 256 NVIDIA Vera CPUs in a dense, liquid-cooled rack to offer scalable, energy-efficient capability. A single rack can sustain over 22,500 concurrent reinforcement learning (RL) or agent sandbox environments, maximizing environments to check, execute, and validate results from the Vera Rubin NVL72 and LPX racks. Vera CPU racks provide the muse for large-scale agentic AI and reinforcement learning, delivering results twice as efficient and 50% faster than traditional rack-scale CPUs. Learn more about how the Vera CPU delivers high-performance bandwidth and efficiency for AI factories.
NVIDIA BlueField-4 STX: AI-native storage
The NVIDIA BlueField-4 STX rack is built with the NVIDIA BlueField-4 processor, which mixes the Vera CPU and ConnectX-9 SuperNIC, and scales out with Spectrum-X Ethernet networking.
It hosts the NVIDIA CMX context memory storage platform, a brand new class of AI-native storage infrastructure that seamlessly extends GPU context capability across the POD and accelerates inference by offloading KV cache right into a dedicated, high‑bandwidth storage layer. CMX is optimized to store and serve massive context memory (KV cache), treating temporary inference context as an AI‑native, shared data type that will be reused across turns, sessions, and agents. This delivers as much as 5x higher tokens‑per‑second and as much as 5x higher power efficiency than traditional storage approaches.
NVIDIA Spectrum-6 SPX: Networking racks
Connecting all the POD right into a single supercomputer are the NVIDIA Spectrum-6 SPX networking racks. The Spectrum-6 SPX networking rack is engineered to speed up east-west and north-south traffic across AI factories. Configurable with either Spectrum-X Ethernet or NVIDIA Quantum-X800 InfiniBand switches, it delivers low-latency, high-throughput rack-to-rack connectivity at scale.
The Spectrum-6 SPX rack now includes the 102.4 Tb/s Spectrum-6 switch, which features 512 lanes and 200 Gb/s co-packaged optics (CPO) in single- and multi-chip switch offerings. This silicon photonics integration replaces pluggable transceivers, delivering highest power efficiency and resiliency, low latency and jitter, and nearly perfect effective bandwidth for keeping AI workloads across compute and storage environments perfectly synchronized.
By co-designing these purpose-built racks to operate as one, the Vera Rubin POD is positioned to speed up every component of agentic AI workloads. This begins with the streamlined NVIDIA MGX rack design that forms the muse of each rack within the POD.
Third-generation NVIDIA MGX rack-scale architecture
Production-grade AI racks must excel across several critical areas: rapid time to volume, proven performance at scale, deep hardware-software co-design, resiliency and energy efficiency, seamless data center deployment and logistics, readiness for future architectures, and more.
The third-generation NVIDIA MGX rack-scale architecture sets the usual across all categories with engineering breakthroughs integrated throughout its mechanical, power, and cooling design.
Enabling resiliency and scalability
The NVIDIA MGX rack prioritizes PCB-based connections with its single-wide design. It unlocks completely modular, cable-free, hose-free, and fanless compute and NVLink switch trays enabling maximum reliability, scalability and serviceability. Single 19-inch-wide racks also simplify shipping and logistics accelerating deployment across AI factories.


The rack contains a highly modular spine as its backplane, consisting of as much as 4 preintegrated and prevalidated copper cable cartridges that connect each tray as one. The spine holds 1000’s of cables and shares the identical mechanical form factor for each MGX NVL and MGX ETL racks.
Ensuring peak energy efficiency from chip to grid
On the component level, the NVIDIA MGX racks feature dynamic power steering where the systems provision power to the components that need it most. This feature can move power between the CPUs, GPUs, and NVLink switch trays to make sure components within the rack operate at peak energy efficiency, improving performance per watt.


AI training and inference workloads create large load swings. If not managed effectively, load swings could cause significant stress on the electrical grid, data center power infrastructure, and IT equipment.
To guard against power swings, MGX racks feature rack-level energy storage that cushions power transients with capacitors. When workloads demand numerous power without delay, the capacitor will supply the extra power while the grid power draw stays flat or ramps up. When workloads suddenly stop, the capacitor will charge while the grid power stays flat or ramps down.
NVIDIA Vera Rubin NVL72 now introduces Intelligent Power Smoothing. It features 6x more rack-level energy storage (400 J per GPU) versus prior generations, and introduces a brand new closed-loop system that permits the GPUs to constantly monitor the state of charge of the capacitors to more efficiently flatten power profiles. This achieves much smaller AC power variation per minute, reduces peak current demands by as much as 25%, and eliminates the necessity for enormous battery packs to guard against large-scale power transients.


At the ability level, provisioning racks at static Max-P strands power capability that would otherwise be used to generate tokens. It assumes homogeneous workloads that all the time require peak power, when in point of fact AI factories run a combination of workloads with various power needs.
By provisioning MGX racks at a lower dynamic Max-Q level, data centers can maximize AI data center throughput by dynamically provisioning the correct quantity of power to every rack depending on the workload. This frees up stranded power, unlocks as much as 30% more GPUs in the identical power budget with 45°C liquid cooling, and boosts performance per watt.
Unlocking larger energy budgets for compute
All MGX racks are universally designed to operate with 45°C (113°F) warm-water inlet temperatures so data centers already designed for liquid cooling are guaranteed a seamless transition without redesigning cooling infrastructure. Figure 5 shows a schematic representation of infrastructure layout to offer 41°C (105.8°F) water to coolant distribution units (CDUs) that in turn supply coolant at 45°C (113°F) to AI racks.


Operating at 45°C enables data centers in lots of climates to make use of ambient air and closed loop dry coolers for cooling, reducing the necessity for compressors, driving down PUE, and unlocking larger energy budgets for compute. Lower inlet temperatures of 35°C require data centers to divert massive amounts of facility power or water for cooling, while higher inlet temperatures maximize the quantity of grid power converted directly into tokens. This yields significant data center power savings—enough to allocate as much as 10% additional Vera Rubin NVL72 racks for more token generation in the identical power budget.
MGX racks will be 100% liquid-cooled leveraging the identical data center cooling infrastructure as prior generations. The third-generation MGX rack features recent internal tray manifolds, rack UQD08 manifolds, and liquid cooled busbars supporting as much as 5,000 A. The coolant used for the rack will depend upon the shopper and data center, but many will proceed to make use of de-ionized water or propylene glycol-based fluid (PG25), which might last as long as 10 years in a closed loop system with minimal liquid maintenance.
Open standard
Underpinning these features is an open, standardized MGX rack architecture. The primary mass-production rack-scale system was with NVIDIA Blackwell in 2024. NVIDIA contributed the design to the Open Compute Project (OCP), reinforcing the commitment to open source technologies and enabling all the ecosystem to rapidly innovate and speed up adoption. NVIDIA has built an ecosystem of greater than 80 global partners, making a highly efficient, globally diversified supply chain that’s experienced in bringing rack-scale AI systems to market.
NVIDIA MGX NVL racks
As independent third-party SemiAnalysis InferenceMax benchmarks exhibit, NVIDIA rack-scale systems deliver 50x higher performance per watt and 35x lower cost per token (NVIDIA GB300 NVL72 versus NVIDIA H200), which translates directly into higher revenues and higher operating margins.
In 2024, NVIDIA shipped the primary NVIDIA GB200 NVL72 rack-scale systems. In 2025, NVIDIA GB300 NVL72 was shipped. Now, NVIDIA Vera Rubin NVL72 is in full production, on the right track to ship within the second half of 2026.
Streamlined design of NVIDIA Vera Rubin NVL72
NVIDIA Vera Rubin NVL72 is an engineering marvel designed to drop seamlessly into existing data center footprints. It’s going to feature nearly two times more transistors than NVIDIA GB200 NVL72 while delivering 10x more performance per watt through extreme co-design. The rack integrates 72 NVIDIA Rubin GPUs, 36 NVIDIA Vera CPUs, ConnectX-9 SuperNICs, and BlueField-4 DPUs across 18 compute trays, alongside 9 NVLink switch trays. In total, the rack houses 1.3 million individual components, nearly 1,300 chips, all packed right into a single-wide third-generation NVIDIA MGX rack weighing roughly 4,000 lbs, or in regards to the weight of a pickup truck.


Compute and NVLink Switch trays
Enabling these 72 GPUs to act as a single unified engine is the sixth-generation NVLink. It delivers 3.6 TB/s of bandwidth per GPU and 260 TB/s of scale-up bandwidth per rack—more data than the bandwidth of all the global web. This high-speed data transfer happens within the NVLink spine in the back of the rack, which features 4 modular preintegrated cable cartridges housing 5,000 copper cables over two miles in length.
The compute trays contained in the Vera Rubin NVL72 are completely redesigned from NVIDIA Blackwell. It contains a robust PCB midplane designed to slot in a single-wide rack that unlocks a cable-free, hose-free, and fanless design. This simplification drops compute tray assembly time from nearly two hours to simply five minutes—as much as 20x faster assembly and serviceability.
Each compute tray features two NVIDIA Vera Rubin superchips with 17,000 components each—roughly five times as many components as a contemporary smartphone. The superchips are connected to the front modular bays that house eight ConnectX-9 SuperNICs and one BlueField-4 DPU through the PCB midplane.


Vera Rubin NVL72 introduces recent rack-scale resiliency features designed to maximise uptime and goodput for giant AI clusters. The NVLink switch trays support operational resiliency features that allow administrators to put switches into maintenance mode and replace them while the rack continues operating. The architecture also supports continued operation even when multiple switch trays are unavailable, minimizing disruption during maintenance.
On the silicon level, NVIDIA Rubin GPUs constantly run nondisruptive health checks and NVIDIA Vera CPUs feature in-system testing and SOCAMM memory for faster serviceability. Together, these chip-to-rack innovations reduce operational overhead and construct on the resiliency improvements seen with Blackwell clusters.
NVIDIA Vera Rubin Ultra NVL576
NVIDIA Vera Rubin Ultra introduces a brand new two-layer all-to-all NVLink topology that can enable developers to scale-up to 576 GPUs. Vera Rubin Ultra NVL576 will mix eight separate MGX NVL racks, each with 72 Rubin Ultra GPUs, all in a single 576-GPU NVLink domain with copper and direct optical connections. It’s going to be built using the identical MGX rack-scale ecosystem for fastest time to production.
Demonstrating this massive multirack NVLink topology, Polyphe is the NVIDIA internal fully functional GB200-based prototype of the multirack NVL576 scale-up architecture.


NVIDIA Kyber NVL1152: The subsequent generation
To scale beyond NVL576, a brand new MGX rack, NVIDIA Kyber, will likely be introduced. NVIDIA Kyber is the next-generation MGX NVL rack design that can double the NVLink domain per rack to suit 144 GPUs.


NVIDIA Kyber will scale up into a large all-to-all NVL1152 supercomputer using similar direct optical interconnects for rack-to-rack scale-up. Kyber provides the muse for the subsequent era of maximum scale-up AI computing using NVIDIA Feynman. Kyber will first be introduced with Vera Rubin Ultra as a standalone NVL144 system, providing customers with three options for Vera Rubin Ultra NVLink scale-up domains: NVL72, NVL144, and the flagship NVL576.
NVIDIA MGX ETL racks
While NVIDIA MGX NVL racks provide massive scale-up compute domains, agentic AI workflows demand highly specialized nodes for extreme low-latency inference, CPU sandboxing, and accelerated context memory for KV cache. To support these diverse needs, Vera Rubin introduces the MGX ETL rack architecture, a brand new fully configurable MGX rack designed with a Spectrum-X Ethernet spine or a direct chip-to-chip spine leveraging the identical rack-scale ecosystem as MGX NVL racks.


MGX ETL shares the identical form factor and physical infrastructure as MGX NVL racks and is designed to operate under the identical mechanical, power, and cooling envelope. Each racks will share the identical key rack components built by the experienced MGX ecosystem: racks, chassis, trays, cable cartridges, liquid cooling manifolds, quick disconnects, busbars (standard and liquid cooled), support bracketry, side rails, power shelves, leak containment trays, tray handles, and more.
MGX ETL will use pre-integrated and pre-validated copper cable cartridges with either a Spectrum-X Ethernet spine or a direct chip-to-chip spine . MGX ETL will leverage the established MGX ecosystem and provide chain that’s experienced in constructing the rack architecture in high volume for multiple years.
NVIDIA Spectrum-X Ethernet spine
MGX ETL with a Spectrum-X Ethernet spine will likely be the muse for the Vera CPU rack and the BlueField-4 STX Storage rack within the Vera Rubin POD. The rack is extremely configurable and will also be made to deal with as much as 256 Rubin GPUs (HGX Rubin NVL8 systems), XPUs, or more.


On this design, 1U MGX ETL switch trays (based on Spectrum-6) sit in the course of the rack. Rear-facing ports connect with the copper spine, while 32 front-facing OSFP cages provide optical transceiver connectivity to the remaining of the POD.
MGX ETL leverages a Spectrum-X Multiplane topology that fans out the 200 Gb/s lanes across multiple switches, delivering full all-to-all connectivity amongst nodes throughout the rack while maintaining a single network tier. The preintegrated copper spine provides resilient, power-efficient connectivity (enabling connectivity between ETL racks with a single tier of optics) and extends purpose-built Spectrum-X Ethernet with zero jitter, noise isolation, and cargo balancing across all the 256-chip rack.
Direct chip-to-chip spine
Designed for extreme low-latency inference, the LPX rack connects 256 LPUs as one. It features 32 compute trays, each with eight LPUs, connected by a direct chip-to-chip spine, which consists of two copper cable cartridges that create an intricate point-to-point topology over 1000’s of paired copper cable connections. These cables make up the direct chip-to-chip spine in the back of the rack consisting of the identical cable cartridge mechanical form factor as other MGX racks. This massive interconnected fabric enables all the 256-LPU rack to act as a single fast inference engine to be deployed with Vera Rubin NVL72.
When scaled to multiple LPX racks in datacenter deployments, the direct chip-to-chip links are maintained across racks enabling multiple LPX racks to operate as a single, incredibly fast inference engine.
NVIDIA Vera Rubin DSX AI factory platform
NVIDIA Vera Rubin DSX is the AI factory platform that gives a blueprint and reference design for co-designed AI infrastructure from chip to grid. It maximizes grid power to token efficiency, goodput, and accelerates time to first production.


NVIDIA Vera Rubin DSX unifies chips, systems, software libraries, APIs, and a world partner ecosystem right into a single architecture that tightly integrates compute, networking, storage, power, cooling, and facility controls across all the AI factory. This permits ecosystem partners to rapidly design, deploy, and scale gigawatt AI factories with maximum token throughput per watt and improved uptime from resiliency and energy efficiency built into the DSX platform end-to-end.
Learn more about NVIDIA Vera Rubin POD
AI infrastructure is rapidly evolving from discrete chips, standalone servers, and rack-scale systems to co-designed POD-scale supercomputers and AI factories. Modern agentic AI workloads are driving a shift toward purpose-built AI infrastructure that integrates compute, networking, and storage right into a single cohesive supercomputer. The NVIDIA Vera Rubin POD unifies five rack-scale systems with key mechanical, power, and cooling innovations from the third-generation NVIDIA MGX rack, delivering scalability, resiliency, and energy efficiency.
At AI factory scale, the NVIDIA Vera Rubin DSX Reference Design and the NVIDIA Omniverse DSX Blueprint for AI factory digital twins provide a unified framework for constructing and operating AI factories. Together, these innovations deliver dramatic gains in performance, cost efficiency, and energy savings to power the era of agentic applications.
Join us for NVIDIA GTC 2026 and watch the GTC keynote with NVIDIA founder and CEO Jensen Huang.
