NVIDIA Groq 3 LPX is a brand new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of agentic systems. Co-designed with the NVIDIA Vera Rubin NVL72, LPX equips the AI factory with an engine optimized for fast, predictable token generation, while Vera Rubin NVL72 stays the flexible, general-purpose workhorse for training and inference, delivering high throughput across prefill and decode, including long-context processing, decode attention, and high-concurrency serving at scale.
This mix matters since the agentic future demands a brand new category of inference. As generation speeds approach 1,000 tokens per second per user, models move beyond conversation-speed interaction toward speed of thought computing. At that rate, AI systems can reason, simulate, and respond repeatedly, enabling experiences that feel less like turn-based chat and more like real-time collaboration.
This shift also raises the ceiling for multi-agent systems. Individual agents will be powerful on their very own, but coordinated groups of agents can accomplish much more, very like human societies scale their capability through collective intelligence and coordination.Â
Supporting these emerging workloads requires infrastructure that may deliver each high throughput and low latency. The mix of Vera Rubin NVL72 and LPX enables this heterogeneous architecture, pairing large-scale AI factory performance with the fast token generation needed to power repeatedly running agentic systems and next-generation AI applications.
Introducing NVIDIA Groq 3 LPXÂ
Vera Rubin and LPX unite the acute performance of Rubin GPUs and LPUs to deliver as much as 35x higher inference throughput per megawatt and as much as 10x more revenue opportunity for trillion-parameter models. Integrated with the NVIDIA MGX ETL rack architecture and aligned with the broader Vera Rubin platform, LPX gives data centers a approach to deploy a dedicated low-latency inference path alongside Vera Rubin NVL72 inside a standard infrastructure design.
The system is built around 256 interconnected NVIDIA Groq 3 LPU accelerators. Its architecture emphasizes deterministic execution, high on-chip SRAM bandwidth, and tightly coordinated scale-up communication so interactive inference can stay responsive at the same time as concurrency rises and request shapes vary.
Deployed alongside Vera Rubin NVL72, LPX accelerates the latency-sensitive portions of the decode loop, including FFN and MoE expert execution, while Rubin GPUs proceed to handle prefill and decode attention. Together, they deliver a heterogeneous serving path that improves interactive responsiveness without sacrificing AI factory throughput.


At rack scale, LPX delivers:
| Specification | NVIDIA Groq 3 LPX |
| AI inference compute | 315 PFLOPS |
| Total SRAM capability | 128 GB |
| On-chip SRAM bandwidth | 40 PB/s |
| Scale-up density | 256 chips |
| Scale-up bandwidth | 640 TB/s |
Vera Rubin NVL72 and LPX create a more heterogeneous inference architecture for the AI factory—one which can support high aggregate token production and responsive interactive AI experiences.
Contained in the NVIDIA Groq 3 LPX compute tray
The LPX rack-scale accelerator houses 32 liquid-cooled 1U compute trays, each designed to support low-latency inference at scale. Every tray integrates eight LPU accelerators, a number processor, and fabric expansion logic in a cableless design that simplifies rack-scale deployment and tightly couples compute with communication.Â
LPU chip-to-chip (C2C) links provide direct communication throughout the tray, across trays via the LPU C2C spine, and across racks as systems scale. Connectivity is significant because interactive inference isn’t nearly raw compute. It also depends upon how efficiently the system can move data, coordinate work, and avoid variable delays as requests flow across devices.


Each tray provides:
| Resource | Per LPX Tray |
| LP30 chips | 8 |
| On-chip SRAM | 4 GB |
| SRAM bandwidth | 1.2 PB/s |
| DRAM via fabric expansion logic | As much as 256 GB |
| DRAM via host CPU | As much as 128 GB |
| AI inference compute (FP8) | 9.6 PFLOPS |
| Scale-up bandwidth | 20 TB/s |
On the system level, LPX is built for inference regimes where coordination overhead and jitter can quickly turn into visible to users. This is particularly relevant as more AI applications move away from offline or throughput-oriented serving and toward interactive generation. To see how LPX is optimized for that regime, it helps to have a look at the processor architecture on the core of the system: the NVIDIA Groq 3 LPU.
First take a look at the architecture of the NVIDIA Groq 3 LPU—the seventh chip of the Vera Rubin Platform
At the center of LPX is the NVIDIA Groq 3 LPU, designed to deliver fast, predictable token generation by tightly coupling compute, memory, and communication under compiler control. The architecture of the LPU is designed to deliver fast, predictable token generation by tightly coupling compute, memory, and communication under compiler control. Quite than optimizing just for peak arithmetic throughput, the LPU emphasizes deterministic execution, high on-chip memory bandwidth, and explicit data movement. These capabilities are especially necessary for decode-dominant, latency-sensitive inference regimes.


Tensor-first compute and explicit data movement
Compute and communication within the LPU are organized around 320-byte vectors because the unit of labor. Arithmetic operations, memory access, and inter-device transfers all operate on these fixed-size vectors, simplifying scheduling and synchronization.Â
Specialized execution modules handle different classes of operations:
- Matrix execution modules (MXM) provide dense multiply-accumulate capability for tensor operations, operating on fixed data types with predictable throughput.
- Vector execution modules (VXM) handle pointwise arithmetic, type conversions, and activation functions using a mesh of arithmetic logic units (ALUs) per lane.
- Switch execution modules (SXM) perform structured data movement, including permutation, rotation, distribution, and transposition of vectors.
By making data movement explicit and programmable, the LPU enables memory access, compute, and communication to be overlapped, reasonably than counting on hardware heuristics.
MEM enables extreme on-chip memory bandwidth
A central element of the LPU is the MEM block—a flat, SRAM-first memory architecture where 500 MB of high-speed on-chip SRAM serves as the first working storage for inference. Quite than counting on hardware-managed caches, the compiler and runtime place the energetic working set, including weights, activations, and KV state, into on-chip memory and move data explicitly. This reduces unpredictable stalls and helps deliver low, stable latency by keeping essentially the most latency-sensitive data near compute.
Because on-chip SRAM capability is finite, larger models are scaled across many interconnected LPU accelerators using parallel execution strategies akin to layer-wise partitioning, so the general system presents a much larger effective working set. On this design, performance is governed less by peak arithmetic throughput and more by how consistently the system can keep compute fed, which is why the LPX pairs 150 TB/s of on-chip memory bandwidth with high bandwidth scale-up chip-to-chip (C2C) communication per LPU.
C2C scaling with predictable communication
To scale inference across multiple devices, the LPU includes high-radix, high-speed C2C links designed for deterministic data exchange. Each LPU connects through 96 C2C links running at 112 Gbps each, enabling a streamlined LPX scale-up topology with high aggregate I/O bi-directional bandwidth of two.5 TB/s and predictable communication timing. This is particularly necessary for distributed inference pipelines, where communication overhead can otherwise turn into a significant source of latency.
Deterministic, compiler-orchestrated execution
The LPU builds on Groq’s spatial execution model, where the compiler explicitly schedules computation, data movement, and synchronization. As an alternative of counting on dynamic hardware schedulers at runtime, the compiler relies on plesiosynchronous, chip-to-chip protocol in hardware that cancels natural clock drift and aligns lots of of LPU accelerators to act as a single coordinated system. With predictable data arrival and periodic software synchronization, developers can reason more directly about timing, and the system can coordinate each compute and network behavior with much greater determinism.
This execution model enables:
- Precise coordination between memory and compute.
- Explicit control over instruction timing.
- Reduced execution jitter under variable workloads
For real-time inference, this determinism helps keep time-to-first-token and per-token latency stable, even at small batch sizes.
The shift toward interactive inference
AI inference spans a broad performance spectrum. On one end are throughput-optimized services akin to batch document processing, moderation, embeddings, and media pipelines, where the goal is to maximise tokens per GPU, tokens per watt, or overall cost efficiency. These workloads often support large-scale shared services, including free-tier and background AI offerings, where high utilization matters greater than per-user responsiveness.
On the opposite end are latency-optimized services akin to coding assistants, chatbots, voice assistants, copilots, and interactive agents, where delays are immediately visible to users. In these workloads, crucial metrics are time-to-first-token, tokens per second per user, and tail latency. Many modern AI platforms must support each regimes concurrently, running high-throughput backends for large-scale processing while delivering responsive interactive experiences. This divergence is one reason heterogeneous inference architectures have gotten increasingly necessary.
What makes interactive inference harder
Several trends are making low-latency interactive inference each more necessary and harder to serve efficiently, as shown in Table 3. As models produce longer outputs and context windows grow, more of the workload shifts into decode, where tokens are generated sequentially, and responsiveness is exposed on to the user.
| Force | Why it matters |
| Low-latency as a product feature | In interactive applications, responsiveness isn’t any longer just an infrastructure metric; it is an element of what users evaluate. |
| Longer reasoning outputs | As models generate longer outputs and multi-step chains of thought, more of the request shifts into sequential token generation. |
| Prefix caching | Reusing shared prompt state can reduce prefill cost, however it also increases the relative share of request-specific decode work that also needs to be served quickly. |
| Longer contexts | As context grows, the Transformer’s self-attention mechanism becomes increasingly constrained by data movement and memory bandwidth. |
At the identical time, longer contexts increase pressure on memory bandwidth and data movement, while serving many concurrent users reduces the batching efficiency that throughput-oriented systems depend on. Because of this, systems optimized for max aggregate throughput should not all the time one of the best fit for workloads that require fast, predictable token generation for every request.
This challenge becomes much more pronounced in agentic AI, where systems repeatedly cycle through inference, retrieval, tool use, and reasoning. In these loops, latency compounds across each step, making stable per-token performance and powerful tail-latency behavior critical for responsive user experiences.
The era of agentic inference requires a brand new architecture
Inference isn’t a single, uniform workload. Inside a request, prefill and decode place different demands on hardware, and people demands shift with batch size, context length, and model structure. Some phases, including self-attention and sparse MoE, can turn into highly sensitive to memory bandwidth and data movement, while others, akin to dense projection and feed-forward layers, scale efficiently on throughput-optimized hardware when enough parallelism is out there. In interactive decode, many operations run at very small batch sizes, making latency rather more sensitive to stalls, contention, and jitter.
Optimizing the complete pipeline for under one regime forces a compromise. Hardware tuned for peak throughput under large batches isn’t ideal for essentially the most latency-sensitive execution paths, while hardware optimized for low-latency execution is less efficient for essentially the most compute-intensive phases.Â
As shown in Figure 4, a heterogeneous system combines each approaches, pairing low-latency interactive performance with high AI factory throughput. The result’s a two-engine architecture: GPUs deliver high output for context-heavy prefill and execute decode attention, while LPUs speed up latency-sensitive decode components akin to FFN/MoE execution. Together, they improve interactivity without giving up AI factory throughput.


Vera Rubin NVL72 meets LPX
Modern inference is a relay race. The identical hardware that runs the heavy context leg doesn’t have to anchor the sprint to the subsequent token. Rubin GPUs are the flexible, general-purpose workhorses for training and inference. They deliver high throughput across many model sizes, batch regimes, and serving patterns, from long-context prefill to decode attention and high-concurrency inference at scale.Â
LPX adds a specialized path optimized for fast, latency-sensitive token generation. Together, they permit a heterogeneous inference design that improves interactive responsiveness without giving up system-scale efficiency.


Decode phase: A repeated multi-engine loop
The prefill phase is dominated by ingesting large inputs and constructing the KV cache—a workload that advantages from dense parallel compute and huge memory capability. The Vera Rubin NVL72 handles this phase efficiently, especially for long-context workloads and MoE models where context will be large and highly variable.
The decode phase is different. Decode is a repeated per-token loop, and different parts of that loop stress different bottlenecks. Within the Vera Rubin platform architecture with LPX, decode is best regarded as a two-engine loop. GPUs handle decode work that advantages most from throughput and huge memory capability, akin to full-context attention over the accrued KV cache. LPX accelerates latency-sensitive execution inside decode, akin to sparse MoE expert feed-forward networks (FFNs) and other pointwise operations. This split, often described as decode phase disaggregation or attention–FFN disaggregation (AFD), separates attention from FFN inside decode and exchanges intermediate activations for every token, so each engine runs the a part of the loop it’s best suited to execute. This AFD loop expands the highest-value operating region of the Pareto frontier.


At rack scale and beyond, the LPX is designed to operate as a tightly coordinated unit of compute, minimizing coordination overhead and reducing jitter. That is worthwhile in decode-heavy, agentic workflows where small delays compound across many model calls and verification loops.
NVIDIA Dynamo makes heterogeneous decode operational
Making heterogeneous decode practical requires software that may classify requests, route work by latency targets, move intermediate activations with low overhead, and keep tail latency stable under bursty, variable traffic. NVIDIA Dynamo provides that orchestration layer by coordinating disaggregated serving and disaggregated decode across heterogeneous backends.
In practice, Dynamo routes prefill to GPU staff to process the massive context and construct the KV cache. During decode, Dynamo orchestrates the AFD loop where GPUs run attention over the accrued KV cache, intermediate activations are handed off to LPUs for FFN/MoE execution, and outputs return to the GPUs to proceed token generation. The result’s a single coherent serving path with more predictable tail latency while sustaining high AI factory throughput.Â


With KV-aware routing, low-overhead transfers, and latency-target-driven scheduling, Dynamo helps keep interactive sessions out of long queues, reduces cross-tenant jitter, and maintains stable tail latency as concurrency and request shapes vary. The result’s a production-ready heterogeneous serving model that delivers responsive user experiences while sustaining high AI factory throughput at scale.
​​Accelerating speculative decoding with LPX
Speculative decoding is an increasingly necessary technique for reducing latency in LLM inference. The approach uses a smaller draft model to generate multiple candidate tokens ahead of time, while a bigger goal model verifies and accepts those tokens in parallel. When the predictions match, multiple tokens will be committed without delay, significantly increasing effective tokens per second and reducing response latency.
LPX is well-suited to act because the draft-generation engine on this architecture. The deterministic execution model and intensely high on-chip SRAM bandwidth of the LPU enable very fast draft token generation, enabling the draft model to run ahead of the verifier. At the identical time, GPUs akin to Rubin remain highly efficient for large-model execution tasks akin to prefill, attention processing, and token verification.
By pairing the 2, the system combines the strengths of each processors:
- LPX generates draft tokens rapidly using its low-latency architecture.
- Rubin GPUs confirm and finalize tokens efficiently using high-throughput compute and huge memory capability.
This separation enables speculative decoding to run across heterogeneous processors, reasonably than running each draft and verifier models on the identical hardware. The result’s a system that may deliver faster draft generation without sacrificing the efficiency of GPU-based verification.


Unlocking intelligent agentic swarms
As AI use cases evolve from easy chat and batch inference to multi-step agentic workflows, responsiveness becomes a requirement. Offline inference and basic assistants can often prioritize aggregate throughput, but interactive applications, deep research, and agentic pipelines mix high token volume with tight feedback loops, where latency compounds across many model calls and power interactions.
On this regime, heterogeneous inference matters. Pairing a high-throughput engine for long-context processing with a low-latency engine for decode FFNs makes it possible to extend user interactivity without sacrificing AI factory output.


Unlocking a brand new category of AI experiences on the Pareto frontier
A practical approach to visualize this tradeoff between performance and value is the Pareto frontier, plotting user interactivity, measured in tokens per second per user (TPS per user), on the horizontal axis against AI factory throughput, measured in tokens per second per megawatt (TPS per MW), on the vertical axis.Â
As shown in Figure 10, different AI services operate at very different points on this curve. Throughput-first services, including many free-tier and background workloads, typically prioritize maximum efficiency and high utilization and sometimes use smaller models with shorter context windows. Premium AI services, against this, demand higher model capability and much more responsive user-visible performance, especially for long-context reasoning and agentic workflows. In Figure 10, that premium tier is represented by a 2-trillion-parameter MoE model with a 400K input context window operating at roughly 400 TPS per user and beyond.


Reaching these premium operating points with a single homogeneous platform forces a tradeoff between responsiveness and overall AI factory throughput since the workload mixes fundamentally different performance regimes throughout the same serving pipeline. A heterogeneous architecture expands the achievable region by combining complementary execution paths, allowing the system to sustain high factory output while delivering highly responsive, low-latency interactive experiences. As illustrated in Figure 10, the mixture of Vera Rubin NVL72 and LPX delivers as much as 35x higher TPS per megawatt at 400 TPS per user compared with the NVIDIA GB200 NVL72, effectively making a recent premium performance tier for interactive AI services.
This shift has a direct economic impact. Higher responsiveness expands the set of premium experiences an AI factory can serve and increases value per unit of infrastructure. With the Vera Rubin platform, AI factories can unlock as much as 5x more revenue per megawatt compared with the GB200 NVL72, and as much as 10x by pairing Vera Rubin NVL72 with LPX for essentially the most latency-sensitive, high-value interactive workloads, akin to agentic coding and multi-agent systems.


What NVIDIA Groq 3 LPX enables for Developers
Developers are increasingly constructing systems that require three things without delay:
- Responsiveness: low and predictable latency for interactive experiences and agent loops.
- Capability: strong model quality, reasoning depth, and long-context understanding.
- Scale: high-throughput and value efficiency to serve many concurrent users or agents.
LPX broadens the set of workloads an AI factory can serve efficiently. Use the low-latency path where predictable token generation improves experience, akin to coding assistants, agentic workflows with tight tool-calling loops, voice interactions, and real-time translation. Keep throughput-first workloads on Rubin GPUs, akin to batch serving, long-context throughput runs, where high concurrency and batching keep GPUs consistently busy and cost-efficient. The operational shift is mindset. Stop optimizing for one headline metric and begin optimizing for a spread of real-world operating points.
Learn more
Dive deeper into the architecture behind NVIDIA Groq 3 LPX and Vera Rubin by starting with the NVIDIA product pages and technical blogs covering the Vera Rubin platform, LPX, AFD, and Dynamo. Explore the underlying research on tensor streaming processors and software-defined silicon design for AI. Together, these resources offer a deeper take a look at the hardware, system architecture, and orchestration software behind heterogeneous, low-latency inference at scale. Next, join a NVIDIA Developer Forum thread focused on inference and deployment to check notes with other teams constructing low-latency serving systems.
Resources
Acknowledgments
Due to Amr Elmeleegy, Andrew Bitar, Andrew Ling, Graham Steele, Itay Neeman, Jamie Li, Omar Kilani, Santosh Raghavan, and Stuart Pitts, together with many other NVIDIA product leaders, engineers, and designers who contributed to this post.
