Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

NVIDIA Groq 3 LPX is a brand new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of agentic systems. Co-designed with the NVIDIA Vera Rubin NVL72, LPX equips the AI factory with an engine optimized for fast, predictable token generation, while Vera Rubin NVL72 stays the flexible, general-purpose workhorse for training and inference, delivering high throughput across prefill and decode, including long-context processing, decode attention, and high-concurrency serving at scale.

This mix matters since the agentic future demands a brand new category of inference. As generation speeds approach 1,000 tokens per second per user, models move beyond conversation-speed interaction toward speed of thought computing. At that rate, AI systems can reason, simulate, and respond repeatedly, enabling experiences that feel less like turn-based chat and more like real-time collaboration.

This shift also raises the ceiling for multi-agent systems. Individual agents will be powerful on their very own, but coordinated groups of agents can accomplish much more, very like human societies scale their capability through collective intelligence and coordination.

Supporting these emerging workloads requires infrastructure that may deliver each high throughput and low latency. The mix of Vera Rubin NVL72 and LPX enables this heterogeneous architecture, pairing large-scale AI factory performance with the fast token generation needed to power repeatedly running agentic systems and next-generation AI applications.

Specification	NVIDIA Groq 3 LPX
AI inference compute	315 PFLOPS
Total SRAM capability	128 GB
On-chip SRAM bandwidth	40 PB/s
Scale-up density	256 chips
Scale-up bandwidth	640 TB/s

Resource	Per LPX Tray
LP30 chips	8
On-chip SRAM	4 GB
SRAM bandwidth	1.2 PB/s
DRAM via fabric expansion logic	As much as 256 GB
DRAM via host CPU	As much as 128 GB
AI inference compute (FP8)	9.6 PFLOPS
Scale-up bandwidth	20 TB/s

Force	Why it matters
Low-latency as a product feature	In interactive applications, responsiveness isn’t any longer just an infrastructure metric; it is an element of what users evaluate.
Longer reasoning outputs	As models generate longer outputs and multi-step chains of thought, more of the request shifts into sequential token generation.
Prefix caching	Reusing shared prompt state can reduce prefill cost, however it also increases the relative share of request-specific decode work that also needs to be served quickly.
Longer contexts	As context grows, the Transformer’s self-attention mechanism becomes increasingly constrained by data movement and memory bandwidth.

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

Introducing NVIDIA Groq 3 LPX

Contained in the NVIDIA Groq 3 LPX compute tray

First take a look at the architecture of the NVIDIA Groq 3 LPU—the seventh chip of the Vera Rubin Platform

Tensor-first compute and explicit data movement

MEM enables extreme on-chip memory bandwidth

C2C scaling with predictable communication

Deterministic, compiler-orchestrated execution

The shift toward interactive inference

What makes interactive inference harder

The era of agentic inference requires a brand new architecture

Vera Rubin NVL72 meets LPX

Decode phase: A repeated multi-engine loop

NVIDIA Dynamo makes heterogeneous decode operational

Accelerating speculative decoding with LPX

Unlocking intelligent agentic swarms

Unlocking a brand new category of AI experiences on the Pareto frontier

What NVIDIA Groq 3 LPX enables for Developers

Learn more

Resources

Acknowledgments

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Hallucinations in LLMs Are Not a Bug within the Data

Where OpenAI’s technology could show up in Iran

Bayesian Considering for People Who Hated Statistics

Securing digital assets against future threats

Musk takes xAI right into a full rebuild

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

Introducing NVIDIA Groq 3 LPX

Contained in the NVIDIA Groq 3 LPX compute tray

First take a look at the architecture of the NVIDIA Groq 3 LPU—the seventh chip of the Vera Rubin Platform

Tensor-first compute and explicit data movement

MEM enables extreme on-chip memory bandwidth

C2C scaling with predictable communication

Deterministic, compiler-orchestrated execution

The shift toward interactive inference

What makes interactive inference harder

The era of agentic inference requires a brand new architecture

Vera Rubin NVL72 meets LPX

Decode phase: A repeated multi-engine loop

NVIDIA Dynamo makes heterogeneous decode operational

​​Accelerating speculative decoding with LPX

Unlocking intelligent agentic swarms

Unlocking a brand new category of AI experiences on the Pareto frontier

What NVIDIA Groq 3 LPX enables for Developers

Learn more

Resources

Acknowledgments

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Accelerating speculative decoding with LPX

What are your thoughts on this topic?
Let us know in the comments below.