Achieving Single-Digit Microsecond Latency Inference for Capital Markets

In algorithmic trading, reducing response times to market events is crucial. To maintain pace with high-speed electronic markets, latency-sensitive firms often use specialized hardware like FPGAs and ASICs. Yet, as markets grow more efficient, traders increasingly rely upon advanced models similar to deep neural networks to reinforce profitability. Because implementing these complex models on low-level hardware requires significant investment, general-purpose GPUs offer a practical, cost-effective alternative.

The NVIDIA GH200 Grace Hopper Superchip within the Supermicro ARS-111GL-NHR server has achieved single-digit microsecond latencies within the STAC-ML Markets (Inference) benchmark, Tacana suite (audited by STAC), providing performance comparable to or higher than specialized hardware systems.

This post details these record-breaking results and provides a deep dive into the custom-tailored solutions required for low-latency GPU inference. It also walks you thru an open source reference implementation and a tutorial for getting began.

	Layers	Time steps	Inputs	Units	Weights
Small	2	64	128	96	160K
Medium	3	96	192	160	635K
Large	4	128	512	736	16.7M

	Ping Pong	Small	Medium	Large
Average, µs	2.4	3.5	4.7	13.2
P99, µs	2.5	4.3	5.4	14.2

Achieving Single-Digit Microsecond Latency Inference for Capital Markets

STAC-ML benchmarking in financial services

NVIDIA key STAC-ML results

Seamless integration of NVIDIA GH200 Grace Hopper Superchip

Comparison to previous submissions

Low-latency LSTM inference on GPUs

NVIDIA open source LSTM CUDA kernels

The right way to construct and run the low-latency LSTM inference reference implementation

Constructing inside Docker

Running a model

Results

Implementation details

Persistent kernels for inference

Timing

Serving multiple model instances

GDRCopy

Ping Pong benchmark

Start with low-latency inference

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Linear Regression Is Actually a Projection Problem (Part 2: From Projections to Predictions)

Latest Rowhammer attacks give complete control of machines running Nvidia GPUs

Our most capable open models so far

Frontier multimodal intelligence on device

How one can Handle Classical Data in Quantum Models?

Achieving Single-Digit Microsecond Latency Inference for Capital Markets

STAC-ML benchmarking in financial services

NVIDIA key STAC-ML results

Seamless integration of NVIDIA GH200 Grace Hopper Superchip

Comparison to previous submissions

Low-latency LSTM inference on GPUs

NVIDIA open source LSTM CUDA kernels

The right way to construct and run the low-latency LSTM inference reference implementation

Constructing inside Docker

Running a model

Results

Implementation details

Persistent kernels for inference

Timing

Serving multiple model instances

GDRCopy

Ping Pong benchmark

Start with low-latency inference

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.