In algorithmic trading, reducing response times to market events is crucial. To maintain pace with high-speed electronic markets, latency-sensitive firms often use specialized hardware like FPGAs and ASICs. Yet, as markets grow more efficient, traders increasingly rely upon advanced models similar to deep neural networks to reinforce profitability. Because implementing these complex models on low-level hardware requires significant investment, general-purpose GPUs offer a practical, cost-effective alternative.Â
The NVIDIA GH200 Grace Hopper Superchip within the Supermicro ARS-111GL-NHR server has achieved single-digit microsecond latencies within the STAC-ML Markets (Inference) benchmark, Tacana suite (audited by STAC), providing performance comparable to or higher than specialized hardware systems.Â
This post details these record-breaking results and provides a deep dive into the custom-tailored solutions required for low-latency GPU inference. It also walks you thru an open source reference implementation and a tutorial for getting began.
STAC-ML benchmarking in financial services
Deep neural networks with long short-term memory (LSTM) are widely used for time series forecasting in capital markets. The STAC-ML (Markets) Inference benchmark measures LSTM model latency—the time between receiving recent input and generating the output. It includes three models of accelerating complexity (LSTM_A, LSTM_B, and LSTM_C), where LSTM_B is about six times greater than LSTM_A, and LSTM_C is roughly 200 times greater than LSTM_A. The benchmark features two suites: Tacana, which tests inference on a sliding window that updates every time step, and Sumaco, which tests inference on entirely recent data for every operation.Â
The STAC-ML Markets (Inference) Tacana benchmark processes sliding window inputs, generating a single regression output, zt, at each iteration.


STAC-ML has emerged as a vital benchmark for financial institutions leveraging machine learning (ML) in trading. It rigorously measures the speed and reliability of a technology stack when running models on live market data under realistic, production-like conditions. By standardizing key metrics—similar to latency, throughput, and efficiency for LSTM and other time series models—STAC-ML enables banks, hedge funds, and market makers to conduct objective, apples-to-apples comparisons of competing hardware and software solutions prior to deployment.
For trading desks situated in co-located data centers, where winning or losing an order will be decided in microseconds, STAC-ML results are essential. They validate that a platform can meet strict latency budgets for demanding use cases like high-frequency market making, short-term price prediction, and automatic hedging.Â
Moreover, since the benchmark is designed and governed by practitioners from leading financial firms, its scores carry significant weight within the technology selection process, helping firms manage risk for the rollout of latest ML-driven trading strategies and justify major investment decisions.
NVIDIA key STAC-ML resultsÂ
NVIDIA demonstrated the next latencies (99th percentile) on a Supermicro ARS-111GL-NHR with a single NVIDIA GH200 Grace Hopper Superchip in FP16 precision for STAC-ML Tacana.
LSTM_A and p99 latency:
- 4.70 microseconds with one model instance
- 4.67 microseconds with two model instances
- 4.61 microseconds with 4 model instances
- 4.67 microseconds with eight model instances
LSTM_B and p99 latency:
- 7.10 microseconds with one model instance
- 6.88 microseconds with two model instances
- 7.10 microseconds with 4 model instances
LSTM_C and p99 latency:
- 15.80 microseconds with one model instance
Note that the observed latencies remain highly consistent when scaling from 1 to 4–8 variety of model instances (NMIs) for each LSTM_A and LSTM_B. This stability highlights the importance of green contexts in maintaining predictable performance for latency-sensitive applications. For more details, see STAC-ML Markets (Inference) on a Supermicro ARS-111GL-NHR with NVIDIA GH200 Grace Hopper Superchip.
Seamless integration of NVIDIA GH200 Grace Hopper Superchip
The NVIDIA GH200 Grace Hopper Superchip expands the powerful 64-bit Arm processor ecosystem by supporting a various range of containers, application binaries, and operating systems—all running effortlessly on Grace Hopper with no modifications required. It seamlessly integrates with the total NVIDIA software stack, including NVIDIA HPC and AI platforms.
Comparison to previous submissions
NVIDIA previously submitted optimized results for each throughput and latency (Sumaco and Tacana benchmarks), as detailed in NVIDIA A100 Aces Throughput, Latency Leads to Key Inference Benchmark for Financial Services Industry. On this earlier Tacana work, the sliding window approach enables more efficient handling of recurrency through precomputation. We refactored the issue to perform computations across all time steps using a hard and fast variety of matrix-matrix multiplications (GEMMs) and an initial precomputation, enabling competitive performance.
Recent benchmark submissions on FPGA for Tacana have reported single-digit microsecond latencies for 2 LSTM sizes by focusing latency measurements on the ultimate time step and leveraging precomputations outside critical sections.Â
Achieving such low latency on GPUs requires a custom-tailored solution, pushing the boundaries of GPU kernel launch latency.Â
The NVIDIA implementation consists of two sequential steps. Step one is precomputation, which generates required inputs for the ultimate time step of the sliding window LSTM. For instance, if there may be the requirement to reset sliding window’s initial hidden/cell inputs to 0, that will be two GEMM operations per layer. This precomputation phase is excluded from timing measurements.Â
The second step is inference, where the last LSTM time step is computed after the sliding window shifts to the brand new input. After the inference, the relevant data for the subsequent inference iteration is precomputed in the subsequent preprocess stage.Â
Low-latency LSTM inference on GPUs
This section describes techniques for implementing low-latency inference of LSTM networks efficiently on NVIDIA hardware, including an open source reference implementation.
NVIDIA open source LSTM CUDA kernels
dl-lowlat-infer is an open source repository that gives examples of CUDA kernels for implementing low-latency time series inference. The techniques utilized in the kernels presented here were also applied within the STAC-ML benchmarks. While the open source repository includes minimal benchmarking capabilities to enable code execution, it is just not intended to be a fully-fledged benchmarking suite like STAC-ML.Â
The dl-lowlat-infer repository showcases efficient techniques for running deep learning workloads on NVIDIA GPUs and is fully self-contained. It might generate model weights and inputs, randomly sample input data locations, and run inference for single or multiple model instances on the identical GPU. Currently, it’s restricted to sliding-window use cases for LSTMs.
This work focuses on three LSTM model sizes which might be tuned specifically to suit and run efficiently on an NVIDIA RTX PRO 6000 Blackwell Server Edition GPU. The configurations include a small model that matches throughout the shared memory and registers of a single streaming multiprocessor (SM), a medium model that spans eight SMs inside a thread block cluster (TBC), and a big model that utilizes nearly the complete device (184 out of 186 SMs).
| Layers | Time steps | Inputs | Units | Weights | |
| Small | 2 | 64 | 128 | 96 | 160K |
| Medium | 3 | 96 | 192 | 160 | 635K |
| Large | 4 | 128 | 512 | 736 | 16.7M |
While the record-breaking STAC-ML Tacana results were achieved on the NVIDIA GH200 Grace Hopper Superchip, the next tutorial uses the NVIDIA RTX PRO 6000 Blackwell Server Edition. This transition is motivated by the goal deployment environment for a lot of financial services firms. Low-latency trading desks often operate in power-constrained co-location environments where the thermal and power envelopes of traditional data center-class GPUs, similar to the GH200, is probably not a viable option.
The NVIDIA RTX PRO 6000 Blackwell Server Edition GPU provides a robust and efficient alternative, suitable for deployment in these restricted environments. Crucially, the low-latency inference techniques and the open source code presented in the next tutorial are fully compatible with each architectures. This ensures that the identical optimized kernels that deliver high performance on the RTX PRO 6000 Blackwell Server Edition GPU can even run efficiently on the GH200. This permits users to simply benchmark on data center platforms.
The right way to construct and run the low-latency LSTM inference reference implementation
To construct and run the benchmark, you would like CUDA 13.0 or newer and a C++20-capable compiler. The next instructions are tailored to the newest NVIDIA Blackwell architecture, but you may as well run the code on NVIDIA Hopper GPUs by compiling for SM90. Only the small network is supported on older GPU architectures; the 2 larger networks cannot run there as a result of technical limitations.
Constructing inside Docker
The benchmark is designed to run inside a Docker container. From throughout the top-level directory of the code, you may construct the container and the benchmark, and prepare the models’ weights and inputs:
make -C docker CUDA_ARCHS=120-real LOCAL_USER=1 release_run
CUDA_ARCHS sets the goal GPU architecture in cmake. For instance, 100 might be used for each NVIDIA Blackwell and NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs.Â
Running a model
After the image is built and the container is began, you may go to the app directory, /app/dl-lowlat-infer, and run, for instance, 10-second executions of a single instance of the small model using the persistent algorithm:
./nvLstmInf lstm_s data/lstm_s data/lstm_s.npy 10
Run 4 instances of the medium model using the persistent algorithm using six CPU threads (one thread per model instance plus principal thread plus timing thread):
./nvLstmInf --cpuset=0,1,2,3,4,5 --num-instances=4 lstm_m data/lstm_m data/lstm_m.npy 10
For more details on running and developing contained in the container, seek advice from the benchmark documentation.
Results
Table 2 shows the outcomes produced from launching the code on a system with an AMD EPYC 9124 16-Core processor and an NVIDIA RTX PRO 6000 Blackwell Server Edition GPU. Â
| Ping Pong | Small | Medium | Large | |
| Average, µs | 2.4 | 3.5 | 4.7 | 13.2 |
| P99, µs | 2.5 | 4.3 | 5.4 | 14.2 |
The Ping Pong test measures the overhead of CPU-GPU synchronization and reading the input vector from host memory, which is the dominant contributor to latency for the small model. These measurements follow the same approach because the MVAPICH latency microbenchmarks from Ohio State University. This overhead varies across systems and depends upon multiple aspects within the hardware and software stack. For larger models with more layers, additional latency arises from using cluster- and grid-level synchronization primitives.
Implementation details
Details about implementation are presented on this section.
Persistent kernels for inference
The inference stage with a batch size of 1 performs matrix vector multiplication, followed by the elementwise operation at each layer. After the ultimate layer, the resulting hidden state is reduced right into a single value, which is reported by the inference to the benchmarking process.Â
Inference was implemented using a persistent kernel approach, meaning the kernel stays energetic throughout the appliance’s lifetime. This persistence improves performance by loading weights into shared memory and registers just once during kernel initialization.Â


Depending on the issue size and to make sure weights fit throughout the available SMs, use a single CUDA block, a TBC, or allocate the complete device. Because of this, three distinct kernels are implemented, all sharing the identical memory layout for weights and all following similar structural and timing conventions.Â
The TBC can span as much as eight SMs on RTX PRO 6000 Blackwell Server Edition GPUs, which is sufficient to accommodate weights for the medium model. The distributed shared memory API for TBCs allows more efficient data exchange and synchronization between SMs when gathering pieces of computed hidden state from the opposite CUDA blocks.Â
Timing
Timing is managed by a CPU thread, requiring implementing the signaling between host and device using CPU and GPU atomic synchronization primitives.
- The host signals the device when recent input arrives in host memory and concurrently starts the timer.Â
- The kernel polls for this signal, then reads the input and initiates computation. Â
- The computed floating-point output also acts because the host’s signal to stop the timer.
There’s additional signaling to abort the kernel execution or reset the double buffer ID. Those buffers contain the info coming from the precomputation. Â
Serving multiple model instances
It is just not energy- or cost-efficient to run, for instance, a single CUDA block inference on an RTX PRO 6000 Blackwell Server Edition GPU. Use the CUDA green context (GC) feature to serve multiple inference instances on the identical GPU. Note that there are other ways to serve multiple instances independently. For instance, use NVIDIA Multi-Instance GPU (MIG) feature or add one other layer of complexity of signalling to the persistent CUDA kernel itself.Â
The GC feature enables partitioning the GPU into GCs inside the appliance without complicating the kernel. Each GC binds to a selected variety of SMs. Any CUDA work submitted to a CUDA stream created in such a context will likely be executed on the corresponding set of SMs.Â
GCs are more lightweight and transparent to the programmer than traditional ones. The GPU is split into partitions of equal size to serve multiple persistent kernels. The remaining SMs are used for the precomputation phase. Because the precomputation is just not latency critical, precomputations from different model instances are submitted to the identical partition with remaining SMs but in numerous CUDA streams.Â
Coordinating with a persistent kernel instance from the host involves multiple spin loops. Thus, serving multiple model instances requires spawning one additional CPU thread per GC.
The minimal GC size is 2 SMs. So, small and medium models allocate two and eight SMs per GC, respectively. The massive model needs almost the entire device to carry the weights in shared memory and registers, and it’s impossible to serve a couple of model directly on a single RTX PRO 6000 Blackwell Server Edition GPU.Â
GDRCopy
Polling a tool on a flag situated in a pinned host buffer will be quite expensive. GDRCopy provides a low-latency alternative by making a CPU mapping of GPU memory using GPUDirect RDMA. This permits CPU-driven memory copies with minimal overhead — particularly helpful in low-latency scenarios where small data transfers are small and frequent. In our experiments, using GDRCopy has yielded speedups of as much as 0.5 µs on PCIe-based systems.
Ping Pong benchmark
To acquire a Ping Pong model, start from the smallest model implementation and take away all LSTM-related computations. This setup measures only the overhead from CPU signaling and input reading of a single time step. Since it involves no weights, implement it using a single CUDA block, similar to within the smallest model. This permits estimating the minimal latency achievable on a given system with our implementation.
Start with low-latency inference
Constructing on the previous work reported in Benchmarking Deep Neural Networks for Low-Latency Trading and Rapid Backtesting on NVIDIA GPUs, we’ve now integrated custom CUDA kernels specifically optimized for latency‑critical paths. These enhancements achieved record‑breaking latency across two LSTM model sizes while preserving a versatile developer experience. The NVIDIA platform continues to offer a consistent, productivity‑oriented environment for research, optimization, and deployment.
These capabilities are accessible through an open source time series modeling pipeline that showcases the way to use NVIDIA technology efficiently for low‑latency inference and backtesting. You can even view the GTC 2026 session Construct High-Performance Financial AI: Achieve Microsecond Latency and Scalable LLM Inference on demand.
STAC and all STAC names are trademarks or registered trademarks of the Strategic Technology Evaluation Center, LLC.
