NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance

-


Large language models (LLMs) are revolutionizing the financial trading landscape by enabling sophisticated evaluation of vast amounts of unstructured data to generate actionable trading insights. These advanced AI systems can process financial news, social media sentiment, earnings reports, and market data to predict stock price movements and automate investment strategies with unprecedented accuracy. 

The Strategic Technology Evaluation Center (STAC) has been developing benchmarks for the workloads key to the financial industry for over 15 years. They’ve now developed the STAC-AI benchmark to assist corporations assess the end-to-end retrieval-augmented generation (RAG) and LLM inference pipeline.

This post presents the outcomes achieved on the STAC-AI LANG6 benchmark across multiple NVIDIA platforms. We may also share some recommendations on how any user can benchmark NVIDIA TensorRT LLM in accordance with the specifications of their dataset.

STAC-AI LANG6 (Inference-Only) benchmark

Within the broader context of a RAG pipeline, STAC-AI LANG6 is the a part of the benchmark specializing in LLM inference performance. The benchmark tests the hardware and software stack on the Llama 3.1 8B Instruct and Llama 3.1 70B Instruct models together with the next custom datasets: 

  • EDGAR4: The prompts are summarizations of the connection of an organization to considered one of various physical and financial concepts (similar to commodities, currencies, rates of interest, and real estate sectors). It uses EDGAR 10‑K paragraphs from a single security filing for a single 12 months. The input/output sequence length goals to model medium-length requests.
  • EDGAR5: Questions covering several different elements of a whole 10‑K filing. The document type is the entire text of a single EDGAR 10‑K filing. The input/output sequence length goals to model long-context requests.

These datasets, based on EDGAR filings, model medium and long-context summarization for financial trading and investment advice use cases. The prompts ask the model to perform evaluation and summarization of annual reports (10-K filings) for hundreds of public corporations over the past five years.

The benchmark also tests two different inference scenarios, batch mode and interactive mode: 

  • Batch (offline) mode: All requests are given directly, and all responses are collected directly. Only throughput is measured. 
  • Interactive (online) mode: Requests arrive at pseudo-random times. The mean arrival rate λ (the common variety of requests the system receives every second) could be set to model different usage scenarios. The benchmark collects metrics similar to response time (RT), words per second per user (WPS/user), and total words per second (WPS), but doesn’t set any constraint on them. RT is analogous to time to first token (TTFT) in other benchmarks, and WPS to tokens/second/user.

Note that interactive mode doesn’t cover the mix of Llama 3.1 70B Instruct with EDGAR5.

The benchmark checks the standard of the output and word count with respect to a control set of LLM-generated responses. 

While other benchmarks allow all preprocessing, a vital differentiator of STAC-AI is the necessity to apply chat templates and tokenize requests during inference. Real deployments may prefer to have this work done on the server side to guard their system prompts, thus imposing more load on the CPU.

Hardware and software stack

This post compares two on-premises NVIDIA Hopper-based servers submitted by HPE with a cloud-based NVIDIA Blackwell node.

Because the benchmark requires post-training quantization as a part of the benchmarking procedure, the models were quantized using the NVIDIA TensorRT Model Optimizer. To leverage probably the most performant kernels available for every deployment, quantization was performed to FP8 on NVIDIA Hopper and to NVFP4 on NVIDIA Blackwell. 

To realize one of the best performance for each Hopper and Blackwell, NVIDIA TensorRT LLM inference framework was used for efficient model execution. These quantized models were run using TensorRT LLM PyTorch runtime for a well-known, native PyTorch development experience while maintaining peak performance. 

Benchmarking results on STAC-AI LANG6

Benchmarking results for each batch mode and interactive mode are detailed on this section.

Batch mode

For batch mode, NVIDIA Blackwell delivers significant speedups in all scenarios. Table 1 shows the WPS and requests per second (RPS) achieved. 

Note that the NVIDIA GB200 NVL72 results weren’t audited by STAC.

Model​ Dataset​ 2x GH200 144 GB ​
TensorRT LLM FP8​
4x GB200 NVL72
TensorRT LLM NVFP4​
2x RTX PRO 6000 
NVFP4
WPS​ RPS WPS RPS WPS RPS
Llama 3.1 8B​ EDGAR4​ 8,237 51.5 37,480​ 224​ 5,500 32.9
EDGAR5​ 304 0.784 1,112​ 2.85​ 138 0.345
Llama 3.1 70B​ EDGAR4​ 1,071 6.77 5,618​ 35.9​ 831 5.26
EDGAR5​ 41.4 0.119 150​ 0.477​ 13 0.04
Table 1. STAC-AI batch mode results across all model and dataset mixtures 

The total reports with more details across each interactive and batch modes could be present in the reports published by STAC.

Single-GPU performance was also assessed to account for the various variety of GPUs on each system. Although STAC-AI doesn’t measure per-GPU performance, the outcomes shown in Figure 1 illustrate the throughput difference between single GPUs from each of the systems.

Relative performance bar chart showing per-GPU performance uplift of up to 3.2x on the GB200 NVL72 compared to the GH200.
Relative performance bar chart showing per-GPU performance uplift of up to 3.2x on the GB200 NVL72 compared to the GH200.
Figure 1. Performance improvement from a single NVIDIA GH200 GPU to a single NVIDIA GB200 NVL72 GPU can reach 3.2x

Interactive mode 

The balance between token economics (depending on throughput) and user experience (depending on interactivity metrics similar to RT and WPS/user) is a vital think about modern LLM inference.

Interactive mode showcases the tradeoff across the interactivity-throughput Pareto front by choosing a spread of arrival rates. Interactivity is measured by each RT and WPS/user. To facilitate visualization, the inverse of WPS/user, defined as interword latency (IWL), or (frac{1}{WPS/user}), is used. Within the graphs we use the ninety fifth percentile of each metrics.

As seen in Figure 2, GB200 NVL72 achieves a greater tradeoff between throughput and each RT and IWL across the board. IWL (solid, lower is best) and RT (dashed, lower is best) are plotted versus interactive-mode throughput across model/dataset scenarios.

Six small line charts compare GH200, GB200, and RTX 6000 Pro in interactive mode. The top row plots p95 IWL (seconds per token) versus throughput (requests per second), and the bottom row plots p95 RT (seconds) versus throughput, for three model/dataset configurations.
Six small line charts compare GH200, GB200, and RTX 6000 Pro in interactive mode. The top row plots p95 IWL (seconds per token) versus throughput (requests per second), and the bottom row plots p95 RT (seconds) versus throughput, for three model/dataset configurations.
Figure 2. NVIDIA GB200 NVL72 sustains higher interactivity at higher interactive throughput in comparison with NVIDIA GH200

Figure 3 shows that, even when operating at the same percentage of maximum throughput, NVIDIA GB200 NVL72 achieves higher RT and IWL in most scenarios. Normalizing the x-axis removes raw throughput benefits and highlights interactivity-at-equal-load.

Six small line charts show the same interactive-mode metrics and scenarios as Figure 2, but the throughput axis is normalized relative to each system’s batch-mode request throughput (shown with vertical reference lines). The top row shows p95 inter-word latency versus normalized throughput and the bottom row shows p95 reaction time versus normalized throughput.
Six small line charts show the same interactive-mode metrics and scenarios as Figure 2, but the throughput axis is normalized relative to each system’s batch-mode request throughput (shown with vertical reference lines). The top row shows p95 inter-word latency versus normalized throughput and the bottom row shows p95 reaction time versus normalized throughput.
Figure 3. At matched utilization (normalized to the batch-mode throughput of every system), NVIDIA GB200 NVL72 delivers lower IWL and RT in comparison with NVIDIA GH200 in most scenarios

How you can benchmark TensorRT LLM together with your custom data

While the STAC benchmark uses proprietary data and metrics, you’ll be able to benchmark TensorRT LLM against models tailored to your specific dataset characteristics. This tutorial walks you thru quantizing a model, preparing your dataset, and running performance benchmarks—all customized to your use case.

Prerequisites:

  • A Docker image that features TensorRT LLM (TensorRT LLM Release, for instance).
  • An NVIDIA GPU that’s large enough to serve your model at the specified quantization level. You could find a support matrix for quantization in TensorRT LLM documentation.
  • A Hugging Face account and token, together with access to the gated models of Llama 3.1 8B Instruct or Llama 3.1 70B Instruct. You may set the HF_TOKEN environment variable to your token, and all subsequent commands will use this token.

Step 1: Launch the container

The containers maintained by NVIDIA contain all the required dependencies pre-installed. Change into an empty directory with enough space for the models and their quantizations. You may start the container on a machine with NVIDIA GPUs with the next command. Ensure that you specify your Hugging Face token.

docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 
            --gpus=all 
            -u $(id -u):$(id -g) 
            -e USER=$(id -un) 
            -e HOME=/tmp 
            -e TRITON_CACHE_DIR=/tmp/.triton 
            -e TORCHINDUCTOR_CACHE_DIR=/tmp/.inductor_cache 
            -e HF_HOME=/workspace/model_cache 
            -e HF_TOKEN= 
            --volume "$(pwd)":/workspace 
            --workdir /workspace 
            nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2

Step 2: Clone the repositories

Model quantization reduces model size and improves inference speed. Use NVIDIA Model Optimizer to quantize Llama 3.1 8B Instruct to NVFP4 format. First, clone the Model Optimizer repository for the quantization example:

git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git -b 0.37.0

Step 3: Quantize the model

Next, execute the Hugging Face example script with the chosen model and quantization format—on this case, Llama 3.1 8B Instruct using NVFP4 quantization.

bash TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh 
  --model meta-llama/Llama-3.1-8B-Instruct 
  --quant nvfp4

Step 4: Generate synthetic data

Use the benchmark utility to generate an artificial dataset with the token distribution needed for a task. This instance creates 30,000 requests with a set input sequence length of two,048, and an output sequence length of 128. Nonzero standard deviations higher approximate real traffic, if you will have access to that information.

python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py 
  --stdout 
  --tokenizer meta-llama/Llama-3.1-8B-Instruct 
  token-norm-dist 
  --input-mean 2048 
  --output-mean 128 
  --input-stdev 0  
  --output-stdev 0 
  --num-requests 30000 
  > dataset_2048_128.json

Step 5: Run the benchmark

The trt-llm bench command can run the generated requests in an offline fashion, sending all requests directly to TensorRT LLM runtime (closely matching the STAC-AI batch mode). 

While some options can be found within the CLI API, the full LLM API could be accessed through a YAML file passed with the extra_llm_api_options parameter. For the needs of this instance, enable CUDA Graphs padding. To study more options, see the TensorRT LLM API Reference.

cat > llm_options.yml << 'EOF'
cuda_graph_config:
  enable_padding: True
EOF

Finally, run the benchmark, specifying the model, the dataset, and the choices:

trtllm-bench 
  --model meta-llama/Llama-3.1-8B-Instruct 
  --model_path /workspace/TensorRT-Model-Optimizer/examples/llm_ptq/saved_models_Llama-3_1-8B-Instruct_nvfp4 
  throughput 
  --dataset dataset_2048_128.json 
  --backend pytorch 
  --extra_llm_api_options llm_options.yml

This may output various metrics similar to the request throughput, the tokens/second/GPU, and more.

Start with TensorRT LLM benchmarking

NVIDIA GB200 NVL72 significantly advanced performance on the STAC-AI LANG6 benchmark, setting a brand new record for LLM inference within the financial sector. NVIDIA Blackwell delivered as much as 3.2x the performance of previous architectures, achieving each higher throughput and consistently maintaining superior interactivity.

Alongside the brand new record, NVIDIA Hopper continues to deliver strong, priceless results for LLM inference workloads. Even greater than three years after its initial release, Hopper proves highly effective in each batch and interactive inference scenarios, maintaining good performance metrics even at high throughput, and confirming its continued relevance for financial institutions.

To dive deeper into organising and running your individual performance evaluations, explore the TensorRT LLM Benchmarking Guide.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x