NVIDIA-Accelerated Mistral 3 Open Models Deliver Efficiency, Accuracy at Any Scale

-


The brand new Mistral 3 open model family delivers industry-leading accuracy, efficiency, and customization capabilities for developers and enterprises. Optimized from NVIDIA GB200 NVL72 to edge platforms, Mistral 3 includes: 

  • One large state-of-the-art sparse multimodal and multilingual mixture of experts (MoE) model with a complete parameter count of 675B 
  • A set of small, dense high-performance models (called Ministral 3) of sizes 3B, 8B, and 14B, each with Base, Instruct, and Reasoning variants (nine models total) 

All of the models were trained on NVIDIA Hopper GPUs and at the moment are available through Mistral AI on Hugging Face. Developers can pick from quite a lot of options for deploying these models on different NVIDIA GPUs with different model precision formats and open source framework compatibility (Table 1). 

  Mistral Large 3  Ministral-3-14B  Ministral-3-8B  Ministral-3-3B 
Total parameters  675B   14B  8B    3B 
Lively parameters  41B  14B  8B  3B 
Context window  256K  256K  256K  256K 
Base  –  BF16  BF16  BF16 
Instruct  –  Q4_K_M, FP8, BF16   Q4_K_M, FP8, BF16   Q4_K_M, FP8, BF16  
Reasoning  Q4_K_M, NVFP4, FP8  Q4_K_M, BF16  Q4_K_M, BF16  Q4_K_M, BF16 
Frameworks 
vLLM  ✔  ✔  ✔  ✔ 
SGLang  ✔  –  –  – 
TensorRT-LLM  –  –  – 
Llama.cpp  –  ✔  ✔  ✔ 
Ollama  –  ✔  ✔ 
NVIDIA hardware 
GB200 NVL72  ✔   ✔  ✔    ✔ 
Dynamo  ✔   ✔  ✔   ✔ 
DGX Spark  ✔  ✔  ✔  ✔ 
RTX   –  ✔  ✔  ✔ 
Jetson   –  ✔   ✔  ✔ 
Table 1. Mistral 3 model specifications

Mistral Large 3 delivers best-in-class performance on NVIDIA GB200 NVL72  

NVIDIA-accelerated Mistral Large 3 achieves best-in-class performance on NVIDIA GB200 NVL72 by leveraging a comprehensive stack of optimizations tailored for giant state-of-the-art MoEs. Figure 1 shows the performance Pareto frontiers for GB200 NVL72 and NVIDIA H200 across the interactivity range. 

Line chart titled “Performance per MW on Mistral Large 3 NVFP4 ISL/OSL 1K/8K.” The x-axis shows TPS per user (interactivity) from 0 to about 150, and the y-axis shows TPS per megawatt from 0 to 7,000,000. A green line labeled GB200 starts high on the left (around 5,000,000 TPS/MW at roughly 40 TPS/user) and slopes downward as interactivity increases. A gray line labeled H200 follows the same general shape but is consistently much lower, starting near 2,000,000 TPS/MW around 15 TPS/user and dropping to the right. The graphic illustrates that GB200 delivers substantially higher energy efficiency than H200 across the full interactivity range. Line chart titled “Performance per MW on Mistral Large 3 NVFP4 ISL/OSL 1K/8K.” The x-axis shows TPS per user (interactivity) from 0 to about 150, and the y-axis shows TPS per megawatt from 0 to 7,000,000. A green line labeled GB200 starts high on the left (around 5,000,000 TPS/MW at roughly 40 TPS/user) and slopes downward as interactivity increases. A gray line labeled H200 follows the same general shape but is consistently much lower, starting near 2,000,000 TPS/MW around 15 TPS/user and dropping to the right. The graphic illustrates that GB200 delivers substantially higher energy efficiency than H200 across the full interactivity range.
Figure 1. Performance per megawatt for Mistral Large 3, comparing NVIDIA GB200 NVL72 and NVIDIA H200 across different interactivity targets

Where production AI systems must deliver each strong user experience (UX) and cost-efficient scale, GB200 provides as much as 10x higher performance than the previous-generation H200, exceeding 5,000,000 tokens per second per megawatt (MW) at 40 tokens per second per user.  

This generational gain translates to higher UX, lower per-token cost, and better energy efficiency for the brand new model. The gain is primarily driven by the next components of the inference optimization stack: 

  • NVIDIA TensorRT-LLM Wide Expert Parallelism (Wide-EP) provides optimized MoE GroupGEMM kernels, expert distribution and cargo balancing, and expert scheduling to totally exploit the NVL72 coherent memory domain. Of particular interest is how resilient this Wide-EP feature set is to architectural variations across large MoEs. This allows a model corresponding to Mistral Large 3 (with roughly half as many experts per layer (128) as DeepSeek-R1) to still realize the high-bandwidth, low-latency, non-blocking advantages of the NVIDIA NVLink fabric. 
  • Low-precision inference that maintains efficiency and accuracy has been achieved using NVFP4, with support from SGLang, TensorRT-LLM, and vLLM. 
  • Mistral Large 3 relies on NVIDIA Dynamo, a low latency distributed inference framework, to rate-match and disaggregate the prefill and decode phases of inference. This in turn boosts performance for long-context workloads, corresponding to 8K/1K configurations (Figure 1). 

As with all models, upcoming performance optimizations—corresponding to speculative decoding with multitoken prediction (MTP) and EAGLE-3—are expected to push performance further, unlocking much more advantages from this recent model. 

NVFP4 quantization 

For Mistral Large 3, developers can deploy a compute-optimized NVFP4 checkpoint that was quantized offline using the open-source llm-compressor library. This enables for reducing compute and memory costs while maintaining accuracy, by leveraging the NVFP4 higher-precision FP8 scaling aspects and finer-grained block scaling to manage quantization error.  

The recipe targets only the MoE weights while keeping all other components at their original checkpoint precision. Because NVFP4 is native to Blackwell, this variant deploys seamlessly on GB200 NVL72. NVFP4 FP8-scale aspects and fine-grained block scaling keep quantization error low, delivering lower compute and memory cost with minimal accuracy loss. 

Open source inference 

These open weight models might be used along with your open source inference framework of selection. TensorRT-LLM leverages optimizations for giant MoE models to spice up performance on GB200 NVL72 systems. To start, you should utilize the TensorRT-LLM preconfigured Docker container.  

NVIDIA collaborated with vLLM to expand support for kernel integrations for speculative decoding (EAGLE), NVIDIA Blackwell, disaggregation, and expanded parallelism. To start, you may deploy the launchable that uses vLLM on NVIDIA cloud GPUs. To see the boilerplate code for serving the model and sample API calls for common use cases, try Running Mistral Large 3 675B Instruct with vLLM on NVIDIA GPUs.

Figure 2 shows the range of GPUs available within the NVIDIA construct platform where you may deploy Mistral Large 3 and Ministral 3. You can select the suitable GPU size and configuration to your needs.  

The image shows the console at brev.dev which allows users to select which type of GPU option in the ‘Select your Compute’ page, the user can select between boxes in a row of H200, H100, A100, L40s, A10 and A100 shown.The image shows the console at brev.dev which allows users to select which type of GPU option in the ‘Select your Compute’ page, the user can select between boxes in a row of H200, H100, A100, L40s, A10 and A100 shown.
Figure 2. A spread of GPUs can be found within the NVIDIA construct platform where developers can deploy Mistral Large 3 and Ministral 3

NVIDIA also collaborated with SGLang to create an implementation of Mistral Large 3 with disaggregation and speculative decoding. For details, see the SGLang documentation

Ministral 3 models deliver speed, versatility, and accuracy   

The small, dense high performance Ministral 3 models are for edge deployment. Offering flexibility for quite a lot of needs, they are available in three parameter sizes—3B, 8B, and 14B—each with Base, Instruct, and Reasoning variants. You possibly can try the models on edge platforms like NVIDIA GeForce RTX AI PC, NVIDIA DGX Spark, and NVIDIA Jetson

When developing locally, you continue to get the good thing about NVIDIA acceleration. NVIDIA collaborated with Ollama and llama.cpp for faster iteration, lower latency, and greater data privacy. You possibly can expect fast inferencing at as much as 385 tokens per second on the NVIDIA RTX 5090 GPU with the Ministral-3B variants. Start with Llama.cpp and Ollama.  

For Ministral-3-3B-Instruct, Jetson developers can use the vLLM container on NVIDIA Jetson Thor to realize 52 tokens per second for single concurrency, with scaling as much as 273 tokens per second with concurrency of 8. 

Production-ready deployment with NVIDIA NIM 

Mistral Large 3 and Ministral-14B-Instruct can be found to be used through the NVIDIA API catalog and preview API for developers to start with minimal setup. Soon enterprise developers can use the downloadable NVIDIA NIM microservices for simple deployment on any GPU-accelerated infrastructure.  

Start constructing with open source AI 

The NVIDIA-accelerated Mistral 3 open model family represents a significant leap for Transatlantic AI within the open source community. The pliability of the models for large-scale MoE and edge-friendly dense transformers meet developers where they’re and inside their development lifecycle.  

With NVIDIA-optimized performance, advanced quantization techniques like NVFP4, and broad framework support, developers can achieve exceptional efficiency and scalability from cloud to edge. To start, download Mistral 3 models from Hugging Face or test deployment-free on construct.nvidia.com/mistralai. 



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x