The brand new Mistral 3 open model family delivers industry-leading accuracy, efficiency, and customization capabilities for developers and enterprises. Optimized from NVIDIA GB200 NVL72 to edge platforms, Mistral 3 includes:
- One large state-of-the-art sparse multimodal and multilingual mixture of experts (MoE) model with a complete parameter count of 675B
- A set of small, dense high-performance models (called Ministral 3) of sizes 3B, 8B, and 14B, each with Base, Instruct, and Reasoning variants (nine models total)
All of the models were trained on NVIDIA Hopper GPUs and at the moment are available through Mistral AI on Hugging Face. Developers can pick from quite a lot of options for deploying these models on different NVIDIA GPUs with different model precision formats and open source framework compatibility (Table 1).
| Mistral Large 3 | Ministral-3-14B | Ministral-3-8B | Ministral-3-3B | |
| Total parameters | 675B | 14B | 8B | 3B |
| Lively parameters | 41B | 14B | 8B | 3B |
| Context window | 256K | 256K | 256K | 256K |
| Base | – | BF16 | BF16 | BF16 |
| Instruct | – | Q4_K_M, FP8, BF16 | Q4_K_M, FP8, BF16 | Q4_K_M, FP8, BF16 |
| Reasoning | Q4_K_M, NVFP4, FP8 | Q4_K_M, BF16 | Q4_K_M, BF16 | Q4_K_M, BF16 |
| Frameworks | ||||
| vLLM | ✔ | ✔ | ✔ | ✔ |
| SGLang | ✔ | – | – | – |
| TensorRT-LLM | ✔ | – | – | – |
| Llama.cpp | – | ✔ | ✔ | ✔ |
| Ollama | – | ✔ | ✔ | ✔ |
| NVIDIA hardware | ||||
| GB200 NVL72 | ✔ | ✔ | ✔ | ✔ |
| Dynamo | ✔ | ✔ | ✔ | ✔ |
| DGX Spark | ✔ | ✔ | ✔ | ✔ |
| RTX | – | ✔ | ✔ | ✔ |
| Jetson | – | ✔ | ✔ | ✔ |
Mistral Large 3 delivers best-in-class performance on NVIDIA GB200 NVL72
NVIDIA-accelerated Mistral Large 3 achieves best-in-class performance on NVIDIA GB200 NVL72 by leveraging a comprehensive stack of optimizations tailored for giant state-of-the-art MoEs. Figure 1 shows the performance Pareto frontiers for GB200 NVL72 and NVIDIA H200 across the interactivity range.


Where production AI systems must deliver each strong user experience (UX) and cost-efficient scale, GB200 provides as much as 10x higher performance than the previous-generation H200, exceeding 5,000,000 tokens per second per megawatt (MW) at 40 tokens per second per user.
This generational gain translates to higher UX, lower per-token cost, and better energy efficiency for the brand new model. The gain is primarily driven by the next components of the inference optimization stack:
- NVIDIA TensorRT-LLM Wide Expert Parallelism (Wide-EP) provides optimized MoE GroupGEMM kernels, expert distribution and cargo balancing, and expert scheduling to totally exploit the NVL72 coherent memory domain. Of particular interest is how resilient this Wide-EP feature set is to architectural variations across large MoEs. This allows a model corresponding to Mistral Large 3 (with roughly half as many experts per layer (128) as DeepSeek-R1) to still realize the high-bandwidth, low-latency, non-blocking advantages of the NVIDIA NVLink fabric.
- Low-precision inference that maintains efficiency and accuracy has been achieved using NVFP4, with support from SGLang, TensorRT-LLM, and vLLM.
- Mistral Large 3 relies on NVIDIA Dynamo, a low latency distributed inference framework, to rate-match and disaggregate the prefill and decode phases of inference. This in turn boosts performance for long-context workloads, corresponding to 8K/1K configurations (Figure 1).
As with all models, upcoming performance optimizations—corresponding to speculative decoding with multitoken prediction (MTP) and EAGLE-3—are expected to push performance further, unlocking much more advantages from this recent model.
NVFP4 quantization
For Mistral Large 3, developers can deploy a compute-optimized NVFP4 checkpoint that was quantized offline using the open-source llm-compressor library. This enables for reducing compute and memory costs while maintaining accuracy, by leveraging the NVFP4 higher-precision FP8 scaling aspects and finer-grained block scaling to manage quantization error.
The recipe targets only the MoE weights while keeping all other components at their original checkpoint precision. Because NVFP4 is native to Blackwell, this variant deploys seamlessly on GB200 NVL72. NVFP4 FP8-scale aspects and fine-grained block scaling keep quantization error low, delivering lower compute and memory cost with minimal accuracy loss.
Open source inference
These open weight models might be used along with your open source inference framework of selection. TensorRT-LLM leverages optimizations for giant MoE models to spice up performance on GB200 NVL72 systems. To start, you should utilize the TensorRT-LLM preconfigured Docker container.
NVIDIA collaborated with vLLM to expand support for kernel integrations for speculative decoding (EAGLE), NVIDIA Blackwell, disaggregation, and expanded parallelism. To start, you may deploy the launchable that uses vLLM on NVIDIA cloud GPUs. To see the boilerplate code for serving the model and sample API calls for common use cases, try Running Mistral Large 3 675B Instruct with vLLM on NVIDIA GPUs.
Figure 2 shows the range of GPUs available within the NVIDIA construct platform where you may deploy Mistral Large 3 and Ministral 3. You can select the suitable GPU size and configuration to your needs.


NVIDIA also collaborated with SGLang to create an implementation of Mistral Large 3 with disaggregation and speculative decoding. For details, see the SGLang documentation.
Ministral 3 models deliver speed, versatility, and accuracy
The small, dense high performance Ministral 3 models are for edge deployment. Offering flexibility for quite a lot of needs, they are available in three parameter sizes—3B, 8B, and 14B—each with Base, Instruct, and Reasoning variants. You possibly can try the models on edge platforms like NVIDIA GeForce RTX AI PC, NVIDIA DGX Spark, and NVIDIA Jetson.
When developing locally, you continue to get the good thing about NVIDIA acceleration. NVIDIA collaborated with Ollama and llama.cpp for faster iteration, lower latency, and greater data privacy. You possibly can expect fast inferencing at as much as 385 tokens per second on the NVIDIA RTX 5090 GPU with the Ministral-3B variants. Start with Llama.cpp and Ollama.
For Ministral-3-3B-Instruct, Jetson developers can use the vLLM container on NVIDIA Jetson Thor to realize 52 tokens per second for single concurrency, with scaling as much as 273 tokens per second with concurrency of 8.
Production-ready deployment with NVIDIA NIM
Mistral Large 3 and Ministral-14B-Instruct can be found to be used through the NVIDIA API catalog and preview API for developers to start with minimal setup. Soon enterprise developers can use the downloadable NVIDIA NIM microservices for simple deployment on any GPU-accelerated infrastructure.
Start constructing with open source AI
The NVIDIA-accelerated Mistral 3 open model family represents a significant leap for Transatlantic AI within the open source community. The pliability of the models for large-scale MoE and edge-friendly dense transformers meet developers where they’re and inside their development lifecycle.
With NVIDIA-optimized performance, advanced quantization techniques like NVFP4, and broad framework support, developers can achieve exceptional efficiency and scalability from cloud to edge. To start, download Mistral 3 models from Hugging Face or test deployment-free on construct.nvidia.com/mistralai.
