A defining strength of the NVIDIA software ecosystem is its commitment to continuous optimization. In August, NVIDIA Jetson AGX Thor launched, with as much as a 5x boost in generative AI performance over NVIDIA Jetson AGX Orin. Through software updates for the reason that release, Jetson Thor now powers a 7x increase in generative AI throughput.
With this proven approach, showcased previously on NVIDIA Jetson Orin and NVIDIA Jetson AGX Xavier, developers can enjoy these improvements on models akin to Llama and DeepSeek, and similar advantages are expected for future model releases. Along with consistent software enhancements, NVIDIA also provides support for leading models, often inside days of their launch. This allows developers to experiment with the newest AI models early on.
The Jetson Thor platform also supports major quantization formats, including the brand new NVFP4 from the NVIDIA Blackwell GPU architecture, helping optimize inference even further. Latest techniques like speculative decoding are also being supported, offering an extra strategy to speed up Gen AI workloads at the sting.
Continuous software optimization
With the recent vLLM container release, Jetson Thor delivers as much as 3.5x greater performance on the identical model and same quantization in comparison with its launch-day performance in late August. Table 1 shows the output tokens/sec on Llama 3.3 70B and DeepSeek R1 70B models at launch in August, in comparison with the newest benchmarked numbers from September 2025.
| Family | Model | Jetson AGX Thor Sept 2025 (output tokens/sec) |
Jetson AGX Thor Aug 2025 (output tokens/sec) |
Jetson AGX Thor speedup in comparison with launch |
| Llama | Llama 3.3 70B | 41.5 | 12.64 | 3.3 |
| DeepSeek | DeepSeek R1 70B | 40.29 | 11.5 | 3.5 |
Table 1. Tokens/sec output on Llama 3.3 and DeepSeek R1 at launch in comparison with the newest benchmarks
Configuration for these benchmarks: Sequence Length: 2048, Output Sequence Length: 128; Max Concurrency: 8; Power Mode: MAXN
Jetson Thor also now supports Eagle 3 speculative decoding in vLLM containers to further increase the performance of generative AI models. For instance, on Llama 3.3 70B with speculative decoding, you may get 88.62 output tokens/sec, making a 7x speedup in comparison with launch.


Run the newest models with day 0 support
Developers can run the newest and best generative AI models on the sting with Jetson Thor with day 0 support. For instance, gpt-oss was supported on llamacpp/ollama on Day 0 of the launch on Jetson AGX Thor. It’s supported on vLLM as well. Similarly, you’ll find week zero support for a lot of NVIDIA Nemotron models, including:
Get max gen AI performance with Jetson Thor
Jetson Thor is powerful for generative AI at the sting, but using it to its full advantage requires the proper techniques. This section is your guide to getting probably the most out of the platform. We’ll dive into quantization and speculative decoding, the 2 strategies for accelerating LLM and VLM inference. We’ll finish with a tutorial showing how one can benchmark your models on Jetson Thor. This gives you a transparent path for selecting the most effective model and configuration on your specific use case.
Quantization: Shrinking model size, speeding up inference
At its core, quantization is the strategy of reducing the numerical precision of a model’s data (its weights and activations). Consider it like using fewer decimal places to represent a number—it’s not exactly the identical, nevertheless it’s close enough and far more efficient to store and calculate. We typically move from the usual 16-bit formats (like FP16 or BF16) to lower-bit formats like 8-bit or 4-bit.
This provides you two huge wins:
- Smaller memory footprint
That is the important thing that unlocks larger models on-device. By cutting the variety of bytes needed for every parameter, you may load models that may otherwise be too big.As a rule of thumb, a 70-billion-parameter model’s weights take up about:
- 140 GB in floating point 16 (FP16) and won’t fit on Thor’s 128 GB memory.
- 70 GB in floating-point 8 (FP8), matches with room to spare.
- 35 GB in 4-bit, enabling multiple large models.
- Faster memory access
Smaller weights mean fewer bytes to tug from memory into the compute cores. This directly reduces latency, which is critical for edge applications where time-to-first-token affects responsiveness and user experience.
Let’s have a look at the 2 formats that matter most on Jetson Thor.
FP8
FP8 is your go-to for an almost lossless first step in optimization. A 70B model’s 16-bit weights are too large for Jetson Thor memory when you account for activations and the KV cache. By halving the load of memory, FP8 makes it practical to load and run that very same model on-device. When properly calibrated, FP8’s accuracy is incredibly near the FP16 baseline (often with a drop of lower than 1%), making it a “secure first step” for chat and general workloads, though sensitive tasks like math or code generation may require extra tuning.
W4A16:4-bit weights and 16-bit activations
W4A16 unlocks massive models on the sting by quantizing static model weights to an ultra-compact 4-bit, while keeping the dynamic, in-flight calculations (the activations) at a higher-precision 16-bit. This trade-off makes it possible to suit models with over 175B parameters on a single Jetson Thor, leaving loads of headroom for his or her activations. Serving multiple large models directly—for instance, two 70B models—is a feat that was a significant challenge for previous Jetson generations.
Which format do you have to use?
Our suggestion is easy: start with W4A16. It typically delivers the best inference speeds and the bottom memory footprint. If you happen to test the quantized model in your task and find that the accuracy meets your quality bar, keep on with it.
In case your task is more complex (like nuanced reasoning or code generation) and you discover W4A16’s accuracy isn’t quite there, switch to FP8. It’s still fast, keeps memory usage low, and provides good enough quality for many edge use cases.
Speculative decoding: Boost inference with a draft-verification decoding approach
When you’ve picked a quantization format, the subsequent big performance lever is speculative decoding. This system quickens inference through the use of two models: a small, fast “draft” model and your large, accurate “goal” model.
Here’s how it really works:
- The draft model quickly generates a bit of candidate tokens (a “guess” of what comes next).
- The goal model then validates the complete chunk in a single pass as a substitute of generating one token at a time.
This “draft-and-verify” process generates multiple tokens per cycle while guaranteeing the ultimate output is similar to what the goal model would produce alone. Your success is measured by the acceptance rate—the proportion of draft tokens accepted. A high rate yields significant latency wins, while a low rate can add overhead, so it’s crucial to benchmark with prompts that reflect your workload. Your essential lever for improving that is the draft model selection; start with one architecturally just like your goal, and for specialised domains, consider fine-tuning a custom draft model to maximise the acceptance rate.
In our experiments, we found that EAGLE-3 speculative decoding delivered the most effective speedups. In our benchmarks on Llama 3.3 70B (W4A16), this feature delivered a 2.5x performance uplift, boosting throughput from 6.27 to 16.19 tokens/sec using vLLM with a concurrency of 1. We benchmarked this using the ShareGPT dataset, but it is best to all the time test on your individual data to validate performance on your specific use case.
Putting together quantization and speculative decoding
The actual magic happens if you mix these techniques. We used vLLM, which has great built-in support for EAGLE-3. Here’s an example command we used to serve the Llama 3.3 w4a16 model with speculative decoding enabled.
vllm serve "RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16" --trust_remote_code -- --speculative-config '{"method":"eagle3","model":"yuhuili/EAGLE3-LLaMA3.3-Instruct-70B","num_speculative_tokens":5}'
Making getting began simpler, NVIDIA is releasing a standalone vLLM container that supports Jetson Thor and is updated monthly with the newest improvements.
Here’s a step-by-step guide to finding the most effective balance between model quality and inference performance:
- Establish a high quality baseline. Before optimizing, load your model at its highest possible precision (FP16 preferably, but when the model is just too big, FP8 can also be effective) and easily confirm that it performs your task appropriately.
- Optimize with quantization. Progressively lower the load precision (for instance, to W4A16), testing for accuracy at each step. Stop when the standard now not meets your requirements.
- Benchmark against reality. Validate your final setup using a performance benchmark that mimics your workload, whether that involves high concurrency, large context windows, or long output sequences.
In case your chosen model still isn’t fast enough, repeat this process with a smaller one. To see exactly how one can run these performance benchmarks, follow our hands-on tutorial on Jetson AI Lab.
Now you may confidently improve your generative AI model performance on Jetson Thor. Get your Jetson AGX Thor Developer Kit today and download the newest NVIDIA JetPack 7 to start out your journey.
