Large language models (LLMs) and multimodal reasoning systems are rapidly expanding beyond the information center. Automotive and robotics developers increasingly need to run conversational AI agents, multimodal perception, and high-level planning directly on the vehicle or robot – where latency, reliability, and the flexibility to operate offline matter most.
While many existing LLM and vision language model (VLM) inference frameworks concentrate on data center needs akin to managing large volumes of concurrent user requests and maximizing throughput across them, embedded inference requires a dedicated, tailored solution.
This post introduces NVIDIA TensorRT Edge-LLM, a brand new, open source C++ framework for LLM and VLM inference, to unravel the emerging need for high-performance edge inference. Edge-LLM is purpose-built for real-time applications on the embedded automotive and robotics platforms NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor. The framework is provided as open source on GitHub for the NVIDIA JetPack 7.1 release.
TensorRT Edge-LLM has minimal dependencies, enabling deployment for production edge applications. Its lean, lightweight design with clear concentrate on embedded-specific capabilities minimizes the framework’s resource footprint.
As well as, TensorRT Edge-LLM advanced features—akin to EAGLE-3 speculative decoding, NVFP4 quantization support, and chunked prefill—provide cutting-edge performance for demanding real-time use cases.


LLM and VLM inference for real-time edge use cases
Edge LLM and VLM inference workloads are defined by the next characteristics:
- Requests from few users or a single userÂ
- Low batch size, normally across cameras
- Production deployments for mission-critical applications
- Offline operation without updating
As a consequence, robotics and automotive real-time applications include specific requirements, including:
- Minimal and predictable latency
- Minimal disk, memory, and compute requirements
- Compliance with production standards
- High robustness and reliability
TensorRT Edge-LLM is designed to satisfy and prioritize these embedded-specific needs to offer a robust foundation for embedded LLM and VLM inference.
Rapid adoption of TensorRT Edge-LLM for automotive use cases
Partners are already leveraging TensorRT Edge-LLM as a foundation for his or her in-car AI products, including Bosch, ThunderSoft, and MediaTek who’re showcasing their technology at CES 2026.
Bosch develops the modern Bosch AI-powered Cockpit in collaboration with Microsoft and NVIDIA, features an modern in-car AI assistant able to natural voice interactions. The answer uses embedded automated speech recognition (ASR) and text-to-speech (TTS) AI models along side LLM inference through TensorRT Edge-LLM for a strong onboard AI that cooperates with larger, cloud-based AI models through a complicated orchestrator.Â
ThunderSoft integrates TensorRT Edge-LLM into its upcoming AIBOX platform, based on NVIDIA DRIVE AGX Orin, to enable responsive, on-device LLM and multimodal inference contained in the vehicle. By combining the ThunderSoft automotive software stack with the TensorRT Edge-LLM lightweight C++ runtime and optimized decoding path, the AIBOX delivers low-latency conversational and cockpit-assist experiences inside strict power and memory limits.Â
MediaTek builds on top of TensorRT Edge-LLM for his or her CX1 SoC that allows cutting-edge cabin AI and HMI applications. TensorRT Edge-LLM accelerates each LLM and VLM inference for a wide selection of use cases, including driver and cabin activity monitoring. MediaTek contributes to the event of TensorRT Edge-LLM with latest embedded-specific inference methods.
With the launch of TensorRT Edge-LLM, these LLM and VLM inference capabilities are actually available for the NVIDIA Jetson ecosystem as the muse for robotics technology.
TensorRT Edge-LLM under the hood
TensorRT Edge-LLM is designed to offer an end-to-end workflow for LLM and VLMÂ inference. It spans three stages:Â
- Exporting Hugging Face models to ONNX
- Constructing optimized NVIDIA TensorRT engines for the goal hardware
- Running inference on the goal hardware


The Python export pipeline converts Hugging Face models to ONNX format with support for quantization, LoRA adapters, and EAGLE-3 speculative decoding (Figure 3).


The engine builder builds TensorRT optimized specifically for the embedded goal hardware (Figure 4).


C++ Runtime is liable for LLM and VLM inference on the goal hardware. It makes use of the TensorRT engines for the decoding loop that defines autoregressive models: iterative, token generation based on input and previously generated tokens. User applications interface with this runtime to unravel LLM and VLM workloads.


For a more detailed explanation of the components, see the TensorRT Edge-LLM documentation.
Start with TensorRT Edge-LLM
Able to start with LLM and VLM inference in your Jetson AGX Thor DevKit?
1. Download the JetPack 7.1 release.
2. Clone the JetPack 7.1 release branch of the NVIDIA/TensorRT-Edge-LLM GitHub repo:
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
3. Check the TensorRT Edge-LLM Quick Start Guide for detailed instructions on getting out-of-the-box supported models from Hugging Face, converting them to ONNX, constructing TensorRT engines on your Jetson AGX Thor platform, and running them with the C++ runtime.
4. Explore the TensorRT Edge-LLM examples to learn more about features and capabilities.
5. See the TensorRT Edge-LLM Customization Guide To adapt TensorRT Edge-LLM to your personal needs.
For NVIDIA DRIVE AGX Thor users, TensorRT Edge-LLM is a component of the NVIDIA DriveOS release package. DriveOS releases will leverage the GitHub repo in upcoming releases.
As LLMs and VLMs move rapidly to the sting, TensorRT Edge-LLM provides a clean, reliable path from Hugging Face models to real-time, production-grade execution on NVIDIA automotive and robotics platforms.
Explore the workflow, test your models, and start constructing the following generation of intelligent on-device applications. To learn more, visit the NVIDIA/TensorRT-Edge-LLM GitHub repo.
AcknowledgmentsÂ
Thanks to Michael Ferry, Nicky Liu, Martin Chi, Ruocheng Jia, Charl Li, Maggie Hu, Krishna Sai Chemudupati, Frederik Kaster, Xiang Guo, Yuan Yao, Vincent Wang, Levi Chen, Chen Fu, Le An, Josh Park, Xinru Zhang, Chengming Zhao, Sunny Gai, Ajinkya Rasani, Zhijia Liu, Ever Wong, Wenting Jiang, Jonas Li, Po-Han Huang, Brant Zhao, Yiheng Zhang, and Ashwin Nanjappa on your contributions to and support of TensorRT Edge-LLM.Â
