Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM

-


Large language models (LLMs) and multimodal reasoning systems are rapidly expanding beyond the information center. Automotive and robotics developers increasingly need to run conversational AI agents, multimodal perception, and high-level planning directly on the vehicle or robot – where latency, reliability, and the flexibility to operate offline matter most.

While many existing LLM and vision language model (VLM) inference frameworks concentrate on data center needs akin to managing large volumes of concurrent user requests and maximizing throughput across them, embedded inference requires a dedicated, tailored solution.

This post introduces NVIDIA TensorRT Edge-LLM, a brand new, open source C++ framework for LLM and VLM inference, to unravel the emerging need for high-performance edge inference. Edge-LLM is purpose-built for real-time applications on the embedded automotive and robotics platforms NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor. The framework is provided as open source on GitHub for the NVIDIA JetPack 7.1 release.

TensorRT Edge-LLM has minimal dependencies, enabling deployment for production edge applications. Its lean, lightweight design with clear concentrate on embedded-specific capabilities minimizes the framework’s resource footprint.

As well as, TensorRT Edge-LLM advanced features—akin to EAGLE-3 speculative decoding, NVFP4 quantization support, and chunked prefill—provide cutting-edge performance for demanding real-time use cases.

A bar chart on the left shows TensorRT Edge-LLM performance compared to vLLM performance across three different configurations. TensorRT Edge LLM shows significantly higher performance. The right bar chart shows TensorRT Edge-LLM performance for newer Qwen3 LLM and VLM models. In both charts configurations where Speculative Decoding is enabled show substantially better performance.
A bar chart on the left shows TensorRT Edge-LLM performance compared to vLLM performance across three different configurations. TensorRT Edge LLM shows significantly higher performance. The right bar chart shows TensorRT Edge-LLM performance for newer Qwen3 LLM and VLM models. In both charts configurations where Speculative Decoding is enabled show substantially better performance.
Figure 1. TensorRT Edge-LLM shows compelling performance compared with the favored LLM and VLM inference framework vLLM

LLM and VLM inference for real-time edge use cases

Edge LLM and VLM inference workloads are defined by the next characteristics:

  • Requests from few users or a single user 
  • Low batch size, normally across cameras
  • Production deployments for mission-critical applications
  • Offline operation without updating

As a consequence, robotics and automotive real-time applications include specific requirements, including:

  • Minimal and predictable latency
  • Minimal disk, memory, and compute requirements
  • Compliance with production standards
  • High robustness and reliability

TensorRT Edge-LLM is designed to satisfy and prioritize these embedded-specific needs to offer a robust foundation for embedded LLM and VLM inference.

Rapid adoption of TensorRT Edge-LLM for automotive use cases

Partners are already leveraging TensorRT Edge-LLM as a foundation for his or her in-car AI products, including Bosch, ThunderSoft, and MediaTek who’re showcasing their technology at CES 2026.

Bosch develops the modern Bosch AI-powered Cockpit in collaboration with Microsoft and NVIDIA, features an modern in-car AI assistant able to natural voice interactions. The answer uses embedded automated speech recognition (ASR) and text-to-speech (TTS) AI models along side LLM inference through TensorRT Edge-LLM for a strong onboard AI that cooperates with larger, cloud-based AI models through a complicated orchestrator. 

ThunderSoft integrates TensorRT Edge-LLM into its upcoming AIBOX platform, based on NVIDIA DRIVE AGX Orin, to enable responsive, on-device LLM and multimodal inference contained in the vehicle. By combining the ThunderSoft automotive software stack with the TensorRT Edge-LLM lightweight C++ runtime and optimized decoding path, the AIBOX delivers low-latency conversational and cockpit-assist experiences inside strict power and memory limits. 

MediaTek builds on top of TensorRT Edge-LLM for his or her CX1 SoC that allows cutting-edge cabin AI and HMI applications. TensorRT Edge-LLM accelerates each LLM and VLM inference for a wide selection of use cases, including driver and cabin activity monitoring. MediaTek contributes to the event of TensorRT Edge-LLM with latest embedded-specific inference methods.

With the launch of TensorRT Edge-LLM, these LLM and VLM inference capabilities are actually available for the NVIDIA Jetson ecosystem as the muse for robotics technology.

TensorRT Edge-LLM under the hood

TensorRT Edge-LLM is designed to offer an end-to-end workflow for LLM and VLM  inference. It spans three stages: 

  • Exporting Hugging Face models to ONNX
  • Constructing optimized NVIDIA TensorRT engines for the goal hardware
  • Running inference on the goal hardware
A flowchart showing how on x86 host computers, Hugging Face models are the input of the Python Export Pipeline, which produces ONNX models as an output. On the target, these ONNX models are used by the Engine Builder to build TensorRT Engines. These engines are then used by the LLM Runtime to produce inference results for users’ applications.
A flowchart showing how on x86 host computers, Hugging Face models are the input of the Python Export Pipeline, which produces ONNX models as an output. On the target, these ONNX models are used by the Engine Builder to build TensorRT Engines. These engines are then used by the LLM Runtime to produce inference results for users’ applications.
Figure 2. TensorRT Edge-LLM workflow with key components

The Python export pipeline converts Hugging Face models to ONNX format with support for quantization, LoRA adapters, and EAGLE-3 speculative decoding (Figure 3).

A flowchart showing the quantization and export tools the TensorRT Edge-LLM Python Export Pipeline provides for different HuggingFace models. For base/vanilla models, quantize llm, export-llm and insert-lora are provided. Export-llm generates the Base ONNX model while insert-lora generates the LoRA-enabled ONNX model. For LoRA weights, the process-LoRA tool provides SafeTensors. For EAGLE draft models, quantize-draft and export-draft create the EAGLE Draft ONNX model. For Vision Transformers, the export-visual tools take care of both quantization and export to provide an ONNX model as output.
A flowchart showing the quantization and export tools the TensorRT Edge-LLM Python Export Pipeline provides for different HuggingFace models. For base/vanilla models, quantize llm, export-llm and insert-lora are provided. Export-llm generates the Base ONNX model while insert-lora generates the LoRA-enabled ONNX model. For LoRA weights, the process-LoRA tool provides SafeTensors. For EAGLE draft models, quantize-draft and export-draft create the EAGLE Draft ONNX model. For Vision Transformers, the export-visual tools take care of both quantization and export to provide an ONNX model as output.
Figure 3. TensorRT Edge-LLM Python export pipeline stages and tools

The engine builder builds TensorRT optimized specifically for the embedded goal hardware (Figure 4).

A flowchart showing how ONNX Models and Export Configs are processed by the TensorRT Edge-LLM Engine Builder. Depending on whether the model is an LLM or VLM, the TensorRT Edge-LLM LLM Builder or VIT Builder will be used. 
A flowchart showing how ONNX Models and Export Configs are processed by the TensorRT Edge-LLM Engine Builder. Depending on whether the model is an LLM or VLM, the TensorRT Edge-LLM LLM Builder or VIT Builder will be used.
Figure 4. TensorRT Edge-LLM engine builder workflow

C++ Runtime is liable for LLM and VLM inference on the goal hardware. It makes use of the TensorRT engines for the decoding loop that defines autoregressive models: iterative, token generation based on input and previously generated tokens. User applications interface with this runtime to unravel LLM and VLM workloads.

A flowchart showing the prefill phase and decode phase of TensorRT Edge-LLM C++ Runtime. Based on tokenized input prompt, the TRT engine runs and provides logits across possible output tokens. KV Cache is then generated and a first token is chosen through sampling. The runtime then enters the decoding phase, where the TensorRT Engine is used to generate the next logits followed by KV cache update and token sampling. Then it is checked whether the stop condition (EOS token) is met; if no, the loop continues with a TRT Engine call, if yes the generated sequence is returned.
A flowchart showing the prefill phase and decode phase of TensorRT Edge-LLM C++ Runtime. Based on tokenized input prompt, the TRT engine runs and provides logits across possible output tokens. KV Cache is then generated and a first token is chosen through sampling. The runtime then enters the decoding phase, where the TensorRT Engine is used to generate the next logits followed by KV cache update and token sampling. Then it is checked whether the stop condition (EOS token) is met; if no, the loop continues with a TRT Engine call, if yes the generated sequence is returned.
Figure 5. Prefill and decode phases of TensorRT Edge-LLM C++ Runtime

For a more detailed explanation of the components, see the TensorRT Edge-LLM documentation.

Start with TensorRT Edge-LLM

Able to start with LLM and VLM inference in your Jetson AGX Thor DevKit?

1. Download the JetPack 7.1 release.

2. Clone the JetPack 7.1 release branch of the NVIDIA/TensorRT-Edge-LLM GitHub repo:

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git

3. Check the TensorRT Edge-LLM Quick Start Guide for detailed instructions on getting out-of-the-box supported models from Hugging Face, converting them to ONNX, constructing TensorRT engines on your Jetson AGX Thor platform, and running them with the C++ runtime.

4. Explore the TensorRT Edge-LLM examples to learn more about features and capabilities.

5. See the TensorRT Edge-LLM Customization Guide To adapt TensorRT Edge-LLM to your personal needs.

For NVIDIA DRIVE AGX Thor users, TensorRT Edge-LLM is a component of the NVIDIA DriveOS release package. DriveOS releases will leverage the GitHub repo in upcoming releases.

As LLMs and VLMs move rapidly to the sting, TensorRT Edge-LLM provides a clean, reliable path from Hugging Face models to real-time, production-grade execution on NVIDIA automotive and robotics platforms.

Explore the workflow, test your models, and start constructing the following generation of intelligent on-device applications. To learn more, visit the NVIDIA/TensorRT-Edge-LLM GitHub repo.

Acknowledgments 

Thanks to Michael Ferry, Nicky Liu, Martin Chi, Ruocheng Jia, Charl Li, Maggie Hu, Krishna Sai Chemudupati, Frederik Kaster, Xiang Guo, Yuan Yao, Vincent Wang, Levi Chen, Chen Fu, Le An, Josh Park, Xinru Zhang, Chengming Zhao, Sunny Gai, Ajinkya Rasani, Zhijia Liu, Ever Wong, Wenting Jiang, Jonas Li, Po-Han Huang, Brant Zhao, Yiheng Zhang, and Ashwin Nanjappa on your contributions to and support of TensorRT Edge-LLM. 



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x