Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM

Large language models (LLMs) and multimodal reasoning systems are rapidly expanding beyond the information center. Automotive and robotics developers increasingly need to run conversational AI agents, multimodal perception, and high-level planning directly on the vehicle or robot – where latency, reliability, and the flexibility to operate offline matter most.

While many existing LLM and vision language model (VLM) inference frameworks concentrate on data center needs akin to managing large volumes of concurrent user requests and maximizing throughput across them, embedded inference requires a dedicated, tailored solution.

This post introduces NVIDIA TensorRT Edge-LLM, a brand new, open source C++ framework for LLM and VLM inference, to unravel the emerging need for high-performance edge inference. Edge-LLM is purpose-built for real-time applications on the embedded automotive and robotics platforms NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor. The framework is provided as open source on GitHub for the NVIDIA JetPack 7.1 release.

TensorRT Edge-LLM has minimal dependencies, enabling deployment for production edge applications. Its lean, lightweight design with clear concentrate on embedded-specific capabilities minimizes the framework’s resource footprint.

As well as, TensorRT Edge-LLM advanced features—akin to EAGLE-3 speculative decoding, NVFP4 quantization support, and chunked prefill—provide cutting-edge performance for demanding real-time use cases.