Deploying Open Source Vision Language Models (VLM) on Jetson

Vision-Language Models (VLMs) mark a big leap in AI by mixing visual perception with semantic reasoning. Moving beyond traditional models constrained by fixed labels, VLMs utilize a joint embedding space to interpret and discuss complex, open-ended environments using natural language.

The rapid evolution of reasoning accuracy and efficiency has made these models ideal for edge devices. The NVIDIA Jetson family, starting from the high-performance AGX Thor and AGX Orin to the compact Orin Nano Super is purpose-built to drive accelerated applications for physical AI and robotics, providing the optimized runtime crucial for leading open source models.

On this tutorial, we’ll reveal how one can deploy the NVIDIA Cosmos Reasoning 2B model across the Jetson lineup using the vLLM framework. We may even guide you thru connecting this model to the Live VLM WebUI, enabling a real-time, webcam-based interface for interactive physical AI.

Prerequisites

Supported Devices:

Jetson AGX Thor Developer Kit
Jetson AGX Orin (64GB / 32GB)
Jetson Orin Super Nano

JetPack Version:

JetPack 6 (L4T r36.x) — for Orin devices
JetPack 7 (L4T r38.x) — for Thor

Storage: NVMe SSD required

~5 GB for the FP8 model weights
~8 GB for the vLLM container image

Accounts:

Create NVIDIA NGC account(free) to download each the model and vLLM contanier

Overview

	Jetson AGX Thor	Jetson AGX Orin	Orin Super Nano
vLLM Container	`nvcr.io/nvidia/vllm:26.01-py3`	`ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04`	`ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04`
Model	FP8 via NGC (volume mount)	FP8 via NGC (volume mount)	FP8 via NGC (volume mount)
Max Model Length	8192 tokens	8192 tokens	256 tokens (memory-constrained)
GPU Memory Util	0.8	0.8	0.65

The workflow is similar for each devices:

Download the FP8 model checkpoint via NGC CLI
Pull the vLLM Docker image on your device
Launch the container with the model mounted as a volume
Connect Live VLM WebUI to the vLLM endpoint

Step 1: Install the NGC CLI

The NGC CLI helps you to download model checkpoints from the NVIDIA NGC Catalog.

Download and install

mkdir -p ~/Projects/CosmosReasoning
cd ~/Projects/CosmosReasoning

# Download the NGC CLI for ARM64
# Get the most recent installer URL from: https://org.ngc.nvidia.com/setup/installers/cli
wget -O ngccli_arm64.zip https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.13.0/files/ngccli_arm64.zip
unzip ngccli_arm64.zip
chmod u+x ngc-cli/ngc

# Add to PATH
export PATH="$PATH:$(pwd)/ngc-cli"

Configure the CLI

ngc config set

You will probably be prompted for:

API Key — generate one at NGC API Key setup
CLI output format — select json or ascii
org — press Enter to just accept the default

Step 2: Download the Model

Download the FP8 quantized checkpoint. That is used on all Jetson devices:

cd ~/Projects/CosmosReasoning
ngc registry model download-version "nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8"

This creates a directory called cosmos-reason2-2b_v1208-fp8-static-kv8/ containing the model weights. Note the complete path — you’ll mount it into the Docker container as a volume.

Step 3: Pull the vLLM Docker Image

For Jetson AGX Thor

docker pull nvcr.io/nvidia/vllm:26.01-py3

For Jetson AGX Orin / Orin Super Nano

docker pull ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

Step 4: Serve Cosmos Reasoning 2B with vLLM

Option A: Jetson AGX Thor

Thor has ample GPU memory and might run the model with generous context length.

Set the trail to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8"
sudo sysctl -w vm.drop_caches=3

Launch the container with the model mounted:

docker run --rm -it 
  --runtime nvidia 
  --network host 
  --ipc host 
  -v "$MODEL_PATH:/models/cosmos-reason2-2b:ro" 
  -e NVIDIA_VISIBLE_DEVICES=all 
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility 
  nvcr.io/nvidia/vllm:26.01-py3 
  bash

Contained in the container, activate the environment and serve the model:

vllm serve /models/cosmos-reason2-2b 
  --max-model-len 8192 
  --media-io-kwargs '{"video": {"num_frames": -1}}' 
  --reasoning-parser qwen3 
  --gpu-memory-utilization 0.8

Note: The --reasoning-parser qwen3 flag enables chain-of-thought reasoning extraction. The --media-io-kwargs flag configures video frame handling.

Wait until you see:

INFO:     Uvicorn running on http://0.0.0.0:8000

Option B: Jetson AGX Orin

AGX Orin has enough memory to run the model with the identical generous parameters as Thor.

Set the trail to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8"
sudo sysctl -w vm.drop_caches=3

1. Launch the container:

docker run --rm -it 
  --runtime nvidia 
  --network host 
  -v "$MODEL_PATH:/models/cosmos-reason2-2b:ro" 
  -e NVIDIA_VISIBLE_DEVICES=all 
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility 
  ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 
  bash

2. Contained in the container, activate the environment and serve:

cd /opt/
source venv/bin/activate

vllm serve /models/cosmos-reason2-2b 
  --max-model-len 8192 
  --media-io-kwargs '{"video": {"num_frames": -1}}' 
  --reasoning-parser qwen3 
  --gpu-memory-utilization 0.8

Wait until you see:

INFO:     Uvicorn running on http://0.0.0.0:8000

Option C: Jetson Orin Super Nano (memory-constrained)

The Orin Super Nano has significantly less RAM, so we’d like aggressive memory optimization flags.

Set the trail to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8"
sudo sysctl -w vm.drop_caches=3

1. Launch the container:

docker run --rm -it 
  --runtime nvidia 
  --network host 
  -v "$MODEL_PATH:/models/cosmos-reason2-2b:ro" 
  -e NVIDIA_VISIBLE_DEVICES=all 
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility 
  ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 
  bash

2. Contained in the container, activate the environment and serve:

cd /opt/
source venv/bin/activate

vllm serve /models/cosmos-reason2-2b 
  --host 0.0.0.0 
  --port 8000 
  --trust-remote-code 
  --enforce-eager 
  --max-model-len 256 
  --max-num-batched-tokens 256 
  --gpu-memory-utilization 0.65 
  --max-num-seqs 1 
  --enable-chunked-prefill 
  --limit-mm-per-prompt '{"image":1,"video":1}' 
  --mm-processor-kwargs '{"num_frames":2,"max_pixels":150528}'

Key flags explained (Orin Super Nano only):

Flag	Purpose
`--enforce-eager`	Disables CUDA graphs to save lots of memory
`--max-model-len 256`	Limits context to slot in available memory
`--max-num-batched-tokens 256`	Matches the model length limit
`--gpu-memory-utilization 0.65`	Reserves headroom for system processes
`--max-num-seqs 1`	Single request at a time to reduce memory
`--enable-chunked-prefill`	Processes prefill in chunks for memory efficiency
`--limit-mm-per-prompt`	Limits to 1 image and 1 video per prompt
`--mm-processor-kwargs`	Reduces video frames and image resolution
`--VLLM_SKIP_WARMUP=true`	Skips warmup to save lots of time and memory

Wait until you see the server is prepared:

INFO:     Uvicorn running on http://0.0.0.0:8000

Confirm the server is running

From one other terminal on the Jetson:

curl http://localhost:8000/v1/models

You need to see the model listed within the response.

Step 5: Test with a Quick API Call

Before connecting the WebUI, confirm the model responds accurately:

curl -s http://localhost:8000/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "/models/cosmos-reason2-2b",
    "messages": [
      {
        "role": "user",
        "content": "What capabilities do you have?"
      }
    ],
    "max_tokens": 128
  }' | python3 -m json.tool

Tip: The model name utilized in the API request must match what vLLM reports. Confirm with curl http://localhost:8000/v1/models.

Step 6: Connect with Live VLM WebUI

Live VLM WebUI provides a real-time webcam-to-VLM interface. With vLLM serving Cosmos Reasoning 2B, you possibly can stream your webcam and get live AI evaluation with reasoning.

Install Live VLM WebUI

The easiest way is pip (Open one other terminal):

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
cd ~/Projects/CosmosReasoning
uv venv .live-vlm --python 3.12
source .live-vlm/bin/activate
uv pip install live-vlm-webui
live-vlm-webui

Or use Docker:

git clone https://github.com/nvidia-ai-iot/live-vlm-webui.git
cd live-vlm-webui
./scripts/start_container.sh

Configure the WebUI

Open https://localhost:8090 in your browser
Accept the self-signed certificate (click Advanced → Proceed)
Within the VLM API Configuration section on the left sidebar:
- Set API Base URL to http://localhost:8000/v1
- Click the Refresh button to detect the model
- Select the Cosmos Reasoning 2B model from the dropdown
Select your camera and click on Start

The WebUI will now stream your webcam frames to Cosmos Reasoning 2B and display the model’s evaluation in real-time.

Beneficial WebUI settings for Orin

Since Orin runs with a shorter context length, adjust these settings within the WebUI:

Max Tokens: Set to 100–150 (shorter responses complete faster)
Frame Processing Interval: Set to 60+ (gives the model time between frames)

Troubleshooting

Out of memory on Orin

Problem: vLLM crashes with CUDA out-of-memory errors.

Solution:

Free system memory before starting:
```
sudo sysctl -w vm.drop_caches=3
```
Lower --gpu-memory-utilization (try 0.55 or 0.50)
Reduce --max-model-len further (try 128)
Ensure that no other GPU-intensive processes are running

Model not present in WebUI

Problem: The model doesn’t appear within the Live VLM WebUI dropdown.

Solution:

Confirm vLLM is running: curl http://localhost:8000/v1/models
Ensure that the WebUI API Base URL is ready to http://localhost:8000/v1 (not https)
If vLLM and WebUI are in separate containers, use http://:8000/v1 as a substitute of localhost

Slow inference on Orin

Problem: Each response takes a really very long time.

Solution:

This is predicted with the memory-constrained configuration. Cosmos Reasoning 2B FP8 on Orin prioritizes fitting in memory over speed
Reduce max_tokens within the WebUI to get shorter, faster responses
Increase the frame interval so the model isn’t consistently processing recent frames

vLLM fails to load model

Problem: vLLM reports that the model path doesn’t exist or can’t be loaded.

Solution:

Confirm the NGC download accomplished successfully: ls ~/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8/
Ensure that the amount mount path is correct in your docker run command
Check that the model directory is mounted as read-only (:ro) and the trail contained in the container matches what you pass to vllm serve

Summary

On this tutorial, we showcased how one can deploy NVIDIA Cosmos Reasoning 2B model on Jetson family of devices using vLLM.

The mixture of Cosmos Reasoning 2B’s chain-of-thought capabilities with Live VLM WebUI’s real-time streaming makes it ideal to prototype and evaluate vision AI applications at the sting.

Additional Resources

Source link

Deploying Open Source Vision Language Models (VLM) on Jetson

Prerequisites

Overview

Step 1: Install the NGC CLI

Download and install

Configure the CLI

Step 2: Download the Model

Step 3: Pull the vLLM Docker Image

For Jetson AGX Thor

For Jetson AGX Orin / Orin Super Nano

Step 4: Serve Cosmos Reasoning 2B with vLLM

Option A: Jetson AGX Thor

Option B: Jetson AGX Orin

Option C: Jetson Orin Super Nano (memory-constrained)

Confirm the server is running

Step 5: Test with a Quick API Call

Step 6: Connect with Live VLM WebUI

Install Live VLM WebUI

Configure the WebUI

Beneficial WebUI settings for Orin

Troubleshooting

Out of memory on Orin

Model not present in WebUI

Slow inference on Orin

vLLM fails to load model

Summary

Additional Resources

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

AI in Multiple GPUs: Gradient Accumulation & Data Parallelism

Is the AI and Data Job Market Dead?

Introducing 🤗 Speed up

PySpark for Pandas Users

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

Deploying Open Source Vision Language Models (VLM) on Jetson

Prerequisites

Overview

Step 1: Install the NGC CLI

Download and install

Configure the CLI

Step 2: Download the Model

Step 3: Pull the vLLM Docker Image

For Jetson AGX Thor

For Jetson AGX Orin / Orin Super Nano

Step 4: Serve Cosmos Reasoning 2B with vLLM

Option A: Jetson AGX Thor

Option B: Jetson AGX Orin

Option C: Jetson Orin Super Nano (memory-constrained)

Confirm the server is running

Step 5: Test with a Quick API Call

Step 6: Connect with Live VLM WebUI

Install Live VLM WebUI

Configure the WebUI

Beneficial WebUI settings for Orin

Troubleshooting

Out of memory on Orin

Model not present in WebUI

Slow inference on Orin

vLLM fails to load model

Summary

Additional Resources

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.