Making Robot Perception More Efficient on NVIDIA Jetson Thor

-


Constructing autonomous robots requires robust, low-latency visual perception for depth, obstacle recognition, localization, and navigation in dynamic environments. These capabilities demand heavy compute. NVIDIA Jetson platforms offer powerful GPUs for deep learning, but increasing AI complexity and the necessity for real-time performance can result in GPU oversubscription. Relying solely on the GPU for all perception tasks may end up in bottlenecks, increased power consumption, and thermal challenges, especially in power-sensitive and thermally constrained environments common in mobile robotics.

The NVIDIA Jetson platform addresses these challenges by combining powerful GPUs with dedicated hardware accelerators. Jetson devices like NVIDIA Jetson AGX Orin and NVIDIA Jetson Thor house specialty hardware accelerators designed to execute image processing and computer-vision tasks with high efficiency. That frees up the GPU for more demanding deep-learning workloads. The NVIDIA Vision Programming Interface (VPI) unlocks the total potential of those diverse hardware accelerators.

On this blog, we explore the advantages of using these accelerators and explain how developers can use VPI to unlock the total potential of the Jetson platform. For instance, we are going to walk you thru the event of a low-latency, low-powered perception application for stereo disparity using these accelerators. To begin, we are going to develop a single stereo camera pipeline, after which move onto developing a multi-stream pipeline with eight stereo cameras acting at 30FPS on Thor T5000—about 10x faster than Orin AGX 64 GB.  

Before we jump into development, let’s quickly look over what accelerators can be found on the Jetson platform, their advantages, what applications they will unlock, and the way VPI will help.

What accelerators does Jetson offer beyond the GPU?

Jetson devices have powerful GPUs for deep learning, but increasing AI complexity demands higher GPU cycle management. Jetson offers specialized engines for computer vision (CV) workloads. While the GPU is powerful and versatile, these engines, when combined with the GPU, offer significant computational benefits. VPI simplifies access to those engines, making experimentation and load-balancing easy.

Diagram showing the relationship between Jetson hardware, VPI operators, and developer applications.Diagram showing the relationship between Jetson hardware, VPI operators, and developer applications.
Figure 1. Vision Programming Interface (VPI) for Jetson developers

Let’s take a look at each accelerator closely to know its purpose and advantages.

Programmable Vision Accelerator (PVA): 

The PVA is a programmable digital signal processing (DSP) engine with a 1024‑bit single-instruction-multiple data (SIMD) unit and native memory with flexible direct memory access (DMA), optimized for vision and image processing with high performance per watt. It runs asynchronously alongside the CPU, GPU, and other accelerators, and is out there on all Jetson SKUs except NVIDIA Jetson Nano.

Through VPI, developers can access ready‑to‑use algorithms like AprilTag detection, object tracking, and stereo disparity estimation. For custom implementation of algorithms, the PVA SDK, now available to Jetson developers, provides C/C++ APIs and tools for developing vision algorithms directly on the PVA.

Optical Flow Accelerator (OFA): 

The OFA is a fixed-function hardware accelerator for computing optical flow and stereo disparity from stereo camera pairs. OFA can operate in two modes: In stereo disparity mode, the OFA estimates a disparity map by processing rectified left and right view from a camera pair. In Optical Flow mode, the OFA estimates 2D motion vectors between two frames. 

Video and Image Compositor (VIC): 

The VIC is a fixed-function, power efficient hardware accelerator in Jetson devices that’s specialized for low-level image processing tasks, corresponding to rescaling, remapping, warping, color space conversion and noise reduction. 

What use cases profit from these accelerators?

Below are some scenarios where developers may consider going beyond the GPU for his or her specific application needs:

  • GPU-oversubscribed applications: As best practice, developers should prioritize deep learning (DL) workloads for the GPU, and offload computer-vision tasks to PVA, OFA, or VIC using VPI. For instance, DeepStream’s Multi‑Object Tracker can run 12 video streams on Orin AGX with GPU alone, but by load balancing with PVA it may possibly support 16 streams.
  • Power‑sensitive applications: In use cases like sentry mode or activity monitoring, offloading most computation to low‑power accelerators (PVA, OFA, VIC) can provide maximum efficiency.
  • Industrial applications with thermal limits: In high‑heat environments, distributing workloads across all accelerators reduces throttling and helps maintain latency and throughput inside thermal budgets.

How one can use VPI to unlock all of the accelerators

VPI provides a unified and versatile framework that provides developers access to accelerators seamlessly on platforms starting from Jetson modules to workstations or PCs with discrete GPUs.

Now let’s take a look at an example that brings all of it together.

Example: stereo vision pipeline 

Modern robotics stacks often depend on passive stereo systems for 3D perception of the encircling world. Consequently, computing stereo disparity maps is an important step toward constructing a fancy perception stack. Here we are going to take a take a look at a sample pipeline that a developer can use to provide stereo disparity and confidence maps. Below we show construct a low-latency, and energy efficient  pipeline with all of the accelerators available via VPI.

A sample stereo vision pipeline using multiple Jetson acceleratorsA sample stereo vision pipeline using multiple Jetson accelerators
Figure 2. Schematic of Stereo Vision Pipeline deployed across multiple accelerators on Jetson. PVA = Programmable Vision Array. VIC = Video and Image Compositor. OFA = Optical Flow Accelerator. 
  • Preprocessing on CPU: The preprocessing step can run on a CPU because it only happens once. This step computes a rectification map that might be used for correcting lens distortion from the stereo camera frames.  
  • Remap on VIC: This step undistorts and aligns camera frames using a precomputed rectification map, ensuring each optical axes are level and parallel. VPI supports polynomial and fisheye distortion models and lets developers define custom warp maps. See the Remap documentation for details.
  • Stereo disparity on OFA: The rectified image pairs are inputs for the semi-global matching (SGM) algorithm. In practice, SGM alone could be noisy and produce erroneous disparity values. A confidence map could be created to enhance the result by discarding disparity estimates that correspond to low confidence values. For more details on SGM and the supported parameters, confer with stereo disparity documentation.
  • Confidence map on PVA: VPI supports three confidence map modes: ABSOLUTE, RELATIVE, and INFERENCE. ABSOLUTE and RELATIVE require two OFA passes (left/right disparity) plus a PVA cross‑check, while INFERENCE uses a single OFA pass followed by a CNN on PVA (two convolution + two non-linear activation layers). Skipping confidence computation is fastest but produces noisy disparity maps, whereas RELATIVE and INFERENCE improve each disparity quality and confidence.

VPI’s unified memory architecture eliminates unnecessary data copies across engines, and its asynchronous stream/event model lets developers schedule workloads and sync points upfront. Hardware‑managed scheduling enables parallel execution across engines, freeing the CPU and hiding latency with an efficient streaming pipeline.

Constructing a high-performance stereo disparity pipeline using VPI

Getting began with Python APIs

This tutorial walks through a basic stereo disparity pipeline without remap using the VPI Python API. 

Prerequisites:

  • An NVIDIA Jetson device (e.g., Jetson AGX Thor)
  • VPI installed via NVIDIA SDK Manager or apt
  • Python libraries: vpi, numpy, Pillow, opencv-python

On this tutorial, we are going to:

  • Load left and right stereo images
  • Convert their format for processing
  • Synchronize the streams to make sure data is prepared
  • Execute the stereo disparity algorithm
  • Post-process the output and save the result

Setup and initialization

Step one is to import the obligatory libraries and create VPIStream objects. A VPIStream acts as a command queue, allowing you to submit tasks for asynchronous execution. We’ll use two streams to exhibit parallel processing.

import vpi
import numpy as np
from PIL import Image
from argparse import ArgumentParser
 
# Create two streams for parallel processing
streamLeft = vpi.Stream()
streamRight = vpi.Stream()

The streamLeft will handle the left image, and streamRight will handle the suitable image.

Loading and converting imagesVPI’s Python API can work directly with NumPy arrays. We load the photographs using Pillow after which wrap them in VPI’s asimage function. Next, we convert the photographs to a format suitable for the stereo disparity algorithm. For this instance, we’ll convert from RGBA8 to Y8_ER_BL (8-bit grayscale, block-linear format).

# Load images and wrap them in VPI images
left_img = np.asarray(Image.open(args.left))
right_img = np.asarray(Image.open(args.right))
left = vpi.asimage(left_img)
right = vpi.asimage(right_img)
 
# Convert images to Y8_ER_BL format in parallel on different backends
left = left.convert(vpi.Format.Y8_ER_BL, scale=1, stream=streamLeft, backend=vpi.Backend.VIC)
right = right.convert(vpi.Format.Y8_ER_BL, scale=1, stream=streamRight, backend=vpi.Backend.CUDA)

The left image conversion is submitted to the VIC backend via streamLeft, while the right image conversion is submitted to the NVIDIA CUDA backend on streamRight. This permits the 2 operations to run in parallel on different hardware units, which is a key advantage of VPI.

Synchronizing and executing stereo disparity

Before we are able to perform stereo disparity, we must be sure that each images are ready. We use streamLeft.sync() to dam the important thread until the left image conversion is complete. Then, we are able to submit the vpi.stereodisp operation on streamRight.

# Synchronize streamLeft to make sure the left image is prepared
streamLeft.sync()
 
# Submit the stereo disparity operation on streamRight
disparityS16 = vpi.stereodisp(left, right, backend=vpi.Backend.OFA|vpi.Backend.PVA|vpi.Backend.VIC, stream=streamRight)

The stereo disparity algorithm is executed on a mix of VPI backends (OFA, PVA, VIC) to benefit from specialized hardware. The result’s a disparity map in S16 format, representing the horizontal shift between corresponding pixels within the two images.

Post-processing and visualization

The raw disparity map must be post-processed for visualization. The disparity values, that are in Q10.5 fixed-point format, are scaled to a 0-255 range and saved.

# Post-process the disparity map 
# Convert Q10.5 to  U8 and scale for visualization
disparityU8 = disparityS16.convert(vpi.Format.U8, scale=255.0/(32*128), stream=streamRight, backend=vpi.Backend.CUDA)
 
# make accessible in cpu
disparityU8 = disparityU8.cpu()
 
#save with pillow
d_pil = Image.fromarray(disparityU8)
d_pil.save('./disparity.png')

This final step converts the raw data right into a human-readable image, where grayscale  represents depth.

Multi-Streaming disparity pipeline using C++ APIs

Advanced robotics need high throughput, which VPI enables through parallel multi‑streaming. By combining streamlined APIs with efficient use of hardware accelerators, VPI lets developers construct fast, reliable vision pipelines—much like those powering Boston Dynamics’ next‑gen robots. 

VPI uses VPIStream objects, that are first-in-first-out (FIFO) command queues for submitting tasks to a backend asynchronously. This permits for parallel execution of operations on different hardware units (asynchronous streams).

For optimum performance in mission-critical applications, VPI’s C++ API is good. 

The next code snippets are from a C++ benchmark that demonstrates construct and run a multi-stream stereo disparity pipeline. The SimpleMultiStreamBenchmark C++ app showcases this by pre‑generating synthetic NV12_BL images to avoid runtime overhead, then running multiple streams in parallel and measuring frames per second (FPS) throughput. It also supports saving inputs and disparity/confidence maps for debugging. This instance pre-generates data to simulate a high-speed, real-time workload.

Establishing resources, object declaration, and initialization

We first declare and initialize all the objects VPI requires to perform this pipeline per stream. This includes creating streams, input/output images, and stereo payloads. Since we are going to feed images of type NV12_BL to the stereo algorithm, we allocated that type and the  Y8_ER image type for intermediate format conversion. 

int totalIterations = itersPerStream * numStreams;
std::vector leftInputs(numStreams), rightInputs(numStreams), confidences(numStreams), leftTmps(numStreams), rightTmps(numStreams);
std::vector leftOuts(numStreams), rightOuts(numStreams), disparities(numStreams);
std::vector stereoPayloads(numStreams);
std::vector streamsLeft(numStreams), streamsRight(numStreams);
std::vector events(numStreams);
int width   = cvImageLeft.cols;
int height  = cvImageLeft.rows;
int vic_pva_ofa = VPI_BACKEND_VIC | VPI_BACKEND_OFA | VPI_BACKEND_PVA;
VPIStereoDisparityEstimatorCreationParams stereoPayloadParams;
VPIStereoDisparityEstimatorParams stereoParams;
CHECK_STATUS(vpiInitStereoDisparityEstimatorCreationParams(&stereoPayloadParams));
CHECK_STATUS(vpiInitStereoDisparityEstimatorParams(&stereoParams));
stereoPayloadParams.maxDisparity = 128;
stereoParams.maxDisparity= 128;
stereoParams.confidenceType  = VPI_STEREO_CONFIDENCE_RELATIVE;

for (int i = 0; i < numStreams; i++)
{
    CHECK_STATUS(vpiImageCreateWrapperOpenCVMat(cvImageLeft, 0, &leftInputs[i]));
    CHECK_STATUS(vpiImageCreateWrapperOpenCVMat(cvImageRight, 0, &rightInputs[i]));
    CHECK_STATUS(vpiStreamCreate(0, &streamsLeft[i]));
    CHECK_STATUS(vpiStreamCreate(0, &streamsRight[i]));
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_Y8_ER, 0, &leftTmps[i]));
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_NV12_BL, 0, &leftOuts[i]));
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_Y8_ER, 0, &rightTmps[i]));
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_NV12_BL, 0, &rightOuts[i]));
    CHECK_STATUS(vpiCreateStereoDisparityEstimator(vic_pva_ofa, width, height, VPI_IMAGE_FORMAT_NV12_BL,
    &stereoPayloadParams, &stereoPayloads[i]));
    CHECK_STATUS(vpiEventCreate(0, &events[i]));
}
int outCount = saveOutput ? (numStreams * itersPerStream) : numStreams;
disparities.resize(outCount);
confidences.resize(outCount);
for (int i = 0; i < outCount; i++)
{
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_S16, 0, &disparities[i]));
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_U16, 0, &confidences[i]));
}

Converting image format

We use VPI’s C API to submit our image conversion operations for every stream to convert to NV12_BL input mimicking frames from the camera.

for (int i = 0; i < numStreams; i++)
{
    CHECK_STATUS(vpiSubmitConvertImageFormat(streamsLeft[i], VPI_BACKEND_CPU, leftInputs[i], leftTmps[i], NULL));
    CHECK_STATUS(vpiSubmitConvertImageFormat(streamsLeft[i], VPI_BACKEND_VIC, leftTmps[i], leftOuts[i], NULL));
    CHECK_STATUS(vpiEventRecord(events[i], streamsLeft[i]));
    CHECK_STATUS(vpiSubmitConvertImageFormat(streamsRight[i], VPI_BACKEND_CPU, rightInputs[i], rightTmps[i], NULL));
    CHECK_STATUS(vpiSubmitConvertImageFormat(streamsRight[i], VPI_BACKEND_VIC, rightTmps[i], rightOuts[i], NULL));
    CHECK_STATUS(vpiStreamWaitEvent(streamsRight[i], events[i]));
}
for (int i = 0; i < numStreams; i++)
{
    CHECK_STATUS(vpiStreamSync(streamsLeft[i]));
    CHECK_STATUS(vpiStreamSync(streamsRight[i]));
}

Same as before, we submit the operations to different hardware on the 2 separate streams. The kinds are inferred from the varieties of the input/output images. This time, we also record a VPIEvent after the left stream’s conversion operations. A VPIEvent is a VPI object that permits a stream to attend for one more stream to finish all the operations on the time of recording. This permits us to force the suitable stream to attend on the left stream’s conversion operation without blocking the calling (important) thread, thus enabling multiple leftStreams and rightStreams to operate in parallel.

Synchronizing and executing stereo disparity

We use VPI’s C API to submit our stereo disparity operation. We also benchmark our stereo disparity using std::chrono.

auto benchmarkStart = std::chrono::high_resolution_clock::now();
for (int iter = 0; iter < itersPerStream; iter++)
{
    for (int i = 0; i < numStreams; i++)
    {
        int dispIdx = saveOutput ? (i * itersPerStream + iter) : i;
        CHECK_STATUS(vpiSubmitStereoDisparityEstimator(streamsRight[i], vic_pva_ofa, stereoPayloads[i], leftOuts[i],
                                                     rightOuts[i], disparities[dispIdx], confidences[dispIdx],
                                                     &stereoParams));
    }
}
// ====================
// End Benchmarking
for (int i = 0; i < numStreams; i++)
{
    CHECK_STATUS(vpiStreamSync(streamsRight[i]));
}
auto benchmarkEnd = std::chrono::high_resolution_clock::now();

Same as before, we submit our operation with a confidenceMap and get a resulting disparityMap. We also end our benchmarking timer and record the time taken to do conversion and disparity. We explicitly sync all the streams after submitting to all of them to be sure that the calling thread isn’t blocked at submission time.

Post-processing and cleanup

We use VPI’s C API and OpenCV interoperability to postprocess and save the disparity map throughout the same per iteration loop. We optionally save the output data for inspection after which clean up the objects after the loop.

// ====================
// Save Outputs
if (saveOutput)
{
    for (int i = 0; i < numStreams * itersPerStream; i++)
    {
        VPIImageData dispData, confData;
        cv::Mat cvDisparity, cvDisparityColor, cvConfidence, cvMask;
        CHECK_STATUS(
        vpiImageLockData(disparities[i], VPI_LOCK_READ, VPI_IMAGE_BUFFER_HOST_PITCH_LINEAR, &dispData));
        vpiImageDataExportOpenCVMat(dispData, &cvDisparity);
        cvDisparity.convertTo(cvDisparity, CV_8UC1, 255.0 / (32 * stereoParams.maxDisparity), 0);
        applyColorMap(cvDisparity, cvDisparityColor, cv::COLORMAP_JET);
        CHECK_STATUS(vpiImageUnlock(disparities[i]));
        std::ostringstream fpStream;
        fpStream << "stream_" << i / itersPerStream << "_iter_" << i % itersPerStream << "_disparity.png";
        imwrite(fpStream.str(), cvDisparityColor);

        // Confidence output (U16 -> scale to 8-bit and save)
        CHECK_STATUS(
        vpiImageLockData(confidences[i], VPI_LOCK_READ, VPI_IMAGE_BUFFER_HOST_PITCH_LINEAR, &confData));
        vpiImageDataExportOpenCVMat(confData, &cvConfidence);
        cvConfidence.convertTo(cvConfidence, CV_8UC1, 255.0 / 65535.0, 0);
        CHECK_STATUS(vpiImageUnlock(confidences[i]));
        std::ostringstream fpStreamConf;
        fpStreamConf << "stream_" << i / itersPerStream << "_iter_" << i % itersPerStream << "_confidence.png";
        imwrite(fpStreamConf.str(), cvConfidence);
    }
}

// ====================
// Clean Up VPI Objects
for (int i = 0; i < numStreams; i++)
{
    CHECK_STATUS(vpiStreamSync(streamsLeft[i]));
    CHECK_STATUS(vpiStreamSync(streamsRight[i]));
    vpiStreamDestroy(streamsLeft[i]);
    vpiStreamDestroy(streamsRight[i]);
    vpiImageDestroy(rightInputs[i]);
    vpiImageDestroy(leftInputs[i]);
    vpiImageDestroy(leftTmps[i]);
    vpiImageDestroy(leftOuts[i]);
    vpiImageDestroy(rightTmps[i]);
    vpiImageDestroy(rightOuts[i]);
    vpiPayloadDestroy(stereoPayloads[i]);
    vpiEventDestroy(events[i]);
}
// Destroy all disparity and confidence images
for (int i = 0; i < (int)disparities.size(); i++)
{
    vpiImageDestroy(disparities[i]);
}
for (int i = 0; i < (int)confidences.size(); i++)
{
    vpiImageDestroy(confidences[i]);
}

Collect benchmarking results

We will now collect and display our benchmarking results.

double totalTimeSeconds = totalTime / 1000000.0;
double avgTimePerFrame  = totalTimeSeconds / totalIterations;
double throughputFPS= totalIterations / totalTimeSeconds;

std::cout << "n" << std::string(70, '=') << std::endl;
std::cout << "SIMPLE MULTI-STREAM RESULTS" << std::endl;
std::cout << std::string(70, '=') << std::endl;
std::cout << "Input: RGB8 -> Y8_BL_ER" << std::endl;
std::cout << "Total time: " << totalTimeSeconds << " seconds" << std::endl;
std::cout << "Avg time per frame: " << (avgTimePerFrame * 1000) << " ms" << std::endl;
std::cout << "THROUGHPUT: " << throughputFPS << " FPS" << std::endl;
std::cout << std::string(70, '=') << std::endl;

std::cout << "THROUGHPUT: " << throughputFPS << " FPS" << std::endl;
std::cout << std::string(70, '=') << std::endl;

Review results

Given a picture resolution of 960×600 and maximum disparity of 128, this solution achieves 30 FPS with eight simultaneous streams running stereo disparity estimation, including confidence maps on Thor T5000 with none load on GPU. That is about 10-times faster than on an Orin AGX 64 GB. The ability setting is MAX_N for each cases. Performance is shown in Table 1.

Stereo disparity full pipeline (RELATIVE mode, res: 960×600, max disparity: 128)
Frame Rate (FPS) Speed-up ratio
Variety of streams Orin AGX (64 GB) Jetson Thor T5000  
1 22 122 5.5
2 12 111 9.5
4 6 58 9.7
8 3 29 9.7

Table 1. Comparison of stereo disparity pipeline in RELATIVE mode on Orin AGX vs. Thor T5000

How Boston Dynamics uses the VPI 

As heavy users of the Jetson platform, Boston Dynamics relies on the Vision Programming Interface (VPI) to speed up its perception pipeline. 

VPI enables seamless access to Jetson’s specialized hardware accelerators, offering a set of optimized vision algorithms corresponding to AprilTags and SGM disparity, and have detectors like ORB, Harris Corner, Pyramidal LK, and OFA-powered optical flow. These are core to Boston Dynamics’ perception stack, supporting each prototype testing and system optimization through load balancing. By adopting VPI, engineers can quickly adapt to hardware updates and shorten time‑to‑value.

Takeaways

The advancements in hardware capabilities within the Jetson Thor platform and libraries like VPI empower developers to design efficient, low-latency solutions for edge-based robotics. 

By utilizing the unique features of every available accelerator on Jetson, robotics firms corresponding to Boston Dynamics can achieve sophisticated vision processing that’s each efficient and scalable, a key step in making intelligent, autonomous robots a reality in various real-world applications.

 To start with constructing your personal CV applications on Jetson, try the next:



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x