Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

In vision AI systems, model throughput continues to enhance. The encompassing pipeline stages must keep pace, including decode, preprocessing, and GPU scheduling. Within the previous post, Construct High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6, this was described because the data-to-tensor gap—a performance mismatch between AI pipeline stages.

The SMPTE VC-6 (ST 2117-1) codec addresses this gap through a hierarchical, tile-based architecture. Images are encoded as progressively refinable Levels of Quality (LoQs), each adding incremental detail. This allows selective retrieval and decoding of only the required resolution, region of interest, or color plane, with random access to independently decodable frames. Pipelines can retrieve and decode only what the model needs.

Nonetheless, efficient single-image execution doesn’t robotically translate to efficient scaling. As batch sizes grow, the bottleneck shifts from single-image kernel efficiency to workload orchestration, launch cadence, and GPU occupancy.

This post focuses on the architectural changes required to scale VC-6 decoding for batched inference and training workloads. As NVIDIA Nsight Systems and NVIDIA Nsight Compute allow developers to discover system- and kernel-level constraints, they were leveraged to revamp the VC-6 CUDA implementation for batch throughput. The result’s as much as ~85% lower per-image decode time in comparison with the previous implementation, with submillisecond decode for LoQ-0 (~4K) in batch and ~0.2 ms for lower LoQs, with an identical output quality. This significantly improves pipeline efficiency for production vision AI workloads.