Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

-


In vision AI systems, model throughput continues to enhance. The encompassing pipeline stages must keep pace, including decode, preprocessing, and GPU scheduling. Within the previous post, Construct High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6, this was described because the data-to-tensor gap—a performance mismatch between AI pipeline stages.

The SMPTE VC-6 (ST 2117-1) codec addresses this gap through a hierarchical, tile-based architecture. Images are encoded as progressively refinable Levels of Quality (LoQs), each adding incremental detail. This allows selective retrieval and decoding of only the required resolution, region of interest, or color plane, with random access to independently decodable frames. Pipelines can retrieve and decode only what the model needs.

Nonetheless, efficient single-image execution doesn’t robotically translate to efficient scaling. As batch sizes grow, the bottleneck shifts from single-image kernel efficiency to workload orchestration, launch cadence, and GPU occupancy. 

This post focuses on the architectural changes required to scale VC-6 decoding for batched inference and training workloads. As NVIDIA Nsight Systems and NVIDIA Nsight Compute allow developers to discover system- and kernel-level constraints, they were leveraged to revamp the VC-6 CUDA implementation for batch throughput. The result’s as much as ~85% lower per-image decode time in comparison with the previous implementation, with submillisecond decode for LoQ-0 (~4K) in batch and ~0.2 ms for lower LoQs, with an identical output quality. This significantly improves pipeline efficiency for production vision AI workloads.

Introducing the VC-6 batch mode implementation

The brand new implementation is built around several architectural changes, including batch mode and kernel-level optimizations.

Batch mode: From N to a single decoder

  • Execution model redesign
    • Algorithmic changes to decode multiple images concurrently with a single decoder 
  • Improved parallelization
    • Leveraging the brand new work dimension (images) next to existing parallelization dimensions (tiles, planes) to shift initial VC-6 tile hierarchy work to GPU.
    • Minibatch pipelining 

Kernel-level optimizations 

  • Nsight Compute driven range decoder kernel optimization
  • The optimizations led to a ~20% kernel speedup

The next sections detail these changes to the VC-6 decoder in depth. As for any CUDA optimization, the plan was to begin with a system-level profiler like Nsight Systems to discover and fix initial performance bottlenecks, after which use Nsight Compute to refine individual kernels.

Moving from N to a single decoder

The highest a part of Figure 1 shows the start line, as detailed in Construct High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6.

The center rows show heavy CUDA API usage, each corresponding to a separate decoder instance decoding a single image each. In All Streams, many small, concurrently running kernels on the GPU are shown in blue. The highest row shows device utilization. Light orange is lower than full utilization, dark orange indicates full load. In this instance, even with enough dispatched work, full utilization isn’t indicated. The profiled algorithm is consequently not optimal.

This inefficiency is explained by the execution of various small kernels. Each kernel launch has several associated overheads, like scheduling and kernel resource management. On this setting, constant per-kernel overhead and little work per kernel result in an unfavorable ratio between overhead and actual work. 

Changing this requires altering the paradigm from many small kernels to just a few larger kernels. 

On this case, NVIDIA Nsight motivated an execution model redesign from N decoders for N images to a single decoder that decodes batches of N images without delay. This latest execution model redistributes the fixed amount of labor into fewer kernels, each with more work. The underside a part of Figure 1 shows the effect of this reimplementation. It shows only two CUDA API timelines, only a handful of enormous kernels, and full GPU utilization, indicated by the dark orange GPU utilization.

Shifting more work to GPU

Within the initial implementation, decoding the foundation and narrow levels of the VC-6 tile hierarchies were performed on the CPU. For single-image decoding, the quantity of labor in these narrower stages was too small to justify GPU execution. Within the batched design, although the work per image stays small, the aggregation of multiple images provides sufficient parallelism to efficiently utilize the GPU.

Moreover, the algorithm was modified to eliminate host-side logic for handling variable image dimensions. With that embedded in GPU kernels, NVIDIA Nsight showed that this reduced each synchronization points and submission latency, while increasing pipeline fluidity.

Figures 2 and three show the utilization and CPU overhead overview of decoding images at LoQ-0 and LoQ-2, indicating more severe inefficiencies for LoQ-2. 

Nonetheless, with batch mode VC-6 (bottom of Figures 2 and three), GPU execution of even the smallest LoQs is possible since the aggregated workload of several images may be efficiently computed on GPU.

Minibatch pipelining

The brand new decoder design splits each batch into minibatches. These undergo a pipeline consisting of CPU processing, PCIe transfer, and GPU decoding stages. Images of a minibatch reside in a pipeline stage concurrently, while stages operate concurrently and conceal one another’s costs.

Figure 4 illustrates this minibatch pipelining. Much like Figure 1, the CUDA API calls are dispatched from two threads, UPLOAD and GPU, with minimal host-side resource usage. Work aggregation has clearly reduced CUDA API calls, memory operations, and synchronizations, while amortising kernel launch overhead across the batch. 

Kernel-level optimizations

Nsight Systems revealed that the initial optimizations alleviated CPU overhead, and further improvements require kernel optimization. The terminal_decode kernel implementing a spread decoder is noteworthy. Nsight Compute highlighted previously noncritical microarchitectural constraints. The next algorithmic issues were highlighted: typical low-level inefficiencies resembling low streaming SM occupancy, warp divergence, noncoalesced memory accesses, and register pressure. These insights are essential for developers to then eliminate or minimize these algorithmic issues where possible.

The Nsight Compute source heatmap and Warp Stall Sampling (All Samples) highlight measured time spent per individual source line. They show that significant time is spent on integer divisions within the range update logic (Figure 5). Since GPUs will not be optimized for integer division and accuracy is non-negotiable, these operations can’t be optimized.

For the decoder table lookup, implemented as binary search on shared memory, Nsight Compute also revealed significant short scoreboard stalls (Figure 6).

These stalls point to shared memory loads (LDS in Figure 6), as dynamic indexing into an area array would otherwise lead to slow local memory access. Since the lookup tables are constant in size, it is feasible to exchange this approach with an area variable and an unrolled loop. In comparison with the binary search, this exhaustive search enables constant indexing right into a fixed-size array that may reside in registers. The mixture of those two changes applied to each range decoders produced a ~20% speedup of this kernel.

Figure 7 shows the memory charts of Nsight Compute. Visually, it confirms that neither shared (last row) nor local memory (row 2) is used after the modification.

The trade-off is increased register usage, from 48 to 92 registers per thread. Here, it is appropriate given the per-thread limit of 255 registers and the relatively small grid dimensions of this kernel. Since targeting high block residency per SM isn’t a priority at this stage, the extra register pressure doesn’t limit overall throughput.

One other optimization was to exchange a custom selection routine with a cub::DeviceSelect function call. This simplifies the code, and off-loads the upkeep and optimization points for current and upcoming hardware to CUB.

Performance scaling and updated results

Figure 8 compares per-image decode time across batch sizes between the previous and improved implementation, evaluated at 4 LoQs (LoQ-0 ~4K, LoQ-1 ~2K, LoQ-2 ~1K, LoQ-3 ~0.5K) using the UHD-IQA dataset (available through V-Nova on Hugging Face).

Two distinct scaling behaviors emerge:

  • The previous implementation plateaus beyond small batch sizes (roughly 1–16). Additional images don’t translate into further per-image gains. In contrast, the optimized CUDA implementation continues to enhance as batch size increases. For instance, LoQ-0 (~4K) decode time drops below 1 ms per image at large batch sizes.
  • The relative improvement grows at lower LoQs. Smaller per-image workloads expose more independent work that may be aggregated, leading to higher GPU utilization. At higher batch sizes, LoQ-2 decoding reaches ~0.2 ms per image and LoQ-3 ~0.14 ms.

Measured improvements include:

  • ~36% lower per-image decode time at batch size 1 (LoQ-0)
  • ~70–80% lower per-image decode time at batch sizes 16–32 for LoQ-2 and LoQ-3
  • As much as ~85% lower per-image decode time at batch size 256

Figure 9 shows the performance of the redesigned implementation across batch sizes on NVIDIA H100 (Hopper), and NVIDIA B200 (Blackwell) GPUs. The outcomes indicate that the performance gains will not be silicon-specific but stem from the improved batch mode. This effectively exposes sufficient parallel work to saturate modern GPU architectures.

VC-6 for vision AI pipelines

Intelligent and tailored-to-fit decoding leveraging VC-6 random-access intra-only, LoQ decoding, and selective region-of-interest or color channel access can profit training, inference, and video summarization workflows. That is an avenue for future work.

Start with VC-6 decoding

Scaling VC-6 decoding requires greater than kernel tuning. Nsight profiling reveals structural limits in launch cadence, occupancy, thread divergence, and memory  behavior. By redesigning the CUDA execution model to show more independent work and amortize overhead across batches, the brand new implementation achieves as much as ~85% lower per-image decode time, reaching submillisecond decode for LoQ-0 (~4K) in batch and ~0.2 ms for lower LoQs, with an identical output quality.

As vision AI workloads proceed to scale, overall pipeline efficiency is decided at every step, including each the decode and preprocessing stages.

To start, take a look at these resources:

  • VC-6 samples
    • Examples for VC-6 encoding and selective decoding
    • Benchmark suite to breed our results with Hugging Face datasets
  • VC-6 AI Blueprint
    • Demo showcasing VC-6 selective decoding in vision AI pipelines
    • Reference integration patterns for multiple use cases



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x