Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library

Deploying large language models (LLMs) requires large-scale distributed inference, which spreads model computation and request handling across many GPUs and nodes to scale to more users while reducing latency. Distributed inference frameworks use techniques corresponding to disaggregated serving, KV cache loading, and wide expert parallelism.

In disaggregated serving environments, prefill and decode phases are run on separate GPUs, requiring efficient KV cache transfers between them. Low-latency and high-throughput communication to maneuver these KV caches are critical to achieve advantages from disaggregated serving.

In KV cache loading, storage is used to assist with growing KV caches in multiturn and agentic AI workloads corresponding to coding assistants and reasoning. For the case of long context KV, the previous results might be loaded from local SSDs and distant storage, as a substitute of recomputing them as prefill. That is one example that explains why storage is becoming a core a part of inference workloads.

In wide expert parallelism, experts are split across many GPUs, where the intermediate results (activations) need to be dispatched to and combined from these experts. Resulting from the requirement for ultra-low-latency communication for intermediate activations between stages, these transfers are typically initiated by the GPU through optimized kernels, known as device side APIs for networking, or device API in brief.

One other unique feature of inference workloads is their need for dynamicity and resiliency. Services can run 24 hours a day, seven days every week. While based on user demand, the variety of GPUs used can change. There will also be more fine-grained dynamicity. The ratio of GPUs doing prefill and decode might change or, within the case of elastic expert parallelism, the variety of replicated experts and even total experts can change.

Within the event of failures, the system must be resilient, running at lower throughput for a transient time period until the recovery mechanism handles the failure. This requirement extends the system’s dynamicity needs by detecting failures and managing the transitional state until recovery completes.

Finally, while there may be a necessity for heterogeneous hardware support by way of memory and storage, there might be heterogeneity in compute hardware as well. Handling each of those unique hardware components can change into cumbersome. This requires a library that may unify different communication and storage technologies, which ensures that frameworks can efficiently move data across various memory and storage hierarchies: GPU memory, CPU memory, and lots of tiers of local and distributed storage from NVMe to cloud object stores.

NVIDIA Inference Transfer Library (NIXL) is an open source, vendor-agnostic, data movement library designed to support these dynamic, complex AI inference frameworks by offering a unified and powerful abstraction to maneuver data across many memory and storage technologies.

This post explains NIXL core concepts, including agents, memory registration, metadata exchange, descriptors, transfer creation and management, and backend plugins. It also explains the usage flow of this library, highlights available performance tools, and provides just a few examples to assist you to start.

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library

What’s NIXL?

NIXL design

Example NIXL use case

Organising the agents

Step 1: Agent creation

Step 2: Memory registration

Step 3: Metadata exchange

Preparing and performing the information transfer

Step 1: Create the transfer request

Step 2: Start (or post) the transfer request

Step 3: Check transfer status

Tear down

Start with NVIDIA Inference Transfer Library

Acknowledgments

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Dorsey makes the AI case against managers

Evaluating the ethics of autonomous systems

Speed up Token Production in AI Factories Using Unified Services and Real-Time AI

How Can A Model 10,000× Smaller Outsmart ChatGPT?

NVIDIA Extreme Co-Design Delivers Latest MLPerf Inference Records

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library

What’s NIXL?

NIXL design

Example NIXL use case

Organising the agents

Step 1: Agent creation

Step 2: Memory registration

Step 3: Metadata exchange

Preparing and performing the information transfer

Step 1: Create the transfer request

Step 2: Start (or post) the transfer request

Step 3: Check transfer status

Tear down

Start with NVIDIA Inference Transfer Library

Acknowledgments

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.