Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library

-


Deploying large language models (LLMs) requires large-scale distributed inference, which spreads model computation and request handling across many GPUs and nodes to scale to more users while reducing latency. Distributed inference frameworks use techniques corresponding to disaggregated serving, KV cache loading, and wide expert parallelism.

In disaggregated serving environments, prefill and decode phases are run on separate GPUs, requiring efficient KV cache transfers between them. Low-latency and high-throughput communication to maneuver these KV caches are critical to achieve advantages from disaggregated serving. 

In KV cache loading, storage is used to assist with growing KV caches in multiturn and agentic AI workloads corresponding to coding assistants and reasoning. For the case of long context KV, the previous results might be loaded from local SSDs and distant storage, as a substitute of recomputing them as prefill. That is one example that explains why storage is becoming a core a part of inference workloads. 

In wide expert parallelism, experts are split across many GPUs, where the intermediate results (activations) need to be dispatched to and combined from these experts. Resulting from the requirement for ultra-low-latency communication for intermediate activations between stages, these transfers are typically initiated by the GPU through optimized kernels, known as device side APIs for networking, or device API in brief.

One other unique feature of inference workloads is their need for dynamicity and resiliency. Services can run 24 hours a day, seven days every week. While based on user demand, the variety of GPUs used can change. There will also be more fine-grained dynamicity. The ratio of GPUs doing prefill and decode might change or, within the case of elastic expert parallelism, the variety of replicated experts and even total experts can change. 

Within the event of failures, the system must be resilient, running at lower throughput for a transient time period until the recovery mechanism handles the failure. This requirement extends the system’s dynamicity needs by detecting failures and managing the transitional state until recovery completes. 

Finally, while there may be a necessity for heterogeneous hardware support by way of memory and storage, there might be heterogeneity in compute hardware as well. Handling each of those unique hardware components can change into cumbersome. This requires a library that may unify different communication and storage technologies, which ensures that frameworks can efficiently move data across various memory and storage hierarchies: GPU memory, CPU memory, and lots of tiers of local and distributed storage from NVMe to cloud object stores.

NVIDIA Inference Transfer Library (NIXL) is an open source, vendor-agnostic, data movement library designed to support these dynamic, complex AI inference frameworks by offering a unified and powerful abstraction to maneuver data across many memory and storage technologies. 

This post explains NIXL core concepts, including agents, memory registration, metadata exchange, descriptors, transfer creation and management, and backend plugins. It also explains the usage flow of this library, highlights available performance tools, and provides just a few examples to assist you to start. 

What’s NIXL?

NIXL is an open source library for accelerating point-to-point data transfers in AI inference frameworks. NIXL provides a single, easy-to-use API that might be used to deal with quite a lot of data transfer challenges inside these frameworks while maintaining maximum performance.

This API supports multiple technologies corresponding to RDMA, GPU-initiated networking, GPU-Direct storage, block and file storage, and advanced cloud storage options including S3 over RDMA and Azure Blob Storage. It’s vendor-agnostic and may run across diverse environments. For instance, it supports Amazon Web Services (AWS) with EFA networking and Trainium or Inferentia accelerators, in addition to Azure with RDMA networking. The team is working with Google Cloud so as to add each RDMA and GPUDirect-TCPXO networking. NIXL is already a key component of many AI inference frameworks corresponding to NVIDIA Dynamo, NVIDIA TensorRT LLM, vLLM, SGLang, Anyscale Ray, LMCache, and more.

This figure illustrates the three challenges of Distributed Inference (Heterogeneous Resources, Dynamic Workload, Massive Scale) and the corresponding three requirements to address them (Resource Disaggregation, Fine-grained Resource Allocation, Distributed Computation). The accompanying text explains that NIXL has a unified API for the heterogeneous resources and is designed to meet these requirements.This figure illustrates the three challenges of Distributed Inference (Heterogeneous Resources, Dynamic Workload, Massive Scale) and the corresponding three requirements to address them (Resource Disaggregation, Fine-grained Resource Allocation, Distributed Computation). The accompanying text explains that NIXL has a unified API for the heterogeneous resources and is designed to meet these requirements.
Figure 1. NIXL addresses three core challenges in distributed AI inference: heterogeneous resources, dynamic workload, and big scale

Core use cases of NIXL include:  

  • Disaggregation: Moves KV blocks between prefill and decode staff with high throughput and low latency 
  • Long context KV cache storage: Stores KV cache data in some long run storage medium to avoid recomputation later
  • Weight transfer: Ships model weights to GPU nodes for fast startup or resharding. The weights might come from GPU memory, host memory, or storage
  • Reinforcement learning: Streams updated weights from learners to actors with minimal transfer overhead 
  • Elastic expert parallelism: Dispatches and mix stages in expert parallelism might be done through NIXL, with support for dynamic reconfigurations

The unified NIXL API is for various kinds of memory and storage, while its pluggable backend design allows this API to focus on many alternative high‑performance technologies (RDMA, GPU-initiated networking, GPU-Direct storage, NVMe, Object Stores, and so forth). NIXL is designed to have a totally non-blocking API and incur minimal overhead. This allows efficient overlapping of communication and computation with high-performance zero-copy transfers.

The NIXL dynamic metadata exchange enables it to dynamically scale up and down a network of NIXL agents. This feature makes it practical for dynamic, long‑running services where compute nodes are added based on user load, removed attributable to failures, or recycled for various purposes on a regular basis.

These features enable NIXL to abstract away various memory and storage types for the user of the NIXL library, while supporting a big selection of high-performance transfer backends. Moreover, dynamicity and resiliency is baked in throughout the NIXL design, targeted for inference applications 24 hours a day, seven days every week. 

NIXL design

NIXL functions as a standalone library, providing the needed abstraction for various network and storage backends. These abstractions include a conductor process that determines when transfers are required, and a NIXL transfer agent that can handle transfers. All of this is finished in an object-oriented manner. The transfer terminology is predicated on writing to or reading from a distant agent (or inside the local agent). These write and skim operations are also known as put and get. 

This terminology enables a unified API that supports each efficient one-sided network communications and storage transfers. The user describes any memory or storage through a listing of descriptors, which has an encompassing type to point if the information is stored in host memory, GPU memory, or some variety of storage. Each descriptor inside a descriptor list points to a location in memory or storage. For instance, some base address and a size on a number memory, GPU, or SSD, or similarly a location inside a file or storage object. Note that every set of transfer descriptors have to be from the identical memory type but can transverse memory types across the transfer. For instance, sending from GPU memory to host memory. 

The conductor gives the NIXL agent access to the specified allocated memories through a registration call. When using one-sided read or write operations, keys or identifiers are generated, so only other processes which have the correct key can access that memory. NIXL encapsulates such information for these registrations, in addition to the required connection info, right into a metadata object. Contained in the NIXL agent, Memory Section and Metadata Handler components are in command of managing the needed local and distant information respectively. 

The conductor process can also be in command of dynamically exchanging the relevant metadata objects to make a decision which agents can confer with one another at each cut-off date. The conductor process can directly obtain the metadata object from one agent, and cargo it into one other agent. For the case of device API usage by GPU kernels, there may be another preparation step needed to send the relevant local and distant metadata to the GPU.

This metadata exchange is just needed for distant agent transfers and never for local memory or storage transfers. For distant storage, NIXL talks to the local client of the distributed storage system, becoming a loopback transfer inside the agent. NIXL also provides optional facilitating methods to exchange such metadata through direct sockets connection or a central metadata service corresponding to etcd. 

Now the conductor process can ask the NIXL agent to organize a transfer request. NIXL first checks whether the required information is on the market for this transfer. Whether it is, the conductor process can ask the NIXL agent to begin the transfer. It may well also monitor the transfer status until it’s complete, in a nonblocking manner. Device API mode operates in an analogous manner, from the GPU kernel.

The NIXL agent will internally find the optimal backend for carrying out this transfer request, and deliver the prepared request to that backend (unless the user specifies the specified backend). This allows NIXL to realize high performance and remain hardware agnostic. Figure 2 shows the present list of supported backends, which is expanding with the rapid adoption of NIXL.

A block diagram illustrating NIXL architecture. In the center is a large block labeled "Transfer Agent" that users of NIXL interact with through the API shown on top. Internally, it manages local registered memory information through Memory Section, and information about other NIXL agents through Metadata Handler. At the bottom, a subset of the existing  transfer backend plugins are shown, both for networking and storage, which the agent uses to carry the requested transfer. 
A block diagram illustrating NIXL architecture. In the center is a large block labeled "Transfer Agent" that users of NIXL interact with through the API shown on top. Internally, it manages local registered memory information through Memory Section, and information about other NIXL agents through Metadata Handler. At the bottom, a subset of the existing  transfer backend plugins are shown, both for networking and storage, which the agent uses to carry the requested transfer.
Figure 2. NIXL architecture consists of a core transfer agent with a Memory Section and Metadata Handler, and supports multiple transfer backend plugins through an API 

Example NIXL use case

The next NIXL use case explores how applications or conductor processes can use the NIXL API to perform an asynchronous point-to-point data transfer using a high-performance networking library.

For the case of transferring between two agents, one agent plays the role of the initiator, which creates and starts the read or write operation. The opposite agent plays the role of the goal, whose memory is being accessed. 

These roles are defined per transfer through the application run based on who invokes the operation. The initiator agent checks the status of transfer locally, and typically sends a notification to the goal agent to point when the transfer is complete. 

Organising the agents

Organising the initiator and goal agents involves the next steps:

Step 1: Agent creation 

At startup, each application spawns a runtime agent configured with relevant initialization parameters. The agent initializes the transfer backends specified or uses UCX as default if none are provided. UCX is a community-driven networking library and is widely tested internally. The user also gives a reputation to the agent, which might be any string, corresponding to an UUID.

Step 2: Memory registration 

Users allocate memory on their chosen devices—GPU, CPU, storage—and register these regions with the agent through NIXL descriptors. NIXL will internally pass that information to every relevant backend that supports that memory type.

Optimization tip: Most backend registrations must undergo a kernel call, which might be time consuming. It is suggested to reduce the variety of registrations by registering larger blocks of memory, as transfers might be created anywhere inside the registered memory. 

Step 3: Metadata exchange 

Goal agent metadata is shared with initiator agents for planned transfers. During runtime, latest metadata might be loaded, or metadata of one other agent might be removed. This can be a key feature that permits dynamicity for the NIXL library.

Optimization tip: When latest registrations or deregistrations occur, the updated metadata must be exchanged. If one side has dynamic registrations and deregistrations, while the opposite side has fixed buffers to receive the information, it is suggested to make the previous side the initiator agent. This removes the necessity for extra metadata exchanges.

Preparing and performing the information transfer

After the metadata has been shared between the 2 peer agents, the initiator performs the next steps:

Step 1: Create the transfer request 

The transfer request indicates the operation type, READ or WRITE, in addition to the initiator and the goal descriptors for use. A notification might be optionally specified. NIXL will confirm these descriptors, settle on the transfer backend, and deliver the descriptors to that backend if preparations are required.

Step 2: Start (or post) the transfer request 

NIXL issues this request to the suitable backend, making this low overhead. The backend performs the information transfer between the source and destination addresses. The backend library performing the transfer uses the system libraries and drivers underneath to perform the transfer efficiently.

Step 3: Check transfer status  

To enable overlap of compute and communication, the post call is nonblocking, which requires the user to envision the status of a transfer individually. Note that the transfer might complete, or might end in an error (network failure, for instance). Such failure doesn’t impact the opposite agents within the system, nor the transfers inside the same agent that don’t face that network failure.

On the goal side, the user can search for notifications that indicate a transfer was complete. The name of the agent shows up within the notification, with the notification message, while the goal agent doesn’t have to know the initiator’s name beforehand. 

Tear down

When a NIXL agent is deleted, NIXL will mechanically deregister the local registered memories. If an energetic transfer is being directed towards this NIXL agent, it can simply end in an error status. If local transfers are usually not finished, NIXL will attempt to release them during agent destruction. Nonetheless, it is suggested to preemptively release those transfer requests.

Performance benchmarking tools are precious for inference systems. They might be used to confirm that a system is working as intended, or find one of the best backend for a selected enterprise system. They may help confirm performance improvements for a selected backend. 

NIXL provides a two‑layer setup, through a low-level benchmark called NIXLBench and an LLM-aware profiler called KVBench. 

NIXLBench is intentionally model‑agnostic and maintains an easy system view. It executes real data transfers, sweeps block and batch sizes, and reports bandwidth metrics with latency percentiles. NIXLBench relies on etcd to exchange transfer metadata for network backends, but not for storage backends as there is no such thing as a need for metadata exchange.

KVBench offers significant benefits for LLM engineers by accelerating benchmarking and iteration through the automated calculation of actual KV cache I/O size and batch size for supported models, and generates a ready-to-run NIXLBench command. KVBench may instantiate profiling of KVCache transfers using its CTPerfTest module.

Start with NVIDIA Inference Transfer Library

NIXL software is fully open source and available on the ai-dynamo/nixl GitHub repo. It’s written in C++ for top performance, efficiency, and composability. Several bindings can be found, including C, Python, and Rust. 

Currently, NIXL is just supported in Linux environments corresponding to Ubuntu and RHEL and is on the market prebuilt as a Python wheel distributable. We encourage you to try NIXL in your personal AI inference frameworks and workloads.

To learn more, you’ll be able to explore additional examples within the NIXL example guide. As a start line, basic_two_peers is an easy two-peer Python example showing registration, metadata exchange, a single READ operation, notification, verification, and teardown. As well as, expanded_two_peers builds on top of the previous example, by adding parallel READs and WRITEs with various preparation methods, reposting the identical transfer request, and usage of patterns in notifications.

We welcome questions, contributions, pull requests, and feedback from the community on GitHub. Stay tuned for the upcoming NIXL v1.0.0 release. To learn more about NIXL, try these additional resources: 

Acknowledgments 

The NVIDIA Inference Transfer Library product team acknowledges the precious contributions of all open source developers, contributors, testers, and community members who’ve participated in its evolution.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x