introduced Gaudi accelerators to Amazon’s EC2 DL1 instances, we faced a challenge that threatened your complete deployment. The performance numbers were not only disappointing; they were disastrous. Models that required training effectively were seeing as much as 50% of their performance degradation when scaling across multiple nodes. The issue? A network topology that routed all bytes of information through host memory, causing a bottleneck that undermined all the things Gaudi was designed to do.
I led the engineering effort to deal with this issue, which ultimately resulted in the event of what we now call Peer Direct. It’s a feature that transformed the best way Gaudi accelerators communicate in cloud environments, and its history has some useful lessons on distributed AI training at scale.
The Problem with Host NICs
Gaudi was designed with the NIC (Network Interface Card) being embedded directly within the silicon. Each chip has ten network interfaces that may handle 100 Gbps and support RDMA with RoCE v2, allowing devices to access one another’s memory directly while not having the CPU or This architecture is extremely efficient for AI training workloads, where collective operations like AllReduce need to build up gradients from dozens or tons of of devices per training iteration.
But cloud deployments usually are not at all times compliant with perfect architectures. When Amazon tested Gaudi for DL1 instances, they’d to utilise atypical host NICs moderately than Gaudi’s built-in networking. The explanations were pragmatic: cost savings and the logistics of working around existing data centre infrastructure to accommodate a brand new network topology. From their business perspective, leveraging established network infrastructure made perfect sense.
From the performance standpoint, it was a disaster. As an alternative of peer-to-peer RDMA transfers between Gaudi cards, all communication went the great distance around. Data needed to be duplicated out of Gaudi’s high-bandwidth memory into host DRAM, processed by the host CPU, sent out the host NIC over TCP/IP, received by the far host, and duplicated back into the far Gaudi’s memory. All of the added hops caused latency, stole CPU cycles, and added bandwidth restrictions that completely ruined the scalability of distributed training.
The performance shortfall was so bad that one questioned whether deployment would ever be value it in any respect. This wasn’t a matter of some trivial optimisation; it was an existential threat to your complete arrangement with AWS.
Why Performance Matters This Much
It’s value knowing why a 50% lack of performance is so disastrous within the life of coaching models, and particularly large models similar to GPT-5. It now takes weeks or months to coach huge language models even on humongous clusters. In case you are messing around with models which have billions or trillions of parameters, every percentage point of performance translates directly into time and dollars.
Consider the economics. If it takes 30 days to coach a model versus 15, you’re not only waiting longer; you’re paying for double the compute time. At cloud scale, with tons of or hundreds of accelerators in continuous use, this adds as much as tens of millions of dollars. Worse, it halves your iteration speed. In an competitive AI world where firms are racing to develop improved models, doubling the variety of tests throughout the same time-frame might be the excellence between being in front and being behind.
Environmental cost can be crucial. Large models require quite a lot of electricity to show. Higher performance means less compute time, which halves energy consumption and carbon emissions. As more pressure is mounted on the AI industry to chop its carbon footprint, gains in efficiency are not any longer a luxury but moderately a necessity.
The answer we designed, Peer Direct, delivered RDMA-like performance when the physical network layout wasn’t suitable for normal RDMA. We wanted direct memory access between Gaudi devices on different systems without traversing host memory, but on host NICs that weren’t designed for this in the primary place.
The enabler was AWS Elastic Fabric Adapter, a high-performance network interface for HPC and AI workloads on EC2. EFA provides low-latency OS-bypass communications, typically sub-10 microsecond latency. EFA provides RDMA-like semantics using libfabric, an in-user-space communication library providing a standard interface across several networking technologies.
The duty was to mix libfabric with Habana’s Collective Communication Library, HCCL, which handles all distributed training workloads. HCCL was built on the idea of native RDMA using Gaudi’s on-chip NICs. We wanted to create a bridge enabling HCCL to leverage libfabric transparently for communications without compromising its performance guarantees and communication semantics.
The answer needed several technical advances. First, we introduced a memory registration system that allowed libfabric to directly access Gaudi’s high-bandwidth memory. We utilised the Linux kernel DMA-BUF framework, which provides a shared mechanism for sharing device driver buffers. When HCCL must transfer data, the Gaudi driver provides a DMA-BUF file descriptor for the memory region, which libfabric can utilise to create RDMA transfers directly from device memory.
Second, we included an LRU cache for memory registrations. Memory registration is dear; it involves kernel calls and setup operations that could cause significant overhead. By caching the mapping of memory addresses to their libfabric handles, we could reuse registrations in hot-access regions, eliminating most registration overhead from actual training.
The result was a communication pipeline that looked something like this: HCCL calls the OFI wrapper, which calls the cached libfabric handle to perform an RDMA transfer straight from source Gaudi memory to destination Gaudi memory, with neither CPU ever being called. The OFI wrapper was introduced to maintain the codebase clean and avoid direct header inclusions — it’s a light-weight library that dynamically links to HCCL and enables using libfabric without requiring direct integration
After the transfer is complete, libfabric reports through a completion queue, and HCCL continues computation with the recently received data.
The Development Experience
Constructing Peer Direct involved venturing into latest territory on tight schedules. Libfabric wasn’t yet mainstream in the sphere of AI accelerators yet. There wasn’t quite a lot of public documentation available, and discussion was meagre. There was more of an emphasis on diving into libfabric source code and reverse-engineering based on experimentation.
The communication with AWS engineers was paramount but time-zone constrained. Working with a team twelve hours ahead meant that debug iterations had 24-hour turnarounds. Every issue needed careful documentation and proper communication, as real-time collaboration was impossible.
The stakes were high since your complete DL1 deployment was riding on this functionality working. Delays would have thwarted a serious product launch. No one on our team had deep background knowledge of libfabric internals, so we were learning a posh codebase while designing a critical integration concurrently.
The Results
Once we actually deployed Peer Direct, the speed improvements were all the hassle was value. We saw a 1.5 to 2x throughput increase for collective operations on a 32MB message size. On larger messages, the performance was much more astounding, with as much as 1.76x higher throughput at a 256MB message size. CPU overhead created a bottleneck that completely disappeared.
Most importantly, these microbenchmark improvements directly translated into real model training performance. Training Habana’s DeepSpeed BERT model with 5 billion parameters across 128 Gaudi devices, we saw substantial throughput gain. Models using more aggressive memory optimisation methods, like ZeRO-2, that are more collective operation dependent, benefited disproportionately from Peer Direct.
PeerDirect was certainly one of the primary enablers for Gaudi performance on AWS DL1 instances, allowing high-scale distributed training to run effortlessly on the launch day. Beyond this initial impact, the hassle set the groundwork for future high-performance communication features and proved that cloud-native AI accelerators could remain competitive despite the constraints of cloud infrastructure.
The experience jogged my memory of a very important lesson in systems engineering: often a very powerful performance improvements don’t result from optimising the fast path, but from sidestepping unjustified detours altogether. During distributed AI training, having data travel straight across accelerators with no unnecessary copies and no CPU intervention is what makes a working system versus one which scales.
Key takeaways? One vital “takeaway” from this project is that assumptions about network topology ought to be tested on the earliest point within the distributed training process. As lots of the accelerator stacks were built based on an idealised environment, they don’t keep in mind the extra hops, translation layers, and/or cost-driven aspects that exist within the cloud environments. Due to this fact, before specializing in optimising either model level or kernel level, engineers should perform easy collective microbenchmarking across the specified topology. If scaling efficiency dramatically decreases with increasing node counts or message sizes, the likely reason is the information path, not the kernel. By identifying the host-memory detour early on, engineers can focus their efforts where they may have the best impact.
One other vital lesson learned was the necessity to treat each memory registration and data transfer as first-class performance concerns. Memory registration overhead can greatly exceed the time spent communicating if each data transfer requires a brand new registration. The LRU cache for registered memories was a non-glamorous addition to HCCL; nevertheless, it effectively eliminated a systemic source of latency and made the RDMA path viable for real-world workloads. When developing distributed systems, engineers should profile not only the available network bandwidth but additionally the lifecycle costs related to allocating buffers, registering them, and tearing down those registrations. Small changes to those control paths can lead to large increases in end-to-end throughputs.
Finally, the combination method utilized in this project provides a pattern for integration. As an alternative of rewriting HCCL to make use of libfabric directly, we created a skinny abstraction layer that maintained existing semantics while replacing the underlying transport layer. This provided several advantages, including minimising risk, reducing code churn, and allowing incremental testing. Teams facing an identical challenge (i.e., adapting accelerator-native communication libraries to cloud-native fabrics) should try and isolate the transport layer, maintain collective semantics, and create small, testable interfaces between the 2. This not only allows for faster development but additionally allows for easier support of future transport backends.
Disclosure: I work as an AI Runtime Team Manager at Intel. The perspectives shared in this text are my very own.
