Busy GPUs: Sampling and pipelining method hastens deep learning on large graphs

Graphs, a potentially extensive web of nodes connected by edges, might be used to specific and interrogate relationships between data, like social connections, financial transactions, traffic, energy grids, and molecular interactions. As researchers collect more data and construct out these graphical pictures, researchers will need faster and more efficient methods, in addition to more computational power, to conduct deep learning on them, in the way in which of graph neural networks (GNN).

Now, a recent method, called SALIENT (SAmpling, sLIcing, and data movemeNT), developed by researchers at MIT and IBM Research, improves the training and inference performance by addressing three key bottlenecks in computation. This dramatically cuts down on the runtime of GNNs on large datasets, which, for instance, contain on the dimensions of 100 million nodes and 1 billion edges. Further, the team found that the technique scales well when computational power is added from one to 16 graphical processing units (GPUs). The work was presented on the Fifth Conference on Machine Learning and Systems.

“We began to have a look at the challenges current systems experienced when scaling state-of-the-art machine learning techniques for graphs to actually big datasets. It turned on the market was loads of work to be done, because loads of the prevailing systems were achieving good performance totally on smaller datasets that fit into GPU memory,” says Tim Kaler, the lead creator and a postdoc within the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

By vast datasets, experts mean scales like all the Bitcoin network, where certain patterns and data relationships could spell out trends or foul play. “There are nearly a billion Bitcoin transactions on the blockchain, and if we wish to discover illicit activities inside such a joint network, then we face a graph of such a scale,” says co-author Jie Chen, senior research scientist and manager of IBM Research and the MIT-IBM Watson AI Lab. “We would like to construct a system that’s capable of handle that sort of graph and allows processing to be as efficient as possible, because day by day we wish to maintain up with the pace of the brand new data which can be generated.”

Kaler and Chen’s co-authors include Nickolas Stathas MEng ’21 of Jump Trading, who developed SALIENT as a part of his graduate work; former MIT-IBM Watson AI Lab intern and MIT graduate student Anne Ouyang; MIT CSAIL postdoc Alexandros-Stavros Iliopoulos; MIT CSAIL Research Scientist Tao B. Schardl; and Charles E. Leiserson, the Edwin Sibley Webster Professor of Electrical Engineering at MIT and a researcher with the MIT-IBM Watson AI Lab.

For this problem, the team took a systems-oriented approach in developing their method: SALIENT, says Kaler. To do that, the researchers implemented what they saw as essential, basic optimizations of components that fit into existing machine-learning frameworks, comparable to PyTorch Geometric and the deep graph library (DGL), that are interfaces for constructing a machine-learning model. Stathas says the method is like swapping out engines to construct a faster automobile. Their method was designed to suit into existing GNN architectures, in order that domain experts could easily apply this work to their specified fields to expedite model training and tease out insights during inference faster. The trick, the team determined, was to maintain the entire hardware (CPUs, data links, and GPUs) busy in any respect times: while the CPU samples the graph and prepares mini-batches of information that can then be transferred through the information link, the more critical GPU is working to coach the machine-learning model or conduct inference.

The researchers began by analyzing the performance of a commonly used machine-learning library for GNNs (PyTorch Geometric), which showed a startlingly low utilization of obtainable GPU resources. Applying easy optimizations, the researchers improved GPU utilization from 10 to 30 percent, leading to a 1.4 to 2 times performance improvement relative to public benchmark codes. This fast baseline code could execute one complete omit a big training dataset through the algorithm (an epoch) in 50.4 seconds.

Looking for further performance improvements, the researchers got down to examine the bottlenecks that occur originally of the information pipeline: the algorithms for graph sampling and mini-batch preparation. Unlike other neural networks, GNNs perform a neighborhood aggregation operation, which computes details about a node using information present in other nearby nodes within the graph — for instance, in a social network graph, information from friends of friends of a user. Because the variety of layers within the GNN increase, the variety of nodes the network has to succeed in out to for information can explode, exceeding the boundaries of a pc. Neighborhood sampling algorithms help by choosing a smaller random subset of nodes to collect; nevertheless, the researchers found that current implementations of this were too slow to maintain up with the processing speed of recent GPUs. In response, they identified a mixture of information structures, algorithmic optimizations, and so forth that improved sampling speed, ultimately improving the sampling operation alone by about 3 times, taking the per-epoch runtime from 50.4 to 34.6 seconds. In addition they found that sampling, at an appropriate rate, might be done during inference, improving overall energy efficiency and performance, a degree that had been missed within the literature, the team notes.

In previous systems, this sampling step was a multi-process approach, creating extra data and unnecessary data movement between the processes. The researchers made their SALIENT method more nimble by making a single process with lightweight threads that kept the information on the CPU in shared memory. Further, SALIENT takes advantage of a cache of recent processors, says Stathas, parallelizing feature slicing, which extracts relevant information from nodes of interest and their surrounding neighbors and edges, throughout the shared memory of the CPU core cache. This again reduced the general per-epoch runtime from 34.6 to 27.8 seconds.

The last bottleneck the researchers addressed was to pipeline mini-batch data transfers between the CPU and GPU using a prefetching step, which might prepare data just before it’s needed. The team calculated that this may maximize bandwidth usage in the information link and produce the tactic as much as perfect utilization; nevertheless, they only saw around 90 percent. They identified and glued a performance bug in a well-liked PyTorch library that caused unnecessary round-trip communications between the CPU and GPU. With this bug fixed, the team achieved a 16.5 second per-epoch runtime with SALIENT.

“Our work showed, I feel, that the devil is in the main points,” says Kaler. “While you pay close attention to the main points that impact performance when training a graph neural network, you’ll be able to resolve an enormous variety of performance issues. With our solutions, we ended up being completely bottlenecked by GPU computation, which is the best goal of such a system.”

SALIENT’s speed was evaluated on three standard datasets ogbn-arxiv, ogbn-products, and ogbn-papers100M, in addition to in multi-machine settings, with different levels of fanout (amount of information that the CPU would prepare for the GPU), and across several architectures, including essentially the most recent state-of-the-art one, GraphSAGE-RI. In each setting, SALIENT outperformed PyTorch Geometric, most notably on the big ogbn-papers100M dataset, containing 100 million nodes and over a billion edges Here, it was 3 times faster, running on one GPU, than the optimized baseline that was originally created for this work; with 16 GPUs, SALIENT was an extra eight times faster.

While other systems had barely different hardware and experimental setups, so it wasn’t all the time a direct comparison, SALIENT still outperformed them. Amongst systems that achieved similar accuracy, representative performance numbers include 99 seconds using one GPU and 32 CPUs, and 13 seconds using 1,536 CPUs. In contrast, SALIENT’s runtime using one GPU and 20 CPUs was 16.5 seconds and was just two seconds with 16 GPUs and 320 CPUs. “If you happen to take a look at the bottom-line numbers that prior work reports, our 16 GPU runtime (two seconds) is an order of magnitude faster than other numbers which have been reported previously on this dataset,” says Kaler. The researchers attributed their performance improvements, partially, to their approach of optimizing their code for a single machine before moving to the distributed setting. Stathas says that the lesson here is that to your money, “it makes more sense to make use of the hardware you’ve gotten efficiently, and to its extreme, before you begin scaling as much as multiple computers,” which might provide significant savings on cost and carbon emissions that may include model training.

This recent capability will now allow researchers to tackle and dig deeper into larger and larger graphs. For instance, the Bitcoin network that was mentioned earlier contained 100,000 nodes; the SALIENT system can capably handle a graph 1,000 times (or three orders of magnitude) larger.

“In the long run, we can be taking a look at not only running this graph neural network training system on the prevailing algorithms that we implemented for classifying or predicting the properties of every node, but we also need to do more in-depth tasks, comparable to identifying common patterns in a graph (subgraph patterns), [which] could also be actually interesting for indicating financial crimes,” says Chen. “We also need to discover nodes in a graph which can be similar in a way that they possibly can be corresponding to the identical bad actor in a financial crime. These tasks would require developing additional algorithms, and possibly also neural network architectures.”

This research was supported by the MIT-IBM Watson AI Lab and partially by the U.S. Air Force Research Laboratory and the U.S. Air Force Artificial Intelligence Accelerator.

Busy GPUs: Sampling and pipelining method hastens deep learning on large graphs

What are your thoughts on this topic?
Let us know in the comments below.

3 COMMENTS

Share this article

Recent posts

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

UK declares hiring of AI staff, but criticism continues

Busy GPUs: Sampling and pipelining method hastens deep learning on large graphs

What are your thoughts on this topic? Let us know in the comments below.

3 COMMENTS

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.