Accelerating Large-Scale Data Analytics with GPU-Native Velox and NVIDIA cuDF

-


As workloads scale and demand for faster data processing grows, GPU-accelerated databases and query engines have been shown to deliver significant price-performance gains in comparison with CPU-based systems. The high memory bandwidth and thread count of GPUs especially profit compute-heavy workloads like multiple joins, complex aggregations, strings processing, and more. The growing availability of GPU nodes and the broad feature coverage of GPU algorithms makes GPU data processing more accessible than ever before.

By addressing performance bottlenecks, each data and business analysts can now query massive datasets to generate real-time insights and explore analytics scenarios.

To support the increasing demand, IBM and NVIDIA are working together to bring NVIDIA cuDF to the Velox execution engine, enabling GPU-native query execution for widely used platforms like Presto and Apache Spark. That is an open project. 

How Velox and cuDF work together to translate query plans

Velox acts as an intermediate layer, translating query plans from systems like Presto and Spark into executable GPU pipelines powered by cuDF, as shown in Figure 1. For more details, see Extending Velox – GPU Acceleration with cuDF.

On this post, we’re excited to share initial performance results of Presto and Spark using the GPU backend in Velox. We dive into:

  • End-to-end Presto acceleration
  • Scaling up Presto to support multi-GPU execution
  • Demonstrating hybrid CPU-GPU execution in Apache Spark
Flow chart illustrating the architecture of a data processing system involving Velox and cuDF, which is integrated with other tools like Spark and Presto.
Flow chart illustrating the architecture of a data processing system involving Velox and cuDF, which is integrated with other tools like Spark and Presto.
Figure 1. A question flows from Presto or Apache Spark through the Velox engine, where it’s converted into executable GPU pipelines powered by cuDF

Moving your entire Presto query plan to GPU for faster execution

Step one of query processing is to translate incoming SQL commands into query plans with tasks for every node within the cluster. On each employee node, the cuDF backend for Velox receives a plan from the Presto coordinator, rewrites the plan using GPU operators, after which executes the plan. 

Running Presto plans using Velox with cuDF required improvements to the GPU operators for TableScan, HashJoin, HashAggregation, FilterProject, and more. 

  • TableScan: The Velox TableScan was prolonged on CPU to be compatible with GPU I/O, decompression, and decoding components in cuDF.
  • HashJoin: The available join types were expanded to incorporate left, right, and inner, in addition to support for filters and null semantics. 
  • HashAggregations: A streaming interface was introduced to administer partial and final aggregations. 

Overall, the operator expansion within the cuDF backend for Velox enables end-to-end GPU execution in Presto, making full use of the Presto SQL parser, optimizer, and coordinator.

The team collected query runtime data using benchmarks in Presto tpch (derived from TPC-H) using Parquet data sources with each the Presto C++ and Presto-on-GPU employee types. Please note that Presto C++ was not in a position to complete Q21 with standard configuration options, so the figure highlights the entire runtime for the 21 successful queries.

As shown in Figure 2, at scale factor 1,000, we observed 1,246 seconds runtime for Presto C++ on AMD 7965WX, 133.8 seconds runtime for Presto on NVIDIA RTX PRO 6000 Blackwell Workstation, and 99.9 seconds runtime for Presto on NVIDIA GH200 Grace Hopper Superchip. We also used CUDA managed memory to finish Q21 on GH200 (see Figure 2 asterisk), yielding 148.9 seconds runtime for Presto GPU on the complete query set.

Bar chart with an X-axis showing categories for Presto C++ on CPU and Presto on NVIDIA GPU results and Y-axis showing runtime in seconds. 
Bar chart with an X-axis showing categories for Presto C++ on CPU and Presto on NVIDIA GPU results and Y-axis showing runtime in seconds.
Figure 2. Runtime results for 21 of twenty-two queries defined in Presto tpch, executed with single-node Presto C++ on CPU and Presto on NVIDIA GPUs at scale factor 1,000

Multi-GPU Presto for faster data exchange and lower query runtime

In distributed query execution, Exchange is a critical operator that controls the information movement between staff on the identical node and likewise between nodes. GPU-accelerated Presto uses a UCX-based Exchange operator that supports running your entire execution pipeline on GPU. The UCX core leverages high bandwidth NVLink for intra-node connectivity and RoCE or InfiniBand for internode connectivity. UCX, or Unified Communication – X Framework, is an open source communication library designed to realize the very best performance for HPC applications. 

Velox supports several Exchange types for several types of data movements: Partitioned, Merge, and Broadcast. Partitioned Exchange uses a hash function to partition input data after which sends the partitions to other staff for further processing. Merge Exchange receives multiple input partitions from other staff after which produces a single, sorted output partition. Broadcast Exchange loads the information in a single employee after which copies the information to all other staff. Integration of GPU exchange into the cuDF backend for Velox is in progress, and the implementation is accessible on mainline Velox.

As shown in Figure 3, Presto achieves efficient performance on GPU with recent UCX-based exchange, especially when high-bandwidth intranode connectivity is provisioned between GPUs. An eight-GPU NVIDIA DGX A100 node delivered >6x speedup when using NVLink within the exchange operator in comparison with using the Presto baseline HTTP exchange. Results were collected for Presto on GPU with each the baseline HTTP Exchange method, and the UCX-based cuDF Exchange method. With eight GPU staff, Presto can finish all 22 queries with the default async memory allocation, without using managed memory.

Note that Figure 3 uses multi-node execution plans from the Presto Coordinator and cold-cache distant data sources. These results are usually not directly comparable to the single-node, hot-cache runtimes shown in Figure 2.

Bar chart with an X-axis showing categories for Presto C++ and Presto on GPU results and Y-axis showing runtime in seconds.Bar chart with an X-axis showing categories for Presto C++ and Presto on GPU results and Y-axis showing runtime in seconds.
Figure 3. Runtime results for the 22 queries defined in Presto tpch benchmark, executed with Presto GPU on NVIDIA DGX A100 (eight A100 GPUs) at scale factor 1,000 

Hybrid CPU-GPU execution in Apache Spark

While the Presto integration focuses on end-to-end GPU execution, the Apache Spark integration with Apache Gluten and cuDF is currently focused on offloading specific query stages. This capability allows essentially the most compute-intensive parts of workloads to be dispatched to GPUs, and this strategy could make one of the best use of GPU resources in hybrid clusters containing each CPU and GPU nodes.

For instance, the second stage of TPC-DS Query 95 SF100 is compute intensive and may decelerate CPU-only clusters. Offloading this stage to GPU achieves significant performance gains. CPU capability stays on the cluster, available for other queries or workloads.

As shown in Figure 4, even when the primary stage of TableScan is run with CPU execution, efficient interoperability between CPU and GPU enables a faster total runtime when the second stage offloads to GPU. The condition CPU only uses eight vCPUs and First Stage CPU+GPU uses eight vCPUs and one NVIDIA T4 GPU (g4dn.2xlarge).

Bar chart with an X-axis showing categories for Second Stage execution on CPU and GPU. Y-axis shows runtime in seconds.Bar chart with an X-axis showing categories for Second Stage execution on CPU and GPU. Y-axis shows runtime in seconds.
Figure 4. Runtime results for the query 95, as defined in Gluten tpcds, executed with single-node, single-GPU at scale factor 100

Become involved with GPU-powered, large-scale data analytics

Driving GPU acceleration within the shared execution engine Velox unlocks performance gains for a wide selection of downstream systems across the information processing ecosystem. The team is working with contributors across many firms to implement reusable GPU operators in Velox, and in turn speed up Presto, Spark (through Gluten), and other systems. This approach reduces duplication, simplifies maintenance, and introduces recent innovations across the open data stack.

We’re excited to share this open source work with the community and listen to your feedback. We invite you to: 

Acknowledgments

Many developers contributed to this work. IBM contributors include Zoltán Arnold Nagy, Deepak Majeti, Daniel Bauer, Chengcheng Jin, Luis Garcés-Erice, Sean Rooney, and Ali LeClerc. NVIDIA contributors include Greg Kimball, Karthikeyan Natarajan, Devavret Makkar, Shruti Shivakumar, and Todd Mostak.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x