NVIDIA is partnering with the University of Wisconsin-Madison to bring GPU-accelerated analytics to DuckDB through the open-source Sirius engine.
DuckDB has seen rapid adoption amongst organizations similar to DeepSeek, Microsoft, and Databricks attributable to its simplicity, speed, and flexibility. As analytics workloads are highly amenable to massive parallelism, GPUs have emerged because the natural next step with higher performance, throughput, and higher total cost of ownership (TCO) in comparison with CPU-based databases. Nevertheless, this growing demand for GPU acceleration is hindered by the challenge of constructing a database system from the bottom up.
That is solved with the jointly developed Sirius, a composable GPU-native execution backend for DuckDB that reuses its advanced subsystems while accelerating query execution with GPUs. Using NVIDIA CUDA-X libraries, Sirius delivers GPU acceleration.
This blog post outlines the Sirius architecture and demonstrates the way it achieved record-breaking performance on ClickBench, a widely used analytics benchmark.
Sirius: A GPU-native SQL engine


Sirius is a GPU-native SQL engine that gives drop-in acceleration for DuckDB—and, in the longer term, other data systems.
The team recently published an article detailing the Sirius architecture and demonstrated state-of-the-art performance on TPC-H at SF100.
Implemented as a DuckDB extension, Sirius requires no modifications to DuckDB’s codebase and only minimal changes to the user-facing interface. On the execution boundary, Sirius consumes query plans within the universal Substrait format, ensuring compatibility with other data systems. To attenuate engineering effort and maximize reliability, Sirius is built on well-established NVIDIA libraries:
- NVIDIA cuDF: High-performance, columnar-oriented relational operators (e.g., joins, aggregations, projections) natively designed for GPUs.
- NVIDIA RAPIDS Memory Manager (RMM): An efficient GPU memory allocator, reducing fragmentation and allocation overheads.
Sirius constructs its GPU-native execution engine and buffer management on top of those high-performance libraries, while reusing DuckDB’s advanced subsystems —including its query parser, optimizer, and scan operators, where appropriate. This mix of mature ecosystems gives Sirius a head start, enabling it to interrupt the ClickBench record with minimal engineering effort.


As illustrated in Figure 2, the method begins when Sirius receives an already optimized query plan from DuckDB’s internal format, ensuring robust logical and physical optimizations are preserved. For table scans, Sirius invokes DuckDB’s scan functionality, which provides features similar to min-max filtering, zone skipping, and on-the-fly decompression—these operations efficiently load the relevant data into host memory.
Next, the results of the table scan is transformed from DuckDB’s native format right into a Sirius data format (closely aligned with Apache Arrow), which is then transferred to GPU memory. In benchmarks like ClickBench, Sirius can cache incessantly accessed tables on the GPU, accelerating repeated query execution.
The Sirius format will be mapped on to a cudf::table for zero-copy interoperability, enabling all remaining SQL operators (aggregates, projections, and joins) to execute at GPU speed through cuDF primitives. Once computation completes, results are transferred back to the CPU, converted to DuckDB’s expected output format, and returned to the user—offering each raw speed and a seamless, familiar analytics experience.
Hitting #1 on Clickbench
Sirius running on an NVIDIA GH200 Grace Hopper Superchip instance from Lambda Labs ($1.5/hour) was evaluated against the highest five systems on ClickBench. The choice systems ran on CPU-only instances—AWS c6a.metal ($7.3/hour), AWS c8g.metal-48xl ($7.6/hour), and AWS c7a.metal-48xl ($9.8/hour). Hot-run execution time and relative runtime are reported, following the ClickBench methodologies, where lower values indicate higher performance, and 1.0 represents the perfect possible rating. Figure 3 shows the geometric mean of the relative runtime across all benchmark queries. Within the ClickBench runs, Sirius achieved the bottom relative runtime on cheaper hardware, leading to a minimum of 7.2x higher cost-efficiency under this setup. Note that these benchmark results were obtained on the time of evaluation and are subject to alter in the longer term.


Figure 4 shows the hot-run query performance in Sirius and the highest two systems in ClickBench: Umbra and DuckDB. Sirius achieved the bottom relative runtime in most queries, driven by efficient GPU computation through cuDF. For example, in q4, q5, and q18, Sirius shows substantial performance gains on commonly used operators similar to filtering, projection, and aggregation.
A couple of queries, nonetheless, reveal opportunities for further improvement. For instance, q23 is bottlenecked by the “accommodates” operation on string columns, q24 and q26 by top-N operators, and q27 by aggregation over huge inputs. Future versions of Sirius will include continual improvements to those operators.


Figure 5 is a more in-depth have a look at probably the most complex ClickBench queries, the regular expression query (q28). When implemented naively, regular expression matching on GPUs can produce massive kernels with high register pressure and sophisticated control flow, resulting in severe performance degradation.
To deal with this, Sirius leverages cuDF’s JIT-compiled string transformation framework for user-defined functions. Figure 5 compares the performance of the JIT approach to cuDF’s precompiled API (cudf::strings::replace_with_backrefs), showing a 13x speedup.
The JIT-transformed kernel achieves 85% warp occupancy, in comparison with only 32% for the precompiled version, demonstrating higher GPU utilization. By decomposing the regular expression into standard string operations similar to character comparisons and substring operations, the cuDF JIT framework can fuse these operations right into a single kernel, improving data locality and reducing register pressure.


What’s next for Sirius
Looking ahead, NVIDIA and the University of Wisconsin-Madison are collaborating on foundational, shareable constructing blocks for GPU data processing, guided by the modular, interoperable, composable, extensible (MICE) principles described within the Composable Codex. Our priority areas are:
- Advanced GPU memory management: Developing robust strategies to administer GPU memory efficiently, including seamless spilling of information beyond physical GPU limits to take care of performance and scale.
- GPU file readers and intelligent I/O prefetching: Plugging in GPU-native file readers with smart prefetching to speed up data loading, minimize stalls, and reduce I/O bottlenecks.
- Pipeline-oriented execution model: Evolving Sirius’s core to a completely composable pipeline architecture that streamlines data flows across GPUs, host, and disk, efficiently overlapping computation and communication while enabling plug-and-play interoperability with open standards.
- Scalable multi-node, multi-GPU architecture: Expanding Sirius’s capability to scale out efficiently to multiple nodes and multiple GPUs, unlocking petabyte-scaled data processing.
By investing in these MICE-compliant components, Sirius goals to make GPU analytics engines easier to construct, integrate, and extend—not only for Sirius, but for your complete open-source analytics ecosystem.
Join Sirius
Sirius is open source with the permissive Apache-2.0 license. Led by NVIDIA and the University of Wisconsin-Madison, the project welcomes contributions from researchers and practitioners with the shared mission of driving the GPU era in data analytics.
We invite you to:
