Is way forward for Python numerical computation?
Late last yr, NVIDIA made a big announcement regarding the longer term of Python-based numerical computing. I wouldn’t be surprised for those who missed it. In spite of everything, every other announcement from every AI company, then and now, seems mega-important.
That announcement introduced the cuNumeric library, a drop-in substitute for the ever-present NumPy library built on top of the Legate framework.
Who’re Nvidia?
Most individuals will probably know Nvidia from their ultra-fast chips that power computers and data centres everywhere in the world. You might even be conversant in Nvidia’s charismatic, leather jacket-loving CEO, Jensen Huang, who seems to pop up on the stage of each AI conference nowadays.
What many individuals don’t know is that Nvidia also designs and creates modern device architectures and associated software. Certainly one of its most prized products is the Compute Unified Device Architecture (CUDA). CUDA is NVIDIA’s proprietary parallel-computing platform and programming model. Since its launch in 2007, it has evolved right into a comprehensive ecosystem comprising drivers, runtime, compilers, math libraries, debugging and profiling tools, and container images. The result’s a neatly tuned hardware and software loop that keeps NVIDIA GPUs on the centre of contemporary high-performance and AI workloads.
What’s Legate?
Legate is an NVIDIA-led open-source runtime layer that helps you to run familiar Python data-science libraries (NumPy, cuNumeric, Pandas-style APIs, sparse linear-algebra kernels, …) on multi-core CPUs, single or multi-GPU nodes, and even multi-node clusters without changing your Python code. It translates high-level array operations right into a graph of fine-grained tasks and hands that graph to the C++ Legion runtime, which schedules the tasks, partitions the information, and moves tiles between CPUs, GPUs and network links for you.
In a nutshell, Legate lets familiar single-node Python libraries scale transparently to multi-GPU, multi-node machines.
What’s cuNumeric?
cuNumeric is a drop-in substitute for NumPy whose array operations are executed by Legate’s task engine and accelerated on one or many NVIDIA GPUs (or, if no GPU is present, on all CPU cores). In practice, you put in it and wish only change one import line to begin using it rather than your regular NumPy code. For instance …
# old
import numpy as np
...
...
# recent
import cupynumeric as np # every thing else stays the identical
...
...
… and run your script on the terminal with the legate command.
Behind the scenes, cuNumeric converts each NumPy call you make, for instance, np.sin, np.linalg.svd, fancy indexing, broadcasting, reductions, etc, into Legate tasks. Those tasks will,Â
- Partition your arrays into tiles sized to suit GPU memory.
- Schedule each tile on the very best available device (GPU or CPU).
- Overlap compute with communication when the workload spans multiple GPUs or nodes.
- Spill tiles to NVMe/SSD routinely when your dataset outruns GPU RAM.
Since the API of cuNumeric mirrors NumPy’s nearly 1-for-1, existing scientific or data-science code can scale from a laptop to a multi-GPU cluster with no rewrite.
Performance advantages
So, this all seems great, right? But it surely only is smart if it ends in tangible performance improvements over using NumPy, and Nvidia is making some strong claims that that is the case. As data scientists, machine learning engineers and data engineers typically use NumPy rather a lot, we are able to appreciate that this is usually a crucial aspect of the systems we write and maintain.
Now, I don’t have a cluster of GPUs or a supercomputer to check this on, but my desktop PC does have an Nvidia GeForce RTX 4070 GPU, and we’re going to make use of that to check out a few of Nvidia’s claims.
(base) tom@tpr-desktop:~$ nvidia-smi
Sun Jun 15 15:26:36 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.75 Driver Version: 566.24 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti On | 00000000:01:00.0 On | N/A |
| 32% 29C P8 9W / 285W | 1345MiB / 12282MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
I’ll install cuNumeric and NumPy on my PC to conduct comparative tests. This may help us assess whether Nvidia’s claims are accurate and understand the performance differences between the 2 libraries.
Organising a development environment.
As at all times, I prefer to arrange a separate development environment to run my tests. That way, nothing I do in that environment will affect any of my other projects. On the time of writing, cuNumeric is just not available to put in on Windows, so I’ll be using WSL2 Ubuntu for Windows as an alternative.
I’ll be using Miniconda to establish my environment, but be happy to make use of whichever tool you’re comfortable with.
$ conda create cunumeric-env python=3.10 -c conda-forge
$ conda activate cunumeric-env
$ conda install -c conda-forge -c legate cupynumeric
$ conda install -c conda-forge ucx cuda-cudart cuda-version=12
Code example 1 — An easy matrix multiplication
Matrix multiplication is the bread and butter of mathematical operations that underpin so many AI systems, so it is smart to try that operation out first.
import time
import gc
import argparse
import sys
def benchmark_numpy(n, runs):
"""Runs the matrix multiplication benchmark using standard NumPy on the CPU."""
import numpy as np
print(f"--- NumPy (CPU) Benchmark ---")
print(f"Multiplying two {n}×{n} matrices ({runs} runs)n")
# 1. Generate data ONCE before the timing loop.
print(f"Generating two {n}x{n} random matrices on CPU...")
A = np.random.rand(n, n).astype(np.float32)
B = np.random.rand(n, n).astype(np.float32)
# 2. Perform one untimed warm-up run.
print("Performing warm-up run...")
_ = np.matmul(A, B)
print("Warm-up complete.n")
# 3. Perform the timed runs.
times = []
for i in range(runs):
start = time.time()
# The operation being timed. The @ operator is a convenient
# shorthand for np.matmul.
C = A @ B
end = time.time()
duration = end - start
times.append(duration)
print(f"Run {i+1}: time = {duration:.4f}s")
del C # Clean up the result matrix
gc.collect()
avg = sum(times) / len(times)
print(f"nNumPy average: {avg:.4f}sn")
return avg
def benchmark_cunumeric(n, runs):
"""Runs the matrix multiplication benchmark using cuNumeric on the GPU."""
import cupynumeric as cn
import numpy as np # Import numpy for the canonical sync
print(f"--- cuNumeric (GPU) Benchmark ---")
print(f"Multiplying two {n}×{n} matrices ({runs} runs)n")
# 1. Generate data ONCE on the GPU before the timing loop.
print(f"Generating two {n}x{n} random matrices on GPU...")
A = cn.random.rand(n, n).astype(np.float32)
B = cn.random.rand(n, n).astype(np.float32)
# 2. Perform a vital untimed warm-up run for JIT compilation.
print("Performing warm-up run...")
C_warmup = cn.matmul(A, B)
# One of the best practice for synchronization: force a duplicate back to the CPU.
_ = np.array(C_warmup)
print("Warm-up complete.n")
# 3. Perform the timed runs.
times = []
for i in range(runs):
start = time.time()
# Launch the operation on the GPU
C = A @ B
# Synchronize by converting the result to a host-side NumPy array.
np.array(C)
end = time.time()
duration = end - start
times.append(duration)
print(f"Run {i+1}: time = {duration:.4f}s")
del C
gc.collect()
avg = sum(times) / len(times)
print(f"ncuNumeric average: {avg:.4f}sn")
return avg
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Benchmark matrix multiplication on NumPy (CPU) vs. cuNumeric (GPU)."
)
parser.add_argument(
"-n", "--n", type=int, default=3000, help="Matrix size (n x n)"
)
parser.add_argument(
"-r", "--runs", type=int, default=5, help="Variety of timing runs"
)
parser.add_argument(
"--cunumeric", motion="store_true", help="Run the cuNumeric (GPU) version"
)
args, unknown = parser.parse_known_args()
# The dispatcher logic
if args.cunumeric or "--cunumeric" in unknown:
benchmark_cunumeric(args.n, args.runs)
else:
benchmark_numpy(args.n, args.runs)
Running the NumPy side of things uses the regular python example1.py command line syntax. For running using Legate, the syntax is more complex. What it does is disable Legate’s automatic configuration after which launch the example1.py script under Legate with one CPU, one GPU, and 0 OpenMP threads using the cuNumeric backend.
Here is the output.
(cunumeric-env) tom@tpr-desktop:~$ python example1.py
--- NumPy (CPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)
Generating two 3000x3000 random matrices on CPU...
Performing warm-up run...
Warm-up complete.
Run 1: time = 0.0976s
Run 2: time = 0.0987s
Run 3: time = 0.0957s
Run 4: time = 0.1063s
Run 5: time = 0.0989s
NumPy average: 0.0994s
(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example1.py --cunu
meric
[0 - 7f2e8fcc8480] 0.000000 {5}{module_config}: Module numa cannot detect resources.
[0 - 7f2e8fcc8480] 0.000000 {4}{topology}: cannot open /sys/devices/system/node/
[0 - 7f2e8fcc8480] 0.000049 {4}{threads}: reservation ('GPU ctxsync 0x55cd5fd34530') can't be satisfied
--- cuNumeric (GPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)
Generating two 3000x3000 random matrices on GPU...
Performing warm-up run...
Warm-up complete.
Run 1: time = 0.0113s
Run 2: time = 0.0089s
Run 3: time = 0.0086s
Run 4: time = 0.0090s
Run 5: time = 0.0087s
cuNumeric average: 0.0093s
Well, that’s a powerful start. cuNumeric is registering a 10x speedup over NumPy.
The warnings that Legate is outputting might be ignored. These are informational, indicating Legate couldn’t find details concerning the machine’s CPU/memory layout (NUMA) or enough CPU cores to administer the GPU.
Code example 2 — Logistic regression
Logistic regression is a foundational tool in data science since it provides an easy, interpretable technique to model and predict binary outcomes (yes/no, pass/fail, click/no-click). In this instance, we’ll measure how long it takes to coach an easy binary classifier on synthetic data. For every of the five runs, it first generates N samples with D features (X), and a corresponding random 0/1 label vector (Y). It initialises the burden vector w
to zeros, then performs 500 iterations of batch gradient descent: computing the linear predictions z = X.dot(w), applying the sigmoid p = 1/(1+exp(–z)), computing the gradient grad = X.T.dot(p – y) / N, and updating the weights with w -= 0.1 * grad. The script records the elapsed time for every run, cleans up memory, and at last prints the common training time.
import time
import gc
import argparse
import sys
# --- Reusable Training Function ---
# By putting the training loop in its own function, we avoid code duplication.
# The `np` argument allows us to pass in either the numpy or cupynumeric module.
def train_logistic_regression(np, X, y, iters, alpha):
"""Performs a set variety of gradient descent iterations."""
# Ensure w starts on the right device (CPU or GPU)
w = np.zeros(X.shape[1])
for _ in range(iters):
z = X.dot(w)
p = 1.0 / (1.0 + np.exp(-z))
grad = X.T.dot(p - y) / X.shape[0]
w -= alpha * grad
return w
def benchmark_numpy(n_samples, n_features, iters, alpha):
"""Runs the logistic regression benchmark using standard NumPy on the CPU."""
import numpy as np
print(f"--- NumPy (CPU) Benchmark ---")
print(f"Training on {n_samples} samples, {n_features} features for {iters} iterationsn")
# 1. Generate data ONCE before the timing loop.
print("Generating random dataset on CPU...")
X = np.random.rand(n_samples, n_features)
y = (np.random.rand(n_samples) > 0.5).astype(np.float64)
# 2. Perform one untimed warm-up run.
print("Performing warm-up run...")
_ = train_logistic_regression(np, X, y, iters, alpha)
print("Warm-up complete.n")
# 3. Perform the timed runs.
times = []
for i in range(args.runs):
start = time.time()
# The operation being timed
_ = train_logistic_regression(np, X, y, iters, alpha)
end = time.time()
duration = end - start
times.append(duration)
print(f"Run {i+1}: time = {duration:.3f}s")
gc.collect()
avg = sum(times) / len(times)
print(f"nNumPy average: {avg:.3f}sn")
return avg
def benchmark_cunumeric(n_samples, n_features, iters, alpha):
"""Runs the logistic regression benchmark using cuNumeric on the GPU."""
import cupynumeric as cn
import numpy as np # Also import numpy for the canonical synchronization
print(f"--- cuNumeric (GPU) Benchmark ---")
print(f"Training on {n_samples} samples, {n_features} features for {iters} iterationsn")
# 1. Generate data ONCE on the GPU before the timing loop.
print("Generating random dataset on GPU...")
X = cn.random.rand(n_samples, n_features)
y = (cn.random.rand(n_samples) > 0.5).astype(np.float64)
# 2. Perform a vital untimed warm-up run for JIT compilation.
print("Performing warm-up run...")
w_warmup = train_logistic_regression(cn, X, y, iters, alpha)
# One of the best practice for synchronization: force a duplicate back to the CPU.
_ = np.array(w_warmup)
print("Warm-up complete.n")
# 3. Perform the timed runs.
times = []
for i in range(args.runs):
start = time.time()
# Launch the operation on the GPU
w = train_logistic_regression(cn, X, y, iters, alpha)
# Synchronize by converting the end result back to a NumPy array.
np.array(w)
end = time.time()
duration = end - start
times.append(duration)
print(f"Run {i+1}: time = {duration:.3f}s")
del w
gc.collect()
avg = sum(times) / len(times)
print(f"ncuNumeric average: {avg:.3f}sn")
return avg
if __name__ == "__main__":
# A more robust argument parsing setup
parser = argparse.ArgumentParser(
description="Benchmark logistic regression on NumPy (CPU) vs. cuNumeric (GPU)."
)
# Hyperparameters for the model
parser.add_argument(
"-n", "--n_samples", type=int, default=2_000_000, help="Number of knowledge samples"
)
parser.add_argument(
"-d", "--n_features", type=int, default=10, help="Variety of features"
)
parser.add_argument(
"-i", "--iters", type=int, default=500, help="Variety of gradient descent iterations"
)
parser.add_argument(
"-a", "--alpha", type=float, default=0.1, help="Learning rate"
)
# Benchmark control
parser.add_argument(
"-r", "--runs", type=int, default=5, help="Variety of timing runs"
)
parser.add_argument(
"--cunumeric", motion="store_true", help="Run the cuNumeric (GPU) version"
)
args, unknown = parser.parse_known_args()
# Dispatcher logic
if args.cunumeric or "--cunumeric" in unknown:
benchmark_cunumeric(args.n_samples, args.n_features, args.iters, args.alpha)
else:
benchmark_numpy(args.n_samples, args.n_features, args.iters, args.alpha)
And the outputs.
(cunumeric-env) tom@tpr-desktop:~$ python example2.py
--- NumPy (CPU) Benchmark ---
Training on 2000000 samples, 10 features for 500 iterations
Generating random dataset on CPU...
Performing warm-up run...
Warm-up complete.
Run 1: time = 12.292s
Run 2: time = 11.830s
Run 3: time = 11.903s
Run 4: time = 12.843s
Run 5: time = 11.964s
NumPy average: 12.166s
(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example2.py --cunu
meric
[0 - 7f04b535c480] 0.000000 {5}{module_config}: Module numa cannot detect resources.
[0 - 7f04b535c480] 0.000000 {4}{topology}: cannot open /sys/devices/system/node/
[0 - 7f04b535c480] 0.001149 {4}{threads}: reservation ('GPU ctxsync 0x55fb037cf140') can't be satisfied
--- cuNumeric (GPU) Benchmark ---
Training on 2000000 samples, 10 features for 500 iterations
Generating random dataset on GPU...
Performing warm-up run...
Warm-up complete.
Run 1: time = 1.964s
Run 2: time = 1.957s
Run 3: time = 1.968s
Run 4: time = 1.955s
Run 5: time = 1.960s
cuNumeric average: 1.961s
Not quite as impressive as our first example, but a 5x to 6x speedup on an already fast NumPy program is just not to be sniffed at.
Code example 3 — solving linear equations
This script benchmarks how long it takes to resolve a dense 3000×3000 linear algebra equation system. This can be a fundamental operation in linear algebra used to resolve the equation of type Ax = b, where A is a big grid of numbers (a 3000×3000 matrix on this case), and b is a listing of numbers (a vector).Â
The goal is to search out the unknown list of numbers x that makes the equation true. This can be a computationally intensive task that’s at the guts of many scientific simulations, engineering problems, financial models, and even some AI algorithms.
import time
import gc
import argparse
import sys # Import sys to examine arguments
# Note: The library imports (numpy and cupynumeric) at the moment are done *inside*
# their respective functions to maintain them separate and avoid import errors.
def benchmark_numpy(n, runs):
"""Runs the linear solve benchmark using standard NumPy on the CPU."""
import numpy as np
print(f"--- NumPy (CPU) Benchmark ---")
print(f"Solving {n}×{n} A x = b ({runs} runs)n")
# 1. Generate data ONCE before the timing loop.
print("Generating random system on CPU...")
A = np.random.randn(n, n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
# 2. Perform one untimed warm-up run. This is nice practice even for
# the CPU to make sure caches are warm and any one-time setup is completed.
print("Performing warm-up run...")
_ = np.linalg.solve(A, b)
print("Warm-up complete.n")
# 3. Perform the timed runs.
times = []
for i in range(runs):
start = time.time()
# The operation being timed
x = np.linalg.solve(A, b)
end = time.time()
duration = end - start
times.append(duration)
print(f"Run {i+1}: time = {duration:.6f}s")
# Clean up the result to be protected with memory
del x
gc.collect()
avg = sum(times) / len(times)
print(f"nNumPy average: {avg:.6f}sn")
return avg
def benchmark_cunumeric(n, runs):
"""Runs the linear solve benchmark using cuNumeric on the GPU."""
import cupynumeric as cn
import numpy as np # Also import numpy for the canonical synchronization
print(f"--- cuNumeric (GPU) Benchmark ---")
print(f"Solving {n}×{n} A x = b ({runs} runs)n")
# 1. Generate data ONCE on the GPU before the timing loop.
# This ensures we should not timing the information transfer in our most important loop.
print("Generating random system on GPU...")
A = cn.random.randn(n, n).astype(np.float32)
b = cn.random.randn(n).astype(np.float32)
# 2. Perform a vital untimed warm-up run. This handles JIT
# compilation and other one-time GPU setup costs.
print("Performing warm-up run...")
x_warmup = cn.linalg.solve(A, b)
# One of the best practice for synchronization: force a duplicate back to the CPU.
_ = np.array(x_warmup)
print("Warm-up complete.n")
# 3. Perform the timed runs.
times = []
for i in range(runs):
start = time.time()
# Launch the operation on the GPU
x = cn.linalg.solve(A, b)
# Synchronize by converting the result to a host-side NumPy array.
# That is guaranteed to dam until the GPU has finished.
np.array(x)
end = time.time()
duration = end - start
times.append(duration)
print(f"Run {i+1}: time = {duration:.6f}s")
# Clean up the GPU array result
del x
gc.collect()
avg = sum(times) / len(times)
print(f"ncuNumeric average: {avg:.6f}sn")
return avg
if __name__ == "__main__":
# A more robust argument parsing setup
parser = argparse.ArgumentParser(
description="Benchmark linear solve on NumPy (CPU) vs. cuNumeric (GPU)."
)
parser.add_argument(
"-n", "--n", type=int, default=3000, help="Matrix size (n x n)"
)
parser.add_argument(
"-r", "--runs", type=int, default=5, help="Variety of timing runs"
)
# Use parse_known_args() to handle potential extra arguments from Legate
args, unknown = parser.parse_known_args()
# The dispatcher logic: check if "--cunumeric" is within the command line
# This is an easy and effective technique to switch between modes.
if "--cunumeric" in sys.argv or "--cunumeric" in unknown:
benchmark_cunumeric(args.n, args.runs)
else:
benchmark_numpy(args.n, args.runs)
The outputs.
(cunumeric-env) tom@tpr-desktop:~$ python example4.py
--- NumPy (CPU) Benchmark ---
Solving 3000×3000 A x = b (5 runs)
Generating random system on CPU...
Performing warm-up run...
Warm-up complete.
Run 1: time = 0.133075s
Run 2: time = 0.126129s
Run 3: time = 0.135849s
Run 4: time = 0.137383s
Run 5: time = 0.138805s
NumPy average: 0.134248s
(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example4.py --cunumeric
[0 - 7f29f42ce480] 0.000000 {5}{module_config}: Module numa cannot detect resources.
[0 - 7f29f42ce480] 0.000000 {4}{topology}: cannot open /sys/devices/system/node/
[0 - 7f29f42ce480] 0.000053 {4}{threads}: reservation ('GPU ctxsync 0x562e88c28700') can't be satisfied
--- cuNumeric (GPU) Benchmark ---
Solving 3000×3000 A x = b (5 runs)
Generating random system on GPU...
Performing warm-up run...
Warm-up complete.
Run 1: time = 0.009685s
Run 2: time = 0.010043s
Run 3: time = 0.009966s
Run 4: time = 0.009739s
Run 5: time = 0.009383s
cuNumeric average: 0.009763s
That could be a tremendous result. The Nvidia cuNumeric run is 100x faster than the NumPy run.
Code example 4 — Sorting
Sorting is such a fundamental a part of every thing that happens in computing, and modern computers are so fast that the majority developers don’t even give it some thought. But let’s see how much of a difference using cuNumeric could make to this ubiquitous operation. We’ll sort a big (30,000,000) 1D array of numbers
# benchmark_sort.py
import time
import sys
import gc
# Array size
n = 30_000_000 # 30 million elements
def benchmark_numpy():
import numpy as np
print(f"Sorting an array of {n} elements with NumPy (5 runs)n")
times = []
for i in range(5):
data = np.random.randn(n).astype(np.float32)
start = time.time()
_ = np.sort(data)
end = time.time()
duration = end - start
times.append(duration)
print(f"Run {i+1}: time = {duration:.6f}s")
del data
gc.collect()
avg = sum(times) / len(times)
print(f"nNumPy average: {avg:.6f}sn")
def benchmark_cunumeric():
import cupynumeric as np
print(f"Sorting an array of {n} elements with cuNumeric (5 runs)n")
times = []
for i in range(5):
data = np.random.randn(n).astype(np.float32)
start = time.time()
_ = np.sort(data)
# Force GPU sync
_ = np.linalg.norm(np.zeros(()))
end = time.time()
duration = end - start
times.append(duration)
print(f"Run {i+1}: time = {duration:.6f}s")
del data
gc.collect()
_ = np.linalg.norm(np.zeros(()))
avg = sum(times) / len(times)
print(f"ncuNumeric average: {avg:.6f}sn")
if __name__ == "__main__":
if "--cunumeric" in sys.argv:
benchmark_cunumeric()
else:
benchmark_numpy()
The outputs.
(cunumeric-env) tom@tpr-desktop:~$ python example5.py
--- NumPy (CPU) Benchmark ---
Sorting an array of 30000000 elements (5 runs)
Creating random array on CPU...
Performing warm-up run...
Warm-up complete.
Run 1: time = 0.588777s
Run 2: time = 0.586813s
Run 3: time = 0.586745s
Run 4: time = 0.586525s
Run 5: time = 0.583783s
NumPy average: 0.586529s
-----------------------------
(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example5.py --cunumeric
[0 - 7fd9e4615480] 0.000000 {5}{module_config}: Module numa cannot detect resources.
[0 - 7fd9e4615480] 0.000000 {4}{topology}: cannot open /sys/devices/system/node/
[0 - 7fd9e4615480] 0.000082 {4}{threads}: reservation ('GPU ctxsync 0x564489232fd0') can't be satisfied
--- cuNumeric (GPU) Benchmark ---
Sorting an array of 30000000 elements (5 runs)
Creating random array on GPU...
Performing warm-up run...
Warm-up complete.
Run 1: time = 0.010857s
Run 2: time = 0.007927s
Run 3: time = 0.007921s
Run 4: time = 0.008240s
Run 5: time = 0.007810s
cuNumeric average: 0.008551s
-------------------------------
Yet one more hugely impressive performance from cuNumeric and Legate.
Summary
This text introduced cuNumeric, an NVIDIA library designed as a high-performance, drop-in substitute for NumPy. The important thing takeaway is that data scientists can speed up their existing Python code on NVIDIA GPUs with minimal effort, often by simply changing a single import line and running the script with the ‘legate’ command.
Two most important components power the technology:
- Legate: An open-source runtime layer from NVIDIA that routinely translates high-level Python operations into tasks. It intelligently manages distributing these tasks across single or multiple GPUs, handling data partitioning, memory management (even spilling to disk if needed), and optimising communication.
- cuNumeric: The user-facing library that mirrors the NumPy API. While you make a call like np.matmul(), cuNumeric converts it right into a task for the Legate engine to execute on the GPU.
I used to be capable of validate Nvidia’s performance claims by running 4 benchmark tests on my desktop PC (with an NVIDIA RTX 4070 Ti GPU), comparing standard NumPy on the CPU against cuNumeric on the GPU.
The outcomes reveal significant performance gains for cuNumeric:
- Matrix Multiplication: ~10x faster than NumPy.
- Logistic Regression Training: ~6x faster.
- Solving Linear Equations: An enormous 100x+ speedup.
- Sorting a Large Array: One other huge improvement, running roughly 70x faster.
In conclusion, I showed that cuNumeric successfully delivers on its promise, making the immense computational power of GPUs accessible to the broader Python data science community without requiring a steep learning curve or a whole code rewrite.
For more information and links to related resources, try the unique Nvidia announcement on cuNumeric here.