NumPy API on a GPU?

Is way forward for Python numerical computation?

Late last yr, NVIDIA made a big announcement regarding the longer term of Python-based numerical computing. I wouldn’t be surprised for those who missed it. In spite of everything, every other announcement from every AI company, then and now, seems mega-important.

That announcement introduced the cuNumeric library, a drop-in substitute for the ever-present NumPy library built on top of the Legate framework.

Who’re Nvidia?

Most individuals will probably know Nvidia from their ultra-fast chips that power computers and data centres everywhere in the world. You might even be conversant in Nvidia’s charismatic, leather jacket-loving CEO, Jensen Huang, who seems to pop up on the stage of each AI conference nowadays.

What many individuals don’t know is that Nvidia also designs and creates modern device architectures and associated software. Certainly one of its most prized products is the Compute Unified Device Architecture (CUDA). CUDA is NVIDIA’s proprietary parallel-computing platform and programming model. Since its launch in 2007, it has evolved right into a comprehensive ecosystem comprising drivers, runtime, compilers, math libraries, debugging and profiling tools, and container images. The result’s a neatly tuned hardware and software loop that keeps NVIDIA GPUs on the centre of contemporary high-performance and AI workloads.

What’s Legate?

Legate is an NVIDIA-led open-source runtime layer that helps you to run familiar Python data-science libraries (NumPy, cuNumeric, Pandas-style APIs, sparse linear-algebra kernels, …) on multi-core CPUs, single or multi-GPU nodes, and even multi-node clusters without changing your Python code. It translates high-level array operations right into a graph of fine-grained tasks and hands that graph to the C++ Legion runtime, which schedules the tasks, partitions the information, and moves tiles between CPUs, GPUs and network links for you.

In a nutshell, Legate lets familiar single-node Python libraries scale transparently to multi-GPU, multi-node machines.

What’s cuNumeric?

cuNumeric is a drop-in substitute for NumPy whose array operations are executed by Legate’s task engine and accelerated on one or many NVIDIA GPUs (or, if no GPU is present, on all CPU cores). In practice, you put in it and wish only change one import line to begin using it rather than your regular NumPy code. For instance …

# old
import numpy as np
...
...

# recent
import cupynumeric as np     # every thing else stays the identical
...
...

… and run your script on the terminal with the legate command.

Behind the scenes, cuNumeric converts each NumPy call you make, for instance, np.sin, np.linalg.svd, fancy indexing, broadcasting, reductions, etc, into Legate tasks. Those tasks will,

Partition your arrays into tiles sized to suit GPU memory.
Schedule each tile on the very best available device (GPU or CPU).
Overlap compute with communication when the workload spans multiple GPUs or nodes.
Spill tiles to NVMe/SSD routinely when your dataset outruns GPU RAM.

Since the API of cuNumeric mirrors NumPy’s nearly 1-for-1, existing scientific or data-science code can scale from a laptop to a multi-GPU cluster with no rewrite.

Performance advantages

So, this all seems great, right? But it surely only is smart if it ends in tangible performance improvements over using NumPy, and Nvidia is making some strong claims that that is the case. As data scientists, machine learning engineers and data engineers typically use NumPy rather a lot, we are able to appreciate that this is usually a crucial aspect of the systems we write and maintain.

Now, I don’t have a cluster of GPUs or a supercomputer to check this on, but my desktop PC does have an Nvidia GeForce RTX 4070 GPU, and we’re going to make use of that to check out a few of Nvidia’s claims.

(base) tom@tpr-desktop:~$ nvidia-smi
Sun Jun 15 15:26:36 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.75                 Driver Version: 566.24         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     On  |   00000000:01:00.0  On |                  N/A |
| 32%   29C    P8              9W /  285W |    1345MiB /  12282MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I’ll install cuNumeric and NumPy on my PC to conduct comparative tests. This may help us assess whether Nvidia’s claims are accurate and understand the performance differences between the 2 libraries.

Organising a development environment.

As at all times, I prefer to arrange a separate development environment to run my tests. That way, nothing I do in that environment will affect any of my other projects. On the time of writing, cuNumeric is just not available to put in on Windows, so I’ll be using WSL2 Ubuntu for Windows as an alternative.

I’ll be using Miniconda to establish my environment, but be happy to make use of whichever tool you’re comfortable with.

$ conda create cunumeric-env python=3.10 -c conda-forge
$ conda activate cunumeric-env
$ conda install -c conda-forge -c legate cupynumeric
$ conda install -c conda-forge ucx cuda-cudart cuda-version=12

Code example 1 — An easy matrix multiplication

Matrix multiplication is the bread and butter of mathematical operations that underpin so many AI systems, so it is smart to try that operation out first.

import time
import gc
import argparse
import sys

def benchmark_numpy(n, runs):
    """Runs the matrix multiplication benchmark using standard NumPy on the CPU."""
    import numpy as np
    
    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Multiplying two {n}×{n} matrices ({runs} runs)n")

    # 1. Generate data ONCE before the timing loop.
    print(f"Generating two {n}x{n} random matrices on CPU...")
    A = np.random.rand(n, n).astype(np.float32)
    B = np.random.rand(n, n).astype(np.float32)

    # 2. Perform one untimed warm-up run.
    print("Performing warm-up run...")
    _ = np.matmul(A, B)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()
        # The operation being timed. The @ operator is a convenient
        # shorthand for np.matmul.
        C = A @ B
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.4f}s")
        del C # Clean up the result matrix
        gc.collect()

    avg = sum(times) / len(times)
    print(f"nNumPy average: {avg:.4f}sn")
    return avg

def benchmark_cunumeric(n, runs):
    """Runs the matrix multiplication benchmark using cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Import numpy for the canonical sync
    
    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Multiplying two {n}×{n} matrices ({runs} runs)n")

    # 1. Generate data ONCE on the GPU before the timing loop.
    print(f"Generating two {n}x{n} random matrices on GPU...")
    A = cn.random.rand(n, n).astype(np.float32)
    B = cn.random.rand(n, n).astype(np.float32)

    # 2. Perform a vital untimed warm-up run for JIT compilation.
    print("Performing warm-up run...")
    C_warmup = cn.matmul(A, B)
    # One of the best practice for synchronization: force a duplicate back to the CPU.
    _ = np.array(C_warmup)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()
        
        # Launch the operation on the GPU
        C = A @ B
        
        # Synchronize by converting the result to a host-side NumPy array.
        np.array(C)

        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.4f}s")
        del C
        gc.collect()

    avg = sum(times) / len(times)
    print(f"ncuNumeric average: {avg:.4f}sn")
    return avg

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Benchmark matrix multiplication on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    parser.add_argument(
        "-n", "--n", type=int, default=3000, help="Matrix size (n x n)"
    )
    parser.add_argument(
        "-r", "--runs", type=int, default=5, help="Variety of timing runs"
    )
    parser.add_argument(
        "--cunumeric", motion="store_true", help="Run the cuNumeric (GPU) version"
    )
    
    args, unknown = parser.parse_known_args()

    # The dispatcher logic
    if args.cunumeric or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n, args.runs)
    else:
        benchmark_numpy(args.n, args.runs)

Running the NumPy side of things uses the regular python example1.py command line syntax. For running using Legate, the syntax is more complex. What it does is disable Legate’s automatic configuration after which launch the example1.py script under Legate with one CPU, one GPU, and 0 OpenMP threads using the cuNumeric backend.

Here is the output.

(cunumeric-env) tom@tpr-desktop:~$ python example1.py
--- NumPy (CPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)

Generating two 3000x3000 random matrices on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.0976s
Run 2: time = 0.0987s
Run 3: time = 0.0957s
Run 4: time = 0.1063s
Run 5: time = 0.0989s

NumPy average: 0.0994s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example1.py --cunu
meric
[0 - 7f2e8fcc8480]    0.000000 {5}{module_config}: Module numa cannot detect resources.
[0 - 7f2e8fcc8480]    0.000000 {4}{topology}: cannot open /sys/devices/system/node/
[0 - 7f2e8fcc8480]    0.000049 {4}{threads}: reservation ('GPU ctxsync 0x55cd5fd34530') can't be satisfied
--- cuNumeric (GPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)

Generating two 3000x3000 random matrices on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.0113s
Run 2: time = 0.0089s
Run 3: time = 0.0086s
Run 4: time = 0.0090s
Run 5: time = 0.0087s

cuNumeric average: 0.0093s

Well, that’s a powerful start. cuNumeric is registering a 10x speedup over NumPy.

The warnings that Legate is outputting might be ignored. These are informational, indicating Legate couldn’t find details concerning the machine’s CPU/memory layout (NUMA) or enough CPU cores to administer the GPU.

Code example 2 — Logistic regression

Logistic regression is a foundational tool in data science since it provides an easy, interpretable technique to model and predict binary outcomes (yes/no, pass/fail, click/no-click). In this instance, we’ll measure how long it takes to coach an easy binary classifier on synthetic data. For every of the five runs, it first generates N samples with D features (X), and a corresponding random 0/1 label vector (Y). It initialises the burden vector w to zeros, then performs 500 iterations of batch gradient descent: computing the linear predictions z = X.dot(w), applying the sigmoid p = 1/(1+exp(–z)), computing the gradient grad = X.T.dot(p – y) / N, and updating the weights with w -= 0.1 * grad. The script records the elapsed time for every run, cleans up memory, and at last prints the common training time.

import time
import gc
import argparse
import sys

# --- Reusable Training Function ---
# By putting the training loop in its own function, we avoid code duplication.
# The `np` argument allows us to pass in either the numpy or cupynumeric module.
def train_logistic_regression(np, X, y, iters, alpha):
    """Performs a set variety of gradient descent iterations."""
    # Ensure w starts on the right device (CPU or GPU)
    w = np.zeros(X.shape[1])
    
    for _ in range(iters):
        z = X.dot(w)
        p = 1.0 / (1.0 + np.exp(-z))
        grad = X.T.dot(p - y) / X.shape[0]
        w -= alpha * grad
    
    return w

def benchmark_numpy(n_samples, n_features, iters, alpha):
    """Runs the logistic regression benchmark using standard NumPy on the CPU."""
    import numpy as np
    
    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Training on {n_samples} samples, {n_features} features for {iters} iterationsn")

    # 1. Generate data ONCE before the timing loop.
    print("Generating random dataset on CPU...")
    X = np.random.rand(n_samples, n_features)
    y = (np.random.rand(n_samples) > 0.5).astype(np.float64)

    # 2. Perform one untimed warm-up run.
    print("Performing warm-up run...")
    _ = train_logistic_regression(np, X, y, iters, alpha)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(args.runs):
        start = time.time()
        # The operation being timed
        _ = train_logistic_regression(np, X, y, iters, alpha)
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.3f}s")
        gc.collect()

    avg = sum(times) / len(times)
    print(f"nNumPy average: {avg:.3f}sn")
    return avg

def benchmark_cunumeric(n_samples, n_features, iters, alpha):
    """Runs the logistic regression benchmark using cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Also import numpy for the canonical synchronization
    
    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Training on {n_samples} samples, {n_features} features for {iters} iterationsn")

    # 1. Generate data ONCE on the GPU before the timing loop.
    print("Generating random dataset on GPU...")
    X = cn.random.rand(n_samples, n_features)
    y = (cn.random.rand(n_samples) > 0.5).astype(np.float64)

    # 2. Perform a vital untimed warm-up run for JIT compilation.
    print("Performing warm-up run...")
    w_warmup = train_logistic_regression(cn, X, y, iters, alpha)
    # One of the best practice for synchronization: force a duplicate back to the CPU.
    _ = np.array(w_warmup)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(args.runs):
        start = time.time()
        
        # Launch the operation on the GPU
        w = train_logistic_regression(cn, X, y, iters, alpha)
        
        # Synchronize by converting the end result back to a NumPy array.
        np.array(w)

        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.3f}s")
        del w
        gc.collect()

    avg = sum(times) / len(times)
    print(f"ncuNumeric average: {avg:.3f}sn")
    return avg

if __name__ == "__main__":
    # A more robust argument parsing setup
    parser = argparse.ArgumentParser(
        description="Benchmark logistic regression on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    # Hyperparameters for the model
    parser.add_argument(
        "-n", "--n_samples", type=int, default=2_000_000, help="Number of knowledge samples"
    )
    parser.add_argument(
        "-d", "--n_features", type=int, default=10, help="Variety of features"
    )
    parser.add_argument(
        "-i", "--iters", type=int, default=500, help="Variety of gradient descent iterations"
    )
    parser.add_argument(
        "-a", "--alpha", type=float, default=0.1, help="Learning rate"
    )
    # Benchmark control
    parser.add_argument(
        "-r", "--runs", type=int, default=5, help="Variety of timing runs"
    )
    parser.add_argument(
        "--cunumeric", motion="store_true", help="Run the cuNumeric (GPU) version"
    )
    
    args, unknown = parser.parse_known_args()

    # Dispatcher logic
    if args.cunumeric or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n_samples, args.n_features, args.iters, args.alpha)
    else:
        benchmark_numpy(args.n_samples, args.n_features, args.iters, args.alpha)

And the outputs.

(cunumeric-env) tom@tpr-desktop:~$ python example2.py
--- NumPy (CPU) Benchmark ---
Training on 2000000 samples, 10 features for 500 iterations

Generating random dataset on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 12.292s
Run 2: time = 11.830s
Run 3: time = 11.903s
Run 4: time = 12.843s
Run 5: time = 11.964s

NumPy average: 12.166s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example2.py --cunu
meric
[0 - 7f04b535c480]    0.000000 {5}{module_config}: Module numa cannot detect resources.
[0 - 7f04b535c480]    0.000000 {4}{topology}: cannot open /sys/devices/system/node/
[0 - 7f04b535c480]    0.001149 {4}{threads}: reservation ('GPU ctxsync 0x55fb037cf140') can't be satisfied
--- cuNumeric (GPU) Benchmark ---
Training on 2000000 samples, 10 features for 500 iterations

Generating random dataset on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 1.964s
Run 2: time = 1.957s
Run 3: time = 1.968s
Run 4: time = 1.955s
Run 5: time = 1.960s

cuNumeric average: 1.961s

Not quite as impressive as our first example, but a 5x to 6x speedup on an already fast NumPy program is just not to be sniffed at.

Code example 3 — solving linear equations

This script benchmarks how long it takes to resolve a dense 3000×3000 linear algebra equation system. This can be a fundamental operation in linear algebra used to resolve the equation of type Ax = b, where A is a big grid of numbers (a 3000×3000 matrix on this case), and b is a listing of numbers (a vector).

The goal is to search out the unknown list of numbers x that makes the equation true. This can be a computationally intensive task that’s at the guts of many scientific simulations, engineering problems, financial models, and even some AI algorithms.

import time
import gc
import argparse
import sys # Import sys to examine arguments

# Note: The library imports (numpy and cupynumeric) at the moment are done *inside*
# their respective functions to maintain them separate and avoid import errors.

def benchmark_numpy(n, runs):
    """Runs the linear solve benchmark using standard NumPy on the CPU."""
    import numpy as np

    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Solving {n}×{n} A x = b ({runs} runs)n")

    # 1. Generate data ONCE before the timing loop.
    print("Generating random system on CPU...")
    A = np.random.randn(n, n).astype(np.float32)
    b = np.random.randn(n).astype(np.float32)

    # 2. Perform one untimed warm-up run. This is nice practice even for
    # the CPU to make sure caches are warm and any one-time setup is completed.
    print("Performing warm-up run...")
    _ = np.linalg.solve(A, b)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()
        # The operation being timed
        x = np.linalg.solve(A, b)
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        # Clean up the result to be protected with memory
        del x
        gc.collect()

    avg = sum(times) / len(times)
    print(f"nNumPy average: {avg:.6f}sn")
    return avg

def benchmark_cunumeric(n, runs):
    """Runs the linear solve benchmark using cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Also import numpy for the canonical synchronization

    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Solving {n}×{n} A x = b ({runs} runs)n")

    # 1. Generate data ONCE on the GPU before the timing loop.
    # This ensures we should not timing the information transfer in our most important loop.
    print("Generating random system on GPU...")
    A = cn.random.randn(n, n).astype(np.float32)
    b = cn.random.randn(n).astype(np.float32)

    # 2. Perform a vital untimed warm-up run. This handles JIT
    # compilation and other one-time GPU setup costs.
    print("Performing warm-up run...")
    x_warmup = cn.linalg.solve(A, b)
    # One of the best practice for synchronization: force a duplicate back to the CPU.
    _ = np.array(x_warmup)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()

        # Launch the operation on the GPU
        x = cn.linalg.solve(A, b)

        # Synchronize by converting the result to a host-side NumPy array.
        # That is guaranteed to dam until the GPU has finished.
        np.array(x)

        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        # Clean up the GPU array result
        del x
        gc.collect()

    avg = sum(times) / len(times)
    print(f"ncuNumeric average: {avg:.6f}sn")
    return avg

if __name__ == "__main__":
    # A more robust argument parsing setup
    parser = argparse.ArgumentParser(
        description="Benchmark linear solve on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    parser.add_argument(
        "-n", "--n", type=int, default=3000, help="Matrix size (n x n)"
    )
    parser.add_argument(
        "-r", "--runs", type=int, default=5, help="Variety of timing runs"
    )

    # Use parse_known_args() to handle potential extra arguments from Legate
    args, unknown = parser.parse_known_args()

    # The dispatcher logic: check if "--cunumeric" is within the command line
    # This is an easy and effective technique to switch between modes.
    if "--cunumeric" in sys.argv or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n, args.runs)
    else:
        benchmark_numpy(args.n, args.runs)

The outputs.

(cunumeric-env) tom@tpr-desktop:~$ python example4.py
--- NumPy (CPU) Benchmark ---
Solving 3000×3000 A x = b (5 runs)

Generating random system on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.133075s
Run 2: time = 0.126129s
Run 3: time = 0.135849s
Run 4: time = 0.137383s
Run 5: time = 0.138805s

NumPy average: 0.134248s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example4.py --cunumeric
[0 - 7f29f42ce480]    0.000000 {5}{module_config}: Module numa cannot detect resources.
[0 - 7f29f42ce480]    0.000000 {4}{topology}: cannot open /sys/devices/system/node/
[0 - 7f29f42ce480]    0.000053 {4}{threads}: reservation ('GPU ctxsync 0x562e88c28700') can't be satisfied
--- cuNumeric (GPU) Benchmark ---
Solving 3000×3000 A x = b (5 runs)

Generating random system on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.009685s
Run 2: time = 0.010043s
Run 3: time = 0.009966s
Run 4: time = 0.009739s
Run 5: time = 0.009383s

cuNumeric average: 0.009763s

That could be a tremendous result. The Nvidia cuNumeric run is 100x faster than the NumPy run.

Code example 4 — Sorting

Sorting is such a fundamental a part of every thing that happens in computing, and modern computers are so fast that the majority developers don’t even give it some thought. But let’s see how much of a difference using cuNumeric could make to this ubiquitous operation. We’ll sort a big (30,000,000) 1D array of numbers

# benchmark_sort.py
import time
import sys
import gc

# Array size
n = 30_000_000 # 30 million elements

def benchmark_numpy():
    import numpy as np
    print(f"Sorting an array of {n} elements with NumPy (5 runs)n")

    times = []
    for i in range(5):
        data = np.random.randn(n).astype(np.float32)
        start = time.time()
        _ = np.sort(data)
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        del data
        gc.collect()

    avg = sum(times) / len(times)
    print(f"nNumPy average: {avg:.6f}sn")

def benchmark_cunumeric():
    import cupynumeric as np
    print(f"Sorting an array of {n} elements with cuNumeric (5 runs)n")

    times = []
    for i in range(5):
        data = np.random.randn(n).astype(np.float32)
        start = time.time()
        _ = np.sort(data)
        # Force GPU sync
        _ = np.linalg.norm(np.zeros(()))
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        del data
        gc.collect()
        _ = np.linalg.norm(np.zeros(()))

    avg = sum(times) / len(times)
    print(f"ncuNumeric average: {avg:.6f}sn")

if __name__ == "__main__":
    if "--cunumeric" in sys.argv:
        benchmark_cunumeric()
    else:
        benchmark_numpy()

The outputs.

(cunumeric-env) tom@tpr-desktop:~$ python example5.py
--- NumPy (CPU) Benchmark ---
Sorting an array of 30000000 elements (5 runs)

Creating random array on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.588777s
Run 2: time = 0.586813s
Run 3: time = 0.586745s
Run 4: time = 0.586525s
Run 5: time = 0.583783s

NumPy average: 0.586529s
-----------------------------

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example5.py --cunumeric
[0 - 7fd9e4615480]    0.000000 {5}{module_config}: Module numa cannot detect resources.
[0 - 7fd9e4615480]    0.000000 {4}{topology}: cannot open /sys/devices/system/node/
[0 - 7fd9e4615480]    0.000082 {4}{threads}: reservation ('GPU ctxsync 0x564489232fd0') can't be satisfied
--- cuNumeric (GPU) Benchmark ---
Sorting an array of 30000000 elements (5 runs)

Creating random array on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.010857s
Run 2: time = 0.007927s
Run 3: time = 0.007921s
Run 4: time = 0.008240s
Run 5: time = 0.007810s

cuNumeric average: 0.008551s
-------------------------------

Yet one more hugely impressive performance from cuNumeric and Legate.

Summary

This text introduced cuNumeric, an NVIDIA library designed as a high-performance, drop-in substitute for NumPy. The important thing takeaway is that data scientists can speed up their existing Python code on NVIDIA GPUs with minimal effort, often by simply changing a single import line and running the script with the ‘legate’ command.

Two most important components power the technology:

Legate: An open-source runtime layer from NVIDIA that routinely translates high-level Python operations into tasks. It intelligently manages distributing these tasks across single or multiple GPUs, handling data partitioning, memory management (even spilling to disk if needed), and optimising communication.
cuNumeric: The user-facing library that mirrors the NumPy API. While you make a call like np.matmul(), cuNumeric converts it right into a task for the Legate engine to execute on the GPU.

I used to be capable of validate Nvidia’s performance claims by running 4 benchmark tests on my desktop PC (with an NVIDIA RTX 4070 Ti GPU), comparing standard NumPy on the CPU against cuNumeric on the GPU.

The outcomes reveal significant performance gains for cuNumeric:

Matrix Multiplication: ~10x faster than NumPy.
Logistic Regression Training: ~6x faster.
Solving Linear Equations: An enormous 100x+ speedup.
Sorting a Large Array: One other huge improvement, running roughly 70x faster.

In conclusion, I showed that cuNumeric successfully delivers on its promise, making the immense computational power of GPUs accessible to the broader Python data science community without requiring a steep learning curve or a whole code rewrite.

For more information and links to related resources, try the unique Nvidia announcement on cuNumeric here.

NumPy API on a GPU?

Is way forward for Python numerical computation?

Who’re Nvidia?

What’s Legate?

What’s cuNumeric?

Performance advantages

Organising a development environment.

Code example 1 — An easy matrix multiplication

Code example 2 — Logistic regression

Code example 3 — solving linear equations

Code example 4 — Sorting

Summary

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Patch Time Series Transformer in Hugging Face

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

NumPy API on a GPU?

Is way forward for Python numerical computation?

Who’re Nvidia?

What’s Legate?

What’s cuNumeric?

Performance advantages

Organising a development environment.

Code example 1 — An easy matrix multiplication

Code example 2 — Logistic regression

Code example 3 — solving linear equations

Code example 4 — Sorting

Summary

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.