Simulating large-scale quantum computers has turn into tougher as the standard of quantum processing units (QPUs) improves. Validating the outcomes is essential to make sure that after the devices scale beyond what’s classically simulable, we are able to still trust the outputs.
Similarly, when generating large-scale datasets for various AI models that aim to help within the operation of quantum processors, we see the necessity to offer useful training data in any respect scales and abstractions accelerated by GPUs. Examples include AI quantum error correction decoders, AI compilers, AI agents for calibration and control, and models to generate recent device designs.
cuQuantum SDK is a set of high-performance libraries and tools for accelerating quantum computing simulations at each the circuit and device levels by orders of magnitude. The most recent version cuQuantum SDK, v25.11 introduces components that speed up two recent workloads: Pauli propagation and stabilizer simulations. Each of those is critical for simulating large scale quantum computers.
This post dives into how you may start running Pauli propagation simulations and speed up sampling out of your stabilizer simulations to resolve these problems with GPU-accelerated supercomputers.
cuQuantum cuPauliProp
Pauli propagation is a comparatively recent method for efficiently simulating the observables of large-scale quantum circuits, which might include noise models of real quantum processors. By expressing states and observables as weighted sums of Pauli tensor products, circuit simulation can dynamically discard terms which contribute insignificantly to a sought expectation value. This allows estimation of experimental quantities that are otherwise intractable for exact simulation.
Many relevant quantum computing applications are centered around computation of expectation values, for instance VQE and quantum simulation of physical dynamics. Various exact and approximate classical simulation techniques enable calculating such observables for big circuits, though they turn into prohibitively expensive in differing settings. For instance, the Matrix Product State technique, a extremely popular approximate tensor network state method for circuit simulation, is usually ill-suited for big circuits which encode the dynamics of two or three dimensional physical systems.
Pauli propagation is a complementary and useful addition to the approximate circuit simulation toolbox, for each pure and noisy circuits. Beyond being provably efficient for simulating near-Clifford and/or very noisy circuits, Pauli propagation has shown impressive performance when simulating circuits which Trotterize the evolution of certain quantum spin systems. This includes some “utility circuits” named in reference to their use in IBM’s utility experiment involving a 127 qubit device as detailed in Evidence for the Utility of Quantum Computing Before Fault Tolerance. Characterizing which circuits could be efficiently simulated with Pauli propagation is an ongoing research effort, as significant as refinement of the algorithmic details of the tactic itself.
cuQuantum 25.11 offers primitives to speed up Pauli propagation or derivative methods on NVIDIA GPUs with the discharge of this recent cuQuantum library, enabling developers and researchers to advance the frontier of classical circuit simulation. Core functions are described in the next sections.
Library initialization
Initialize the library handle and workspace descriptor required for operations:
import cupy as cp
from cuquantum.bindings import cupauliprop
from cuquantum import cudaDataType
# Create library handle and workspace descriptor
handle = cupauliprop.create()
workspace = cupauliprop.create_workspace_descriptor(handle)
# Assign GPU memory to workspace
ws_size = 1024 * 1024 * 64 # Example: 64 MiB
d_ws = cp.cuda.alloc(ws_size)
cupauliprop.workspace_set_memory(
handle, workspace, cupauliprop.Memspace.DEVICE,
cupauliprop.WorkspaceKind.WORKSPACE_SCRATCH, d_ws.ptr, ws_size
)
Define an observable
To begin the simulation, allocate device memory for the Pauli expansions (sums of products of Pauli operators expressed as a set of unsigned integers in addition to their coefficients) and initialize the input expansion with an observable (for instance, ).
# Helper to encode Pauli string into packed integers (2 bits per qubit: X and Z masks)
def encode_pauli(num_qubits, paulis, qubits):
num_ints = cupauliprop.get_num_packed_integers(num_qubits)
# Packed integer format: [X_ints..., Z_ints...]
packed = np.zeros(num_ints * 2, dtype=np.uint64)
x_mask, z_mask = packed[:num_ints], packed[num_ints:]
for p, q in zip(paulis, qubits):
idx, bit = divmod(q, 64)
if p in (cupauliprop.PauliKind.PAULI_X, cupauliprop.PauliKind.PAULI_Y):
x_mask[idx] |= (1 << bit)
if p in (cupauliprop.PauliKind.PAULI_Z, cupauliprop.PauliKind.PAULI_Y):
z_mask[idx] |= (1 << bit)
return packed
# 1. Allocate Device Buffers
# Define capability (max variety of Pauli strings) and allocate buffers
max_terms = 10000
num_packed_ints = cupauliprop.get_num_packed_integers(num_qubits)
d_pauli = cp.zeros((max_terms, 2 * num_packed_ints), dtype=cp.uint64, order="C")
d_coef = cp.zeros(max_terms, dtype=cp.float64, order="C")
# 2. Populate Initial Observable (Z_62)
encoded_pauli = encode_pauli(num_qubits, [cupauliprop.PauliKind.PAULI_Z], [62])
# Assign the primary term
d_pauli[0] = cp.array(encoded_pauli)
d_coef[0] = 1.0
# 3. Create Pauli Expansions
# Input expansion: pre-populated with our observable
expansion_in = cupauliprop.create_pauli_expansion(
handle, num_qubits,
d_pauli.data.ptr, d_pauli.nbytes,
d_coef.data.ptr, d_coef.nbytes,
cudaDataType.CUDA_R_64F,
1, 1, 1 # num_terms=1, is_sorted=True, is_unique=True
)
# Output expansion: empty initially (num_terms=0), needs its own buffers
d_pauli_out = cp.zeros_like(d_pauli)
d_coef_out = cp.zeros_like(d_coef)
expansion_out = cupauliprop.create_pauli_expansion(
handle, num_qubits,
d_pauli_out.data.ptr, d_pauli_out.nbytes,
d_coef_out.data.ptr, d_coef_out.nbytes,
cudaDataType.CUDA_R_64F,
0, 0, 0
)
Operator creation
Define quantum gates or operators, corresponding to a Pauli rotation .
# Create a Z-rotation gate on qubit 0
paulis = [cupauliprop.PauliKind.PAULI_Z]
qubits = [0]
gate = cupauliprop.create_pauli_rotation_gate_operator(
handle, theta, 1, qubits, paulis
)
Operator application
Apply an operator (a gate or noise-channel) to the expansion, evolving the system. Note that almost all applications work within the so-called Heisenberg picture, which suggests that the gates within the circuit are applied in reverse order to the observable. This also requires passing the adjoint argument as True when applying the operator.
# Get a view of the present terms within the input expansion
num_terms = cupauliprop.pauli_expansion_get_num_terms(handle, expansion_out)
view = cupauliprop.pauli_expansion_get_contiguous_range(
handle, expansion_in, 0, num_terms)
# Apply gate: in_expansion -> gate -> out_expansion
cupauliprop.pauli_expansion_view_compute_operator_application(
handle, view, expansion_out, gate,
True, # adjoint?
False, False, # make_sorted?, keep_duplicates?
0, None, # Truncation strategies (optional)
workspace
)
Expectation values
Compute the expectation value (trace with the zero state ).
import numpy as np
result = np.zeros(1, dtype=np.float64)
# Compute trace
cupauliprop.pauli_expansion_view_compute_trace_with_zero_state(
handle, view, result.ctypes.data, workspace
)
Combining these methods shows that NVIDIA DGX B200 GPUs offer significant speedups over CPU based codes. For small coefficient cutoffs, multiple order of magnitude speedups are observed over single-threaded Qiskit Pauli-Prop on essentially the most recent dual-socket data center CPUs.


cuQuantum cuStabilizer
Stabilizer simulations arise from the Gottesman-Knill theorem, which states that gates throughout the Clifford group (normalizer of the qubit Pauli group) could be efficiently simulated classically in polynomial time. This Clifford group is made up of CNOT, Hadamard and Phase gates (S). For that reason, stabilizer simulations have been critical for resource estimation and testing quantum error correcting codes at large scales.
There are a couple of different approaches to constructing stabilizer simulators, from tableau simulators to border simulators. cuStabilizer currently addresses improving the throughput for sampling rates in a frame simulator.
Frame simulation only focuses on effects of quantum noise on the quantum state. Because the quantum devices are imperfect, it’s possible to model the imperfections in circuit execution by inserting random “noisy” gates in it. If the noise-free result is understood, getting the noisy result requires only to trace the difference, or how the noisy gates change the circuit output.
It seems that this effect is far easier to compute in comparison with full circuit simulation. The variety of possible mixtures of how noisy gates could be inserted grows very fast with the dimensions of the circuit, which suggests that to be able to reliably model the error correcting algorithm a lot of shots is required.
For users taken with developing quantum error correcting codes, testing recent decoders, or generating data for AI decoders, frame simulation is right. APIs can be found to enhance sampling and speed up any frame simulation on NVIDIA GPUs. The cuQuantum SDK cuStabilizer library exposes C API and Python API. While the C API will provide higher performance, the Python API is best for getting began, because it is more flexible and handles memory allocation for the user.
Create a circuit and apply frame simulation
cuStabilizer has two important classes involved within the simulation: Circuit and FrameSimulator. The circuit can accept a string that accommodates circuit instructions, much like the format utilized in the Stim CPU simulator. To create a FrameSimulator you might want to specify information in regards to the circuit, to allocate enough resources.
import cuquantum.stabilizer as cust
# Circuit information
num_qubits = 5
num_shots = 10_000
num_measurements = 2
# Create a circuit on GPU
circ = cust.Circuit("""
H 0 1
X_ERROR(0.1) 1 2
DEPOLARIZE2(0.5) 2 3
CX 0 1 2 3
M 0 3
"""
sim = cust.FrameSimulator(
num_qubits,
num_shots,
num_measurements
)
sim.apply(circ)
You’ll be able to reuse a simulator between different circuits, so long as your simulator has enough qubits available. The next code will apply a circuit to a state modified by the primary circuit circ.
circ2 = cust.Circuit("""
Z_ERROR(0.01) 1 4
""")
sim.apply(circ2)
Read simulation results
The state of simulator consists of three bit-tables:
- x_bits
- z_bits
- measurement_bits
The primary two tables store the Pauli Frame (much like the cuPauliProp Pauli Expansion, but in a special layout and without the weights). The third stores the difference between noise-free measurement and the noisy measurements in each shot.
Essentially the most efficient method to store the bits is to encode them in an integer value. That is known as “bit-packed” format, where each byte in memory stores eight significant bits. While this format is best, manipulating individual bits requires extra steps in your program. The bit-packed format just isn’t easily integrated with the common notion of “array,” as those are considered to contain values of several bytes, corresponding to int32.
To offer a straightforward representation in numpy, cuStabilizer supports the bit_packed argument, which might toggle between different formats. If bit_packed=False, each bit is encoded in a single uint8 value, thus using 8x more memory. When specifying input bit tables, the format can be necessary for performance, as described within the cuQuantum documentation.
# Get measurement flips
m_table = sim.get_measurement_bits(bit_packed=False)
print(m_table.dtype)
# uint8
print(m_table.shape)
# (2, 10000)
print(m_table)
# [[0 0 0 ... 0 0 0]
# [1 0 0 ... 0 1 1]]
x_table, z_table = sim.get_pauli_xz_bits(bit_packed=True)
print(x_table.dtype)
# uint8
print(x_table.shape)
# (5, 1252)
For quick access to the underlying Pauli frames, cuStabilizer provides a PauliTable class, which could be indexed by the shot index:
# Get pauli table
pauli_table = sim.get_pauli_table()
num_frames_print = 5
for i in range(num_frames_print):
print(pauli_table[i])
# ...XZ
# ZXX..
# ...Z.
# .....
# ...Z.
When leveraging the sampling API we see that we are able to drastically improve the throughput compared to Google Stim, cutting-edge code on the newest data center CPUs.
Surface code simulation
cuStabilizer can accept Stim circuits as input, and you should utilize it to simulate surface code circuits:
import stim
p = 0.001
circ_stim = stim.Circuit.generated(
"surface_code:rotated_memory_z",
distance=5,
rounds=5,
after_clifford_depolarization=p,
after_reset_flip_probability=p,
before_measure_flip_probability=p,
before_round_data_depolarization=p,
)
circ = cust.Circuit(circ_stim)
sim = cust.FrameSimulator(
circ_stim.num_qubits,
num_shots,
circ_stim.num_measurements,
num_detectors=circ_stim.num_detectors,
)
sim.apply(circ)
pauli_table = sim.get_pauli_table()
for i in range(num_frames_print):
print(pauli_table[i])
Note that essentially the most efficient simulation is achieved for a lot of samples and variety of qubits. Moreover, the most effective performance is achieved when the resulting bit tables are kept on GPU, as when using the cupy package.
Figure 2 demonstrates the most effective use of cuStabilizer and expected performance on the NVIDIA B200 GPU and Intel Xeon Platinum 8570 CPU. It shows that the optimal performance for a code distance 31 is achieved at about one million shots. Users can get a 1,060x speedup for big code distances.


Start with recent cuQuantum libraries
The most recent functionalities in cuQuantum proceed to push the bounds of what is feasible with GPU based quantum computer emulations enabling two recent major classes of workloads. These workloads are critical for quantum error correction, verification and validation, and algorithm engineering for intermediate to large scale quantum devices.
Start with cuQuantum cuPauliProp using pip install cupauliprop-cu13. To learn more, review the cuPauliProp documentation.
Start with cuQuantum cuStabilizer using pip install custabilizer-cu13. To learn more, review the cuStabilizer documentation.
