Establishing a Scalable Sparse Ecosystem with the Universal Sparse Tensor

-


Sparse tensors are vectors, matrices, and higher-dimensional generalizations with many zeros. They’re crucial in various fields comparable to scientific computing, signal processing, and deep learning as a result of their efficiency in storage, computation, and power. Despite their advantages, handling sparse tensors manually or through existing libraries is commonly cumbersome, error-prone, nonportable, and doesn’t scale with the combinatorial explosion of sparsity patterns, data types, operations, and targets. 

Research largely focuses on sparse storage formats—data structures that compactly store nonzeros and permit efficient operations that avoid redundancies comparable to x+0=x and x*0=0. This allows scaling to larger sizes or solving same sizes with fewer resources. No single sparse format is perfect; the perfect selection is determined by the nonzero distribution, operations, and goal architecture.

The Universal Sparse Tensor (UST) decouples a tensor’s sparsity from its memory storage representation. The UST uses a domain-specific language (DSL) to explain how a tensor needs to be represented in memory. Developers concentrate on the sparsity of a tensor only. Compile-time or runtime inspection of the chosen format for the operands in sparsity-agnostic polymorphic operations eventually decides between dispatching to an optimized handwritten library or otherwise with automatic sparse code generation when no such solution exists. The UST has its roots in sparse compiler technology; see, for instance, Compiler Support for Sparse Matrix Computations, Sparse Tensor Algebra Compilation and Compiler Support for Sparse Tensor Computations in MLIR.

This post focuses on how developers can use the UST to define common but additionally less common sparse storage formats, each tailored to specific properties of the sparse tensors that occur of their application. Operability with, for instance, SciPy, CuPy, and PyTorch sparse tensors maps common formats like COO, CSR, and DIA to the corresponding DSL of the UST. Nevertheless, developers can even define their very own novel sparsity format via the DSL. Combined with dispatch or code generation, this permits the design of novel sparse formats without explicit coding, as shall be shown in future posts.

The tensor format DSL

The tensor format DSL maps tensor dimensions (logical tensor indices) to storage levels (physical memory indices) using an invertible function that defines how each level needs to be stored. It consists of:

1. An ordered sequence of dimension specifications, comparable to (i, j), each of which incorporates a dimension-expression to offer a named reference to every dimension.

2. An ordered sequence of level specifications, comparable to (i : dense, j : compressed), each of which incorporates a level expression, which defines what’s stored in each level, and a required level type, which defines how the extent is stored, including:

  • A set of level properties, comparable to unique and ordered
  • A required level format, comparable to:
    • dense: Meaning that the underlying storage only involves indexing just like a dense array
    • compressed: Meaning that at that level, a positions array defines the locations of a stored level and a coordinates array is used to store the indices inside each stored level 
    • singleton: A compressed variant where each coordinate directly belongs to the parent on the previous level
    • range: A dense dense variant where coordinate values are based on a compression at a previous level

The positions and coordinates arrays are stored as jagged (variable-sized) arrays, first indexed by level, namely pos[l], after which by content, where the positions use the convention that two consecutive elements define a half-open interval from pos[l][i] as much as pos[l][i+1]. Where unused, jagged arrays are left empty. At the top of all levels, a single values array incorporates the linearized numerical values of all stored elements. What makes the UST universal is that the DSL above may be easily prolonged to incorporate much more storage formats.

UST examples

The UST examples on this section show that the various possibilities of storing dense and sparse tensors grows rapidly with the dimensionality.

Scalars

A scalar is a 0-dimensional (0D) tensor, so there are not any dimensions and thus no levels. As such, the values array within the UST storage consists of a single “dense” element that incorporates the scalar value (Figure 1).

Scalar stored as a single element in the values array.
Scalar stored as a single element in the values array.
Figure 1. UST storage of a scalar with value 1

Vectors

A vector is a 1-dimensional (1D) tensor that may be interpreted visually as either a row or column vector. A sample sparse row vector is shown in Figure 2.

A sample vector with five nonzeros.A sample vector with five nonzeros.
Figure 2. A sample sparse row vector

Two common ways of storing a vector dense or compressed are shown below:

(i) -> (i : dense)         # dense vector
(i) -> (i : compressed)    # sparse vector

Figure 3 shows the UST storage of the sample vector using these two formats. Dense storage uses zero padding to get filled levels within the values array. The positions and coordinates arrays are unused. Sparse storage uses the positions array to define the half-open interval [0,5), where there are coordinates and values of each element.

Two side-by-side images: UST storage of a dense vector (left) and sparse vector (right).Two side-by-side images: UST storage of a dense vector (left) and sparse vector (right).
Figure 3. UST storage of a dense vector (left) and sparse vector (right)

It’s also possible to map a single dimension to two levels to obtain blocked vector storage. An example that defines block size 4 is shown below, which essentially maps a 1D dimension space, named i, into a 2D level space (i/4,i%4).

(i) -> (i / 4 : compressed, i % 4 : dense)    # blocked vector

Figure 4 shows UST storage of the sample vector. Zero padding is used to get filled blocks. The positions array in the first level denotes that two dense blocks are stored using [0,2). Arrays at the second level are unused. As shown in red, the element with, for example, index 8 and value -1 is mapped to inter-block level index 2 and intra-block level index 0.

Blocked vector storage.
Blocked vector storage.
Figure 4. UST storage in the blocked vector format

Matrices

Sparse matrix storage (2D) has been studied extensively and many widely adopted sparse storage formats exist, each with a unique name. The flexibility of the UST is able to define most of them, thereby unifying many variations under a single umbrella.

If using just the level formats dense/compressed combined with index permutations, the tensor format DSL of a d-dimensional tensor already gives rise to 2d✕d! different combinations. For d=2, this evaluates to eight different matrix formats.

(i, j) -> (i : dense, j : dense)              # Dense (row-major)
(i, j) -> (j : dense, i : dense)              # Dense (col-major)
(i, j) -> (i : dense, j : compressed)         # CSR
(i, j) -> (j : dense, i : compressed)         # CSC
(i, j) -> (i : compressed, j : compressed)    # DCSR
(i, j) -> (j : compressed, i : compressed)    # DCSC
(i, j) -> (i : compressed, j : row)           # CROW (less common)
(i, j) -> (j : compressed, i : row)           # CCOL (less common)

Dense storage using either row-major or column-major order is straightforward. Figure 5 shows the CSR format of a sample 4×5 matrix with six nonzero elements. Because level 0 is dense, arrays pos[0], crd[0] are unused. As an alternative, the i-index is used as an index into level 1 to seek out compressed storage of that row. 

There, two consecutive positions pos[1][i] and pos[1][i+1] define a half-open interval of locations into the coordinates array used to reconstruct the j-index of each stored value. (Given this concise way of encoding, the dimensions of the pos array must at all times be yet one more than the actual dimension size.) That is the last level, so the actual numerical values may be found at the identical positions within the values array.

A sample matrix together with CSR representation.
A sample matrix together with CSR representation.
Figure 5. Sample matrix and UST storage in CSR

Taking a look at other level types, the COO format is expressed using the DSL shown below. Here the compressed level uses the nonunique property to indicate that the identical index can appear multiple times. The second level uses the singleton level format to indicate that no positions array is required since each index is associated directly with its parent.

(i, j) -> (i : compressed(nonunique), j : singleton)    # COO

Storing the hyper sparse matrix in Figure 6 in COO yields the given UST format. Coloring is used to indicate the storage relation of element a22=3. Like previous formats, the UST COO format also corresponds one-to-one with the COO format present in the literature, aside from the somewhat unusual initial positions array to define the variety of stored elements.

A sample matrix together with COO representation.
A sample matrix together with COO representation.
Figure 6. Sample hyper sparse matrix and UST storage in COO

The range level format may be combined with level expressions involving add/sub to define diagonal and antidiagonal storage of elements. Here, there may be a selection to make use of either the row index or column index into the actually stored dense diagonal, giving rise to 4 variants.

(i, j) -> (j-i : compressed, i : range)    # DIA-I
(i, j) -> (j-i : compressed, j : range)    # DIA-J
(i, j) -> (i+j : compressed, i : range)    # ANTI-DIA-I
(i, j) -> (i+j : compressed, j : range)    # ANTI-DIA-J

The difference between using either a row or column index only becomes apparent for nonsquare matrices. The sample 5×7 matrix in Figure 7 shows each the DIA-I and DIA-J formats in Figures 8 and 9. Coloring is used to relate stored diagonals back to the matrix diagonals. The format uses zero padding inside and outside the range to achieve completely filled levels (where, unlike inside padding, outside padding doesn’t correspond to elements inside the original matrix index space).

A sample matrix with nonzeros clustered in three diagonals.
A sample matrix with nonzeros clustered in three diagonals.
Figure 7. Sample matrix with three diagonals almost filled

Figure 8 shows the DIA-I format.

Diagonal storage with index I.
Diagonal storage with index I.
Figure 8. Sample matrix and UST storage in DIA-I

Figure 9 is the DIA-J format for a similar matrix. Choosing the index of the larger dimension ends in more zero padding, in order that the DIA-I format could be preferred on this case.

Diagonal storage with index J.
Diagonal storage with index J.
Figure 9. Sample matrix and UST storage in DIA-J

Many more storage variations may be obtained with blocking, where 2D indices map into either 3D levels (into strips) or 4D levels (into blocks). For the latter, despite the fact that the literature typically only mentions BSR and BSC to define row-major or column-major order between the blocks, the tensor format DSL of the UST more precisely also distinguishes between row-major and column-major format inside each stored block, giving rise to the next 4 variants for two×3 blocks.

(i, j) -> (i / 2 : compressed, j / 3 : dense,
           i % 2 : dense,      j % 3 : dense)     # BSR-Row(2,3)
(i, j) -> (i / 2 : compressed, j / 3 : dense,
           j % 3 : dense,      i % 2 : dense)     # BSR-Col(2,3)
(i, j) -> (j / 3 : compressed, i / 2 : dense,
           i % 2 : dense,      j % 3 : dense)     # BSC-Row(2,3)
(i, j) -> (j / 3 : compressed, i / 2 : dense,
           j % 3 : dense,      i % 2 : dense)     # BSC-Col(2,3)

The 4×9 matrix in Figure 10 is used as an example block storage. Block B11 is coloured red to focus on where it actually resides. For instance, entry a25=16 may be found mapping dimension indices (2,5) to the extent indices (1, 1, 0, 2): block B11 indexed by (0,2). Because compressed row-major storage of blocks makes this the fourth stored block (after B00, B02, and B10), while dense row-major storage inside the block makes this the third entry there, eventually this translates to the element at index 20 of the values array.

A sample matrix together with BSR representation.A sample matrix together with BSR representation.
Figure 10. Sample block matrix and UST storage in BSR-Row(2,3)

These examples should provide an impression of the expressiveness of the tensor format DSL in describing sparse matrix storage formats. The next sections explore higher dimensions.

Tensors

As previously discussed, just using dense/compressed level types with index permutations within the UST tensor format DSL gives rise to 2d✕d! other ways of storing tensors. For d=3, d=4, d=5, and d=6 this yields 48, 384, 3,840, and 46,080 combos, respectively—just too many to list here. Adding the opposite level types increases this number even further. 

To supply a couple of examples of storing 3D tensors: the dimension permutations of the UST allows for six different direct ways of storing 3D dense tensors, various from the classic row-major to column-major and all the things swiveling in between. Storing a 3D tensor with all levels compressed gives rise to the CSF (compressed sparse fiber) format.

(i, j, k) -> (i : compressed, j : compressed, k : compressed)  # CSF

Consider the 3D 4x3x5 tensor in Figure 11, which is visualized on paper by showing the underlying 2D 3×5 matrices in a row for i=0, i=1, i=2 (empty), and that i=3.

Visualization of 3D tensor.
Visualization of 3D tensor.
Figure 11. Sample 3D tensor, visualized as matrices

Representing this tensor using the CSF format as UST yields the storage shown in Figure 12. The coloring indicates how the stored element a320=8 is found. Such a format is suitable for very sparse tensors where the nonzeros are uniformly distributed over all dimensions.

CSF storage with seven rows
CSF storage with seven rows
Figure 12. UST storage of the sample tensor as CSF

A 3D tensor may be regarded as a batched set of matrices, a preferred concept in DL. Placing the batch dimension at different levels can sometimes have interesting consequences on the uniformity of sparsity inside the batch. Consider the 3D 4x5x5 tensor of Figure 13, visualized as a batch of 5×5 matrices.

Sample 3D tensor
Sample 3D tensor
Figure 13. Sample tensor with elements clustered around diagonals

Given the diagonal nature inside the matrices, it is smart to store this tensor as a batched DIA format. Essentially the most natural way is keeping the batch dimensional variable i because the outermost dense level, followed by diagonal storage of the matrices.

(i, j, k) -> (i : dense, k-j : compressed, j : range) # batched DIA-I

This nonuniform batched DIAG-I storage is shown in Figure 14, where each batch can have its own diagonal structure. Color relates one particular diagonal back to its stored location. Zero padding occurs, each inside and outdoors the stored diagonals, to be able to get completely filled dense levels. The coordinates array denotes where diagonals reside (for instance, at -1, 0, +1 for the primary matrix,  and -1 and 0 for the last matrix).

Nonuniform batched diagonal storage.
Nonuniform batched diagonal storage.
Figure 14. UST storage as nonuniform batched DIAG-I

It’s possible to make a subtle change within the tensor format DSL by permuting the batch variable below the diagonal computation:

(i, j, k) -> (k-j : compressed, i : dense, j : range)  # uniform
                                                       # batched CSR

Now metadata for the diagonals is stored just once, but on the expense of forcing a uniform diagonal nonzero pattern on all matrices within the batching, which can require additional zero padding, as may be seen in Figure 15.

Uniform batched diagonal storage.
Uniform batched diagonal storage.
Figure 15. UST storage as uniform batched DIAG-I

Learn more

This post has illustrated the breadth of tensor storage formats encompassed by the UST. Stay tuned for the mixing of the UST with polymorphic operations that dispatch to optimized libraries or mechanically generated code. This approach establishes a clean, easy-to-use, and scalable sparse ecosystem, facilitating the introduction of novel storage schemes without explicit coding.

Wish to learn more? Enroll for the NVIDIA GTC 2026 session, Accelerating GPU Scientific Computing with nvmath-python for a demo on UST and exciting product updates.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x