Block Sparse Matrices for Smaller and Faster Language Models

-


François Lagunas's avatar

In previous blog
posts
we introduced sparse matrices and what they may do to enhance neural networks.

The fundamental assumption is that full dense layers are sometimes overkill and could be pruned with out a significant loss in precision.
In some cases sparse linear layers may even improve precision or/and generalization.

The major issue is that currently available code that supports sparse algebra computation is severely lacking efficiency.
We’re also still waiting for official PyTorch support.

That is why we ran out of patience and took a while this summer to handle this “lacuna”.
Today, we’re excited to release the extension pytorch_block_sparse.

By itself, and even higher combined with other methods like
distillation
and quantization,
this library enables networks that are each smaller and faster,
something Hugging Face considers crucial to let anybody use
neural networks in production at low price, and to improve the experience for the top user.



Usage

The provided BlockSparseLinear module is a drop in substitute for torch.nn.Linear, and it’s trivial to make use of
it in your models:


from pytorch_block_sparse import BlockSparseLinear

...


self.fc = BlockSparseLinear(1024, 256, density=0.1)

The extension also provides a BlockSparseModelPatcher that permits to change an existing model “on the fly”,
which is shown on this example notebook.
Such a model can then be trained as usual, with none change in your model source code.



NVIDIA CUTLASS

This extension relies on the cutlass tilesparse proof of concept by Yulhwa Kim.

It’s using C++ CUDA templates for block-sparse matrix multiplication
based on CUTLASS.

CUTLASS is a set of CUDA C++ templates for implementing high-performance CUDA kernels.
With CUTLASS, approching cuBLAS performance on custom kernels is feasible without resorting to assembly language code.

The most recent versions include all of the Ampere Tensor Core primitives, providing x10 or more speedups with a limited lack of precision.
Next versions of pytorch_block_sparse will make use of those primitives,
as block sparsity is 100% compatible with Tensor Cores requirements.



Performance

At the present stage of the library, the performances for sparse matrices are roughly
two times slower than their cuBLAS optimized dense counterpart, and we’re confident
that we are able to improve this in the long run.

This can be a huge improvement on PyTorch sparse matrices: their current implementation is an order of magnitude slower
than the dense one.

However the more necessary point is that the performance gain of using sparse matrices grows with the sparsity,
so a 75% sparse matrix is roughly 2x faster than the dense equivalent.

The memory savings are much more significant: for 75% sparsity, memory consumption is reduced by 4x
as you’ll expect.



Future work

Having the ability to efficiently train block-sparse linear layers was just step one.
The sparsity pattern is currenly fixed at initialization, and naturally optimizing it during learning will yield large
improvements.

So in future versions, you possibly can expect tools to measure the “usefulness” of parameters to have the opportunity to optimize the sparsity pattern.
NVIDIA Ampere 50% sparse pattern inside blocks will probably yield one other significant performance gain, just as upgrading
to newer versions of CUTLASS does.

So, stay tuned for more sparsity goodness in a near future!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x