CUDA Tile Programming Now Available for BASIC!

-


CUDA 13.1 introduced CUDA Tile, a next generation tile-based GPU programming paradigm designed to make fine-grained parallelism more accessible and versatile. One among its key strengths is language openness: any programming language can goal CUDA Tile, enabling developers to bring tile-based GPU acceleration right into a wide selection of ecosystems.

In response to overwhelming demand from seasoned developers in every single place, we’re releasing cuTile BASIC for GPUs, bringing CUDA Tile programming to this long-overlooked language.

What’s cuTile BASIC?

cuTile BASIC is an expression of the CUDA Tile programming model in BASIC, built on top of the CUDA Tile IR specification. It enables you to jot down tile kernels in BASIC using a tile-based model, which is a natural fit for a programming language like BASIC which predates multi-threaded programming.

cuTile BASIC is the right marriage of the facility of GPUs with the anachronistic charm and syntactic simplicity of the BASIC programming language – a sublime language, from a more pixelated era. Manually numbering your lines of code never looked so good or ran so fast!

Who’s cuTile BASIC for?

BASIC is one in all the oldest programming languages around and as such, is revered by a complete generation of developers who remember the sound of a 300 baud dial-up modem handshaking fondly. For a lot of such developers, BASIC was their first introduction to computer programming.

Now, developers with BASIC still burned into their brains can take legacy applications onto NVIDIA GPU-accelerated computing for the primary time. This unlocks performance and functionality the BASIC programming language could never have previously imagined – allowing your Lunar Lander to zip across the moon’s surface faster than an Artemis mission.

Relive the glory of writing games on your graphing calculator during math class while running on the world’s strongest GPUs.

Get setup

First, install cuTile BASIC with PIP:

pip install git+https://github.com/nvidia/cuda-tile.git@basic-experimental

The complete hardware and software requirements for running cuTile BASIC are listed at the top of this post (64k of RAM or more really helpful). 

cuTile BASIC example

In the event you’ve learned CUDA C++, you most likely encountered the canonical vector addition kernel. A vector add kernel in CUDA C++ looks something like the next, which takes two vectors and adds them together elementwise to provide a 3rd vector.

That is one in all the best CUDA kernels one can write:

__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)
{
 /* calculate my thread index */
 int workIndex = threadIdx.x + blockIdx.x*blockDim.x;

 if(workIndex < vectorLength)
 {
  /* perform the vector addition */
  C[workIndex] = A[workIndex] + B[workIndex];
 }
}

On this kernel, each thread’s work is explicitly specified, and the programmer, when launching this kernel, will specify the variety of blocks and threads to be launched.

Now let’s have a look at the equivalent code written in cuTile BASIC. We don’t have to specify what each thread does. We only need to break the info into tiles and specify what mathematical operations should occur to those tiles. Every part else is handled for us.

The cuTile BASIC vector add kernel is shown below:

10 REM Vector Add: C = A + B
20 INPUT N, A(), B()
30 DIM A(N), B(N), C(N)
40 TILE A(128), B(128), C(128)
50 LET C(BID) = A(BID) + B(BID)
60 OUTPUT C
70 END

This instance may be very basic and uses standard BASIC with three additional things to indicate:

  • Indexing an array returns a tile, which is a subset of the array.
  • BID is a built-in variable that specifies the tile block index. 
  • TILE specifies what size tiles the arrays ought to be partitioned into.

Notice we didn’t need to specify anything apart from the addition operation. Every part else is handled by cuTile BASIC.

Putting all of it together

Now we’ll show how you can run this vector add kernel in BASIC. The simple workflow is that first the BASIC function is compiled to a cubin, after which it’s launched on the GPU. For brevity, we’ve omitted the boring Python host and wrapper code, but yow will discover it in our GitHub repository. 

If you could have the correct versions of CUDA Toolkit and Python installed and have downloaded the cuTile BASIC repo from GitHub, you’ll be able to execute the next command:

$ python examples/vector_add.py
[1/2] Compiling to cubin ...
      Arrays: ['A', 'B', 'C'], tile_shapes={'A': [128], 'B': [128], 'C': [128]}, grid_size=8
[2/2] Launching kernel on GPU ...

Results (showing 5 samples of 1024):
  C[   0] =        0.0  (expected 0.0)
  C[   1] =        3.0  (expected 3.0)
  C[ 511] =     1533.0  (expected 1533.0)
  C[ 512] =     1536.0  (expected 1536.0)
  C[1023] =     3069.0  (expected 3069.0)

VERIFICATION PASSED  (max_diff=0.000000, 1024 elements)

In case your output looks the identical, congratulations, you simply ran your very first cuTile BASIC program, and quite possibly your very first program of any sort written in BASIC! Max Headroom can be proud.

A BASIC matrix multiplication

BASIC is an easy language such that only a few lines of code can express common algorithms. Consider a matrix multiply (GEMM) kernel in BASIC, shown below:

10 REM GEMM: C(M,N) = A(M,K) * B(K,N)
15 INPUT M, N, K, A(), B()
20 DIM A(M, K), B(K, N), C(M, N)
30 TILE A(128, 32), B(32, 128), C(128, 128), ACC(128, 128)
40 LET TILEM = INT(BID / INT(N / 128))
50 LET TILEN = BID MOD INT(N / 128)
60 LET ACC = 0.0
70 FOR KI = 0 TO INT(K / 32) - 1
80   LET ACC = MMA(A(TILEM, KI), B(KI, TILEN), ACC)
90 NEXT KI
100 LET C(TILEM, TILEN) = ACC
110 OUTPUT C
120 END

On this kernel, as well as to straightforward BASIC syntax, TILE specifies how A, B, and C ought to be tiled and the scale of the accumulator tile, ACC.  MMA is the function call for matrix multiply and accumulate. Notice how easy this code is. You specify how your data ought to be subdivided into tiles, and specify your algorithm at a high-level, and under the covers, CUDA Tile handles every part else. 

This instance can also be available within the GitHub repo’s examples folder. Running it produces the next output:

$ python examples/gemm.py
[1/2] Compiling to cubin ...
      M=512, N=512, K=512, tile_shapes={'A': [128, 32], 'B': [32, 128], 'C': [128, 128]}, grid_size=16
[2/2] Launching kernel on GPU ...

Results (showing 5 samples of 512x512 = 262144 elements):
  C[0,0] =    -0.1199  (expected -0.1199)
  C[0,1] =   -14.4456  (expected -14.4456)
  C[256,0] =   -15.8891  (expected -15.8891)
  C[256,1] =    -2.8646  (expected -2.8646)
  C[511,511] =    11.4724  (expected 11.4724)

VERIFICATION PASSED  (max_diff=0.000012, tol=0.005120)

Matrix multiplication, resembling that shown above, is at the center of artificial intelligence tools like large language models. With cuTile Basic, developers can now explore the frontiers of artificial intelligence in models with trillions of parameters from a language that might barely imagine a complete megabyte of system memory. 

How developers can get cuTile

To run cuTile BASIC programs, you wish the next:

  • A GPU with compute capability 8.x, 10.x, 11.x or 12.x (in future CUDA releases we’ll add support for extra GPU architectures)
  • NVIDIA Driver R580 or later (R590 is required for tile-specific developer tools support)
  • CUDA Toolkit 13.1 or later
  • Python version 3.10 or higher
  • cuTile BASIC package

Start

When you’ve got all of the software, try the complete cuTile BASIC documentation, try all of the sample programs found on GitHub, and begin programming in cuTile BASIC today. Relish in the chance to port your modern AI or scientific computing code base to a historically pivotal language while retaining the power to run on probably the most powerful hardware available! Just don’t tell your Commodore 64.

CUDA Tile in any language

While BASIC is probably not the primary language developers consider for high-performance parallel computing, it’s an instructive demonstration that, because of the design of the CUDA software stack, CUDA Tile might be used from nearly any programming language. By compiling to the CUDA Tile IR format, CUDA Tile might be delivered to nearly any language … even BASIC!

Editor’s Note: On reflection, developers asking for broad support of the CUDA Tile programming model should perhaps have been a bit more specific. Search for cuTile COBOL coming April 1, 2027.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x