Benchmarking LLMs on AI-Generated CUDA Code with ComputeEval 2025.2

-


Can AI coding assistants write efficient CUDA code? To assist measure and improve their capabilities, we created ComputeEval, a strong, open source benchmark for evaluating AI models and agents on CUDA programming tasks. 

Just a few months ago, we announced the first release of ComputeEval and today, we’re introducing its first major expansion by adding greater than 100 latest CUDA challenges.

With this release, the dataset has grown to a complete of 232 of CUDA and CUDA Compute Core Libraries (CCCL) problems. We deliberately raised the bar by adding harder challenges that require LLMs to make use of modern CUDA features, equivalent to Tensor Cores, advanced shared memory patterns, and warp-level primitives. The brand new problems test the power to accurately orchestrate features like CUDA Graphs, Streams, and Events. All throughout the context of real-world applications like dynamic simulations.

LLM performance on CUDA programming

Our team evaluated several leading LLMs on ComputeEval to determine baseline performance metrics and understand the present state of AI-assisted CUDA programming (Table 1).

Model ComputeEval 2025.2
232 latest problems
pass@1
ComputeEval 2025.1
128 problems
pass@1
GPT-5 (medium) 0.5819 0.61
Claude Sonnet 4.0 0.5517 0.64
gpt-oss-20B (high) 0.5474 N/A
gpt-oss-120b (high) 0.5302 N/A
Claude Opus 4.0 0.5216 N/A
DeepSeek-R1 0.4397 0.55
gpt-oss-120b (medium) 0.4224 N/A
gpt-oss-20b (medium) 0.4224 N/A
gpt-oss-120b (low) 0.4052 N/A
DeepSeek-V3.1 0.3750 0.44
Llama 4 Maverick 17B 128E 0.3448 0.47
Llama 3.1 405B 0.3405 0.4
gpt-oss-20B (low) 0.3319 0.41
Table 1. Pass@1 accuracy of state-of-the-art LLMs on ComputeEval 2025.1 and 2025.2. The most recent version introduces 232 latest CUDA programming challenges, providing a tougher benchmark for AI-assisted coding.

We observed that scores for all models declined with the move to ComputeEval 2025.2. This doesn’t indicate that the models have gotten less capable—moderately, it reflects that the benchmark itself has develop into more difficult. With each release, we’re raising the bar for AI, pushing it to exhibit a deeper understanding of the nuances of accelerated computing.

What’s next and the way to get entangled

We’ll proceed expanding each the dataset and the capabilities of the evaluation framework. Work is already underway to increase ComputeEval’s coverage to additional CUDA-X libraries, including cuBLAS, CUTLASS, cuDNN, RAPIDS, and more. We invite the broader HPC and AI communities to contribute and collaborate. Explore the code on GitHub and access the dataset on Hugging Face.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x