Can AI coding assistants write efficient CUDA code? To assist measure and improve their capabilities, we created ComputeEval, a strong, open source benchmark for evaluating AI models and agents on CUDA programming tasks.Â
Just a few months ago, we announced the first release of ComputeEval and today, we’re introducing its first major expansion by adding greater than 100 latest CUDA challenges.
With this release, the dataset has grown to a complete of 232 of CUDA and CUDA Compute Core Libraries (CCCL) problems. We deliberately raised the bar by adding harder challenges that require LLMs to make use of modern CUDA features, equivalent to Tensor Cores, advanced shared memory patterns, and warp-level primitives. The brand new problems test the power to accurately orchestrate features like CUDA Graphs, Streams, and Events. All throughout the context of real-world applications like dynamic simulations.
LLM performance on CUDA programming
Our team evaluated several leading LLMs on ComputeEval to determine baseline performance metrics and understand the present state of AI-assisted CUDA programming (Table 1).
| Model | ComputeEval 2025.2 232 latest problems pass@1 |
ComputeEval 2025.1 128 problems pass@1 |
| GPT-5 (medium) | 0.5819 | 0.61 |
| Claude Sonnet 4.0 | 0.5517 | 0.64 |
| gpt-oss-20B (high) | 0.5474 | N/A |
| gpt-oss-120b (high) | 0.5302 | N/A |
| Claude Opus 4.0 | 0.5216 | N/A |
| DeepSeek-R1 | 0.4397 | 0.55 |
| gpt-oss-120b (medium) | 0.4224 | N/A |
| gpt-oss-20b (medium) | 0.4224 | N/A |
| gpt-oss-120b (low) | 0.4052 | N/A |
| DeepSeek-V3.1 | 0.3750 | 0.44 |
| Llama 4 Maverick 17B 128E | 0.3448 | 0.47 |
| Llama 3.1 405B | 0.3405 | 0.4 |
| gpt-oss-20B (low) | 0.3319 | 0.41 |
We observed that scores for all models declined with the move to ComputeEval 2025.2. This doesn’t indicate that the models have gotten less capable—moderately, it reflects that the benchmark itself has develop into more difficult. With each release, we’re raising the bar for AI, pushing it to exhibit a deeper understanding of the nuances of accelerated computing.
What’s next and the way to get entangled
We’ll proceed expanding each the dataset and the capabilities of the evaluation framework. Work is already underway to increase ComputeEval’s coverage to additional CUDA-X libraries, including cuBLAS, CUTLASS, cuDNN, RAPIDS, and more. We invite the broader HPC and AI communities to contribute and collaborate. Explore the code on GitHub and access the dataset on Hugging Face.
