The most effective thing about agent skills is upskilling your agents on hard problems. There are two ways to have a look at that:
- You may take Opus 4.5 or other SOTA models and tackle the toughest problems on the market.
- You may take models that run in your laptop and upskill them to harder problems. On this blog post, we’ll show you how you can tackle the latter.
This blog post walks through the means of using a brand new tool, upskill, to generate and evaluate agent skills with large models and use them with smaller models. We are going to benchmark upskill on the duty of writing CUDA kernels for diffusers models, but the method is mostly useful for cutting costs, or using smaller models on hard and domain-specific problems.
What are agent skills?
In case you missed it, agent skills are taking the coding agent game by storm. In truth, they’re a simple concept to define model context as files, like instructions as markdown and code as scripts. The file format makes them easy to generate, share, and review. Briefly, they’re an practical medium to share capabilities across models and tools, and so they’re most useful in specific domains or hard problems. Not stuff the model can do well anyway.
This post showcases this process through the use of Claude to generate a Skill file that might be utilized by open source models for a fancy and specialized task: write CUDA kernels.
We first tried a straightforward skill based on existing documentation, and we found that it improved performance for some others, but not all. In truth, it could even degrade performance or increase token usage for some models. Take a look at the plot below to see the performance of the model with and without the essential skill.
Now, let’s walk through how you should use upskill to upskill your agents on hard problems, and measure performance.
First, we use Claude Code to construct a kernel interactively and export the trace. We worked through the method by instructing, validating, and adding documentation links. This somewhat naive process is vital to disclose the models’ initial challenges. In truth, you’ll be able to iterate on this multiple times, by trying to unravel the duty with draft versions of the skill, and experimenting with smaller models. Every time, you’ll be able to instruct the agent to enhance the skill and test it on the smaller model.
Here’s an example of the skill that we created and have been using to construct kernels. We began from this agent trace where the agent was in a position to construct a kernel, but not without some help.
Once the teacher model has performed the duty, we’d like them to make a skill. There are plenty of effective ways to do that.
- Throughout the same session, instruct the agent to create a skill file for the duty it just accomplished.
- Use Anthropic ‘skill creator’ skill either inside the agent session or with an exported trace and a brand new agent session.
- Use the
upskilltool to create a skill based on the trace.
Normally, the primary 2 options end in functional skills. Nonetheless, the performance of an agent with the skill is unknown. That’s where upskill is helpful, because it should also generate test cases on your skill based on the trace. It then compares the outcomes under each scenarios: using the trace, or applying the skill. We see below that the unique model (Claude Opus)l met the identical performance with and without the skill. This implies the skill captured the duty for this model. Great!
3. Take your skill to an open source, smaller, or cheaper model
Finally, we’d like to transfer our newly created skill to the tool or model we intend to make use of. Most tools like codex, cursor, and opencode have settled on a consistent format for skills, which is a directory at {agent}/skills/{skill_name}/SKILL.md , so we just must copy the skill directory to this location.
With upskill we are able to pass a skill and a set of models to the eval command and upskill will run the test cases on those models with and without the skill to check performance. We are able to see here that the skill increases accuracy on some open models, but not on all.
On this case, we would need to iterate further on the gpt-oss skills by regenerating the skill. We are able to do upskill generate –from {skill}.
There’s more to agent skills than model performance. Often agents can reach a given accuracy with or with no skill, they simply must eat more tokens to get there. For recurring tasks, we wish to optimize agents to make use of less tokens to realize the identical accuracy. The outcomes below reveal one other dimension to the skill. Some models are significantly reducing their performance token usage, whilst others are using more tokens with the skill. For instance, with moonshotai/Kimi-K2-Pondering the skill is clearly effective by way of accuracy and token usage. Nonetheless, for Claude Opus 4.5 there is no such thing as a clear performance increase and a rise in token usage, so you wouldn’t need to use this skill with Claude Opus 4.5.
tldr; check out and evaluate models with the abilities you create. Use upskill eval or an analogous tool to judge the models performance with and without skills.
That’s the high level end to finish of upskilling your coding agents on hard problems. Check out upskill now like this:
# install upskill
pip install upskill
# or use uvx
uvx upskill --help
# generate a skill based on an agent trace
upskill generate "write nvidia kernels" --from ./trace.md
# evaluate models on a skill
upskill eval ./skills/my-skill/ --model haiku --model sonnet
# generate skills for local models
upskill generate "parse YAML"
--model opus
--eval-model "unsloth/GLM-4.7-Flash-GGUF:Q4_0"
--eval-base-url http://localhost:8080/v1
We’ve a high level understanding of how we are able to upskill an agent. Let’s now take a look at the use case we solved for writing CUDA kernels.
We didn’t just want to put in writing kernel code, but understand the complete kernel-builder workflow: project structure, construct.toml configuration, architecture-specific optimizations, and PyTorch bindings. This tutorial shows how upskill creates validated skills that really work.
The kernel-builder-cuda-kernels skill teaches Claude all the things it must find out about CUDA development: which GPU architecture to focus on, how you can structure a kernel-builder project, when to make use of shared memory versus registers, and how you can write PyTorch bindings.
With this skill, you’ll be able to tell Claude things like:
Construct a fused LayerNorm + GELU kernel optimized for H100.
And Claude will create the whole project structure, CUDA implementation, and construct configuration—following the precise conventions that kernel-builder expects.
This is not about generating boilerplate. The skill encodes domain expertise: H100 uses compute capability 9.0, shared memory needs to be aligned to 128 bytes, async memory copies require __CUDA_ARCH__ >= 900. Knowledge that will take hours to assemble from documentation gets packaged into ~500 tokens that load on demand.
Setup and Install
Install upskill:
pip install upskill
# or use uvx for one-off runs
uvx upskill --help
Set your API key:
export ANTHROPIC_API_KEY=sk-ant-...
export HF_TOKEN=hf_...
That is it. Upskill uses Anthropic Claude Opus-4.5 model by default but in addition supports OpenAI and native models via OpenAI-compatible endpoints as generators. We wish to make use of the dearer and better quality models to generate skills, and the smaller ones to make use of them. Think robin hood.
Skill Generation
Let’s walk through generating a skill that teaches agents how you can construct CUDA kernels with HuggingFace’s kernels library.
Generate the Skill
Start with a transparent task description:
upskill generate "construct optimized CUDA kernels for PyTorch using HuggingFace kernel-builder"
Above we used upskill, nevertheless it could the truth is be any agent or chat tool and an exported trace.
upskill generate "write kernels" —-from .md
Also, we could start from an existing skill and add to it:
upskill generate "add more error handling and edge cases"
--from ./skills/kernel-builder-cuda-kernels/
upskill loads the present skill, applies your improvements, and re-evaluates to make sure the changes help.
upskill creates a skill, generates test cases, evaluates performance, and refines based on failures:
Generating skill with sonnet...
Generating test cases...
Evaluating on sonnet... (attempt 1)
60% -> 95% (+35%) OK
kernel-builder-cuda-kernels
Construct optimized CUDA kernels for PyTorch using HuggingFace kernel-builder.
SKILL.md ~520 tokens
baseline ████████████ 60%
with skill ███████████████████ 95% (+35%)
Saved to ./skills/kernel-builder-cuda-kernels
The baseline shows how the model performs with none skill. The “with skill” result shows performance after the skill is injected into context. A 35% improvement means the skill is working.
The skill is saved as a directory following the Agent Skills specification:
./skills/kernel-builder-cuda-kernels/
├── SKILL.md # Foremost instructions (~520 tokens)
└── skill_meta.json # Metadata and test cases
Open `SKILL.md` to see what upskill generated:
---
name: kernel-builder-cuda-kernels
description: Construct optimized CUDA kernels for PyTorch using HuggingFace kernel-builder.
---
# Constructing CUDA Kernels with kernel-builder
## Overview
This guide explains how you can create optimized CUDA kernels for PyTorch models
using HuggingFace's kernel-builder. It covers project setup, kernel implementation,
and constructing for specific GPU architectures like NVIDIA H100.
## Project Structure
project/
├── construct.toml # Construct configuration
├── kernel_src/ # CUDA kernel implementations
│ ├── attention.cu
│ ├── layernorm.cu
│ └── geglu.cu
└── torch-ext/ # PyTorch C++ bindings
└── torch_binding.cpp
## Construct Configuration
Create `construct.toml` to define your kernel package:
[general]
name = "diffuser_kernels"
backends = ["cuda"]
[general.cuda]
# H100 is compute capability 9.0
capabilities = ["9.0"]
...
Evaluate on a Different Model
The essential test is: does this skill help local or cheaper models to construct kernels?
# Start a local OpenAI-compatible server with an online UI:
llama-server -hf unsloth/GLM-4.7-Flash-GGUF:Q4_K_M
# Evaluate on local model (llama.cpp server)
upskill eval ./skills/my-skill/
--model "unsloth/GLM-4.7-Flash-GGUF:Q4_0"
--base-url http://localhost:8080/v1
Generating skill with sonnet...
Generating test cases...
Evaluating on "unsloth/GLM-4.7-Flash-GGUF:Q4_0"... (attempt 1)
40% -> 85% (+45%) OK
baseline ████████░░░░░░░░░░░░ 40%
with skill █████████████████░░░ 85% (+45%)
Saved to ./skills/kernel-builder-cuda-kernels
A forty five% improvement on "unsloth/GLM-4.7-Flash-GGUF:Q4_0" means the skill successfully transfers domain knowledge from a capable model to a faster, cheaper one. Skills that work on weaker models will certainly work on stronger ones.
That is the core value proposition: use expensive models to create skills, then deploy those skills with low cost or local models.
How the evaluation in upskill works
upskill uses a teacher-student approach to judge models where the teacher model generates test cases for the scholar model to be evaluated on.
- Teacher model (Opus) generates the skill
- Test cases (Opus) are generated mechanically from the duty description
- Student model (local) is evaluated with and without the skill
- Skill lift measures the advance
When you pass an existing skill to upskill eval, it should generate test cases for the skill and evaluate the model on them. Test cases are easy input/output pairs that confirm the agent understands the duty:
{
"cases": [
{
"input": "Create a build.toml for a CUDA kernel targeting H100",
"expected": {"contains": "9.0"}
},
{
"input": "Write a basic CUDA kernel template with proper includes",
"expected": {"contains": "cuda_runtime.h"}
}
]
}
We can even test how a skill performs across different models:
upskill eval ./skills/kernel-builder-cuda-kernels/
--model haiku --m kimi --runs 5
Evaluating kernel-builder-cuda-kernels across 2 model(s)
3 test case(s), 5 run(s) per model
haiku
Pass rate: 4/5 (80%) Avg assertions: 2.8/3
sonnet
Pass rate: 5/5 (100%) Avg assertions: 3.0/3
┏━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Model ┃ Pass Rate ┃ Avg Assertions ┃ Avg Tokens ┃
┡━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ haiku │ 4/5 │ 2.8/3 │ 1250 │
│ kimi │ 5/5 │ 3.0/3 │ 1890 │
└────────┴───────────┴────────────────┴────────────┘
This helps you discover the cost-performance sweet spot: perhaps Haiku with the skill is nice enough on your use case, saving significant API costs.
What’s Next
We have shown that upskill can create validated skills that transfer domain expertise from powerful models to cheaper ones. The kernel-builder skill is only one example of what is possible.
Some things to try:
- Generate skills on your internal tools
- Construct a skill library on your codebase
- Capture tribal knowledge
- Benchmark across models
The approach works for any specialized task where you’d otherwise write detailed prompts repeatedly. Skills are portable across Claude Code, Codex, Cursor, and other tools that support the Agent Skills specification.




