Codex is Open Sourcing AI models

-


ben burtenshaw's avatar

shaun smith's avatar


banner

Constructing on our work to get Claude Code to coach open source models, we are actually getting Codex to go further. We gave Codex access to the Hugging Face Skills repository, which accommodates skills for Machine Learning and AI tasks reminiscent of training or evaluating models. With HF skills, a coding agent can:

  • Nice-tune and apply RL alignment on language models
  • Review, explain, and act on live training metrics from Trackio
  • Evaluate checkpoints and act on evaluation results
  • Create reports from experiments
  • Export to and quantize models with GGUF for local deployment
  • Publish models to the Hub

This tutorial dives even deeper and shows you ways it really works and tips on how to use it yourself. So let’s start.

Codex uses AGENTS.md files to perform specialized tasks, whilst Claude Code uses ‘Skills’. Fortunately, ‘HF-skills’ is compatible with each approaches and works with major coding agents like Claude Code, Codex, or Gemini CLI.

With HF-skills, you possibly can tell Codex something like:

Nice-tune Qwen3-0.6B on the dataset open-r1/codeforces-cots

And Codex will:

  1. Validate your dataset format
  2. Select appropriate hardware (t4-small for a 0.6B model)
  3. Use and update a training script with Trackio monitoring
  4. Submit the job to Hugging Face Jobs
  5. Report the job ID and estimated cost
  6. Check on progress whenever you ask
  7. Show you how to debug if something goes unsuitable

The model trains on Hugging Face GPUs whilst you do other things. When it’s done, your fine-tuned model appears on the Hub, able to use.

This is not a toy demo. The extension supports the identical training methods utilized in production: supervised fine-tuning, direct preference optimization, and reinforcement learning with verifiable rewards. You’ll be able to train models from 0.5B to 7B parameters, convert them to GGUF for local deployment, and run multi-stage pipelines that mix different techniques.



GOAL: End-to-end Machine Learning experiments

We explored this single prompt approach within the Claude Code tutorial. Nonetheless, we are able to now go further and get OpenAI Codex to do end-to-end Machine Learning experiments. For instance, Codex should find a way to observe progress, evaluate the models, and maintain an up to this point training report. This can allow engineers to delegate experiments to Codex and review reports in a more hands-off way. It can also allow Codex to make more decisions by itself based on the training report and evaluation results.

So let’s start!



Setup and Install

Before starting, you will need:



Install Codex

Codex is OpenAI’s AI coding agent included in ChatGPT Plus, Pro, Business, Edu, and Enterprise plans. Codex brings AI assistance directly into your development workflow.

See the Codex documentation for installation and setup instructions.



Install the Hugging Face Skills

The Hugging Face Skills repository includes an AGENTS.md file that Codex robotically detects and uses.

Clone the repository:

git clone https://github.com/huggingface/skills.git
cd skills

Codex will robotically detect the AGENTS.md file within the repository and cargo the abilities. You’ll be able to confirm the instructions are loaded with:

codex --ask-for-approval never "Summarize the present instructions."

See the Codex AGENTS guide for more details.



Hook up with Hugging Face

Authenticate with Hugging Face using the hf auth login command and a write-access token from hf.co/settings/tokens:

hf auth login

Codex supports MCP (Model Context Protocol) servers. You’ll be able to configure the Hugging Face MCP server for added Hub integration capabilities. You’ll be able to add the Hugging Face MCP server to your Codex configuration by adding the next to your ~/.codex/config.toml file:

[mcp_servers.huggingface]
command = "npx"
args = ["-y", "mcp-remote", "https://huggingface.co/mcp?login"]

Configure Hugging Face MCP Server to make use of relevant MCP servers like Jobs within the Settings page.

Then start Codex and you will be directed to the Hugging Face MCP authentication page.



Your first AI Experiment

Let’s walk through an entire example. We’ll fine-tune a small model to enhance code solving abilities, using the open-r1/codeforces-cots dataset and the openai_humaneval benchmark.

The open-r1/codeforces-cots dataset is a dataset of codeforces problems and solutions. It’s a great dataset for instruction tuning a model to resolve hard coding problems.



Instruct Codex to do an end-to-end fine-tuning experiment

Start Codex in your project directory. Then give it a straightforward and clear instruction:

Start a brand new fine-tuning experiment to enhance code solving abilities on using SFT. 
- Maintain a report for the experiment. 
- Evaluate models with the openai_humaneval benchmark
- Use the open-r1/codeforces-cots dataset

You may notice that we have gone a bit further than the only prompt approach within the Claude Code tutorial. We have added more details to the instruction but additionally added more steps to the experiment.

Why not try iterating on this experiment yourself with more open ended questions like “What’s the most effective model for code solving abilities?” or “What’s the most effective dataset for code solving abilities?”

Codex analyzes your request and prepares a training configuration. For a 0.6B model on a demo dataset, it selects t4-small—enough GPU for this model size and the most affordable option available. Codex will start a brand new report at training_reports/--.md which looks like the instance below. Because the experiment progresses, Codex will update the report with the newest information and every run report.

Example Training Report
# Base Model & Dataset
[Base Model](https://huggingface.co/Qwen/Qwen3-0.6B)  
[Dataset](https://huggingface.co/datasets/open-r1/codeforces-cots)

---

# `sft-a10g` - `TBD` - `In Progress`

## Training Parameters
| Parameter | Value |
|-----------|-------|
| Method | SFT (TRL) |
| Model | `Qwen/Qwen3-0.6B` |
| Dataset | `open-r1/codeforces-cots` (train, 5% eval split) |
| Max Length | 2048 |
| Epochs | 1 (extend to three after first check) |
| Per-Device Batch Size | 1 |
| Grad Accum Steps | 8 |
| Effective Batch | 8 |
| Learning Rate | 5e-5 |
| Weight Decay | 0.01 |
| Warmup Ratio | 0.03 |
| Eval Strategy | steps (500) |
| Save Strategy | steps (500), `hub_strategy=every_save`, limit=2 |
| Precision | bf16 |
| Gradient Checkpointing | true |
| Packing | false |
| Hub Model | `burtenshaw/qwen3-codeforces-cots-sft` |
| Hardware | a10g-small |
| Timeout | 2h |
| Trackio | project `qwen3-codeforces-cots`, run `sft-a10g` |

## Run Status
In Progress (queued to submit)

## Run Logs
Pending submission (job link can be added)

## Trackio Logs
Pending (will link after job starts)

## Run Evaluations
Pending (lighteval `openai_humaneval` for base + checkpoints)

---

# Experiment Evaluations
| Run Title | Benchmark | Rating | Evaluation Job Link | Model Link |
|-----------|-----------|-------|---------------------|------------|
| `sft-a10g` - `TBD` - `In Progress` | HumanEval pass@1 | TBD | TBD | [burtenshaw/qwen3-codeforces-cots-sft](https://huggingface.co/burtenshaw/qwen3-codeforces-cots-sft)



Updating the Training Report

Because the experiment progresses, Codex will update the report with the newest information and every run report. You’ll be able to view the report in training_reports/--.md file.

For instance, codex will update the title of the report back to sft-a10gTBDIn Progress when the experiment is in progress.

# `base-humaneval-a10g` - `2025-12-09 13:47:47 UTC` - `In Progress`

It might probably link to the run logs and trackio logs.

## Run Logs

[Run Logs](https://huggingface.co/jobs/burtenshaw/6938272ec67c9f186cfe1ae3)

## Trackio Logs

[Trackio Logs](https://burtenshaw-trackio.hf.space/?project=qwen3-codeforces-sft&metrics=train/loss&runs=sft-qwen3-codeforces-20251209-175806&sidebar=hidden&navbar=hidden)

And it is going to update the evaluation ends in a combined table.

# Experiment Evaluations

| Run Title | Benchmark | Rating | Evaluation Job Link | Model Link |
|-----------|-----------|-------|---------------------|------------|
| `base-humaneval-a10g` - `2025-12-09 13:47:47 UTC` - `Accomplished` | HumanEval pass@1 | 0.304 | [Logs](https://huggingface.co/jobs/burtenshaw/69382863c67c9f186cfe1ae7) | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) |
| `qwen3-0.6b-lora-v1` - `2025-12-09 13:47:47 UTC` - `In Progress` | HumanEval pass@1 | TBD | TBD | [burtenshaw/qwen3-codeforces-cots-sft](https://huggingface.co/burtenshaw/qwen3-codeforces-cots-sft)



Dataset Validation

Dataset format and processing is probably the most common source of coaching failures and frequently a big amount of labor is finished within the training script. Codex can validate datasets before the job starts and either define a configuration for TRL or process the dataset individually.

Most often, Codex will validate the dataset before training, but you possibly can all the time check the dataset validation before submitting the job.

Check if open-r1/codeforces-cots works for SFT training.

Codex runs a fast inspection on CPU (fractions of a penny) and reports:

Dataset validation for my-org/conversation-data:

SFT: ✓ READY
  Found 'messages' column with conversation format

DPO: ✗ INCOMPATIBLE
  Missing 'chosen' and 'rejected' columns

In case your dataset needs transformation, Codex can preprocess the dataset before training.

Preprocess the dataset open-r1/codeforces-cots to make use of 'good_response' and 'bad_response' as an alternative of 'chosen' and 'rejected'.

Codex will preprocess the dataset and update the training script with the brand new configuration.



Review Before Submitting

Before Codex submits anything, you may see the configuration:

I'll fine-tune Qwen/Qwen3-0.6B on open-r1/codeforces-cots using SFT.

Configuration:
- Hardware: t4-small (~$0.75/hour)
- Estimated time: ~20 minutes
- Estimated cost: ~$0.30
- Output: username/qwen-codeforces-cots-sft

The model can be pushed to Hub robotically. Should I submit?

That is your probability to regulate anything. Change the output repo name, pick different hardware, or ask Codex to switch training parameters. When you approve, the job is submitted.

For instance, you possibly can ask Codex to try a test run:

Do a fast test run on 100 examples.



Track Progress using the Training Report

After submission, you possibly can track progress using the training report and get links to the run logs and trackio logs. You may as well tell Codex to fetch the logs, summarize progress, and update the training report. As codex runs the roles, it is going to update the training report with the newest information.

Example Codex Response
─ Worked for 10m 38s ─────────────────────────────────────────────────────────────────────────────────────────────────────────

• Training and eval jobs are live. Key details:

  - SFT launched on open-r1/codeforces-cots (solutions_py_decontaminated, 98/2 split) with LoRA on Qwen/Qwen3-0.6B, a10g-
    large, bf16, max_steps=1000, batch=1, grad_accum=8, lr=2e-5, save/eval every 200 steps, hub push burtenshaw/qwen3-
    codeforces-cots-sft. Job: https://huggingface.co/jobs/burtenshaw/6938272ec67c9f186cfe1ae3 (Trackio project codeforces-
    cots-sft, run qwen3-0.6b-lora-v1).
  - Baseline lighteval (HumanEval) on the bottom model queued on a10g-large. Job: https://huggingface.co/jobs/
  - Experiment report updated with current runs and links: training_reports/qwen3-codeforces-cots-sft.md.

  To judge the fine-tuned checkpoints once the SFT job pushes to the hub, reuse the identical lighteval script:
  hf jobs uv run --flavor a10g-large --timeout 1h --secrets HF_TOKEN -e MODEL_ID=burtenshaw/qwen3-codeforces-cots-sft -e
  RESULTS_ORG=burtenshaw -d /tmp/lighteval_humaneval.py
  (or swap MODEL_ID for a selected checkpoint path). This can log to the identical results directory on the hub.

  1. Monitor the 2 running jobs; I can fetch logs or metrics on request.
  2. After a checkpoint lands, trigger the lighteval run above to check against the baseline; then we are able to fill in pass@1 within the report.

Trackio will monitor the training and evaluation scores for you by running evaluation jobs and choosing the most effective checkpoint. If you must know the status of the training job, you possibly can ask Codex to fetch the logs and summarize progress in a table.

Are models outperforming the bottom model?
| Model | Benchmark | Rating | Evaluation Job Link | Model Link |
|-----------|-----------|-------|---------------------|------------|
| `qwen3-0.6b-lora-v1` - `2025-12-09 13:47:47 UTC` - `Accomplished` | HumanEval pass@1 | 0.342 | [Logs](<link to training job>) | [burtenshaw/qwen3-codeforces-cots-sft](https://huggingface.co/burtenshaw/qwen3-codeforces-cots-sft)
| `base-humaneval-a10g` - `2025-12-09 13:47:47 UTC` - `Accomplished` | HumanEval pass@1 | 0.306 | [Logs](<link to evaluation job>) | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)

You may as well monitor the training loss in real-time.

Example Trackio dashboard of a Sweep test

Codex fetches the logs and summarizes progress.

Click here for an example Trackio dashboard with some accomplished runs.



Use Your Model

When training completes, your model is on the Hub:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("burtenshaw/qwen3-codeforces-cots-sft")
tokenizer = AutoTokenizer.from_pretrained("burtenshaw/qwen3-codeforces-cots-sft")

Transformers is great as an ordinary, and we are able to easily convert the trained model to GGUF for local deployment. It’s because the training skill accommodates instructions and support scripts to convert models to GGUF.

Convert my fine-tuned model to GGUF with Q4_K_M quantization.
Push to username/my-model-gguf.

Codex then converts to GGUF, applies quantization, and pushes to the Hub. If we trained a LoRA adapter, it is going to merge the LoRA adapters into the bottom model.

Then use it locally:

llama-server -hf /:


llama-server -hf unsloth/Qwen3-1.7B-GGUF:Q4_K_M



Hardware and Cost

Codex selects hardware based in your model size, but understanding the tradeoffs helps you make higher decisions. You should utilize the Hardware Guide to see the hardware options and costs, but codex will do it for you and choose the most effective option.

For tiny models under 1B parameters, t4-small works well. These models train quickly—expect $1-2 for a full run. This is ideal for educational or experimental runs.

For small models (1-3B), step as much as t4-medium or a10g-small. Training takes a number of hours and costs $5-15.

For medium models (3-7B), you would like a10g-large or a100-large with LoRA. Full fine-tuning doesn’t fit, but LoRA makes these very trainable. Budget $15-40 for production.

For large models (7B+), this HF skills job will not be suitable for this scale yet. But stay tuned because we’re working on it!



What’s Next

We have shown that Codex can handle the complete lifecycle of model fine-tuning: validating data, choosing hardware, generating scripts, submitting jobs, monitoring progress, and converting outputs.

Some things to try:

  • Nice-tune a model on your individual dataset
  • Try larger experiments with more models and datasets and let the agent create a report for you.
  • Train a reasoning model with GRPO on math or code and let the agent create a report for you.

The extension is open source. You’ll be able to extend it, customize it to your workflows, or use it as a start line for other training scenarios.




Resources



Codex



Hugging Face Skills



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x