Hugging Face TRL now officially integrates with RapidFire AI to speed up your fine-tuning and post-training experiments. TRL users can now discover, install, and run RapidFire AI because the fastest strategy to compare multiple fine-tuning/post-training configurations to customize LLMs without major code changes and without bloating GPU requirements.
Why this matters
When fine-tuning or post-training LLMs, teams often shouldn’t have the time and/or budget to check multiple configs despite the fact that that may significantly boost eval metrics. RapidFire AI helps you to launch multiple TRL configs concurrently–even on a single GPU–and compare them in near real time via a brand new adaptive, chunk-based scheduling and execution scheme. In internal benchmarks referenced within the TRL page, this delivers ~16–24× higher experimentation throughput than sequentially comparing configs one after one other, enabling you to succeed in a lot better metrics much faster.

RapidFire AI establishes live three-way communication between your IDE, a metrics dashboard, and a multi-GPU execution backend
What you get, out of the box
-
Drop-in TRL wrappers — Use
RFSFTConfig,RFDPOConfig, andRFGRPOConfigas near-zero-code replacements for TRL’s SFT/DPO/GRPO configs. -
Adaptive chunk-based concurrent training — RapidFire AI shards the dataset right into a given variety of chunks and cycles configs at chunk boundaries to enable earlier apples-to-apples comparisons and likewise maximize GPU utilization.
-
Interactive Control Ops (IC Ops) — From the dashboard itself, you may Stop, Resume, Delete, and Clone-Modify, possibly with Warm-Start, any runs in flight to avoid wasting resources on underperforming configs and double-down on higher performing configs–no job restarts, no juggling separate GPUs or clusters, no resource bloat.

Clone promising configurations with modified hyperparameters, optionally warm-starting from the parent’s weights, all from the live dashboard
-
Multi-GPU orchestration — The RapidFire AI scheduler mechanically places and orchestrates configs across available GPUs on chunks of information via effcient shared-memory mechanisms. You concentrate on your models and eval metrics, not plumbing.
-
MLflow-based dashboard — Real-time metrics, logs, and IC Ops in a single place as soon as you begin your experiment. Support for more dashboards resembling Trackio, W&B, and TensorBoard coming soon.
How it really works
RapidFire AI splits your dataset randomly into “chunks” and cycles LLM configurations through the GPUs at chunk boundaries. You get incremental signal on eval metrics across all configs way more quickly. The automated checkpointing via an efficient shared-memory-based adapter/model spilling/loading mechanism keeps training smooth, stable, and consistent. Use IC Ops to adapt mid-flight to stop low-performers earlier and clone promising ones with tweaked config knobs, optionally warm-starting from the parent’s weights.

Sequential vs. Task Parallel vs. RapidFire AI: The adaptive scheduler maximizes GPU utilization across multiple configs and GPUs. The underside row shows IC Ops in motion—stopping, cloning, and modifying runs mid-flight.
Getting Began
Install RapidFire AI and get running in under a minute:
pip install rapidfireai
huggingface-cli login --token YOUR_TOKEN
pip uninstall -y hf-xet
rapidfireai init
rapidfireai start
The dashboard launches at http://localhost:3000 where you may monitor and control all of your experiments.
Supported TRL trainers
- SFT with
RFSFTConfig - DPO with
RFDPOConfig - GRPO with
RFGRPOConfig
These are designed as drop-in replacements so which you can keep your TRL mental model while gaining way more concurrency and control in your fine-tuning/post-training applications.
Minimal TRL SFT example
Here’s what it looks wish to train multiple configurations concurrently even on a single GPU:
from rapidfireai import Experiment
from rapidfireai.automl import List, RFGridSearch, RFModelConfig, RFLoraConfig, RFSFTConfig
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
dataset = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset")
train_dataset = dataset["train"].select(range(128)).shuffle(seed=42)
def formatting_function(row):
return {
"prompt": [
{"role": "system", "content": "You are a helpful customer support assistant."},
{"role": "user", "content": row["instruction"]},
],
"completion": [{"role": "assistant", "content": row["response"]}]
}
dataset = dataset.map(formatting_function)
config_set = List([
RFModelConfig(
model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
peft_config=RFLoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"]),
training_args=RFSFTConfig(learning_rate=1e-3, max_steps=128, fp16=True),
),
RFModelConfig(
model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
peft_config=RFLoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"]),
training_args=RFSFTConfig(learning_rate=1e-4, max_steps=128, fp16=True),
formatting_func=formatting_function,
)
])
experiment = Experiment(experiment_name="sft-comparison")
config_group = RFGridSearch(configs=config_set, trainer_type="SFT")
def create_model(model_config):
model = AutoModelForCausalLM.from_pretrained(
model_config["model_name"],
device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_config["model_name"])
return (model, tokenizer)
experiment.run_fit(config_group, create_model, train_dataset, num_chunks=4, seed=42)
experiment.end()
What happens once you run this?
Suppose you run the above on a 2-GPU machine. As an alternative of coaching sequentially (Config 1 → wait → Config 2 → wait), each configs train concurrently:
| Approach | Time till Comparative Decision | GPU utilization |
|---|---|---|
| Sequential (traditional) | ~quarter-hour | 60% utilization |
| RapidFire AI (concurrent) | ~5 minutes | 95%+ utilization |
You’ll be able to get to a comparative decision 3× sooner on the identical resources after each configs finish processing the primary data chunk as a substitute of waiting for them to see the entire dataset one after one other. Open http://localhost:3000 to look at live metrics and use IC Ops to stop, clone, or tweak runs in real-time based on what you are seeing.
Benchmarks: Real-World Speedups
Here’s what teams see on time to succeed in a comparable overall best training loss (across all tried configs) when switching from sequential comparisons to RapidFire AI-enabled hyperparallel experimentation:
| Scenario | Sequential Time | RapidFire AI Time | Speedup |
|---|---|---|---|
| 4 configs, 1 GPU | 120 min | 7.5 min | 16× |
| 8 configs, 1 GPU | 240 min | 12 min | 20× |
| 4 configs, 2 GPUs | 60 min | 4 min | 15× |
Benchmarks on NVIDIA A100 40GB with TinyLlama-1.1B and Llama-3.2-1B models
Get Began Today
🚀 Try it hands-on: Interactive Colab Notebook — Zero setup, runs in your browser
📚 Full Documentation: oss-docs.rapidfire.ai — Complete guides, examples, and API reference
💻 GitHub: RapidFireAI/rapidfireai — Open source, production-ready
📦 Install via PyPI: pypi.org/project/rapidfireai — pip install rapidfireai
💬 Join the Community: Discord — Get help, share results, request features
RapidFire AI was built since the common established order of trying one config at a time wastes each time and GPU cycles. With this official integration, every TRL user can fine-tune/post-train smarter, iterate faster, and ship higher models.
Try the mixing and tell us: How much faster is your experimentation loop? What should we construct next? We’re just getting began, and your feedback shapes where we go from here.
