Nice-tuning and reinforcement learning (RL) for large language models (LLMs) require advanced expertise and sophisticated workflows, making them out of reach for a lot of. The open source Unsloth project changes that by streamlining the method, making it easier for people and small teams to explore LLM customization. When paired with the efficiency and throughput of the NVIDIA Blackwell GPUs, this mix helps democratize access to LLM development, opening the door for a wider community of practitioners to innovate.
This post explains how developers can train custom LLMs locally on NVIDIA RTX PRO 6000 Blackwell Series, GeForce RTX 50 Series, and NVIDIA DGX Spark using Unsloth. It also covers how these same workflows scale seamlessly into Blackwell-powered cloud instances, equivalent to NVIDIA DGX Cloud and people from NVIDIA Cloud Partners, for production workloads.
What’s Unsloth?
Unsloth is an open source framework that simplifies and accelerates LLM fine-tuning and RL. It uses custom Triton kernels and algorithms to deliver:
- 2x faster training throughput
- 70% less VRAM usage
- No accuracy loss
It supports popular models equivalent to Llama, gpt-oss, and DeepSeek, and is now optimized for NVIDIA Blackwell GPUs with NVFP4 precision.
With support from the NVIDIA DGX Cloud AI team, Unsloth extends from consumer GPUs, equivalent to the GeForce RTX 50 Series, RTX PRO 6000 Blackwell Series, and NVIDIA GB10-based developer workstations (equivalent to the NVIDIA DGX Spark), to enterprise-class NVIDIA HGX B200 and NVIDIA GB200 NVL72 systems. This makes fine-tuning accessible to everyone.
How does Unsloth perform on NVIDIA Blackwell?
Unsloth benchmarks show that, with NVIDIA Blackwell, it delivers significant gains in comparison with other optimized setups, including Flash Attention 2. Specifically, it delivers:
- 2x increase in training speed
- 70% VRAM reduction (even for 70B+ parameter models)
- 12x longer context windows
These results mean that you could now fine-tune models with as many as 40 billion parameters on a single Blackwell GPU.
Test setup: NVIDIA GeForce RTX 5090 GPU with 32 GB of VRAM, Alpaca dataset, batch size = 2, gradient accumulation = 4, rank = 32, QLoRA applied on all linear layers.
| Model | VRAM | Unsloth speed | VRAM reduction | Longer context | Hugging Face + FA2 |
| Llama 3.1 (8B) | 80 GB | 2x | >70% | 12x longer | 1x |
| VRAM | Unsloth context length | Hugging Face + FA2 context length |
| 8 GB | 2,972 | OOM |
| 12 GB | 21,848 | 932 |
| 16 GB | 40,724 | 2,551 |
| 24 GB | 78,475 | 5,789 |
| 32 GB | 122,181 | 9,711 |
arrange Unsloth on NVIDIA GPUs
Unsloth setup is straightforward, whether you favor a fast pip install, an isolated virtual environment, or a containerized Docker deployment. Try the next examples on any Blackwell generation GPU, including the GeForce RTX 50 Series.
Running a 20B model
The next example shows what it’d appear like to run the gpt-oss-20b model:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
"unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization
"unsloth/gpt-oss-120b-unsloth-bnb-4bit",
"unsloth/gpt-oss-20b", # 20B model using MXFP4 format
"unsloth/gpt-oss-120b",
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/gpt-oss-20b",
max_seq_length = max_seq_length, # Select any for long context!
load_in_4bit = True, # 4 bit quantization to scale back memory
full_finetuning = False, # [NEW!] Now we have full finetuning now!
# token = "hf_...", # use one if using gated models
)
Docker deployment
Unsloth also offers a prebuilt Docker image, which is supported in NVIDIA Blackwell GPUs.
Note that the Docker container requires the NVIDIA Container Toolkit to be installed in your host system.
Before running the next command, fill in your specific information:
docker run -d -e JUPYTER_PASSWORD="mypassword"
-p 8888:8888 -p 2222:22
-v $(pwd)/work:/workspace/work
--gpus all
unsloth/unsloth
Using an isolated environment
Issue the next commands from the shell to put in Unsloth using Python:
python -m venv unsloth
source unsloth/bin/activate
pip install unsloth
Note: Depending in your system, you might need to make use of pip3 / pip3.13 and python3 / python3.13.
Handling issues with xFormers
If you happen to encounter issues with xFormers, construct from source.
First, uninstall any existing xFormers:
pip uninstall xformers -y
Next, clone and construct:
pip install ninja
export TORCH_CUDA_ARCH_LIST="12.0"
git clone --depth=1 https://github.com/facebookresearch/xformers --recursive
cd xformers && python setup.py install && cd ..
Using uv
If you happen to prefer to make use of uv, install Unsloth using the next command:
While Unsloth enables local experimentation with 20B and 40B models on a single Blackwell GPU, the identical workflows are fully portable to NVIDIA DGX Cloud and NVIDIA Cloud Partners. This permits scaling to clusters of Blackwell GPUs for fine-tuning 70B+ models, reinforcement learning, and enterprise workloads without changing a line of code.
Start transforming LLM training runs
From experimentation to production, NVIDIA DGX Cloud and NVIDIA Cloud Partners deliver the facility to coach and fine-tune at any scale—combining elastic compute, enterprise storage, and real-time monitoring in fully managed AI environments optimized for NVIDIA GPUs.
In keeping with Unsloth Co-Founders Daniel and Michael Han, “AI shouldn’t be an exclusive club. The subsequent great AI breakthrough could come from anywhere—students, individual researchers, or small startups. Unsloth is here to make sure they’ve the tools they need.”
Start locally in your NVIDIA GeForce RTX 50 Series GPU, NVIDIA RTX PRO 6000 Blackwell Series GPU, or NVIDIA DGX Spark system to fine-tune models with Unsloth. Then scale seamlessly with NVIDIA DGX Cloud or an NVIDIA Cloud Partner to harness clusters of Blackwell GPUs with enterprise-grade reliability and visibility—all without compromise. Try the step-by-step guide to fine-tuning LLMs with NVIDIA Blackwell GPUs and Unsloth, and methods to install the software on NVIDIA DGX Spark.
