Make LLM Positive-tuning 2x faster with Unsloth and 🤗 TRL

Pulling your hair out because LLM fine-tuning is taking without end? On this post, we introduce a light-weight tool developed by the community to make LLM fine-tuning go super fast!

Before diving into Unsloth, it could be helpful to read our QLoRA blog post, or be conversant in LLM fine-tuning using the 🤗 PEFT library.

Unsloth – 2x faster, -40% memory usage, 0% accuracy degradation

Unsloth is a light-weight library for faster LLM fine-tuning which is fully compatible with the Hugging Face ecosystem (Hub, transformers, PEFT, TRL). The library is actively developed by the Unsloth team (Daniel and Michael) and the open source community. The library supports most NVIDIA GPUs –from GTX 1070 all the way in which as much as H100s–, and may be used with the whole trainer suite from the TRL library (SFTTrainer, DPOTrainer, PPOTrainer). On the time of writing, Unsloth supports the Llama (CodeLlama, Yi, etc) and Mistral architectures.

Unsloth works by overwriting some parts of the modeling code with optimized operations. By manually deriving backpropagation steps and rewriting all Pytorch modules into Triton kernels, Unsloth can each reduce memory usage and make fine-tuning faster. Crucially, accuracy degradation is 0% with respect to normal QLoRA, because no approximations are made within the optimized code.

Benchmarking

1 A100 40GB	Dataset	🤗 Hugging Face	🤗 + Flash Attention 2	🦥 Unsloth	🦥 VRAM reduction
Code Llama 34b	Slim Orca	1x	1.01x	1.94x	-22.7%
Llama-2 7b	Slim Orca	1x	0.96x	1.87x	-39.3%
Mistral 7b	Slim Orca	1x	1.17x	1.88x	-65.9%
Tiny Llama 1.1b	Alpaca	1x	1.55x	2.74x	-57.8%
DPO with Zephyr	Ultra Chat	1x	1.24x	1.88x	-11.6%

Free Colab T4	Dataset	🤗 Hugging Face	🤗 + Pytorch 2.1.1	🦥 Unsloth	🦥 VRAM reduction
Llama-2 7b	OASST	1x	1.19x	1.95x	-43.3%
Mistral 7b	Alpaca	1x	1.07x	1.56x	-13.7%
Tiny Llama 1.1b	Alpaca	1x	2.06x	3.87x	-73.8%
DPO with Zephyr	Ultra Chat	1x	1.09x	1.55x	-18.6%

Unsloth was benchmarked across 59 runs using 4 datasets on Tesla T4 and A100 Google Colab instances. QLoRA was applied to all linear layers (attention and MLP) with a rank of 16, and gradient checkpointing was on. By testing against the most recent Transformers version (4.36), which has SDPA natively integrated if you could have Pytorch 2.1.1, Unsloth is as much as 2.7x faster and uses as much as 74% less memory. We also tested Unsloth on a free Google Colab instance (low RAM, 1 T4 GPU, Pytorch 2.1.0 CUDA 12.1). All 59 notebooks are provided for full reproducibility, and more details are in Unsloth’s benchmarking details here

How do I exploit Unsloth?

Just load your model with FastLanguageModel.from_pretrained! Currently, Unsloth supports Llama and Mistral type architectures (Yi, Deepseek, TinyLlama, Llamafied Qwen). Please, open a Github issue for those who want others! Also, on the most recent Transformers primary branch, you’ll be able to now load pre-quantized 4bit models directly! This makes downloading models 4x faster, and reduces memory fragmentation by around 500MB, which lets you fit larger batches! We have now just a few pre-quantized models on your convenience, including unsloth/llama-2-7b-bnb-4bit, unsloth/llama-2-13b-bnb-4bit, unsloth/mistral-7b-bnb-4bit and unsloth/codellama-34b-bnb-4bit.

You’ll need to supply your intended maximum sequence length to from_pretrained. Unsloth internally performs RoPE Scaling, so larger maximum sequence lengths are robotically supported. Otherwise the API is just about the identical as transformers’ from_pretrained, except that FastLanguageModel.from_pretrained also returns the model tokenizer for convenience.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", 
    max_seq_length = 2048, 
    load_in_4bit = True,
)

Once the model has been loaded, use FastLanguageModel.get_peft_model to connect adapters with a purpose to perform QLoRA fine-tuning.


model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",    
    use_gradient_checkpointing = True,
)

Once adapters are attached, you should utilize the model directly inside any class from the HF ecosystem, akin to the SFTTrainer from TRL!

Unsloth + TRL integration

To make use of Unsloth with the TRL library, simply pass the Unsloth model into SFTTrainer or DPOTrainer! The trained model is fully compatible with the Hugging Face ecosystem, so you’ll be able to push the ultimate model to the Hub and use transformers for inference out of the box!

import torch

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

from unsloth import FastLanguageModel

max_seq_length = 2048 

dataset = load_dataset("imdb", split="train")


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", 
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)


model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",    
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
      per_device_train_batch_size = 2,
      gradient_accumulation_steps = 4,
      warmup_steps = 10,
      max_steps = 60,
      fp16 = not torch.cuda.is_bf16_supported(),
      bf16 = torch.cuda.is_bf16_supported(),
      logging_steps = 1,
      output_dir = "outputs",
      optim = "adamw_8bit",
      seed = 3407,
  ),
)
trainer.train()

Reproducible notebooks

We’re sharing below fully reproducible notebooks for anyone that desires to check out Unsloth with SFTTrainer on a free-tier Google Colab instance.

Llama 7b Free Tesla T4 colab example here

Mistral 7b Free Tesla T4 colab example here

CodeLlama 34b A100 colab example here

Zephyr DPO replication T4 colab example here

Source link

Make LLM Positive-tuning 2x faster with Unsloth and 🤗 TRL

Unsloth – 2x faster, -40% memory usage, 0% accuracy degradation

Benchmarking

How do I exploit Unsloth?

Unsloth + TRL integration

Reproducible notebooks

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Enhancing maritime cybersecurity with technology and policy

Hugging Face on PyTorch / XLA TPUs

Retrieval Augmented Generation with Huggingface Transformers and Ray

Decisioning on the Edge: Policy Matching at Scale

Easy considerations for easy people constructing fancy neural networks

Make LLM Positive-tuning 2x faster with Unsloth and 🤗 TRL

Unsloth – 2x faster, -40% memory usage, 0% accuracy degradation

Benchmarking

How do I exploit Unsloth?

Unsloth + TRL integration

Reproducible notebooks

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.