QLoRa: Wonderful-Tune a Large Language Model on Your GPU QLoRa: Quantized LLMs with Low-Rank Adapters Wonderful-tuning a GPT model with QLoRa GPT Inference with QLoRa Conclusion

Artificial Intelligence

QLoRa: Wonderful-Tune a Large Language Model on Your GPU QLoRa: Quantized LLMs with Low-Rank Adapters Wonderful-tuning a GPT model with QLoRa GPT Inference with QLoRa Conclusion

admin

June 2, 2023

QLoRa: Wonderful-Tune a Large Language Model on Your GPU
QLoRa: Quantized LLMs with Low-Rank Adapters
Wonderful-tuning a GPT model with QLoRa
GPT Inference with QLoRa
Conclusion

Most large language models (LLM) are too big to be fine-tuned on consumer hardware. As an example, to fine-tune a 65 billion parameters model we’d like greater than 780 Gb of GPU memory. That is reminiscent of ten A100 80 Gb GPUs. In other words, you would wish cloud computing to fine-tune your models.

Now, with QLoRa (Dettmers et al., 2023), you could possibly do it with just one A100.

On this blog post, I’ll introduce QLoRa. I’ll briefly describe how it really works and we’ll see the way to use it to fine-tune a GPT model with 20 billion parameters, in your GPU.

Note: I used my very own nVidia RTX 3060 12 Gb to run all of the commands on this post. You may also use a free instance of Google Colab to realize the identical results. If you desire to use a GPU with a smaller memory, you would need to use a smaller LLM.

In June 2021, Hu et al. (2021) introduced low-rank adapters (LoRa) for LLMs.

LoRa adds a tiny amount of trainable parameters, i.e., adapters, for every layer of the LLM and freezes all the unique parameters. For fine-tuning, we only need to update the adapter weights which significantly reduces the memory footprint.

QLoRa goes three steps further by introducing: 4-bit quantization, double quantization, and the exploitation of nVidia unified memory for paging.

In a couple of words, each certainly one of these steps works as follows:

: It is a method that improves upon quantile quantization. It ensures an equal variety of values in each quantization bin. This avoids computational issues and errors for outlier values.
: The authors of QLoRa defines it as follows: “the technique of quantizing the quantization constants for extra memory savings.”
: It relies on the NVIDIA Unified Memory feature and routinely handles page-to-page transfers between the CPU and GPU. It ensures error-free GPU processing, especially in situations where the GPU may run out of memory.

All of those steps drastically reduce the memory requirements for fine-tuning, while performing almost on par with standard fine-tuning.

Hardware requirements for QLoRa:

: The next demo works on a GPU with 12 Gb of VRAM, for a model with lower than 20 billion parameters, e.g., GPT-J. As an example, I ran it with my RTX 3060 12 Gb. If you’ve an even bigger card with 24 Gb of VRAM, you’ll be able to do it with a 20 billion parameter model, e.g., GPT-NeoX-20b.
: I like to recommend a minimum of 6 Gb. Most up-to-date computers have enough RAM.
: GPT-J and GPT-NeoX-20b are each very big models. I like to recommend at the least 80 Gb of free space.

In case your machine doesn’t meet these requirements, the free instance of Google Colab could be enough as a substitute.

Software requirements for QLoRa:

We want CUDA. Make certain it’s installed in your machine.

We may even must install all of the dependencies:

: A library that comprises all we’d like to quantize an LLM.
: These are standard libraries which can be used to efficiently train models from Hugging Face Hub.
: A library that gives the implementations for various methods to only fine-tune a small variety of (extra) model parameters. We want it for LoRa.
: This one is just not a requirement. We are going to only use it to get a dataset for fine-tuning. After all, you’ll be able to provide as a substitute your individual dataset.

We will get all of them with PIP:

pip install -q -U bitsandbytes
pip install -q -U git+https://github.com/huggingface/transformers.git 
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/speed up.git
pip install -q datasets

Next, we will start writing the Python script.

Loading and Quantization of a GPT model

We want the next imports to load and quantize an LLM.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

For this demo, we’ll fine-tune the GPT NeoX model pre-trained by EleutherAI. It is a model with 20 billion parameters. Note: GPT NeoX has a permissive license (Apache 2.0) that enables industrial use.

We will get this model and the associated tokenizer from Hugging Face Hub:

model_name = "EleutherAI/gpt-neox-20b"#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

Then, we’d like to detail the configuration of the quantizer, as follows:

quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

load_in_4bit: The model might be loaded within the memory with 4-bit precision.
bnb_4bit_use_double_quant: We are going to do the double quantization proposed by QLoRa.
bnb_4bit_quant_type: That is the sort of quantization. “nf4” stands for 4-bit NormalFloat.
bnb_4bit_compute_dtype: While we load and store the model in 4-bit, we’ll partially dequantize it when needed and do all of the computations with a 16-bit precision (bfloat16).

So now we will load the model in 4-bit:

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quant_config, device_map={"":0})

Then, we enable gradient checkpointing:

model.gradient_checkpointing_enable()

Preprocessing the GPT model for LoRa

That is where we use PEFT. We prepare the model for LoRa, adding trainable adapters for every layer.

from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_modelmodel = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=8, 
lora_alpha=32, 
target_modules=["query_key_value"], 
lora_dropout=0.05, 
bias="none", 
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)

In LoraConfig, you’ll be able to play with r, alpha, and dropout to acquire higher results in your task. You’ll find more options and details within the PEFT repository.

With LoRa, we add only 8 million parameters. We are going to only train these parameters and freeze every thing else. Wonderful-tuning must be fast.

Get your dataset ready

For this demo, I take advantage of the “english_quotes” dataset. It is a dataset fabricated from famous quotes distributed under a CC BY 4.0 license.

from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Wonderful-tuning GPT-NeoX-20B with QLoRa

Finally, the fine-tuning with Hugging Face Transformers could be very standard.

import transformerstokenizer.pad_token = tokenizer.eos_token
trainer = transformers.Trainer(
model=model,
train_dataset=data["train"],
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
warmup_steps=2,
max_steps=20,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit"
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, multi level marketing=False),
)
trainer.train()

Don’t forget optim=”paged_adamw_8bit”. It prompts the paging for higher memory management. Without it, we get out-of-memory errors.

Running this fine-tuning should only take 5 minutes on Google Colab.

The VRAM consumption should peak at 15 Gb.

And that’s it, we fine-tuned an LLM at no cost!

Does it work? Let’s try inference.

The QLoRa model we fine-tuned will be directly used with the usual Hugging Face Transformers’ inference, as follows:

text = "Ask not what your country"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

It’s best to get this quote as output:

Ask not what your country can do for you, ask what you'll be able to do to your country.”– John F.

We got the expected quote. Not bad for five minutes of fine-tuning!

Large language models got greater but, at the identical time, we finally got the tools to do fine-tuning and inference on consumer hardware.

Because of LoRa, and now QLoRa, we will fine-tune models with billion parameters without counting on cloud computing and with out a significant drop in performance based on the QLoRa paper.

If you’ve any problem running the code, please drop a comment, and I’ll attempt to help. You may also find more details about QLoRa implementation within the official GitHub repository.

If you desire to deploy an LLM, have a have a look at my tutorial using nVidia Triton Inference Server: