The Only Guide You Must Superb-Tune Llama 3 or Any Other Open Source Model

-

Superb-tuning large language models (LLMs) like Llama 3 involves adapting a pre-trained model to specific tasks using a domain-specific dataset. This process leverages the model’s pre-existing knowledge, making it efficient and cost-effective in comparison with training from scratch. On this guide, we’ll walk through the steps to fine-tune Llama 3 using QLoRA (Quantized LoRA), a parameter-efficient method that minimizes memory usage and computational costs.

Overview of Superb-Tuning

Superb-tuning involves several key steps:

  1. Choosing a Pre-trained Model: Select a base model that aligns along with your desired architecture.
  2. Gathering a Relevant Dataset: Collect and preprocess a dataset specific to your task.
  3. Superb-Tuning: Adapt the model using the dataset to enhance its performance on specific tasks.
  4. Evaluation: Assess the fine-tuned model using each qualitative and quantitative metrics.

Concepts and Techniques

Superb-tuning Large Language Models

Full Superb-Tuning

Full fine-tuning updates all of the parameters of the model, making it specific to the brand new task. This method requires significant computational resources and is usually impractical for very large models.

Parameter-Efficient Superb-Tuning (PEFT)

PEFT updates only a subset of the model’s parameters, reducing memory requirements and computational cost. This system prevents catastrophic forgetting and maintains the final knowledge of the model.

Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA)

LoRA fine-tunes only just a few low-rank matrices, while QLoRA quantizes these matrices to cut back the memory footprint further.

Superb-Tuning Methods

  1. Full Superb-Tuning: This involves training all of the parameters of the model on the task-specific dataset. While this method may be very effective, it’s also computationally expensive and requires significant memory.
  2. Parameter Efficient Superb-Tuning (PEFT): PEFT updates only a subset of the model’s parameters, making it more memory-efficient. Techniques like Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) fall into this category.

What’s LoRA?

Comparing finetuning methods: QLORA enhances LoRA with 4-bit precision quantization and paged optimizers for memory spike management

Comparing finetuning methods: QLORA enhances LoRA with 4-bit precision quantization and paged optimizers for memory spike management

LoRA is an improved fine-tuning method where, as a substitute of fine-tuning all of the weights of the pre-trained model, two smaller matrices that approximate the larger matrix are fine-tuned. These matrices constitute the LoRA adapter. This fine-tuned adapter is then loaded into the pre-trained model and used for inference.

Key Benefits of LoRA:

  • Memory Efficiency: LoRA reduces the memory footprint by fine-tuning only small matrices as a substitute of your entire model.
  • Reusability: The unique model stays unchanged, and multiple LoRA adapters may be used with it, facilitating handling multiple tasks with lower memory requirements.

What’s Quantized LoRA (QLoRA)?

QLoRA takes LoRA a step further by quantizing the weights of the LoRA adapters to lower precision (e.g., 4-bit as a substitute of 8-bit). This further reduces memory usage and storage requirements while maintaining a comparable level of effectiveness.

Key Benefits of QLoRA:

  • Even Greater Memory Efficiency: By quantizing the weights, QLoRA significantly reduces the model’s memory and storage requirements.
  • Maintains Performance: Despite the reduced precision, QLoRA maintains performance levels near that of full-precision models.

Task-Specific Adaptation

During fine-tuning, the model’s parameters are adjusted based on the brand new dataset, helping it higher understand and generate content relevant to the particular task. This process retains the final language knowledge gained during pre-training while tailoring the model to the nuances of the goal domain.

Superb-Tuning in Practice

Full Superb-Tuning vs. PEFT

  • Full Superb-Tuning: Involves training your entire model, which may be computationally expensive and requires significant memory.
  • PEFT (LoRA and QLoRA): Superb-tunes only a subset of parameters, reducing memory requirements and stopping catastrophic forgetting, making it a more efficient alternative.

Implementation Steps

  1. Setup Environment: Install vital libraries and arrange the computing environment.
  2. Load and Preprocess Dataset: Load the dataset and preprocess it right into a format suitable for the model.
  3. Load Pre-trained Model: Load the bottom model with quantization configurations if using QLoRA.
  4. Tokenization: Tokenize the dataset to arrange it for training.
  5. Training: Superb-tune the model using the prepared dataset.
  6. Evaluation: Evaluate the model’s performance on specific tasks using qualitative and quantitative metrics.

Steo by Step Guide to Superb Tune LLM

Setting Up the Environment

We’ll use a Jupyter notebook for this tutorial. Platforms like Kaggle, which provide free GPU usage, or Google Colab are perfect for running these experiments.

1. Install Required Libraries

First, ensure you’ve the vital libraries installed:

!pip install -qqq -U bitsandbytes transformers peft speed up datasets scipy einops evaluate trl rouge_score

2. Import Libraries and Set Up Environment

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, 
    pipeline, HfArgumentParser
)
from trl import ORPOConfig, ORPOTrainer, setup_chat_format, SFTTrainer
from tqdm import tqdm
import gc
import pandas as pd
import numpy as np
from huggingface_hub import interpreter_login
# Disable Weights and Biases logging
os.environ['WANDB_DISABLED'] = "true"
interpreter_login()

3. Load the Dataset

We’ll use the DialogSum dataset for this tutorial:

Preprocess the dataset in response to the model’s requirements, including applying appropriate templates and ensuring the information format is suitable for fine-tuning​ (Hugging Face)​​ (DataCamp)​.

dataset_name = "neil-code/dialogsum-test"
dataset = load_dataset(dataset_name)

Inspect the dataset structure:

print(dataset['test'][0])

4. Create BitsAndBytes Configuration

To load the model in 4-bit format:

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

5. Load the Pre-trained Model

Using Microsoft’s Phi-2 model for this tutorial:

model_name = 'microsoft/phi-2'
device_map = {"": 0}
original_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map=device_map,
    quantization_config=bnb_config,
    trust_remote_code=True,
    use_auth_token=True
)

6. Tokenization

Configure the tokenizer:

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    padding_side="left", 
    add_eos_token=True, 
    add_bos_token=True, 
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

Superb-Tuning Llama 3 or Other Models

When fine-tuning models like Llama 3 or every other state-of-the-art open-source LLMs, there are specific considerations and adjustments required to make sure optimal performance. Listed below are the detailed steps and insights on learn how to approach this for various models, including Llama 3, GPT-3, and Mistral.

5.1 Using Llama 3

Model Selection:

  • Ensure you’ve the right model identifier from the Hugging Face model hub. For instance, the Llama 3 model is perhaps identified as meta-llama/Meta-Llama-3-8B on Hugging Face.
  • Ensure to request access and log in to your Hugging Face account if required for models like Llama 3​ (Hugging Face)​​

Tokenization:

  • Use the suitable tokenizer for Llama 3, ensuring it’s compatible with the model and supports required features like padding and special tokens.

Memory and Computation:

  • Superb-tuning large models like Llama 3 requires significant computational resources. Ensure your environment, resembling a strong GPU setup, can handle the memory and processing requirements. Make sure the environment can handle the memory requirements, which may be mitigated through the use of techniques like QLoRA to cut back the memory footprint​ (Hugging Face Forums)

Example:

model_name = 'meta-llama/Meta-Llama-3-8B'
device_map = {"": 0}
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
original_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map=device_map,
    quantization_config=bnb_config,
    trust_remote_code=True,
    use_auth_token=True
)

Tokenization:

Depending on the particular use case and model requirements, ensure correct tokenizer configuration without redundant settings. For instance, use_fast=True is advisable for higher performance​ (Hugging Face)​​ (Weights & Biases)​.

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    padding_side="left", 
    add_eos_token=True, 
    add_bos_token=True, 
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

5.2 Using Other Popular Models (e.g., GPT-3, Mistral)

Model Selection:

  • For models like GPT-3 and Mistral, ensure you employ the right model name and identifier from the Hugging Face model hub or other sources.

Tokenization:

  • Just like Llama 3, be certain the tokenizer is accurately arrange and compatible with the model.

Memory and Computation:

  • Each model could have different memory requirements. Adjust your environment setup accordingly.

Example for GPT-3:

model_name = 'openai/gpt-3'
device_map = {"": 0}
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
original_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map=device_map,
    quantization_config=bnb_config,
    trust_remote_code=True,
    use_auth_token=True
)

Example for Mistral:

model_name = 'mistral-7B'
device_map = {"": 0}
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
original_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map=device_map,
    quantization_config=bnb_config,
    trust_remote_code=True,
    use_auth_token=True
)

Tokenization Considerations: Each model could have unique tokenization requirements. Make sure the tokenizer matches the model and is configured accurately.

Llama 3 Tokenizer Example:

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    padding_side="left", 
    add_eos_token=True, 
    add_bos_token=True, 
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

GPT-3 and Mistral Tokenizer Example:

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    use_fast=True
)

7. Test the Model with Zero-Shot Inferencing

Evaluate the bottom model with a sample input:

from transformers import set_seed
set_seed(42)
index = 10
prompt = dataset['test'][index]['dialogue']
formatted_prompt = f"Instruct: Summarize the next conversation.n{prompt}nOutput:n"
# Generate output
def gen(model, prompt, max_length):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_length=max_length)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)
res = gen(original_model, formatted_prompt, 100)
output = res[0].split('Output:n')[1]
print(f'INPUT PROMPT:n{formatted_prompt}')
print(f'MODEL GENERATION - ZERO SHOT:n{output}')

8. Pre-process the Dataset

Convert dialog-summary pairs into prompts:

def create_prompt_formats(sample):
    blurb = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    instruction = "### Instruct: Summarize the below conversation."
    input_context = sample['dialogue']
    response = f"### Output:n{sample['summary']}"
    end = "### End"
    
    parts = [blurb, instruction, input_context, response, end]
    formatted_prompt = "nn".join(parts)
    sample["text"] = formatted_prompt
    return sample
dataset = dataset.map(create_prompt_formats)

Tokenize the formatted dataset:

def preprocess_batch(batch, tokenizer, max_length):
    return tokenizer(batch["text"], max_length=max_length, truncation=True)
max_length = 1024
train_dataset = dataset["train"].map(lambda batch: preprocess_batch(batch, tokenizer, max_length), batched=True)
eval_dataset = dataset["validation"].map(lambda batch: preprocess_batch(batch, tokenizer, max_length), batched=True)

9. Prepare the Model for QLoRA

Prepare the model for parameter-efficient fine-tuning:

original_model = prepare_model_for_kbit_training(original_model)

Hyperparameters and Their Impact

Hyperparameters play an important role in optimizing the performance of your model. Listed below are some key hyperparameters to contemplate:

  1. Learning Rate: Controls the speed at which the model updates its parameters. A high learning rate might result in faster convergence but can overshoot the optimal solution. A low learning rate ensures regular convergence but might require more epochs.
  2. Batch Size: The variety of samples processed before the model updates its parameters. Larger batch sizes can improve stability but require more memory. Smaller batch sizes might result in more noise within the training process.
  3. Gradient Accumulation Steps: This parameter helps in simulating larger batch sizes by accumulating gradients over multiple steps before performing a parameter update.
  4. Variety of Epochs: The variety of times your entire dataset is passed through the model. More epochs can improve performance but might result in overfitting if not managed properly.
  5. Weight Decay: Regularization technique to forestall overfitting by penalizing large weights.
  6. Learning Rate Scheduler: Adjusts the training rate during training to enhance performance and convergence.

Customize the training configuration by adjusting hyperparameters like learning rate, batch size, and gradient accumulation steps based on the particular model and task requirements. For instance, Llama 3 models may require different learning rates in comparison with smaller models​ (Weights & Biases)​​ (GitHub)

Example Training Configuration

orpo_args = ORPOConfig(
learning_rate=8e-6,
lr_scheduler_type="linear",max_length=1024,max_prompt_length=512,
beta=0.1,per_device_train_batch_size=2,per_device_eval_batch_size=2,
gradient_accumulation_steps=4,optim="paged_adamw_8bit",num_train_epochs=1,
evaluation_strategy="steps",eval_steps=0.2,logging_steps=1,warmup_steps=10,
report_to="wandb",output_dir="./results/",)

10. Train the Model

Arrange the trainer and begin training:

trainer = ORPOTrainer(
model=original_model,
args=orpo_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,)
trainer.train()
trainer.save_model("fine-tuned-llama-3")

Evaluating the Superb-Tuned Model

After training, evaluate the model’s performance using each qualitative and quantitative methods.

1. Human Evaluation

Compare the generated summaries with human-written ones to evaluate the standard.

2. Quantitative Evaluation

Use metrics like ROUGE to evaluate performance:

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.rating(reference_summary, generated_summary)
print(scores)

Common Challenges and Solutions

1. Memory Limitations

Using QLoRA helps mitigate memory issues by quantizing model weights to 4-bit. Ensure you’ve enough GPU memory to handle your batch size and model size.

2. Overfitting

Monitor validation metrics to forestall overfitting. Use techniques like early stopping and weight decay.

3. Slow Training

Optimize training speed by adjusting batch size, learning rate, and using gradient accumulation.

4. Data Quality

Ensure your dataset is clean and well-preprocessed. Poor data quality can significantly impact model performance.

Conclusion

Superb-tuning LLMs using QLoRA is an efficient option to adapt large pre-trained models to specific tasks with reduced computational costs. By following this guide, you may fine-tune PHI, Llama 3 or every other open-source model to realize high performance in your specific tasks.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x