Within the fast-moving world of Natural Language Processing (NLP), we regularly find ourselves comparing different language models to see which one works best for specific tasks. This blog post is all about comparing three models: RoBERTa, Mistral-7b, and Llama-2-7b. We used them to tackle a typical problem – classifying tweets about disasters. It can be crucial to notice that Mistral and Llama 2 are large models with 7 billion parameters. In contrast, RoBERTa-large (355M parameters) is a comparatively smaller model used as a baseline for the comparison study.
On this blog, we used PEFT (Parameter-Efficient High-quality-Tuning) technique: LoRA (Low-Rank Adaptation of Large Language Models) for fine-tuning the pre-trained model on the sequence classification task. LoRa is designed to significantly reduce the variety of trainable parameters while maintaining strong downstream task performance.
The principal objective of this blog post is to implement LoRA fine-tuning for sequence classification tasks using three pre-trained models from Hugging Face: meta-llama/Llama-2-7b-hf, mistralai/Mistral-7B-v0.1, and roberta-large
Hardware Used
- Variety of nodes: 1
- Variety of GPUs per node: 1
- GPU type: A6000
- GPU memory: 48GB
Goals
- Implement fine-tuning of pre-trained LLMs using LoRA PEFT methods.
- Learn the best way to use the HuggingFace APIs (transformers, peft, and datasets).
- Setup the hyperparameter tuning and experiment logging using Weights & Biases.
Dependencies
datasets
evaluate
peft
scikit-learn
torch
transformers
wandb
Note: For reproducing the reported results, please check the pinned versions within the wandb reports.
Pre-trained Models
RoBERTa
RoBERTa (Robustly Optimized BERT Approach) is a complicated variant of the BERT model proposed by Meta AI research team. BERT is a transformer-based language model using self-attention mechanisms for contextual word representations and trained with a masked language model objective. Note that BERT is an encoder only model used for natural language understanding tasks (resembling sequence classification and token classification).
RoBERTa is a well-liked model to fine-tune and appropriate as a baseline for our experiments. For more information, you may check the Hugging Face model card.
Llama 2
Llama 2 models, which stands for Large Language Model Meta AI, belong to the family of enormous language models (LLMs) introduced by Meta AI. The Llama 2 models vary in size, with parameter counts starting from 7 billion to 65 billion.
Llama 2 is an auto-regressive language model, based on the transformer decoder architecture. To generate text, Llama 2 processes a sequence of words as input and iteratively predicts the subsequent token using a sliding window.
Llama 2 architecture is barely different from models like GPT-3. As an illustration, Llama 2 employs the SwiGLU activation function slightly than ReLU and opts for rotary positional embeddings rather than absolute learnable positional embeddings.
The recently released Llama 2 introduced architectural refinements to raised leverage very long sequences by extending the context length to as much as 4096 tokens, and using grouped-query attention (GQA) decoding.
Mistral 7B
Mistral 7B v0.1, with 7.3 billion parameters, is the primary LLM introduced by Mistral AI.
The principal novel techniques utilized in Mistral 7B’s architecture are:
- Sliding Window Attention: Replace the total attention (square compute cost) with a sliding window based attention where each token can attend to at most 4,096 tokens from the previous layer (linear compute cost). This mechanism enables Mistral 7B to handle longer sequences, where higher layers can access historical information beyond the window size of 4,096 tokens.
- Grouped-query Attention: utilized in Llama 2 as well, the technique optimizes the inference process (reduce processing time) by caching the important thing and value vectors for previously decoded tokens within the sequence.
LoRA
PEFT, Parameter Efficient High-quality-Tuning, is a set of techniques (p-tuning, prefix-tuning, IA3, Adapters, and LoRa) designed to fine-tune large models using a much smaller set of coaching parameters while preserving the performance levels typically achieved through full fine-tuning.
LoRA, Low-Rank Adaptation, is a PEFT method that shares similarities with Adapter layers. Its primary objective is to cut back the model’s trainable parameters. LoRA’s operation involves
learning a low rank update matrix while keeping the pre-trained weights frozen.
Setup
RoBERTa has a limitatiom of maximum sequence length of 512, so we set the MAX_LEN=512 for all models to make sure a good comparison.
MAX_LEN = 512
roberta_checkpoint = "roberta-large"
mistral_checkpoint = "mistralai/Mistral-7B-v0.1"
llama_checkpoint = "meta-llama/Llama-2-7b-hf"
Data preparation
Data loading
We’ll load the dataset from Hugging Face:
from datasets import load_dataset
dataset = load_dataset("mehdiiraqui/twitter_disaster")
Now, let’s split the dataset into training and validation datasets. Then add the test set:
from datasets import Dataset
data = dataset['train'].train_test_split(train_size=0.8, seed=42)
data['val'] = data.pop("test")
data['test'] = dataset['test']
Here’s an summary of the dataset:
DatasetDict({
train: Dataset({
features: ['id', 'keyword', 'location', 'text', 'target'],
num_rows: 6090
})
val: Dataset({
features: ['id', 'keyword', 'location', 'text', 'target'],
num_rows: 1523
})
test: Dataset({
features: ['id', 'keyword', 'location', 'text', 'target'],
num_rows: 3263
})
})
Let’s check the information distribution:
import pandas as pd
data['train'].to_pandas().info()
data['test'].to_pandas().info()
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 7613 non-null int64
1 keyword 7552 non-null object
2 location 5080 non-null object
3 text 7613 non-null object
4 goal 7613 non-null int64
dtypes: int64(2), object(3)
memory usage: 297.5+ KB
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 3263 non-null int64
1 keyword 3237 non-null object
2 location 2158 non-null object
3 text 3263 non-null object
4 goal 3263 non-null int64
dtypes: int64(2), object(3)
memory usage: 127.6+ KB
Goal distribution within the train dataset
goal
0 4342
1 3271
Name: count, dtype: int64
Because the classes are usually not balanced, we are going to compute the positive and negative weights and use them for loss calculation later:
pos_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().goal.value_counts()[1])
neg_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().goal.value_counts()[0])
The ultimate weights are:
POS_WEIGHT, NEG_WEIGHT = (1.1637114032405993, 0.8766697374481806)
Then, we compute the utmost length of the column text:
max_char = data['train'].to_pandas()['text'].str.len().max()
max_words = data['train'].to_pandas()['text'].str.split().str.len().max()
The utmost variety of characters is 152.
The utmost variety of words is 31.
Data Processing
Let’s have a look to 1 row example of coaching data:
data['train'][0]
{'id': 5285,
'keyword': 'fear',
'location': 'Thibodaux, LA',
'text': 'my worst fear. https://t.co/iH8UDz8mq3',
'goal': 0}
The information comprises a keyword, a location and the text of the tweet. For the sake of simplicity, we select the text feature because the only input to the LLM.
At this stage, we prepared the train, validation, and test sets within the HuggingFace format expected by the pre-trained LLMs. The subsequent step is to define the tokenized dataset for training using the suitable tokenizer to remodel the text feature into two Tensors of sequence of token ids and a focus masks. As each model has its specific tokenizer, we are going to have to define three different datasets.
We start by defining the RoBERTa dataloader:
from transformers import AutoTokenizer
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta_checkpoint, add_prefix_space=True)
Note: The RoBERTa tokenizer has been trained to treat spaces as a part of the token. In consequence, the primary word of the sentence is encoded in another way if it isn’t preceded by a white space. To make sure the first word features a space, we set add_prefix_space=True. Also, to take care of consistent pre-processing for all three models, we set the parameter to ‘True’ for Llama 2 and Mistral 7b.
- Define the preprocessing function for converting one row of the dataframe:
def roberta_preprocessing_function(examples):
return roberta_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)
By applying the preprocessing function to the primary example of our training dataset, we now have the tokenized inputs (input_ids) and the eye mask:
roberta_preprocessing_function(data['train'][0])
{'input_ids': [0, 127, 2373, 2490, 4, 1205, 640, 90, 4, 876, 73, 118, 725, 398, 13083, 329, 398, 119, 1343, 246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
- Now, let’s apply the preprocessing function to the complete dataset:
col_to_delete = ['id', 'keyword','location', 'text']
roberta_tokenized_datasets = data.map(roberta_preprocessing_function, batched=True, remove_columns=col_to_delete)
roberta_tokenized_datasets = roberta_tokenized_datasets.rename_column("goal", "label")
roberta_tokenized_datasets.set_format("torch")
Note: we deleted the undesired columns from our data: id, keyword, location and text. We have now deleted the text because we now have already converted it into the inputs ids and the eye mask:
We will take a look into our tokenized training dataset:
roberta_tokenized_datasets['train'][0]
{'label': tensor(0),
'input_ids': tensor([ 0, 127, 2373, 2490, 4, 1205, 640, 90, 4, 876,
73, 118, 725, 398, 13083, 329, 398, 119, 1343, 246,
2]),
'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}
- For generating the training batches, we also have to pad the rows of a given batch to the utmost length present in the batch. For that, we are going to use the
DataCollatorWithPaddingclass:
from transformers import DataCollatorWithPadding
roberta_data_collator = DataCollatorWithPadding(tokenizer=roberta_tokenizer)
You may follow the identical steps for preparing the information for Mistral 7B and Llama 2 models:
Note that Llama 2 and Mistral 7B do not have a default pad_token_id. So, we use the eos_token_id for padding as well.
from transformers import AutoTokenizer, DataCollatorWithPadding
mistral_tokenizer = AutoTokenizer.from_pretrained(mistral_checkpoint, add_prefix_space=True)
mistral_tokenizer.pad_token_id = mistral_tokenizer.eos_token_id
mistral_tokenizer.pad_token = mistral_tokenizer.eos_token
def mistral_preprocessing_function(examples):
return mistral_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)
mistral_tokenized_datasets = data.map(mistral_preprocessing_function, batched=True, remove_columns=col_to_delete)
mistral_tokenized_datasets = mistral_tokenized_datasets.rename_column("goal", "label")
mistral_tokenized_datasets.set_format("torch")
mistral_data_collator = DataCollatorWithPadding(tokenizer=mistral_tokenizer)
from transformers import AutoTokenizer, DataCollatorWithPadding
llama_tokenizer = AutoTokenizer.from_pretrained(llama_checkpoint, add_prefix_space=True)
llama_tokenizer.pad_token_id = llama_tokenizer.eos_token_id
llama_tokenizer.pad_token = llama_tokenizer.eos_token
def llama_preprocessing_function(examples):
return llama_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)
llama_tokenized_datasets = data.map(llama_preprocessing_function, batched=True, remove_columns=col_to_delete)
llama_tokenized_datasets = llama_tokenized_datasets.rename_column("goal", "label")
llama_tokenized_datasets.set_format("torch")
llama_data_collator = DataCollatorWithPadding(tokenizer=llama_tokenizer)
Now that we now have prepared the tokenized datasets, the subsequent section will showcase the best way to load the pre-trained LLMs checkpoints and the best way to set the LoRa weights.
Models
RoBERTa
Load RoBERTa Checkpoints for the Classification Task
We load the pre-trained RoBERTa model with a sequence classification head using the Hugging Face AutoModelForSequenceClassification class:
from transformers import AutoModelForSequenceClassification
roberta_model = AutoModelForSequenceClassification.from_pretrained(roberta_checkpoint, num_labels=2)
LoRA setup for RoBERTa classifier
We import LoRa configuration and set some parameters for RoBERTa classifier:
- TaskType: Sequence classification
- r(rank): Rank for our decomposition matrices
- lora_alpha: Alpha parameter to scale the learned weights. LoRA paper advises fixing alpha at 16
- lora_dropout: Dropout probability of the LoRA layers
- bias: Whether so as to add bias term to LoRa layers
The code below uses the values really useful by the Lora paper. Later on this post we are going to perform hyperparameter tuning of those parameters using wandb.
from peft import get_peft_model, LoraConfig, TaskType
roberta_peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none",
)
roberta_model = get_peft_model(roberta_model, roberta_peft_config)
roberta_model.print_trainable_parameters()
We will see that the variety of trainable parameters represents only 0.64% of the RoBERTa model parameters:
trainable params: 2,299,908 || all params: 356,610,052 || trainable%: 0.6449363911929212
Mistral
Load checkpoints for the classification model
Let’s load the pre-trained Mistral-7B model with a sequence classification head:
from transformers import AutoModelForSequenceClassification
import torch
mistral_model = AutoModelForSequenceClassification.from_pretrained(
pretrained_model_name_or_path=mistral_checkpoint,
num_labels=2,
device_map="auto"
)
For Mistral 7B, we now have so as to add the padding token id because it isn’t defined by default.
mistral_model.config.pad_token_id = mistral_model.config.eos_token_id
LoRa setup for Mistral 7B classifier
For Mistral 7B model, we’d like to specify the target_modules (the query and value vectors from the eye modules):
from peft import get_peft_model, LoraConfig, TaskType
mistral_peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none",
target_modules=[
"q_proj",
"v_proj",
],
)
mistral_model = get_peft_model(mistral_model, mistral_peft_config)
mistral_model.print_trainable_parameters()
The variety of trainable parameters reprents only 0.024% of the Mistral model parameters:
trainable params: 1,720,320 || all params: 7,112,380,416 || trainable%: 0.02418768259540745
Llama 2
Load checkpoints for the classification mode
Let’s load pre-trained Llama 2 model with a sequence classification header.
from transformers import AutoModelForSequenceClassification
import torch
llama_model = AutoModelForSequenceClassification.from_pretrained(
pretrained_model_name_or_path=llama_checkpoint,
num_labels=2,
device_map="auto",
offload_folder="offload",
trust_remote_code=True
)
For Llama 2, we now have so as to add the padding token id because it isn’t defined by default.
llama_model.config.pad_token_id = llama_model.config.eos_token_id
LoRa setup for Llama 2 classifier
We define LoRa for Llama 2 with the identical parameters as for Mistral:
from peft import get_peft_model, LoraConfig, TaskType
llama_peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS, r=16, lora_alpha=16, lora_dropout=0.05, bias="none",
target_modules=[
"q_proj",
"v_proj",
],
)
llama_model = get_peft_model(llama_model, llama_peft_config)
llama_model.print_trainable_parameters()
The variety of trainable parameters reprents only 0.12% of the Llama 2 model parameters:
trainable params: 8,404,992 || all params: 6,615,748,608 || trainable%: 0.1270452143516515
At this point, we defined the tokenized dataset for training in addition to the LLMs setup with LoRa layers. The next section will introduce the best way to launch training using the HuggingFace Trainer class.
Setup the trainer
Evaluation Metrics
First, we define the performance metrics we are going to use to match the three models: F1 rating, recall, precision and accuracy:
import evaluate
import numpy as np
def compute_metrics(eval_pred):
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")
f1_metric= evaluate.load("f1")
accuracy_metric = evaluate.load("accuracy")
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}
Custom Trainer for Weighted Loss
As mentioned initially of this post, we now have an imbalanced distribution between positive and negative classes. We want to coach our models with a weighted cross-entropy loss to account for that. The Trainer class doesn’t support providing a custom loss because it expects to get the loss directly from the model’s outputs.
So, we’d like to define our custom WeightedCELossTrainer that overrides the compute_loss method to calculate the weighted cross-entropy loss based on the model’s predictions and the input labels:
from transformers import Trainer
class WeightedCELossTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.pop("labels")
outputs = model(**inputs)
logits = outputs.get("logits")
loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor([neg_weights, pos_weights], device=model.device, dtype=logits.dtype))
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
Trainer Setup
Let’s set the training arguments and the trainer for the three models.
RoBERTa
First necessary step is to maneuver the models to the GPU device for training.
roberta_model = roberta_model.cuda()
roberta_model.device()
It is going to print the next:
device(type="cuda", index=0)
Then, we set the training arguments:
from transformers import TrainingArguments
lr = 1e-4
batch_size = 8
num_epochs = 5
training_args = TrainingArguments(
output_dir="roberta-large-lora-token-classification",
learning_rate=lr,
lr_scheduler_type= "constant",
warmup_ratio= 0.1,
max_grad_norm= 0.3,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=num_epochs,
weight_decay=0.001,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
report_to="wandb",
fp16=False,
gradient_checkpointing=True,
)
Finally, we define the RoBERTa trainer by providing the model, the training arguments and the tokenized datasets:
roberta_trainer = WeightedCELossTrainer(
model=roberta_model,
args=training_args,
train_dataset=roberta_tokenized_datasets['train'],
eval_dataset=roberta_tokenized_datasets["val"],
data_collator=roberta_data_collator,
compute_metrics=compute_metrics
)
Mistral-7B
Much like RoBERTa, we initialize the WeightedCELossTrainer as follows:
from transformers import TrainingArguments, Trainer
mistral_model = mistral_model.cuda()
lr = 1e-4
batch_size = 8
num_epochs = 5
training_args = TrainingArguments(
output_dir="mistral-lora-token-classification",
learning_rate=lr,
lr_scheduler_type= "constant",
warmup_ratio= 0.1,
max_grad_norm= 0.3,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=num_epochs,
weight_decay=0.001,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
report_to="wandb",
fp16=True,
gradient_checkpointing=True,
)
mistral_trainer = WeightedCELossTrainer(
model=mistral_model,
args=training_args,
train_dataset=mistral_tokenized_datasets['train'],
eval_dataset=mistral_tokenized_datasets["val"],
data_collator=mistral_data_collator,
compute_metrics=compute_metrics
)
Note that we would have liked to enable half-precision training by setting fp16 to True. The principal reason is that Mistral-7B is large, and its weights cannot fit into one GPU memory (48GB) with full float32 precision.
Llama 2
Much like Mistral 7B, we define the trainer as follows:
from transformers import TrainingArguments, Trainer
llama_model = llama_model.cuda()
lr = 1e-4
batch_size = 8
num_epochs = 5
training_args = TrainingArguments(
output_dir="llama-lora-token-classification",
learning_rate=lr,
lr_scheduler_type= "constant",
warmup_ratio= 0.1,
max_grad_norm= 0.3,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=num_epochs,
weight_decay=0.001,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
report_to="wandb",
fp16=True,
gradient_checkpointing=True,
)
llama_trainer = WeightedCELossTrainer(
model=llama_model,
args=training_args,
train_dataset=llama_tokenized_datasets['train'],
eval_dataset=llama_tokenized_datasets["val"],
data_collator=llama_data_collator,
compute_metrics=compute_metrics
)
Hyperparameter Tuning
We have now used Wandb Sweep API to run hyperparameter tunning with Bayesian search strategy (30 runs). The hyperparameters tuned are the next.
| method | metric | lora_alpha | lora_bias | lora_dropout | lora_rank | lr | max_length |
|---|---|---|---|---|---|---|---|
| bayes | goal: maximize | distribution: categorical | distribution: categorical | distribution: uniform | distribution: categorical | distribution: uniform | distribution: categorical |
| name: eval/f1-score | values: -16 -32 -64 |
values: None | -max: 0.1 -min: 0 |
values: -4 -8 -16 -32 |
-max: 2e-04 -min: 1e-05 |
values: 512 |
For more information, you may check the Wandb experiment report within the resources sections.
Results
| Models | F1 rating | Training time | Memory consumption | Variety of trainable parameters |
|---|---|---|---|---|
| RoBERTa | 0.8077 | 538 seconds | GPU1: 9.1 Gb GPU2: 8.3 Gb |
0.64% |
| Mistral 7B | 0.7364 | 2030 seconds | GPU1: 29.6 Gb GPU2: 29.5 Gb |
0.024% |
| Llama 2 | 0.7638 | 2052 seconds | GPU1: 35 Gb GPU2: 33.9 Gb |
0.12% |
Conclusion
On this blog post, we compared the performance of three large language models (LLMs) – RoBERTa, Mistral 7b, and Llama 2 – for disaster tweet classification using LoRa. From the performance results, we will see that RoBERTa is outperforming Mistral 7B and Llama 2 by a big margin. This raises the query about whether we really want a fancy and huge LLM for tasks like short-sequence binary classification?
One learning we will draw from this study is that one should account for the particular project requirements, available resources, and performance needs to decide on the LLMs model to make use of.
Also, for relatively easy prediction tasks with short sequences base models resembling RoBERTa remain competitive.
Finally, we showcase that LoRa method may be applied to each encoder (RoBERTa) and decoder (Llama 2 and Mistral 7B) models.
Resources
-
Yow will discover the code script in the next Github project.
-
You may check the hyper-param search leads to the next Weight&Bias reports:

