A Deep Dive into Roberta, Llama 2, and Mistral for Disaster Tweets Evaluation with Lora

Within the fast-moving world of Natural Language Processing (NLP), we regularly find ourselves comparing different language models to see which one works best for specific tasks. This blog post is all about comparing three models: RoBERTa, Mistral-7b, and Llama-2-7b. We used them to tackle a typical problem – classifying tweets about disasters. It can be crucial to notice that Mistral and Llama 2 are large models with 7 billion parameters. In contrast, RoBERTa-large (355M parameters) is a comparatively smaller model used as a baseline for the comparison study.

On this blog, we used PEFT (Parameter-Efficient High-quality-Tuning) technique: LoRA (Low-Rank Adaptation of Large Language Models) for fine-tuning the pre-trained model on the sequence classification task. LoRa is designed to significantly reduce the variety of trainable parameters while maintaining strong downstream task performance.

The principal objective of this blog post is to implement LoRA fine-tuning for sequence classification tasks using three pre-trained models from Hugging Face: meta-llama/Llama-2-7b-hf, mistralai/Mistral-7B-v0.1, and roberta-large

Hardware Used

Variety of nodes: 1
Variety of GPUs per node: 1
GPU type: A6000
GPU memory: 48GB

Goals

Implement fine-tuning of pre-trained LLMs using LoRA PEFT methods.
Learn the best way to use the HuggingFace APIs (transformers, peft, and datasets).
Setup the hyperparameter tuning and experiment logging using Weights & Biases.

Dependencies

datasets
evaluate
peft
scikit-learn
torch
transformers
wandb

Note: For reproducing the reported results, please check the pinned versions within the wandb reports.

Pre-trained Models

RoBERTa

RoBERTa (Robustly Optimized BERT Approach) is a complicated variant of the BERT model proposed by Meta AI research team. BERT is a transformer-based language model using self-attention mechanisms for contextual word representations and trained with a masked language model objective. Note that BERT is an encoder only model used for natural language understanding tasks (resembling sequence classification and token classification).

RoBERTa is a well-liked model to fine-tune and appropriate as a baseline for our experiments. For more information, you may check the Hugging Face model card.

Llama 2

Llama 2 models, which stands for Large Language Model Meta AI, belong to the family of enormous language models (LLMs) introduced by Meta AI. The Llama 2 models vary in size, with parameter counts starting from 7 billion to 65 billion.

Llama 2 is an auto-regressive language model, based on the transformer decoder architecture. To generate text, Llama 2 processes a sequence of words as input and iteratively predicts the subsequent token using a sliding window.
Llama 2 architecture is barely different from models like GPT-3. As an illustration, Llama 2 employs the SwiGLU activation function slightly than ReLU and opts for rotary positional embeddings rather than absolute learnable positional embeddings.

The recently released Llama 2 introduced architectural refinements to raised leverage very long sequences by extending the context length to as much as 4096 tokens, and using grouped-query attention (GQA) decoding.

Mistral 7B

Mistral 7B v0.1, with 7.3 billion parameters, is the primary LLM introduced by Mistral AI.
The principal novel techniques utilized in Mistral 7B’s architecture are:

Sliding Window Attention: Replace the total attention (square compute cost) with a sliding window based attention where each token can attend to at most 4,096 tokens from the previous layer (linear compute cost). This mechanism enables Mistral 7B to handle longer sequences, where higher layers can access historical information beyond the window size of 4,096 tokens.
Grouped-query Attention: utilized in Llama 2 as well, the technique optimizes the inference process (reduce processing time) by caching the important thing and value vectors for previously decoded tokens within the sequence.

LoRA

PEFT, Parameter Efficient High-quality-Tuning, is a set of techniques (p-tuning, prefix-tuning, IA3, Adapters, and LoRa) designed to fine-tune large models using a much smaller set of coaching parameters while preserving the performance levels typically achieved through full fine-tuning.

LoRA, Low-Rank Adaptation, is a PEFT method that shares similarities with Adapter layers. Its primary objective is to cut back the model’s trainable parameters. LoRA’s operation involves
learning a low rank update matrix while keeping the pre-trained weights frozen.

Setup

RoBERTa has a limitatiom of maximum sequence length of 512, so we set the MAX_LEN=512 for all models to make sure a good comparison.

MAX_LEN = 512 
roberta_checkpoint = "roberta-large"
mistral_checkpoint = "mistralai/Mistral-7B-v0.1"
llama_checkpoint = "meta-llama/Llama-2-7b-hf"

Data preparation

Data loading

We’ll load the dataset from Hugging Face:

from datasets import load_dataset
dataset = load_dataset("mehdiiraqui/twitter_disaster")

Now, let’s split the dataset into training and validation datasets. Then add the test set:

from datasets import Dataset

data = dataset['train'].train_test_split(train_size=0.8, seed=42)

data['val'] = data.pop("test")

data['test'] = dataset['test']

Here’s an summary of the dataset:

DatasetDict({
    train: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'target'],
        num_rows: 6090
    })
    val: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'target'],
        num_rows: 1523
    })
    test: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'target'],
        num_rows: 3263
    })
})

Let’s check the information distribution:

import pandas as pd

data['train'].to_pandas().info()
data['test'].to_pandas().info()

RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   goal    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


RangeIndex: 3263 entries, 0 to 3262
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
 4   goal    3263 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 127.6+ KB

Goal distribution within the train dataset

goal
0    4342
1    3271
Name: count, dtype: int64

Because the classes are usually not balanced, we are going to compute the positive and negative weights and use them for loss calculation later:

pos_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().goal.value_counts()[1])
neg_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().goal.value_counts()[0])

The ultimate weights are:

POS_WEIGHT, NEG_WEIGHT = (1.1637114032405993, 0.8766697374481806)

Then, we compute the utmost length of the column text:


max_char = data['train'].to_pandas()['text'].str.len().max()

max_words = data['train'].to_pandas()['text'].str.split().str.len().max()

The utmost variety of characters is 152.
The utmost variety of words is 31.

Data Processing

Let’s have a look to 1 row example of coaching data:

data['train'][0]

{'id': 5285,
 'keyword': 'fear',
 'location': 'Thibodaux, LA',
 'text': 'my worst fear. https://t.co/iH8UDz8mq3',
 'goal': 0}

The information comprises a keyword, a location and the text of the tweet. For the sake of simplicity, we select the text feature because the only input to the LLM.

At this stage, we prepared the train, validation, and test sets within the HuggingFace format expected by the pre-trained LLMs. The subsequent step is to define the tokenized dataset for training using the suitable tokenizer to remodel the text feature into two Tensors of sequence of token ids and a focus masks. As each model has its specific tokenizer, we are going to have to define three different datasets.

We start by defining the RoBERTa dataloader:

from transformers import AutoTokenizer
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta_checkpoint, add_prefix_space=True)

Note: The RoBERTa tokenizer has been trained to treat spaces as a part of the token. In consequence, the primary word of the sentence is encoded in another way if it isn’t preceded by a white space. To make sure the first word features a space, we set add_prefix_space=True. Also, to take care of consistent pre-processing for all three models, we set the parameter to ‘True’ for Llama 2 and Mistral 7b.

Define the preprocessing function for converting one row of the dataframe:

def roberta_preprocessing_function(examples):
    return roberta_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

By applying the preprocessing function to the primary example of our training dataset, we now have the tokenized inputs (input_ids) and the eye mask:

roberta_preprocessing_function(data['train'][0])

{'input_ids': [0, 127, 2373, 2490, 4, 1205, 640, 90, 4, 876, 73, 118, 725, 398, 13083, 329, 398, 119, 1343, 246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Now, let’s apply the preprocessing function to the complete dataset:

col_to_delete = ['id', 'keyword','location', 'text']

roberta_tokenized_datasets = data.map(roberta_preprocessing_function, batched=True, remove_columns=col_to_delete)

roberta_tokenized_datasets = roberta_tokenized_datasets.rename_column("goal", "label")

roberta_tokenized_datasets.set_format("torch")

Note: we deleted the undesired columns from our data: id, keyword, location and text. We have now deleted the text because we now have already converted it into the inputs ids and the eye mask:

We will take a look into our tokenized training dataset:

roberta_tokenized_datasets['train'][0]

{'label': tensor(0),
 'input_ids': tensor([    0,   127,  2373,  2490,     4,  1205,   640,    90,     4,   876,
            73,   118,   725,   398, 13083,   329,   398,   119,  1343,   246,
             2]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}

For generating the training batches, we also have to pad the rows of a given batch to the utmost length present in the batch. For that, we are going to use the DataCollatorWithPadding class:


from transformers import DataCollatorWithPadding
roberta_data_collator = DataCollatorWithPadding(tokenizer=roberta_tokenizer)

You may follow the identical steps for preparing the information for Mistral 7B and Llama 2 models:

Note that Llama 2 and Mistral 7B do not have a default pad_token_id. So, we use the eos_token_id for padding as well.


from transformers import AutoTokenizer, DataCollatorWithPadding
mistral_tokenizer = AutoTokenizer.from_pretrained(mistral_checkpoint, add_prefix_space=True)
mistral_tokenizer.pad_token_id = mistral_tokenizer.eos_token_id
mistral_tokenizer.pad_token = mistral_tokenizer.eos_token

def mistral_preprocessing_function(examples):
    return mistral_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

mistral_tokenized_datasets = data.map(mistral_preprocessing_function, batched=True, remove_columns=col_to_delete)
mistral_tokenized_datasets = mistral_tokenized_datasets.rename_column("goal", "label")
mistral_tokenized_datasets.set_format("torch")


mistral_data_collator = DataCollatorWithPadding(tokenizer=mistral_tokenizer)


from transformers import AutoTokenizer, DataCollatorWithPadding
llama_tokenizer = AutoTokenizer.from_pretrained(llama_checkpoint, add_prefix_space=True)
llama_tokenizer.pad_token_id = llama_tokenizer.eos_token_id
llama_tokenizer.pad_token = llama_tokenizer.eos_token

def llama_preprocessing_function(examples):
    return llama_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

llama_tokenized_datasets = data.map(llama_preprocessing_function, batched=True, remove_columns=col_to_delete)
llama_tokenized_datasets = llama_tokenized_datasets.rename_column("goal", "label")
llama_tokenized_datasets.set_format("torch")


llama_data_collator = DataCollatorWithPadding(tokenizer=llama_tokenizer)

Now that we now have prepared the tokenized datasets, the subsequent section will showcase the best way to load the pre-trained LLMs checkpoints and the best way to set the LoRa weights.

Models

RoBERTa

Load RoBERTa Checkpoints for the Classification Task

We load the pre-trained RoBERTa model with a sequence classification head using the Hugging Face AutoModelForSequenceClassification class:

from transformers import AutoModelForSequenceClassification 
roberta_model = AutoModelForSequenceClassification.from_pretrained(roberta_checkpoint, num_labels=2)

LoRA setup for RoBERTa classifier

We import LoRa configuration and set some parameters for RoBERTa classifier:

TaskType: Sequence classification
r(rank): Rank for our decomposition matrices
lora_alpha: Alpha parameter to scale the learned weights. LoRA paper advises fixing alpha at 16
lora_dropout: Dropout probability of the LoRA layers
bias: Whether so as to add bias term to LoRa layers

The code below uses the values really useful by the Lora paper. Later on this post we are going to perform hyperparameter tuning of those parameters using wandb.

from peft import get_peft_model, LoraConfig, TaskType
roberta_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none",
)
roberta_model = get_peft_model(roberta_model, roberta_peft_config)
roberta_model.print_trainable_parameters()

We will see that the variety of trainable parameters represents only 0.64% of the RoBERTa model parameters:

trainable params: 2,299,908 || all params: 356,610,052 || trainable%: 0.6449363911929212

Mistral

Load checkpoints for the classification model

Let’s load the pre-trained Mistral-7B model with a sequence classification head:

from transformers import AutoModelForSequenceClassification
import torch
mistral_model =  AutoModelForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=mistral_checkpoint,
  num_labels=2,
  device_map="auto"
)

For Mistral 7B, we now have so as to add the padding token id because it isn’t defined by default.

mistral_model.config.pad_token_id = mistral_model.config.eos_token_id

LoRa setup for Mistral 7B classifier

For Mistral 7B model, we’d like to specify the target_modules (the query and value vectors from the eye modules):

from peft import get_peft_model, LoraConfig, TaskType

mistral_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none", 
    target_modules=[
        "q_proj",
        "v_proj",
    ],
)

mistral_model = get_peft_model(mistral_model, mistral_peft_config)
mistral_model.print_trainable_parameters()

The variety of trainable parameters reprents only 0.024% of the Mistral model parameters:

trainable params: 1,720,320 || all params: 7,112,380,416 || trainable%: 0.02418768259540745

Llama 2

Load checkpoints for the classification mode

Let’s load pre-trained Llama 2 model with a sequence classification header.

from transformers import AutoModelForSequenceClassification
import torch
llama_model =  AutoModelForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=llama_checkpoint,
  num_labels=2,
  device_map="auto",
  offload_folder="offload",
  trust_remote_code=True
)

For Llama 2, we now have so as to add the padding token id because it isn’t defined by default.

llama_model.config.pad_token_id = llama_model.config.eos_token_id

LoRa setup for Llama 2 classifier

We define LoRa for Llama 2 with the identical parameters as for Mistral:

from peft import get_peft_model, LoraConfig, TaskType
llama_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=16, lora_alpha=16, lora_dropout=0.05, bias="none", 
    target_modules=[
        "q_proj",
        "v_proj",  
    ],
)

llama_model = get_peft_model(llama_model, llama_peft_config)
llama_model.print_trainable_parameters()

The variety of trainable parameters reprents only 0.12% of the Llama 2 model parameters:

trainable params: 8,404,992 || all params: 6,615,748,608 || trainable%: 0.1270452143516515

At this point, we defined the tokenized dataset for training in addition to the LLMs setup with LoRa layers. The next section will introduce the best way to launch training using the HuggingFace Trainer class.

Setup the trainer

Evaluation Metrics

First, we define the performance metrics we are going to use to match the three models: F1 rating, recall, precision and accuracy:

import evaluate
import numpy as np

def compute_metrics(eval_pred):
    
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric= evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    logits, labels = eval_pred 
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    
    return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}

Custom Trainer for Weighted Loss

As mentioned initially of this post, we now have an imbalanced distribution between positive and negative classes. We want to coach our models with a weighted cross-entropy loss to account for that. The Trainer class doesn’t support providing a custom loss because it expects to get the loss directly from the model’s outputs.

So, we’d like to define our custom WeightedCELossTrainer that overrides the compute_loss method to calculate the weighted cross-entropy loss based on the model’s predictions and the input labels:

from transformers import Trainer

class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        
        outputs = model(**inputs)
        logits = outputs.get("logits")
        
        loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor([neg_weights, pos_weights], device=model.device, dtype=logits.dtype))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

Trainer Setup

Let’s set the training arguments and the trainer for the three models.

RoBERTa

First necessary step is to maneuver the models to the GPU device for training.

roberta_model = roberta_model.cuda()
roberta_model.device()

It is going to print the next:

device(type="cuda", index=0)

Then, we set the training arguments:

from transformers import TrainingArguments

lr = 1e-4
batch_size = 8
num_epochs = 5

training_args = TrainingArguments(
    output_dir="roberta-large-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=False,
    gradient_checkpointing=True,
)

Finally, we define the RoBERTa trainer by providing the model, the training arguments and the tokenized datasets:

roberta_trainer = WeightedCELossTrainer(
    model=roberta_model,
    args=training_args,
    train_dataset=roberta_tokenized_datasets['train'],
    eval_dataset=roberta_tokenized_datasets["val"],
    data_collator=roberta_data_collator,
    compute_metrics=compute_metrics
)

Mistral-7B

Much like RoBERTa, we initialize the WeightedCELossTrainer as follows:

from transformers import TrainingArguments, Trainer

mistral_model = mistral_model.cuda()

lr = 1e-4
batch_size = 8
num_epochs = 5

training_args = TrainingArguments(
    output_dir="mistral-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=True,
    gradient_checkpointing=True,
)


mistral_trainer = WeightedCELossTrainer(
    model=mistral_model,
    args=training_args,
    train_dataset=mistral_tokenized_datasets['train'],
    eval_dataset=mistral_tokenized_datasets["val"],
    data_collator=mistral_data_collator,
    compute_metrics=compute_metrics
)

Note that we would have liked to enable half-precision training by setting fp16 to True. The principal reason is that Mistral-7B is large, and its weights cannot fit into one GPU memory (48GB) with full float32 precision.

Llama 2

Much like Mistral 7B, we define the trainer as follows:

from transformers import TrainingArguments, Trainer

llama_model = llama_model.cuda()

lr = 1e-4
batch_size = 8
num_epochs = 5
training_args = TrainingArguments(
    output_dir="llama-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=True,
    gradient_checkpointing=True,
)



llama_trainer = WeightedCELossTrainer(
    model=llama_model,
    args=training_args,
    train_dataset=llama_tokenized_datasets['train'],
    eval_dataset=llama_tokenized_datasets["val"],
    data_collator=llama_data_collator,
    compute_metrics=compute_metrics
)

Hyperparameter Tuning

We have now used Wandb Sweep API to run hyperparameter tunning with Bayesian search strategy (30 runs). The hyperparameters tuned are the next.

method	metric	lora_alpha	lora_bias	lora_dropout	lora_rank	lr	max_length
bayes	goal: maximize	distribution: categorical	distribution: categorical	distribution: uniform	distribution: categorical	distribution: uniform	distribution: categorical
	name: eval/f1-score	values: -16 -32 -64	values: None	-max: 0.1 -min: 0	values: -4 -8 -16 -32	-max: 2e-04 -min: 1e-05	values: 512

For more information, you may check the Wandb experiment report within the resources sections.

Results

Models	F1 rating	Training time	Memory consumption	Variety of trainable parameters
RoBERTa	0.8077	538 seconds	GPU1: 9.1 Gb GPU2: 8.3 Gb	0.64%
Mistral 7B	0.7364	2030 seconds	GPU1: 29.6 Gb GPU2: 29.5 Gb	0.024%
Llama 2	0.7638	2052 seconds	GPU1: 35 Gb GPU2: 33.9 Gb	0.12%

Conclusion

On this blog post, we compared the performance of three large language models (LLMs) – RoBERTa, Mistral 7b, and Llama 2 – for disaster tweet classification using LoRa. From the performance results, we will see that RoBERTa is outperforming Mistral 7B and Llama 2 by a big margin. This raises the query about whether we really want a fancy and huge LLM for tasks like short-sequence binary classification?

One learning we will draw from this study is that one should account for the particular project requirements, available resources, and performance needs to decide on the LLMs model to make use of.

Also, for relatively easy prediction tasks with short sequences base models resembling RoBERTa remain competitive.

Finally, we showcase that LoRa method may be applied to each encoder (RoBERTa) and decoder (Llama 2 and Mistral 7B) models.

Resources

Yow will discover the code script in the next Github project.
You may check the hyper-param search leads to the next Weight&Bias reports:

Source link

A Deep Dive into Roberta, Llama 2, and Mistral for Disaster Tweets Evaluation with Lora

Hardware Used

Goals

Dependencies

Pre-trained Models

RoBERTa

Llama 2

Mistral 7B

LoRA

Setup

Data preparation

Data loading

Data Processing

Models

RoBERTa

Load RoBERTa Checkpoints for the Classification Task

LoRA setup for RoBERTa classifier

Mistral

Load checkpoints for the classification model

LoRa setup for Mistral 7B classifier

Llama 2

Load checkpoints for the classification mode

LoRa setup for Llama 2 classifier

Setup the trainer

Evaluation Metrics

Custom Trainer for Weighted Loss

Trainer Setup

RoBERTa

Mistral-7B

Llama 2

Hyperparameter Tuning

Results

Conclusion

Resources

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.