🤗 Speed up
Having began in a time when wrappers were less common, I became accustomed to writing my very own training loops, which I find easier to debug – an approach that 🤗 Speed up supports effectively. It proved useful on this project – I wasn’t entirely certain of the required data and label formats or shapes and my data didn’t match the well-organized examples often shown in tutorials, but having full access to intermediate computations throughout the training loop allowed me to iterate quickly.
Context Length
Most tutorials suggest using each sentence as a single training example. Nevertheless, on this case, I made a decision an extended context can be more suitable as documents typically contain references to multiple entities, a lot of that are irrelevant (e.g. lawyers, other creditors, case numbers). This broader context enables the model to higher discover the relevant client. I used 512 tokens from each document as one training example. It is a common maximum limit for models but comfortably accommodates all entities in most of my documents.
Labelling of Subtokens
Within the 🤗 token classification tutorial [1], beneficial approach is:
Only labeling the primary token of a given word. Assign
-100
to other subtokens from the identical word.
Nevertheless, I discovered that the next method suggested within the token classification tutorial of their NLP course [2] works a lot better:
Each token gets the identical label because the token that began the word it’s inside, since they’re a part of the identical entity. For tokens inside a word but not initially, we replace the
B-
withI-
Label “-100” is special label that’s ignored by loss function. Hence, I used their functions with minor changes:
def align_labels_with_tokens(labels, word_ids):
new_labels = []
current_word = None
for word_id in word_ids:
if word_id != current_word:
# Start of a brand new word!
current_word = word_id
label = -100 if word_id is None else labels[word_id]
new_labels.append(label)
elif word_id is None:
# Special token
new_labels.append(-100)
else:
# Same word as previous token
label = labels[word_id]
# If the label is B-XXX we modify it to I-XXX
if label % 2 == 1:
label += 1
new_labels.append(label)return new_labels
def tokenize_and_align_labels(examples):
tokenizer = AutoTokenizer.from_pretrained("../model/xlm-roberta-large")
tokenized_inputs = tokenizer(
examples["tokens"], truncation=True, is_split_into_words=True,
padding="max_length", max_length=512)
all_labels = examples["ner_tags"]
new_labels = []
for i, labels in enumerate(all_labels):
word_ids = tokenized_inputs.word_ids(i)
new_labels.append(align_labels_with_tokens(labels, word_ids))
tokenized_inputs["labels"] = new_labels
return tokenized_inputs
I also used their postprocess()
function:
To simplify its evaluation part, we define this
postprocess()
function that takes predictions and labels and converts them to lists of strings.
def postprocess(predictions, labels):
predictions = predictions.detach().cpu().clone().numpy()
labels = labels.detach().cpu().clone().numpy()true_labels = [[id2label[l] for l in label if l != -100] for label in labels]
true_predictions = [
[id2label[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
return true_predictions, true_labels
Class Weights
Incorporating class weights into the loss function significantly improved model performance. While this adjustment could appear straightforward — without it, the model overemphasized the bulk “O” class — it’s surprisingly absent from most tutorials. I implemented a custom compute_weights()
function to handle this imbalance:
def compute_weights(trainset, num_labels):
c = Counter()
for t in trainset:
c += Counter(t['labels'].tolist())
weights = [sum(c.values())/(c[i]+1) for i in range(num_labels)]
return weights
Training Loop
I defined two additional functions: PyTorch DataLoader()
to administer batch processing, and a primary()
function to establish distributed training objects and execute the training loop.
from speed up import Accelerator, notebook_launcher
from collections import Counter
from datasets import Dataset
from datetime import datetime
import torch
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification
from transformers import XLMRobertaConfig, XLMRobertaForTokenClassification
from seqeval.metrics import classification_report, f1_scoredef create_dataloaders(trainset, evalset, batch_size, num_workers):
train_dataloader = DataLoader(trainset, shuffle=True,
batch_size=batch_size, num_workers=num_workers)
eval_dataloader = DataLoader(evalset, shuffle=False,
batch_size=batch_size, num_workers=num_workers)
return train_dataloader, eval_dataloader
def primary(batch_size, num_workers, epochs, model_path, dataset_tr, dataset_ev, training_type, model_params, dt):
accelerator = Accelerator(split_batches=True)
num_labels = model_params['num_labels']
# Prepare data #
train_ds = Dataset.from_dict(
{"tokens": [d[2][:512] for d in dataset_tr],
"ner_tags": [d[1][:512] for d in dataset_tr]})
eval_ds = Dataset.from_dict(
{"tokens": [d[2][:512] for d in dataset_ev],
"ner_tags": [d[1][:512] for d in dataset_ev]})
trainset = train_ds.map(tokenize_and_align_labels, batched=True,
remove_columns=["tokens", "ner_tags"])
evalset = eval_ds.map(tokenize_and_align_labels, batched=True,
remove_columns=["tokens", "ner_tags"])
trainset.set_format("torch")
evalset.set_format("torch")
train_dataloader, eval_dataloader = create_dataloaders(trainset, evalset,
batch_size, num_workers)
# Variety of training #
if training_type=='from_scratch':
config = XLMRobertaConfig.from_pretrained(model_path, **model_params)
model = XLMRobertaForTokenClassification(config)
elif training_type=='transfer_learning':
model = AutoModelForTokenClassification.from_pretrained(model_path,
ignore_mismatched_sizes=True, **model_params)
for param in model.parameters():
param.requires_grad=False
for param in model.classifier.parameters():
param.requires_grad=True
elif training_type=='fine_tuning':
model = AutoModelForTokenClassification.from_pretrained(model_path,
**model_params)
for param in model.parameters():
param.requires_grad=True
for param in model.classifier.parameters():
param.requires_grad=True
# Intantiate the optimizer #
optimizer = torch.optim.AdamW(params=model.parameters(), lr=2e-5)
# Instantiate the educational rate scheduler #
lr_scheduler = ReduceLROnPlateau(optimizer, patience=5)
# Define loss function #
weights = compute_weights(trainset, num_labels)
loss_fct = CrossEntropyLoss(weight=torch.tensor(weights))
# Prepare objects for distributed training #
loss_fct, train_dataloader, model, optimizer, eval_dataloader, lr_scheduler = accelerator.prepare(
loss_fct, train_dataloader, model, optimizer, eval_dataloader, lr_scheduler)
# Training loop #
max_f1 = 0 # for early stopping
for t in range(epochs):
# training
accelerator.print(f"nnEpoch {t+1}n-------------------------------")
model.train()
tr_loss = 0
preds = list()
labs = list()
for batch in train_dataloader:
outputs = model(input_ids=batch['input_ids'],
attention_mask=batch['attention_mask'])
labels = batch["labels"]
loss = loss_fct(outputs.logits.view(-1, num_labels), labels.view(-1))
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
tr_loss += loss
predictions = outputs.logits.argmax(dim=-1)
predictions_gathered = accelerator.gather(predictions)
labels_gathered = accelerator.gather(labels)
true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
preds.extend(true_predictions)
labs.extend(true_labels)
lr_scheduler.step(tr_loss)
accelerator.print(f"Train loss: {tr_loss/len(train_dataloader):>8f} n")
accelerator.print(classification_report(labs, preds))
# evaluation
model.eval()
ev_loss = 0
preds = list()
labs = list()
for batch in eval_dataloader:
with torch.no_grad():
outputs = model(input_ids=batch['input_ids'],
attention_mask=batch['attention_mask'])
labels = batch["labels"]
loss = loss_fct(outputs.logits.view(-1, num_labels), labels.view(-1))
ev_loss += loss
predictions = outputs.logits.argmax(dim=-1)
predictions_gathered = accelerator.gather(predictions)
labels_gathered = accelerator.gather(labels)
true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
preds.extend(true_predictions)
labs.extend(true_labels)
accelerator.print(f"Eval loss: {ev_loss/len(eval_dataloader):>8f} n")
accelerator.print(classification_report(labs, preds))
accelerator.print(f"Current Learning Rate: {optimizer.param_groups[0]['lr']}")
# checkpoint best model
if f1_score(labs, preds) > max_f1:
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(f"../model/xlml_ner/{dt}/",
is_main_process=accelerator.is_main_process,
save_function=accelerator.save)
accelerator.print(f"Model saved during {t+1}. epoch.")
max_f1 = f1_score(labs, preds)
best_epoch = t
# early stopping
if (t - best_epoch) > 10:
accelerator.print(f"Early stopping after {t+1}. epoch.")
break
accelerator.print("Done!")
With all the pieces prepared, the model is prepared for training. I just have to initiate the method:
label_list = [
"O",
"B-evcu", "I-evcu", # variable symbol of creditor
"B-rc", "I-rc", # birth ID
"B-prijmeni", "I-prijmeni", # surname
"B-jmeno", "I-jmeno", # given name
"B-datum", "I-datum", # birth date
]
id2label = {a: b for a,b in enumerate(label_list)}
label2id = {b: a for a,b in enumerate(label_list)}num_workers = 6 # variety of GPUs
batch_size = num_workers*2
epochs = 100
model_path = "../model/xlm-roberta-large"
training_type = "fine_tuning" # from_scratch / transfer_learning / fine_tuning
model_params = {"id2label": id2label, "label2id": label2id, "num_labels": 11}
dt = datetime.now().strftime("%Y%m%d_%H%M%S")
os.mkdir(f"../model/xlml_ner/{dt}")
notebook_launcher(primary, args=(batch_size, num_workers, epochs, model_path,
dataset_tr, dataset_ev, training_type, model_params, dt),
num_processes=num_workers, mixed_precision="fp16", use_port="29502")
I find using notebook_launcher()
convenient, because it allows me to run training within the console and simply work with results afterward.
XLM-RoBERTa base vs large vs Small-E-Czech
I experimented with fine-tuning three models. The XLM-RoBERTa base model [3] delivered satisfactory performance, however the server capability also allowed me to try the XLM-RoBERTa large model [3], which has twice the parameters.
XLM-RoBERTa is a multilingual version of RoBERTa. It’s pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
The big model showed a slight improvement in results, so I ultimately deployed it. I also tested Small-E-Czech [4], an Electra-small model pre-trained on Czech web data, but its performance was poor.
High-quality-tuning vs Transfer learning vs Training from scratch
Along with fine-tuning (updating all model weights), I tested transfer learning, because it is usually suggested that training only the ultimate (classification) layer may suffice.. Nevertheless, the performance difference was significant, favoring full fine-tuning. I also attempted training from scratch by importing only architecture of the model, initializing the weights randomly, after which training, but as expected, this approach was ineffective.
RoBERTa vs LLM (Claude 3.5 Sonnet)
I briefly explored zero-shot LLMs, though with minimal prompt engineering (so 🥱). The model struggled even with basic requests, similar to (I used Czech within the actual prompt):
Find variable symbol of creditor. This number has exactly 9 consecutive digits 0–9 without letters or other special characters. It is normally preceded by one among the next abbreviations: ‘ev.č.’, ‘zn. opr’, ‘VS. O’, ‘evid. č. opr.’. Quite the opposite, I’m not desirous about a transaction number with the abbreviation ‘č.j.’. This number doesn’t appear often in documents, it could occur that you’ll not find a way to search out it, then write ‘cannot find’. If you happen to’re unsure, write ‘unsure’.
The model sometimes didn’t output the 9-digit format accurately. Post-processing would filter out shorter numbers, but there have been many false positives 9-digit numbers.
Occasionally the model inferred incorrect birth IDs based solely on birth dates (even with temperature set to 0). Alternatively, it excelled at extracting names, surnames, and birth dates.
Overall, even in my previous experiments, I discovered that LLMs (on the time of writing) perform higher with general tasks but lack accuracy and reliability for specific or unconventional tasks. The performance in identifying the client was fairly similar for each approaches. For internal reasons, the RoBERTa model was deployed.