Home Artificial Intelligence Effectively Annotate Text Data for Transformers via Lively Learning + Re-labeling What’s Lively Learning? What’s ActiveLab? Motivation Classifying the Politeness of Text Methodology Model Training and Evaluation Use Lively Learning Scores to Determine What to Label Next Adding recent Annotations Results

Effectively Annotate Text Data for Transformers via Lively Learning + Re-labeling What’s Lively Learning? What’s ActiveLab? Motivation Classifying the Politeness of Text Methodology Model Training and Evaluation Use Lively Learning Scores to Determine What to Label Next Adding recent Annotations Results

0
Effectively Annotate Text Data for Transformers via Lively Learning + Re-labeling
What’s Lively Learning?
What’s ActiveLab?
Motivation
Classifying the Politeness of Text
Methodology
Model Training and Evaluation
Use Lively Learning Scores to Determine What to Label Next
Adding recent Annotations
Results

ActiveLab chooses which data it’s best to (re)label to coach a more practical model. Given the identical labeling budget, ActiveLab outperforms other selection methods.

In this text, I highlight using lively learning to enhance a fine-tuned Transformer model for text classification, while keeping the full variety of collected labels from human annotators low. When resource constraints prevent you from acquiring labels for everything of your data, lively learning goals to avoid wasting each money and time by choosing which examples data annotators should spend their effort labeling.

Lively Learning helps prioritize what data to label to be able to maximize the performance of a supervised machine learning model trained on the labeled data. This process often happens iteratively — at each round, lively learning tells us which examples we must always collect additional annotations for to enhance our current model probably the most .

Lively Learning is most useful to efficiently annotate data in settings where you’ve got a big pool of unlabeled data and a limited labeling budget. Here you ought to resolve which examples to label to be able to train an accurate model. A vital assumption of the methods presented here (and most Machine Learning) is that the person examples are independent and identically distributed.

ActiveLab is an lively learning algorithm that is especially useful when the annotators are noisy since it helps resolve when we must always collect yet another annotation for a previously annotated example (whose label seems suspect) vs. for a not-yet-annotated example. After collecting these recent annotations for a batch of information to extend our training dataset, we re-train our model and evaluate its test accuracy.

CROWDLAB is a technique to estimate our confidence in a consensus label from multi-annotator data, which produces accurate estimates via a weighted ensemble of any trained model’s probabilistic prediction p_M​ and the person labels assigned by each annotator j. ActiveLab forms an identical weighted ensemble estimate, treating each annotator’s selection as a probabilistic decision p_j output by one other predictor:

ActiveLab decides to re-label data when the probability of the present consensus label for a previously-annotated datapoint falls below our (purely model-based) confidence in the anticipated label for an unannotated datapoint.

ActiveLab is best for data labeling applications where data annotators are imperfect and you might be in a position to train an affordable classifier model (which is in a position to produce higher than random predictions). The strategy works with any data modality and classifier model.

I recently joined Cleanlab as a Data Scientist and am excited to share how ActiveLab (a part of our open-source library, freely available under AGPL-v3 license) may be utilized in various workflows to enhance datasets.

Here I consider a binary text classification task: predicting whether a particular phrase is polite or impolite. In comparison with random choice of which examples to gather a further annotation for, lively learning with ActiveLab consistently produces a lot better Transformer models at each round (around of the error-rate), irrespective of the full labeling budget!

The remainder of this text walks through the open-source code you should utilize to realize these results. You possibly can run the entire code to breed my lively learning experiments here: Colab Notebook

The dataset I consider here’s a variant of the Stanford Politeness Corpus. It’s structured as a binary text classification task, to categorise whether each phrase is polite or impolite. Human annotators are given a specific text phrase they usually provide an (imperfect) annotation regarding its politeness: 0 for impolite and 1 for polite.

Training a Transformer classifier on the annotated data, we measure model accuracy over a set of held-out test examples, where I feel confident about their ground truth labels because they’re derived from a consensus amongst 5 annotators who labeled each of those examples.

As for the training data, we’ve got:

  • X_labeled_full: our initial training set with only a small set of 100 text examples labeled with 2 annotations per example.
  • X_unlabeled: a big set of 1900 unlabeled text examples we are able to consider having annotators label.

Listed here are just a few examples from X_labeled_full:

  • Annotation (by annotator #61): polite
  • Annotation (by annotator #99): polite

2.

  • Annotation (by annotator #16): polite
  • Annotation (by annotator #22): impolite

3.

  • Annotation (by annotator #22): impolite
  • Annotation (by annotator #61): impolite

For every round we:

  1. Compute ActiveLab consensus labels for every training example derived from all annotations collected up to now.
  2. Train our Transformer classification model on the present training set using these consensus labels.
  3. Evaluate test accuracy on the test set (which has high-quality ground truth labels).
  4. Run cross-validation to get out-of-sample predicted class probabilities from our model for your complete training set and unlabeled set.
  5. Get ActiveLab lively learning scores for every example within the training set and unlabeled set. These scores estimate how informative it will be to gather one other annotation for every example.
  6. Select a subset (n = batch_size) of examples with the bottom lively learning scores.
  7. Collect one additional annotation for every of the n chosen examples.
  8. Add the brand new annotations (and recent previously non-annotated examples if chosen) to our training set for the following iteration.

I subsequently compare models trained on data labeled via lively learning vs. data labeled via . For every random selection round, I take advantage of majority vote consensus as an alternative of ActiveLab consensus (in Step 1) after which just randomly select the examples to gather a further label for as an alternative of using ActiveLab scores (in Step 6).

Here is the code we use for model training and evaluation, using the Hugging Face library which offers many state-of-the-art Transformer networks.

# Helper method to get accuracy and pred_probs from Trainer.
def compute_metrics(p):
logits, labels = p
pred = np.argmax(logits, axis=1)
pred_probs = softmax(logits, axis=1)
accuracy = accuracy_score(y_true=labels, y_pred=pred)
return {"logits":logits, "pred_probs":pred_probs, "accuracy": accuracy}

# Helper method to initiate a recent Trainer with given train and test sets.
def get_trainer(train_set, test_set):

# Model params.
model_name = "distilbert-base-uncased"
model_folder = "model_training"
max_training_steps = 300
num_classes = 2

# Set training args.
# We time-seed to make sure randomness between different benchmarking runs.
training_args = TrainingArguments(
max_steps=max_training_steps,
output_dir=model_folder,
seed = int(datetime.now().timestamp())
)

# Tokenize train/test set.
dataset_train = tokenize_data(train_set)
dataset_test = tokenize_data(test_set)

# Initiate a pre-trained model.
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)
trainer = Trainer(
model=model,
args=training_args,
compute_metrics = compute_metrics,
train_dataset = train_tokenized_dataset,
eval_dataset = test_tokenized_dataset,
)
return trainer

I first tokenize my test and train sets, after which initialize a pre-trained DistilBert Transformer model. High quality-tuning DistilBert with 300 training steps produced balance between accuracy and training time for my data. This classifier outputs predicted class probabilities which I convert to class predictions before evaluating their accuracy.

Here is the code we use to attain each example via an lively learning estimate of how informative collecting yet another label for this instance shall be.

from cleanlab.multiannotator import get_active_learning_scores

pred_probs, pred_probs_unlabeled = get_pred_probs(train_set, X_unlabeled)

# Compute lively learning scores.
active_learning_scores, active_learning_scores_unlabeled = get_active_learning_scores(
multiannotator_labels, pred_probs, pred_probs_unlabeled
)
# Get the indices of examples to gather more labels for.
chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label(
X_labeled_full,
X_unlabeled,
extra_annotations,
batch_size_to_label,
active_learning_scores,
active_learning_scores_unlabeled,
)

During each round of Lively Learning, we fit our Transformer model via 3-fold cross-validation on the present training set. This enables us to get out-of-sample predicted class probabilities for every example within the training set and we can even use the trained Transformer to get out-of-sample predicted class probabilities for every example within the unlabeled pool. All of that is internally implemented within the get_pred_probs helper method. Using out-of-sample predictions helps us avoid bias resulting from potential overfitting.

Once I actually have these probabilistic predictions, I pass them into the get_active_learning_scores method from the open-source cleanlab package, which implements the ActiveLab algorithm. This method provides us with scores for all of our labeled and unlabeled data. Lower scores indicate data points for which collecting one additional label ought to be most informative for our current model (scores are directly comparable between labeled and unlabeled data).

I form a batch of examples with the bottom scores because the examples to gather an annotation for (via the get_idx_to_label method). Here I at all times collect the very same variety of annotations in each round (under each the lively learning and random selection approaches). For this application, I limit the utmost variety of annotations per example to five (don’t wish to spend effort labeling the identical example over and all over again).

Here is the code used so as to add recent annotations for the chosen examples to the present training dataset.

# Mix ids of labeled and unlabeled chosen examples.
chosen_example_ids = np.concatenate([X_labeled_full.iloc[chosen_examples_labeled].index.values, X_unlabeled.iloc[chosen_examples_unlabeled].index.values])

# Collect annotations for the chosen examples.
for example_id in chosen_example_ids:
# Collect recent annotation and who it's coming from.
new_annotation = get_annotation(example_id, chosen_annotator)

# Recent annotator has been chosen.
if chosen_annotator not in X_labeled_full.columns.values:
empty_col = np.full((len(X_labeled_full),), np.nan)
X_labeled_full[chosen_annotator] = empty_col

# Add chosen annotation to the training set.
X_labeled_full.at[example_id, chosen_annotator] = new_annotation

The combined_example_ids are the ids of the text examples we wish to gather an annotation for. For every of those, we use the get_annotation helper method to gather a recent annotation from an annotator. Here, we prioritize choosing annotations from annotators who’ve already annotated one other example. If not one of the annotators for the given example exist within the training set, we randomly select one. On this case, we add a recent column to our training set which represents the brand new annotator. Finally, we add the newly collected annotation to the training set. If the corresponding example was previously non-annotated, we also add it to the training set and take away it from the unlabeled collection.

We’ve now accomplished one round of collecting recent annotations and retrain the Transformer model on the updated training set. We repeat this process in multiple rounds to continue to grow the training dataset and improving our model.

I ran 25 rounds of lively learning (labeling batches of information and retraining the Transformer model), collecting 25 annotations in each round. I repeated all of this, the following time using random selection to decide on which examples to annotate in each round — as a baseline comparison. Before additional data are annotated, each approaches start with the identical initial training set of 100 examples (hence achieving roughly the identical Transformer accuracy in the primary round). Due to inherent stochasticity in training Transformers, I ran this whole process five times (for every data labeling strategy) and report the usual deviation (shaded region) and mean (solid line) of test accuracies across the five replicate runs.

ActiveLab outperforms random selection substantially when averaged over 5 runs. The usual deviation is shaded and the solid line is the mean.

We see that selecting what data to annotate next has drastic effects on model performance. Lively learning using ActiveLab consistently outperforms random selection by a big margin at each round. For instance, in round 4 with 275 total annotations within the training set, we obtain vs. only 76% accuracy with out a clever selection strategy of what to annotate. Overall, the resulting Transformer models fit on the dataset constructed via lively learning have around of the error-rate, irrespective of the full labeling budget.

While lively learning has its benefits, it could not at all times be probably the most helpful approach. For example, when the information labeling process is inexpensive, or when there’s a range bias or distribution shift between the unlabeled dataset and the information that the model will encounter during deployment. Moreover, lively learning feedback loops depend on the classifier model’s ability to generate predictions which might be more informative than random. When this is just not the case, lively learning may not provide any significant signal concerning the data’s informativeness.

All images unless otherwise noted are by the writer.

LEAVE A REPLY

Please enter your comment!
Please enter your name here