On this blog post, we’ll simulate a real-world customer support use case and use tools machine learning tools of the Hugging Face ecosystem to deal with it.
We strongly recommend using this notebook as a template/example to resolve your real-world use case.
Defining Task, Dataset & Model
Before jumping into the actual coding part, it is important to have a transparent definition of the use case that you desire to to automate or partly automate.
A transparent definition of the use case helps discover essentially the most suitable task, dataset to make use of, and model to use in your use case.
Defining your NLP task
Alright, let’s dive right into a hypothetical problem we wish to resolve using models of natural language processing models. Let’s assume we’re selling a product and our customer support team receives 1000’s of messages including feedback, complaints, and questions which ideally should all be answered.
Quickly, it becomes obvious that customer support is on no account capable of reply to each message. Thus, we resolve to only reply to essentially the most unsatisfied customers and aim to reply 100% of those messages, as these are likely essentially the most urgent in comparison with the opposite neutral and positive messages.
Assuming that a) messages of very unsatisfied customers represent only a fraction of all messages and b) that we are able to filter out unsatisfied messages in an automatic way, customer support should find a way to achieve this goal.
To filter out unsatisfied messages in an automatic way, we plan on applying natural language processing technologies.
Step one is to map our use case – filtering out unsatisfied messages – to a machine learning task.
The tasks page on the Hugging Face Hub is an amazing place to get began to see which task most closely fits a given scenario. Each task has an in depth description and potential use cases.
The duty of finding messages of essentially the most unsatisfied customers may be modeled as a text classification task: Classify a message into one among the next 5 categories: very unsatisfied, unsatisfied, neutral, satisfied, or very satisfied.
Finding suitable datasets
Having selected the duty, next, we should always find the info the model can be trained on. This is generally more essential for the performance of your use case than picking the suitable model architecture.
Take into accout that a model is only nearly as good as the info it has been trained on. Thus, we needs to be very careful when curating and/or choosing the dataset.
Since we consider the hypothetical use case of filtering out unsatisfied messages, let’s look into what datasets can be found.
On your real-world use case, it’s very likely that you have got internal data that best represents the actual data your NLP system is imagined to handle. Subsequently, you need to use such internal data to coach your NLP system.
It might nevertheless be helpful to also include publicly available data to enhance the generalizability of your model.
Let’s take a take a look at all available Datasets on the Hugging Face Hub. On the left side, you’ll be able to filter the datasets based on Task Categories in addition to Tasks that are more specific. Our use case corresponds to Text Classification -> Sentiment Evaluation so let’s select these filters. We’re left with ca. 80 datasets on the time of writing this notebook. Two facets needs to be evaluated when picking a dataset:
- Quality: Is the dataset of top of the range? More specifically: Does the info correspond to the info you expect to take care of in your use case? Is the info diverse, unbiased, …?
- Size: How big is the dataset? Often, one can safely say the larger the dataset, the higher.
It’s quite tricky to judge whether a dataset is of top of the range efficiently, and it’s even more difficult to know whether and the way the dataset is biased.
An efficient and reasonable heuristic for prime quality is to take a look at the download statistics. The more downloads, the more usage, the upper likelihood that the dataset is of top of the range. The dimensions is straightforward to judge as it might often be quickly read upon. Let’s take a take a look at essentially the most downloaded datasets:
Now we are able to inspect those datasets in additional detail by reading through the dataset card, which ideally should give all relevant and essential information. As well as, the dataset viewer is an incredibly powerful tool to examine whether the info suits your use case.
Let’s quickly go over the dataset cards of the models above:
- GLUE is a set of small datasets that primarily serve to check recent model architectures for researchers. The datasets are too small and do not correspond enough to our use case.
- Amazon polarity is a large and well-suited dataset for customer feedback because the data deals with customer reviews. Nevertheless, it only has binary labels (positive/negative), whereas we’re on the lookout for more granularity within the sentiment classification.
- Tweet eval uses different emojis as labels that can’t easily be mapped to a scale going from unsatisfied to satisfied.
- Amazon reviews multi appears to be essentially the most suitable dataset here. Now we have sentiment labels starting from 1-5 corresponding to 1-5 stars on Amazon. These labels may be mapped to very unsatisfied, neutral, satisfied, very satisfied. Now we have inspected some examples on the dataset viewer to confirm that the reviews look very just like actual customer feedback reviews, so this looks as if a superb dataset. As well as, each review has a
product_categorylabel, so we could even go so far as to only use reviews of a product category corresponding to the one we’re working in. The dataset is multi-lingual, but we are only occupied with the English version for now. - Yelp review full looks like a really suitable dataset. It’s large and comprises product reviews and sentiment labels from 1 to five. Sadly, the dataset viewer will not be working here, and the dataset card can be relatively sparse, requiring some more time to examine the dataset. At this point, we should always read the paper, but given the time constraint of this blog post, we’ll decide to go for Amazon reviews multi.
As a conclusion, let’s deal with the Amazon reviews multi dataset considering all training examples.
As a final note, we recommend making use of Hub’s dataset functionality even when working with private datasets. The Hugging Face Hub, Transformers, and Datasets are flawlessly integrated, which makes it trivial to make use of them together when training models.
As well as, the Hugging Face Hub offers:
Finding an appropriate model
Having selected the duty and the dataset that best describes our use case, we are able to now look into selecting a model for use.
Almost definitely, you’ll have to fine-tune a pretrained model for your individual use case, but it surely is price checking whether the hub already has suitable fine-tuned models. On this case, you may reach a better performance by just continuing to fine-tune such a model in your dataset.
Let’s take a take a look at all models which were fine-tuned on Amazon Reviews Multi. You’ll find the list of models on the underside right corner – clicking on Browse models trained on this dataset you’ll be able to see a listing of all models fine-tuned on the dataset which can be publicly available. Note that we’re only occupied with the English version of the dataset because our customer feedback will only be in English. Most of essentially the most downloaded models are trained on the multi-lingual version of the dataset and those who are not multi-lingual have little or no information or poor performance. At this point,
it could be more sensible to fine-tune a purely pretrained model as a substitute of using one among the already fine-tuned ones shown within the link above.
Alright, the subsequent step now could be to search out an appropriate pretrained model for use for fine-tuning. This is definitely tougher than it seems given the massive amount of pretrained and fine-tuned models which can be on the Hugging Face Hub. The perfect option is generally to easily check out a wide range of different models to see which one performs best.
We still have not found the right way of comparing different model checkpoints to one another at Hugging Face, but we offer some resources which can be price looking into:
- The model summary gives a brief overview of various model architectures.
- A task-specific search on the Hugging Face Hub, e.g. a search on text-classification models, shows you essentially the most downloaded checkpoints which can be a sign of how well those checkpoints perform.
Nevertheless, each of the above resources are currently suboptimal. The model summary will not be at all times kept up up to now by the authors. The speed at which recent model architectures are released and old model architectures turn into outdated makes it extremely difficult to have an up-to-date summary of all model architectures.
Similarly, it doesn’t necessarily mean that essentially the most downloaded model checkpoint is the most effective one. E.g. bert-base-cased is amongst essentially the most downloaded model checkpoints but will not be the most effective performing checkpoint anymore.
The perfect approach is to check out various model architectures, not sleep up to now with recent model architectures by following experts in the sphere, and check well-known leaderboards.
For text-classification, the essential benchmarks to take a look at are GLUE and SuperGLUE. Each benchmarks evaluate pretrained models on a wide range of text-classification tasks, resembling grammatical correctness, natural language inference, Yes/Absolute confidence answering, etc…, that are quite just like our goal task of sentiment evaluation. Thus, it is cheap to decide on one among the leading models of those benchmarks for our task.
On the time of writing this blog post, the most effective performing models are very large models containing greater than 10 billion parameters most of which will not be open-sourced, e.g. ST-MoE-32B, Turing NLR v5, or
ERNIE 3.0. One in every of the top-ranking models that is well accessible is DeBERTa. Subsequently, let’s check out DeBERTa’s newest base version – i.e. microsoft/deberta-v3-base.
Training / Effective-tuning a model with 🤗 Transformers and 🤗 Datasets
On this section, we’ll jump into the technical details of easy methods to
fine-tune a model end-to-end to find a way to robotically filter out very unsatisfied customer feedback messages.
Cool! Let’s start by installing all needed pip packages and establishing our code environment, then look into preprocessing the dataset, and at last start training the model.
The next notebook may be run online in a google colab pro with the GPU runtime environment enabled.
Install all needed packages
To start with, let’s install git-lfs in order that we are able to robotically upload our trained checkpoints to the Hub during training.
apt install git-lfs
Also, we install the 🤗 Transformers and 🤗 Datasets libraries to run this notebook. Since we can be using DeBERTa on this blog post, we also need to put in the sentencepiece library for its tokenizer.
pip install datasets transformers[sentencepiece]
Next, let’s login into our Hugging Face account in order that models are uploaded appropriately under your name tag.
from huggingface_hub import notebook_login
notebook_login()
Output:
Login successful
Your token has been saved to /root/.huggingface/token
Authenticated through git-credential store but this is not the helper defined in your machine.
You would possibly must re-authenticate when pushing to the Hugging Face Hub. Run the next command in your terminal in case you ought to set this credential helper because the default
git config --global credential.helper store
Preprocess the dataset
Before we are able to start training the model, we should always bring the dataset in a format
that’s comprehensible by the model.
Thankfully, the 🤗 Datasets library makes this extremely easy as you will notice in the next cells.
The load_dataset function loads the dataset, nicely arranges it into predefined attributes, resembling review_body and stars, and at last saves the newly arranged data using the arrow format on disk.
The arrow format allows for fast and memory-efficient data reading and writing.
Let’s load and prepare the English version of the amazon_reviews_multi dataset.
from datasets import load_dataset
amazon_review = load_dataset("amazon_reviews_multi", "en")
Output:
Downloading and preparing dataset amazon_reviews_multi/en (download: 82.11 MiB, generated: 58.69 MiB, post-processed: Unknown size, total: 140.79 MiB) to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609...
Dataset amazon_reviews_multi downloaded and ready to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609. Subsequent calls will reuse this data.
Great, that was fast 🔥. Let’s take a take a look at the structure of the dataset.
print(amazon_review)
Output:
{.output .execute_result execution_count="5"}
DatasetDict({
train: Dataset({
features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
num_rows: 200000
})
validation: Dataset({
features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
num_rows: 5000
})
test: Dataset({
features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
num_rows: 5000
})
})
Now we have 200,000 training examples in addition to 5000 validation and test examples. This sounds reasonable for training! We’re only really occupied with the input being the "review_body" column and the goal being the "starts" column.
Let’s take a look at a random example.
random_id = 34
print("Stars:", amazon_review["train"][random_id]["stars"])
print("Review:", amazon_review["train"][random_id]["review_body"])
Output:
Stars: 1
Review: This product caused severe burning of my skin. I actually have used other brands with no problems
The dataset is in a human-readable format, but now we’d like to remodel it right into a “machine-readable” format. Let’s define the model repository which incorporates all utils needed to preprocess and fine-tune the checkpoint we selected.
model_repository = "microsoft/deberta-v3-base"
Next, we load the tokenizer of the model repository, which is a DeBERTa’s Tokenizer.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_repository)
As mentioned before, we’ll use the "review_body" because the model’s input and "stars" because the model’s goal. Next, we make use of the tokenizer to remodel the input right into a sequence of token ids that may be understood by the model. The tokenizer does exactly this and may assist you to limit your input data to a certain length to not run right into a memory issue. Here, we limit
the utmost length to 128 tokens which within the case of DeBERTa corresponds to roughly 100 words which in turn corresponds to ca. 5-7 sentences. Taking a look at the dataset viewer again, we are able to see that this covers just about all training examples.
Vital: This does not imply that our model cannot handle longer input sequences, it just implies that we use a maximum length of 128 for training because it covers 99% of our training and we don’t desire to waste memory. Transformer models have shown to be superb at generalizing to longer sequences after training.
If you ought to learn more about tokenization usually, please have a take a look at the Tokenizers docs.
The labels are easy to remodel as they already correspond to numbers of their raw form, i.e. the range from 1 to five. Here we just shift the labels into the range 0 to 4 since indexes often start at 0.
Great, let’s pour our thoughts into some code. We are going to define a preprocess_function that we’ll apply to every data sample.
def preprocess_function(example):
output_dict = tokenizer(example["review_body"], max_length=128, truncation=True)
output_dict["labels"] = [e - 1 for e in example["stars"]]
return output_dict
To use this function to all data samples in our dataset, we use the map approach to the amazon_review object we created earlier. It will apply the function on all the weather of all of the splits in amazon_review, so our training, validation, and testing data can be preprocessed in a single single command. We run the mapping function in batched=True mode to hurry up the method and in addition remove all columns since we do not need them anymore for training.
tokenized_datasets = amazon_review.map(preprocess_function, batched=True, remove_columns=amazon_review["train"].column_names)
Let’s take a take a look at the brand new structure.
tokenized_datasets
Output:
DatasetDict({
train: Dataset({
features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
num_rows: 200000
})
validation: Dataset({
features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
num_rows: 5000
})
test: Dataset({
features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
num_rows: 5000
})
})
We will see that the outer layer of the structure stayed the identical however the naming of the columns has modified.
Let’s take a take a look at the identical random example we checked out previously only that it’s preprocessed now.
print("Input IDS:", tokenized_datasets["train"][random_id]["input_ids"])
print("Labels:", tokenized_datasets["train"][random_id]["labels"])
Output:
Input IDS: [1, 329, 714, 2044, 3567, 5127, 265, 312, 1158, 260, 273, 286, 427, 340, 3006, 275, 363, 947, 2]
Labels: 0
Alright, the input text is transformed right into a sequence of integers which may be transformed to word embeddings by the model, and the label index is solely shifted by -1.
Effective-tune the model
Having preprocessed the dataset, next we are able to fine-tune the model. We are going to make use of the favored Hugging Face Trainer which allows us to start out training in only a few lines of code. The Trainer may be used for kind of all tasks in PyTorch and is amazingly convenient by caring for a variety of boilerplate code needed for training.
Let’s start by loading the model checkpoint using the convenient AutoModelForSequenceClassification. Because the checkpoint of the model repository is only a pretrained checkpoint we should always define the scale of the classification head by passing num_lables=5 (since we’ve 5 sentiment classes).
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_repository, num_labels=5)
Some weights of the model checkpoint at microsoft/deberta-v3-base weren't used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.classifier.bias', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'mask_predictions.dense.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.classifier.weight']
- This IS expected in the event you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on one other task or with one other architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected in the event you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you just expect to be exactly an identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification weren't initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['pooler.dense.bias', 'classifier.weight', 'classifier.bias', 'pooler.dense.weight']
It's best to probably TRAIN this model on a down-stream task to find a way to make use of it for predictions and inference.
Next, we load a knowledge collator. A data collator is accountable for ensuring each batch is appropriately padded during training, which should occur dynamically since training samples are reshuffled before each epoch.
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
During training, it will be important to watch the performance of the model on a held-out validation set. To achieve this, we should always pass a to define a compute_metrics function to the Trainer which is then called at each validation step during training.
The best metric for the text classification task is accuracy, which simply states how much percent of the training samples were appropriately classified. Using the accuracy metric could be problematic nevertheless if the validation or test data may be very unbalanced. Let’s confirm quickly that this will not be the case by counting the occurrences of every label.
from collections import Counter
print("Validation:", Counter(tokenized_datasets["validation"]["labels"]))
print("Test:", Counter(tokenized_datasets["test"]["labels"]))
Output:
Validation: Counter({0: 1000, 1: 1000, 2: 1000, 3: 1000, 4: 1000})
Test: Counter({0: 1000, 1: 1000, 2: 1000, 3: 1000, 4: 1000})
The validation and test data sets are as balanced as they may be, so we are able to safely use accuracy here!
Let’s load the accuracy metric via the datasets library.
from datasets import load_metric
accuracy = load_metric("accuracy")
Next, we define the compute_metrics which can be applied to the anticipated outputs of the model which is of type EvalPrediction and subsequently exposes the model’s predictions and the gold labels.
We compute the anticipated label class by taking the argmax of the model’s prediction before passing it alongside the gold labels to the accuracy metric.
import numpy as np
def compute_metrics(pred):
pred_logits = pred.predictions
pred_classes = np.argmax(pred_logits, axis=-1)
labels = np.asarray(pred.label_ids)
acc = accuracy.compute(predictions=pred_classes, references=labels)
return {"accuracy": acc["accuracy"]}
Great, now all components required for training are ready and all that is left to do is to define the hyper-parameters of the Trainer. We’d like to ensure that the model checkpoints are uploaded to the Hugging Face Hub during training. By setting push_to_hub=True, this is finished robotically at every save_steps via the convenient push_to_hub method.
Besides, we define some standard hyper-parameters resembling learning rate, warm-up steps and training epochs. We are going to log the loss every 500 steps and run evaluation every 5000 steps.
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="deberta_amazon_reviews_v1",
num_train_epochs=2,
learning_rate=2e-5,
warmup_steps=200,
logging_steps=500,
save_steps=5000,
eval_steps=5000,
push_to_hub=True,
evaluation_strategy="steps",
)
Putting all of it together, we are able to finally instantiate the Trainer by passing all required components. We’ll use the "validation" split because the held-out dataset during training.
from transformers import Trainer
trainer = Trainer(
args=training_args,
compute_metrics=compute_metrics,
model=model,
tokenizer=tokenizer,
data_collator=data_collator,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"]
)
The trainer is able to go 🚀 You may start training by calling trainer.train().
train_metrics = trainer.train().metrics
trainer.save_metrics("train", train_metrics)
Output:
***** Running training *****
Num examples = 200000
Num Epochs = 2
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 50000
Output:
| Step | Training Loss | Validation Loss | Accuracy |
| 5000 | 0.931200 | 0.979602 | 0.585600 |
| 10000 | 0.931600 | 0.933607 | 0.597400 |
| 15000 | 0.907600 | 0.917062 | 0.602600 |
| 20000 | 0.902400 | 0.919414 | 0.604600 |
| 25000 | 0.879400 | 0.910928 | 0.608400 |
| 30000 | 0.806700 | 0.933923 | 0.609200 |
| 35000 | 0.826800 | 0.907260 | 0.616200 |
| 40000 | 0.820500 | 0.904160 | 0.615800 |
| 45000 | 0.795000 | 0.918947 | 0.616800 |
| 50000 | 0.783600 | 0.907572 | 0.618400 |
Output:
***** Running Evaluation *****
Num examples = 5000
Batch size = 8
Saving model checkpoint to deberta_amazon_reviews_v1/checkpoint-50000
Configuration saved in deberta_amazon_reviews_v1/checkpoint-50000/config.json
Model weights saved in deberta_amazon_reviews_v1/checkpoint-50000/pytorch_model.bin
tokenizer config file saved in deberta_amazon_reviews_v1/checkpoint-50000/tokenizer_config.json
Special tokens file saved in deberta_amazon_reviews_v1/checkpoint-50000/special_tokens_map.json
added tokens file saved in deberta_amazon_reviews_v1/checkpoint-50000/added_tokens.json
Training accomplished. Don't forget to share your model on huggingface.co/models =)
Cool, we see that the model seems to learn something! Training loss and validation loss are happening and the accuracy also finally ends up being well over random likelihood (20%). Interestingly, we see an accuracy of around 58.6 % after only 5000 steps which does not improve that much anymore afterward. Selecting an even bigger model or training for longer would have probably given higher results here, but that is adequate for our hypothetical use case!
Alright, finally let’s upload the model checkpoint to the Hub.
trainer.push_to_hub()
Output:
Saving model checkpoint to deberta_amazon_reviews_v1
Configuration saved in deberta_amazon_reviews_v1/config.json
Model weights saved in deberta_amazon_reviews_v1/pytorch_model.bin
tokenizer config file saved in deberta_amazon_reviews_v1/tokenizer_config.json
Special tokens file saved in deberta_amazon_reviews_v1/special_tokens_map.json
added tokens file saved in deberta_amazon_reviews_v1/added_tokens.json
Several commits (2) can be pushed upstream.
The progress bars could also be unreliable.
Evaluate / Analyse the model
Now that we’ve fine-tuned the model we must be very careful about analyzing its performance.
Note that canonical metrics, resembling accuracy, are useful to get a general picture
about your model’s performance, but it surely may not be enough to judge how well the model performs in your actual use case.
The higher approach is to search out a metric that best describes the actual use case of the model and measure exactly this metric during and after training.
Let’s dive into evaluating the model 🤿.
The model has been uploaded to the Hub under deberta_v3_amazon_reviews after training, so in a primary step, let’s download it from there again.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("patrickvonplaten/deberta_v3_amazon_reviews")
The Trainer will not be only a wonderful class to coach a model, but in addition to judge a model on a dataset. Let’s instantiate the trainer with the identical instances and functions as before, but this time there isn’t any have to pass a training dataset.
trainer = Trainer(
args=training_args,
compute_metrics=compute_metrics,
model=model,
tokenizer=tokenizer,
data_collator=data_collator,
)
We use the Trainer’s predict function to judge the model on the test dataset on the identical metric.
prediction_metrics = trainer.predict(tokenized_datasets["test"]).metrics
prediction_metrics
Output:
***** Running Prediction *****
Num examples = 5000
Batch size = 8
Output:
{'test_accuracy': 0.608,
'test_loss': 0.9637690186500549,
'test_runtime': 21.9574,
'test_samples_per_second': 227.714,
'test_steps_per_second': 28.464}
The outcomes are very just like performance on the validation dataset, which is generally sign because it shows that the model didn’t overfit the test dataset.
Nevertheless, 60% accuracy is way from being perfect on a 5-class classification problem, but do we’d like very high accuracy for all classes?
Since we’re mostly concerned with very negative customer feedback, let’s just deal with how well the model performs on classifying reviews of essentially the most unsatisfied customers. We also resolve to assist the model a bit – all feedback classified as either very unsatisfied or unsatisfied can be handled by us – to catch near 99% of the very unsatisfied messages. At the identical time, we also measure what number of unsatisfied messages we are able to answer this manner and the way much unnecessary work we do by answering messages of neutral, satisfied, and really satisfied customers.
Great, let’s write a brand new compute_metrics function.
import numpy as np
def compute_metrics(pred):
pred_logits = pred.predictions
pred_classes = np.argmax(pred_logits, axis=-1)
labels = np.asarray(pred.label_ids)
very_unsatisfied_label_idx = (labels == 0)
very_unsatisfied_pred = pred_classes[very_unsatisfied_label_idx]
very_unsatisfied_pred = very_unsatisfied_pred * (very_unsatisfied_pred - 1)
true_positives = sum(very_unsatisfied_pred == 0) / len(very_unsatisfied_pred)
satisfied_label_idx = (labels > 1)
satisfied_pred = pred_classes[satisfied_label_idx]
false_positives = sum(satisfied_pred <= 1) / len(satisfied_pred)
return {"%_unsatisfied_replied": round(true_positives, 2), "%_satisfied_incorrectly_labels": round(false_positives, 2)}
We again instantiate the Trainer to simply run the evaluation.
trainer = Trainer(
args=training_args,
compute_metrics=compute_metrics,
model=model,
tokenizer=tokenizer,
data_collator=data_collator,
)
And let’s run the evaluation again with our recent metric computation which is healthier suited to our use case.
prediction_metrics = trainer.predict(tokenized_datasets["test"]).metrics
prediction_metrics
Output:
***** Running Prediction *****
Num examples = 5000
Batch size = 8
Output:
{'test_%_satisfied_incorrectly_labels': 0.11733333333333333,
'test_%_unsatisfied_replied': 0.949,
'test_loss': 0.9637690186500549,
'test_runtime': 22.8964,
'test_samples_per_second': 218.375,
'test_steps_per_second': 27.297}
Cool! This already paints a fairly nice picture. We catch around 95% of very unsatisfied customers robotically at a value of wasting our efforts on 10% of satisfied messages.
Let’s do some quick math. We receive every day around 10,000 messages for which we expect ca. 500 to be very negative. As a substitute of getting to reply to all 10,000 messages, using this automatic filtering, we’d only have to look into 500 + 0.12 * 10,000 = 1700 messages and only reply to 475 messages while incorrectly missing 5% of the messages. Pretty nice – a 83% reduction in human effort at missing only 5% of very unsatisfied customers!
Obviously, the numbers don’t represent the gained value of an actual use case, but we could come near it with enough high-quality training data of your real-world example!
Let’s save the outcomes
trainer.save_metrics("prediction", prediction_metrics)
and again upload every little thing on the Hub.
trainer.push_to_hub()
Output:
Saving model checkpoint to deberta_amazon_reviews_v1
Configuration saved in deberta_amazon_reviews_v1/config.json
Model weights saved in deberta_amazon_reviews_v1/pytorch_model.bin
tokenizer config file saved in deberta_amazon_reviews_v1/tokenizer_config.json
Special tokens file saved in deberta_amazon_reviews_v1/special_tokens_map.json
added tokens file saved in deberta_amazon_reviews_v1/added_tokens.json
To https://huggingface.co/patrickvonplaten/deberta_amazon_reviews_v1
599b891..ad77e6d principal -> principal
Dropping the next result because it doesn't have all of the needed fields:
{'task': {'name': 'Text Classification', 'type': 'text-classification'}}
To https://huggingface.co/patrickvonplaten/deberta_amazon_reviews_v1
ad77e6d..13e5ddd principal -> principal
The information is now saved here.
That is it for today 😎. As a final step, it will also make a variety of sense to try the model out on actual real-world data. This may be done directly on the inference widget on the model card:
It does appear to generalize quite well to real-world data 🔥
Optimization
As soon as you think that the model’s performance is nice enough for production it’s all about making the model as memory efficient and fast as possible.
There are some obvious solutions to this like selecting the most effective suited accelerated hardware, e.g. higher GPUs, ensuring no gradients are computed throughout the forward pass, or lowering the precision, e.g. to float16.
More advanced optimization methods include using open-source accelerator libraries resembling ONNX Runtime, quantization, and inference servers like Triton.
At Hugging Face, we’ve been working so much to facilitate the optimization of models, especially with our open-source Optimum library. Optimum makes it very simple to optimize most 🤗 Transformers models.
In case you’re on the lookout for highly optimized solutions which don’t require any technical knowledge, you could be occupied with the Inference API, a plug & play solution to serve in production a wide selection of machine learning tasks, including sentiment evaluation.
Furthermore, in the event you are trying to find support in your custom use cases, Hugging Face’s team of experts may also help speed up your ML projects! Our team answer questions and find solutions as needed in your machine learning journey from research to production. Visit hf.co/support to learn more and request a quote.

