Energetic Learning with AutoNLP and Prodigy

-


Abhishek Thakur's avatar


Energetic learning within the context of Machine Learning is a process wherein you iteratively add labeled data, retrain a model and serve it to the top user. It’s an countless process and requires human interaction for labeling/creating the info. In this text, we’ll discuss easy methods to use AutoNLP and Prodigy to construct an lively learning pipeline.



AutoNLP

AutoNLP is a framework created by Hugging Face that lets you construct your personal state-of-the-art deep learning models on your personal dataset with almost no coding in any respect. AutoNLP is built on the enormous shoulders of Hugging Face’s transformers, datasets, inference-api and plenty of other tools.

With AutoNLP, you’ll be able to train SOTA transformer models on your personal custom dataset, fine-tune them (mechanically) and serve them to the end-user. All models trained with AutoNLP are state-of-the-art and production-ready.

On the time of writing this text, AutoNLP supports tasks like binary classification, regression, multi class classification, token classification (equivalent to named entity recognition or a part of speech), query answering, summarization and more. Yow will discover an inventory of all of the supported tasks here. AutoNLP supports languages like English, French, German, Spanish, Hindi, Dutch, Swedish and plenty of more. There may be also support for custom models with custom tokenizers (in case your language is just not supported by AutoNLP).



Prodigy

Prodigy is an annotation tool developed by Explosion (the makers of spaCy). It’s a web-based tool that permits you to annotate your data in real time. Prodigy supports NLP tasks equivalent to named entity recognition (NER) and text classification, but it surely’s not limited to NLP! It supports Computer Vision tasks and even creating your personal tasks! You’ll be able to try the Prodigy demo: here.

Note that Prodigy is a business tool. Yow will discover out more about it here.

We selected Prodigy because it is some of the popular tools for labeling data and is infinitely customizable. Additionally it is very easy to setup and use.



Dataset

Now begins probably the most interesting a part of this text. After taking a look at numerous datasets and several types of problems, we stumbled upon BBC News Classification dataset on Kaggle. This dataset was utilized in an inclass competition and may be accessed here.

Let’s take a have a look at this dataset:

As we will see it is a classification dataset. There may be a Text column which is the text of the news article and a Category column which is the category of the article. Overall, there are 5 different classes: business, entertainment, politics, sport & tech.

Training a multi-class classification model on this dataset using AutoNLP is a bit of cake.

Step 1: Download the dataset.

Step 2: Open AutoNLP and create a brand new project.

Step 3: Upload the training dataset and select auto-splitting.

Step 4: Accept the pricing and train your models.

Please note that within the above example, we’re training 15 different multi-class classification models. AutoNLP pricing may be as little as $10 per model. AutoNLP will select one of the best models and do hyperparameter tuning for you by itself. So, now, all we’d like to do is sit back, chill out and wait for the outcomes.

After around quarter-hour, all models finished training and the outcomes are ready. It looks like one of the best model scored 98.67% accuracy!

So, we are actually in a position to classify the articles within the dataset with an accuracy of 98.67%! But wait, we were talking about lively learning and Prodigy. What happened to those? 🤔 We did use Prodigy as we’ll see soon. We used it to label this dataset for the named entity recognition task. Before starting the labeling part, we thought it will be cool to have a project wherein we will not be only in a position to detect the entities in news articles but in addition categorize them. That is why we built this classification model on existing labels.



Energetic Learning

The dataset we used did have categories but it surely did not have labels for entity recognition. So, we decided to make use of Prodigy to label the dataset for an additional task: named entity recognition.

Once you will have Prodigy installed, you’ll be able to simply run:

$ prodigy ner.manual bbc blank:en BBC_News_Train.csv --label PERSON,ORG,PRODUCT,LOCATION

Let us take a look at the several values:

  • bbc is the dataset that can be created by Prodigy.
  • blank:en is the spaCy tokenizer getting used.
  • BBC_News_Train.csv is the dataset that can be used for labeling.
  • PERSON,ORG,PRODUCT,LOCATION is the list of labels that can be used for labeling.

When you run the above command, you’ll be able to go to the prodigy web interface (often at localhost:8080) and begin labelling the dataset. Prodigy interface could be very easy, intuitive and straightforward to make use of. The interface looks like the next:

All you will have to do is select which entity you would like to label (PERSON, ORG, PRODUCT, LOCATION) after which select the text that belongs to the entity. Once you might be done with one document, you’ll be able to click on the green button and Prodigy will mechanically give you next unlabelled document.

prodigy_ner_demo

Using Prodigy, we began labelling the dataset. After we had around 20 samples, we trained a model using AutoNLP. Prodigy doesn’t export the info in AutoNLP format, so we wrote a fast and dirty script to convert the info into AutoNLP format:

import json
import spacy

from prodigy.components.db import connect

db = connect()
prodigy_annotations = db.get_dataset("bbc")
examples = ((eg["text"], eg) for eg in prodigy_annotations)
nlp = spacy.blank("en")

dataset = []

for doc, eg in nlp.pipe(examples, as_tuples=True):
    try:
        doc.ents = [doc.char_span(s["start"], s["end"], s["label"]) for s in eg["spans"]]
        iob_tags = [f"{t.ent_iob_}-{t.ent_type_}" if t.ent_iob_ else "O" for t in doc]
        iob_tags = [t.strip("-") for t in iob_tags]
        tokens = [str(t) for t in doc]
        temp_data = {
            "tokens": tokens,
            "tags": iob_tags
        }
        dataset.append(temp_data)
    except:
        pass

with open('data.jsonl', 'w') as outfile:
    for entry in dataset:
        json.dump(entry, outfile)
        outfile.write('n')

It will provide us with a JSONL file which may be used for training a model using AutoNLP. The steps can be same as before except we’ll select Token Classification task when creating the AutoNLP project. Using the initial data we had, we trained a model using AutoNLP. The most effective model had an accuracy of around 86% with 0 precision and recall. We knew the model didn’t learn anything. It’s pretty obvious, we had only around 20 samples.

After labelling around 70 samples, we began getting some results. The accuracy went as much as 92%, precision was 0.52 and recall around 0.42. We were getting some results, but still not satisfactory. In the next image, we will see how this model performs on an unseen sample.

As you’ll be able to see, the model is struggling. But it surely’s significantly better than before! Previously, the model was not even in a position to predict anything in the identical text. At the least now, it’s in a position to determine that Bruce and David are names.

Thus, we continued. We labelled a number of more samples.

Please note that, in each iteration, our dataset is getting larger. All we’re doing is uploading the brand new dataset to AutoNLP and let it do the remaining.

After labelling around ~150 samples, we began getting some good results. The accuracy went as much as 95.7%, precision was 0.64 and recall around 0.76.

Let’s take a have a look at how this model performs on the identical unseen sample.

WOW! That is amazing! As you’ll be able to see, the model is now performing extremely well! Its in a position to detect many entities in the identical text. The precision and recall were still a bit low and thus we continued labeling much more data. After labeling around ~250 samples, we had one of the best results when it comes to precision and recall. The accuracy went as much as ~95.9% and precision and recall were 0.73 and 0.79 respectively. At this point, we decided to stop labelling and end the experimentation process. The next graph shows how the accuracy of best model improved as we added more samples to the dataset:

Well, it is a well-known incontrovertible fact that more relevant data will lead to raised models and thus higher results. With this experimentation, we successfully created a model that can’t only classify the entities within the news articles but in addition categorize them. Using tools like Prodigy and AutoNLP, we invested our effort and time only to label the dataset (even that was made simpler by the interface prodigy offers). AutoNLP saved us numerous effort and time: we didn’t should determine which models to make use of, easy methods to train them, easy methods to evaluate them, easy methods to tune the parameters, which optimizer and scheduler to make use of, pre-processing, post-processing etc. We just needed to label the dataset and let AutoNLP do the whole lot else.

We imagine with tools like AutoNLP and Prodigy it’s extremely easy to create data and state-of-the-art models. And because the whole process requires almost no coding in any respect, even someone with no coding background can create datasets that are generally not available to the general public, train their very own models using AutoNLP and share the model with everyone else locally (or simply use them for their very own research / business).

We’ve open-sourced one of the best model created using this process. You’ll be able to try it here. The labelled dataset can be downloaded here.

Models are only state-of-the-art due to the info they’re trained on.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x