Home Artificial Intelligence Text Classification Challenge with Extra-Small Datasets: Positive-Tuning Versus ChatGPT The dataset Regular fine-tuning with RoBERTa Few-shot with ChatGPT Positive-tuning a GPT-3 model Conclusion Sources

Text Classification Challenge with Extra-Small Datasets: Positive-Tuning Versus ChatGPT The dataset Regular fine-tuning with RoBERTa Few-shot with ChatGPT Positive-tuning a GPT-3 model Conclusion Sources

8
Text Classification Challenge with Extra-Small Datasets: Positive-Tuning Versus ChatGPT
The dataset
Regular fine-tuning with RoBERTa
Few-shot with ChatGPT
Positive-tuning a GPT-3 model
Conclusion
Sources

LLMs excel on extra-small datasets, but classical approaches shine as datasets grow

Photo by Debby Hudson on Unsplash

The Toloka ML team continually researches and compares different approaches to text classification under various conditions. Here we present one other one among our experiments on the performance of NLP models when trained on extra-small datasets.

Previously, we provided a transient overview of potential solutions and compared classical models with large language models (LLMs) for a selected text classification task. Nevertheless, those comparisons were based on a “regular” dataset that contained enough data points to construct a reliable classifier. In real-world scenarios, it’s possible you’ll encounter situations where limited data is on the market or human labeling hasn’t been carried out.

Intuitively, LLMs akin to GPT-3 or ChatGPT might outperform smaller models on account of their extensive “knowledge”. To analyze this hypothesis, we created an artificially small dataset by extracting a portion of a bigger one and compared several approaches. We fine-tuned the RoBERTa base model, employed ChatGPT for few-shot classification, and fine-tuned the GPT-3 Babbage model.

To guage the comprehension capabilities of varied models, we chosen a multiclass dataset consisting of scientific article abstracts. The duty was to find out each article’s domain.

We opted for the WOS-11967 [1] dataset, which incorporates 11,967 documents with 35 categories that include seven parent categories: medical, psychology, computer science, biochemistry, electrical engineering, civil sciences, and mechanical engineering. We sampled 10,000 data points and focused solely on the parent categories for our evaluation.

While the dataset was not perfectly balanced, the category distribution was reasonably proportional. Subsequently, satisfactory results could potentially be achieved across all classes. The category distribution is illustrated below.

The category distribution of the sample of the WOS-11967 dataset

Upon manual evaluation, we found that determining the domain of some abstracts was relatively straightforward, while in other cases, the duty became tougher. As an illustration, computer science articles may discuss mathematical topics, or psychology articles might contain medical or biochemical terms and abbreviations, making it difficult to tell apart them from biochemistry or medical domains. The abstracts also varied significantly in length, with a mean of 274 tokens (ChatGPT tokens) and an ordinary deviation of 115 tokens.

To simulate scenarios involving extra-small datasets, we performed a train-test split on the corpora and allocated a small variety of samples to the training set. We repeated this process 3 times with different training set sizes to judge any changes in performance within the models based on the available training data. We created three splits for our experiment: WOS-11967-s200 (200 samples within the training set, 9,800 samples within the test set), WOS-11967-s500 (500 / 9,500), and WOS-11967-s2000 (2,000 / 8,000).

Now, let’s take a take a look at the outcomes obtained using different models to tackle these problems.

For our baseline, we chosen the RoBERTa base model [2] and fine-tuned it on the three datasets mentioned earlier. We used the identical hyperparameter configuration for every run (a batch size of 32, a learning rate of 3e-5, a linear scheduler with warmup, and a 256-token window), together with early stopping to forestall overfitting.

We obtained the next results:

The information shows that 200 samples are insufficient in terms of extracting all of the mandatory patterns and data required to accurately classify the abstracts. The lower macro-average F1 rating also indicates that the model underperforms on under-represented classes like mechanical engineering. This implies that it’s not enough to have only just a few samples from a selected class.

As expected, the model’s performance improved as the quantity of obtainable data increased — ultimately leading to robust performance for multiclass classification across seven classes.

The second approach we explored was few-shot classification using ChatGPT. This method differs significantly from traditional classification because it doesn’t involve training a model per se. As a substitute, we engineered the input prompt to realize optimal performance.

Nevertheless, it was unattainable to feed all 200 samples into the model on account of its 4096-token context size limit. Given the measurements above, we could only present around 14 abstracts to the model. That number was further reduced when considering the tokens used for instructions and delimiters.

Initially, we employed the “system” role for instructions and provided a single example per class to guide the model’s response. We simplified the category names to single tokens while retaining their meaning. This made it easier for the model to pick the suitable category and limit the output to a single token. As an illustration, “Biochemistry” became “Bio,” and “Computer Science” became “Computer.” Moreover, we restricted the variety of tokens generated by providing a listing of classes to select from and instructing the model to return the “Unknown” token if it was unsure in regards to the category.

Overall, ‌performance with this method was inferior in comparison with the RoBERTa model trained on just 200 samples. We noticed that the model’s classification ability heavily relied on the supplied prompt. Modifying a single sentence could either improve or worsen the metrics. In some cases, ChatGPT missed categories despite explicit instructions to not accomplish that (which might be a drawback of how we formulated our prompt).

In just a few fringe cases, it produced categories not listed within the instructions, but described the article domains, akin to “Math” or “Chemistry”. It’s unclear whether these flaws needs to be attributed to the model or the dataset. Nevertheless, in accordance with the validation set, these categories may be corrected using easy rules like changing all instances of “Math” to “Computer”.

To enhance metrics, we tried to make use of as much data as possible. Since we still couldn’t feed all 200 samples into the model, we devised a two-stage process:

  • First, we asked the model to discover similarities between abstracts from a selected domain and generate summaries.
  • Second, we incorporated these summaries into the instructions to offer the model with insights in regards to the classes and features identified by the model itself in the primary stage.

This approach allowed us to feed more training data samples into the model; and it worked — we boosted metrics by roughly 10%. Below is the prompt we used to generate these summaries:

The prompt for ChatGPT used to extract meaningful details about article domains

For every domain, we supplied seven to eight abstracts, leading to a complete of 63 distinct abstracts used to organize the classification prompt (eight abstracts per seven classes to construct summaries and 7 abstracts provided as examples within the actual prompt).

Nevertheless, we instructed the model to reply with “Unknown” if it was uncertain in regards to the class. Within the validation set we observed that the majority “Unknown” responses corresponded to computer science articles. We then replaced all “Unknown” instances with the “Computer” class.

The resulting classification prompt read as follows:

The ultimate prompt for ChatGPT used to categorise article abstracts

Once more, performance was heavily influenced by the prompt and the samples provided. The model also generated several categories outside the goal list, requiring manual adjustments to be made based on the validation set. This approach yielded the next results:

The performance was notably higher than fine-tuning a RoBERTa model on 200 samples — and fewer samples were required. Nevertheless, as the supply of labeled data increased, RoBERTa began to outperform this approach, even with just 500 samples.

We consider that further performance improvements are possible through proper prompt engineering. Some useful suggestions and tricks may be present in the Prompting Guide.

For our final approach, we fine-tuned the GPT-3 Babbage model on these three datasets. We followed the dataset preparation recommendations outlined within the OpenAI guide and opted for the default hyperparameters without making any specific adjustments. The training process for every dataset took about 20 minutes, yielding the next results:

The fine-tuned GPT-3 model delivered impressive results even on the smallest dataset, surpassing each RoBERTa and ChatGPT. As the quantity of coaching data increased, the performance gap between RoBERTa and the tuned GPT-3 model narrowed. This raised questions on the resources and feasibility of using either option. We discussed the professionals and cons of each approaches in our previous articles.

This experiment demonstrates that our initial hypothesis was correct — larger models trained on more extensive data perform significantly higher on extra-small datasets. With proper prompt engineering and few-shot techniques, it’s possible to realize favorable results.

Nevertheless, differences in performance decrease because the dataset size increases. Furthermore, an appropriately tailored classical model, akin to a domain-adapted RoBERTa model, can sometimes outperform generic LLMs in classification tasks. It will probably be attributed to the model’s specialized “knowledge” of the subject material. Moreover, with the fitting optimizations, inference using these models may be significantly faster, which is crucial when developing online services.

All images unless otherwise noted are by the writer.

  1. Kowsari K, Brown DE, Heidarysafa M, Jafari Meimandi K, Gerber MS, Barnes LE. HDLTex: Hierarchical Deep Learning for Text Classification. In: Machine Learning and Applications (ICMLA), 2017 sixteenth IEEE International Conference On. IEEE; 2017.
  2. Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR. 2019;abs/1907.11692. http://arxiv.org/abs/1907.11692

8 COMMENTS

  1. … [Trackback]

    […] Read More on that Topic: bardai.ai/artificial-intelligence/text-classification-challenge-with-extra-small-datasets-positive-tuning-versus-chatgptthe-datasetregular-fine-tuning-with-robertafew-shot-with-chatgptpositive-tuning-a-gpt-3-modelconclusi…

  2. … [Trackback]

    […] Here you can find 41791 more Info on that Topic: bardai.ai/artificial-intelligence/text-classification-challenge-with-extra-small-datasets-positive-tuning-versus-chatgptthe-datasetregular-fine-tuning-with-robertafew-shot-with-chatgptpositive-tuni…

  3. … [Trackback]

    […] There you will find 44323 more Info on that Topic: bardai.ai/artificial-intelligence/text-classification-challenge-with-extra-small-datasets-positive-tuning-versus-chatgptthe-datasetregular-fine-tuning-with-robertafew-shot-with-chatgptpositive-tu…

  4. … [Trackback]

    […] Information to that Topic: bardai.ai/artificial-intelligence/text-classification-challenge-with-extra-small-datasets-positive-tuning-versus-chatgptthe-datasetregular-fine-tuning-with-robertafew-shot-with-chatgptpositive-tuning-a-gpt-3-modelconclu…

LEAVE A REPLY

Please enter your comment!
Please enter your name here