How good is ChatGPT on QA tasks? Intro to Query Answering Intro to DeepPavlov Library Reading Comprehension (SQuAD 2.0) Natural Yes/No Questions (BoolQ) Query-answering NLI (QNLI) The right way to use DeepPavlov Library for QA? Conclusion

Artificial Intelligence

How good is ChatGPT on QA tasks? Intro to Query Answering Intro to DeepPavlov Library Reading Comprehension (SQuAD 2.0) Natural Yes/No Questions (BoolQ) Query-answering NLI (QNLI) The right way to use DeepPavlov Library for QA? Conclusion

admin

June 19, 2023

How good is ChatGPT on QA tasks?
Intro to Query Answering
Intro to DeepPavlov Library
Reading Comprehension (SQuAD 2.0)
Natural Yes/No Questions (BoolQ)
Query-answering NLI (QNLI)
The right way to use DeepPavlov Library for QA?
Conclusion

A hands-on comparison using ChatGPT and fine-tuned encoder-based models on QA tasks.

ChatGPT released by OpenAI is a flexible Natural Language Processing (NLP) system that comprehends the conversation context to supply relevant responses. Although little is understood about construction of this model, it has turn into popular attributable to its quality in solving natural language tasks. It supports various languages, including English, Spanish, French, and German. Considered one of some great benefits of this model is that it will possibly generate answers in diverse styles: formal, informal, and humorous.

The bottom model behind ChatGPT only has 3.5B parameters, yet it provides answers higher than the GPT3 model that has 175B parameters. This highlights the importance of collecting human data for supervised model fine-tuning. ChatGPT has been evaluated on well-known natural language processing tasks, and this text will compare gpt-3.5-turbo performance with supervised transformer-based models from the DeepPavlov Library on query answering tasks. For this text, I actually have prepared a Google Colab notebook so that you can try using models from the DeepPavlov Library for some QA tasks.

Query answering is a natural language processing task utilized in various domains (e.g. customer support, education, healthcare), where the goal is to supply an appropriate answer to questions asked in natural language. There are several kinds of Query Answering tasks, resembling factoid QA, where the reply is a brief fact or a bit of knowledge, and non-factoid QA, where the reply is an opinion or an extended explanation. Query Answering has been an lively research area in NLP for a few years so there are several datasets which were created for evaluating QA systems. Listed here are some datasets for QA that we are going to take a look at in additional detail in this text: SQuAD, BoolQ, and QNLI.

Example of factoid Query Answering made by the creator.

When developing effective QA systems, several problems arise, resembling handling ambiguity and variability of natural language, coping with large amounts of text and understanding complex language structures and contexts. Advances in deep learning and other NLP techniques have helped solve a few of these challenges and have led to significant improvements in performance of QA systems in recent times.

The DeepPavlov Library uses BERT base models to take care of Query Answering, resembling RoBERTa. BERT is a pre-trained transformer-based deep learning model for natural language processing that achieved state-of-the-art results across a big selection of natural language processing tasks when this model was proposed. There are also a great deal of models with the BERT-like architecture which might be trained to take care of tasks in specific languages besides English: Chinese, German, French, Italian, Spanish, and lots of others. It might probably be used to resolve a wide range of language problems, while changing just one layer in the unique model.

The DeepPavlov Library is a free, open-source NLP framework that features state-of-the-art NLP models that could be utilized alone or as a part of the DeepPavlov Dream. This Multi-Skill AI Assistant Platform offers various text classification models for intent recognition, topic classification, and insult identification.

DeepPavlov Demo example for QA generated by the creator.

PyTorch is the underlying machine learning framework that DeepPavlov framework employs. The DeepPavlov Library is implemented in Python and supports Python versions 3.6–3.9. Also the library supports the usage of transformer-based models from the Hugging Face Hub through Hugging Face Transformers. Interaction with the models is feasible either via the command-line interface (CLI), the applying programming interface (API), or through Python pipelines. Please note that specific models — could have additional installation requirements.

Let’s start our comparison with the reading comprehension task, more specifically, SQuAD 2.0 dataset. It is a large-scale, open-domain query answering dataset that accommodates over 100,000 questions with answers based on a given passage of text. Unlike the unique SQuAD dataset, SQuAD 2.0 includes questions that can’t be answered solely based on the provided passage, which makes it more difficult for machine learning models to reply accurately.

Despite the undeniable fact that the primary version of SQuAD was released back in 2016 and that it contained answers to questions on Wikipedia articles, the QA within the SQuAD statement continues to be relevant. After training the model on this dataset, it’s possible to supply information not only from Wikipedia as context, but additionally information from official documents like contracts, company reports, etc. Thus, it is feasible to extract facts from a broad range of texts not known during training.

Let’s now move on to composing a matter for ChatGPT. To make the answers more stable, I restarted the session after each query. So, the next prompt was used to get answers on examples in SQuAD-style:

Please answer the given query based on the context. If there is no such thing as a answer within the context, reply NONE.
context: [‘context’]
query: [‘question’]

Let’s now take a look at and compare the outcomes. Despite the fact that ChatGPT is exceptionally good at answering questions, experiments on one thousand sampled examples showed that ChatGPT lags far behind existing solutions when answering questions on a given context. If we rigorously consider the examples on which the model is fallacious, it seems that ChatGPT doesn’t cope well with answering a matter by context when the reply may not actually be within the presented context; also, it doesn’t perform well when questioned in regards to the order of events in time. The DeepPavlov library, alternatively, provides the right output in all these cases. Few examples are following:

SQuAD example 1. Expected — NONE, DeepPavlov — NONE. Made by the athor.

Here we see that ChatGPT doesn’t at all times work well with numeric values. This may occasionally be attributable to the undeniable fact that on this case the model is attempting to get a solution from the context, although in actual fact there is no such thing as a answer.

SQuAD example 2. Expected — K-1, DeepPavlov — K-1. Made by the athor.

That is example that ChatGPT hasn’t seen. And as we are able to see, here it cannot answer the query in regards to the order of events in time.

For quantitative comparison of ChatGPT and DeepPavlov SQuAD 2.0 pretrained model I used the sampled set from the paper. Although the model from the DeepPavlov library shows higher results than ChatGPT, it isn’t much ahead on the tested sample. This may occasionally be attributable to the mistakes within the layout of the dataset. Whether it is corrected, the outcomes of each models are expected to enhance.

Models performance on 1000 sampled pairs from SQuAD 2.0. Made by the creator.

DeepPavlov Library also accommodates a model that has been trained on BoolQ dataset to reply natural questions within the yes/no format. BoolQ is a machine learning dataset containing over 150,000 yes/no questions, created by Google Research to check the flexibility of natural language processing algorithms to accurately answer binary questions based on a given text passage.

Let’s see learn how to get responses from ChatGPT using BoolQ examples. One can use the prompt from the article:

Please answer the given query based on the context. The reply ought to be exact ‘yes’ or ‘no’
context: [‘context’]
query: [‘question’]

As you’ll be able to see, the prompt is easy and shouldn’t confuse the model. Some experiments have already been carried out and showed that the accuracy of ChatGPT on the BoolQ dataset (reading comprehension) is 86.8. Nevertheless, the most effective model achieves 91.2. While the difference between the compared models isn’t too big, here some examples where ChatGPT lags behind the model from the DeepPavlov Library:

BoolQ example 1. Expected — Yes, DeepPavlov — Yes. Made by the athor.

Despite the undeniable fact that within the given example it’s written that the long cloak is known as a brief cape — ChatGPT couldn’t generalize this information, from which it follows that it is identical thing.

BoolQ example 2. Expected — Yes, DeepPavlov — Yes. Made by the athor.

It’s strange, but ChatGPT sometimes can’t match a numeric time value with a verbal one.

In all of the above examples, the model from the DeepPavlov Library responds appropriately. Similar to within the examples for SQUAD on the BoolQ dataset, ChatGPT makes mistakes when it needs to reply about time and when it must generalize some information. Also, I noticed that ChatGPT relies heavily on information that was memorized during training. So, if we take an actual fact as a context, but add false information to it, then ChatGPT mostly will give a solution using real information as a substitute of fictional. Example:

BoolQ example 3. Expected — Yes, DeepPavlov — Yes. Made by the athor.

In the unique text, it’s indeed said that sweet potato and potato don’t belong to the identical family. But again, within the given context it’s clearly written that they belong to the identical family — nightshade, which was completely ignored by ChatGPT.

The last QA task on which I compared the outcomes of the DeepPavlov Library and ChatGPT is query answering entailment or QNLI dataset. QNLI (Query Natural Language Inference) is a natural language understanding (NLU) task where a model is trained to find out the connection between two sentences: the context sentence (query) and the hypothesis sentence (answer). The target is to evaluate whether the hypothesis sentence is entailment or not, given the context sentence.

To unravel the QNLI problem using ChatGPT, the next prompt from the article was used:

Given the query: [‘question’]
Determine if the next sentence accommodates the corresponding answer: [‘sentence’]

More often than not, ChatGPT manages to decide on the right answer. But I used to be able to substantiate the outcomes of the experiments: ChatGPT responds much worse than BERT-based models on the QNLI task. I didn’t manage to seek out recent classes of ChatGPT model mistakes — principally, these all fall under the identical classes of temporary mistakes (mistakes related to the lack to reason about events and their order in time) and logical mistakes (mistakes related to the lack to do deductive or inductive reasoning).

Also, I evaluated the DeepPavlov model on the sample from the dev set as in paper. The outcomes are as follows. The pretrained QNLI models in DeepPavlov outperforms ChatGPT and the models from the article.

Models performance on 50 sampled pairs from QNLI dataset. Made by the creator.

Just a few examples of mistakes:

QNLI example 1. Expected — No (not entailment), DeepPavlov — not_entailment. Made by the athor.

At first sight, even an individual might imagine that the above sentence accommodates a solution to the query. But the important thing word within the query is “minimum” — with nothing about it within the sentence.

QNLI example 2. Expected — Yes (entailment), DeepPavlov — entailment. Made by the athor.

Within the above example, we are able to again notice that ChatGPT could be easily deceived. The sentence from the instance doesn’t explicitly indicate which day of the week the debut took place on, but implicitly it will possibly be concluded that it was on Sunday. On the whole, ChatGPT seems to have limitations by way of logical reasoning and contextual understanding. In consequence, it could struggle with questions which might be relatively easy for humans.

Along with the issues outlined above, ChatGPT has other drawbacks. The responses generated by ChatGPT could be inconsistent and, at times, contradictory. The model’s answers may vary when asked the identical query, and its performance may additionally be influenced by the order wherein the questions are asked.

A hands-on comparison using ChatGPT and fine-tuned encoder-based models on QA tasks.

LEAVE A REPLY Cancel reply