Home Artificial Intelligence How you can Validate OpenAI GPT Model Performance with Text Summarization Using GPT Effectively Validating GPT Model Performance Evaluation of Results Conclusion

How you can Validate OpenAI GPT Model Performance with Text Summarization Using GPT Effectively Validating GPT Model Performance Evaluation of Results Conclusion

How you can Validate OpenAI GPT Model Performance with Text Summarization
Using GPT Effectively
Validating GPT Model Performance
Evaluation of Results

Photo by Patrick Tomasso on Unsplash

No matter your occupation or age, you’ve heard about OpenAI’s generative pre-trained transformer (GPT) technology on LinkedIn, YouTube, or within the news. These powerful artificial intelligence models/chatbots can seemingly handle any task, from creating poems to solving leetcode problems to coherently summarizing long articles of text.

Screenshot of OpenAI’s GPT Playground Summarizing Jupiter Notes, taken by Writer

The promising applications of GPT models seem limitless inside the expanding NLP industry. But with ever-increasing model sizes, it’s crucial for teams which are constructing large language models (LLMs) to .Since AI, like GPT, is a growing subject in ethics, developers should make sure that their models are fair, accountable, and explainable. Nonetheless, doing proper testing on artificial general intelligence across many various contexts is tedious, expensive, and time-consuming.

From the angle of a machine learning engineer at Kolena, this text offers an intensive guide to using GPT models and compares theirperformance for the abstractive text summarization task. With this actively researched NLP problem, we’ll give you the option to , and so rather more.

By the top of this text, you’ll learn that GPT-3.5’s Turbo model gives a 22% higher BERT-F1 rating with a 15% lower failure rate at 4.8x the associated fee and 4.5x the typical inference time compared to GPT-3’s Ada model for abstractive text summarization.

Suppose you should use GPT for fast solutions in NLP applications, like translating text or explaining code. Where do you begin? Fortunately, there are only three major steps in using GPT for any unique task:

  1. Picking the proper model
  2. Creating an appropriate prompt
  3. Using GPT’s API for responses (our code is at the top of this text)

Prior to picking a model, we must first consider just a few things: How well does each model work? Which one gives the very best ROI? Which one generally performs the very best? Which one performs the very best in your data?

To narrow down the logistics in selecting a GPT model, we use the CNN-DailyMail text summarization dataset to benchmark and compare the performance of 5 . The test split of the dataset incorporates 11,490 news articles and their respective summaries.

For step two, we generate recent summaries with each model using a consistent prompt in the next format:

“Professionally summarize this news article like a reporter with about {word_count_limit} to {word_count_limit+50} words:n {full_text}”

In practice, it takes some experimentation to refine a prompt that may give subjectively optimal results. By utilizing the identical prompt, we are able to accurately compare model behaviors with one less variable in how each model differs.

On this particular article, we deal with the 1st step, which is picking the proper model.

Let’s get acquainted with the GPT models of interest, which come from the GPT-3 and GPT-3.5 series. Each model has a token limit defining the utmost size of the combined input and output, so if, for instance, your prompt for the Turbo model incorporates 2,000 tokens, the utmost output you’ll receive is 2,096 tokens. For English text, 75 words typically tokenizes into roughly 100 tokens.

We’re currently on the waitlist for GPT-4 access, so we’ll include those models in the longer term. For now, the major difference between GPT-4 and GPT-3.5 isn’t significant for basic tasks, but GPT-4 offers a much larger limit for tokens at a much higher price point in comparison with Davinci.

Performance Metrics of Abstractive Text Summarization

As everyone knows, metrics help us measure performance. The tables below highlight the usual and custom metrics we use to judge models on their text summarization performance:

*We calculate BLEU scores with SacreBLEU and BERT scores with Microsoft’s deberta-xlarge-mnli model.

ROUGE and BLEU measure similarity with word matchings in the bottom truths and inferences, while BERT scores consider semantic similarity. The upper the worth, the closer the similarity:

Results with Standard Metrics

After we generate recent summaries (inferences) per article on each model, we are able to compare model performance across any kind of metric with the bottom truths. Let’s look into the summary comparisons and metric plots, ignoring Babbage for more readability.

In the next example, the unique 350-word news article has this summary:

A recent report from Suncorp Bank found Australians spent $20 billion on technology prior to now 12 months. Men spent twice as much as women on computers, digital accessories, mobile apps, and streaming services. Families with children at home spend 50 per cent more to remain digitally than singles, couples without children and empty nesters. One third of households don’t budget for technology or wildly underestimate how much they are going to spend.

We get the next ROUGE_L, BLEU, and generated summaries with Davinci and Ada:

You’ll notice that by reading the generated summaries, Davinci does a coherent job of summarizing the content of a bigger text. Ada, nevertheless, doesn’t provide a summary of the identical quality, and the lower values of ROUGE_L and BLEU reflect that lower quality of output.

Distribution of ROUGE_L

After we examine the distributions of ROUGE_L and BLEU for every model, we see that Ada has lower metric values, and Turbo has the best metric values. Davinci falls just behind Turbo when it comes to these metrics. As GPT models , we see a general too. The greater the worth for these metrics, the greater the variety of words from the bottom truth summary exist within the generated texts. As well as, these .

Distribution of BLEU


For BERT scores, the identical trend is consistent: larger models have higher performance in matching key words and semantic meaning from the provided summary. This is obvious in how the distribution for larger models shifts to the proper, within the direction of upper F1 scores.

Distribution of BERT_F1
BERT_F1 vs word_count

From the plot above, we see that greater models maintain their performance higher than smaller models as text size grows. The larger models remain consistently performant across a big selection of text lengths while the smaller models fluctuate in performance as texts grow longer.

Results with Custom Metrics

Let’s check our custom metrics to see if there’s any reason not to make use of Turbo or Davinci.

Distribution of API Request Costs

From the models’ cost distributions, we learn that than another model. Although Davinci and Turbo perform at similar levels, .

Distribution of inf_to_gt_word_count

Within the figure above, there’s a drastic difference within the variety of words generated for a similar ground truth. Turbo and Davinci consistently provide a summary that’s twice the bottom truth summary length, whereas . Specifically, some generated summaries from the smaller models are much shorter and a few are greater than 4 times as long! Remember that we prompted each model with the identical request and word count targetper article, but while others completely ignored it.

The variance in summary length is an issue for users as this imbalance indicates potential issues with the model or poor performance. In the instance above, Curie repeats “variety of charitable causes prior to now, most notably his work with St. Jude Children’s Research Hospital” a minimum of twice. As compared to Turbo, while costing the . Inside that small difference, we should always note that the associated fee in generating this particular summary with Curie is double the associated fee of Turbo for the reason that variety of tokens contained within the output was extremely high.

After running model evaluations for an hour on Kolena, we are able to outline and summarize each model’s performance and characteristics as shown below.

We now understand that the larger the model size:

  • The more semantically similar the provided and generated summaries are
  • The costlier it’s to compute, excluding Turbo
  • The lower the variety of empty summaries
  • The slower it’s to generate a summary
  • The

Ultimately, the offered within the GPT-3/3.5 series, providing probably the most consistent text similarity scores, all while also being very cost-effective.

Notes for Further Research

Interestingly, given a text to summarize, , despite the fact that the prompt is inside the token limit. Turbo failed on not one of the articles, which is an amazing achievement. Nonetheless, this could be because Turbo isn’t as responsive in flagging sensitive content or puts less emphasis in making such considerations. Ada could be less performant, but we should always ask OpenAI if it refuses to generate summaries out of ethical consideration or technical limitations. Below is a sample of the to offer any summary, but Turbo produced decent summaries. It does look like Ada is less lenient in producing summaries with sensitive content:

Articles Where Ada Fails While Turbo Performs Well — From Kolena

The bottom truth summaries from the dataset are . Nonetheless, we assume ground truth summaries are perfect for the aim of straightforward performance computations, so model evaluation metrics might indicate that an amazing model is definitely underperforming, despite the fact that it produces perfectly valid and detailed summaries. Perhaps , as shown below:

The world of NLP is rapidly advancing with the introduction of LLMs like GPT. As such models develop into larger, more complex, and costlier, it’s crucial for developers and users alike to grasp their expected performance levels for specific use cases.

‍, depending in your problem, expectations, and available resources. There may be much to think about when picking a single GPT model on your NLP tasks. Within the quickly advancing era of LLMs, hopefully the findings outlined in this text give a recent perspective on the differences amongst OpenAI’s models.

Stay tuned for more posts in the longer term where we may cover prompt engineering, GPT-4 performance, or differences in model behavior by varieties of content as well!

As promised earlier in this text, our code for reference and all five models’ summaries for each example inside this text are all on this page. You may learn more about OpenAI’s API or models in OpenAI’s documentation.

All images of plots are screenshots taken from Kolena unless otherwise indicated. Note that similar plots may be manually generated in common frameworks comparable to mathplotlib.



Please enter your comment!
Please enter your name here