Home Artificial Intelligence The Decontaminated Evaluation of GPT-4 Decontamination of the evaluation data It’s contaminated Is GPT-4 good at these exams? Conclusion

The Decontaminated Evaluation of GPT-4 Decontamination of the evaluation data It’s contaminated Is GPT-4 good at these exams? Conclusion

The Decontaminated Evaluation of GPT-4
Decontamination of the evaluation data
It’s contaminated
Is GPT-4 good at these exams?

Image from Pixabay

GPT-4 was announced by OpenAI in March with impressive demonstrations and outstanding claims.

Most of those claims come from their very own evaluation of GPT-4.

OpenAI used many existing skilled and academic exams for this evaluation.

Models similar to GPT-4 are exposed to “data contamination”, i.e.,

Why is that this an issue?

Let’s take an example.

GPT-4 was evaluated on the LSAT exam. To perform a scientifically credible evaluation, OpenAI had to ascertain whether the LSAT questions used for evaluation weren’t within the training data of GPT-4. In the event that they were, GPT-4 could have memorized the questions after which would obviously perform higher on these specific questions at evaluation time.

It’s like a human who had access to the exam questions before it happened.

Within the GPT-4 technical report, one in all the few things OpenAI disclosed about GPT-4 is the info contamination of their evaluation. They exposed their technique to quantify and assess this contamination and drew several conclusions from their observations.

In this text, I review and discuss how OpenAI handled the info contamination of GPT-4. I expose several pitfalls of their method.

I can’t agree with several of their conclusions.

To examine whether there may be an intersection between the training and evaluation data, OpenAI used a quite simple technique counting on a substring matching algorithm (described page 28 of the technical report).

First, they removed all spaces and symbols within the training and evaluation data (the exams). They kept the numbers.

Then, they picked 3 substrings of fifty characters for every query (or equivalent) within the exams used for evaluation. If one in all these substrings happened to be within the training data of GPT-4, the query is faraway from the .

With this method, they made two critical selections.

The primary one is that this method is random.

Selecting 3 random substrings is especially problematic for exams with very long questions.

As an illustration, one query within the Uniform Bar Exam may contain 1,500 sequences of fifty characters. Note: They’re very long questions, see some examples.

Randomly selecting 3 substrings amongst 1,500 implies that a big a part of each query is totally ignored by this decontamination strategy.

This strategy can’t reliably detect whether a big a part of an issue is within the training data.

We will imagine that a few of these exam questions have been studied or discussed within the GPT-4 training data, but partly and never entirely since they’re very long questions. So a partial but significant match wouldn’t be detected in that case.

The uniform bar exam has 400 questions. But by randomly checking 3 substrings for every query, OpenAI didn’t find any of those questions within the training data.

The second critical selection is that they decontaminated the evaluation data and never the training data.

Removing questions from the training data, retraining GPT-4, after which evaluating it on the exams again would have been too costly, obviously.

Nevertheless, in the event that they had assessed this contamination earlier of their development process, i.e., before training, they might have removed all of the exam examples from the training data.

It’s also essential to notice that they didn’t include the info of RLHF of their decontamination process. If an issue of an exam is within the RLHF, it’s going to remain within the evaluation data.

RLHF stands for Reinforcement Learning from Human Feedback. Once pre-trained, GPT-4 is further fine-tuned using reinforcement learning on human feedback to enhance its performance. This dataset of “feedback” was not checked for the decontamination.

The foremost reason given for not including the RLHF training data is that the fine-tuning exploiting RLHF didn’t significantly improve the performance of GPT-4. They only observed a +0.3% on the typical rating after RLHF post-training.

Image from Pixabay

The main points of the contamination for every exam are given page 30 of the report.

Among the many 49 exams used for evaluation, 12 were found completely absent from the training data. They’re: all of the Leetcode datasets, the Uniform Bar Exam, SAT EBRW exam, and a few AP exams.

In total, the exams used for evaluation contain 4,123 questions. 545 of those questions have been present in the training data. Note: Why is there a “”? So far as I understand, OpenAI removed the query entirely if there may be a match. But for the exam “USA Biolympiad Semifinal Exam 2020”, that comprises 150 questions, they note that they removed 3.00% of the questions (see Table 10 of the paper). 3% of 150, that’s 4.5. One in all these numbers might be flawed.

That is 13.2% of the evaluation data that are contaminated.

Interestingly, for several exams, the decontamination seems to enhance the outcomes obtained by GPT-4.

That is counter-intuitive.

We might imagine that if the removed questions were within the training data, GPT-4 ought to be good at answering them because it had the chance to memorize them.

But we all know nothing of those excluded questions.

They stands out as the most difficult ones for some exams, hence the upper percentage of correct answers after excluding them from the evaluation.

OpenAI claims that the contamination didn’t have a major impact. They note:

Overall across most exams, each contamination and vision have relatively little effect. (Caption of Table 9)

The degration is usually small and as often postive as negative […] (Caption of Table 10)

That is the “overall” conclusion. If we glance closer at the outcomes, that’s not so obvious. Let’s see among the details.

In Table 10 of the technical report, OpenAI has also evaluated GPT-4 on two separate set of questions for every exam:

  • “contaminated”: This set comprises only the questions present in the training data.
  • “non-contaminated”: This set comprises all of the remaining questions.

That is an interesting experiment. The performance of GPT-4 on these two sorts of datasets (fifth and sixth columns) varies extremely for some exams, as an illustration from 41.67% to 0% for AMC 12.

For another exams, GPT-4 performed higher on the evaluation data it didn’t use during training (non-contaminated).

Does it mean that GPT-4 is best for questions it didn’t see during training?

No, “contaminated” and “non-contaminated” are only two different evaluation data.

GPT-4 may perform higher on one in all the 2 datasets for many various reasons, as an illustration, given the subject of the questions, their length, their difficulty, etc.

Let’s have a particular have a look at the LSAT exam. And let’s say that a rating above 160 is an excellent rating on this exam.

GPT-4 achieved a rating of 163. After decontamination, removing 39% of the questions, GPT-4 achieved a good higher rating of 167.

Can we conclude that GPT-4 can achieve an excellent rating on the LSAT exam?

Yes, we will. But provided that cheating is allowed.

On one hand, now we have the complete exam on which GPT-4 performs at 163. It’s an excellent rating but GPT-4 saw among the questions before passing the exam.

Then again, if we remove 39% of the questions for decontamination, this shouldn’t be an LSAT exam anymore. No human passed a 61% LSAT. This exam doesn’t exist.

Furthermore, the 39% of questions removed may contain essentially the most difficult questions. We don’t know if a rating of 167 is sweet or bad on this 61% LSAT.

We will reason similarly for all the opposite “contaminated” exams used for evaluation.

Some exams weren’t “contaminated”, similar to the Uniform Bar Exam and Leet code questions, but there are additional issues.

I won’t write about these issues here. Arvind Narayanan and Sayash Kapoor already discussed the outcomes for these questions of their formidable article which you can read here:

As I wrote within the introduction, assessing the info contamination of huge language models is a particularly difficult task.

When collecting and preprocessing the training data, ideally we should always have already identified a listing of publicly relevant exams and benchmarks to exclude from the training data.

Nonetheless, my opinion is that it actually makes lots of sense for OpenAI to coach GPT-4 on all these exams.

The goal can be to have a GPT-4 nearly as good as possible for the questions posed by these exams. I can see lots of potential use cases for GPT-4 on this area, similar to helping students and teachers to organize exams.

Yet, this selection has a price:

When you like this text and would have an interest to read the following ones, the perfect solution to support my work is to turn out to be a Medium member using this link:

When you are already a member and need to support this work, just follow me on Medium.


Please enter your comment!
Please enter your name here