As many individuals probably heard, a recent large language model (GPT4) from OpenAI was recently announced. Obviously, model weights aren’t published (nothing recent), but this time, the paper describing the model architecture and training can also be not published (that is recent). Nonetheless, the model is offered in type of paid API (with waitlist), and to ChatGPT-Plus subscribers (with fairly restrictive limitations).
The model is hyped to be multi-modal (consuming each text and pictures as inputs), nevertheless, the exposed APIs are text-only. A more interesting thing is the benchmark results, which were published: apparently it beats the previous best LLM (Google’s instruction-finetuned PaLM) across the board, achieving impressive results across many of the exam-like datasets.
One line within the report stands out though (columns are GPT-4, GPT-4 w/o vision, and GPT-3.5):
Codeforces is an internet site with programming competitions: participants are given textual problem statements and are asked to jot down a code which solves the given problem. The code is then submitted to the system, which runs it on a predefined set of tests, verifying correctness and resource usage (runtime and RAM). Problem difficulty ranges from trivial (solvable by someone who just began learning a programming language) to very difficult (with solutions requiring ideas worthy of research papers in theoretical computer science).
Since these are competitions, users are assigned ELO-like rankings. What the GPT4 report claims is that its rating can be 392, which is abysmally low — frankly, a bit kid who just had a number of informatics lessons would get higher. Does this mean that GPT4 cannot solve coding problems?
One other table from the identical report disagrees:
Leetcode is a group of coding problems from technical interviews. Their virtue is that they aren’t original, and picked up from external sources, so data leakage is amazingly likely (i.e. GPT4 probably seen these exact problems with solutions during training).
So, is the mystery solved, and GPT4 cannot solve coding problems ultimately? Apparently, there may be a disagreement: after a number of days, a paper by Microsoft researchers (obvious conflict of interest…) got here out, with a fairly pretentious title “Sparks of Artificial General Intelligence: Early experiments with GPT-4”. In it, the authors claim that GPT4 is kind of good at coding:
In its current state, we imagine that GPT-4 has a high proficiency in writing focused programs that only depend upon existing public libraries, which favorably compares to the common software engineer’s ability.
So, what is occurring? The aim of this text is to guage GPT4 on some recent problems and see whether it could possibly solve them.
Problem selection
Leetcode is the worst possible source for selecting problems, since it offers problems originating from other sources, meaning that even recent Leetcode problems are more likely to be old and appear in GPT4 training set. Evaluating whether it could possibly memorize and retrieve solutions in a fuzzy way just isn’t very interesting. For these reasons, we decide problems which appeared within the .
For the competition source, we decide AtCoder. It offers a big number of contest and problem difficulty, and has a pleasant property that the statements and solutions to their problems are frequently quite short. This prevents overloading GPT4 context size in the general public demo.
For a similar reason of reducing context size, we ask GPT4 to give you solutions in Python, as they’re often shorter than in lots of other languages.
AtCoder specifics
The platform offers contests in three categories:
- AtCoder Grand Contests: that is the “fundamental” track for strong human participants, with virtue being that even the simplest problems require non-zero amount of considering;
- AtCoder Regular Contests: these contests assume that participants have decent knowledge and proficiency with programming language, but doesn’t necessarily require complicated ideas, with most problems being relatively straight-forward for knowledgeable participants;
- AtCoder Beginner Contests: these are educational and aimed toward individuals who just need to learn programming. Unlike the previous two categories, not all problems in these contests are original.
In each contest, there are multiple problems, ordered by perceived difficulty. As a rule of thumb, the simplest problem in each contest is usually solved by many of the participants.
To present some examples, listed here are examples of a straightforward problem in each category:
- Beginner: link. The issue asks to match each string from a set with a small list of predefined strings. No considering is mandatory, however the participants must know find out how to compare strings of their programming language;
- Regular: link
You’re given a string S of length N consisting of lowercase English letters, and a positive integer K. Determine whether there may be a string S′ of length K that satisfies the next conditions.
The concatenation of S and S′ on this order is a palindrome. The concatenation of S′ and S on this order is a palindrome.
This already requires a little bit of considering (considering cases K < 2N and K ≥ 2N), and fewer trivial coding.
There’s an array A_1,…,A_N, and initially A_i=i for all i. Define the next routine shuffle(L,R):
If R=L+1, swap values of A_L and A_R and terminate.
Otherwise, run shuffle(L,R−1) followed by shuffle(L+1,R).Suppose we run shuffle(1,N). Print the worth of A_K after the routine finishes.
That is already hard for a human to unravel with no piece of paper (manually considering some examples and noticing the pattern).
If we discuss intelligence, solving the “Grand” category will be considered a high level of intelligence competitive with strong humans; solving “Regular” will be considered an indication of intelligence; “Beginner” doesn’t require much intelligence, just the fundamental knowledge of a programming language and algorithmic techniques.
Prompt tuning
The next prompt was used for giving GPT4 the instructions (attempting to make it more instruction-like doesn’t seem to assist with the outcomes):
# Solving Math with Coding
## You're given the duty of writing a Python program to unravel the next math problem:
## Requirements:
- The code ought to be syntactically correct standalone Python program.
- Do not forget the correct indentation.
- Please print the ultimate answer using print(solution).
### Possible Python Program:
So, how do the outcomes appear like? Surprisingly, they’re much worse than I anticipated:
GPT4 solutions will be found here. Looks like AGI just isn’t happening for now: not a single problem requiring intelligence is solved! Furthermore, it has troubles even with the beginner problems.
For instance, this problem: it asks to concatenate two sorted arrays, sort the concatenation, and tell where are the positions of the weather of the unique arrays within the sorted result. GPT4 solution is quadratic in time, so it times out.
One other failure example: this problem. It asks to count quite a lot of substrings in a string which contain each character a good variety of times. GPT4 solution just isn’t only quadratic, but completely improper, checking a distinct property.
Given troubles with these, it’s unsurprising that it has zero results on Regular/Grand problems. It often replicates some standard algorithms or uses standard techniques like dynamic programming, but this all doesn’t help it in any respect to unravel the issues at hand.
One interesting statement is, it often produces code which doesn’t even work on sample inputs, despite being supplied with them. Perhaps a straightforward technique to make is barely higher can be to re-sample the answer until it passes the sample tests, but this is difficult to implement with the prompt rate limitation in the general public demo.
To the advantage of GPT4, the part that it consistently gets right is parsing the inputs: perhaps individuals who take part in these contests can use it for this purpose (if it wasn’t so slow). That is form of consistent with the use-cases of the prevailing CoPilot: it’s also often getting right such tedious and straight-forward pieces of code.
How come that GPT4 is so good on tests, but so disastrously bad on coding problems? There ought to be loads of code and coding problems within the training set (no less than, other LLMs augment their training with ample amounts of those).
One major difference is the character of the duty: multiple-choice questions are way easier to reply, because the model just needs to choose a solution from a small space; while with the coding problems, it needs to supply the working code end-to-end. One technique to confirm that that is the explanation can be to ask it to act as a classifier: given a press release and an answer, tell whether the answer is correct (or pick an accurate one amongst multiple options). I didn’t do the test, but I’m willing to bet that it’s going to be just as bad.
One other potential explanation will be that the exam questions are aimed to check knowledge, not skills. Indeed, GPT4 solves non-zero amount of problems from Beginner contests, which require only knowledge; while struggles with more skill-based Regular and Grand contests. This is able to be a pessimistic view on LLMs: in the event that they only do well on knowledge-based tasks, they’re essentially glorified serps (or natural language interfaces for something trivial), and we aren’t getting AGI with them any time soon.
An optimist might argue that it’s because we just have to put more work into making it work for this use-case: do finetuning (unattainable because GPT4 is closed), or do some chain-of-thought prompting to make the model reason for a while before jumping to coding. This can be a valid argument, but given the performance on the Beginner problems which don’t require much reasoning, I doubt it’ll help much.
Nevertheless, I’m planning to play a bit more with LLMs on this direction, specifically, finetune LLaMA for this specific use-case (let’s see if it no less than beats GPT4; some first attempts with low-rank finetuning of the smallest 7B version are even worse, but there may be clearly more headroom).
chill jazz
bossa music
jazz music