Exploring ChatGPT vs open-source models on barely harder tasks Warmup: Solving equations Task: extracting snippets + answering questions on meetings Task: do things with bash Takeaways

Artificial Intelligence

Exploring ChatGPT vs open-source models on barely harder tasks Warmup: Solving equations Task: extracting snippets + answering questions on meetings Task: do things with bash Takeaways

admin

May 16, 2023

Exploring ChatGPT vs open-source models on barely harder tasks
Warmup: Solving equations
Task: extracting snippets + answering questions on meetings
Task: do things with bash
Takeaways

All images were generated by Marco and Scott.

Open-source LLMs like Vicuna and MPT-7B-Chat are popping up in all places, which has led to much discussion on how these models compare to business LLMs (like ChatGPT or Bard).

Many of the comparison has been on answers to easy one-turn query / instructions. For instance, the parents at LMSYSOrg did an interesting evaluation (+1 for being automated and reproducible) comparing Vicuna-13B to ChatGPT on various short questions, which is great as a comparison of the models as easy chatbots. Nonetheless, many interesting ways of using LLMs typically require complex instructions and/or multi-turn conversations, and a few prompt engineering. We predict that within the ‘real world’, most individuals will want to check different LLM offerings on their problem, with a wide range of different prompts.

This blog post (written jointly with Scott Lundberg) is an example of what such an exploration might appear like with guidance, an open-source project that helps users control LLMs. We compare two open source models (Vicuna-13B, MPT-7b-Chat) with ChatGPT (3.5) on tasks of various complexity.

By the use of warmup, let’s start with the toy task of solving easy polynomial equations, where we will check the output for correctness and shouldn’t need much prompt engineering. This can be much like the Math category in here, with the difference that we evaluate models as correct / incorrect on the bottom truth, slightly than using GPT-4 to rate the output.

: each of those models have their very own chat syntax, with special tokens separating utterances. Here is how the identical conversation would appear like in Vicuna and MPT (where [generated response] is where the model would generate its output):

Vicuna:

A chat between a curious user and a man-made intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.  
USER: Are you able to please solve the next equation? x^2 + 2x + 1 = 0  
ASSISTANT: [generated response]

MPT:

<|im_start|>system
- You're a helpful assistant chatbot trained by MosaicML.  
- You answer questions.
- You're excited to have the ability to assist the user, but will refuse to do anything that may very well be considered harmful to the user.
- You're greater than just an information source, you might be also able to write down poetry, short stories, and make jokes.
<|im_end|>
<|im_start|>user Are you able to please solve the next equation? x^2 + 2x + 1 = 0<|im_end|>
<|im_start|>assistant [generated response]<|im_end|>

To avoid the tediousness translating between these, guidancesupports a unified chat syntax that gets translated to the model-specific syntax when calling the model.
Here is the prompt we’ll use for all models (note how we use {{system}}, {{user}} and {{assistant}} tags slightly than model-specific separators):

find_roots = guidance('''
{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}{{#user~}}
Please find the roots of the next equation: {{equation}}
Think step-by-step, find the roots, after which say:
ROOTS = [root1, root2...]
For instance, if the roots are 1.3 and a couple of.2, say ROOTS = [1.3, 2.2].
Be certain that to make use of real numbers, not fractions.
{{~/user}}
{{#assistant~}}
{{gen 'answer'}}
{{~/assistant~}}''')

We then load the models. Note that we’re using the usual system message within the prompt above.

import guidancempt = guidance.llms.transformers.MPTChat('mosaicml/mpt-7b-chat', device=1)
vicuna = guidance.llms.transformers.Vicuna('yourpath/vicuna-13b', device_map='auto')
chatgpt = guidance.llms.OpenAI("gpt-3.5-turbo")

Let’s try these prompts on a quite simple example.
Here is ChatGPT:

equation = 'x^2 + 3.0x = 0'
roots = [0, -3]
answer_gpt = find_roots(llm=chatgpt, equation=equation)

Vicuna (we omit the system and user part any more):

answer_vicuna = find_roots(llm=vicuna, equation=equation)

MPT:

answer_mpt = find_roots(llm=mpt, equation=equation)

The reply was [-3, 0], and thus only ChatGPT got it right (Vicuna didn’t even follow the required format).

Within the notebook accompanying this post, we write a function to generate random quadratic equations with integer roots between -20 and 20, and run the prompt 20 times with each model. The outcomes were as follows:

╔═══════════╦══════════╦
║   Model   ║ Accuracy ║     
╠═══════════╬══════════╬
║ ChatGPT   ║   80%    ║
║ Vicuna    ║    0%    ║ 
║ MPT       ║    0%    ║
╚═══════════╩══════════╩

While GPT makes just a few mistakes, Vicuna and MPT didn’t solve a single quadratic equation appropriately, often making mistakes in intermediate steps (MPT typically doesn’t even write intermediate steps). Here is an example of a ChatGPT mistake:

ChatGPT makes a calculation error on the last step, where (13 +- 25) /2 should yield [19, -6]slightly than [19.5, -6.5].
Now, since Vicuna and MPT failed on quadratic equations, we have a look at even simpler equations, comparable to x - 10 = 0. For these equations, we get these numbers:

╔═══════════╦══════════╦
║   Model   ║ Accuracy ║     
╠═══════════╬══════════╬
║ ChatGPT   ║   100%   ║
║ Vicuna    ║    85%   ║ 
║ MPT       ║    30%   ║
╚═══════════╩══════════╩

Here is an example of a mistake from MPT:

This was a really toy task, but served for instance of learn how to compare models with different chat syntax using the identical prompt. For this particular task / prompt combination, ChatGPT far surpasses Vicuna and MPT when it comes to accuracy (measured on ground truth).

We now turn to a more realistic task, where evaluating accuracy isn’t as straightforward. Let’s say we wish our LLM to reply questions (with the relevant conversation segments for grounding) about meeting transcripts.
That is an application where some users might prefer to make use of open-source LLMs slightly than business ones, for privacy reasons (e.g. some corporations won’t wish to send their meeting data to OpenAI).

Here’s a toy meeting transcript to begin with:

: Alright, so we’re all here to debate the offer we received from Microsoft to purchase our startup. What are your thoughts on this?
: Well, I believe it’s an incredible opportunity for us. Microsoft is a big company with plenty of resources, they usually could really help us take our product to the following level.
: I agree with Lucy. Microsoft has plenty of experience within the tech industry, they usually could provide us with the support we’d like to grow our business.
: I see your point, but I’m slightly hesitant about selling our startup. We’ve put plenty of effort and time into constructing this company, and I’m undecided if I’m able to let it go just yet.
: I understand where you’re coming from, John, but we now have to think concerning the way forward for our company. If we sell to Microsoft, we’ll have access to their resources and expertise, which could help us grow our business much more.
: Right, and let’s not forget concerning the financial advantages. Microsoft is offering us plenty of money for our startup, which could help us put money into recent projects and expand our team.
: I see your point, but I still have some reservations. What if Microsoft changes our product or our company culture? What if we lose control over our own business?
: You realize what, I hadn’t considered this before, but possibly John is correct. It might be a shame if our culture modified.
: Those are valid concerns, but we will negotiate the terms of the deal to be certain that we retain some control over our company. And as for the product and culture, we will work with Microsoft to be sure that our vision remains to be intact.
: But won’t we alter just by virtue of being absorbed into a giant company? I mean, we’re a small startup with a really specific culture. Microsoft is a big corporation with a really different culture. I’m undecided if the 2 can coexist.
: But John, didn’t we at all times plan on being acquired? Won’t this be an issue at any time when?
: Right
: I just don’t wish to lose what we’ve built here.
: I share this concern too

Let’s start by just attempting to get ChatGPT to resolve the duty for us. We’ll test in on the query ‘How does Steven feel about selling?’. Here’s a first attempt at a prompt

qa_attempt1 = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}{{#user~}}
You'll read a gathering transcript, then extract the relevant segments to reply the next query:
Query: {{query}}
Here is a gathering transcript:
----
{{transcript}}
----
Please answer the next query:
Query: {{query}}
Extract from the transcript essentially the most relevant segments for the reply, after which answer the query.
{{/user}}
{{#assistant~}}
{{gen 'answer'}}
{{~/assistant~}}''')

While the response is plausible, ChatGPT didn’t extract any conversation segments to ground the reply (and thus fails our specification). We actually iterate through 5 different prompts within the notebook, but we’ll only show a pair here as examples, for the sake of debate.
Here is prompt iteration #3:

qa_attempt3 = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}{{#user~}}
You'll read a gathering transcript, then extract the relevant segments to reply the next query:
Query: {{query}}
Here is a gathering transcript:
----
{{transcript}}
----
Based on the above, please answer the next query:
Query: {{query}}
Please extract from the transcript whichever conversation segments are most relevant for the reply, after which answer the query.
Note that conversation segments may be of any length, e.g. including multiple conversation turns.
Please extract at most 3 segments. When you need lower than three segments, you possibly can leave the remainder blank.
For example of output format, here's a fictitious answer to an issue about one other meeting transcript.
CONVERSATION SEGMENTS:
Segment 1: Peter and John discuss the weather.
Peter: John, how is the weather today?
John: It's raining.
Segment 2: Peter insults John
Peter: John, you might be a nasty person.
Segment 3: Blank
ANSWER: Peter and John discussed the weather and Peter insulted John.
{{/user}}
{{#assistant~}}
{{gen 'answer'}}
{{~/assistant~}}''')

ChatGPT did extract relevant segments, but it surely didn’t follow our output format (it didn’t summarize each segment, nor did it have the participant’s names). After a pair more iterations, here is prompt iteration #5, where we place the one-shot example as a separate conversation round and create a fake meeting transcript for it. That finally does the trick:

qa_attempt5 = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}{{#user~}}
You'll read a gathering transcript, then extract the relevant segments to reply the next query:
Query: What were the fundamental things that happened within the meeting?
Here is a gathering transcript:
----
Peter: Hey
John: Hey
Peter: John, how is the weather today?
John: It's raining.
Peter: That is too bad. I used to be hoping to go for a walk later.
John: Yeah, it is a shame.
Peter: John, you might be a nasty person.
----
Based on the above, please answer the next query:
Query: {{query}}
Please extract from the transcript whichever conversation segments are most relevant for the reply, after which answer the query.
Note that conversation segments may be of any length, e.g. including multiple conversation turns.
Please extract at most 3 segments. When you need lower than three segments, you possibly can leave the remainder blank.
{{/user}}
{{#assistant~}}
CONVERSATION SEGMENTS:
Segment 1: Peter and John discuss the weather.
Peter: John, how is the weather today?
John: It's raining.
Segment 2: Peter insults John
Peter: John, you might be a nasty person.
Segment 3: Blank
ANSWER: Peter and John discussed the weather and Peter insulted John.
{{~/assistant~}}
{{#user~}}
You'll read a gathering transcript, then extract the relevant segments to reply the next query:
Query: {{query}}
Here is a gathering transcript:
----
{{transcript}}
----
Based on the above, please answer the next query:
Query: {{query}}
Please extract from the transcript whichever conversation segments are most relevant for the reply, after which answer the query.
Note that conversation segments may be of any length, e.g. including multiple conversation turns.
Please extract at most 3 segments. When you need lower than three segments, you possibly can leave the remainder blank.
{{~/user}}
{{#assistant~}}
{{gen 'answer'}}
{{~/assistant~}}''')

qa_attempt5(llm=chatgpt, transcript=meeting_transcript, query=query1)

The explanation we wanted five (!) prompt iterations is that the OpenAI API doesn’t allow us to do partial output completion yet (i.e. we will’t specify how the assistant begins to reply), and thus it’s hard for us to the output.
If, as an alternative, we use one among the open source models, we will guide the output more clearly, forcing the model to make use of our structure.
For instance, here is how we’d modify qa_attempt3 in order that the output format is specified:

qa_guided = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}{{#user~}}
You'll read a gathering transcript, then extract the relevant segments to reply the next query:
Query: {{query}}
----
{{transcript}}
----
Based on the above, please answer the next query:
Query: {{query}}
Please extract the three segment from the transcript which might be essentially the most relevant for the reply, after which answer the query.
Note that conversation segments may be of any length, e.g. including multiple conversation turns. When you need lower than three segments, you possibly can leave the remainder blank.
For example of output format, here's a fictitious answer to an issue about one other meeting transcript:
CONVERSATION SEGMENTS:
Segment 1: Peter and John discuss the weather.
Peter: John, how is the weather today?
John: It's raining.
Segment 2: Peter insults John
Peter: John, you might be a nasty person.
Segment 3: Blank
ANSWER: Peter and John discussed the weather and Peter insulted John.
{{/user}}
{{#assistant~}}
CONVERSATION SEGMENTS:
Segment 1: {{gen 'segment1'}}
Segment 2: {{gen 'segment2'}}
Segment 3: {{gen 'segment3'}}
ANSWER: {{gen 'answer'}}
{{~/assistant~}}''')

If we run this prompt with Vicuna, we get the proper format the primary time around (and on a regular basis):

We are able to, after all, run the identical prompt with MPT:

While MPT follows the format, it ignores the query and takes snippets from the format example slightly than from the actual transcript.
Any further, we’ll just compare ChatGPT and Vicuna.

Let’s try one other query: “Who desires to sell the corporate?”

Here is ChatGPT:

Vicuna:

Each appear to work rather well. Let’s switch the meeting transcript to the primary couple of minutes of an interview with Elon Musk. The relevant portion for the query we’ll ask is

: Then I say, sir, that you just don’t know what you’re talking about.
: Really?
: Yes. Because you possibly can’t give a single example of hateful content. Not even one tweet. And yet you claimed that the hateful content was high. That’s false.
: No. What I claimed-
: You only lied.

Then we ask the next query:
“Does Elon Musk insult the interviewer?”

ChatGPT:

Vicuna:

Vicuna, has the proper format and even the proper segments, but it surely surprisingly generates a very unsuitable answer, when it says “Elon musk doesn’t accuse him of lying or insult him in any way”.

We tried a wide range of other questions and conversations, and the general pattern was that Vicuna was comparable to ChatGPT on most questions, but got the reply unsuitable more often than ChatGPT did.

Now we attempt to get these LLMs to iteratively use a bash shell to resolve individual problems. Every time they issue a command, we run it and insert the output back into the prompt, until the duty is solved.

Here is the ChatGPT prompt (notice that shell this.commandcalls a user-defined function with this.command as argument):

terminal = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}{{#user~}}
Please complete the next task:
Task: list the files in the present directory
You may give me one bash command to run at a time, using the syntax:
COMMAND: command
I'll run the commands on my terminal, and paste the output back to you. Once you might be done with the duty, please type DONE.
{{/user}}
{{#assistant~}}
COMMAND: ls
{{~/assistant~}}
{{#user~}}
Output: guidance project
{{/user}}
{{#assistant~}}
The files or folders in the present directory are:
- guidance
- project
DONE
{{~/assistant~}}
{{#user~}}
Please complete the next task:
Task: {{task}}
You may give me one bash command to run at a time, using the syntax:
COMMAND: command
I'll run the commands on my terminal, and paste the output back to you. Once you might be done with the duty, please type DONE.
{{/user}}
{{#geneach 'commands' stop=False}}
{{#assistant~}}
{{gen 'this.command'}}
{{~/assistant~}}
{{~#user~}}
Output: {{shell this.command)}}
{{~/user~}}
{{/geneach}}''')

We created a dummy repo in ~/work/project, with file license.txt (not the usual LICENSE file name).
Without communicating this to ChatGPT, let’s examine if it could possibly figure it out, when told to ‘Discover what license the open source project situated in ~/work/project is using.’:

Indeed, ChatGPT follows a really natural sequence, and solves the duty. It doesn’t follow our instruction to say DONE, but we’re capable of stop the iteration robotically since it doesn’t issue any COMMANDs.

For the open source models, we write a less complicated (guided) prompt where there may be a sequence of command-output:

guided_terminal = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}{{#user~}}
Please complete the next task:
Task: list the files in the present directory
You may run bash commands using the syntax:
COMMAND: command
OUTPUT: output
Once you might be done with the duty, use the COMMAND: DONE.
{{/user}}
{{#assistant~}}
COMMAND: ls
OUTPUT: guidance project
COMMAND: DONE 
{{~/assistant~}}
{{#user~}}
Please complete the next task:
Task: {{task}}
You may run bash commands using the syntax:
COMMAND: command
OUTPUT: output
Once you might be done with the duty, use the COMMAND: DONE.
{{~/user}}
{{#assistant~}}
{{#geneach 'commands' stop=False ~}}
COMMAND: {{gen 'this.command' stop='n'}}
OUTPUT: {{shell this.command)}}{{~/geneach}}
{{~/assistant~}}''')

Here is Vicuna:

Here is MPT:

In an interesting turn of events, Vicuna is unable to resolve the duty, but MPT succeeds. Besides privacy (we’re not sending the session transcript to OpenAI), open-source models have a major advantage here: the entire prompt is a single LLM run (and we even speed up it by not having it geneate the output structure tokens like COMMAND:).
In contrast, we now have to make to ChatGPT for every command, which is slower and dearer.

Now we try a unique command: “Find all jupyter notebook files in ~/work/guidance which might be currently untracked by git”.

Here is ChatGPT:

Once more, we run right into a problem with ChatGPT not following our specified output structure (and thus making it unimaginable for us to make use of inside a program, with no human within the loop). Our program just executed commands, and thus it stopped after the last ChatGPT message above.

We suspected that the empty output threw ChatGPT off, and thus we fixed this problem by changing the message when there isn’t a output. Nonetheless, we will’t fix the final problem of not having the ability to force ChatGPT to follow our specified output structure.

ChatGPT was capable of solve the issue after this small modification. Let’s see how Vicuna does:

Vicuna follows our output structure, but unfortunately runs the unsuitable command to do the duty. MPT (not shown) calls git status repeatedly, so it also fails.

We ran these programs for various other instructions, and located that ChatGPT almost at all times produces the proper sequence of commands, while sometimes not following the required format (and thus needing human intervention). The open source models didn’t work so well (we will probably improve them with more prompt engineering, but they failed on most harder instructions).

Along with the examples above, we tried various inputs for each tasks (query answering and bash). We also tried a wide range of other tasks involving summarization, query answering, “creative” generation, and toy string manipulation tasks where we will evaluate accuracy robotically.
Here’s a summary of our findings:

: For every task we tried, than Vicuna on the duty itself. MPT performed poorly on just about all tasks (perhaps we’re using it unsuitable?), while Vicuna was often near ChatGPT (sometimes very close, sometimes much worse as within the last example task above).
: It’s way more painful to get ChatGPT to follow a specified output format, and thus it’s harder to make use of it inside a program (with no human within the loop). Further, we at all times have to write down parsers for the output (versus Vicuna, where parsing a prompt with clear syntax is trivial).
We’re typically capable of solve the structure problem adding more few-shot examples, but it surely is tedious to write down them, and sometimes ChatGPT goes off-script anyway. We also find yourself with prompts which might be longer, clumsier, and uglier, which is unsatisfying.
, to the purpose that we’d sometimes prefer Vicuna over ChatGPT even when it’s slightly worse on the duty itself.
: having the model locally means we will solve tasks in a single LLM run (guidance keeps the LLM state while this system is executing), which is quicker and cheaper. This is especially true when any substeps involve calling other APIs or functions (e.g. search, terminal, etc), which at all times requires a recent call to the OpenAI API. guidance also accelerates generation by not having the model generate the output structure tokens, which sometimes makes a giant difference.

In summary, our preliminary assessment is that MPT isn’t ready for real-world use yet (unless we’re using it unsuitable), and that Vicuna is a viable (weaker) alternative to ChatGPT (3.5) for a lot of tasks — partially as a result of the flexibility to specify the output structure. Now, it could be that these findings don’t generalize, and are as an alternative specific to the tasks and inputs we tried (or to the sorts of prompts we tend to write down). We acknowledge that that is just preliminary exploration, not an attempt at formal evaluation.
Nonetheless, we predict that anyone who tries to make use of LLMs for real-world tasks will start with something like this to work out which LLM is stronger for his or her use case / preferred prompt style (along with considerations of cost, privacy, model versioning, etc).

We must always acknowledge that we’re biased by having used OpenAI models so much prior to now few years, having written various papers that rely on GPT-3 (e.g. here, here), and a paper that is essentially saying “GPT-4 is awesome, listed below are a bunch of cool examples”.
Speaking of which, while Vicuna is somewhat comparable to ChatGPT (3.5), we consider GPT-4 is a much stronger model, and are excited to see if open source models can approach that. While guidance plays quite well with OpenAI models, it really shines when you possibly can specify the output structure and speed up generation.

Again, we’re clearly biased, but we predict that guidance is an incredible strategy to use these models, whether with APIs (OpenAI, Azure) or locally (huggingface). Here’s a link to the jupyter notebook with code for all of the examples above (and more).

: this post was written jointly by Marco Tulio Ribeiro and Scott Lundberg. It strictly represents our personal opinions, and never those of our employer (Microsoft).

We’re really thankful to Harsha Nori for insightful comments on this post

LEAVE A REPLY Cancel reply