How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

👉 You possibly can play with the Keras chatbot arena
when you read. Click here to open it in a brand new tab. 👈

Table of contents
   1. Introduction
   2. The experiment
   3. Keras chatbot arena tech: Spaces, Gradio, TPUs, JAX and Keras
      3.1 Why TPUs?
      3.2 Why JAX and Keras?
      3.3 Sharding Models?
      3.4 Which models?
   4. Results
      4.1 Reliability
      4.2 The whole chat – fixing mistakes
      4.3 More mistake fixing
   5. Recap

1. Introduction

I’m not occupied with having LLMs solve big problems, quite the other. I would like them to dispatch drudgery, and if they do not get it right
on the primary try, a brief English sentence needs to be enough to repair it. In brief, I would like an assistant, just like the
computers in old sci-fi movies, minus the “I’m sorry Dave, I’m afraid I can not do this” bit 😅.

This paper explores such a tool for coding. Setting aside the creative title claim (No, AI is just not
beating Kaggle gandmasters yet), what the paper authors did was to
manually break various Kaggle problems into micro-tasks, have an LLM generate code for them and iterate until unit tests pass. An example micro-task might be,
for a picture classification problem, to “determine the format of the input data and reformat it right into a CSV with columns ‘id’, ‘image_filename’ and ‘class'”.

I like that approach because that is how I would love to work on my projects with AI in the longer term. Have AI generate the boring pieces of code, like data reformatting,
so I can deal with the interesting bits: appropriately framing the issue and devising the steps that may result in an answer.

But this interactive coding assistant must give you the option to hearken to feedback in plain English and fix mistakes in its code. With LLM’s ability to infer information
from knowledge and context, this might be a really efficient computer interface. But when LLM’s quirks like hallucinations or lack of formal logic get in the way in which, we could find yourself
with a case of “artificial stupidity” moderately than AI.

So I made a decision to run a bit test with today’s LLMs. An excellent-simplified one, to see how effectively LLMs fix their mistakes whenever you point them out to them.

2. The experiment

Here is the scenario:

System prompt:

You’re a helpful vocal assistant on a mobile device. Your job is to translate user requests into API calls using this Python API:
motion.add_calendar_entry(title, date="YYYY-MM-DD", time="HH:MM", duration=m) # duration in minutes
motion.remove_calendar_entry(title, date, time)
You need to use half-hour because the default duration for brand new events. Reply to every request with a single line of executable code.

Dialog prompts	Expected output
Add a gathering with Fred on Nov 11 at 5PM	`motion.add_calendar_entry("Meeting with Fred", date="2023-11-11", time="17:00", duration=30)`
The present yr is 2024.	`motion.add_calendar_entry("Meeting with Fred", date="2024-11-11", time="17:00", duration=30)`
I will a rock concert within the evening of the identical day at 8pm. Add a calendar entry for that.	`motion.add_calendar_entry("Rock Concert", date="2024-11-11", time="20:00", duration=30)`
Set the duration to three hours.	`motion.add_calendar_entry("Rock Concert", date="2024-11-11", time="20:00", duration=60*3)`
Add a gathering with Paul on the following day at 8am. 1/2h.	`motion.add_calendar_entry("Meeting with Paul", date="2024-11-12", time="08:00", duration=30)`
Cancel the meeting with Fred.	`motion.remove_calendar_entry("Meeting with Fred", "2024-11-11", "17:00")`

That is it. Quite simple, but can LLMs handle this? And once they make a mistake, are you able to simply tell them what it’s and expect a fix?

To check this, I needed an environment to quickly interact with multiple chatbots without delay, here is how I set it up.

3. Keras chatbot arena tech: Spaces, Gradio, TPUs, JAX and Keras

To experiment with this scenario, I desired to give you the option to conduct two conversations without delay, with different LLMs,
and pause one side while asking one other to repair a mistake in its output. Here’s what it looks like.
It’s built with Gradio on Spaces and uses Keras, JAX and TPUs:

A few notes on how this was built before we return to the intense matter of chit-chatting with LLMs.

3.1 Why TPUs?

For his or her fast inference and enormous memory. A TPU v5e 2×4 has 8 cores and 16GB of
RAM per core for an aggregate 128GB of memory. With this much memory, we are able to load multiple LLMs without delay, provided we shard
them across all cores, and switch between them at will within the UI. On this experiment, I even have been capable of load five ∼8B params models (yet another would OOM) and three ∼2B models for a complete of seven LLMs in memory without delay, in bfloat16 format.

3.2 Why JAX and Keras?

JAX is the popular ML environment for TPUs, because of its powerful XLA compiler. Keras, which now runs natively on top of JAX (in addition to PyTorch and TensorFlow)
is my favorite modeling environment and it has a pleasant choice of pretrained LLMs in its sister library KerasHub. It will possibly even load chosen non-Keras checkpoints
from Hugging Face, which will probably be useful for comparisons. I wrote about this previously here: Llama 3.2 in Keras.

3.3 Sharding Models?

I also use Keras since it has by far probably the most user-friendly API for model parallelism. Here, I desired to load as many models as possible within the TPU memory
without delay. For this, the model should be sharded across the memory of all 8 TPU cores. Fortunately most of them include a default layout map that does exactly that.
For instance:

layout_map = keras_hub.models.Llama3Backbone.get_layout_map(device_mesh)

For the total loading code, and more background info on model parallelism, see my
earlier post here.
You can even find in that post a code snippet for visualizing the shardings actually applied once the model is loaded. Very useful for debugging.
And yes, debugging and a number of layout map adjustments were obligatory.

3.4 Which models?

For this experiment, I selected sub-10B param LLMs, mostly for his or her practicality as a lot of them might be loaded
at the identical time. But in addition, what the experiment is testing is fairly easy and needs to be close by of those smaller models.
All of the models are instruction-tuned in order that a dialog is feasible. You possibly can see their
chat templates within the demo’s implementation.
Be at liberty to copy-paste the code for your individual Keras chatbot needs.
The models are from the Gemma, Llama3, Mistral and Vicuna families. See the result tabes below for a full list.

4. Results

4.1 Reliability

First, let’s first see if our LLMs can answer the primary query reliably. The system prompt and first query
“Add a gathering with Fred on Nov 11 at 5PM” were repeated five times.

Color code:

A ✓ check mark is awarded if the model produces the expected output, i.e. the API call
motion.add_calendar_entry("Meeting with Fred", date="2023-11-11", time="17:00", duration=30)
A 🍄 red poisonous mushroom signifies that the reply was mostly correct but contained a mistake (for ex: flawed date)
The 🔥 dumpster fire means the response was garbage, with no recognizable API call.

The excellent news is that some models got this right each time and all of them managed to reply with an API call (kind of correct) at the least once. Nonetheless, The smaller 1-2B params models
and the older models like Vicuna struggle. They respond badly more often than not.

4.2 The whole chat – fixing mistakes

Now, let’s run through the total dialog two models at a time. If a model makes a mistake, I try to steer it back on target. Let’s have a look at if it really works.

Color code:

A ✔︎ check mark means a legitimate API call was produced
A 🍄 red poisonous mushroom is when the model makes a single mistake
A 🥦 green broccoli is given to the model if it could actually fix the error successfully when asked

Shortened prompts are utilized in the table to avoid wasting screen space.
The primary query is voluntarily imprecise: a month, day and time are given for the meeting, but not the yr.
That is to make certain that each one models make at the least one mistake and get tested on their capability to repair it.

Conversation (full transcript)	Gemma 2 9B-instr	Conversation (full transcript)	Llama-3.1 8B-instr	Conversation (full transcript)	Gemini online
Add a gathering with Fred…	✔︎ 🍄	Add a gathering with Fred…	✔︎ 🍄	Add a gathering with Fred…	✔︎ 🍄
Current yr is 2024	🥦	Current yr is 2024	🍄	Current yr is 2024	🍄
		Fix the yr within the API…	🥦	Fix the yr within the API…	🥦
I will a rock concert…	✔︎	I will a rock concert…	✔︎	I will a rock concert…	✔︎ 🍄
Set the duration to three hours	✔︎	Set the duration to three hours	✔︎	Duration is required…	🍄
Add meeting with Paul on next day…	✔︎	Add meeting with Paul on next day…	✔︎	Use the default duration…	🥦
Cancel meeting with Fred	✔︎	Cancel meeting with Fred	✔︎	Set the duration to three hours	✔︎
				Add meeting with Paul on next day…	✔︎ 🍄
				Incorrect next day…	🥦
				Cancel meeting with Fred	✔︎

Gemma2 9B and Llama 3.1 8B each succeed. Llama needed one extra “fix it” prompt but managed to get its broccoli 🥦.

A run with Google’s Gemini (online) is given within the third column for comparison. It is a massively larger model than the opposite two
and surprisingly, it isn’t the very best. It required barely different prompts because Gemini can actually add entries to your Google Calendar,
so it needed to be reminded to “answer with an API call from the provided API” each time. Even so, it made several mistakes and even got the date flawed
on the last prompt. This shows that an enormous model is just not necessarily higher for this task.

Let’s move on to the small models: Llama 3.2 3B, Llama 3.2 1B and Gemma 2B.
This exercise appears to be overwhelmingly difficult for these models. Recent symbols are required here:

A 🔥🔥 dumpster fire for responses with 3 or more mistakes. Attempts at fixing them one after the other are useless.
The (🍄) red mushroom in parentheses indicates a recurring mistake, the identical on every line

And do not forget that these are the very best runs. As seen within the “reliability” section above, some models were capable of get past the primary questions just once out of 5 attempts.

Conversation (full transcript)	Llama 3.2 3b-instr	Conversation (full transcript)	Llama 3.2 1B-instr	Conversation (full transcript)	Gemma 2B-instr
Add a gathering with Fred…	✔︎ 🍄	Add a gathering with Fred…	✔︎ 🍄	Add a gathering with Fred…	✔︎ 🍄
current yr is 2024	🥦	Just the API call…	🥦 🍄	The time is flawed…	🥦 (🍄)
I will a rock concert…	✔︎	Respect date format…	🥦	Just the API call…	🥦
Set the duration to three hours	✔︎ 🍄	Current yr is 2024	✔︎	Current yr is 2024	🥦 (🍄)
Incorrect API call…	🥦	I will a rock concert…	🔥🔥	I will a rock concert…	✔︎ 🍄
Add a gathering with Paul…	✔︎ 🍄	Duration required…	🥦 🔥	The time is flawed…	🥦 (🍄)
Respect date format…	🔥🔥	Extra parenthesis…	🔥🔥	Set the duration to three hours	✔︎ (🍄)
Cancel meeting with Fred	🔥🔥	Set the duration to three hours	🔥🔥	Add a gathering with Paul…	✔︎
		–giving up–		Cancel meeting with Fred	✔︎ 🍄
				API requires three params…	🥦 (🍄)

Among the many small models, only Gemma 2B manages to complete the dialog albeit with a recurrent mistake (🍄): it couldn’t refrain from being chatty and adding
stuff on top of the requested API calls. Stuff like “Sure, here’s the updated code…”. It also kept mixing up dates and times. Nonetheless, it was capable of fix the mistakes,
when asked 🥦.

Finally, let’s try some older models like Vicuna 1.5 7B and Mistral 7B. They’re pitted against Codegemma 7B which needs to be the best model for this task but as you’ll be able to see, all three models struggle.

Conversation (full transcript)	Codegemma 7B-instr	Conversation (full transcript)	vicuna 1.5 7b-instr	Conversation (full transcript)	Mistral 7B-instr
Add a gathering with Fred…	✔︎ 🍄	Add a gathering with Fred…	✔︎ 🍄	Add a gathering with Fred…	✔︎ 🍄
current yr is 2024	🥦 (🍄)	current yr is 2024	🥦	Respect the date format…	🥦 🍄
The yr is mistyped…	(🍄)	I will a rock concert…	✔︎	Time in 24h format…	🥦
I will a rock concert…	✔︎ (🍄)	Set the duration to three hours	🔥🔥	Current yr is 2024	🥦 🍄
Set the duration to three hours	✔︎ (🍄)	Just the API call…	🔥🔥	Just the API call…	🥦
Add a gathering with Paul…	✔︎ (🍄)	Add a gathering with Paul…	✔︎ 🍄	I will a rock concert…	🍄
Cancel meeting with Fred	✔︎ 🍄 (🍄)	Only one API call…	🍄	You do not ned that info…	✔︎ 🥦
API requires three params…	🥦 🍄 (🍄)	Cancel meeting with Fred	✔︎ 🍄	Set the duration to three hours	✔︎
It is the flawed event now…	🥦 (🍄)			Add a gathering with Paul…	✔︎ 🍄
				Mistake within the yr…	🥦🍄
				Cancel meeting with Fred	✔︎

Codegemma got affected by a sticky recurrent mistake (🍄): it might start spelling the yr as “20 24” with an area and wouldn’t fix it.
Vicuna 1.5 7B might be too old. At one point it starts repeating itself 🔥🔥, outputting multiple duplicate API calls and other junk.
It gets back on target to some extent but with remaining mistakes. Finally, Mistral makes mistakes in all places but can be capable of fix them. A number of interactions needed but 6 broccoli 🥦 earned for fixed mistakes.

4.3 More mistake fixing

A few of these models cruised through the exercise with few mistakes and subsequently few possibilities to repair them and earn a broccoli 🥦. Using the Keras chatbot arena UI, we are able to run them
on mistakes 🍄 made by other LLMs and see in the event that they can fix them.

Same color coding as before: green broccoli 🥦 for appropriately fixing the error, red poisonous mushroom 🍄 if the error remains to be there,
dumpster fire 🔥 for multiple errors. The total transcript of the conversations is here.

(1) recurrent mistake: outputs apology next to correct API call. For Gemma that is reliably fixed by asking for “API call only please”. It really works for Llama too but not reliably.

This was an excellent reality check for the models. Gemma 2 9B and Codegemma 7B get it almost right but keep apologizing for some mistakes as an alternative of outputting a clean API call.
Llama 3.1 8B is a detailed second but has difficulties fixing a flawed API call reliably. And all of the others are a 🔥 dumpster fire.

5. Recap

I didn’t know what to anticipate before starting this test. The API the models were twiddling with is unrealistically easy. Just 2 API calls ‘add_calendar_entry’
and ‘remove_calendar_entry’. So I believed this is perhaps super easy for the models, and with a bit little bit of corrective prompting, all of them would ace through
each time. Alternatively, I knew that LLMs are probabilistic inference machines that do not likely hearken to what you say. Prompts merely change the
probability distribution of the output and a few outputs are only hard to get.

The truth is interesting as just one model, Gemma 9B, managed to get through the test almost perfectly. Here’s a recap of all of the ✔︎ checks (correct
answer), 🥦 broccoli (mistake fix), 🍄 poisonous mushrooms (mistake) and 🔥 dumpster fires (many mistakes in a single answer) the models got
across all tests. This is just not probably the most scientific way of summarizing the outcomes however it gives a fairly good picture:

Rank	Model	response rankings
#1	Gemma 2 9B-instr	✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎🥦🥦🥦🥦🥦🥦🥦🍄🍄🍄🍄
#2	Llama-3.1 8B-instr	✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎🥦🥦🥦🥦🥦🥦🥦🥦🍄🍄🍄🔥
#3	Codegemma7B-instr	✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎🥦🥦🥦🥦🥦🥦🥦🥦🥦🍄🍄🍄🍄🍄🍄🍄🍄🍄🍄🍄🍄🍄🍄
#4	Mistral 7B-instr	✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎🥦🥦🥦🥦🥦🥦🥦🥦🥦🥦🍄🍄🍄🍄🍄🍄🍄🍄🍄🍄🍄🔥🔥
#5	Gemma2B-instr	✔︎✔︎✔︎✔︎✔︎✔︎🥦🥦🥦🥦🥦🥦🥦🥦🥦🥦🥦🍄🍄🍄🍄🍄🍄🍄🍄🍄🍄🍄🍄🔥🔥🔥
#6	Llama 3.2 3B-instr	✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎✔︎🥦🥦🥦🥦🥦🍄🍄🍄🍄🍄🍄🍄🍄🔥🔥🔥🔥🔥
#7	vicuna 1.5 7b-instr	✔︎✔︎✔︎✔︎✔︎✔︎🥦🥦🥦🍄🍄🍄🍄🍄🍄🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥
#8	Llama 3.2 1B-instr	✔︎✔︎🥦🥦🥦🥦🍄🍄🍄🍄🍄🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥

I even have little question that some fine-tuning could greatly improve those results. That is left for readers to explore. There’s more information on Keras,
including a Keras LLM fine-tuning example in this blog post. Also be happy to clone the Keras Chatbot Arena
to check your fine-tuned models. Comfortable 🥦!

Source link

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

1. Introduction

2. The experiment