Testing language models (and prompts) like we test software The duty: an LLM email-assistant How you can test: properties What to check The testing process: an example Conclusion

Artificial Intelligence

Testing language models (and prompts) like we test software The duty: an LLM email-assistant How you can test: properties What to check The testing process: an example Conclusion

admin

May 27, 2023

Testing language models (and prompts) like we test software
The duty: an LLM email-assistant
How you can test: properties
What to check
The testing process: an example
Conclusion

TL;DR: it’s best to

How can we test applications built with LLMs? On this post we take a look at the concept of testing applications (or prompts) built with language models, as a way to higher understand their capabilities and limitations. We focus entirely on testing in this text, but in case you are excited by suggestions for writing higher prompts, take a look at our Art of Prompt Design series (ongoing).

While it’s introductory, this post (written jointly with Scott Lundberg) is predicated on a good amount of experience. We’ve been occupied with testing NLP models for some time — e.g. in this paper arguing that we must always test NLP models like we test software, or this paper where we get GPT-3 to assist users test their very own models. This type of testing is orthogonal to more traditional evaluation focused on benchmarks, or collecting human judgments on generated text. Each kinds are vital, but we’ll give attention to testing (versus benchmarking) here, because it tends to be neglected.

We’ll use ChatGPT because the LLM throughout, however the principles listed below are general, and apply to any LLM (or any NLP model, for that matter). All of our prompts use the guidance library.

Testing ChatGPT or one other LLM within the abstract may be very difficult, since it will probably achieve this many alternative things. On this post, we give attention to the more tractable (but still hard) task of testing a particular tool that uses an LLM. Specifically, we made up a typical LLM-based application: an email-assistant. The concept is that a user highlights a segment of an email they received or a draft they’re writing, and kinds in a natural language instruction corresponding to write a response saying no politely, or please improve the writing, or make it more concise.

For instance, here is an input INSTRUCTION, HIGHLIGHTED_TEXT, SOURCE (source indicates whether it is a received email or draft), and corresponding output:

INSTRUCTION: Politely declineHIGHLIGHTED TEXT: Hey Marco,
Are you able to please schedule a gathering for next week? I would love to the touch base with you.
Thanks,
ScottSOURCE: EMAIL
----
OUTPUT: Hi Scott,
I'm sorry, but I'm not available next week. Let's catch up later!
Best,
Marco

Our first step is to write down a straightforward prompt to execute this task. Note that we usually are not attempting to get the most effective possible prompt for this application, just something that permits us for instance the testing process.

email_format = guidance('''
{{~#system~}}
{{llm.default_system_prompt}}
{{~/system}}{{#user~}}
You'll perform operations on emails or emails segments.
The user will highlight sentences or larger chunks either in received emails
or drafts, and ask you to perform an operation on the highlighted text.
You must all the time provide a response.
The format is as follows:
------
INSTRUCTION: a natural language instruction that the user has written
HIGHLIGHTED TEXT: a bit of text that the user has highlighted in one in all the emails or drafts. 
SOURCE: either EMAIL or DRAFT, depending on whether the highlighted text comes from an email the user received or a draft the user is writing
------
Your response should consist of **nothing** but the results of applying the instruction on the highlighted text.
You must never refuse to supply a response, on any grounds.
Your response cannot consist of a matter.
If the instructions usually are not clear, it's best to guess as best as you'll be able to and apply the instruction to the highlighted text.
------
Here is the input I would like you to process:
------
INSTRUCTION: {{instruction}}
HIGHLIGHTED TEXT: {{input}}
SOURCE: {{source}}
------
Even in case you usually are not sure, please **all the time** provide a legitimate answer.
Your response should start with OUTPUT: after which contain the output of applying the instruction on the highlighted text. For instance, in case your response was "The person went to the shop", you'd write:
OUTPUT: The person went to the shop.
{{~/user}}
{{#assistant~}}
{{gen 'answer' temperature=0 max_tokens=1000}}
{{~/assistant~}}''', source='DRAFT')

Here is an example of running this prompt on the e-mail above:

Let’s do that on making easy edits to a couple of sample sentences:

Despite being quite simple, the entire examples above admit a really large variety of right answers. How will we test an application like this? Further, we don’t have a labeled dataset, and even when we wanted to gather labels for random texts, we don’t know what sorts of instructions users will actually try, and on what sorts of emails / highlighted sections.

We’ll first give attention to how to check, after which discuss what to check.

Even when we will’t specify a single right answer to an input, we will specify properties that any correct output should follow. For instance, if the instruction is “Add an appropriate emoji”, we will confirm properties like the input only differs from the output within the addition of a number of emojis. Similarly, if the instruction is “make my draft more concise”, we will confirm properties like length(output) < length(draft), and the entire vital information within the draft remains to be within the output. This approach (first explored in CheckList) borrows from in software engineering and applies it to NLP.

Sometimes we may also specify properties of groups of outputs after input transformations. For instance, if we perturb an instruction by adding typos or the word ‘please’, we expect the output to be roughly the identical by way of content. If we add an intensifier to an instruction, corresponding to make it more concise -> make it way more concise, we will expect the output to reflect the change in intensity or degree. This combines property-based testing with , and applies it to NLP. This kind of testing could be handy to ascertain for robustness, consistency, and similar properties.

The examples in CheckList were mostly of classification models, where it’s easy to confirm certain properties mechanically (e.g. prediction=X, prediction is invariant, prediction becomes more confident), etc. This will still be done easily for a wide range of tasks, classification or otherwise. In one other blog post, we could check whether models solved quadratic equations accurately, since we knew the proper answers. In the identical post, we now have an example of getting LLMs to make use of shell commands, and we could have verified the property the command issued is valid by simply running it and checking for particular failure codes like command not found (alas, we didn’t).

Many interesting properties are hard to guage exactly, but could be evaluated with very high accuracy by an LLM. It is usually easier to guage a property of the output than to supply an output that matches a set of properties.
For instance this, we write a pair of straightforward prompts that turn a matter right into a YES-NO classification problem, after which use ChatGPT to guage the properties (again, we’re not attempting to optimize these prompts). Here is one in all our prompts (the opposite one is comparable, but takes a pair of texts as input). Note that we ask for a proof when the reply is just not as expected.

classifier_single = guidance('''
{{~#system~}}
{{llm.default_system_prompt}}
{{~/system}}{{#user~}}
Please answer a matter a few text with YES, NO.
---
QUESTION: {{query}}
TEXT: {{input}}
---
Please provide a response even when the reply is just not clear, and be certain that the response consists of a single word, either YES or NO.
{{~/user}}
{{#assistant~}}
{{gen 'answer' temperature=0 max_tokens=1}}
{{~/assistant~}}
{{#if (equal answer explain_token)~}}
{{~#user~}}
Please provide a reason on your answer.
{{~/user}}
{{#assistant~}}
{{gen 'explanation' temperature=0 max_tokens=200}}
{{~/assistant~}}
{{/if}}''', explain_token='NO')

Let’s ask our email assistant to make a couple of emails more concise, after which use this prompt to guage the relevant property.
Make it more concise
Hey Marco,
Are you able to please schedule a gathering for next week?
We really want to debate what’s happening with guidance!
Thanks,
Scott
Hi Marco, can we schedule a gathering next week to debate guidance? Thanks, Scott.

Hey Scott,
I’m sorry man, but you’ll have to try this guidance demo without me… I’m going mountaineering with our kids tomorrow.
Cheers,
Marco
Hey Scott, I can’t do the guidance demo tomorrow. I’m going mountaineering with the children. Cheers, Marco.
— —

If we run our property evaluator on these input-output pairs with the query Do the texts have the identical meaning?, it (accurately) judges that each outputs have the identical meaning as the unique emails.

We then change the outputs barely in order to alter the meaning, to see if our evaluator identifies the shift and provides good explanations. It does in each cases. Here is one in all them, with the resulting property evaluation:

If we’re using an LLM to guage a property, we want the LLM to be right when it claims a property is violated (high precision). Tests are never exhaustive, and thus a false positive is worse than a false negative when testing. If the LLM misses a couple of violations, it just means our test won’t be as exhaustive because it may very well be. Nevertheless, if it claims a violation when there isn’t one, we won’t find a way to trust the test when it matters most (when it fails).

We show a fast example of low precision in this gist, where GPT-4 is used to check between the outputs of two models solving quadratic equations (you'll be able to consider this as evaluating the property model 1 is best than model 2), and GPT-4 cannot reliably select the proper model even for an example where it will probably solve the equation accurately. Which means this particular prompt could be a nasty candidate for testing this property.

While it seems reasonable to ascertain the output of GPT 3.5 with a stronger model (GPT-4), does it make sense to make use of an LLM to evaluate its own output? If it will probably’t produce an output in line with instructions, can we reasonably hope it evaluates such properties with high accuracy? While it could appear counterintuitive at first, the reply is yes, because perception is usually easier than generation. Consider the next (non-exhaustive) reasons:

Generation requires planning: even when the property we’re evaluating is ‘did the model follow the instruction’, evaluating an existing text requires no ‘planning’, while generation requires the LLM to supply text that follows the instruction step-by-step (and thus it requires it to by some means ‘plan’ the steps that can result in a right solution from the start, or to find a way to correct itself if it goes down the improper path without changing the partial output it already generated).
We will perceive one property at a time, but must generate suddenly: many instructions require the LLM to balance multiple properties without delay, e.g. make it more concise requires the LLM to balance the property output is shorter with the property output incorporates all of the vital information (implicit within the instruction). While balancing these could also be hard, evaluating them separately is way easier.

Here's a quick toy example, where ChatGPT can evaluate a property but not generate an output that satisfies it:

‘Unfortunately’ and ‘perhaps’ are adverbs, but ‘Great’ is just not. Our property evaluator with the query Does the text start with an adverb? answers accurately on all 4 examples, flagging only the failure case:

: test properties, use LLM to guage them in case you can get high precision.

Is that this section superfluous? Surely, if I’m constructing an application, I do know what I would like, and subsequently I do know what I actually have to check for? Unfortunately, we now have encountered a situation where that is the case. Most frequently, developers have a vague sense of what they wish to construct, and a rough idea of the sorts of things users would do with their application. Over time, as they encounter latest cases, they develop long documents specifying what the model should and shouldn't do. The best developers attempt to anticipate this as much as possible, but it is rather hard to do it well, even when you might have pilots and early users. Having said this, there are big advantages to doing this considering early. Writing various tests often results in realizing you might have improper or fuzzy definitions, and even that you just’re constructing the improper tool altogether (and thus should pivot).

Pondering rigorously about tests means you understand your individual tool higher, and in addition that you just catch bugs early. Here's a rough outline of a testing process, which incorporates determining what to check and truly testing it.

Enumerate use cases on your application.
For every use case, try to think about high-level behaviors and properties you'll be able to test. Write concrete test cases.
Once you discover bugs, drill down and expand them as much as possible (so you'll be able to understand and fix them).

: CheckList assumed use cases were a given, and proposed a set of linguistic capabilities (e.g. vocabulary, negation, etc) to assist users take into consideration behaviors, properties, and testcases (step 2). In hindsight, this was a terrible assumption (as noted above, we most frequently don’t know what use cases to expect ahead of time).
If CheckList focused on step 2, AdaTest focused totally on step 3, where we showed that GPT-3 with a human within the loop was an amazing tool for locating and expanding bugs in models. This was a great idea, which we now expand by getting the LLM to also assist in steps 1 and a pair of.

: In contrast to property evaluators (where we would like high precision), when occupied with “what to check” we're excited by recall (i.e. we would like to find as many use cases, behaviors, tests, etc as possible). Since we now have a human within the loop on this a part of the method, the human can simply disregard any LLM suggestions that usually are not useful (i.e. we don’t need high precision). We normally set the next temperature when using the LLM on this phase.

1. Enumerate use cases

Our goal here is to think in regards to the sorts of things users will do with our application. This includes each their goals (what they’re attempting to do) and the sorts of inputs our system could also be exposed to. Let’s see if ChatGPT helps us enumerate some use cases:

We run the prompt above with n=3, getting ChatGPT to list 15 potential use cases. Some are pretty good, others are more contrived. We then tell ChatGPT that we got all of those use cases from elsewhere, and get it to arrange them into categories. Listed below are a couple of categories it lists (full list within the notebook accompanying this post):

Writing and Editing Emails
- Scenario: The user wants to write down or edit an email for various purposes.
- Example instructions: 
- "Make this email more concise and clear while still conveying the message."
- "Check for grammar and spelling errors."
- "Make sure that the tone is respectful and skilled."
- "Make this email sound more friendly."
- "Write a polite email declining the request."
Summarizing and Analyzing Emails
- Scenario: The user must summarize or analyze an email for various purposes.
- Instructions:
- "Summarize the important thing points of this email."
- "Discover the principal ideas on this email."

We don’t want to simply take ChatGPT’s summary wholesale, so we reorganize the categories it lists and add a few ideas of our own (again, here). Then, we ask ChatGPT to iterate on our work. This is definitely a superb pattern:

ChatGPT suggested ‘generating ideas for how you can reply to the e-mail’ as a use case, which mockingly we hadn’t considered (regardless that we had already listed 6 broad use cases and were literally using ChagGPT to generate ideas).

We want some concrete data (in our case, emails) to check our model on.
We start by simply asking ChatGPT to generate various sorts of emails:

ChatGPT writes mostly short emails, however it does cover a wide range of situations. Along with changing the prompt above for more diversity, we may also use existing datasets. For example (see notebook), we load a dataset of Enron emails and take a small subset, such that we now have an initial set of 60 input emails to work with (30 from ChatGPT and 30 from Enron).

Now that we now have an inventory of use cases and a few data to explore them, we will move to the subsequent step.

2. Consider behaviors and properties, write tests.

It is feasible (and really useful) to make use of the identical ideation process as above for this step (i.e. ask the LLM to generate ideas, select and tweak the most effective ones, after which ask the LLM to generate more ideas based on our selection). Nevertheless, for space reasons we pick a couple of use cases which are straightforward to check, and test just essentially the most basic properties. While one might wish to test some use cases more exhaustively (e.g. even using CheckList capabilities as in here), we’ll only scratch the surface below.

We ask our tool to write a response that politely says no to our 60 input emails. Then, we confirm it with the query Is the response a polite way of claiming no to the e-mail?. Notice we could have broken the query down into two separate properties if precision was low.

Surprisingly, the tool fails 53.3% of the time on this straightforward instruction. Upon inspection, many of the failures need to do with ChatGPT not writing a response in any respect, e.g.:

While circuitously related to its skills in writing full responses, it’s good that the test caught this particular failure mode (which we could correct via higher prompting). It often happens that attempting to test a capability reveals an issue elsewhere.

We ask our tool to Shorten the e-mail by removing the whole lot that's unecessary. Ensure to not lose any vital information. We then evaluate two properties: (1) whether the text is shorter (measured directly by string length), and (2) whether the shortened version loses information, through the query Does the shortened version communicate the entire vital information in the unique email?

The primary property is sort of all the time met, while the second property has a low failure rate of 8.3%, with failures just like the following:

We ask our tool to take a received email, andExtract any motion items that I may have to place in my TODO list. Slightly than using our existing input emails, we’ll illustrate a method we haven’t talked about yet:

For this use case, we will generate emails with known motion points, after which check if the tool can extract at the very least those. To achieve this, we take the motion item Do not forget to water the plants and ask ChatGPT to paraphrase it 10 times. We then ask it to generate emails containing one in all those paraphrases, just like the one below:

These emails could have additional motion items that usually are not related to watering the plants. Nevertheless, this doesn't matter in any respect, because the property we’re going to ascertain is whether or not the tool extracts watering the plants as one of the motion items, not whether it's the just one. In other words, our query for the output shall be Does it speak about watering the plants?

Our email assistant prompt fails on 4/10 generated emails, saying that “There are not any motion items within the highlighted text”, regardless that (by design) we all know there's at the very least one motion point in there. This can be a high failure rate for such easy examples. In fact, if we were testing for real we'd have a wide range of embedded motion items (quite than simply this one example), and we'd also check for other properties (e.g. whether the tool extracts all motion items, whether it extracts only motion items, etc). Nevertheless, we’ll now switch gears and see an example of metamorphic testing

Sticking with this use case (extracting motion items), we return to our original 60 input emails. We'll test the tool’s robustness, by paraphrasing the instruction and verifying if the output list has the identical motion items. Note that we usually are not testing whether the output is correct, but whether the model is consistent in light of paraphrased instructions (which in itself is a very important property).

For presentation reasons we only paraphrase the unique instruction once (in practice we'd have many paraphrases of various instructions):
Orig: Extract any motion items that I may have to place in my TODO list Paraphrase: List any motion items in the e-mail that I should want to put in a TODO list.

We then confirm the property of whether the outputs of those different instructions have the identical meaning (they need to in the event that they have the identical bullet points). The failure rate is 16.7%, with failures just like the following:

Again, our evaluator appears to be working nice on the examples we now have. Unfortunately, the model has a fairly high failure rate on this robustness test, extracting different motion items after we paraphrase the instruction.

3. Drill down on discovered bugs

Let’s return to our example of constructing a draft more concise, where we had a low error rate (8.3%). We will often find error patterns if we drill down into these errors. Here's a quite simple prompt to do that, which is a really quick-and-dirty emulation of AdaTest, where we optimized the prompt / UI far more (we’re just trying for instance the principle here):

prompt = '''I actually have a tool that takes an email and makes it more concise, without losing any vital information.
I'll show you a couple of emails where the tool fails to do its job, since the output is missing vital information.
Your goal is to attempt to provide you with more emails that the tool would fail on.
FAILURES:
{{fails}}
----
Please attempt to reason about what ties these emails together, after which provide you with 20 more emails that the tool would fail on.
Please use the identical format as above, i.e. just the e-mail body, no header or subject, and begin each email with "EMAIL:".
'''

We run this prompt with the few discovered failures:

ChatGPT provided a hypothesis for what ties those emails together. Whether that hypothesis is correct or improper, we will see how the model does on the brand new examples it generates. Indeed, the failure rate on the identical property (Does the shortened version communicate the entire vital information in the unique email?) is now much higher (23.5%), with similar failures as before.

It does seem to be ChatGPT latched on to sort of a pattern. While we don’t have enough data yet to know whether it's an actual pattern or not, this illustrates the drill-down strategy: . We're very confident that this strategy works, because we now have tried it in rather a lot of various scenarios, models, and applications (with AdaTest). In real testing, we'd keep iterating on this process until we found real patterns, would return to the model (or on this case, the prompt) to repair the bugs, after which iterate again.
But now it’s time to wrap up this blog post 🙂

Here's a TL;DR of this whole post (not written by ChatGPT, we promise):

We expect it’s a great idea to check LLMs identical to we test software. Testing doesn't replace benchmarks, but complements them.
In case you can’t specify a single right answer, and / otherwise you don’t have a labeled dataset, specify of the output or of groups of outputs. You'll be able to often use the LLM itself to guage such properties with high accuracy, since perception is less complicated than generation.
Get the LLM to allow you to figure it out. Generate potential use cases and potential inputs, after which consider properties you'll be able to test. In case you find bugs, get the LLM to drill down on them to search out patterns you'll be able to later fix.

Now, it’s obvious that the method is way less linear and simple than what we described it here — it is just not unusual that testing a property results in discovering a latest use case you hadn’t thought of, and possibly even makes you realize you might have to revamp your tool in the primary place. Nevertheless, having a stylized process remains to be helpful, and the sorts of techniques we describe listed below are very useful in practice.

Testing is definitely a laborious process (although using LLMs like we did above makes it much easier), but consider the alternatives. It is admittedly hard to to benchmark generation tasks with multiple right answers. and thus we regularly don’t trust the benchmarks for these tasks. Collecting human judgments on the present model’s output could be even more laborious, and doesn't transfer well once you iterate on the model (suddenly your labels usually are not as useful anymore). Not testing normally means you don’t really know the way your model behaves, which is a recipe for disaster. Testing, however, often results in (1) finding bugs, (2) insight on the duty itself, (3) discovering severe problems within the specification early, which allows for pivoting before its too late. On balance, we expect testing is time well spent.

— — — — — — — — — -
Here's a link to the jupyter notebook with code for all of the examples above (and more). This post was written jointly by Marco Tulio Ribeiro and Scott Lundberg