What I Learned Pushing Prompt Engineering to the Limit General Lessons Prompting Techniques Examples Tooling Conclusion

Artificial Intelligence

What I Learned Pushing Prompt Engineering to the Limit General Lessons Prompting Techniques Examples Tooling Conclusion

admin

June 15, 2023

What I Learned Pushing Prompt Engineering to the Limit
General Lessons
Prompting Techniques
Examples
Tooling
Conclusion

Satirical depiction of prompt engineering. Paradoxically, the DALL-E2 generated image was generated by the writer using prompt engineering with the prompt “a mad scientist handing over a scroll to an artificially intelligent robot, generated in a retro style”, plus a variation, plus outpainting.

I spent the past two months constructing a large-language-model (LLM) powered application. It was an exciting, intellectually stimulating, and at times frustrating experience. My entire conception of prompt engineering — and of what is feasible with LLMs — modified over the course of the project.

I’d like to share with you a few of my biggest takeaways with the goal of shedding light on a number of the often unspoken facets of prompt engineering. I hope that after reading about my trials and tribulations, you’ll find a way to make more informed prompt engineering decisions. Should you’d already dabbled in prompt engineering, I hope that this helps you push forward in your personal journey!

For context, here is the TL;DR on the project we’ll be learning from:

My team and I built VoxelGPT, an application that mixes LLMs with the FiftyOne computer vision query language to enable looking through image and video datasets via natural language. VoxelGPT also answers questions on FiftyOne itself.
VoxelGPT is open source (so is FiftyOne!). All the code is available on GitHub.
You may try VoxelGPT without cost at gpt.fiftyone.ai.
Should you’re curious how we built VoxelGPT, you possibly can read more about it on TDS here.

Now, I’ve split the prompt engineering lessons into 4 categories:

General Lessons
Prompting Techniques
Examples
Tooling

Science? Engineering? Black Magic?

Prompt engineering is as much experimentation because it is engineering. There are an infinite number of how to jot down a prompt, from the precise wording of your query, to the content and formatting of the context you feed in. It could be overwhelming. I discovered it easiest to begin easy and construct up an intuition — after which test out hypotheses.

In computer vision, each dataset has its own schema, label types, and sophistication names. The goal for VoxelGPT was to find a way to work with any computer vision dataset, but we began with only a single dataset: MS COCO. Keeping all the additional degrees of freedom fixed allowed us to nail down into the LLM’s ability to jot down syntactically correct queries in the primary place.

When you’ve determined a formula that’s successful in a limited context, then determine the best way to generalize and construct upon this.

Which Model(s) to Use?

People say that one of the essential characteristics of huge language models is that they’re relatively interchangeable. In theory, you must find a way to swap one LLM out for an additional without substantially changing the connective tissue.

While it’s true that changing the LLM you employ is commonly so simple as swapping out an API call, there are definitely some difficulties that arise in practice.

Some models have much shorter context lengths than others. Switching to a model with a shorter context can require major refactoring.
Open source is great, but open source LLMs usually are not as performant (yet) as GPT models. Plus, if you happen to are deploying an application with an open source LLM, you have to to make sure that the container running the model has enough memory and storage. This could find yourself being more troublesome (and costlier) than simply using API endpoints.
Should you start using GPT-4 after which switch to GPT-3.5 due to cost, it’s possible you’ll be shocked by the drop-off in performance. For classy code generation and inference tasks, GPT-4 is MUCH higher.

Where to Use LLMs?

Large language models are powerful. But simply because they could be able to certain tasks doesn’t mean that you must — and even should — use them for those tasks. The perfect approach to take into consideration LLMs is as enablers. LLMs usually are not the WHOLE solution: they are only an element of it. Don’t expect large language models to do all the things.

For instance, it often is the case that the LLM you might be using can (under ideal circumstances) generate properly formatted API calls. But if you happen to know what the structure of the API call should appear to be, and you might be actually taken with filling in sections of the API call (variable names, conditions, etc.), then just use the LLM to do those tasks, and use the (properly post-processed) LLM outputs to generate structured API calls yourself. This can be cheaper, more efficient, and more reliable.

A whole system with LLMs will certainly have a number of connective tissue and classical logic, plus a slew of traditional software engineering and ML engineering components. Find what works best in your application.

LLMs Are Biased

Language models are each inference engines and knowledge stores. Oftentimes, the knowledge store aspect of an LLM might be of great interest to users — many individuals use LLMs as search engine replacements! By now, anyone who has used an LLM knows that they’re vulnerable to making up fake “facts” — a phenomenon known as hallucination.

Sometimes, nonetheless, LLMs suffer from the other problem: they’re too firmly fixated on facts from their training data.

In our case, we were attempting to prompt GPT-3.5 to find out the suitable ViewStages (pipelines of logical operations) required in converting a user’s natural language query into a legitimate FiftyOne Python query. The issue was that GPT-3.5 knew concerning the `Match` and `FilterLabels` ViewStages, which have existed in FiftyOne for a while, but its training data did not include recently added functionality wherein a `SortBySimilarity` ViewStage might be used to seek out images the resemble a text prompt.

We tried passing in a definition of `SortBySimilarity`, details about its usage, and examples. We even tried instructing GPT-3.5 that it MUST NOT use the `Match` or `FilterLabels` ViewStages, or else it should be penalized. Regardless of what we tried, the LLM still oriented itself towards what it knew, whether it was the appropriate alternative or not. We were fighting against the LLM’s instincts!

We ended up having to take care of this issue in post-processing.

Painful Post-Processing Is Inevitable

Regardless of how good your examples are; irrespective of how strict your prompts are — large language models will invariably hallucinate, provide you with improperly formatted responses, and throw a tantrum once they don’t understand input information. Probably the most predictable property of LLMs is the unpredictability of their outputs.

I spent an ungodly period of time writing routines to pattern match for and proper hallucinated syntax. The post-processing file ended up containing almost 1600 lines of Python code!

A few of these subroutines were as straightforward as adding parenthesis, or changing “and” and “or” to “&” and “|” in logical expressions. Some subroutines were way more involved, like validating the names of the entities within the LLM’s responses, converting one ViewStage to a different if certain conditions were met, ensuring that the numbers and varieties of arguments to methods were valid.

Should you are using prompt engineering in a somewhat confined code generation context, I’d recommend the next approach:

Write your personal custom error parser using Abstract Syntax Trees (Python’s ast module).
If the outcomes are syntactically invalid, feed the generated error message into your LLM and have it try again.

This approach fails to handle the more insidious case where syntax is valid but the outcomes usually are not right. If anyone has a very good suggestion for this (beyond AutoGPT and “show your work” style approaches), please let me know!

The More the Merrier

To construct VoxelGPT, I used what gave the look of every prompting technique under the sun:

“You’re an authority”
“Your task is”
“You MUST”
“You can be penalized”
“Listed below are the principles”

No combination of such phrases will ensure a certain variety of behavior. Clever prompting just isn’t enough.

That being said, the more of those techniques you use in a prompt, the more you nudge the LLM in the appropriate direction!

Examples > Documentation

It is not uncommon knowledge by now (and customary sense!) that each examples and other contextual information like documentation might help elicit higher responses from a big language model. I discovered this to be the case for VoxelGPT.

When you add all the directly pertinent examples and documentation though, what do you have to do if you’ve got extra room within the context window? In my experience, I discovered that tangentially related examples mattered greater than tangentially related documentation.

Modularity >> Monolith

The more you possibly can break down an overarching problem into smaller subproblems, the higher. Somewhat than feeding the dataset schema and an inventory of end-to-end examples, it’s rather more effective to discover individual selection and inference steps (selection-inference prompting), and feed in just the relevant information at each step.

That is preferable for 3 reasons:

LLMs are higher at doing one task at a time than multiple tasks without delay.
The smaller the steps, the better to sanitize inputs and outputs.
It’s a very important exercise for you because the engineer to know the logic of your application. The purpose of LLMs isn’t to make the world a black box. It’s to enable latest workflows.

How Many Do I Need?

A giant a part of prompt engineering is determining what number of examples you would like for a given task. That is highly problem specific.

For some tasks (effective query generation and answering questions based on the FiftyOne documentation), we were in a position to get away without any examples. For others (tag selection, whether or not chat history is relevant, and named entity recognition for label classes) we just needed a number of examples to get the job done. Our principal inference task, nonetheless, has almost 400 examples (and that remains to be the limiting consider overall performance), so we only pass in probably the most relevant examples at inference time.

When you find yourself generating examples, attempt to follow two guidelines:

Be as comprehensive as possible. If you’ve got a finite space of possibilities, then try to offer the LLM not less than one example for every case. For VoxelGPT, we tried to have on the very least one example for every syntactically correct way of using each ViewStage — and typically a number of examples for every, so the LLM can do pattern matching.
Be as consistent as possible. Should you are breaking the duty down into multiple subtasks, make sure that the examples are consistent from one task to the subsequent. You may reuse examples!

Synthetic Examples

Generating examples is a laborious process, and handcrafted examples can only take you thus far. It’s just impossible to think about every possible scenario ahead of time. While you deploy your application, you possibly can log user queries and use these to enhance your example set.

Prior to deployment, nonetheless, your best bet is perhaps to generate synthetic examples.

Listed below are two approaches to generating synthetic examples that you simply might find helpful:

Use an LLM to generate examples. You may ask the LLM to differ its language, and even imitate the form of potential users! This didn’t work for us, but I’m convinced it could work for a lot of applications.
Programmatically generate examples — potentially with randomness — based on elements within the input query itself. For VoxelGPT, this implies generating examples based on the fields within the user’s dataset. We’re within the strategy of incorporating this into our pipeline, and the outcomes we’ve seen thus far have been promising.

LangChain

LangChain is popular for a reason: the library makes it easy to attach LLM inputs and outputs in complex ways, abstracting away the gory details. The Models and Prompts modules especially are top notch.

That being said, LangChain is certainly a piece in progress: their Memories, Indexes, and Chains modules all have significant limitations. Listed below are just a number of of the problems I encountered when attempting to use LangChain:

Document Loaders and Text Splitters: In LangChain, Document Loaders are speculated to transform data from different file formats into text, and Text Splitters are speculated to split text into semantically meaningful chunks. VoxelGPT answers questions on the FiftyOne documentation by retrieving probably the most relevant chunks of the docs and piping them right into a prompt. With a purpose to generate meaningful answers to questions on the FiftyOne docs, I needed to effectively construct custom loaders and splitters, because LangChain didn’t provide the suitable flexibility.
Vectorstores: LangChain offers Vectorstore integrations and Vectorstore-based Retrievers to assist find relevant information to include into LLM prompts. That is great in theory, however the implementations are lacking in flexibility. I had to jot down a custom implementation with ChromaDB so as to pass embedding vectors ahead of time and never have them recomputed each time I ran the appliance. I also had to jot down a custom retriever to implement the custom pre-filtering I needed.
Query Answering with Sources: When constructing out query answering over the FiftyOne docs, I arrived at an affordable solution utilizing LangChain’s `RetrievalQA` Chain. Once I desired to add sources in, I believed it will be as straightforward as swapping out that chain for LangChain’s `RetrievalQAWithSourcesChain`. Nevertheless, bad prompting techniques meant that this chain exhibited some unlucky behavior, corresponding to hallucinating about Michael Jackson. Once more, I needed to take matters into my very own hands.

What does all of this mean? It might be easier to only construct the components yourself!

Vector Databases

Vector search could also be on 🔥🔥🔥, but that doesn’t mean you NEED it in your project. I initially implemented our similar example retrieval routine using ChromaDB, but because we only had a whole bunch of examples, I ended up switching to a precise nearest neighbor search. I did must take care of all the metadata filtering myself, however the result was a faster routine with fewer dependencies.

TikToken

Adding TikToken into the equation was incredibly easy. In total, TikToken added <10 lines of code to the project, but allowed us to be rather more precise when counting tokens and attempting to fit as much information as possible into the context length. That is the one true no-brainer in relation to tooling.

There are tons of LLMs to pick from, a lot of shiny latest tools, and a bunch of “prompt engineering” techniques. All of this might be each exciting and overwhelming. The important thing to constructing an application with prompt engineering is to:

Break the issue down; construct the answer up
Treat LLMs as enablers, not as end-to-end solutions
Only use tools once they make your life easier
Embrace experimentation!

Go construct something cool!