What I Learned Pushing Prompt Engineering to the Limit General Lessons Prompting Techniques Examples Tooling Conclusion

Artificial Intelligence

What I Learned Pushing Prompt Engineering to the Limit General Lessons Prompting Techniques Examples Tooling Conclusion

admin

June 15, 2023

What I Learned Pushing Prompt Engineering to the Limit
General Lessons
Prompting Techniques
Examples
Tooling
Conclusion

Satirical depiction of prompt engineering. Sarcastically, the DALL-E2 generated image was generated by the creator using prompt engineering with the prompt “a mad scientist handing over a scroll to an artificially intelligent robot, generated in a retro style”, plus a variation, plus outpainting.

I spent the past two months constructing a large-language-model (LLM) powered application. It was an exciting, intellectually stimulating, and at times frustrating experience. My entire conception of prompt engineering — and of what is feasible with LLMs — modified over the course of the project.

I’d like to share with you a few of my biggest takeaways with the goal of shedding light on a number of the often unspoken facets of prompt engineering. I hope that after reading about my trials and tribulations, you’ll have the option to make more informed prompt engineering decisions. In the event you’d already dabbled in prompt engineering, I hope that this helps you push forward in your personal journey!

For context, here is the TL;DR on the project we’ll be learning from:

My team and I built VoxelGPT, an application that mixes LLMs with the FiftyOne computer vision query language to enable looking through image and video datasets via natural language. VoxelGPT also answers questions on FiftyOne itself.
VoxelGPT is open source (so is FiftyOne!). All the code is available on GitHub.
You’ll be able to try VoxelGPT free of charge at gpt.fiftyone.ai.
In the event you’re curious how we built VoxelGPT, you’ll be able to read more about it on TDS here.

Now, I’ve split the prompt engineering lessons into 4 categories:

General Lessons
Prompting Techniques
Examples
Tooling

Science? Engineering? Black Magic?

Prompt engineering is as much experimentation because it is engineering. There are an infinite number of the way to put in writing a prompt, from the precise wording of your query, to the content and formatting of the context you feed in. It will possibly be overwhelming. I discovered it easiest to start out easy and construct up an intuition — after which test out hypotheses.

In computer vision, each dataset has its own schema, label types, and sophistication names. The goal for VoxelGPT was to have the option to work with any computer vision dataset, but we began with only a single dataset: MS COCO. Keeping all the additional degrees of freedom fixed allowed us to nail down into the LLM’s ability to put in writing syntactically correct queries in the primary place.

When you’ve determined a formula that’s successful in a limited context, then determine easy methods to generalize and construct upon this.

Which Model(s) to Use?

People say that one of the vital characteristics of huge language models is that they’re relatively interchangeable. In theory, it’s best to have the option to swap one LLM out for one more without substantially changing the connective tissue.

While it’s true that changing the LLM you employ is usually so simple as swapping out an API call, there are definitely some difficulties that arise in practice.

Some models have much shorter context lengths than others. Switching to a model with a shorter context can require major refactoring.
Open source is great, but open source LLMs aren’t as performant (yet) as GPT models. Plus, should you are deploying an application with an open source LLM, you will want to make sure that the container running the model has enough memory and storage. This will find yourself being more troublesome (and costlier) than simply using API endpoints.
In the event you start using GPT-4 after which switch to GPT-3.5 due to cost, you could be shocked by the drop-off in performance. For sophisticated code generation and inference tasks, GPT-4 is MUCH higher.

Where to Use LLMs?

Large language models are powerful. But simply because they could be able to certain tasks doesn’t mean you’ll want to — and even should — use them for those tasks. The very best approach to take into consideration LLMs is as enablers. LLMs aren’t the WHOLE solution: they are only a component of it. Don’t expect large language models to do all the things.

For example, it stands out as the case that the LLM you’re using can (under ideal circumstances) generate properly formatted API calls. But should you know what the structure of the API call should appear like, and you’re actually fascinated by filling in sections of the API call (variable names, conditions, etc.), then just use the LLM to do those tasks, and use the (properly post-processed) LLM outputs to generate structured API calls yourself. This might be cheaper, more efficient, and more reliable.

An entire system with LLMs will certainly have numerous connective tissue and classical logic, plus a slew of traditional software engineering and ML engineering components. Find what works best in your application.

LLMs Are Biased

Language models are each inference engines and knowledge stores. Oftentimes, the knowledge store aspect of an LLM will be of great interest to users — many individuals use LLMs as search engine replacements! By now, anyone who has used an LLM knows that they’re susceptible to making up fake “facts” — a phenomenon known as hallucination.

Sometimes, nevertheless, LLMs suffer from the alternative problem: they’re too firmly fixated on facts from their training data.

In our case, we were attempting to prompt GPT-3.5 to find out the suitable ViewStages (pipelines of logical operations) required in converting a user’s natural language query into a sound FiftyOne Python query. The issue was that GPT-3.5 knew concerning the `Match` and `FilterLabels` ViewStages, which have existed in FiftyOne for a while, but its training data did not include recently added functionality wherein a `SortBySimilarity` ViewStage will be used to seek out images the resemble a text prompt.

We tried passing in a definition of `SortBySimilarity`, details about its usage, and examples. We even tried instructing GPT-3.5 that it MUST NOT use the `Match` or `FilterLabels` ViewStages, or else it’ll be penalized. Irrespective of what we tried, the LLM still oriented itself towards what it knew, whether it was the suitable selection or not. We were fighting against the LLM’s instincts!

We ended up having to cope with this issue in post-processing.

Painful Post-Processing Is Inevitable

Irrespective of how good your examples are; regardless of how strict your prompts are — large language models will invariably hallucinate, offer you improperly formatted responses, and throw a tantrum after they don’t understand input information. Essentially the most predictable property of LLMs is the unpredictability of their outputs.

I spent an ungodly period of time writing routines to pattern match for and proper hallucinated syntax. The post-processing file ended up containing almost 1600 lines of Python code!

A few of these subroutines were as straightforward as adding parenthesis, or changing “and” and “or” to “&” and “|” in logical expressions. Some subroutines were much more involved, like validating the names of the entities within the LLM’s responses, converting one ViewStage to a different if certain conditions were met, ensuring that the numbers and sorts of arguments to methods were valid.

In the event you are using prompt engineering in a somewhat confined code generation context, I’d recommend the next approach:

Write your personal custom error parser using Abstract Syntax Trees (Python’s ast module).
If the outcomes are syntactically invalid, feed the generated error message into your LLM and have it try again.

This approach fails to deal with the more insidious case where syntax is valid but the outcomes aren’t right. If anyone has a superb suggestion for this (beyond AutoGPT and “show your work” style approaches), please let me know!

The More the Merrier

To construct VoxelGPT, I used what gave the impression of every prompting technique under the sun:

“You might be an authority”
“Your task is”
“You MUST”
“You might be penalized”
“Listed below are the foundations”

No combination of such phrases will ensure a certain sort of behavior. Clever prompting just isn’t enough.

That being said, the more of those techniques you use in a prompt, the more you nudge the LLM in the suitable direction!

Examples > Documentation

It is not uncommon knowledge by now (and customary sense!) that each examples and other contextual information like documentation can assist elicit higher responses from a big language model. I discovered this to be the case for VoxelGPT.

When you add all the directly pertinent examples and documentation though, what do you have to do if you will have extra room within the context window? In my experience, I discovered that tangentially related examples mattered greater than tangentially related documentation.

Modularity >> Monolith

The more you’ll be able to break down an overarching problem into smaller subproblems, the higher. Fairly than feeding the dataset schema and an inventory of end-to-end examples, it’s way more effective to discover individual selection and inference steps (selection-inference prompting), and feed in just the relevant information at each step.

That is preferable for 3 reasons:

LLMs are higher at doing one task at a time than multiple tasks directly.
The smaller the steps, the simpler to sanitize inputs and outputs.
It’s a very important exercise for you because the engineer to know the logic of your application. The purpose of LLMs isn’t to make the world a black box. It’s to enable recent workflows.

How Many Do I Need?

A giant a part of prompt engineering is determining what number of examples you would like for a given task. That is highly problem specific.

For some tasks (effective query generation and answering questions based on the FiftyOne documentation), we were in a position to get away without any examples. For others (tag selection, whether or not chat history is relevant, and named entity recognition for label classes) we just needed just a few examples to get the job done. Our essential inference task, nevertheless, has almost 400 examples (and that continues to be the limiting think about overall performance), so we only pass in probably the most relevant examples at inference time.

If you end up generating examples, attempt to follow two guidelines:

Be as comprehensive as possible. If you will have a finite space of possibilities, then try to present the LLM not less than one example for every case. For VoxelGPT, we tried to have on the very least one example for every syntactically correct way of using each ViewStage — and typically just a few examples for every, so the LLM can do pattern matching.
Be as consistent as possible. In the event you are breaking the duty down into multiple subtasks, make sure that the examples are consistent from one task to the following. You’ll be able to reuse examples!

Synthetic Examples

Generating examples is a laborious process, and handcrafted examples can only take you to this point. It’s just impossible to consider every possible scenario ahead of time. While you deploy your application, you’ll be able to log user queries and use these to enhance your example set.

Prior to deployment, nevertheless, your best bet could be to generate synthetic examples.

Listed below are two approaches to generating synthetic examples that you simply might find helpful:

Use an LLM to generate examples. You’ll be able to ask the LLM to differ its language, and even imitate the sort of potential users! This didn’t work for us, but I’m convinced it could work for a lot of applications.
Programmatically generate examples — potentially with randomness — based on elements within the input query itself. For VoxelGPT, this implies generating examples based on the fields within the user’s dataset. We’re within the means of incorporating this into our pipeline, and the outcomes we’ve seen to this point have been promising.

LangChain

LangChain is popular for a reason: the library makes it easy to attach LLM inputs and outputs in complex ways, abstracting away the gory details. The Models and Prompts modules especially are top notch.

That being said, LangChain is certainly a piece in progress: their Memories, Indexes, and Chains modules all have significant limitations. Listed below are just just a few of the problems I encountered when attempting to use LangChain:

Document Loaders and Text Splitters: In LangChain, Document Loaders are speculated to transform data from different file formats into text, and Text Splitters are speculated to split text into semantically meaningful chunks. VoxelGPT answers questions on the FiftyOne documentation by retrieving probably the most relevant chunks of the docs and piping them right into a prompt. With a view to generate meaningful answers to questions on the FiftyOne docs, I needed to effectively construct custom loaders and splitters, because LangChain didn’t provide the suitable flexibility.
Vectorstores: LangChain offers Vectorstore integrations and Vectorstore-based Retrievers to assist find relevant information to include into LLM prompts. That is great in theory, however the implementations are lacking in flexibility. I had to put in writing a custom implementation with ChromaDB as a way to pass embedding vectors ahead of time and never have them recomputed each time I ran the appliance. I also had to put in writing a custom retriever to implement the custom pre-filtering I needed.
Query Answering with Sources: When constructing out query answering over the FiftyOne docs, I arrived at an inexpensive solution utilizing LangChain’s `RetrievalQA` Chain. Once I desired to add sources in, I believed it will be as straightforward as swapping out that chain for LangChain’s `RetrievalQAWithSourcesChain`. Nonetheless, bad prompting techniques meant that this chain exhibited some unlucky behavior, corresponding to hallucinating about Michael Jackson. Once more, I needed to take matters into my very own hands.

What does all of this mean? It could be easier to simply construct the components yourself!

Vector Databases

Vector search could also be on 🔥🔥🔥, but that doesn’t mean you NEED it in your project. I initially implemented our similar example retrieval routine using ChromaDB, but because we only had lots of of examples, I ended up switching to a precise nearest neighbor search. I did must cope with all the metadata filtering myself, however the result was a faster routine with fewer dependencies.

TikToken

Adding TikToken into the equation was incredibly easy. In total, TikToken added <10 lines of code to the project, but allowed us to be way more precise when counting tokens and attempting to fit as much information as possible into the context length. That is the one true no-brainer in the case of tooling.

There are tons of LLMs to pick from, a lot of shiny recent tools, and a bunch of “prompt engineering” techniques. All of this will be each exciting and overwhelming. The important thing to constructing an application with prompt engineering is to:

Break the issue down; construct the answer up
Treat LLMs as enablers, not as end-to-end solutions
Only use tools after they make your life easier
Embrace experimentation!

Go construct something cool!