How I Turned ChatGPT into an SQL-Like Translator for Image and Video Datasets

Artificial Intelligence

How I Turned ChatGPT into an SQL-Like Translator for Image and Video Datasets

admin

June 8, 2023

How I Turned ChatGPT into an SQL-Like Translator for Image and Video Datasets

Although the prompt for these two examples was structured in the identical way, the responses differed in just a few key ways. Response 1 attempts to create a DatasetView by adding ViewStage to the dataset. Response 2 defines and applies a MongoDB aggregation pipeline, followed by the limit() method (applying Limit stage) to limit the view to 10 samples, in addition to a non-existent (AKA hallucinated) display() method. Moreover, while Response 1 loads in an actual dataset (Open Images V6), Response 2 is effectively template code, as "your_dataset_name" and "your_model_name” must be filled in.

These examples also highlighted the next issues:

Boilerplate code: some responses contained code for importing modules, instantiating datasets (and models), and visualizing the view (session = fo.launch_app(dataset)).
Explanatory text: in lots of cases — including educational contexts — the undeniable fact that the model explains its “reasoning” is a positive. If we would like to perform queries on the user’s behalf, nonetheless, this explanatory text just gets in the best way. Some queries even resulted in multiple code blocks, split up by text.

What we actually wanted was for the LLM to reply with code that could possibly be copied and pasted right into a Python process, without the entire extra baggage. As a primary attempt at prompting the model, I began to provide the next text as prefix to any natural language query I wanted it to translate:

Your task is to convert input natural language queries into Python code to generate ViewStages for the pc vision library FiftyOne.
Listed below are some rules:
- Avoid all header code like importing packages, and all footer code like saving the dataset or launching the FiftyOne App.
- Just give me the ultimate Python code, no intermediate code snippets or explanation.
- at all times assume the dataset is stored within the Python variable `dataset`
- you should utilize the next ViewStages to generate your response, in any combination: exclude, exclude_by, exclude_fields, exclude_frames, …

Crucially, I defined a task, and set rules, instructing the model what it was allowed and never allowed to do.

Note: with responses coming in a more uniform format, it was at this point that I moved from the ChatGPT chat interface to using GPT-4 via OpenAI’s API.

Limiting Scope

Our team also decided that, a minimum of to begin, we might limit the scope of what we were asking the LLM to do. While the fiftyone query language itself is full-bodied, asking a pre-trained model to do arbitrarily complex tasks with none fine-tuning is a recipe for disappointment. Start easy, and iteratively add in complexity.

For this experiment, we imposed the next bounds:

Just images and videos: don’t expect the LLM to question 3D point clouds or grouped datasets.
Ignore fickle ViewStages: most ViewStages abide by the identical basic rules, but just a few buck the trend. `Concat` is the one ViewStages` that takes in a second DatasetView; Mongo uses MongoDB Aggregation syntax; GeoNear has a query argument, which takes in a fiftyone.utils.geojson.geo_within() object; and GeoWithin requires a 2D array to define the region to which the “inside” applies. We decided to disregard Concat, Mongo, and GeoWithin, and to support all GeoNear usage except for the query argument.
Keep on with two stages: while it might be great for the model to compose an arbitrary variety of stages, in most workflows I’ve seen, one or two ViewStages suffice to create the specified DatasetView. The goal of this project was to not get caught within the weeds, but to construct something useful for computer vision practitioners.

VoxelGPT using natural language to question a picture dataset. Image courtesy of the writer.

Along with giving the model an explicit “task” and providing clear instructions, we found that we could improve performance by giving the model more details about how FiftyOne’s query language works. Without this information, the LLM is flying blind. It’s just grasping, reaching out into the darkness.

For instance, in Prompt 2, once I asked for false positive predictions, the response attempted to reference these false positives with predictions.mistakes.false_positive. So far as ChatGPT was concerned, this gave the impression of an affordable method to store and access details about false positives.

The model didn’t know that in FiftyOne, the reality/falsity of detection predictions is evaluated with dataset.evaluate_detections() and after running said evaluation, you possibly can retrieve all images with a false positive by matching for eval_fp>0 with:

images_with_fp = dataset.match(F("eval_fp")>0)

I attempted to make clear the duty by providing additional rules, equivalent to:

- When a user asks for probably the most "unique" images, they're referring to the "uniqueness" field stored on samples.
- When a user asks for probably the most "flawed" or "mistaken" images, they're referring to the "mistakenness" field stored on samples.
- If a user doesn't specify a label field, e.g. "predictions" or "ground_truth" to which to use certain operations, assume they mean "ground_truth" if a ground_truth field exists on the info.

I also provided details about label types:

- Object detection bounding boxes are in [top-left-x, top-left-y, width, height] format, all relative to the image width and height, within the range [0, 1]
- possible label types include Classification, Classifications, Detection, Detections, Segmentation, Keypoint, Regression, and Polylines

Moreover, while by providing the model with a listing of allowed view stages, I used to be in a position to nudge it towards using them, it didn’t know

When a given stage was relevant, or
How to make use of the stage in a syntactically correct manner

To fill this gap, I wanted to provide the LLM details about each of the view stages. I wrote code to loop through view stages (which you possibly can list with fiftyone.list_view_stages()), store the docstring, after which split the text of the docstring into description and inputs/arguments.

Nevertheless, I soon bumped into an issue: context length.

Using the bottom GPT-4 model via the OpenAI API, I used to be already bumping up against the 8,192 token context length. And this was before adding in examples, or any information concerning the dataset itself!

OpenAI does have a GPT-4 model with a 32,768 token context which in theory I could have used, but a back-of-the-envelope calculation convinced me that this might get expensive. If we filled your complete 32k token context, given OpenAI’s pricing, it might cost about $2 per query!

As an alternative, our team rethought our approach and did the next:

Switch to GPT-3.5
Minimize token count
Be more selective with input info

Switching to GPT-3.5

There’s no such thing as a free lunch — this did result in barely lower performance, a minimum of initially. Over the course of the project, we were in a position to get well and much surpass this through prompt engineering! In our case, the hassle was price the associated fee savings. In other cases, it won’t be.

Minimizing Token Count

With context length becoming a limiting factor, I employed the next easy trick: use ChatGPT to optimize prompts!

One ViewStage at a time, I took the unique description and list of inputs, and fed this information into ChatGPT, together with a prompt asking the LLM to reduce the token count of that text while retaining all semantic information. Using tiktoken to count the tokens in the unique and compressed versions, I used to be able to cut back the variety of tokens by about 30%.

Being More Selective

While it’s great to offer the model with context, some information is more helpful than other information, depending on the duty at hand. If the model only must generate a Python query involving two ViewStages, it probably won’t profit terribly from details about what inputs the opposite ViewStages take.

We knew that we wanted a method to select relevant information depending on the input natural language query. Nevertheless, it wouldn’t be so simple as performing a similarity search on the descriptions and input parameters, because the previous often is available in very different language than the latter. We wanted a method to link input and data selection.

That link, because it seems, was examples.

Generating Examples

For those who’ve ever played around with ChatGPT or one other LLM, you’ve probably experienced first-hand how providing the model with even only a single relevant example can drastically improve performance.

As a start line, I got here up with 10 completely synthetic examples and passed these along to GPT-3.5 by adding this below the duty rules and ViewStage descriptions in my input prompt:

Listed below are just a few examples of Input-Output Pairs in A, B form:

A) "Filepath starts with '/Users'"
B) `dataset.match(F("filepath").starts_with("/Users"))`A) "Predictions with confidence > 0.95"
B) `dataset.filter_labels("predictions", F("confidence") > 0.95)`
…

With just these 10 examples, there was a noticeable improvement in the standard of the model’s responses, so our team decided to be systematic about it.

First, we combed through our docs, finding any and all examples of views created through mixtures of ViewStages.
We then went through the list of ViewStages and added examples in order that we had as close to finish coverage as possible over usage syntax. To this, we made sure that there was a minimum of one example for every argument or keyword, to provide the model a pattern to follow.
With usage syntax covered, we varied the names of fields and classes within the examples in order that the model wouldn’t generate any false assumptions about names correlating with stages. As an illustration, we don’t want the model to strongly associate the “person” class with the match_labels() method simply because the entire examples for match_labels() occur to incorporate a “person” class.

Choosing Similar Examples

At the top of this instance generation process, we already had a whole bunch of examples — way over could fit within the context length. Fortunately, these examples contained (as input) natural language queries that we could directly compare with the user’s input natural language query.

To perform this comparison, we pre-computed embeddings for these example queries with OpenAI’s text-embedding-ada–002 model. At run-time, the user’s query is embedded with the identical model, and the examples with probably the most similar natural language queries — by cosine distance — are chosen. Initially, we used ChromaDB to construct an in-memory vector database. Nevertheless, on condition that we were coping with a whole bunch or 1000’s of vectors, reasonably than a whole bunch of 1000’s or tens of millions, it actually made more sense to modify to a precise vector search (plus we limited dependencies).

It was becoming difficult to administer these examples and the components of the prompt, so it was at this point that we began to make use of LangChain’s Prompts module. Initially, we were in a position to use their Similarity ExampleSelector to pick out probably the most relevant examples, but eventually we had to write down a custom ExampleSelector in order that we had more control over the pre-filtering.

Filtering for Appropriate Examples

In the pc vision query language, the suitable syntax for a question can depend upon the media sort of the samples within the dataset: videos, for instance, sometimes must be treated in another way than images. Reasonably than confuse the model by giving seemingly conflicting examples, or complicating the duty by forcing the model to infer based on media type, we decided to only give examples that may be syntactically correct for a given dataset. Within the context of vector search, that is often called pre-filtering.

This concept worked so well that we eventually applied the identical considerations to other features of the dataset. In some cases, the differences were merely syntactic — when querying labels, the syntax for accessing a Detections label is different from that of a Classification label. Other filters were more strategic: sometimes we didn’t want the model to learn about a certain feature of the query language.

As an illustration, we didn’t want to provide the LLM examples utilizing computations it might not have access to. If a text similarity index had not been constructed for a selected dataset, it might not make sense to feed the model examples of looking for the most effective visual matches to a natural language query. In the same vein, if the dataset didn’t have any evaluation runs, then querying for true positives and false positives would yield either errors or null results.

You may see the entire example pre-filtering pipeline in view_stage_example_selector.py within the GitHub repo.

Selecting Contextual Info Based on Examples

For a given natural language query, we then use the examples chosen by our ExampleSelector to come to a decision what additional information to offer within the context.

Particularly, we count the occurrences of every ViewStage in these chosen examples, discover the five most frequent `ViewStages, and add the descriptions and data concerning the input parameters for these ViewStages as context in our prompt. The rationale for that is that if a stage steadily occurs in similar queries, it is probably going (but not guaranteed) to be relevant to this question.

If it isn’t relevant, then the outline will help the model to find out that it isn’t relevant. Whether it is relevant, then details about input parameters will help the model generate a syntactically correct ViewStage operation.

Up until this point, we had focused on squeezing as much relevant information as possible — and just relevant information — right into a single prompt. But this approach was reaching its limits.

Even without accounting for the undeniable fact that every dataset has its own names for fields and classes, the space of possible Python queries was just too large.

To make progress, we wanted to interrupt the issue down into smaller pieces. Taking inspiration from recent approaches, including Chain-of-thought prompting and Selection-inference prompting, we divided the issue of generating a DatasetView into 4 distinct selection subproblems

Algorithms
Runs of algorithms
Relevant fields
Relevant class names

We then chained these selection “links” together, and passed their outputs along to the model in the ultimate prompt for DatasetView inference.

For every of those subtasks, the identical principles of uniformity and ease apply. We tried to recycle the natural language queries from existing examples wherever possible, but made a degree to simplify the formats of all inputs and outputs for every selection task. What’s simplest for one link will not be simplest for an additional!

Algorithms

In FiftyOne, information resulting from a computation on a dataset is stored as a “run”. This includes computations like uniqueness, which measures how unique each image is relative to the remainder of the photographs within the dataset, and hardness, which quantifies the issue a model will experience when attempting to learn on this sample. It also includes computations of similarity, which involve generating a vector index for embeddings related to each sample, and even evaluation computations, which we touched upon earlier.

Each of those computations generates a unique sort of results object, which has its own API. Moreover, there isn’t any one-to-one correspondence between ViewStages and these computations. Let’s take uniqueness for example.

A uniqueness computation result’s stored in a float-valued field ("uniqueness” by default) on each image. Because of this depending on the situation, you could need to sort by uniqueness:

view = dataset.sort_by("uniqueness")

Retrieve samples with uniqueness above a certain threshold:

from fiftyone import ViewField as F
view = dataset.match(F("uniqueness") > 0.8)

And even just show the individuality field:

view = dataset.select_fields("uniqueness")

On this selection step, we task the LLM with predicting which of the possible computations is perhaps relevant to the user’s natural language query. An example for this task looks like:

Query: "most unusual images with a false positive"
Algorithms used: ["uniqueness", "evaluation"]

Runs of Algorithms

Once potentially relevant computational algorithms have been identified, we task the LLM with choosing probably the most appropriate run of every computation. This is important because some computations could be run multiple times on the identical dataset with different configurations, and a ViewStage may only make sense with the proper “run”.

A terrific example of that is similarity runs. Suppose you might be testing out two models (InceptionV3 and CLIP) in your data, and you’ve got generated a vector similarity index on the dataset for every model. When using the SortBySimilarity view stage, which images are determined to be most much like which other images can depend quite strongly on the embedding model, so the next two queries would want to generate different results:

## query A:
"show me the ten most similar images to image 1 with CLIP"## query B:
"show me the ten most similar images to image 1 with InceptionV3"

This run selection process is handled individually for every sort of computation, as each requires a modified set of task rules and examples.

Relevant Fields

This link within the chain involves identifying all field names relevant to the natural language query which can be not related to a computational run. As an illustration not all datasets with predictions have those labels stored under the name "predictions”. Depending on the person, dataset, and application, predictions is perhaps stored in a field named "pred", "resnet", "fine-tuned", "predictions_05_16_2023", or something else entirely.

Examples for this task included the query, the names and forms of all fields within the dataset, and the names of relevant fields:

Query: "Exclude model2 predictions from all samples"
Available fields: "[id: string, filepath: string, tags: list, ground_truth: Detections, model1_predictions: Detections, model2_predictions: Detections, model3_predictions: Detections]"
Required fields: "[model2_predictions]"

Relevant Class Names

For label fields like classifications and detections, translating a natural language query into Python code requires using the names of actual classes within the dataset. To perform this, I tasked GPT-3.5 with performing named entity recognition for label classes in input queries.

Within the query “samples with a minimum of one cow prediction and no horses”, the model’s job is to discover "horse" and "cow". These identified names are then compared against the category names for label fields chosen within the prior step — first case sensitive, then case insensitive, then plurality insensitive.

If no matches are found between named entities and the category names within the dataset, we fall back to semantic matching: "people" → "person", "table" → "dining table", and "animal" → [“cat”, “dog", “horse", …].

Every time the match isn’t equivalent, we use the names of the matched classes to update the query that’s passed into the ultimate inference step:

query: "20 random images with a table"
## becomes:
query: "20 random images with a dining table"

ViewStage Inference

Once all of those selections have been made, the same examples, relevant descriptions, and relevant dataset info (chosen algorithmic runs, fields, and classes) are passed in to the model, together with the (potentially modified) query.

Reasonably than instruct the model to return code to me in the shape dataset.view1().view2()…viewn() as we were doing initially, we ended up nixing the dataset part, and as an alternative asking the model to return the ViewStages as a listing. On the time, I used to be surprised to see this improve performance, but in hindsight, it suits with the insight that the more you split the duty up, the higher an LLM can do.

Creating an LLM-powered toy is cool, but turning the identical kernel into an LLM-power application is far cooler. Here’s a transient overview of how we did it.

Unit Testing

As we turned this from a proof-of-principle right into a robustly engineered system, we used unit testing to emphasize test the pipeline and discover weak points. The modular nature of links within the chain implies that each step can individually be unit tested, validated, and iterated on while not having to run your complete chain.

This results in faster improvement, because different individuals or groups of individuals inside a prompt-engineering team can work on different links within the chain in parallel. Moreover, it ends in reduced costs, as in theory, you need to only have to run a single step of LLM inference to optimize a single link within the chain.

Evaluating LLM-Generated Code

We used Python’s eval() function to show GPT-3.5’s response right into a DatasetView. We then set the state of the FiftyOne App session to display this view.

Input Validation

Garbage input → garbage output. To avoid this, we run validation to be certain that that the user’s natural language query is wise.

First, we use OpenAI’s moderation endpoint. Then we categorize any prompt into considered one of the next 4 cases:

1: Sensible and complete: the prompt can reasonably be translated into Python code for querying a dataset.

All images with dog detections

2: Sensible and incomplete: the prompt is affordable, but can’t be converted right into a DatasetView without additional information. For instance, if we now have two models with predictions on our data, then the next prompt, which just refers to “my model” is insufficient:

Retrieve my model’s incorrect predictions

3: Out of scope: we’re constructing an application that generates queried views into computer vision datasets. While the underlying GPT-3.5 model is a general purpose LLM, our application shouldn’t turn right into a disconnected ChatGPT session next to your dataset. Prompts just like the following must be snuffed out:

Explain quantum computing like I’m five

4: Not sensible: given a random string, it might not make sense to try and generate a view of the dataset — where would one even start?!

Azlsakjdbiayervbg