How I Turned ChatGPT into an SQL-Like Translator for Image and Video Datasets The Query Language Defining the Task Providing Context Examples Divide and Conquer Making it Usable Recap

Artificial Intelligence

How I Turned ChatGPT into an SQL-Like Translator for Image and Video Datasets The Query Language Defining the Task Providing Context Examples Divide and Conquer Making it Usable Recap

admin

June 11, 2023

How I Turned ChatGPT into an SQL-Like Translator for Image and Video Datasets
The Query Language
Defining the Task
Providing Context
Examples
Divide and Conquer
Making it Usable
Recap

Though the prompt for these two examples was structured in the identical way, the responses differed in a number of key ways. Response 1 attempts to create a DatasetView by adding ViewStage to the dataset. Response 2 defines and applies a MongoDB aggregation pipeline, followed by the limit() method (applying Limit stage) to limit the view to 10 samples, in addition to a non-existent (AKA hallucinated) display() method. Moreover, while Response 1 loads in an actual dataset (Open Images V6), Response 2 is effectively template code, as "your_dataset_name" and "your_model_name” must be filled in.

These examples also highlighted the next issues:

: some responses contained code for importing modules, instantiating datasets (and models), and visualizing the view (session = fo.launch_app(dataset)).
: in lots of cases — including educational contexts — the proven fact that the model explains its “reasoning” is a positive. If we wish to perform queries on the user’s behalf, nonetheless, this explanatory text just gets in the way in which. Some queries even resulted in multiple code blocks, split up by text.

What we actually wanted was for the LLM to reply with code that may very well be copied and pasted right into a Python process, without the entire extra baggage. As a primary attempt at prompting the model, I began to offer the next text as prefix to any natural language query I wanted it to translate:

Your task is to convert input natural language queries into Python code to generate ViewStages for the pc vision library FiftyOne.
Listed below are some rules:
- Avoid all header code like importing packages, and all footer code like saving the dataset or launching the FiftyOne App.
- Just give me the ultimate Python code, no intermediate code snippets or explanation.
- at all times assume the dataset is stored within the Python variable `dataset`
- you need to use the next ViewStages to generate your response, in any combination: exclude, exclude_by, exclude_fields, exclude_frames, …

Crucially, I defined a task, and set rules, instructing the model what it was allowed and never allowed to do.

Note: with responses coming in a more uniform format, it was at this point that I moved from the ChatGPT chat interface to using GPT-4 via OpenAI’s API.

Limiting Scope

Our team also decided that, at the very least to start out, we might limit the scope of what we were asking the LLM to do. While the fiftyone query language itself is full-bodied, asking a pre-trained model to do arbitrarily complex tasks with none fine-tuning is a recipe for disappointment. Start easy, and iteratively add in complexity.

For this experiment, we imposed the next bounds:

: don’t expect the LLM to question 3D point clouds or grouped datasets.
: most ViewStages abide by the identical basic rules, but a number of buck the trend. `Concat` is the one ViewStages` that takes in a second DatasetView; Mongo uses MongoDB Aggregation syntax; GeoNear has a query argument, which takes in a fiftyone.utils.geojson.geo_within() object; and GeoWithin requires a 2D array to define the region to which the “inside” applies. We decided to disregard Concat, Mongo, and GeoWithin, and to support all GeoNear usage except for the query argument.
: while it might be great for the model to compose an arbitrary variety of stages, in most workflows I’ve seen, one or two ViewStages suffice to create the specified DatasetView. The goal of this project was to not get caught within the weeds, but to construct something useful for computer vision practitioners.

VoxelGPT using natural language to question a picture dataset. Image courtesy of the creator.

Along with giving the model an explicit “task” and providing clear instructions, we found that we could improve performance by giving the model more details about how FiftyOne’s query language works. Without this information, the LLM is flying blind. It’s just grasping, reaching out into the darkness.

For instance, in Prompt 2, after I asked for false positive predictions, the response attempted to reference these false positives with predictions.mistakes.false_positive. So far as ChatGPT was concerned, this gave the impression of an affordable method to store and access details about false positives.

The model didn’t know that in FiftyOne, the reality/falsity of detection predictions is evaluated with dataset.evaluate_detections() and after running said evaluation, you’ll be able to retrieve all images with a false positive by matching for eval_fp>0 with:

images_with_fp = dataset.match(F("eval_fp")>0)

I attempted to make clear the duty by providing additional rules, reminiscent of:

- When a user asks for probably the most "unique" images, they're referring to the "uniqueness" field stored on samples.
- When a user asks for probably the most "improper" or "mistaken" images, they're referring to the "mistakenness" field stored on samples.
- If a user doesn't specify a label field, e.g. "predictions" or "ground_truth" to which to use certain operations, assume they mean "ground_truth" if a ground_truth field exists on the info.

I also provided details about label types:

- Object detection bounding boxes are in [top-left-x, top-left-y, width, height] format, all relative to the image width and height, within the range [0, 1]
- possible label types include Classification, Classifications, Detection, Detections, Segmentation, Keypoint, Regression, and Polylines

Moreover, while by providing the model with a listing of allowed view stages, I used to be in a position to nudge it towards using them, it didn’t know

When a given stage was relevant, or
How to make use of the stage in a syntactically correct manner

To fill this gap, I wanted to offer the LLM details about each of the view stages. I wrote code to loop through view stages (which you’ll be able to list with fiftyone.list_view_stages()), store the docstring, after which split the text of the docstring into description and inputs/arguments.

Nevertheless, I soon bumped into an issue: context length.

Using the bottom GPT-4 model via the OpenAI API, I used to be already bumping up against the 8,192 token context length. And this was before adding in examples, or any information in regards to the dataset itself!

OpenAI does have a GPT-4 model with a 32,768 token context which in theory I could have used, but a back-of-the-envelope calculation convinced me that this might get expensive. If we filled the whole 32k token context, given OpenAI’s pricing, it might cost about $2 per query!

As a substitute, our team rethought our approach and did the next:

Switch to GPT-3.5
Minimize token count
Be more selective with input info

Switching to GPT-3.5

There’s no such thing as a free lunch — this did result in barely lower performance, at the very least initially. Over the course of the project, we were in a position to recuperate and much surpass this through prompt engineering! In our case, the hassle was value the fee savings. In other cases, it won’t be.

Minimizing Token Count

With context length becoming a limiting factor, I employed the next easy trick: use ChatGPT to optimize prompts!

One ViewStage at a time, I took the unique description and list of inputs, and fed this information into ChatGPT, together with a prompt asking the LLM to attenuate the token count of that text while retaining all semantic information. Using tiktoken to count the tokens in the unique and compressed versions, I used to be able to cut back the variety of tokens by about 30%.

Being More Selective

While it’s great to supply the model with context, some information is more helpful than other information, depending on the duty at hand. If the model only must generate a Python query involving two ViewStages, it probably won’t profit terribly from details about what inputs the opposite ViewStages take.

We knew that we wanted a method to select relevant information depending on the input natural language query. Nevertheless, it wouldn’t be so simple as performing a similarity search on the descriptions and input parameters, because the previous often is available in very different language than the latter. We wanted a method to link input and knowledge selection.

That link, because it seems, was examples.

Generating Examples

In case you’ve ever played around with ChatGPT or one other LLM, you’ve probably experienced first-hand how providing the model with even only a single relevant example can drastically improve performance.

As a start line, I got here up with 10 completely synthetic examples and passed these along to GPT-3.5 by adding this below the duty rules and ViewStage descriptions in my input prompt:

Listed below are a number of examples of Input-Output Pairs in A, B form:

A) "Filepath starts with '/Users'"
B) `dataset.match(F("filepath").starts_with("/Users"))`A) "Predictions with confidence > 0.95"
B) `dataset.filter_labels("predictions", F("confidence") > 0.95)`
…

With just these 10 examples, there was a noticeable improvement in the standard of the model’s responses, so our team decided to be systematic about it.

First, we combed through our docs, finding any and all examples of views created through combos of ViewStages.
We then went through the list of ViewStages and added examples in order that we had as close to finish coverage as possible over usage syntax. To this, we made sure that there was at the very least one example for every argument or keyword, to offer the model a pattern to follow.
With usage syntax covered, we varied the names of fields and classes within the examples in order that the model wouldn’t generate any false assumptions about names correlating with stages. For example, we don’t want the model to strongly associate the “person” class with the match_labels() method simply because the entire examples for match_labels() occur to incorporate a “person” class.

Choosing Similar Examples

At the top of this instance generation process, we already had a whole lot of examples — way over could fit within the context length. Fortunately, these examples contained (as input) natural language queries that we could directly compare with the user’s input natural language query.

To perform this comparison, we pre-computed embeddings for these example queries with OpenAI’s text-embedding-ada–002 model. At run-time, the user’s query is embedded with the identical model, and the examples with probably the most similar natural language queries — by cosine distance — are chosen. Initially, we used ChromaDB to construct an in-memory vector database. Nevertheless, on condition that we were coping with a whole lot or 1000’s of vectors, reasonably than a whole lot of 1000’s or tens of millions, it actually made more sense to change to a precise vector search (plus we limited dependencies).

It was becoming difficult to administer these examples and the components of the prompt, so it was at this point that we began to make use of LangChain’s Prompts module. Initially, we were in a position to use their Similarity ExampleSelector to pick probably the most relevant examples, but eventually we had to put in writing a custom ExampleSelector in order that we had more control over the pre-filtering.

Filtering for Appropriate Examples

In the pc vision query language, the suitable syntax for a question can rely on the media kind of the samples within the dataset: videos, for instance, sometimes must be treated in a different way than images. Quite than confuse the model by giving seemingly conflicting examples, or complicating the duty by forcing the model to infer based on media type, we decided to only give examples that may be syntactically correct for a given dataset. Within the context of vector search, that is generally known as pre-filtering.

This concept worked so well that we eventually applied the identical considerations to other features of the dataset. In some cases, the differences were merely syntactic — when querying labels, the syntax for accessing a Detections label is different from that of a Classification label. Other filters were more strategic: sometimes we didn’t want the model to learn about a certain feature of the query language.

For example, we didn’t want to offer the LLM examples utilizing computations it might not have access to. If a text similarity index had not been constructed for a selected dataset, it might not make sense to feed the model examples of looking for one of the best visual matches to a natural language query. In an analogous vein, if the dataset didn’t have any evaluation runs, then querying for true positives and false positives would yield either errors or null results.

You may see the entire example pre-filtering pipeline in view_stage_example_selector.py within the GitHub repo.

Selecting Contextual Info Based on Examples

For a given natural language query, we then use the examples chosen by our ExampleSelector to make your mind up what additional information to supply within the context.

Particularly, we count the occurrences of every ViewStage in these chosen examples, discover the five most frequent `ViewStages, and add the descriptions and knowledge in regards to the input parameters for these ViewStages as context in our prompt. The rationale for that is that if a stage continuously occurs in similar queries, it is probably going (but not guaranteed) to be relevant to this question.

If it is just not relevant, then the outline will help the model to find out that it is just not relevant. Whether it is relevant, then details about input parameters will help the model generate a syntactically correct ViewStage operation.

Up until this point, we had focused on squeezing as much relevant information as possible — and just relevant information — right into a single prompt. But this approach was reaching its limits.

Even without accounting for the proven fact that every dataset has its own names for fields and classes, the space of possible Python queries was just too large.

To make progress, we wanted to interrupt the issue down into smaller pieces. Taking inspiration from recent approaches, including Chain-of-thought prompting and Selection-inference prompting, we divided the issue of generating a DatasetView into 4 distinct selection subproblems

Algorithms
Runs of algorithms
Relevant fields
Relevant class names

We then chained these selection “links” together, and passed their outputs along to the model in the ultimate prompt for DatasetView inference.

For every of those subtasks, the identical principles of uniformity and ease apply. We tried to recycle the natural language queries from existing examples wherever possible, but made a degree to simplify the formats of all inputs and outputs for every selection task. What’s simplest for one link will not be simplest for an additional!

Algorithms

In FiftyOne, information resulting from a computation on a dataset is stored as a “run”. This includes computations like uniqueness, which measures how unique each image is relative to the remainder of the pictures within the dataset, and hardness, which quantifies the issue a model will experience when attempting to learn on this sample. It also includes computations of similarity, which involve generating a vector index for embeddings related to each sample, and even evaluation computations, which we touched upon earlier.

Each of those computations generates a unique kind of results object, which has its own API. Moreover, there is just not any one-to-one correspondence between ViewStages and these computations. Let’s take uniqueness for instance.

A uniqueness computation result’s stored in a float-valued field ("uniqueness” by default) on each image. Which means depending on the situation, chances are you’ll need to sort by uniqueness:

view = dataset.sort_by("uniqueness")

Retrieve samples with uniqueness above a certain threshold:

from fiftyone import ViewField as F
view = dataset.match(F("uniqueness") > 0.8)

And even just show the individuality field:

view = dataset.select_fields("uniqueness")

On this selection step, we task the LLM with predicting which of the possible computations is likely to be relevant to the user’s natural language query. An example for this task looks like:

Query: "most original images with a false positive"
Algorithms used: ["uniqueness", "evaluation"]

Runs of Algorithms

Once potentially relevant computational algorithms have been identified, we task the LLM with choosing probably the most appropriate run of every computation. This is important because some computations could be run multiple times on the identical dataset with different configurations, and a ViewStage may only make sense with the fitting “run”.

An excellent example of that is similarity runs. Suppose you might be testing out two models (InceptionV3 and CLIP) in your data, and you’ve got generated a vector similarity index on the dataset for every model. When using the SortBySimilarity view stage, which images are determined to be most much like which other images can depend quite strongly on the embedding model, so the next two queries would wish to generate different results:

## query A:
"show me the ten most similar images to image 1 with CLIP"## query B:
"show me the ten most similar images to image 1 with InceptionV3"

This run selection process is handled individually for every kind of computation, as each requires a modified set of task rules and examples.

Relevant Fields

This link within the chain involves identifying all field names relevant to the natural language query which can be not related to a computational run. For example not all datasets with predictions have those labels stored under the name "predictions”. Depending on the person, dataset, and application, predictions is likely to be stored in a field named "pred", "resnet", "fine-tuned", "predictions_05_16_2023", or something else entirely.

Examples for this task included the query, the names and varieties of all fields within the dataset, and the names of relevant fields:

Query: "Exclude model2 predictions from all samples"
Available fields: "[id: string, filepath: string, tags: list, ground_truth: Detections, model1_predictions: Detections, model2_predictions: Detections, model3_predictions: Detections]"
Required fields: "[model2_predictions]"

Relevant Class Names

For label fields like classifications and detections, translating a natural language query into Python code requires using the names of actual classes within the dataset. To perform this, I tasked GPT-3.5 with performing named entity recognition for label classes in input queries.

Within the query “samples with at the very least one cow prediction and no horses”, the model’s job is to discover "horse" and "cow". These identified names are then compared against the category names for label fields chosen within the prior step — first case sensitive, then case insensitive, then plurality insensitive.

If no matches are found between named entities and the category names within the dataset, we fall back to semantic matching: "people" → "person", "table" → "dining table", and "animal" → [“cat”, “dog", “horse", …].

Every time the match is just not equivalent, we use the names of the matched classes to update the query that’s passed into the ultimate inference step:

query: "20 random images with a table"
## becomes:
query: "20 random images with a dining table"

ViewStage Inference

Once all of those selections have been made, the same examples, relevant descriptions, and relevant dataset info (chosen algorithmic runs, fields, and classes) are passed in to the model, together with the (potentially modified) query.

Quite than instruct the model to return code to me in the shape dataset.view1().view2()…viewn() as we were doing initially, we ended up nixing the dataset part, and as a substitute asking the model to return the ViewStages as a listing. On the time, I used to be surprised to see this improve performance, but in hindsight, it suits with the insight that the more you split the duty up, the higher an LLM can do.

Creating an LLM-powered toy is cool, but turning the identical kernel into an LLM-power application is way cooler. Here’s a temporary overview of how we did it.

Unit Testing

As we turned this from a proof-of-principle right into a robustly engineered system, we used unit testing to emphasize test the pipeline and discover weak points. The modular nature of links within the chain signifies that each step can individually be unit tested, validated, and iterated on without having to run the whole chain.

This results in faster improvement, because different individuals or groups of individuals inside a prompt-engineering team can work on different links within the chain in parallel. Moreover, it ends in reduced costs, as in theory, you must only have to run a single step of LLM inference to optimize a single link within the chain.

Evaluating LLM-Generated Code

We used Python’s eval() function to show GPT-3.5’s response right into a DatasetView. We then set the state of the FiftyOne App session to display this view.

Input Validation

Garbage input → garbage output. To avoid this, we run validation to make sure that that the user’s natural language query is smart.

First, we use OpenAI’s moderation endpoint. Then we categorize any prompt into one among the next 4 cases:

Sensible and complete: the prompt can reasonably be translated into Python code for querying a dataset.

All images with dog detections

Sensible and incomplete: the prompt is cheap, but can’t be converted right into a DatasetView without additional information. For instance, if we now have two models with predictions on our data, then the next prompt, which just refers to “my model” is insufficient:

Retrieve my model’s incorrect predictions

Out of scope: we’re constructing an application that generates queried views into computer vision datasets. While the underlying GPT-3.5 model is a general purpose LLM, our application shouldn’t turn right into a disconnected ChatGPT session next to your dataset. Prompts just like the following must be snuffed out:

Explain quantum computing like I’m five

Not sensible: given a random string, it might not make sense to try and generate a view of the dataset — where would one even start?!

Azlsakjdbiayervbg