Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Automobile Example

Optimizing Multimodal Agents

Multimodal AI agents, those who can process text and pictures (or other media), are rapidly entering real-world domains like autonomous driving, healthcare, and robotics. In these settings, we now have traditionally used vision models like CNNs; within the post-GPT era, we are able to use vision and multimodal language models that leverage human instructions in the shape of prompts, quite than task-oriented, highly specific vision models.

Nonetheless, ensuring good outcomes from the models requires effective instructions, or, more commonly, prompt engineering. Existing prompt engineering methods rely heavily on trial and error, and this is commonly exacerbated by the complexity and better cost of tokens when working across non-text modalities similar to images. Automatic prompt optimization is a recent advancement in the sphere that systematically tunes prompts to supply more accurate, consistent outputs.

For instance, a self-driving automobile perception system might use a vision-language model to reply questions on road images. A poorly phrased prompt can result in misunderstandings or errors with serious consequences. As an alternative of fine-tuning and reinforcement learning, we are able to use one other multimodal model with reasoning capabilities to learn and adapt its prompts.

Fig. 1. Can a machine () help us get from a baseline system prompt for driving hazards to an improved output based on our dataset?

Although these automatic methods could be applied to text-based agents, they are sometimes not well documented for more complex, real-world applications beyond a basic toy dataset, similar to handwriting or image classification. To best show how these concepts work in a more complex, dynamic, and data-intensive setting, we are going to walk through an example using a self-driving automobile agent.

What Is Agent Optimization?

Agent optimization is an element of automatic prompt engineering, however it involves working with various parts of the agent, similar to multi-prompts, tool calling, RAG, agent architecture, and various modalities. There are quite a lot of research projects and libraries, similar to GEPA; nevertheless, lots of these tools don’t provide end-to-end support for tracing, evaluating, and managing datasets, similar to images.

For this walk-through, we can be using the Opik Agent Optimizer SDK (opik-optimizer), which is an open-sourced agent optimization toolkit that automates this process using LLMs internally, together with optimization algorithms like GEPA and quite a lot of their very own, similar to HRPO, for various use cases, so you’ll be able to iteratively improve prompts without manual trial-and-error.

How Can LLMs Optimize Prompts?

Essentially, an LLM can “act as” a prompt engineer and rewrite a given prompt. We start by taking the standard approach, as a prompt engineer would with trial and error, and ask a small agent to review its work across a number of examples, fix its mistakes, and create a brand new prompt.

Meta Prompting is a classic example of using chain-of-thought reasoning (CoT), similar to “explain the rationale why you gave me this prompt”, during its recent prompt generation process, and we keep iterating on this across multiple rounds of prompt generation. Below is an example of an LLM-based meta-prompting optimizer adjusting the prompt and generating recent candidates.

Fig. 2. How LLMs could be used to optimize prompts, a basic meta-prompter example where the LLM acts as a prompt tuner.

Within the toolkit, there may be a meta-prompt-based optimizer called metaprompter, and we are able to show how the optimization works:

It starts with an initial ChatPrompt, an OpenAI-style chat prompt object with system and user prompts,
a dataset (),
and a metric () to optimize against, which could be an LLMaaJ (LLM-as-a-judge) and even simpler heuristic metrics, similar to equal comparison of expected outputs within the dataset to outputs from the model.

Opik then uses various algorithms, including LLMs, to iteratively mutate the prompt and evaluate performance, mechanically tracking results. Essentially acting as our own very machine-driven prompt engineer!

Getting Began

On this walkthrough, we would like to make use of a small dataset of self-driving automobile dashcam images and tune the prompts using automatic prompt optimization with a multi-modal agent that may detect hazards.

We’d like to establish our surroundings and install the toolkit to get going. First, you will want an open-source Opik instance, either within the cloud or locally, to log traces, manage datasets, and store optimization results. You’ll be able to go to the repository and run the Docker start command to run the Opik platform or arrange a free account on their website.

Once arrange, you’ll need Python () and a number of libraries. First, install the opik-optimizer package; it would also install the opik core package, which handles datasets and evaluation.

Install and configure using uv ():

# install with venv and py version
uv venv .venv --python 3.11

# install optimizer package
uv pip install opik-optimizer

# post-install configure SDK
opik configure

Or alternatively, install and configure using pip:

# Setup venv
python -m venv .venv

# load venv
source .venv/bin/activate

# install optimizer package
pip install opik-optimizer

# post-install configure SDK
opik configure

You’ll also need API keys for any LLM models you propose to make use of. The SDK uses LiteLLM, so you’ll be able to mix providers, see here for a full list of models, and browse their docs for other integrations like ollama and vLLM if you desire to run models locally.

In our example, we can be using OpenAI models, so it’s essential set your keys in your environment. You adjust this step as needed for loading the API keys on your model:

export OPENAI_API_KEY="sk-…"

Now that we now have our Opik environment arrange and our keys configured to access LLM models for optimization and evaluation, we are able to get to work on our datasets to tune our agent.

Working with Datasets To Tune the Agent

Before we are able to start with prompts and models, we’d like a dataset. To tune an AI agent (), we’d like examples that function our “preferences” for the outcomes we would like to attain. You’ll normally have a “golden” dataset, which, on your AI agent, would come with example inputs and output pairs that you just maintain because the prime examples and evaluate your agent against.

For this instance project, we are going to use an off-the-shelf dataset for self-driving cars that’s already arrange as a demo dataset within the optimizer SDK. The dataset accommodates dashcam images and human-labeled hazards. Our goal is to make use of a really basic prompt and have the optimizer “discover” the optimal prompt by reviewing the pictures and the test outputs it would run.

The dataset, DHPR (), is accessible on Hugging Face and is already mapped within the SDK because the driving_hazard dataset (this dataset is released under BSD 3-Clause license). The inner mapping within the SDK handles Hugging Face conversions, image resizing, and compression, including PNG-to-JPEG conversions and conversions to an Opik-compatible dataset. The SDK includes helper utilities for those who wish to make use of your personal multimodal dataset.

Fig. 3. The driving hazards dataset on Hugging Face.

The DHPR dataset includes a number of fields that we’ll use to ground our agent’s behavior against human preferences during our optimization process. Here’s a breakdown of what’s within the dataset:

query, which they asked the human annotator, “Based on my dashcam image, what’s the potential hazard?”
hazard, which is the response from the human labeling
bounding_box that has the hazard marked and could be overlaid on the image
plausible_speed is the annotator’s guestimate of the automobile’s speed from the predefined set [10, 30, 50+].
image_source metadata on where the source images were recorded.

Now, let’s start with a brand new Python file, optimize_multimodal.py, and begin with our dataset to coach and validate our optimization process with:

from opik_optimizer.datasets import driving_hazard
dataset = driving_hazard(count=20)
validation_dataset = driving_hazard(count=5)

This code, when executed, will make sure the Hugging Face dataset is downloaded and added to your Opik platform UI as a dataset we are able to optimize or test with. We’ll then pass the variables dataset and validation_dataset to the optimization steps within the code in a while. You’ll note we’re setting the count values to low numbers, 20 and 5, to load a small sample as needed to avoid processing the complete dataset for our walk-through, which could be resource-intensive.

Whenever you run a full optimization process in a live environment, you need to aim to make use of as much of the dataset as possible. It’s good practice to start out small and scale up, as diagnosing long-running optimizations could be problematic and resource-intensive.

We also configured the optional validation_dataset, which is used to check our optimization at the beginning and end on a hold-out set to make sure the recorded improvement is validated on unseen data. Out of the box, the optimizers’ pre-configured datasets all include pre-set splits, which you’ll access from the split argument. See examples as follows:

# example a) driving_hazard pre-configured splits
from opik_optimizer.datasets import driving_hazard
trainset = driving_hazard(split=train)
valset = driving_hazard(split=validation)
testset = driving_hazard(split=test)

# example b) gsm84k math dataset pre-configured splits
from opik_optimizer.datasets import gsm8k
trainset = gsm8k(split=train)
valset = gsm8k(split=validation)
testset = gsm8k(split=test)

The splits also ensure there’s no overlapping data, because the dataset is shuffled in the right order and split into 3 parts. We avoid using these splits to avoid having to make use of very large datasets and runs after we are getting began.

Let’s go ahead and run our code optimize_multimodal.py with just the driving hazard dataset. The dataset can be loaded into Opik and could be seen in our dashboard () under “driving_hazard_train_20”.

Fig. 4. The hazard dataset is loaded in our Opik datasets, and we are able to see the image data (base64).

With our dataset loaded in Opik we may also load the dataset within the Opik playground, which is a pleasant and straightforward strategy to see how various prompts would behave and test them against a straightforward prompt similar to “Discover the hazards on this image.”

Fig. 5. We will run a prompt across all rows on the column image by configuring a prompt, choosing a model, and choosing our dataset.

As you’ll be able to see from the instance (), we are able to use the playground to check prompts for our agent quite quickly. This might be the standard process we’d use for manual prompt engineering: adjusting the prompt in a playground-like environment and simulating how various changes to the prompt would affect the model’s outputs.

For some scenarios, this could possibly be sufficient with some automated scoring and using intuition to regulate prompts, and you’ll be able to see how bringing the prevailing prompt optimization process right into a more visual and systematic process, how subtle changes can easily be tested against our golden dataset ()

Defining Evaluation Metrics To Optimize With

We’ll proceed to define our evaluation metrics designed to let the optimizer know what changes are working and which usually are not. We’d like a strategy to signal the optimizer about what’s working and what’s failing. For this, we are going to use an evaluation metric because the “reward”; it would be a straightforward rating that the optimizer uses to come to a decision which prompt changes to make.

These evaluation metrics could be easy (e.g., Equals) or more complex (e.g., LLM-as-a-judge). Since Opik is a totally open-source evaluation suite, you need to use quite a lot of various metrics, which you’ll explore here to seek out out more.

Logically, you’d think that after we compare the dataset ground truth (a) to the model output (b), we’d do a straightforward equals comparison metric like is (a == b), which is able to return a boolean true or false. Using a direct comparison metric could be harmful to the optimizer, because it makes the method much harder and will not yield the precise answer right from the beginning (or throughout the optimization process).

Certainly one of the human-annotated examples from the dataset we try to get the optimizer to match, you’ll be able to see how getting the LLM to create the exact same output blindly could possibly be difficult:

Entity #1 brakes his automobile in front of Entity #2. Seeing that Entity #2 also pulled his brakes. At a speed of 45 km/h, I am unable to stop my automobile in time and hit Entity #2.

To support the hill-climbing needed for the optimizer, we are going to use a comparison metric that gives an approximation rating as a percentage on a scale of 0.0 to 1.0. For this scenario, we are going to use the Levenshtein ratio, a straightforward math-based measure of how closely the characters and words within the output match those in the bottom truth dataset. With our closeness to example metric, LR () a body of text with a number of characters off could yield a rating for instance of 98% (0.98), as they’re very similar ().

Fig. 6. Visual example of how levenshtein distance ratio is calculated.

In our Python script, we define this practice metric as a function alongside the input and output variables from our dataset. In practice we are going to define the mapping between the dataset hazard and the output llm_output, in addition to the scoring function to be passed to the optimizer. There are more metric examples within the documentation, but for now, we are going to use the next setup in our code after the dataset creation:

from opik.evaluation.metrics import LevenshteinRatio
from opik.evaluation.metrics.score_result import ScoreResult

def levenshtein_ratio(
    dataset_item: dict[str, Any],
    llm_output: str
) -> ScoreResult:
    metric = LevenshteinRatio()
    metric_score = metric.rating(
        reference=dataset_item["hazard"], output=llm_output
    )
    return ScoreResult(
        value=metric_score.value,
        name=metric_score.name,
        reason=f"Levenshtein ratio between `{dataset_item['hazard']}` and `{llm_output}` is `{metric_score.value}`.",
    )

Setting Up Our Base Agent & Prompt

Here we’re configuring the agent’s place to begin. On this case, we assume we have already got an agent and a handwritten prompt. If you happen to were optimizing your personal agent, you’d replace these placeholders. We start by importing the ChatPrompt class, which allows us to configure the agent as a straightforward chat prompt. The optimizer SDK handles inputs via the ChatPrompt, and you’ll be able to extend this with tool/function calling and more multi-prompt/agent scenarios, also for your personal use cases.

from opik_optimizer import ChatPrompt

# Define the prompt to optimize
system_prompt = """You might be an authority driving safety assistant
specialized in hazard detection. Your task is to investigate dashcam
images and discover potential hazards that a driver should pay attention to.

For every image:
1. Fastidiously examine the visual scene
2. Discover any potential hazards (pedestrians, vehicles,
road conditions, obstacles, etc.)
3. Assess the urgency and severity of every hazard
4. Provide a transparent, specific description of the hazard

Be precise and actionable in your hazard descriptions.
Deal with safety-critical information."""

# Map into an OpenAI-style chat prompt object
prompt = ChatPrompt(
    messages=[
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "{question}"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "{image}",
                    },
                },
            ],
        },
    ],
)

In our example, we now have a system prompt and a user prompt, based on the query {query}and the image {image} from the dataset we created earlier. We’re going to attempt to optimize the system prompt in order that the input changes based on each image (). The fields within the parentheses, like {data_field}, are columns in our dataset that the SDK will mechanically map and likewise convert for things like multi-modal images.

Loading and Wiring the Optimizers

The toolkit comes with a variety of optimizers, from easy meta-prompting, which uses chain-of-thought reasoning to update prompts, to GEPA and more advanced reflective optimizers. On the time of this walk-through, the hierarchical reflective optimizer (HRPO) is the one we are going to use for instance purposes, because it’s suited to complex and ambiguous tasks.

The HRPO optimization algorithm () uses hierarchical root cause evaluation to discover and address specific failure modes in your prompts. It analyzes evaluation results, identifies patterns in failures, and generates targeted improvements to systematically address each failure mode.

Fig. 7. How a hierarchical approach to “failures” is used to generate recent candidate prompts with an LLM.

To this point in our project, we now have arrange the bottom dataset, evaluation metric, and prompt for our agent, but haven’t wired this as much as any optimizers. Let’s go ahead and wire in HRPO into our project. We’d like to load our model and configure any parameters, similar to the model we would like to make use of to run the optimizer on:

from opik_optimizer import HRPO

# Setup optimizer and configuration parameters
optimizer = HRPO(
  model="openai/gpt-5.2."
  model_parameters={"temperature": 1}
}

There are additional parameters we are able to set, similar to the variety of threads for multi-threading or the model parameters passed on to the LLM calls, as we show by setting our temperaturevalue.

It’s Time, Running The Optimizer

Now we now have every part we’d like, including our starting agent, dataset, metric, and the optimizer. To execute the optimizer, we’d like to call the optimizer’s optimize_prompt function and pass all components, together with any additional parameters. So really, at this stage, the optimizer and the optimize_prompt() function, which when executed, will run the optimizer we configured (optimizer).

# Execute optimizer
optimization_result = optimizer.optimize_prompt(
  prompt=prompt, # our ChatPrompt
  dataset=dataset, # our Opik dataset
  validation_dataset=validation_dataset, # optional, hold-out test
  metric=levenshtein_ratio, # our custom metric
  max_trials=10, # optional, variety of runs
)

# Output and display results
optimization_result.display()

You’ll notice some additional arguments we passed; the max_trials argument limits the variety of trials () the optimizer will run before stopping. You must start with a low number, as some datasets and optimizer loops could be token-heavy, especially with image-based runs, which may result in very long runs and be time and cost-intensive. Once we’re completely satisfied with our setup, we are able to all the time come back and scale this up.

Let’s run our full script now and see the optimizer in motion. It’s best to execute this in your terminal, but this must also work high-quality in a notebook similar to Jupyter Notebooks:

Fig. 8. Here we are able to see how the reflection steps described in are working with each failure mode captured.

The optimizer will run through 10 trials (). On each loop, it would generate a number () of failures to ascertain, test, and develop recent prompts for. At each trial (), the brand new candidate prompts are tested and evaluated, and one other trial begins. After a short time, we must always reach the tip of our optimization loop; in our case, this happens after 10 full trials, which mustn’t take greater than a minute to execute.

Congratulations, we optimized our multi-modal agent, and we are able to now take the brand new system prompt and apply it to the identical model in production with improved accuracy. In a production scenario, you’d copy this into our codebase. To investigate our optimization run, we are able to see that the terminal and dashboard should show the brand new results:

Fig. 9. Final results show within the CLI terminal at the tip of the script.

Based on the outcomes, we are able to see that we now have gone from a baseline rating of 15% to 39% after 10 trials, a whoping 152% improvement with a brand new prompt in under a minute. These results are based on our comparison metric, which the optimizer used as its signal: a comparison of the output vs. our expected output in our dataset.

Digging into our results, a number of key things to notice:

In the course of the trial runs the rating shoots up in a short time, then slowly normalizes. You must increase the variety of trials, and we must always see whether it needs more to find out the subsequent set of prompt improvements.
The rating may even be more “volatile” and overfit with low samples of 20 and 5 for validation, so we had to maintain our test small; randomness will impact our scores massively. Whenever you re-run, try using the total dataset or a bigger sample (e.g., count=50) and see how the scores are more realistic.

Overall, as we scale this up, we’d like to provide the optimizer more data and more time () to “hill climb,” which may take multiple rounds.

At the tip of our optimization, our recent and improved system prompt has now recognized that it must label various interactions and that the output style must match. Here is our final improved prompt after 10 trials:

You might be an authority driving incident analyst specialized in collision-causal description.

Your task is to investigate dashcam images and write the more than likely collision-oriented causal narrative that matches reference-style answers.

For every image:
1. Discover the first interacting participants and label them explicitly as "Entity #1", "Entity #2", etc. (e.g., vehicle, pedestrian, cyclist, obstacle).
2. Describe the only most salient accident interaction as an explicit causal chain using entity labels: "Entity #X [action/failure] → [immediate consequence/path conflict] → [impact]".
3. End with a transparent impact end result that MUST (a) use explicit collision language AND (b) name the entities involved (e.g., "Entity #2 rear-ends Entity #1", "Entity #1 side-impacts Entity #2",
"Entity #1 strikes Entity #2").

Output requirements (critical):
- Produce ONE short, direct causal statement (1–2 sentences).
- The statement MUST include: (i) no less than two entities by label, (ii) a concrete motion/failure-to-yield/encroachment, and (iii) an explicit collision end result naming the entities. If any of those
are missing, the reply is invalid.
- Do NOT output a checklist, multiple hazards, severity/urgency rankings, or general driving advice.
- Avoid general risk discussion (visibility, congestion, pedestrians) unless it directly supports the only causal chain culminating within the collision/impact.
- Deal with the particular causal progression culminating within the impact (even when partially inferred from context); don't describe multiple possible crashes-commit to the only more than likely one.

You’ll be able to grab the total final code for the instance end to finish as follows:

from typing import Any

from opik_optimizer.datasets import driving_hazard
from opik_optimizer import ChatPrompt, HRPO
from opik.evaluation.metrics import LevenshteinRatio
from opik.evaluation.metrics.score_result import ScoreResult

# Import the dataset
dataset = driving_hazard(count=20)
validation_dataset = driving_hazard(split="test", count=5)

# Define the metric to optimize on
def levenshtein_ratio(dataset_item: dict[str, Any], llm_output: str) -> ScoreResult:
    metric = LevenshteinRatio()
    metric_score = metric.rating(reference=dataset_item["hazard"], output=llm_output)
    return ScoreResult(
        value=metric_score.value,
        name=metric_score.name,
        reason=f"Levenshtein ratio between `{dataset_item['hazard']}` and `{llm_output}` is `{metric_score.value}`.",
    )

# Define the prompt to optimize
system_prompt = """You might be an authority driving safety assistant specialized in hazard detection.

Your task is to investigate dashcam images and discover potential hazards that a driver should pay attention to.

For every image:
1. Fastidiously examine the visual scene
2. Discover any potential hazards (pedestrians, vehicles, road conditions, obstacles, etc.)
3. Assess the urgency and severity of every hazard
4. Provide a transparent, specific description of the hazard

Be precise and actionable in your hazard descriptions. Deal with safety-critical information."""

prompt = ChatPrompt(
    messages=[
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "{question}"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "{image}",
                    },
                },
            ],
        },
    ],
)

# Initialize HRPO (Hierarchical Reflective Prompt Optimizer)
optimizer = HRPO(model="openai/gpt-5.2", model_parameters={"temperature": 1})

# Run optimization
optimization_result = optimizer.optimize_prompt(
    prompt=prompt,
    dataset=dataset,
    validation_dataset=validation_dataset,
    metric=levenshtein_ratio,
    max_trials=10,
)

# Show results
optimization_result.display()

Going Further and Common Pitfalls

Now you’re done together with your first optimization run. There are some additional suggestions when working with optimizers, and particularly when working with multi-modal agents, to enter more advanced scenarios, in addition to avoiding some common anti-patterns:

Model Costs and Selection: Multimodal prompts send larger payloads. Monitor token usage within the Opik dashboard. If cost is a difficulty, use a smaller vision model. Running these optimizers through multiple loops can get quite expensive. On the time of publication on GPT 5.2, this instance cost us about $0.15 USD. Monitor this as you run examples to see how the optimizer is behaving and catch any issues before you scale out.
Model Selection and Vision Support: Double-check that your chosen model supports images. Some very recent model releases will not be mapped yet, so you may have issues. Keep your Python packages updated.
Dataset Image Size and Format: Think about using JPEGs and lower-resolution images, that are more efficient over large-resolution PNGs, which could be more token-hungry resulting from their size. Test how the model behaves via direct API calls, the playground, and small trial runs before scaling out. Within the demo we ran, the dataset images were mechanically converted by the SDK to JPEG (60% quality) and a max height/width of 512 pixels, pattern you’re welcomed to follow.
Dataset Split: If you may have many examples, split into training/validation. Use a subset (n_samples) during optimization to seek out a greater prompt, and reserve unseen data to verify the development generalizes. This prevents overfitting the prompt to a number of items.
Evaluation Metric Design: For Hierarchical Reflective optimizer, return a ScoreResult for every example. These reasons drive its root-cause evaluation. Poor or missing reasons could make the optimizer less effective. Other optimizers behave otherwise, so knowing that evaluations are critical to success is essential, it’s also possible to see if LLM-as-a-judge is a viable evaluation metric for more complex senarios.
Iteration and Logging: The instance script mechanically logs each trial’s prompts and scores. Inspect these to know how the prompt modified. If results stagnate, try increasing max_trials or using a unique optimizer algorithm. You can too chain optimizers: take the output prompt from one optimizer and feed it into one other. That is an excellent strategy to mix multiple approaches and ensemble optimizers to attain higher combined efficiency.
Mix with Other Methods: We may also mix steps and data into the optimizer using bounding boxes, adding additional data through purpose-built visual processing models like Meta’s SAM 3 to annotate our data and supply additional metadata. In practice, our input dataset could have image and image_annotated, which could be used as input to the optimizer.

Takeaways and Future Outlook of Optimizers

Thanks for following together with this. As a part of this walk-through, we explored:

Getting began with open-source agent & prompt optimization
Making a process to optimize a multi-modal vision-based agent
Evaluating with image-based datasets within the context of LLMs

Moving forward, automating prompt design is becoming increasingly essential as vision-capable LLMs advance. Thoughtfully optimized prompts can significantly improve model performance on complex multimodal tasks. Optimizers show how we are able to harness LLMs themselves to refine instructions, turning a protracted, tedious, and really manual process into a scientific search.

Looking ahead, we are able to begin to see recent ways of working during which automatic prompts and agent-optimization tools replace outdated prompt-engineering methods and fully leverage each model’s own understanding.

Enjoyed This Article?

Vincent Koc is a highly achieved AI research engineer, author, and lecturer with a wealth of experience across quite a lot of global corporations and works primarily in open-source development in artificial intelligence with a keen interest in optimization approaches. Be at liberty to attach with him on LinkedIn and X if you desire to stay connected or have any questions on the hands-on example.

References

[1] Y Choi, et. al. Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs https://arxiv.org/abs/2510.09201

[2] M Suzgun, A T Kalai. Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding https://arxiv.org/abs/2401.12954

[3] K Charoenpitaks, et. al. Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction https://ieeexplore.ieee.org/document/10568360 & https://github.com/DHPR-dataset/DHPR-dataset

[4] F. Yu, et. al. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning https://arxiv.org/abs/1805.04687 & https://bair.berkeley.edu/blog/2018/05/30/bdd/

[5] Chen et. al. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark https://dl.acm.org/doi/10.5555/3692070.3692324 & https://mllm-judge.github.io/

[6] Opik. HRPO (Hierarchical Reflective Prompt Optimizer) https://www.comet.com/docs/opik/agent_optimization/algorithms/hierarchical_adaptive_optimizer & https://www.comet.com/site/products/opik/features/automatic-prompt-optimization/

[7] Meta. Introducing Meta Segment Anything Model 3 and Segment Anything Playground https://ai.meta.com/blog/segment-anything-model-3/

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Automobile Example

Optimizing Multimodal Agents

What Is Agent Optimization?

How Can LLMs Optimize Prompts?

Getting Began

Working with Datasets To Tune the Agent

Defining Evaluation Metrics To Optimize With

Setting Up Our Base Agent & Prompt

Loading and Wiring the Optimizers

It’s Time, Running The Optimizer

Going Further and Common Pitfalls

Takeaways and Future Outlook of Optimizers

Enjoyed This Article?

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

Supply-chain attack using invisible code hits GitHub and other repositories

Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

Why physical AI is becoming manufacturing’s next advantage

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Automobile Example

Optimizing Multimodal Agents

What Is Agent Optimization?

How Can LLMs Optimize Prompts?

Getting Began

Working with Datasets To Tune the Agent

Defining Evaluation Metrics To Optimize With

Setting Up Our Base Agent & Prompt

Loading and Wiring the Optimizers

It’s Time, Running The Optimizer

Going Further and Common Pitfalls

Takeaways and Future Outlook of Optimizers

Enjoyed This Article?

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.