We now support VLMs in smolagents!

-


Aymeric Roucher's avatar

merve's avatar

You hypocrite, first take the sign off of your personal eye, and you then will see clearly to take the speck out of your brother’s eye. Matthew 7, 3-5



TL;DR

We have now added vision support to smolagents, which unlocks using vision language models in agentic pipelines natively.



Table of Contents



Overview

Within the agentic world, many capabilities are hidden behind a vision wall. A standard example is web browsing: web pages feature wealthy visual content that you just never fully recuperate by simply extracting their text, be it the relative position of objects, messages transmitted through color, specific icons… On this case, vision is an actual superpower for agents. So we just added this capability to our smolagents!

Teaser of what this provides: an agentic browser that navigates the online in complete autonomy!

Here’s an example of what it looks like:



How we gave sight to smolagents

🤔 How do we would like to pass images to agents? Passing a picture will be done in two ways:

  1. You may have images directly available to the agent at start. This is commonly the case for Document AI.
  2. Sometimes, images should be added dynamically. example is when an online browser just performed an motion, and desires to see the impact on its viewports.



1. Pass images once at agent start

For the case where we would like to pass images without delay, we added the likelihood to pass a listing of images to the agent within the run method: agent.run("Describe these images:", images=[image_1, image_2]) .

These image inputs are then stored within the task_images attribute of TaskStep together with the prompt of the duty that you need to perform.

When running the agent, they can be passed to the model. This is useful with cases like taking actions based on long PDFs that include visual elements.



2. Pass images at each step ⇒ use a callback

dynamically add images into the agent’s memory?

To seek out out, we first need to grasp how our agents work.

All agents in smolagents are based on the singular MultiStepAgent class, which is an abstraction of the ReAct framework. On a basic level, this class performs actions on a cycle of following steps, where existing variables and knowledge are incorporated into the agent logs as follows:

  • Initialization: the system prompt is stored in a SystemPromptStep, and the user query is logged right into a TaskStep.
  • ReAct Loop (While):
    1. Use agent.write_inner_memory_from_logs() to write down the agent logs into a listing of LLM-readable chat messages.
    2. Send these messages to a Model object to get its completion. Parse the completion to get the motion (a JSON blob for ToolCallingAgent, a code snippet for CodeAgent).
    3. Execute the motion and logs result into memory (an ActionStep).
    4. At the tip of every step, run all callback functions defined in agent.step_callbacks.
      ⇒ That is where we added support to photographs: make a callback that logs images into memory!

The figure below details this process:

As you’ll be able to see, to be used cases where images are dynamically retrieved (e.g. web browser agent), we support adding images to the model’s ActionStep, in attribute step_log.observation_images.

This will be done via a callback, which can be run at the tip of every step.

Let’s show easy methods to make such a callback, and using it to construct an online browser agent.👇👇



create a Web browsing agent with vision

We’re going to make use of helium. It provides browser automations based on selenium: this can be a neater way for our agent to govern webpages.

pip install "smolagents[all]" helium selenium python-dotenv

The agent itself can use helium directly, so no need for specific tools: it might directly use helium to perform actions, equivalent to click("top 10") to click the button named “top 10” visible on the page.
We still must make some tools to assist the agent navigate the online: a tool to return to the previous page, and one other tool to shut pop-ups, because these are quite hard to grab for helium since they don’t have any text on their close buttons.

from io import BytesIO
from time import sleep

import helium
from dotenv import load_dotenv
from PIL import Image
from selenium import webdriver
from selenium.common.exceptions import ElementNotInteractableException, TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

from smolagents import CodeAgent, LiteLLMModel, OpenAIServerModel, TransformersModel, tool
from smolagents.agents import ActionStep


load_dotenv()
import os

@tool
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
    """
    Searches for text on the present page via Ctrl + F and jumps to the nth occurrence.
    Args:
        text: The text to go looking for
        nth_result: Which occurrence to leap to (default: 1)
    """
    elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
    if nth_result > len(elements):
        raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
    result = f"Found {len(elements)} matches for '{text}'."
    elem = elements[nth_result - 1]
    driver.execute_script("arguments[0].scrollIntoView(true);", elem)
    result += f"Focused on element {nth_result} of {len(elements)}"
    return result

@tool
def go_back() -> None:
    """Goes back to previous page."""
    driver.back()

@tool
def close_popups() -> str:
    """
    Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! This doesn't work on cookie consent banners.
    """
    
    modal_selectors = [
        "button[class*='close']",
        "[class*='modal']",
        "[class*='modal'] button",
        "[class*='CloseButton']",
        "[aria-label*='close']",
        ".modal-close",
        ".close-modal",
        ".modal .close",
        ".modal-backdrop",
        ".modal-overlay",
        "[class*='overlay']"
    ]

    wait = WebDriverWait(driver, timeout=0.5)

    for selector in modal_selectors:
        try:
            elements = wait.until(
                EC.presence_of_all_elements_located((By.CSS_SELECTOR, selector))
            )

            for element in elements:
                if element.is_displayed():
                    try:
                        
                        driver.execute_script("arguments[0].click();", element)
                    except ElementNotInteractableException:
                        
                        element.click()

        except TimeoutException:
            proceed
        except Exception as e:
            print(f"Error handling selector {selector}: {str(e)}")
            proceed
    return "Modals closed"

For now, the agent has no visual input.
So allow us to show easy methods to dynamically feed it images in its step logs by utilizing a callback.
We make a callback save_screenshot that can be run at the tip of every step.

def save_screenshot(step_log: ActionStep, agent: CodeAgent) -> None:
    sleep(1.0)  
    driver = helium.get_driver()
    current_step = step_log.step_number
    if driver is not None:
        for step_logs in agent.logs:  
            if isinstance(step_log, ActionStep) and step_log.step_number <= current_step - 2:
                step_logs.observations_images = None
        png_bytes = driver.get_screenshot_as_png()
        image = Image.open(BytesIO(png_bytes))
        print(f"Captured a browser screenshot: {image.size} pixels")
        step_log.observations_images = [image.copy()]  

    
    url_info = f"Current url: {driver.current_url}"
    step_log.observations = url_info if step_logs.observations is None else step_log.observations + "n" + url_info
    return

A very powerful line here is after we add the image in our observations images: step_log.observations_images = [image.copy()].

This callback accepts each the step_log, and the agent itself as arguments. Having agent as an input allows to perform deeper operations than simply modifying the last logs.

Let’s make a model. We have added support for images in all models.
Only one precision: when using TransformersModel with a VLM, for it to work properly you could pass
flatten_messages_as_text as False upon initialization, like:

model = TransformersModel(model_id="HuggingFaceTB/SmolVLM-Instruct", device_map="auto", flatten_messages_as_text=False)

For this demo, let’s use an even bigger Qwen2VL via Fireworks API:

model = OpenAIServerModel(
    api_key=os.getenv("FIREWORKS_API_KEY"),
    api_base="https://api.fireworks.ai/inference/v1",
    model_id="accounts/fireworks/models/qwen2-vl-72b-instruct",
)

Now let’s move on to defining our agent. We set the very best verbosity_level to display the LLM’s full output messages to view its thoughts, and we increased max_steps to twenty to provide the agent more steps to explore the online.
We also provide it with our callback save_screenshot defined above.

agent = CodeAgent(
    tools=[go_back, close_popups, search_item_ctrl_f],
    model=model,
    additional_authorized_imports=["helium"],
    step_callbacks = [save_screenshot],
    max_steps=20,
    verbosity_level=2
)

Finally, we offer our agent with some guidance about using helium.

helium_instructions = """
You should utilize helium to access web sites. Don't hassle concerning the helium driver, it's already managed.
First you could import all the things from helium, you then can do other actions!
Code:
```py
from helium import *
go_to('github.com/trending')
```

You may directly click clickable elements by inputting the text that appears on them.
Code:
```py
click("Top products")
```

If it is a link:
Code:
```py
click(Link("Top products"))
```

For those who attempt to interact with a component and it isn't found, you may get a LookupError.
Typically stop your motion after each button click to see what happens in your screenshot.
Never attempt to login in a page.

To scroll up or down, use scroll_down or scroll_up with as an argument the variety of pixels to scroll from.
Code:
```py
scroll_down(num_pixels=1200) # It will scroll one viewport down
```

When you've pop-ups with a cross icon to shut, don't attempt to click the close icon by finding its element or targeting an 'X' element (this most frequently fails).
Just use your built-in tool `close_popups` to shut them:
Code:
```py
close_popups()
```

You should utilize .exists() to ascertain for the existence of a component. For instance:
Code:
```py
if Text('Accept cookies?').exists():
    click('I accept')
```

Proceed in several steps fairly than trying to resolve the duty in a single shot.
And at the tip, only when you've your answer, return your final answer.
Code:
```py
final_answer("YOUR_ANSWER_HERE")
```

If pages seem stuck on loading, you would possibly must wait, as an illustration `import time` and run `time.sleep(5.0)`. But don't overuse this!
To list elements on page, DO NOT try code-based element searches like 'contributors = find_all(S("ol > li"))': just take a look at the most recent screenshot you've and skim it visually, or use your tool search_item_ctrl_f.
In fact, you'll be able to act on buttons like a user would do when navigating.
After each code blob you write, you can be routinely supplied with an updated screenshot of the browser and the present browser url.
But beware that the screenshot will only be taken at the tip of the entire motion, it won't see intermediate states.
Don't kill the browser.
"""



Running the agent

Now all the things’s ready: Let’s run our agent!

github_request = """
I'm trying to search out how hard I actually have to work to get a repo in github.com/trending.
Are you able to navigate to the profile for the highest writer of the highest trending repo, and provides me their total variety of commits over the past yr?
"""

agent.run(github_request + helium_instructions)

Note, nonetheless, that this task is de facto hard: depending on the VLM that you just use, this won’t all the time work. Strong VLMs like Qwen2VL-72B or GPT-4o succeed more often.



Next Steps

This will provide you with a glimpse of the capabilities of a vision-enabled CodeAgent, but there’s rather more to do!

We’re looking forward to seeing what you’ll construct with vision language models and smolagents!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x