Detecting and Editing Visual Objects with Gemini

✨ Overview

Traditional computer vision models are typically trained to detect a set set of object classes, like “person”, “cat”, or “automobile”. If you ought to detect something specific that wasn’t within the training set, comparable to an “illustration” in a book photograph, you often have to collect a dataset, label it manually, and train a custom model, which might take hours and even days.

On this exploration, we’ll test a distinct approach using Gemini. We are going to leverage its spatial understanding capabilities to perform open-vocabulary object detection. This enables us to search out objects based solely on a natural language description, with none training.

Once the visual objects are detected, we’ll extract them after which use Gemini’s image editing capabilities (specifically the Nano Banana models) to revive and creatively transform them.

🔥 Challenge

We’re coping with unstructured data: photos of books, magazines, and objects within the wild. These images present several difficulties for traditional computer vision:

Variety: The objects we would like to search out (illustrations, engravings, and any visuals on the whole) vary wildly in style and content.
Distortion: Pages are curved, photos are taken at angles, and lighting is uneven.
Noise: Old books have stains, paper grain, and text bleeding through from the opposite side.

Our challenge is to construct a sturdy pipeline that may detect these objects despite the distortions, extract them cleanly, and edit them to appear like high-quality digital assets… all using easy text prompts.

🏁 Setup

🐍 Python packages

We’ll use the next packages:

google-genai: the Google Gen AI Python SDK lets us call Gemini with just a few lines of code
pillow for image management
matplotlib for result visualization

We’ll also use these packages (dependencies of google-genai):

pydantic for data management
tenacity for request management

pip install --quiet "google-genai>=1.63.0" "pillow>=11.3.0" "matplotlib>=3.10.0"

🔗 Gemini API

To make use of the Gemini API, we’ve got two important options:

Via Vertex AI with a Google Cloud project
Via Google AI Studio with a Gemini API key

The Google Gen AI SDK provides a unified interface to those APIs, and we are able to use environment variables for the configuration. 🔽

🛠️ Option 1 – Gemini API via Vertex AI

Requirements:

Gen AI SDK environment variables:

GOOGLE_GENAI_USE_VERTEXAI="True"
GOOGLE_CLOUD_PROJECT=""
GOOGLE_CLOUD_LOCATION=""

💡 For preview models, the placement have to be set to global. For generally available models, we are able to select the closest location among the many Google model endpoint locations.

ℹ️ Learn more about organising a project and a development environment.

🛠️ Option 2 – Gemini API via Google AI Studio

Requirement:

Gen AI SDK environment variables:

GOOGLE_GENAI_USE_VERTEXAI="False"
GOOGLE_API_KEY=""

ℹ️ Learn more about getting a Gemini API key from Google AI Studio.

💡 You’ll be able to store your environment configuration outside of the source code:

Environment	Method
IDE	`.env` file (or equivalent)
Colab	Colab Secrets (🗝️ icon in left panel, see code below)
Colab Enterprise	Google Cloud project and site are routinely defined
Vertex AI Workbench	Google Cloud project and site are routinely defined

Define the next environment detection functions. You can even define your configuration manually if needed. 🔽

import os
import sys
from collections.abc import Callable

from google import genai

# Manual setup (leave unchanged if setup is environment-defined)

# @markdown **Which API: Vertex AI or Google AI Studio?**
GOOGLE_GENAI_USE_VERTEXAI = True  # @param {type: "boolean"}

# @markdown **Option A - Google Cloud project [+location]**
GOOGLE_CLOUD_PROJECT = ""  # @param {type: "string"}
GOOGLE_CLOUD_LOCATION = "global"  # @param {type: "string"}

# @markdown **Option B - Google AI Studio API key**
GOOGLE_API_KEY = ""  # @param {type: "string"}


def check_environment() -> bool:
    check_colab_user_authentication()
    return check_manual_setup() or check_vertex_ai() or check_colab() or check_local()


def check_manual_setup() -> bool:
    return check_define_env_vars(
        GOOGLE_GENAI_USE_VERTEXAI,
        GOOGLE_CLOUD_PROJECT.strip(),  # May need been pasted with line return
        GOOGLE_CLOUD_LOCATION,
        GOOGLE_API_KEY,
    )


def check_vertex_ai() -> bool:
    # Workbench and Colab Enterprise
    match os.getenv("VERTEX_PRODUCT", ""):
        case "WORKBENCH_INSTANCE":
            pass
        case "COLAB_ENTERPRISE":
            if not running_in_colab_env():
                return False
        case _:
            return False

    return check_define_env_vars(
        True,
        os.getenv("GOOGLE_CLOUD_PROJECT", ""),
        os.getenv("GOOGLE_CLOUD_REGION", ""),
        "",
    )


def check_colab() -> bool:
    if not running_in_colab_env():
        return False

    # Colab Enterprise was checked before, so that is Colab only
    from google.colab import auth as colab_auth  # type: ignore

    colab_auth.authenticate_user()

    # Use Colab Secrets (🗝️ icon in left panel) to store the environment variables
    # Secrets are private, visible only to you and the notebooks that you just select
    # - Vertex AI: Store your settings as secrets
    # - Google AI: Directly import your Gemini API key from the UI
    vertexai, project, location, api_key = get_vars(get_colab_secret)

    return check_define_env_vars(vertexai, project, location, api_key)


def check_local() -> bool:
    vertexai, project, location, api_key = get_vars(os.getenv)

    return check_define_env_vars(vertexai, project, location, api_key)


def running_in_colab_env() -> bool:
    # Colab or Colab Enterprise
    return "google.colab" in sys.modules


def check_colab_user_authentication() -> None:
    if running_in_colab_env():
        from google.colab import auth as colab_auth  # type: ignore

        colab_auth.authenticate_user()


def get_colab_secret(secret_name: str, default: str) -> str:
    from google.colab import errors, userdata  # type: ignore

    try:
        return userdata.get(secret_name)
    except errors.SecretNotFoundError:
        return default


def disable_colab_cell_scrollbar() -> None:
    if running_in_colab_env():
        from google.colab import output  # type: ignore

        output.no_vertical_scroll()


def get_vars(getenv: Callable[[str, str], str]) -> tuple[bool, str, str, str]:
    # Limit getenv calls to the minimum (may trigger UI confirmation for secret access)
    vertexai_str = getenv("GOOGLE_GENAI_USE_VERTEXAI", "")
    if vertexai_str:
        vertexai = vertexai_str.lower() in ["true", "1"]
    else:
        vertexai = bool(getenv("GOOGLE_CLOUD_PROJECT", ""))

    project = getenv("GOOGLE_CLOUD_PROJECT", "") if vertexai else ""
    location = getenv("GOOGLE_CLOUD_LOCATION", "") if project else ""
    api_key = getenv("GOOGLE_API_KEY", "") if not project else ""

    return vertexai, project, location, api_key


def check_define_env_vars(
    vertexai: bool,
    project: str,
    location: str,
    api_key: str,
) -> bool:
    match (vertexai, bool(project), bool(location), bool(api_key)):
        case (True, True, _, _):
            # Vertex AI - Google Cloud project [+location]
            location = location or "global"
            define_env_vars(vertexai, project, location, "")
        case (True, False, _, True):
            # Vertex AI - API key
            define_env_vars(vertexai, "", "", api_key)
        case (False, _, _, True):
            # Google AI Studio - API key
            define_env_vars(vertexai, "", "", api_key)
        case _:
            return False

    return True


def define_env_vars(vertexai: bool, project: str, location: str, api_key: str) -> None:
    os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = str(vertexai)
    os.environ["GOOGLE_CLOUD_PROJECT"] = project
    os.environ["GOOGLE_CLOUD_LOCATION"] = location
    os.environ["GOOGLE_API_KEY"] = api_key


def check_configuration(client: genai.Client) -> None:
    service = "Vertex AI" if client.vertexai else "Google AI Studio"
    print(f"✅ Using the {service} API", end="")

    if client._api_client.project:
        print(f' with project "{client._api_client.project[:7]}…"', end="")
        print(f' in location "{client._api_client.location}"')
    elif client._api_client.api_key:
        api_key = client._api_client.api_key
        print(f' with API key "{api_key[:5]}…{api_key[-5:]}"', end="")
        print(f" (in case of error, make certain it was created for {service})")


print("✅ Environment functions defined")

🤖 Gen AI SDK

To send Gemini requests, create a google.genai client:

from google import genai

check_environment()

client = genai.Client()

check_configuration(client)

🖼️ Image test suite

Let’s define an inventory of images for our tests: 🔽

from dataclasses import dataclass
from enum import StrEnum

Url = str


class Source(StrEnum):
    incunable = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014rosen0487:0165/full/pct:25/0/default.jpg"
    engravings = "https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:34:07:66:92:1:00340766921:0121/full/pct:50/0/default.jpg"
    museum_guidebook = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014gen34181:0033/full/pct:75/0/default.jpg"
    denver_illustrated = "https://tile.loc.gov/image-services/iiif/service:gdc:gdclccn:rc:01:00:04:94:rc01000494:0051/full/pct:50/0/default.jpg"
    physics_textbook = "https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:03:64:87:31:8:00036487318:0103/full/pct:50/0/default.jpg"
    portrait_miniatures = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2024:2024rosen013592v02:0249/full/pct:50/0/default.jpg"
    wizard_of_oz_drawings = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2006:2006gen32405:0048/full/pct:25/0/default.jpg"
    paintings = "https://images.unsplash.com/photo-1714146681164-f26fed839692?h=1440"
    alice_drawing = "https://images.unsplash.com/photo-1630595011903-689853b04ee2?h=800"
    book = "https://images.unsplash.com/photo-1643451533573-ee364ba6e330?h=800"
    manual = "https://images.unsplash.com/photo-1623666936367-a100f62ba9b7?h=800"
    electronics = "https://images.unsplash.com/photo-1757397584789-8b2c5bfcdbc3?h=1440"


@dataclass
class SourceMetadata:
    title: str
    webpage_url: Url
    credit_line: str


LOC = "Library of Congress"
LOC_RARE_BOOKS = "Library of Congress, Rare Book and Special Collections Division"
LOC_MEETING_FRONTIERS = "Library of Congress, Meeting of Frontiers"

metadata_by_source: dict[Source, SourceMetadata] = {
    Source.incunable: SourceMetadata(
        "Vergaderinge der historien van Troy (1485)",
        "https://www.loc.gov/resource/rbc0001.2014rosen0487/?sp=165",
        LOC_RARE_BOOKS,
    ),
    Source.engravings: SourceMetadata(
        "Harper's illustrated catalogue (1847)",
        "https://www.loc.gov/resource/gdcscd.00340766921/?sp=121",
        LOC,
    ),
    Source.museum_guidebook: SourceMetadata(
        "Barnum's American Museum illustrated (1850)",
        "https://www.loc.gov/resource/rbc0001.2014gen34181/?sp=33",
        LOC_RARE_BOOKS,
    ),
    Source.denver_illustrated: SourceMetadata(
        "Denver illustrated (1893)",
        "https://www.loc.gov/resource/gdclccn.rc01000494/?sp=51",
        LOC_MEETING_FRONTIERS,
    ),
    Source.physics_textbook: SourceMetadata(
        "Lessons in physics (1916)",
        "https://www.loc.gov/resource/gdcscd.00036487318/?sp=103",
        LOC,
    ),
    Source.portrait_miniatures: SourceMetadata(
        "The history of portrait miniatures (1904)",
        "https://www.loc.gov/resource/rbc0001.2024rosen013592v02/?sp=249",
        LOC_RARE_BOOKS,
    ),
    Source.wizard_of_oz_drawings: SourceMetadata(
        "The wonderful Wizard of Oz (1899)",
        "https://www.loc.gov/resource/rbc0001.2006gen32405/?sp=48",
        LOC_RARE_BOOKS,
    ),
    Source.paintings: SourceMetadata(
        "Open book showing paintings by Vincent van Gogh",
        "https://unsplash.com/photos/9hD7qrxICag",
        "Photo by Trung Manh cong on Unsplash",
    ),
    Source.alice_drawing: SourceMetadata(
        "Open book showing an illustration and text from Alice's Adventures in Wonderland",
        "https://unsplash.com/photos/bewzr_Q9u2o",
        "Photo by Brett Jordan on Unsplash",
    ),
    Source.book: SourceMetadata(
        "Open book showing two botanical illustrations",
        "https://unsplash.com/photos/4IDqcNj827I",
        "Photo by Ranurte on Unsplash",
    ),
    Source.manual: SourceMetadata(
        "Open user manual for vintage camera",
        "https://unsplash.com/photos/aaFU96eYASk",
        "Photo by Annie Spratt on Unsplash",
    ),
    Source.electronics: SourceMetadata(
        "Circuit board with electronic components",
        "https://unsplash.com/photos/Aqa1pHQ57pw",
        "Photo by Albert Stoynov on Unsplash",
    ),
}

print("✅ Test images defined")

🧠 Gemini models

Gemini comes in several versions. We are able to currently use the next models:

For object detection: Gemini 2.5 or Gemini 3, each available in Flash or Pro versions.
For object editing: Gemini 2.5 Flash Image or Gemini 3 Pro Image, also often called Nano Banana and Nano Banana Pro.

🛠️ Helpers

Now, let’s add core helper classes and functions: 🔽

from enum import auto
from pathlib import Path
from typing import Any, forged

import IPython.display
import matplotlib.pyplot as plt
import pydantic
import tenacity
from google.genai.errors import ClientError
from google.genai.types import (
    FinishReason,
    GenerateContentConfig,
    GenerateContentResponse,
    PIL_Image,
    ThinkingConfig,
    ThinkingLevel,
)


# Multimodal models with spatial understanding and structured outputs
class MultimodalModel(StrEnum):
    # Generally Available (GA)
    GEMINI_2_5_FLASH = "gemini-2.5-flash"
    GEMINI_2_5_PRO = "gemini-2.5-pro"
    # Preview
    GEMINI_3_FLASH_PREVIEW = "gemini-3-flash-preview"
    GEMINI_3_1_PRO_PREVIEW = "gemini-3.1-pro-preview"
    # Default model used for object detection
    DEFAULT = GEMINI_3_FLASH_PREVIEW


# Image generation and editing models
class ImageModel(StrEnum):
    # Generally Available (GA)
    GEMINI_2_5_FLASH_IMAGE = "gemini-2.5-flash-image"  # Nano Banana 🍌
    # Preview
    GEMINI_3_PRO_IMAGE_PREVIEW = "gemini-3-pro-image-preview"  # Nano Banana Pro 🍌
    # Default model used for image editing
    DEFAULT = GEMINI_2_5_FLASH_IMAGE


Model = MultimodalModel | ImageModel


def generate_content(
    contents: list[Any],
    model: Model,
    config: GenerateContentConfig | None,
    should_display_response_info: bool = False,
) -> GenerateContentResponse | None:
    response = None
    client = check_client_for_model(model)

    for attempt in get_retrier():
        with attempt:
            response = client.models.generate_content(
                model=model.value,
                contents=contents,
                config=config,
            )
    if should_display_response_info:
        display_response_info(response, config)

    return response


def check_client_for_model(model: Model) -> genai.Client:
    if (
        model.value.endswith("-preview")
        and client.vertexai
        and client._api_client.location != "global"
    ):
        # Preview models are only available on the "global" location
        return genai.Client(location="global")

    return client


def display_response_info(
    response: GenerateContentResponse | None,
    config: GenerateContentConfig | None,
) -> None:
    if response is None:
        print("❌ No response")
        return

    if usage_metadata := response.usage_metadata:
        if usage_metadata.prompt_token_count:
            print(f"Input tokens   : {usage_metadata.prompt_token_count:9,d}")
        if usage_metadata.candidates_token_count:
            print(f"Output tokens  : {usage_metadata.candidates_token_count:9,d}")
        if usage_metadata.thoughts_token_count:
            print(f"Thoughts tokens: {usage_metadata.thoughts_token_count:9,d}")

    if (
        config is just not None
        and config.response_mime_type == "application/json"
        and response.parsed is None
    ):
        print("❌ Couldn't parse the JSON response")
        return
    if not response.candidates:
        print("❌ No `response.candidates`")
        return
    if (finish_reason := response.candidates[0].finish_reason) != FinishReason.STOP:
        print(f"❌ {finish_reason = }")
    if not response.text:
        print("❌ No `response.text`")
        return


def generate_image(
    sources: list[PIL_Image],
    prompt: str,
    model: ImageModel,
    config: GenerateContentConfig | None = None,
) -> PIL_Image | None:
    contents = [*sources, prompt.strip()]

    response = generate_content(contents, model, config)

    return check_get_output_image_from_response(response)


def check_get_output_image_from_response(
    response: GenerateContentResponse | None,
) -> PIL_Image | None:
    if response is None:
        print("❌ No `response`")
        return None
    if not response.candidates:
        print("❌ No `response.candidates`")
        if response.prompt_feedback:
            if block_reason := response.prompt_feedback.block_reason:
                print(f"{block_reason = :s}")
            if block_reason_message := response.prompt_feedback.block_reason_message:
                print(f"{block_reason_message = }")
        return None
    if not (content := response.candidates[0].content):
        print("❌ No `response.candidates[0].content`")
        return None
    if not (parts := content.parts):
        print("❌ No `response.candidates[0].content.parts`")
        return None

    output_image: PIL_Image | None = None
    for part in parts:
        if part.text:
            display_markdown(part.text)
            proceed
        sdk_image = part.as_image()
        assert sdk_image is just not None
        output_image = sdk_image._pil_image
        assert output_image is just not None
        break  # There needs to be a single image

    return output_image


def get_thinking_config(model: Model) -> ThinkingConfig | None:
    match model:
        case MultimodalModel.GEMINI_2_5_FLASH:
            return ThinkingConfig(thinking_budget=0)
        case MultimodalModel.GEMINI_2_5_PRO:
            return ThinkingConfig(thinking_budget=128, include_thoughts=False)
        case MultimodalModel.GEMINI_3_FLASH_PREVIEW:
            return ThinkingConfig(thinking_level=ThinkingLevel.MINIMAL)
        case MultimodalModel.GEMINI_3_1_PRO_PREVIEW:
            return ThinkingConfig(thinking_level=ThinkingLevel.LOW)
        case _:
            return None  # Default


def display_markdown(markdown: str) -> None:
    IPython.display.display(IPython.display.Markdown(markdown))


def display_image(image: PIL_Image) -> None:
    IPython.display.display(image)


def get_retrier() -> tenacity.Retrying:
    return tenacity.Retrying(
        stop=tenacity.stop_after_attempt(7),
        wait=tenacity.wait_incrementing(start=10, increment=1),
        retry=should_retry_request,
        reraise=True,
    )


def should_retry_request(retry_state: tenacity.RetryCallState) -> bool:
    if not retry_state.end result:
        return False
    err = retry_state.end result.exception()
    if not isinstance(err, ClientError):
        return False
    print(f"❌ ClientError {err.code}: {err.message}")

    retry = False
    match err.code:
        case 400 if err.message is just not None and " try again " in err.message:
            # Workshop: first time access to Cloud Storage (service agent provisioning)
            retry = True
        case 429:
            # Workshop: temporary project with 1 QPM quota
            retry = True
    print(f"🔄 Retry: {retry}")

    return retry


print("✅ Helpers defined")

🔍 Detecting visual objects

To perform visual object detection, craft the prompt to point what you’d prefer to detect and the way results needs to be returned. In the identical request, it’s possible to also extract additional details about each detected object. This may be virtually anything, from labels comparable to “furniture”, “table”, or “chair”, to more precise classifications like “mammals” or “reptiles”, or to contextual data comparable to captions, colours, shapes, etc.

For the following tests, we’ll experiment with detecting illustrations inside book photos. Here’s a possible prompt:

OBJECT_DETECTION_PROMPT = """
Detect every illustration throughout the book photo and extract the next data for every:
- `box_2d`: Bounding box coordinates of the illustration only (ignoring any caption).
- `caption`: Verbatim caption or legend comparable to "Figure 1". Use "" if not found.
- `label`: Single-word label describing the illustration. Use "" if not found.
"""

Notes:

Bounding boxes are very useful for locating or extracting the detected objects.
Typically, for Gemini models, a box_2d bounding box represents coordinates normalized to a (0, 0, 1000, 1000) space for a (0, 0, width, height) input image.
We’re also requesting to extract captions (metadata often present in reference books) and labels (dynamic metadata).

To automate response processing, it’s convenient to define a Pydantic class that matches the prompt, comparable to:

class DetectedObject(pydantic.BaseModel):
    box_2d: list[int]
    caption: str
    label: str

DetectedObjects: TypeAlias = list[DetectedObject]

Then, request a structured output with config fields response_mime_type and response_schema:

config = GenerateContentConfig(
    # …,
    response_mime_type="application/json",
    response_schema=DetectedObjects,
    # …,
)

This may generate a JSON response which the SDK can parse routinely, letting us directly use object instances:

detected_objects = forged(DetectedObjects, response.parsed)

Let’s add just a few object-detection-specific classes and functions: 🔽

import io
import urllib.request
from collections.abc import Iterator
from dataclasses import field
from datetime import datetime

import PIL.Image
from google.genai.types import Part, PartMediaResolutionLevel
from PIL.PngImagePlugin import PngInfo

OBJECT_DETECTION_PROMPT = """
Detect every illustration throughout the book photo and extract the next data for every:
- `box_2d`: Bounding box coordinates of the illustration only (ignoring any caption).
- `caption`: Verbatim caption or legend comparable to "Figure 1". Use "" if not found.
- `label`: Single-word label describing the illustration. Use "" if not found.
"""

# Margin added to detected/cropped objects, giving more context for a greater understanding of spatial distortions
CROP_MARGIN_PX = 10

# Set to True to avoid wasting each generated image
SAVE_GENERATED_IMAGES = False
OUTPUT_IMAGES_PATH = Path("./object_detection_and_editing")


# Matching class for structured output generation
class DetectedObject(pydantic.BaseModel):
    box_2d: list[int]
    caption: str
    label: str


# Misc data classes
InputImage = Path | Url
DetectedObjects = list[DetectedObject]
WorkflowStepImages = list[PIL_Image]


class WorkflowStep(StrEnum):
    SOURCE = auto()
    CROPPED = auto()
    RESTORED = auto()
    COLORIZED = auto()
    CINEMATIZED = auto()


@dataclass
class VisualObjectWorkflow:
    source_image: PIL_Image
    detected_objects: DetectedObjects
    images_by_step: dict[WorkflowStep, WorkflowStepImages] = field(default_factory=dict)

    def __post_init__(self) -> None:
        denormalize_bounding_boxes(self)


workflow_by_image: dict[InputImage, VisualObjectWorkflow] = {}


def denormalize_bounding_boxes(self: VisualObjectWorkflow) -> None:
    """Convert the box_2d coordinates.
    - Before: [y1, x1, y2, x2] normalized to 0-1000, as returned by Gemini
    - After:  [x1, y1, x2, y2] in source_image coordinates, as utilized in Pillow
    """

    def to_image_coord(coord: int, dim: int) -> int:
        return int(coord * dim / 1000 + 0.5)

    w, h = self.source_image.size
    for obj in self.detected_objects:
        y1, x1, y2, x2 = obj.box_2d
        x1, x2 = to_image_coord(x1, w), to_image_coord(x2, w)
        y1, y2 = to_image_coord(y1, h), to_image_coord(y2, h)
        obj.box_2d = [x1, y1, x2, y2]


def detect_objects(
    image: InputImage,
    prompt: str = OBJECT_DETECTION_PROMPT,
    model: MultimodalModel = MultimodalModel.DEFAULT,
    config: GenerateContentConfig | None = None,
    media_resolution: PartMediaResolutionLevel | None = None,
    display_results: bool = True,
) -> None:
    display_image_source_info(image)
    pil_image, content_part = get_pil_image_and_part(image, model, media_resolution)
    prompt = prompt.strip()
    contents = [content_part, prompt]
    config = config or get_object_detection_config(model)

    response = generate_content(contents, model, config)

    if response is just not None and response.parsed is just not None:
        detected_objects = forged(DetectedObjects, response.parsed)
    else:
        detected_objects = DetectedObjects()

    workflow = VisualObjectWorkflow(pil_image, detected_objects)
    workflow_by_image[image] = workflow
    add_cropped_objects(workflow, image, prompt)

    if display_results:
        display_detected_objects(workflow)


def get_pil_image_and_part(
    image: InputImage,
    model: MultimodalModel,
    media_resolution: PartMediaResolutionLevel | None,
) -> tuple[PIL_Image, Part]:
    if isinstance(image, Path):
        image_bytes = image.read_bytes()
    else:
        headers = {"User-Agent": "Mozilla/5.0"}
        req = urllib.request.Request(image, headers=headers)
        with urllib.request.urlopen(req, timeout=10) as response:
            image_bytes = response.read()

    pil_image = PIL.Image.open(io.BytesIO(image_bytes))
    content_part = Part.from_bytes(
        data=image_bytes,
        mime_type="image/*",
        media_resolution=media_resolution,
    )

    return pil_image, content_part


def get_object_detection_config(model: Model) -> GenerateContentConfig:
    # Low randomness for more determinism
    return GenerateContentConfig(
        temperature=0.0,
        top_p=0.0,
        seed=42,
        response_mime_type="application/json",
        response_schema=DetectedObjects,
        thinking_config=get_thinking_config(model),
    )


def add_cropped_objects(
    workflow: VisualObjectWorkflow,
    input: InputImage,
    prompt: str,
    crop_margin: int = CROP_MARGIN_PX,
) -> None:
    cropped_images: list[PIL_Image] = []
    obj_count = len(workflow.detected_objects)
    for obj_order, obj in enumerate(workflow.detected_objects, 1):
        cropped_image, _ = extract_object_image(workflow.source_image, obj, crop_margin)
        cropped_images.append(cropped_image)
        save_workflow_image(
            WorkflowStep.SOURCE,
            WorkflowStep.CROPPED,
            input,
            obj_order,
            obj_count,
            cropped_image,
            dict(prompt=prompt, crop_margin=str(crop_margin)),
        )
    workflow.images_by_step[WorkflowStep.CROPPED] = cropped_images


def extract_object_image(
    image: PIL_Image,
    obj: DetectedObject,
    margin: int = 0,
) -> tuple[PIL_Image, tuple[int, int, int, int]]:
    def clamp(coord: int, dim: int) -> int:
        return min(max(coord, 0), dim)

    x1, y1, x2, y2 = obj.box_2d
    w, h = image.size
    if margin != 0:
        x1, x2 = clamp(x1 - margin, w), clamp(x2 + margin, w)
        y1, y2 = clamp(y1 - margin, h), clamp(y2 + margin, h)

    box = (x1, y1, x2, y2)
    object_image = image.crop(box)

    return object_image, box


def save_workflow_image(
    source_step: WorkflowStep,
    target_step: WorkflowStep,
    input_image: InputImage,
    obj_order: int,
    obj_count: int,
    target_image: PIL_Image | None,
    image_info: dict[str, str] | None = None,
) -> None:
    if not SAVE_GENERATED_IMAGES or target_image is None:
        return
    if not OUTPUT_IMAGES_PATH.is_dir():
        OUTPUT_IMAGES_PATH.mkdir(parents=True)
    time_str = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    try:
        filename = f"{Source(input_image).name}_"
    except ValueError:
        filename = ""
    filename += f"{obj_order}o{obj_count}_{source_step}_{target_step}_{time_str}.png"
    image_path = OUTPUT_IMAGES_PATH.joinpath(filename)
    params = {}
    if image_info:
        png_info = PngInfo()
        for k, v in image_info.items():
            png_info.add_text(k, v)
        params.update(pnginfo=png_info)
    target_image.save(image_path, **params)


# Matplotlib
FIGURE_FG_COLOR = "#F1F3F4"
FIGURE_BG_COLOR = "#202124"
EDGE_COLOR = "#80868B"
rcParams = {
    "figure.dpi": 300,
    "text.color": FIGURE_FG_COLOR,
    "figure.edgecolor": FIGURE_FG_COLOR,
    "axes.titlecolor": FIGURE_FG_COLOR,
    "axes.edgecolor": FIGURE_FG_COLOR,
    "xtick.color": FIGURE_FG_COLOR,
    "ytick.color": FIGURE_FG_COLOR,
    "figure.facecolor": FIGURE_BG_COLOR,
    "axes.edgecolor": EDGE_COLOR,
    "xtick.bottom": False,
    "xtick.top": False,
    "ytick.left": False,
    "ytick.right": False,
    "xtick.labelbottom": False,
    "ytick.labelleft": False,
}
plt.rcParams.update(rcParams)


def display_image_source_info(image: InputImage) -> None:
    def get_image_info_md() -> str:
        if image not in Source:
            return f"[[Source Image]({image})]"
        source = Source(image)
        metadata = metadata_by_source.get(source)
        if not metadata:
            return f"[[Source Image]({source.value})]"
        parts = [
            f"[Source Image]({source.value})",
            f"[Source Page]({metadata.webpage_url})",
            metadata.title,
            metadata.credit_line,
        ]
        separator = "•"
        inner_info = f" {separator} ".join(parts)
        return f"{separator} {inner_info} {separator}"

    def yield_md_rows() -> Iterator[str]:
        horizontal_line = "---"
        image_info = get_image_info_md()
        yield horizontal_line
        yield f"_{image_info}_"
        yield horizontal_line

    display_markdown(f"{chr(10)}{chr(10)}".join(yield_md_rows()))


def display_detected_objects(workflow: VisualObjectWorkflow) -> None:
    source_image = workflow.source_image
    detected_objects = PIL.Image.recent("RGB", source_image.size, "white")
    for obj in workflow.detected_objects:
        obj_image, box = extract_object_image(source_image, obj)
        detected_objects.paste(obj_image, (box[0], box[1]))

    _, (ax1, ax2) = plt.subplots(1, 2, layout="compressed")
    ax1.imshow(source_image)
    ax2.imshow(detected_objects)

    disable_colab_cell_scrollbar()
    plt.show()


print("✅ Object detection helpers defined")

🧪 Let’s start easy: can we detect the only illustration on this incunable from 1485?

detect_objects(Source.incunable)

💡 This works nicely. The bounding box may be very precise, enclosing the hand-colored woodcut illustration very tightly.

🧪 Now, let’s check the detection of the multiple visuals on this museum guidebook:

detect_objects(Source.museum_guidebook)

💡 Remarks:

The bounding boxes are again very precise.

The outcomes are perfect: there aren’t any false positives and no false negatives.

The captions below the visuals will not be enclosed throughout the bounding boxes, which was specifically requested. The bounding box granularity may be controlled by changing the prompt.

🧪 What about barely warped visuals?

detect_objects(Source.paintings)

💡 This doesn’t make a difference. Notice how the bottom-right painting is partially covered by the orange bookmark. We’ll attempt to fix that within the restoration step.

🧪 What concerning the tilted visuals on this book concerning the architecture in Denver?

detect_objects(Source.denver_illustrated)

💡 Each visual is perfectly detected: spatial understanding covers tilted objects.

🧪 Finally, let’s check the detection on this significantly warped book page from Alice’s Adventures in Wonderland:

detect_objects(Source.alice_drawing)

💡 Page curvature and other distortions don’t prevent non-rectangular objects from being detected. In actual fact, spatial understanding works on the pixel level, which explains this precision for warped objects. In case you’d prefer to work at a lower level, you may as well ask for a “segmentation mask” within the prompt and also you’ll get a base64-encoded PNG (each pixel giving the 0-255 probability it belongs to the thing throughout the bounding box). See the segmentation doc for more details.

🏷️ Text extraction and dynamic labeling

On top of localizing each object with its bounding box, our prompt requested to extract a verbatim caption and to assign a single-word label, when possible.

Let’s add a straightforward function to display the detection data in a table: 🔽

from collections import defaultdict


def display_detection_data(source: Source, show_consolidated: bool = False) -> None:
    def string_with_visible_linebreaks(s: str) -> str:
        return f'''"{s.replace(chr(10), "↩️")}"'''

    def yield_md_rows_consolidated(workflow: VisualObjectWorkflow) -> Iterator[str]:
        yield "| label | count | captions |"
        yield "| :--- | ---: | :--- |"
        stats = defaultdict(list)
        for obj in workflow.detected_objects:
            stats[obj.label].append(string_with_visible_linebreaks(obj.caption))
        for label, captions in stats.items():
            count = len(captions)
            label_captions = " • ".join(sorted(captions))
            yield f"| {label} | {count} | {label_captions} |"

    def yield_md_rows_with_bbox(workflow: VisualObjectWorkflow) -> Iterator[str]:
        yield "| box_2d | label | caption |"
        yield "| :--- | :--- | :--- |"
        for obj in workflow.detected_objects:
            yield f"| {obj.box_2d} | {obj.label} | {string_with_visible_linebreaks(obj.caption)} |"

    workflow = workflow_by_image.get(source)
    if workflow is None:
        print(f'❌ No detection for source "{source.name}"')
        return
    md_rows = list(
        yield_md_rows_consolidated(workflow)
        if show_consolidated
        else yield_md_rows_with_bbox(workflow)
    )
    display_image_source_info(source)
    display_markdown(chr(10).join(md_rows))

Within the museum guidebook, the dynamic labeling is precise in accordance with the context, and the captions below each illustration are perfectly extracted:

display_detection_data(Source.museum_guidebook)

box_2d	label	caption
[954, 629, 1338, 1166]	beetle	“The Horned Beetle.”
[265, 984, 464, 1504]	armor	“Armor of a Man.”
[737, 984, 915, 1328]	armor	“Horse Armor.”
[1225, 1244, 1589, 1685]	beetle	“The Goliath Beetle.”
[264, 1766, 431, 2006]	mask	“The Mask.”
[937, 1769, 1260, 2087]	butterfly	“Painted Lady Butterfly.”
[1325, 2170, 1581, 2468]	butterfly	“The Lady Butterfly.”

Within the book photo showing 4 paintings, this is ideal too:

display_detection_data(Source.paintings)

box_2d	label	caption
[378, 203, 837, 575]	painting	“Hái Ô-liu (Olive Picking), tháng 12 năm 1889, sơn dầu trên toan, 28 3/4 x 35 in. [73 x 89 cm]”
[913, 207, 1380, 563]	painting	“Hẻm núi Les Peiroulets (Les Peiroulets Ravine), tháng 10 năm 1889, sơn dầu trên toan, 28 3/4 x 36 1/4 in. [73 x 92 cm]”
[387, 596, 845, 978]	painting	“Trưa: Nghỉ ngơi (phỏng theo Millet) (Noon: Rest from Work [after Millet]), tháng 1 năm 1890, sơn dầu trên toan, 28 3/4 x 35 7/8 in. [73 x 91 cm]”
[921, 611, 1397, 982]	painting	“Hoa hạnh đào (Almond Blossom), tháng 2 năm 1890, sơn dầu trên toan, 28 3/8 x 36 1/4 in. [73 x 92 cm]”

Within the Denver architecture book, the 4 captions are assigned to the right illustrations, which was not an obvious task:

display_detection_data(Source.denver_illustrated)

box_2d	label	caption
[203, 224, 741, 839]	constructing	“ERNEST AND CRANMER BUILDING.”
[743, 73, 1192, 758]	constructing	“PEOPLE’S BANK BUILDING.”
[1185, 211, 1787, 865]	constructing	“BOSTON BUILDING.”
[699, 754, 1238, 1203]	constructing	“COOPER BUILDING.”

💡 If you will have a more in-depth take a look at the input image, it’s hard to inform which caption belongs to which illustration at a look. Most of us would want to give it some thought (and is perhaps flawed). Asking Gemini reveals that the outcomes are intentional and never pure luck:

Within the “Alice’s Adventures in Wonderland” book page, there was a single illustration accompanying the story text. As expected, the caption is empty (i.e., no false positive):

display_detection_data(Source.alice_drawing)

box_2d	label	caption
[111, 146, 1008, 593]	illustration	“”

🔭 Generalizing object detection

We are able to use the identical principles for other object types. We’ll generally keep requesting bounding boxes to discover object positions inside images. Without changing our current output structure (i.e., no code change), we are able to use captions and labels to extract different object metadata depending on the input type.

🧪 See how we are able to detect electronic components by adapting the prompt while keeping the very same code and output structure:

ELECTRONIC_COMPONENT_DETECTION_PROMPT = """
Exhaustively detect all the person electronic components within the image and supply the next data for every:
- `box_2d`: bounding box coordinates.
- `caption`: Verbatim alphanumeric text visible on the component (including original line breaks), or "" if no text is present.
- `label`: Specific form of component.
"""

detect_objects(
    Source.electronics,
    ELECTRONIC_COMPONENT_DETECTION_PROMPT,
    media_resolution=PartMediaResolutionLevel.MEDIA_RESOLUTION_ULTRA_HIGH,
)

💡 Remarks:

Large and tiny components are detected, due to the precise instruction “exhaustively detect…”.

Through the use of the ultra-high media resolution, we ensure more details are tokenized and the “P” component (a visible outlier) gets detected.

Here’s a consolidated view of the detected components:

display_detection_data(Source.electronics, show_consolidated=True)

label	count	captions
integrated circuit	3	“49240↩️020S6K” • “8105↩️0:35” • “P4010↩️9NA0”
resistor	4	“” • “” • “105” • “R020”
inductor	1	“n1W”
diode	3	“K” • “L” • “P”
capacitor	6	“” • “” • “” • “” • “” • “”
transistor	1	“41”
connector	1	“”

💡 Remarks:

Components are detected together with their text markings, despite the three different text orientations (upright, sideways, and the other way up), the blur, and the photo noise.

We removed the degree of freedom for multi-line text by specifying the inclusion of “original line breaks” within the prompt: responses now consistently include the road breaks for the three integrated circuits (displayed with the ↩️ emoji for higher visibility).

The last degree of freedom lies within the labeling. While most components have been properly labeled, it’s unclear whether the “P” component is a diode, a resistor, or a fuse. Making the instructions more specific (e.g., listing the possible labels, using an enum for the label field within the Pydantic class, or providing guidelines and more details concerning the expected circuit boards) will make the prompt more “closed” and the outcomes more deterministic and accurate.
It’s also possible to enable/update the thinking_config configuration, which is able to trigger a series of thought before generating the ultimate answer. In all of the detections performed, our code used ThinkingLevel.MINIMAL, which didn’t devour any thought tokens (with Gemini 3 Flash). Updating the parameter to ThinkingLevel.LOW, ThinkingLevel.MEDIUM, or ThinkingLevel.HIGH will use thought tokens and may lead to raised outputs in complex cases.

This demonstrates the flexibility of the approach. Without retraining a model, we switched from detecting Fifteenth-century woodcuts and illustrations with vintage layouts to identifying modern electronics just by changing the prompt. Such detections, including caption and label metadata, might be used to auto-crop components for a parts catalog, confirm assembly lines, or create interactive schematics… all and not using a single labeled training image.

🪄 Editing visual objects

Now that we are able to detect visual objects, we are able to envision an automation workflow to extract and reuse them. For this, we’ll use Gemini 2.5 Flash Image (also often called Nano Banana 🍌) by default, a state-of-the-art image generation and editing model.

Our object editing functions will follow the identical template, taking one step as input and generating an edited image for the output step. Let’s define core helpers for this: 🔽

from typing import Protocol


class ObjectEditingFunction(Protocol):
    def __call__(
        self,
        image: InputImage,
        prompt: str | None = None,
        model: ImageModel | None = None,
        config: GenerateContentConfig | None = None,
        display_results: bool = True,
    ) -> None: ...


SourceTargetSteps = tuple[WorkflowStep, WorkflowStep]
registered_functions: dict[SourceTargetSteps, ObjectEditingFunction] = {}

DEFAULT_EDITING_CONFIG = GenerateContentConfig(response_modalities=["IMAGE"])
EMPTY_IMAGE = PIL.Image.recent("1", (1, 1), "white")


def object_editing_function(
    default_prompt: str,
    source_step: WorkflowStep,
    target_step: WorkflowStep,
    default_model: ImageModel = ImageModel.DEFAULT,
    default_config: GenerateContentConfig = DEFAULT_EDITING_CONFIG,
) -> ObjectEditingFunction:
    def editing_function(
        image: InputImage,
        prompt: str | None = default_prompt,
        model: ImageModel | None = default_model,
        config: GenerateContentConfig | None = default_config,
        display_results: bool = True,
    ) -> None:
        workflow, source_images = get_workflow_and_step_images(image, source_step)
        if prompt is None:
            prompt = default_prompt
        prompt = prompt.strip()
        if model is None:
            model = default_model
        # Note: "config is None" is valid and can use the model endpoint default config

        target_images: list[PIL_Image] = []
        display_image_source_info(image)
        obj_count = len(source_images)
        for obj_order, source_image in enumerate(source_images, 1):
            target_image = generate_image([source_image], prompt, model, config)
            save_workflow_image(
                source_step,
                target_step,
                image,
                obj_order,
                obj_count,
                target_image,
                dict(prompt=prompt),
            )
            target_images.append(target_image if target_image else EMPTY_IMAGE)

        workflow.images_by_step[target_step] = target_images
        if display_results:
            display_sources_and_targets(workflow, source_step, target_step)

    registered_functions[(source_step, target_step)] = editing_function

    return editing_function


def get_workflow_and_step_images(
    image: InputImage,
    step: WorkflowStep,
) -> tuple[VisualObjectWorkflow, list[PIL_Image]]:
    # Objects detected?
    if image not in workflow_by_image:
        detect_objects(image, display_results=False)
    workflow = workflow_by_image.get(image)
    assert workflow is just not None

    # Workflow step objects? (single level, might be prolonged to a dynamical graph)
    operation = (WorkflowStep.CROPPED, step)
    if step not in workflow.images_by_step and operation in registered_functions:
        source_function = registered_functions[operation]
        source_function(image, display_results=False)

    # Source images
    source_images = workflow.images_by_step.get(step)
    assert source_images is just not None

    return workflow, source_images


def display_sources_and_targets(
    workflow: VisualObjectWorkflow,
    source_step: WorkflowStep,
    target_step: WorkflowStep,
) -> None:
    source_images = workflow.images_by_step[source_step]
    target_images = workflow.images_by_step[target_step]
    if not source_images:
        print("❌ No images to display")
        return

    fig = plt.figure(layout="compressed")
    if horizontal := (len(source_images) >= 2):
        rows, cols = 2, len(source_images)
    else:
        rows, cols = len(source_images), 2
    gs = fig.add_gridspec(rows, cols)

    for i, (source_image, target_image) in enumerate(
        zip(source_images, target_images, strict=True)
    ):
        for dim, image in enumerate([source_image, target_image]):
            grid_spec = gs[dim, i] if horizontal else gs[i, dim]
            ax = fig.add_subplot(grid_spec)
            ax.set_axis_off()
            ax.imshow(image)

    disable_colab_cell_scrollbar()
    plt.show()


print("✅ Object editing helpers defined")

Now, let’s define a primary editing step to revive the detected objects that may contain many real-life artifacts…

✨ Restoring visual objects

For this restoration step, we want to craft a prompt that’s generic enough (to cover most use cases) but additionally specific enough (to take into consideration restoration needs).

A picture editing prompt relies on natural language, typically using imperative or declarative instructions. With an imperative prompt, you describe the actions to perform on the input, while with a declarative prompt, you describe the expected output. Each are possible and can provide equivalent results. Your selection is basically a matter of preference, so long as the prompt is sensible.

Our test suite is generally composed of book photos, which might contain various photographic and paper artifacts. The Nano Banana models understand these subtleties and may edit images accordingly, which simplifies the prompt.

Here’s a possible restoration function using an imperative prompt:

RESTORATION_PROMPT = """
- Isolate and straighten the visual on a pure white background, excluding any surrounding text.
- Clean up all physical artifacts and noise while preserving every original detail.
- Center the result and scale it to suit the canvas with minimal, symmetrical margins, ensuring no distortion or cropping.
"""

# Default config with low randomness for more deterministic restoration outputs
RESTORATION_CONFIG = GenerateContentConfig(
    temperature=0.0,
    top_p=0.0,
    seed=42,
    response_modalities=["IMAGE"],
)

restore_objects = object_editing_function(
    RESTORATION_PROMPT,
    WorkflowStep.CROPPED,
    WorkflowStep.RESTORED,
    default_config=RESTORATION_CONFIG,
)

print("✅ Restoration function defined")

🧪 Let’s try to revive the illustration from the 1485 incunable:

restore_objects(Source.incunable)

💡 We now have a pleasant restoration of the hand-colored woodcut illustration. Note that our prompt is generic () and might be made more specific to remove more or fewer artifacts. In this instance, there are remaining artifacts, comparable to the paper discoloration within the sword or the bleeding ink within the armor. We’ll see if we are able to fix these within the colorization step.

🧪 What concerning the illustrations from the museum guidebook?

restore_objects(Source.museum_guidebook)

💡 All good!

🧪 What concerning the barely warped visuals?

restore_objects(Source.paintings)

💡 Remarks:

Notice how, on the last painting, the orange bookmark is correctly removed and the hidden part inpainted to finish the painting.

We requested to “fill the canvas with minimal uniform margins, without distortion or cropping”. Depending on the aspect ratio and form of the visual, this degree of freedom can lead to different white margins.

This instance shows famous paintings by Vincent Van Gogh. Nano Banana doesn’t fetch any reference images and only uses the provided input. If these were photos of personal paintings, they might be restored in the identical way.

Within the Denver architecture book, the illustrations may be tilted, which our generic prompt doesn’t fully take into consideration. When several geometric transformations are involved, it could actually be difficult to craft an imperative prompt that details all of the operations to perform. As an alternative, a descriptive prompt may be more straightforward by directly describing the expected output.

🧪 Here’s an example of a descriptive prompt specializing in the restoration of tilted visuals:

tilted_visual_prompt = """
An upright, high-fidelity rendition of the visual isolated against a pure white background, filling the canvas with minimal uniform margins. The output is clean, sharp, and freed from physical artifacts.
"""

restore_objects(Source.denver_illustrated, tilted_visual_prompt)

💡 Remarks:

To get these results, the prompt focuses on requesting an “upright” visual “filling the canvas”, which proves more straightforward to put in writing than attempting to account for all possible geometric corrections.

The native visual understanding routinely identifies the content type (photo, illustration, etc.) and different artifacts (photographic, paper, printing, scanning…), allowing for precise restorations out of the box.

Notice how the consistency is preserved: the last visual is restored as an illustration, while the primary visuals maintain their photographic style.

The outcomes, with this fairly generic prompt, are impressive. It’s, in fact, possible to be more specific and request particular lighting, styles, colours…

On this last test, the input visual has distortions not only from the page curvature but additionally from the photo perspective.

🧪 Here’s an example of a descriptive prompt specializing in restoring warped illustrations:

warped_visual_prompt = """
An edge-to-edge digital extraction of the illustration from the provided book photo, excluding any peripheral text. All page curvature and perspective distortions are corrected, leading to a picture framed in an ideal rectangle, on a pure white canvas with minimal margins.
"""

restore_objects(Source.alice_drawing, warped_visual_prompt)

💡 It is basically impressive that such a restoration may be performed in a single step. Note that this prompt is just not stable and may generate less optimal results (it might profit from being more precise). If you will have complex transformations, test descriptive prompts iteratively, using precise and concise instructions, and also you is perhaps pleasantly surprised. Within the worst case, it’s also possible to process the transformations in successive, easier steps.

Now, let’s add a colorization step…

🎨 Colorization

Our restoration step respected the unique forms of the input images. Recent image editing models excel at transforming image styles, starting with colours. This will generally be performed directly with a straightforward, precise instruction.

Here’s a possible colorization function using an imperative prompt:

COLORIZATION_PROMPT = """
Colorize this image in a contemporary book illustration style, maintaining all original details with none additions.
"""

colorize = object_editing_function(
    COLORIZATION_PROMPT,
    WorkflowStep.RESTORED,
    WorkflowStep.COLORIZED,
)

print("✅ Colorization function defined")

🧪 Let’s modernize our 1485 illustration:

colorize(Source.incunable)

💡 All details are preserved, as requested within the prompt. Notice how the colorization can naturally fix some remaining artifacts (e.g., the paper discoloration within the sword or the bleeding ink within the armor).

🧪 Let’s colorize our museum guidebook illustrations:

colorize(Source.museum_guidebook)

💡 Our prompt may be very open because it only specifies “modern book illustration style”. This will generate very creative colorizations, but all of them appear to make perfect sense.

🧪 What about our Denver buildings?

colorize(Source.denver_illustrated)

💡 As requested, all of them appear like modern illustrations, including the primary visuals (originating from noisy photos).

It’s possible to go further by not only “colorizing” but additionally “transforming” the image right into a significantly different one.

🧪 Let’s make our “Alice’s Adventures in Wonderland” drawing right into a watercolor painting:

watercolor_prompt = """
Transform this visual right into a warm, watercolor painting.
"""

colorize(Source.alice_drawing, watercolor_prompt)

🧪 What about making it a conventional painting?

painting_prompt = """
Transform this visual into a conventional painting.
"""

colorize(Source.alice_drawing, painting_prompt)

We may also change image compositions. Depending on the context, some compositions are kind of implied by default. For instance, illustrations often have margins, while photos generally have edge-to-edge (full-bleed within the printing world) compositions. When possible, it’s interesting to seek advice from a form of visual (which intrinsically brings lots of semantics to the context) and adjust the instructions accordingly.

🧪 Let’s see how we are able to detect engravings on this 1847 book, restore them, and transform them into modern digital graphics:

detect_objects(Source.engravings)

restore_objects(Source.engravings)

visual_to_digital_graphic_prompt = """
Transform this visual right into a full-color, flat digital graphic, extending the content for a full-bleed effect.
"""

colorize(Source.engravings, visual_to_digital_graphic_prompt)

🧪 We may also transform the identical engravings into photos with a quite simple prompt:

visual_to_photo_prompt = """
Transform this visual right into a high-end, modern camera photograph.
"""

colorize(Source.engravings, visual_to_photo_prompt)

💡 As photos are generally full-bleed, the prompt doesn’t have to specify a composition.

It’s really as much as our imagination, as Nano Banana seems to know every aspect of the visual semantics.

Let’s add a final step to see how far we are able to go, reimagining images as cinematic movie stills…

🎞️ Cinematization

We’ve used fairly “closed” prompts up to now, crafting specific instructions and constraints to manage the outputs. It’s possible to go even further with “open” prompts and generate images in full creative mode. Notably, it could actually be interesting to seek advice from photographic or cinematographic terminology because it encompasses many visual techniques.

Here’s a possible generic cinematization function to reimagine images as movie stills:

CINEMATIZATION_PROMPT = """
Reimagine this image as a joyful, modern live-action cinematic movie still featuring skilled lighting and composition.
"""

cinematize = object_editing_function(
    CINEMATIZATION_PROMPT,
    WorkflowStep.RESTORED,
    WorkflowStep.CINEMATIZED,
)

🧪 Let’s cinematize the “Alice’s Adventures in Wonderland” drawing:

cinematize(Source.alice_drawing)

💡 This looks like a high-budget movie still. There are numerous degrees of freedom within the prompt, but you’re prone to get foreground figures in sharp focus, a gradual background blur, “golden hour” lighting (a magical ingredient for a lot of cinematographers), and detailed textures. Such compositions really evoke different atmospheres in comparison with the photos generated within the previous test.

🧪 Let’s test the workflow on a page from the Wonderful Wizard of Oz containing three drawings:

detect_objects(Source.wizard_of_oz_drawings)

restore_objects(Source.wizard_of_oz_drawings)

cinematize(Source.wizard_of_oz_drawings)

💡 The forged for a brand new movie is prepared 😉

Cinematic images have various use cases:

These cinematized stills may be perfect “reference images” for video generation models like Veo. See Generate Veo videos from reference images.
As they’re photorealistic representations, they will also be a source for generating 2D or 3D visuals, in any style, with realistic figures, perfect proportions, advanced lighting, enhanced compositions…
You should use them in lots of skilled contexts or for high-end products: presentations, magazines, posters, storyboards, brainstorming sessions…

🏁 Conclusion

Gemini’s native spatial understanding enables the detection of specific visual objects based on a single prompt in natural language.
We tested the detection of illustrations in book photos, which traditional machine learning (ML) models often miss, as they’re typically trained to detect people, animals, vehicles, food, and a finite set of physical object classes.
We tested the detection of straight, tilted, and even significantly warped illustrations, and so they were all the time precisely identified.
The core implementation was straightforward, requiring minimal code using the Python SDK and customised prompts. By comparison, fine-tuning a conventional object detection model is time-consuming: it involves assembling a picture dataset, labeling objects, and managing training jobs.
This solution may be very flexible: we could switch from detecting illustrations to electronic components, by adapting the prompt, while keeping the code unchanged.
Using structured outputs (with a JSON schema or Pydantic classes, and the Python SDK) makes the code each easy to implement and able to deploy to production.
Then, Nano Banana allows editing these visual objects in virtually any way possible.
We tested a workflow with restoration, colorization, and even cinematization steps, using imperative and descriptive prompts.
The probabilities seem really limitless, and the principles on this exploration may be reused in several contexts.

➕ More!

Thanks for reading. Let me know in the event you create something cool!

Detecting and Editing Visual Objects with Gemini

✨ Overview

🔥 Challenge

🏁 Setup

🐍 Python packages

🔗 Gemini API

🤖 Gen AI SDK

🖼️ Image test suite

🧠 Gemini models

🛠️ Helpers

🔍 Detecting visual objects

🏷️ Text extraction and dynamic labeling

🔭 Generalizing object detection

🪄 Editing visual objects

✨ Restoring visual objects

🎨 Colorization

🎞️ Cinematization

🏁 Conclusion

➕ More!

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Introducing Sonnet 4.6 Anthropic

A Generalizable MARL-LP Approach for Scheduling in Logistics

Designing Data and AI Systems That Hold Up in Production

Latest AirSnitch attack breaks Wi-Fi encryption in homes, offices, and enterprises

Google’s latest AI image generation model

Detecting and Editing Visual Objects with Gemini

✨ Overview

🔥 Challenge

🏁 Setup

🐍 Python packages

🔗 Gemini API

🤖 Gen AI SDK

🖼️ Image test suite

🧠 Gemini models

🛠️ Helpers

🔍 Detecting visual objects

🏷️ Text extraction and dynamic labeling

🔭 Generalizing object detection

🪄 Editing visual objects

✨ Restoring visual objects

🎨 Colorization

🎞️ Cinematization

🏁 Conclusion

➕ More!

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.