Updating Classifier Evasion for Vision Language Models

-


Advances in AI architectures have unlocked multimodal functionality, enabling transformer models to process multiple forms of knowledge in the identical context. For example, vision language models (VLMs) can generate output from combined image and text input, enabling developers to construct systems that interpret graphs, process camera feeds, or operate with traditionally human interfaces like desktop applications. In some situations, this extra vision modality may process external, untrusted images, and there’s significant precedent in regards to the attack surface of image-processing machine learning systems. On this post, we’ll apply a few of these historical ideas to modern architectures to assist developers understand the varied threats and mitigations unlocked within the vision domain.

Vision language models

VLMs extend the transformer architecture popularized by large language models (LLMs) to just accept each text and image input. VLMs may be finetuned to caption, detect, and segment objects, and answer questions on images by combining the image and text into one set of tokens processed by the LLM. A widely-used open source example is PaliGemma 2. As shown in Figure 1, PaliGemma 2 uses SigLIP to encode and project the image into token space compatible with Gemma 2, then concatenates the image tokens with the text tokens before passing them to Gemma.

A diagram showing how PaliGemma 2 accepts image input, which is processed by the SigLIP ImageEncoder and processed by a linear projection before those tokens are concatenated with the text tokens and passed to Gemma 2 before generating text output.A diagram showing how PaliGemma 2 accepts image input, which is processed by the SigLIP ImageEncoder and processed by a linear projection before those tokens are concatenated with the text tokens and passed to Gemma 2 before generating text output.
Figure 1. The PaliGemma 2 architecture

How much influence can we exert over the LLM if we control the image input? Can we adapt classic adversarial image generation techniques to VLMs? In that case, this will likely impact how we secure systems integrating these VLMs into control flow or physical systems.

Evading image classifiers

In 2014, researchers discovered that human-imperceptible pixel perturbations could possibly be used to manage the output of image classification models. Figure 2, from the seminal paper Intriguing properties of neural networks, shows how the photographs on the left (all distinctly and accurately classified) could possibly be perturbed by the pixel values in the center column (magnified for illustration) to generate the photographs on the fitting, all of that are classified as ostriches. This concept became generally known as classifier evasion.

A 3x3 grid of images where each row represents an image, a pixel mask, and the modified image that looks identical but has a different classification from a machine learning model.A 3x3 grid of images where each row represents an image, a pixel mask, and the modified image that looks identical but has a different classification from a machine learning model.
Figure 2. Adversarial pixel perturbations used to vary image classification

As the sphere of adversarial machine learning evolved, researchers developed increasingly sophisticated attack algorithms and open source tools. Most of those attacks relied on direct access to model gradients (open-box attacks) or approximated gradients through sampling methods (closed-box attacks) to craft perturbations that were each effective and “minimally perceptible”. One easy technique was Projected Gradient Descent (PGD), which formalized adversarial example generation as a constrained optimization problem. PGD iteratively nudges the input within the direction of the gradient, while ensuring that the perturbation stays small to limit perceptibility.

Because the research community increasingly sought real-world relevance, the main focus shifted toward the threat model itself. In practice, attackers rarely have pixel-level access to a whole image. As a substitute, they could have the ability to physically modify only a part of an object, while being less constrained by perceptibility. This led to the event of adversarial patches as shown in Figure 3, where the attacker optimizes a localized region of a picture that may be printed and physically applied in the actual world.

A picture of a banana and a graph showing the classification as “banana”, then a “sticker” placed next to it on the table, and the graph showing “toaster.”A picture of a banana and a graph showing the classification as “banana”, then a “sticker” placed next to it on the table, and the graph showing “toaster.”
Figure 3. Adding an algorithmically-generated patch flips the classification from “banana” to “toaster”

Let’s adapt these ideas for VLMs.

Constructing adversarial images for VLMs

We’re going to deal with a particular scenario through which a VLM processes a picture of a red traffic light (Figure 4). The VLM prompt is static: “should I stop or go?” however the attacker has some level of control over the input image. We’re also only focused on open-box attacks where the attacker has access to the whole model and input prompt during development to generate their adversarial input.

A traffic light with the red circle illuminated to signal “stop.”A traffic light with the red circle illuminated to signal “stop.”
Figure 4. An unmodified traffic light

In the next examples, we test against this general inference setup where the model is initialized, a processor is defined to handle input formatting, and a set system prompt is defined:

model_id = "google/paligemma2-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda").eval()
processor = PaliGemmaProcessor.from_pretrained(model_id, use_fast=True)

prompt = "answer en should I stop or go?" #formatted as PaliGemma expects

def get_output(image): #attacker controlled image
    prompt = "answer en should I stop or go?"
    model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
    input_len = model_inputs["input_ids"].shape[-1]
    
    with torch.inference_mode():
        generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
        generation = generation[0][input_len:]
        decoded = processor.decode(generation, skip_special_tokens=True)
    return decoded

As expected with an unmodified image, the VLM generates “stop” as shown in Figure 5.

Screenshot from a Jupyter Notebook showing the benign stoplight and the model output: “stop.”Screenshot from a Jupyter Notebook showing the benign stoplight and the model output: “stop.”
Figure 5. Control test showing that the model produced the text “stop”

The traffic light was embedded by SigLIP and projected into token-space. Those tokens were then concatenated with the tokens for “answer en should I stop or go?” before being passed to Gemma, which returned one token: “stop”. In an LLM, we’d try some type of prompt injection to override the system instruction, but on this scenario, we are able to only control the image while the text is fixed.

Pixel perturbations

When attacking traditional image classification models, the model’s probability output is used to measure loss. Pixel values are modified to cut back the likelihood that the image is accurately classified (an untargeted attack) and optionally maximize the likelihood that the output is a particular class (a targeted attack). Similarly with PaliGemma 2, we are able to use the token logits because with greedy sampling, the model will at all times select the almost definitely token. The core ideas in using PGD to generate adversarial samples against PaliGemma are:

  1. We use the tokenizer to discover desired output and undesired output. On this case, we would like to incentivize generating “go” and disincentivize generating “stop” so we get their token IDs.
stop_id = processor.tokenizer("stop", add_special_tokens=False).input_ids[0]
go_id = processor.tokenizer("go", add_special_tokens=False).input_ids[0]
  1. Now we have access to the model’s output logits, so we are able to have a look at the comparative likelihood of the output tokens for each “stop” and “go”.
logits = outputs.logits
next_token_logits = logits[:, -1, :]
logit_stop = next_token_logits[:, stop_id]
logit_go = next_token_logits[:, go_id]
  1. We will define a loss function because the difference between the logits for our desired and undesired outputs. This loss function measures how good or bad our image is.
loss = -(logit_go - logit_stop).mean()

Using those primitives, we run an optimization loop to generate a mask over the image. As this loop progresses, we are able to monitor our adversarial image’s logits for “stop” vs “go.”We see that it doesn’t take much perturbation for “go” to quickly turn into greater than “stop.” This means that our modified traffic light will output “go” when passed through PaliGemma 2, as shown in Figure 6.

Step 4/20 | loss=1.3125 | logit_stop=13.125 | logit_go=11.812
Step 8/20 | loss=-4.1875 | logit_stop=9.062 | logit_go=13.250
Step 12/20 | loss=-6.5938 | logit_stop=6.969 | logit_go=13.562
Step 16/20 | loss=-7.8125 | logit_stop=5.938 | logit_go=13.750
Step 20/20 | loss=-8.1250 | logit_stop=5.562 | logit_go=13.688
Screenshot from a Jupyter Notebook showing the modified stoplight and the model output: “go”. There are some slightly visible artifacts, but the image is still clearly the same stoplight.Screenshot from a Jupyter Notebook showing the modified stoplight and the model output: “go”. There are some slightly visible artifacts, but the image is still clearly the same stoplight.
Figure 6. A barely perceptible pixel modification flipped the output from “stop” to “go”

The difference with VLMs

Conventional image classifiers were limited to a set set of image classes, but with VLMs, we’ve moved into the generative era, where the output may be manipulated into a wider distribution. In the only conventional paradigm for this traffic light scenario, there is likely to be two classes: “stop” and “go,” and each possible input could be classified into those two buckets.

Now, the output is anything that the Gemma LLM can output. Functionally, we’re treating the model as a classifier with as many classes as there are distinct tokens. So, using the identical attack generation process as before but optimizing for “eject” as an alternative of “go”, we are able to generate an output that won’t have been considered by the appliance designers (Figure 7).

Screenshot from a Jupyter Notebook showing the modified stoplight and the model output: “eject”. There are some slightly visible artifacts, but the image is still clearly the same stoplight.Screenshot from a Jupyter Notebook showing the modified stoplight and the model output: “eject”. There are some slightly visible artifacts, but the image is still clearly the same stoplight.
Figure 7. A barely perceptible pixel modification flipped the output from “stop” to “eject”

When designing a system which may process untrusted images, developers should consider how resilient the remaining of the system is to unexpected output. The safety and robustness properties of the end-to-end system extend far beyond the core model’s characteristics and include input and output sanitization, NeMo Guardrails, and safety control systems.

Extending the attack

There are a lot of cases where an attacker may need access to a portion of the visual environment without with the ability to modify pixel values across all the image. This is straightforward to grasp within the case of cameras, but additionally true for computer use agents, where the attacker may only have write access to a portion of a screenshot (for instance, a banner ad displayed in a browser). In these cases, you’ll be able to generate adversarial patches by optimizing just the controlled pixels, as shown in Figure 8. For this instance, the adversarial input was generated on a white square relatively than as a perturbation mask to raised simulate a physical sticker.

Screenshot from a Jupyter Notebook showing the modified stoplight and the model output: “go”. The stoplight clearly has a small square of random pixels in the bottom left.Screenshot from a Jupyter Notebook showing the modified stoplight and the model output: “go”. The stoplight clearly has a small square of random pixels in the bottom left.
Figure 8. A sticker flips the output from “stop” to “go”

These patches are brittle, and the success of the attack depends heavily on their placement, lighting conditions, camera noise, shadows, and other difficult-to-control variables. In practice, this method produces patches so brittle that they’re unlikely to succeed as physical sticker attacks because the placement should be pixel-perfect, aligned, etc. To construct more robust attacks, add Expectation Over Transformation perturbations to the training loop by randomly moving or rotating the image, adjusting brightness, and otherwise adding realistic noise to the generation process.

Attackers must also consider their optimization constraints. “Human imperceptible” is likely to be irrelevant in a computer-use scenario where the attacker expects the input to be processed by a completely autonomous system, as an illustration. The less constraints the attacker imposes, the more likely they’re to succeed.

Learn more

VLMs extend the present power and capability of LLMs to unlock many useful multimodal applications, including robotics and computer use agents. Images are a part of the VLM prompt and may be used to govern model output identical to untrusted text. Understanding the history of attacking and defending image classifiers and embedding models can assist discover risks and inform mitigations to construct robust systems. Images aren’t the one additional modality being introduced into language models with historical adversarial machine learning research. Security teams should review some older techniques for video, audio, and other modalities to evaluate and increase the resilience of their multimodal AI applications. 

Because adversarial examples may be programmatically generated, they ought to be used to enhance training, evaluation, and benchmarking to extend the robustness of resulting systems. Learn more about generating adversarial examples in Exploring Adversarial Machine Learning.

When constructing agentic systems with VLMs, proceed evaluating them based on their autonomy level and threat modeling. Explore the family of NVIDIA VLMs.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x