Updating Classifier Evasion for Vision Language Models

Advances in AI architectures have unlocked multimodal functionality, enabling transformer models to process multiple forms of knowledge in the identical context. For example, vision language models (VLMs) can generate output from combined image and text input, enabling developers to construct systems that interpret graphs, process camera feeds, or operate with traditionally human interfaces like desktop applications. In some situations, this extra vision modality may process external, untrusted images, and there’s significant precedent in regards to the attack surface of image-processing machine learning systems. On this post, we’ll apply a few of these historical ideas to modern architectures to assist developers understand the varied threats and mitigations unlocked within the vision domain.

No items found.