Gen-AI Safety Landscape: A Guide to the Mitigation Stack for Text-to-Image Models

There’s also a big area of risk as documented in [4] where marginalized groups are related to harmful connotations reinforcing societal hateful stereotypes. For instance, representation of demographic groups that conflates humans with animals or mythological creatures (akin to black people as monkeys or other primates), conflating humans with food or objects (like associating individuals with disabilities and vegetables) or associating demographic groups with negative semantic concepts (akin to terrorism with muslim people).

Problematic associations like these between groups of individuals and ideas reflect long-standing negative narratives concerning the group. If a generative AI model learns problematic associations from existing data, it could reproduce them in content that’s generates [4].

Problematic Associations of marginalized groups and ideas. Image source

There are several ways to fine-tune the LLMs. In accordance with [6], one common approach is known as Supervised Positive-Tuning (SFT). This involves taking a pre-trained model and further training it with a dataset that features pairs of inputs and desired outputs. The model adjusts it’s parameters by learning to higher match these expected responses.

Typically, fine-tuning involves two phases: SFT to determine a base model, followed by RLHF for enhanced performance. SFT involves imitating high-quality demonstration data, while RLHF refines LLMs through preference feedback.

RLHF could be done in two ways, reward-based or reward-free methods. In reward-based method, we first train a reward model using preference data. This model then guides online Reinforcement Learning algorithms like PPO. Reward-free methods are simpler, directly training the models on preference or rating data to know what humans prefer. Amongst these reward-free methods, DPO has demonstrated strong performances and develop into popular in the neighborhood. Diffusion DPO could be used to steer the model away from problematic depictions towards more desirable alternatives. The tricky a part of this process shouldn’t be training itself, but data curation. For every risk, we want a set of tons of or hundreds of prompts, and for every prompt, a desirable and undesirable image pair. The desirable example should ideally be an ideal depiction for that prompt, and the undesirable example ought to be similar to the desirable image, except it should include the chance that we would like to unlearn.

These mitigations are applied after the model is finalized and deployed within the production stack. These cover all of the mitigations applied on the user input prompt and the ultimate image output.

Prompt filtering

When users input a text prompt to generate a picture, or upload a picture to change it using inpainting technique, filters could be applied to dam requests asking for harmful content explicitly. At this stage, we address issues where users explicitly provide harmful prompts like “show a picture of an individual killing one other person” or upload a picture and ask “remove this person’s clothing” and so forth.

For detecting harmful requests and blocking, we are able to use a straightforward blocklist based approached with keyword matching, and block all prompts which have an identical harmful keyword (say “suicide”). Nevertheless, this approach is brittle, and may produce large variety of false positives and false negatives. Any obfuscating mechanisms (say, users querying for “suicid3” as an alternative of “suicide”) will fall through with this approach. As an alternative, an embedding-based CNN filter could be used for harmful pattern recognition by converting the user prompts into embeddings that capture the semantic meaning of the text, after which using a classifier to detect harmful patterns inside these embeddings. Nevertheless, LLMs have been proved to be higher for harmful pattern recognition in prompts because they excel at understanding context, nuance, and intent in a way that simpler models like CNNs may struggle with. They supply a more context-aware filtering solution and may adapt to evolving language patterns, slang, obfuscating techniques and emerging harmful content more effectively than models trained on fixed embeddings. The LLMs could be trained to dam any defined policy guideline by your organization. Except for harmful content like sexual imagery, violence, self-injury etc., it might even be trained to discover and block requests to generate public figures or election misinformation related images. To make use of an LLM based solution at production scale, you’d should optimize for latency and incur the inference cost.

Prompt manipulations

Before passing within the raw user prompt to model for image generation, there are several prompt manipulations that could be done for enhancing the security of the prompt. Several case studies are presented below:

Prompt augmentation to scale back stereotypes: LDMs amplify dangerous and sophisticated stereotypes [5] . A broad range of odd prompts produce stereotypes, including prompts simply mentioning traits, descriptors, occupations, or objects. For instance, prompting for basic traits or social roles leading to images reinforcing whiteness as ideal, or prompting for occupations leading to amplification of racial and gender disparities. Prompt engineering so as to add gender and racial diversity to the user prompt is an efficient solution. For instance, “image of a ceo” -> “image of a ceo, asian woman” or “image of a ceo, black man” to provide more diverse results. This also can help reduce harmful stereotypes by transforming prompts like “image of a criminal” -> “image of a criminal, olive-skin-tone” for the reason that original prompt would have most certainly produced a black man.

Prompt anonymization for privacy: Additional mitigation could be applied at this stage to anonymize or filter out the content within the prompts that ask for specific private individuals information. For instance “Image of John Doe from in shower” -> “Image of an individual in shower”

Prompt rewriting and grounding to convert harmful prompt to benign: Prompts could be rewritten or grounded (normally with a fine-tuned LLM) to reframe problematic scenarios in a positive or neutral way. For instance, “Show a lazy [ethnic group] person taking a nap” -> “Show an individual relaxing within the afternoon”. Defining a well-specified prompt, or commonly known as grounding the generation, enables models to stick more closely to instructions when generating scenes, thereby mitigating certain latent and ungrounded biases. “Show two people having fun” (This could lead on to inappropriate or dangerous interpretations) -> “Show two people dining at a restaurant”.

Output image classifiers

Image classifiers could be deployed that detect images produced by the model as harmful or not, and will block them before being sent back to the users. Stand alone image classifiers like this are effective for blocking images which can be visibly harmful (showing graphic violence or a sexual content, nudity, etc), Nevertheless, for inpainting based applications where users will upload an input image (e.g., image of a white person) and provides a harmful prompt (“give them blackface”) to rework it in an unsafe manner, the classifiers that only take a look at output image in isolation is not going to be effective as they lose context of the “transformation” itself. For such applications, multimodal classifiers that may consider the input image, prompt, and output image together to make a choice of whether a change of the input to output is protected or not are very effective. Such classifiers can be trained to discover “unintended transformation” e.g., uploading a picture of a lady and prompting to “make them beautiful” resulting in a picture of a skinny, blonde white woman.

Regeneration as an alternative of refusals

As an alternative of refusing the output image, models like DALL·E 3 uses classifier guidance to enhance unsolicited content. A bespoke algorithm based on classifier guidance is deployed, and the working is described in [3]—

When a picture output classifier detects a harmful image, the prompt is re-submitted to DALL·E 3 with a special flag set. This flag triggers the diffusion sampling process to make use of the harmful content classifier to sample away from images that may need triggered it.

Mainly this algorithm can “nudge” the diffusion model towards more appropriate generations. This could be done at each prompt level and image classifier level.

Gen-AI Safety Landscape: A Guide to the Mitigation Stack for Text-to-Image Models

Prompt filtering

Prompt manipulations

Output image classifiers

Regeneration as an alternative of refusals

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Five with MIT ties elected to National Academy of Medicine for 2025

OpenAI Releases ‘Atlas’ Browser

Dispatch: Partying at certainly one of Africa’s largest AI gatherings

OpenAI enters browser war with Atlas

Scaling Recommender Transformers to a Billion Parameters

Gen-AI Safety Landscape: A Guide to the Mitigation Stack for Text-to-Image Models

Prompt filtering

Prompt manipulations

Output image classifiers

Regeneration as an alternative of refusals

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.