Improving Green Screen Generation for Stable Diffusion

-

Despite community and investor enthusiasm around visual generative AI, the output from such systems will not be at all times ready for real-world usage; one example is that gen AI systems are likely to output (or a series of images, within the case of video), relatively than the which are typically required for diverse applications in multimedia, and for visual effects practitioners.

An easy example of that is clip-art designed to ‘float’ over whatever goal background the user has chosen:

Transparency of this type has been commonly available for over thirty years; for the reason that digital revolution of the early Nineteen Nineties, users have been in a position to extract elements from video and pictures through an increasingly sophisticated series of toolsets and techniques.

As an example, the challenge of ‘dropping out’ blue-screen and green-screen backgrounds in video footage, once the purview of costly chemical processes and optical printers (in addition to hand-crafted mattes), would turn out to be the work of minutes in systems equivalent to Adobe’s After Effects and Photoshop applications (amongst many other free and proprietary programs and systems).

Once a component has been isolated, an alpha channel (effectively a mask that obscures any non-relevant content) allows any element within the video to be effortlessly superimposed over recent backgrounds, or composited along with other isolated elements.

Examples of alpha channels, with their effects depicted in the lower row. Source: https://helpx.adobe.com/photoshop/using/saving-selections-alpha-channel-masks.html

Source: https://helpx.adobe.com/photoshop/using/saving-selections-alpha-channel-masks.html

Dropping Out

In computer vision, the creation of alpha channels falls inside the aegis of semantic segmentation, with open source projects equivalent to Meta’s Segment Anything providing a text-promptable approach to isolating/extracting goal objects, through semantically-enhanced object recognition.

The Segment Anything framework has been utilized in a big selection of visual effects extraction and isolation workflows, equivalent to the Alpha-CLIP project.

Example extractions using Segment Anything, in the Alpha-CLIP framework: Source: https://arxiv.org/pdf/2312.03818

Source: https://arxiv.org/pdf/2312.03818

There are many different semantic segmentation methods that will be adapted to the duty of assigning alpha channels.

Nonetheless, semantic segmentation relies on trained data which can not contain all of the which are required to be extracted. Although models trained on very high volumes of information can enable a wider range of objects to be recognized (effectively becoming foundational models, or world models), they’re nonetheless limited by the classes that they’re trained to acknowledge most effectively.

Semantic segmentation systems such as Segment Anything can struggle to identify certain objects, or parts of objects, as exemplified here in output from ambiguous prompts. Source: https://maucher.pages.mi.hdm-stuttgart.de/orbook/deeplearning/SAM.html

Source: https://maucher.pages.mi.hdm-stuttgart.de/orbook/deeplearning/SAM.html

In any case, semantic segmentation is just as much a process as a green screen procedure, and must isolate elements without the advantage of a single swathe of background color that will be effectively recognized and removed.

Because of this, it has occasionally occurred to the user community that images and videos may very well be generated that may very well be immediately removed via conventional methods.

Unfortunately, popular latent diffusion models equivalent to Stable Diffusion often have some difficulty rendering a very vivid green screen. It is because the models’ training data doesn’t typically contain an important many examples of this relatively specialized scenario. Even when the system succeeds, the concept of ‘green’ tends to spread in an unwanted manner to the foreground subject, as a result of concept entanglement:

Above, we see that Stable Diffusion has prioritized authenticity of image over the need to create a single intensity of green, effectively replicating real-world problems that occur in traditional green screen scenarios. Below, we see that the 'green' concept has polluted the foreground image. The more the prompt focuses on the 'green' concept, the worse this problem is likely to get. Source: https://stablediffusionweb.com/

Source: https://stablediffusionweb.com/

Despite the advanced methods in use, each the lady’s dress and the person’s tie (within the lower images seen above) would are likely to ‘drop out’ together with the green background – an issue that hails back* to the times of photochemical emulsion dye removal within the Seventies and Eighties.

As ever, the shortcomings of a model will be overcome by throwing specific data at an issue, and devoting considerable training resources. Systems equivalent to Stanford’s 2024 offering LayerDiffuse create a fine-tuned model able to generating images with alpha channels:

The Stanford LayerDiffuse project was trained on a million apposite images capable of imbuing the model with transparency capabilities. Source: https://arxiv.org/pdf/2402.17113

Source: https://arxiv.org/pdf/2402.17113

Unfortunately, along with the considerable curation and training resources required for this approach, the dataset used for LayerDiffuse will not be publicly available, restricting the usage of models trained on it. Even when this impediment didn’t exist, this approach is difficult to customize or develop for specific use cases.

A bit of later in 2024, Adobe Research collaborated with Stonybrook University to supply MAGICK, an AI extraction approach trained on custom-made diffusion images.

From the 2024 paper, an example of fine-grained alpha channel extraction in MAGICK. Source: https://openaccess.thecvf.com/content/CVPR2024/papers/Burgert_MAGICK_A_Large-scale_Captioned_Dataset_from_Matting_Generated_Images_using_CVPR_2024_paper.pdf

Source: https://openaccess.thecvf.com/content/CVPR2024/papers/Burgert_MAGICK_A_Large-scale_Captioned_Dataset_from_Matting_Generated_Images_using_CVPR_2024_paper.pdf

150,000 extracted, AI-generated objects were used to coach MAGICK, in order that the system would develop an intuitive understanding of extraction:

Samples from the MAGICK training dataset.

This dataset, because the source paper states, was very difficult to generate for the aforementioned reason – that diffusion methods have difficulty creating solid keyable swathes of color. Subsequently, manual collection of the generated mattes was vital.

This logistic bottleneck once more results in a system that can’t be easily developed or customized, but relatively have to be used inside its initially-trained range of capability.

TKG-DM – ‘Native’ Chroma Extraction for a Latent Diffusion Model

A brand new collaboration between German and Japanese researchers has proposed an alternative choice to such trained methods, capable – the paper states – of obtaining higher results than the above-mentioned methods, without the necessity to train on specially-curated datasets.

TKG-DM alters the random noise that seeds a generative image so that it is better-capable of producing a solid, keyable background – in any color. Source: https://arxiv.org/pdf/2411.15580

Source: https://arxiv.org/pdf/2411.15580

The brand new method approaches the issue on the generation level, by optimizing the random noise from which a picture is generated in a latent diffusion model (LDM) equivalent to Stable Diffusion.

The approach builds on a previous investigation into the colour schema of a Stable Diffusion distribution, and is capable of manufacturing background color of any kind, with less (or no) entanglement of the important thing background color into foreground content, in comparison with other methods.

Initial noise is conditioned by a channel mean shift that is able to influence aspects of the denoising process, without entangling the color signal into the foreground content.

The paper states:

The recent paper is titled , and comes from seven researchers across Hosei University in Tokyo and RPTU Kaiserslautern-Landau & DFKI GmbH, in Kaiserslautern.

Method

The brand new approach extends the architecture of Stable Diffusion by conditioning the initial Gaussian noise through a (CMS), which produces noise patterns designed to encourage the specified background/foreground separation within the generated result.

Schema for the workflow of the proposed system.

CMS adjusts the mean of every color channel while maintaining the final development of the denoising process.

The authors explain:

The color channel desired for the background chroma color is instantiated with a null text prompt, while the actual foreground content is created semantically, from the user's text instruction.

Self-attention and cross-attention are used to separate the 2 facets of the image (the chroma background and the foreground content). Self-attention helps with internal consistency of the foreground object, while cross-attention maintains fidelity to the text prompt. The paper points out that since background imagery is normally less detailed and emphasized in generations, its weaker influence is comparatively easy to beat and substitute with a swatch of pure color.

A visualization of the influence of self-attention and cross-attention in the chroma-style generation process.

Data and Tests

TKG-DM was tested using Stable Diffusion V1.5 and Stable Diffusion SDXL. Images were generated at 512x512px and 1024x1024px, respectively.

Images were created using the DDIM scheduler native to Stable Diffusion, at a guidance scale of seven.5, with 50 denoising steps. The targeted background color was green, now the dominant dropout method.

The brand new approach was in comparison with DeepFloyd, under the settings used for MAGICK; to the fine-tuned low-rank diffusion model GreenBack LoRA; and likewise to the aforementioned LayerDiffuse.

For the info, 3000 images from the MAGICK dataset were used.

Examples from the MAGICK dataset, from which 3000 images were curated in tests for the new system. Source: https://ryanndagreat.github.io/MAGICK/Explorer/magick_rgba_explorer.html

Source: https://ryanndagreat.github.io/MAGICK/Explorer/magick_rgba_explorer.html

For metrics, the authors used Fréchet Inception Distance (FID) to evaluate foreground quality. In addition they developed a project-specific metric called m-FID, which uses the BiRefNet system to evaluate the standard of the resulting mask.

Visual comparisons of the BiRefNet system against prior methods. Source: https://arxiv.org/pdf/2401.03407

Source: https://arxiv.org/pdf/2401.03407

To check semantic alignment with the input prompts, the CLIP-Sentence (CLIP-S) and CLIP-Image (CLIP-I) methods were used. CLIP-S evaluates prompt fidelity, and CLIP-I the visual similarity to ground truth.

First set of qualitative results for the new method, this time for Stable Diffusion V1.5. Please refer to source PDF for better resolution.

The authors assert that the outcomes (visualized above and below, SD1.5 and SDXL, respectively) display that TKG-DM obtains superior results without prompt-engineering or the need to coach or fine-tune a model.

SDXL qualitative results. Please refer to source PDF for better resolution.

They observe that with a prompt to incite a green background within the generated results, Stable Diffusion 1.5 has difficulty generating a clean background, while SDXL (though performing somewhat higher) produces unstable light green tints liable to interfere with separation in a chroma process.

They further note that while LayerDiffuse generates well-separated backgrounds, it occasionally loses detail, equivalent to precise numbers or letters, and the authors attribute this to limitations within the dataset. They add that mask generation also occasionally fails, resulting in ‘uncut’ images.

For quantitative tests, though LayerDiffuse apparently has the advantage in SDXL for FID, the authors emphasize that that is the results of a specialized dataset that effectively constitutes a ‘baked’ and non-flexible product. As mentioned earlier, any objects or classes not covered in that dataset, or inadequately covered, may not perform as well, while further fine-tuning to accommodate novel classes presents the user with a curation and training burden.

Quantitative results for the comparisons. LayerDiffuse's apparent advantage, the paper implies, comes at the expense of flexibility, and the burden of data curation and training.

The paper states:

Finally, the researchers conducted a user study to judge prompt adherence across the varied methods. 100 participants were asked to guage 30 image pairs from each method, with subjects extracted using BiRefNet and manual refinements across all examples. The authors’ training-free approach was preferred on this study.

Results from the user study.

TKG-DM is compatible with the favored ControlNet third-party system for Stable Diffusion, and the authors contend that it produces superior results to ControlNet’s native ability to attain this sort of separation.

Conclusion

Perhaps essentially the most notable takeaway from this recent paper is the extent to which latent diffusion models are entangled, in contrast to the favored public perception that they will effortlessly separate facets of images and videos when generating recent content.

The study further emphasizes the extent to which the research and hobbyist community has turned to fine-tuning as a fix for models’ shortcomings – an answer which is able to at all times address specific classes and sorts of object. In such a scenario, a fine-tuned model will either work thoroughly on a limited variety of classes, or else work well on a far more higher volume of possible classes and objects, in accordance with higher amounts of information within the training sets.

Subsequently it’s refreshing to see at the least one solution that doesn’t depend on such laborious and arguably disingenuous solutions.

 

* Superman

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x