Introduction
substitute is a staple of image editing, achieving production-grade results stays a major challenge for developers. Many existing tools work like “black boxes,” which suggests we’ve got little control over the balance between quality and speed needed for an actual application. I bumped into these difficulties while constructing VividFlow. The project is especially focused on Image-to-Video generation, nevertheless it also provides a feature for users to swap backgrounds using AI prompts.
To make the system more reliable across various kinds of images, I ended up specializing in three technical areas that made a major difference in my results:
- A Three-Tier Fallback Strategy: I discovered that orchestrating BiRefNet, U²-Net, and traditional gradients ensures the system all the time produces a usable mask, even when the first model fails.
- Correction in Lab Color Space: Moving the method to Lab space helped me remove the “yellow halo” artifacts that usually appear when mixing images in standard RGB space.
- Specific Logic for Cartoon Art: I added a dedicated pipeline to detect and preserve the sharp outlines and flat colours which can be unique to illustrations.
These are the approaches that worked for me after I deployed the app on HuggingFace Spaces. In this text, I would like to share the logic and a number of the math behind these selections, and the way they helped the system handle the messy number of real-world images more consistently.
1. The Problem with RGB: Why Backgrounds Leave a Trace
Standard RGB alpha mixing tends to go away a stubborn visual mess in background substitute. Whenever you mix a portrait shot against a coloured wall right into a latest background, the sting pixels often hold onto a few of that original color. That is most blatant when the unique and latest backgrounds have contrasting colours, like swapping a warm yellow wall for a cool blue sky. You frequently find yourself with an unnatural yellowish tint that immediately gives away the proven fact that the image is a composite. For this reason even when your segmentation mask is pixel-perfect, the ultimate composite still looks obviously fake — the colour contamination betrays the edit.
The difficulty is rooted in how RGB mixing works. Standard alpha compositing treats each color channel independently, calculating weighted averages without considering how humans actually perceive color. To see this problem concretely, consider the instance visualized in Figure 1 below. Take a dark hair pixel (RGB 80, 60, 40) captured against a yellow wall (RGB 200, 180, 120). In the course of the photo shoot, light from that wall reflects onto the hair edges, making a color forged. For those who apply a 50% mix with a brand new blue background in RGB space, the pixel becomes a muddy average (RGB 140, 120, 80) that preserves obvious traces of the unique yellow—precisely the yellowish tint problem we wish to eliminate. As an alternative of a clean transition, this contamination breaks the illusion of natural integration.
As demonstrated within the figure above, the center panel shows how RGB mixing produces a muddy result that retains the yellowish tint from the unique wall. The rightmost panel reveals the answer: switching to Lab color space before the ultimate mix allows surgical removal of this contamination. Lab space separates lightness (L channel) from chroma (a and b channels), enabling targeted corrections of color casts without disturbing the luminance that defines object edges. The corrected result (RGB 75, 55, 35) achieves natural hair darkness while eliminating yellow influence through vector operations within the ab plane, a mathematical process I’ll detail in Section 5.
2. System Architecture: Orchestrating the Workflow
The background substitute pipeline orchestrates several specialized components in a rigorously designed sequence that prioritizes each robustness and efficiency. The architecture ensures that even when individual models encounter difficult scenarios, the system gracefully degrades to alternative approaches while maintaining output quality without wasting GPU resources.

Following the architecture diagram, the pipeline executes through six distinct stages:
Image Preparation: The system resizes and normalizes input images to a maximum dimension of 1024 pixels, ensuring compatibility with diffusion model architectures while maintaining aspect ratio.
Semantic Evaluation: An OpenCLIP vision encoder analyzes the image to detect subject type (person, animal, object, nature, or constructing) and measures color temperature characteristics (warm versus cool tones).
Prompt Enhancement: Based on the semantic evaluation, the system augments the user’s original prompt with contextually appropriate lighting descriptors (golden hour, soft diffused, shiny daylight) and atmospheric qualities (skilled, natural, elegant, cozy).
Background Generation: Stable Diffusion XL synthesizes a brand new background scene using the improved prompt, configured with a DPM-Solver++ scheduler running for twenty-five inference steps at guidance scale 7.5.
Robust Mask Generation: The system attempts three progressively simpler approaches to extract the foreground. BiRefNet provides high-quality semantic segmentation as the primary alternative. When BiRefNet produces insufficient results, U²-Net through rembg offers reliable general-purpose extraction. Traditional gradient-based methods function the ultimate fallback, guaranteeing mask production no matter input complexity.
Perceptual Color Mixing: The fusion stage operates in Lab color space to enable precise removal of background color contamination through chroma vector deprojection. Adaptive suppression strength scales with each pixel’s color similarity to the unique background. Multi-scale edge refinement produces natural transitions around advantageous details, and the result’s composited back to plain color space with proper gamma correction.
3. The Three-Tier Mask Strategy: Quality Meets Reliability
In background substitute, the mask quality is the ceiling, your final image can never look higher than the mask it’s built on. Nevertheless, counting on only one segmentation model is a recipe for failure when coping with real-world variety. I discovered that a three-tier fallback strategy was the perfect technique to ensure every user gets a usable result, whatever the image type.

- BiRefNet (The Quality Leader): That is the first alternative for complex scenes. For those who have a look at the left panel of the comparison image, notice how cleanly it handles the person curly hair strands. It uses a bilateral architecture that balances high-level semantic understanding with fine-grained detail. In my experience, it’s the one model that consistently avoids the “choppy” go searching flyaway hair.
- U²-Net via rembg (The Balanced Fallback): When BiRefNet struggles, often with cartoons or very small subjects—the system mechanically switches to U²-Net. Taking a look at the middle panel, the hair edges are a bit “fuzzier” and fewer detailed than BiRefNet, but the general body shape remains to be very accurate. I added custom alpha stretching and morphological refinements to this stage to assist keep extremities like hands and feet from being by chance clipped.
- Traditional Gradients (The “Never Fail” Safety Net): As a final resort, I exploit Sobel and Laplacian operators to seek out edges based on pixel intensity. The right panel shows the result: it’s much simpler and misses the advantageous hair textures, nevertheless it is guaranteed to finish and not using a model error. To make this look skilled, I apply a guided filter using the unique image as a signal, which helps smooth out noise while keeping the structural edges sharp.
4. Perceptual Color Space Operations for Targeted Contamination Removal
The answer to RGB mixing’s color contamination problem lies in selecting a color space where luminance and chromaticity separate cleanly. Lab color space, standardized by the CIE (2004), provides exactly this property through its three-channel structure: the L channel encodes lightness on a 0–100 scale, while the a and b channels represent color opponent dimensions spanning green-to-red and blue-to-yellow respectively. Unlike RGB where all three channels couple together during mixing operations, Lab allows surgical manipulation of color information without disturbing the brightness values that outline object boundaries.
The mathematical correction operates through vector projection within the ab chromatic plane. To grasp this operation geometrically, consider Figure 3 below, which visualizes the method in two-dimensional ab space. When an edge pixel exhibits contamination from a yellow background, its measured chroma vector C represents the pixel’s color coordinates (a, b) within the ab plane, pointing partially toward the yellow direction. Within the diagram, the contaminated pixel appears as a red arrow with coordinates (a = 12, b = 28), while the background’s yellow chroma vector B appears as an orange arrow pointing toward (a = 5, b = 45). The important thing insight is that the portion of C that aligns with B represents unwanted background influence, while the perpendicular portion represents the topic’s true color.

Figure 3. Vector projection in Lab ab chromatic plane removing yellow background contamination.
As illustrated within the figure above, the system removes contamination by projecting C onto the normalized background direction B̂ and subtracting this projection. Mathematically, the corrected chroma vector becomes:
[mathbf{C}’ = mathbf{C} – (mathbf{C} cdot mathbf{hat{B}}) mathbf{hat{B}}]
where C · B̂ denotes the dot product that measures how much of C lies along the background direction. The yellow dashed line in Figure 3 represents this projection component, showing the contamination magnitude of 15 units along the background direction. The purple dashed arrow demonstrates the subtraction operation that yields the corrected green arrow C′ = ( = 4, = 8). This corrected chroma exhibits substantially reduced yellow component (from = 28 all the way down to = 8) while maintaining the unique red-green balance ( stays near its original value). The operation performs precisely what visual inspection suggests is required: it removes only the colour component parallel to the background direction while preserving perpendicular components that encode the topic’s inherent coloration.
Critically, this correction happens exclusively within the chromatic dimensions while the L channel stays untouched throughout the operation. This preservation of luminance maintains the sting structure that viewers perceive as natural boundaries between foreground and background elements. Converting the corrected Lab values back to RGB space produces the ultimate pixel color that integrates cleanly with the brand new background without visible contamination artifacts.
5. Adaptive Correction Strength Through Color Distance Metrics
Simply removing all background color from edges risks overcorrection, edges can grow to be artificially gray or desaturated, losing natural warmth. To forestall this, I implemented adaptive strength modulation based on how contaminated each pixel actually is, using the ΔE color distance metric:
[Delta E = sqrt{(Delta L)^2 + (Delta a)^2 + (Delta b)^2}]
where ΔE below 1 is imperceptible while values above 5 indicate clearly distinguishable colours. Pixels with ΔE below 18 from the background are classified as contaminated candidates for correction.
The correction strength follows an inverse relationship, pixels very near the background color receive strong correction while distant pixels get gentle treatment:
[S = 0.85 times maxleft(0, 1 – frac{Delta E}{18}right)]
This formula ensures strength gracefully tapers to zero as ΔE approaches the edge, avoiding sharp discontinuities.
Figure 4 illustrates this through a zoomed comparison of hair edges against different backgrounds. The left panel shows the unique image with yellow wall contamination visible along the hair boundary. The center panel reveals how standard RGB mixing preserves a yellowish rim that immediately betrays the composite as artificial. The proper panel shows the Lab-based correction eliminating color spill while maintaining natural hair texture, the sting now integrates cleanly with the blue background by targeting contamination precisely on the mask boundary without affecting legitimate subject color.

6. Cartoon-Specific Enhancement for Line Art Preservation
Cartoon and line-art images present unique challenges for generic segmentation models trained on photographic data. Unlike natural photographs with gradual transitions, cartoon characters feature sharp black outlines and flat color fills. Standard deep learning segmentation often misclassifies black outlines as background while giving insufficient coverage to solid fill areas, creating visible gaps in composites.
I developed an automatic detection pipeline that prompts when the system identifies line-art characteristics through three features: edge density (Canny edge pixels ratio), color simplicity (unique colours relative to area), and dark pixel prevalence (luminance below 50). When these thresholds are met, specialized enhancement routines trigger.
Figure 5 below shows the enhancement pipeline through 4 stages. The primary panel displays the unique cartoon dog with its characteristic black outlines and flat colours. The second panel shows the improved mask, notice the entire white silhouette capturing all the character. The third panel reveals Canny edge detection identifying sharp outlines. The fourth panel highlights dark regions (luminance < 50) that mark the black lines defining the character’s form.

The enhancement process within the figure above operates in two stages. First, black outline protection scans for dark pixels (luminance < 80), dilates them slightly, and sets their mask alpha to 255 (full opacity), ensuring black lines are never lost. Second, internal fill enhancement identifies high-confidence regions (alpha > 160), applies morphological closing to attach separated parts, then boosts medium-confidence pixels inside this zone to minimum alpha of 220, eliminating gaps in flat-colored areas.
This specialized handling preserved mask coverage across anime characters, comic illustrations, and line drawings during development. Without it, generic models produce masks technically correct for photos but fail to preserve the sharp outlines and solid fills that outline cartoon imagery.
Conclusion: Engineering Decisions Over Model Selection
Constructing this background substitute system reinforced a core principle: production-quality AI applications require thoughtful orchestration of multiple techniques fairly than counting on a single “best” model. The three-tier mask generation strategy ensures robustness across diverse inputs, Lab color space operations eliminate perceptual artifacts that RGB mixing inherently produces, and cartoon-specific enhancements preserve artistic integrity for non-photographic content. Together, these design decisions create a system that handles real-world diversity while maintaining transparency about how corrections are applied—critical for developers integrating AI into their applications.
Several directions for future enhancement emerge from this work. Implementing guided filter refinement as standard post-processing could further smooth mask edges while preserving structural boundaries. The cartoon detection heuristics currently use fixed thresholds but may gain advantage from a light-weight classifier trained on labeled examples. The adaptive spill suppression currently uses linear falloff, but smooth step or double smooth step curves might provide more natural transitions. Finally, extending the system to handle video input would require temporal consistency mechanisms to stop flickering between frames.
Project Links:
Acknowledgments:
This work builds upon the open-source contributions of BiRefNet, U²-Net, Stable Diffusion XL, and OpenCLIP. Special because of the HuggingFace team for providing the ZeroGPU infrastructure that enabled this deployment.
References & Further Reading
Color Science Foundations
- CIE. (2004). (third ed.). CIE Publication 15:2004. International Commission on Illumination.
- Sharma, G., Wu, W., & Dalal, E. N. (2005). The CIEDE2000 color-difference formula: Implementation notes, supplementary test data, and mathematical observations. , 30(1), 21-30.
Deep Learning Segmentation
- Peng, Z., Shen, J., & Shao, L. (2024). Bilateral reference for high-resolution dichotomous image segmentation.
- Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O. R., & Jagersand, M. (2020). U²-Net: Going deeper with nested U-structure for salient object detection. , 106, 107404.
Image Compositing & Color Spaces
- Lucas, B. D. (1984). Color image compositing in multiple color spaces. .
Core Infrastructure
- Rombach, R., et al. (2022). High-resolution image synthesis with latent diffusion models. , 10684-10695.
- Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. , 8748-8763.
Image Attribution
- All figures in this text were generated using Gemini Nano Banana and Python code.
