Although Adobe’s Firefly latent diffusion model (LDM) is arguably probably the greatest currently available, Photoshop users who’ve tried its generative features could have noticed that it isn’t in a position to easily – as a substitute it completely the user’s chosen area with imagery based on the user’s text prompt (albeit that Firefly is adept at integrating the resulting generated section into the context of the image).
In the present beta version, Photoshop can a minimum of incorporate a reference image as a partial image prompt, which catches Adobe’s flagship product as much as the sort of functionality that Stable Diffusion users have enjoyed for over two years, due to third-party frameworks akin to Controlnet:
This illustrates an open problem in image synthesis research – the problem that diffusion models have in editing existing images without implementing a full-scale ‘re-imagining’ of the choice indicated by the user.

Source: https://arxiv.org/pdf/2502.20376
This problem occurs because LDMs generate images through iterative denoising, where each stage of the method is conditioned on the text prompt supplied by the user. With the text prompt content converted into embedding tokens, and with a hyperscale model akin to Stable Diffusion or Flux containing lots of of 1000’s (or tens of millions) of near-matching embeddings related to the prompt, the method has a calculated conditional distribution to aim towards; and every step taken is a step towards this ‘conditional distribution goal’.
In order that’s text to image – a scenario where the user ‘hopes for the perfect’, since there isn’t any telling exactly what the generation will likely be like.
As a substitute, many have sought to make use of an LDM’s powerful generative capability to edit existing images – but this entails a balancing act between fidelity and suppleness.
When a picture is projected into the model’s latent space by methods akin to DDIM inversion, the goal is to get well the unique as closely as possible while still allowing for meaningful edits. The issue is that the more precisely a picture is reconstructed, the more the model adheres to its structure, making major modifications difficult.

Then again, if the method prioritizes editability, the model loosens its grip on the unique, making it easier to introduce changes – but at the fee of overall consistency with the source image:

Because it’s an issue that even Adobe’s considerable resources are struggling to handle, then we will reasonably consider that the challenge is notable, and should not allow of easy solutions, if any.
Tight Inversion
Due to this fact the examples in a brand new paper released this week caught my attention, because the work offers a worthwhile and noteworthy improvement on the present state-of-the-art on this area, by proving in a position to apply subtle and refined edits to photographs projected into the latent space of a model – without the edits either being insignificant or else overwhelming the unique content within the source image:

LDM hobbyists and practitioners may recognize this type of result, since much of it will possibly be created in a posh workflow using external systems akin to Controlnet and IP-Adapter.
In reality the brand new method – dubbed – does indeed leverage IP-Adapter, together with a dedicated face-based model, for human depictions.

Source: https://arxiv.org/pdf/2308.06721
The signal achievement of Tight Inversion, then, is to have proceduralized complex techniques right into a single drop-in plug-in modality that will be applied to existing systems, including lots of the most well-liked LDM distributions.
Naturally, because of this Tight Inversion (TI), just like the adjunct systems that it leverages, uses the source image as a conditioning factor for its own edited version, as a substitute of relying solely on accurate text prompts:

Though the authors’ concede that their approach isn’t free from the standard and ongoing tension between fidelity and editability in diffusion-based image editing techniques, they report state-of-the-art results when injecting TI into existing systems, vs. the baseline performance.
The latest work is titled , and comes from five researchers across Tel Aviv University and Snap Research.
Method
Initially a Large Language Model (LLM) is used to generate a set of various text prompts from which a picture is generated. Then the aforementioned DDIM inversion is applied to every image : the text prompt used to generate the image; a shortened version of the identical; and a null (empty) prompt.
With the inverted noise returned from these processes, the photographs are again regenerated with the identical condition, and without classifier-free guidance (CFG).

As we will see from the graph above, the scores across various metrics are improved with increased text length. The metrics used were Peak Signal-to-Noise Ratio (PSNR); L2 distance; Structural Similarity Index (SSIM); and Learned Perceptual Image Patch Similarity (LPIPS).
Image-Conscious
Effectively Tight Inversion changes how a number diffusion model edits real images by conditioning the inversion process on the image itself slightly than relying only on text.
Normally, inverting a picture right into a diffusion model’s noise space requires estimating the starting noise that, when denoised, reconstructs the input. Standard methods use a text prompt to guide this process; but an imperfect prompt can result in errors, losing details or altering structures.
Tight Inversion as a substitute uses IP Adapter to feed visual information into the model, in order that it reconstructs the image with greater accuracy, converting the source images into conditioning tokens, and projecting them into the inversion pipeline.
These parameters are editable: increasing the influence of the source image makes the reconstruction nearly perfect, while reducing it allows for more creative changes. This makes Tight Inversion useful for each subtle modifications, akin to changing a shirt color, or more significant edits, akin to swapping out objects – without the common side-effects of other inversion methods, akin to the lack of effective details or unexpected aberrations within the background content.
The authors state:
Data and Tests
The researchers evaluated TI on its capability to reconstruct and to edit real world source images. All experiments used Stable Diffusion XL with a DDIM scheduler as outlined within the original Stable Diffusion paper; and all tests used 50 denoising steps at a default guidance scale of seven.5.
For image conditioning, IP-Adapter-plus sdxl vit-h was used. For few-step tests, the researchers used SDXL-Turbo with a Euler scheduler, and in addition conducted experiments with FLUX.1-dev, conditioning the model within the latter case on PuLID-Flux, using RF-Inversion at 28 steps.
PulID was used solely in cases featuring human faces, since that is the domain that PulID was trained to handle – and while it’s noteworthy that a specialized sub-system is used for this one possible prompt type, our inordinate interest in generating human faces suggests that relying solely on the broader weights of a foundation model akin to Stable Diffusion might not be adequate to the standards we demand for this particular task.
Reconstruction tests were performed for qualitative and quantitative evaluation. Within the image below, we see qualitative examples for DDIM inversion:

The paper states:

The authors also tested Tight Inversion as a drop-in module for existing systems, pitting the modified versions against their baseline performance.
The three systems tested were the aforementioned DDIM Inversion and RF-Inversion; and in addition ReNoise, which shares some authorship with the paper under discussion here. Since DDIM results haven’t any difficulty in obtaining 100% reconstruction, the researchers focused only on editability.

Here the authors comment:
The authors also tested the system quantitatively. In step with prior works, they used the validation set of MS-COCO, and note that the outcomes (illustrated below) improved reconstruction across all metrics for all of the methods.

Next, the authors tested the system’s ability to photos, pitting it against baseline versions of prior approaches prompt2prompt; Edit Friendly DDPM; LED-ITS++; and RF-Inversion.
Show below are a number of the paper’s qualitative results for SDXL and Flux (and we refer the reader to the slightly compressed layout of the unique paper for further examples).

The authors contend that Tight Inversion consistently outperforms existing inversion techniques by striking a greater balance between reconstruction and editability. Standard methods akin to DDIM inversion and ReNoise can get well a picture well, the paper states that they often struggle to preserve effective details when edits are applied.
In contrast, Tight Inversion leverages image conditioning to anchor the model’s output more closely to the unique, stopping unwanted distortions. The authors contend that even when competing approaches produce reconstructions that accurate, the introduction of edits often results in artifacts or structural inconsistencies, and that Tight Inversion mitigates these issues.
Finally, quantitative results were obtained by evaluating Tight Inversion against the MagicBrush benchmark, using DDIM inversion and LEDITS++, measured with CLIP Sim.

The authors conclude:
Conclusion
Though it doesn’t represent a ‘breakthrough’ in one in all the thorniest challenges in LDM-based image synthesis, Tight Inversion consolidates plenty of burdensome ancillary approaches right into a unified approach to AI-based image editing.
Although the strain between editability and fidelity isn’t gone under this method, it’s notably reduced, in accordance with the outcomes presented. Considering that the central challenge this work addresses may prove ultimately intractable if handled by itself terms (slightly than looking beyond LDM-based architectures in future systems), Tight Inversion represents a welcome incremental improvement within the state-of-the-art.
Â