Disney’s Research arm is offering a brand new approach to compressing images, leveraging the open source Stable Diffusion V1.2 model to supply more realistic images at lower bitrates than competing methods.
Source: https://studios.disneyresearch.com/app/uploads/2024/09/Lossy-Image-Compression-with-Foundation-Diffusion-Models-Paper.pdf
The brand new approach (defined as a ‘codec’ despite its increased complexity as compared to traditional codecs equivalent to JPEG and AV1) can operate over any Latent Diffusion Model (LDM). In quantitative tests, it outperforms former methods by way of accuracy and detail, and requires significantly less training and compute cost.
The important thing insight of the brand new work is that (a central process in all image compression) is analogous to (a central process in diffusion models).
Due to this fact a ‘traditionally’ quantized image may be treated as a loud version of the unique image, and utilized in an LDM’s denoising process as an alternative of random noise, with a purpose to reconstruct the image at a goal bitrate.

The authors contend:
Nevertheless, in common with other projects that seek to use the compression capabilities of diffusion models, the output may hallucinate details. In contrast, lossy methods equivalent to JPEG will produce clearly distorted or over-smoothed areas of detail, which may be recognized as compression limitations by the casual viewer.
As a substitute, Disney’s codec may alter detail from context that was not there within the source image, because of the coarse nature of the Variational Autoencoder (VAE) utilized in typical models trained on hyperscale data.
While this has some implications for artistic depictions and the verisimilitude of casual photographs, it could have a more critical impact in cases where small details constitute essential information, equivalent to evidence for court cases, data for facial recognition, scans for Optical Character Recognition (OCR), and a wide selection of other possible use cases, within the eventuality of the popularization of a codec with this capability.
At this nascent stage of the progress of AI-enhanced image compression, all these possible scenarios are far in the long run. Nevertheless, image storage is a hyperscale global challenge, touching on issues around data storage, streaming, and electricity consumption, besides other concerns. Due to this fact AI-based compression could offer a tempting trade-off between accuracy and logistics. History shows that the very best codecs don’t all the time win the widest user-base, when issues equivalent to licensing and market capture by proprietary formats are aspects in adoption.
Disney has been experimenting with machine learning as a compression method for a very long time. In 2020, considered one of the researchers on the brand new paper was involved in a VAE-based project for improved video compression.
The latest Disney paper was updated in early October. Today the corporate released an accompanying YouTube video. The project is titled , and comes from 4 researchers at ETH Zürich (affiliated with Disney’s AI-based projects) and Disney Research. The researchers also offer a supplementary paper.
Method
The brand new method uses a VAE to encode a picture into its compressed latent representation. At this stage the input image consists of derived features – low-level vector-based representations. The latent embedding is then quantized back right into a bitstream, and back into pixel-space.
This quantized image is then used as a template for the noise that typically seeds a diffusion-based image, with a various variety of denoising steps (wherein there is usually a trade-off between increased denoising steps and greater accuracy, vs. lower latency and better efficiency).

Each the quantization parameters and the entire variety of denoising steps may be controlled under the brand new system, through the training of a neural network that predicts the relevant variables related to those points of encoding. This process is named , and the Disney system uses the Entroformer framework because the entropy model which powers the procedure.
The authors state:
Stable Diffusion V2.1 is the diffusion backbone for the system, chosen because everything of the code and the bottom weights are publicly available. Nevertheless, the authors emphasize that their schema is applicable to a wider variety of models.
Pivotal to the economics of the method is , which evaluates the optimal variety of denoising steps – a balancing act between efficiency and performance.

The quantity of noise within the latent embedding must be considered when making a prediction for the very best variety of denoising steps.
Data and Tests
The model was trained on the Vimeo-90k dataset. The photographs were randomly cropped to 256x256px for every epoch (i.e., each complete ingestion of the refined dataset by the model training architecture).
The model was optimized for 300,000 steps at a learning rate of 1e-4. That is essentially the most common amongst computer vision projects, and likewise the bottom and most fine-grained generally practicable value, as a compromise between broad generalization of the dataset’s concepts and traits, and a capability for the reproduction of nice detail.
The authors comment on among the logistical considerations for an economic yet effective system*:
Datasets used for testing the system were Kodak; CLIC2022; and COCO 30k. The dataset was pre-processed in accordance with the methodology outlined within the 2023 Google offering .
Metrics used were Peak Signal-to-Noise Ratio (PSNR); Learned Perceptual Similarity Metrics (LPIPS); Multiscale Structural Similarity Index (MS-SSIM); and Fréchet Inception Distance (FID).
Rival prior frameworks tested were divided between older systems that used Generative Adversarial Networks (GANs), and newer offerings based around diffusion models. The GAN systems tested were High-Fidelity Generative Image Compression (HiFiC); and ILLM (which offers some improvements on HiFiC).
The diffusion-based systems were (CDC) and (HFD).

For the quantitative results (visualized above), the researchers state:
For the user study, a two-alternative-forced-choice (2AFC) method was used, in a tournament context where the favored images would go on to later rounds. The study used the Elo rating system originally developed for chess tournaments.
Due to this fact, participants would view and choose the very best of two presented 512x512px images across the varied generative methods. An extra experiment was undertaken through which image comparisons from the identical user were evaluated, via a Monte Carlo simulation over 10,0000 iterations, with the median rating presented in results.

Here the authors comment:
In the unique paper, in addition to the supplementary PDF, the authors provide further visual comparisons, considered one of which is shown earlier in this text. Nevertheless, because of the granularity of difference between the samples, we refer the reader to the source PDF, in order that these results may be judged fairly.
The paper concludes by noting that its proposed method operates twice as fast because the rival CDC (3.49 vs 6.87 seconds, respectively). It also observes that ILLM can process a picture inside 0.27 seconds, but that this technique requires burdensome training.
Conclusion
The ETH/Disney researchers are clear, on the paper’s conclusion, in regards to the potential of their system to generate false detail. Nevertheless, not one of the samples offered in the fabric dwell on this issue.
In all fairness, this problem just isn’t limited to the brand new Disney approach, but is an inevitable collateral effect of using diffusion models – an inventive and interpretive architecture – to compress imagery.
Interestingly, only five days ago two other researchers from ETH Zurich produced a paper titled , which examines the opportunity of an ‘optimal level of hallucination’ in AI-based compression systems.
The authors there make a case for the desirability of hallucinations where the domain is generic (and, arguably, ‘harmless’) enough:
Thus this second paper makes a case for compression to be optimally ‘creative’ and representative, relatively than recreating as accurately as possible the core traits and lineaments of the unique non-compressed image.
One wonders what the photographic and artistic community would make of this fairly radical redefinition of ‘compression’.