Optimizing Stable Diffusion for Intel CPUs with NNCF and 🤗 Optimum

Latent Diffusion models are game changers in the case of solving text-to-image generation problems. Stable Diffusion is probably the most famous examples that got wide adoption locally and industry. The thought behind the Stable Diffusion model is easy and compelling: you generate a picture from a noise vector in multiple small steps refining the noise to a latent image representation. This approach works thoroughly, but it may well take a protracted time to generate a picture for those who would not have access to powerful GPUs.

Through the past five years, OpenVINO Toolkit encapsulated many features for high-performance inference. Initially designed for Computer Vision models, it still dominates on this domain showing best-in-class inference performance for a lot of contemporary models, including Stable Diffusion. Nonetheless, optimizing Stable Diffusion models for resource-constraint applications requires going far beyond just runtime optimizations. And that is where model optimization capabilities from OpenVINO Neural Network Compression Framework (NNCF) come into play.

On this blog post, we are going to outline the issues of optimizing Stable Diffusion models and propose a workflow that substantially reduces the latency of such models when running on a resource-constrained HW reminiscent of CPU. Particularly, we achieved 5.1x inference acceleration and 4x model footprint reduction in comparison with PyTorch.

Stable Diffusion optimization

Within the Stable Diffusion pipeline, the UNet model is computationally the most costly to run. Thus, optimizing only one model brings substantial advantages by way of inference speed.

Nonetheless, it seems that the normal model optimization methods, reminiscent of post-training 8-bit quantization, don’t work for this model. There are two predominant reasons for that. First, pixel-level prediction models, reminiscent of semantic segmentation, super-resolution, etc., are probably the most complicated by way of model optimization due to complexity of the duty, so tweaking model parameters and the structure breaks the leads to quite a few ways. The second reason is that the model has a lower level of redundancy since it accommodates loads of information while being trained on lots of of tens of millions of samples. That’s the reason researchers need to employ more sophisticated quantization methods to preserve the accuracy after optimization. For instance, Qualcomm used the layer-wise Knowledge Distillation method (AdaRound) to quantize Stable Diffusion models. It implies that model tuning after quantization is required, anyway. If that’s the case, why not only use Quantization-Aware Training (QAT) which may tune the model and quantization parameters concurrently in the identical way the source model is trained? Thus, we tried this approach in our work using NNCF, OpenVINO, and Diffusers and paired it with Token Merging.

Optimization workflow

We often start the optimization of a model after it’s trained. Here, we start from a model fine-tuned on the Pokemons dataset containing images of Pokemons and their text descriptions.

We used the text-to-image fine-tuning example for Stable Diffusion from the Diffusers and integrated QAT from NNCF into the next training script. We also modified the loss function to include knowledge distillation from the source model that acts as a teacher on this process while the actual model being trained acts as a student. This approach is different from the classical knowledge distillation method, where the trained teacher model is distilled right into a smaller student model. In our case, knowledge distillation is used as an auxiliary method that helps improve the ultimate accuracy of the optimizing model. We also use the Exponential Moving Average (EMA) method for model parameters excluding quantizers which allows us to make the training process more stable. We tune the model for 4096 iterations only.

With some tricks, reminiscent of gradient checkpointing and keeping the EMA model in RAM as an alternative of VRAM, we will run the optimization process using one GPU with 24 GB of VRAM. The entire optimization takes lower than a day using one GPU!

Going beyond Quantization-Aware Training

Quantization alone can bring significant enhancements by reducing model footprint, load time, memory consumption, and inference latency. But the beauty of quantization is that it may well be applied together with other optimization methods resulting in a cumulative speedup.

Recently, Facebook Research introduced a Token Merging method for Vision Transformer models. The essence of the strategy is that it merges redundant tokens with vital ones using one in every of the available strategies (averaging, taking max values, etc.). This is finished before the self-attention block, which is probably the most computationally demanding a part of Transformer models. Subsequently, reducing the token dimension reduces the general computation time within the self-attention blocks. This method has also been adapted for Stable Diffusion models and has shown promising results when optimizing Stable Diffusion pipelines for high-resolution image synthesis running on GPUs.

We modified the Token Merging method to be compliant with OpenVINO and stacked it with 8-bit quantization when applied to the Attention UNet model. This also involves all of the mentioned techniques including Knowledge Distillation, etc. As for quantization, it requires fine-tuning to be applied to revive the accuracy. We also start optimization and fine-tuning from the model trained on the Pokemons dataset. The figure below shows an overall optimization workflow.

The resultant model is very helpful when running inference on devices with limited computational resources, reminiscent of client or edge CPUs. Because it was mentioned, stacking Token Merging with quantization results in an extra reduction within the inference latency.

PyTorch FP32, Inference Speed: 230.5 seconds, Memory Footprint: 3.44 GB

OpenVINO FP32, Inference Speed: 120 seconds (**1.9x**), Memory Footprint: 3.44 GB

OpenVINO 8-bit, Inference Speed: 59 seconds (**3.9x**), Memory Footprint: 0.86 GB (**0.25x**)

ToMe + OpenVINO 8-bit, Inference Speed: 44.6 seconds (**5.1x**), Memory Footprint: 0.86 GB (**0.25x**)

Results of image generation demo using different optimized models. Input prompt is “cartoon bird”, seed is 42. The models are with OpenVINO 2022.3 in Hugging Face Spaces using a “CPU upgrade” instance which utilizes third Generation Intel® Xeon® Scalable Processors with Intel® Deep Learning Boost technology.

Results

We used the disclosed optimization workflows to get two varieties of optimized models, 8-bit quantized and quantized with Token Merging, and compare them to the PyTorch baseline. We also converted the baseline to vanilla OpenVINO floating-point (FP32) model for the excellent comparison.

The image above shows the outcomes of image generation and a few model characteristics. As you’ll be able to see, just conversion to OpenVINO brings a big decrease within the inference latency ( 1.9x ). Applying 8-bit quantization boosts inference speed further resulting in 3.9x speedup in comparison with PyTorch. One other good thing about quantization is a big reduction of model footprint, 0.25x of PyTorch checkpoint, which also improves the model load time. Applying Token Merging (ToME) (with a merging ratio of 0.4 ) on top of quantization brings 5.1x performance speedup while keeping the footprint at the identical level. We didn’t provide an intensive evaluation of the visual quality of the optimized models, but, as you’ll be able to see, the outcomes are quite solid.

For the outcomes shown on this blog, we used the default variety of 50 inference steps. With fewer inference steps, inference speed might be faster, but this has an effect on the standard of the resulting image. How large this effect is depends upon the model and the scheduler. We recommend experimenting with different variety of steps and schedulers and find what works best on your use case.

Below we show learn how to perform inference with the ultimate pipeline optimized to run on Intel CPUs:

from optimum.intel import OVStableDiffusionPipeline


name = "OpenVINO/stable-diffusion-pokemons-tome-quantized-aggressive"
pipe = OVStableDiffusionPipeline.from_pretrained(name, compile=False)
pipe.reshape(batch_size=1, height=512, width=512, num_images_per_prompt=1)
pipe.compile()


prompt = "a drawing of a green pokemon with red eyes"
output = pipe(prompt, num_inference_steps=50, output_type="pil").images[0]
output.save("image.png")

You’ll find the training and quantization code within the Hugging Face Optimum Intel library. The notebook that demonstrates the difference between optimized and original models is on the market here. You may also find many models on the Hugging Face Hub under the OpenVINO organization. As well as, we’ve created a demo on Hugging Face Spaces that’s being run on a third Generation Intel Xeon Scalable processor.

What in regards to the general-purpose Stable Diffusion model?

As we showed with the Pokemon image generation task, it is feasible to attain a high level of optimization of the Stable Diffusion pipeline when using a comparatively small amount of coaching resources. At the identical time, it’s well-known that training a general-purpose Stable Diffusion model is an expensive task. Nonetheless, with enough budget and HW resources, it is feasible to optimize the general-purpose model using the described approach and tune it to provide high-quality images. The one caveat we’ve is said to the token merging method that reduces the model capability substantially. The rule of thumb here is the more complicated the dataset you will have for the training, the less merging ratio it is best to use through the optimization.

For those who enjoyed reading this post, you may also be curious about testing this post that discusses other complementary approaches to optimize the performance of Stable Diffusion on 4th generation Intel Xeon CPUs.

Source link

Optimizing Stable Diffusion for Intel CPUs with NNCF and 🤗 Optimum

Stable Diffusion optimization

Optimization workflow

Going beyond Quantization-Aware Training

Results

What in regards to the general-purpose Stable Diffusion model?

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Hugging Face Collaborates with Microsoft to launch Hugging Face Model Catalog on Azure

Introducing the Hugging Face LLM Inference Container for Amazon SageMaker

SAM 3 vs. Specialist Models — A Performance Benchmark

Introducing BERTopic Integration with the Hugging Face Hub

Azure ML vs. AWS SageMaker: A Deep Dive into Model Training — Part 1

Optimizing Stable Diffusion for Intel CPUs with NNCF and 🤗 Optimum

Stable Diffusion optimization

Optimization workflow

Going beyond Quantization-Aware Training

Results

What in regards to the general-purpose Stable Diffusion model?

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.