Faster Stable Diffusion with Core ML on iPhone, iPad, and Mac

-


Pedro Cuenca's avatar


WWDC’23 (Apple Worldwide Developers Conference) was held last week. Lots of the news focused on the Vision Pro announcement in the course of the keynote, but there’s rather more to it. Like yearly, WWDC week is filled with greater than 200 technical sessions that dive deep contained in the upcoming features across Apple operating systems and frameworks. This yr we’re particularly enthusiastic about changes in Core ML dedicated to compression and optimization techniques. These changes make running models reminiscent of Stable Diffusion faster and with less memory use! As a taste, consider the next test I ran on my iPhone 13 back in December, compared with the present speed using 6-bit palettization:
Stable Diffusion on iPhone, back in December and now using 6-bit palettization
Stable Diffusion on iPhone, back in December and now with 6-bit palettization



Contents



Latest Core ML Optimizations

Core ML is a mature framework that permits machine learning models to run efficiently on-device, benefiting from all of the compute hardware in Apple devices: the CPU, the GPU, and the Neural Engine specialized in ML tasks. On-device execution goes through a period of extraordinary interest triggered by the recognition of models reminiscent of Stable Diffusion and Large Language Models with chat interfaces. Many individuals need to run these models on their hardware for quite a lot of reasons, including convenience, privacy, and API cost savings. Naturally, many developers are exploring ways to run these models efficiently on-device and creating recent apps and use cases. Core ML improvements that contribute to achieving that goal are big news for the community!

The Core ML optimization changes encompass two different (but complementary) software packages:

  • The Core ML framework itself. That is the engine that runs ML models on Apple hardware and is a component of the operating system. Models should be exported in a special format supported by the framework, and this format can also be known as “Core ML”.
  • The coremltools conversion package. That is an open-source Python module whose mission is to convert PyTorch or Tensorflow models to the Core ML format.

coremltools now features a recent submodule called coremltools.optimize with all of the compression and optimization tools. For full details on this package, please take a have a look at this WWDC session. Within the case of Stable Diffusion, we’ll be using 6-bit palettization, a form of quantization that compresses model weights from a 16-bit floating-point representation to only 6 bits per parameter. The name “palettization” refers to a way much like the one utilized in computer graphics to work with a limited set of colours: the colour table (or “palette”) accommodates a hard and fast variety of colours, and the colours within the image are replaced with the indexes of the closest colours available within the palette. This immediately provides the good thing about drastically reducing storage size, and thus reducing download time and on-device disk use.

Illustration of 2-bit palettization. Image credit: Apple WWDC’23 Session 'Use Core ML Tools for machine learning model compression'
Illustration of 2-bit palettization. Image credit: Apple WWDC’23 Session Use Core ML Tools for machine learning model compression.

The compressed 6-bit weights can’t be used for computation, because they are only indices right into a table and now not represent the magnitude of the unique weights. Subsequently, Core ML must uncompress the palletized weights before use. In previous versions of Core ML, uncompression took place when the model was first loaded from disk, so the quantity of memory used was equal to the uncompressed model size. With the brand new improvements, weights are kept as 6-bit numbers and converted on the fly as inference progresses from layer to layer. This may appear slow – an inference run requires quite a lot of uncompressing operations –, nevertheless it’s typically more efficient than preparing all of the weights in 16-bit mode! The explanation is that memory transfers are within the critical path of execution, and transferring less memory is quicker than transferring uncompressed data.



Using Quantized and Optimized Stable Diffusion Models

Last December, Apple introduced ml-stable-diffusion, an open-source repo based on diffusers to simply convert Stable Diffusion models to Core ML. It also applies optimizations to the transformers attention layers that make inference faster on the Neural Engine (on devices where it’s available). ml-stable-diffusion has just been updated after WWDC with the next:

  • Quantization is supported using --quantize-nbits during conversion. You may quantize to eight, 6, 4, and even 2 bits! For best results, we recommend using 6-bit quantization, because the precision loss is small while achieving fast inference and significant memory savings. If you should go lower than that, please check this section for advanced techniques.
  • Additional optimizations of the eye layers that achieve even higher performance on the Neural Engine! The trick is to separate the query sequences into chunks of 512 to avoid the creation of huge intermediate tensors. This method is named SPLIT_EINSUM_V2 within the code and might improve performance between 10% to 30%.

In an effort to make it easy for everybody to make the most of these improvements, we have now converted the 4 official Stable Diffusion models and pushed them to the Hub. These are all of the variants:

In an effort to use 6-bit models, you would like the event versions of iOS/iPadOS 17 or macOS 14 (Sonoma) because those are those that contain the most recent Core ML framework. You may download them from the Apple developer site when you are a registered developer, or you’ll be able to join for the general public beta that shall be released in a couple of weeks.

Note that every variant is obtainable in Core ML format and in addition as a zip archive. Zip files are perfect for native apps, reminiscent of our open-source demo app and other third party tools. For those who just need to run the models on your personal hardware, the simplest way is to make use of our demo app and choose the quantized model you should test. That you must compile the app using Xcode, but an update shall be available for download within the App Store soon. For more details, check our previous post.

Running 6-bit stable-diffusion-2-1-base model in demo app
Running 6-bit stable-diffusion-2-1-base model in demo app

If you should download a specific Core ML package to integrate it in your personal Xcode project, you’ll be able to clone the repos or simply download the version of interest using code like the next.

from huggingface_hub import snapshot_download
from pathlib import Path

repo_id = "apple/coreml-stable-diffusion-2-1-base-palettized"
variant = "original/packages"

model_path = Path("./models") / (repo_id.split("https://huggingface.co/")[-1] + "_" + variant.replace("https://huggingface.co/", "_"))
snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False)
print(f"Model downloaded at {model_path}")



Converting and Optimizing Custom Models

If you should use a customized Stable Diffusion model (for instance, if you’ve got fine-tuned or dreamboothed your personal models), you should use Apple’s ml-stable-diffusion repo to do the conversion yourself. It is a transient summary of the way you’d go about it, but we recommend you read the documentation details.

If you should apply quantization, you would like the most recent versions of coremltools, apple/ml-stable-diffusion and Xcode with a purpose to do the conversion.

  1. Select the model you should convert. You may train your personal or select one from the Hugging Face Diffusers Models Gallery. For instance, let’s convert prompthero/openjourney-v4.
  2. Install apple/ml-stable-diffusion and run a primary conversion using the ORIGINAL attention implementation like this:
python -m python_coreml_stable_diffusion.torch2coreml 
    --model-version prompthero/openjourney-v4 
    --convert-unet 
    --convert-text-encoder 
    --convert-vae-decoder 
    --convert-vae-encoder 
    --convert-safety-checker 
    --quantize-nbits 6 
    --attention-implementation ORIGINAL 
    --compute-unit CPU_AND_GPU 
    --bundle-resources-for-swift-cli 
    --check-output-correctness 
    -o models/original/openjourney-6-bit

  • Use --convert-vae-encoder if you should use image-to-image tasks.
  • Do not use --chunk-unet with --quantized-nbits 6 (or less), because the quantized model is sufficiently small to work nice on each iOS and macOS.

  1. Repeat the conversion for the SPLIT_EINSUM_V2 attention implementation:
python -m python_coreml_stable_diffusion.torch2coreml 
    --model-version prompthero/openjourney-v4 
    --convert-unet 
    --convert-text-encoder 
    --convert-vae-decoder 
    --convert-safety-checker 
    --quantize-nbits 6 
    --attention-implementation SPLIT_EINSUM_V2 
    --compute-unit ALL 
    --bundle-resources-for-swift-cli 
    --check-output-correctness 
    -o models/split_einsum_v2/openjourney-6-bit
  1. Test the converted models on the specified hardware. As a rule of thumb, the ORIGINAL version often works higher on macOS, whereas SPLIT_EINSUM_V2 is generally faster on iOS. For more details and extra data points, see these tests contributed by the community on the previous version of Stable Diffusion for Core ML.

  2. To integrate the specified model in your personal app:

    • For those who are going to distribute the model contained in the app, use the .mlpackage files. Note that this can increase the dimensions of your app binary.
    • Otherwise, you should use the compiled Resources to download them dynamically when your app starts.

For those who don’t use the --quantize-nbits option, weights shall be represented as 16-bit floats. That is compatible with the present version of Core ML so that you won’t need to put in the betas of iOS, macOS or Xcode.



Using Lower than 6 bits

6-bit quantization is a sweet spot between model quality, model size and convenience – you only need to supply a conversion option with a purpose to have the option to quantize any pre-trained model. That is an example of post-training compression.

The beta version of coremltools released last week also includes training-time compression methods. The concept here is you can fine-tune a pre-trained Stable Diffusion model and perform the weights compression while fine-tuning is happening. This lets you use 4- and even 2-bit compression while minimizing the loss in quality. The explanation this works is because weight clustering is performed using a differentiable algorithm, and due to this fact we will apply the standard training optimizers to search out the quantization table while minimizing model loss.

Now we have plans to judge this method soon, and might’t wait to see how 4-bit optimized models work and how briskly they run. For those who beat us to it, please drop us a note and we’ll be comfortable to ascertain 🙂



Conclusion

Quantization methods could be used to cut back the dimensions of Stable Diffusion models, make them run faster on-device and devour less resources. The newest versions of Core ML and coremltools support techniques like 6-bit palettization which might be easy to use and which have a minimal impact on quality. Now we have added 6-bit palettized models to the Hub, that are sufficiently small to run on each iOS and macOS. We have also shown how you’ll be able to convert fine-tuned models yourself, and might’t wait to see what you do with these tools and techniques!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x