Accelerating SD Turbo and SDXL Turbo Inference with ONNX Runtime and Olive

-


SD Turbo and SDXL Turbo are two fast generative text-to-image models able to generating viable images in as little as one step, a big improvement over the 30+ steps often required with previous Stable Diffusion models. SD Turbo is a distilled version of Stable Diffusion 2.1, and SDXL Turbo is a distilled version of SDXL 1.0. We’ve previously shown the way to speed up Stable Diffusion inference with ONNX Runtime. Not only does ONNX Runtime provide performance advantages when used with SD Turbo and SDXL Turbo, but it surely also makes the models accessible in languages aside from Python, like C# and Java.



Performance gains

On this post, we’ll introduce optimizations within the ONNX Runtime CUDA and TensorRT execution providers that speed up inference of SD Turbo and SDXL Turbo on NVIDIA GPUs significantly.

ONNX Runtime outperformed PyTorch for all (batch size, variety of steps) mixtures tested, with throughput gains as high as 229% for the SDXL Turbo model and 120% for the SD Turbo model. ONNX Runtime CUDA has particularly good performance for dynamic shape but demonstrates a marked improvement over PyTorch for static shape as well.



The way to run SD Turbo and SDXL Turbo

To speed up inference with the ONNX Runtime CUDA execution provider, access our optimized versions of SD Turbo and SDXL Turbo on Hugging Face.

The models are generated by Olive, an easy-to-use model optimization tool that’s hardware aware. Note that fp16 VAE should be enabled through the command line for best performance, as shown within the optimized versions shared. For instructions on the way to run the SD and SDXL pipelines with the ONNX files hosted on Hugging Face, see the SD Turbo usage example and the SDXL Turbo usage example.

To speed up inference with the ONNX Runtime TensorRT execution provider as a substitute, follow the instructions found here.

The next is an example of image generation with the SDXL Turbo model guided by a text prompt:

python3 demo_txt2img_xl.py 
  --version xl-turbo 
  "little cute gremlin wearing a jacket, cinematic, vivid colours, intricate masterpiece, golden ratio, highly detailed"

Generated Gremlin Example
Figure 1. Little cute gremlin wearing a jacket image generated with text prompt using SDXL Turbo.

Note that the instance image was generated in 4 steps, demonstrating the power of SD Turbo and SDXL Turbo to generate viable images in fewer steps than previous Stable Diffusion models.

For a user-friendly technique to check out Stable Diffusion models, see our ONNX Runtime Extension for Automatic1111’s SD WebUI. This extension enables optimized execution of the Stable Diffusion UNet model on NVIDIA GPUs and uses the ONNX Runtime CUDA execution provider to run inference against models optimized with Olive. Presently, the extension has only been optimized for Stable Diffusion 1.5. SD Turbo and SDXL Turbo models could be used as well, but performance optimizations are still in progress.



Applications of Stable Diffusion in C# and Java

Making the most of the cross-platform, performance, and value advantages of ONNX Runtime, members of the community have also contributed samples and UI tools of their very own using Stable Diffusion with ONNX Runtime.

These community contributions include OnnxStack, a .NET library that builds upon our previous C# tutorial to supply users with quite a lot of capabilities for many alternative Stable Diffusion models when performing inference with C# and ONNX Runtime.

Moreover, Oracle has released a Stable Diffusion sample with Java that runs inference on top of ONNX Runtime. This project can also be based on our C# tutorial.



Benchmark results

We benchmarked the SD Turbo and SDXL Turbo models with Standard_ND96amsr_A100_v4 VM using A100-SXM4-80GB and a Lenovo Desktop with RTX-4090 GPU (WSL Ubuntu 20.04) to generate images of resolution 512×512 using the LCM Scheduler and fp16 models. The outcomes are measured using these specifications:

  • onnxruntime-gpu==1.17.0 (built from source)
  • torch==2.1.0a0+32f93b1
  • tensorrt==8.6.1
  • transformers==4.36.0
  • diffusers==0.24.0
  • onnx==1.14.1
  • onnx-graphsurgeon==0.3.27
  • polygraphy==0.49.0

To breed these results, we recommend using the instructions linked within the ‘Usage example’ section.

Because the original VAE of SDXL Turbo cannot run in fp16 precision, we used sdxl-vae-fp16-fix in testing SDXL Turbo. There are slight discrepancies between its output and that of the unique VAE, however the decoded images are close enough for many purposes.

The PyTorch pipeline for static shape has applied channel-last memory format and torch.compile with reduce-overhead mode.

The next charts illustrate the throughput in images per second vs. different (batch size, variety of steps) mixtures for various frameworks. It’s price noting that the label above each bar indicates the speedup percentage vs. Torch Compile – e.g., in the primary chart, ORT_TRT (Static) is 31% faster than Torch (Compile) for (batch, steps) combination (4, 1).

We elected to make use of 1 and 4 steps because each SD Turbo and SDXL Turbo can generate viable images in as little as 1 step but typically produce images of one of the best quality in 3-5 steps.



SDXL Turbo

The graphs below illustrate the throughput in images per second for the SDXL Turbo model with each static and dynamic shape. Results were gathered on an A100-SXM4-80GB GPU for various (batch size, variety of steps) mixtures. For dynamic shape, the TensorRT engine supports batch size 1 to eight and image size 512×512 to 768×768, but it surely is optimized for batch size 1 and image size 512×512.

Throughput for SDXL Turbo on A100 Tensor Cores GPU (static shapes)
Throughput for SDXL Turbo on A100 Tensor Cores GPU (dynamic shapes)



SD Turbo

The subsequent two graphs illustrate throughput in images per second for the SD Turbo model with each static and dynamic shape on an A100-SXM4-80GB GPU.

Throughput for SD Turbo on A100 Tensor Cores GPU (static shapes)
Throughput for SD Turbo on A100 Tensor Cores GPU (dynamic shapes)

The ultimate set of graphs illustrates throughput in images per second for the SD Turbo model with each static and dynamic shape on an RTX-4090 GPU. On this dynamic shape test, the TensorRT engine is built for batch size 1 to eight (optimized for batch size 1) and glued image size 512×512 on account of memory limitation.

Throughput for SD Turbo on RTX 4090 (static shapes)
Throughput for SD Turbo on RTX 4090 (dynamic shapes)



How briskly are SD Turbo and SDXL Turbo with ONNX Runtime?

These results display that ONNX Runtime significantly outperforms PyTorch with each CUDA and TensorRT execution providers in static and dynamic shape for all (batch, steps) mixtures shown. This conclusion applies to each model sizes (SD Turbo and SDXL Turbo), in addition to each GPUs tested. Notably, ONNX Runtime with CUDA (dynamic shape) was shown to be 229% faster than Torch Anticipating (batch, steps) combination (1, 4).

Moreover, ONNX Runtime with the TensorRT execution provider performs barely higher for static shape provided that the ORT_TRT throughput is higher than the corresponding ORT_CUDA throughput for many (batch, steps) mixtures. Static shape is usually favored when the user knows the batch and image size at graph definition time (e.g., the user is barely planning to generate images with batch size 1 and image size 512×512). In these situations, the static shape has faster performance. Nonetheless, if the user decides to change to a special batch and/or image size, TensorRT must create a brand new engine (meaning double the engine files within the disk) and switch engines (meaning additional time spent loading the brand new engine).

However, ONNX Runtime with the CUDA execution provider is usually a better option for dynamic shape for SD Turbo and SDXL Turbo models when using an A100-SXM4-80GB GPU, but ONNX Runtime with the TensorRT execution provider performs barely higher on dynamic shape for many (batch, steps) mixtures when using an RTX-4090 GPU. The advantage of using dynamic shape is that users can run inference more quickly when the batch and image sizes will not be known until graph execution time (e.g., running batch size 1 and image size 512×512 for one image and batch size 4 and image size 512×768 for one more). When dynamic shape is utilized in these cases, users only need to construct and save one engine, reasonably than switching engines during inference.



GPU optimizations

Besides the techniques introduced in our previous Stable Diffusion blog, the next optimizations were applied by ONNX Runtime to yield the SD Turbo and SDXL Turbo results outlined on this post:

  • Enable CUDA graph for static shape inputs.
  • Add Flash Attention V2.
  • Remove extra outputs in text encoder (keep the hidden state output specified by clip_skip parameter).
  • Add SkipGroupNorm fusion to fuse group normalization with Add nodes that precede it.

Moreover, we now have added support for brand spanking new features, including LoRA weights for latent consistency models (LCMs).



Next steps

In the long run, we plan to proceed improving upon our Stable Diffusion work by updating the demo to support recent features, resembling IP Adapter and Stable Video Diffusion. ControlNet support may even be available shortly.

We’re also working on optimizing SD Turbo and SDXL Turbo performance with our existing Stable Diffusion web UI extension and plan to assist add support for each models to a Windows UI developed by a member of the ONNX Runtime community.

Moreover, a tutorial for the way to run SD Turbo and SDXL Turbo with C# and ONNX Runtime is coming soon. Within the meantime, take a look at our previous tutorial on Stable Diffusion.



Resources

Try a few of the resources discussed on this post:



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x