Improving Diffusers Package for High-Quality Image Generation

Overcoming token size limitations, custom model loading, LoRa support, textual inversion support, and more

Goodbye Babel, generated by Andrew Zhu using Diffusers in pure Python

Stable Diffusion WebUI from AUTOMATIC1111 has proven to be a robust tool for generating high-quality images using the Diffusion model. Nevertheless, while the WebUI is straightforward to make use of, data scientists, machine learning engineers, and researchers often require more control over the image generation process. That is where the diffusers package from huggingface is available in, providing a approach to run the Diffusion model in Python and allowing users to customize their models and prompts to generate images to their specific needs.

Despite its potential, the Diffusers package has several limitations that prevent it from generating images nearly as good as those produced by the Stable Diffusion WebUI. Probably the most significant of those limitations include:

The shortcoming to make use of custom models within the .safetensor file format;
The 77 prompt token limitation;
A scarcity of LoRA support;
And the absence of image scale-up functionality (also often known as HighRes in Stable Diffusion WebUI);
Low performance and high VRAM usage by default.

This text goals to handle these limitations and enable the Diffusers package to generate high-quality images comparable to those produced by the Stable Diffusion WebUI. With the enhancement solutions provided, data scientists, machine learning engineers, and researchers can enjoy greater control and suppleness of their image generation processes while also achieving exceptional results. In the next sections, we’ll explore the varied strategies and techniques that might be used to beat these limitations and unlock the complete potential of the Diffusers package.

Note that please follow this link to put in all required CUDA and Python packages whether it is your first time running Stable Diffusion.

1. Load Up Local Model files in .safetensor Format

Users can easily spin up diffusers to generate a picture like this:

from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipeline.to("cuda")
image = pipeline("A cute cat playing piano").images[0]
image.save("image_of_cat_playing_piano.png")

It’s possible you’ll not satisfy with either the output image or the performance. Let’s cope with the issues one after the other. First, let’s load up a custom model in .safetensor format situated anywhere in your machine. you just load the model file like this:

pipeline = DiffusionPipeline.from_pretrained("/model/custom_model.safetensors")

Listed here are the detailed steps to covert .safetensor file to diffusers format:

. Pull all diffusers code from GitHub

git clone https://github.com/huggingface/diffusers.git

. Under the scripts folder locate the file: convert_original_stable_diffusion_to_diffusers.py

In your terminal, run this command to convert .safetensor file to Diffusers format. Remember to alter the — checkpoint_path value to represent your case.

python convert_original_stable_diffusion_to_diffusers.py --from_safetensors --checkpoint_path="D:stable-diffusion-webuimodelsStable-diffusiondeliberate_v2.safetensors" --dump_path='D:sd_modelsdeliberate_v2' --device='cuda:0'

. Now you possibly can load up the pipeline using the newly converted model file, here is the whole code:

from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
)
pipeline.to("cuda")
image = pipeline("A cute cat playing piano").images[0]
image.save("image_of_cat_playing_piano.png")

You need to have the option to convert and use any models you download from huggingface or civitai.com.

Cat playing piano generated by the above code

2. Boost the Performance of Diffusers

Generating high-quality images could be a time-consuming process even for the most recent 3xxx and 4xxx Nvidia RTX GPUs. By default, Diffuers package comes with non-optimized settings. Two solutions might be applied to greatly boost performance.

Here is the interaction speed before applying the next solution, only about 2.x iterations per second in RTX 3070 TI 8G RAM to generate a 512×512 image

The primary solution is to make use of half precision weights. Half precision weights use 16-bit floating-point numbers as an alternative of the normal 32-bit numbers. This reduces the memory required for storing weights and accelerates computation, which might significantly improve the performance of the Diffusers package.

In accordance with this video, reducing float precision from FP32 to FP16 can even enable the Tensor Cores.

I had one other article to check out how briskly GPU Tensor cores can boost the computation speed.

Here is how one can enable FP16 in diffusers, Just adding two lines of code will boost the performance by 500%, with almost no image quality impacts.

from diffusers import DiffusionPipeline
import torch # <----- Line 1 added
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,torch_dtype        = torch.float16 # <----- Line 2 Added
)
pipeline.to("cuda")
image = pipeline("A cute cat playing piano").images[0]
image.save("image_of_cat_playing_piano.png")

Now the iteration speed boosts to 10.x iteration per second. A .

Xformers is an open-source library that gives a set of high-performance transformers for various natural language processing (NLP) tasks. It’s built on top of PyTorch and goals to offer efficient and scalable transformer models that might be easily integrated into existing NLP pipelines. (Nowadays, are there any models that don’t use Transformer? :P)

Install Xformers by pip install xformers , then we are able to easily switch diffusers to make use of xformers by one line code.

...
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()  <--- one line added
...

This one-line code boosts performance by one other 20%.

3. Remove the 77 prompt tokens limitation

In the present version of Diffusers, there’s a limitation of 77 prompt tokens that might be utilized in the generation of images.

Fortunately, there’s an answer to this problem. By utilizing the “lpw_stable_diffusion” pipeline provided by the community, you possibly can unlock the 77 prompt token limitation and generate high-quality images with longer prompts.

To make use of the “lpw_stable_diffusion” pipeline, you should utilize the next code:

pipeline = DiffusionPipeline.from_pretrained(
model_path,
custom_pipeline="lpw_stable_diffusion",  #<--- code added
torch_dtype=torch.float16
)

On this code, we’re initializing a recent DiffusionPipeline object using the “from_pretrained” method. We’re specifying the trail to the pre-trained model and setting the “custom_pipeline” argument to “lpw_stable_diffusion”. This tells Diffusers to make use of the “lpw_stable_diffusion” pipeline, which unlocks the 77 prompt token limitation.

Now, let’s use a protracted prompt string to check it out. Here is the whole code:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion"  #<--- code added
,torch_dtype        = torch.float16
)
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()
prompt = """
Babel tower falling down, walking on the starlight, dreamy ultra wide shot
, atmospheric, hyper realistic, epic composition, cinematic, octane render
, artstation landscape vista photography by Carr Clifton & Galen Rowell, 16K resolution
, Landscape veduta photo by Dustin Lefevre & tdraw, detailed landscape painting by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed post processing, artstation, rendering by octane, unreal engine
"""
image = pipeline(prompt).images[0]
image.save("goodbye_babel_tower.png")

And you’ll get a picture like this:

Goodby Babel, generated by Andrew Zhu using diffusers

In case you still see a warning message like: Token indices sequence length is longer than the required maximum sequence length for this model ( *** > 77 ) . Running this sequence through the model will end in indexing errors. It’s normal, you possibly can just ignore it.

4. Use Custom LoRA with Diffusers

Despite the claims of LoRA support in Diffusers, users still face limitations in terms of loading local LoRA files within the .safetensor file format. This could be a significant obstacle for users to make use of the LoRA from the community.

To beat this limitation, I even have created a function that enables users to load LoRA files with weighted numbers in real time. This function might be used to load LoRA files and their corresponding weights to a Diffusers model, enabling the generation of high-quality images with LoRA data.

Here is the function body:

from safetensors.torch import load_file
def __load_lora(
pipeline
,lora_path
,lora_weight=0.5
):
state_dict = load_file(lora_path)
LORA_PREFIX_UNET = 'lora_unet'
LORA_PREFIX_TEXT_ENCODER = 'lora_te'alpha = lora_weight
visited = []
# directly update weight in diffusers model
for key in state_dict:
# as we now have set the alpha beforehand, so just skip
if '.alpha' in key or key in visited:
proceed
if 'text' in key:
layer_infos = key.split('.')[0].split(LORA_PREFIX_TEXT_ENCODER+'_')[-1].split('_')
curr_layer = pipeline.text_encoder
else:
layer_infos = key.split('.')[0].split(LORA_PREFIX_UNET+'_')[-1].split('_')
curr_layer = pipeline.unet
# find the goal layer
temp_name = layer_infos.pop(0)
while len(layer_infos) > -1:
try:
curr_layer = curr_layer.__getattr__(temp_name)
if len(layer_infos) > 0:
temp_name = layer_infos.pop(0)
elif len(layer_infos) == 0:
break
except Exception:
if len(temp_name) > 0:
temp_name += '_'+layer_infos.pop(0)
else:
temp_name = layer_infos.pop(0)
# org_forward(x) + lora_up(lora_down(x)) * multiplier
pair_keys = []
if 'lora_down' in key:
pair_keys.append(key.replace('lora_down', 'lora_up'))
pair_keys.append(key)
else:
pair_keys.append(key)
pair_keys.append(key.replace('lora_up', 'lora_down'))
# update weight
if len(state_dict[pair_keys[0]].shape) == 4:
weight_up = state_dict[pair_keys[0]].squeeze(3).squeeze(2).to(torch.float32)
weight_down = state_dict[pair_keys[1]].squeeze(3).squeeze(2).to(torch.float32)
curr_layer.weight.data += alpha * torch.mm(weight_up, weight_down).unsqueeze(2).unsqueeze(3)
else:
weight_up = state_dict[pair_keys[0]].to(torch.float32)
weight_down = state_dict[pair_keys[1]].to(torch.float32)
curr_layer.weight.data += alpha * torch.mm(weight_up, weight_down)
# update visited list
for item in pair_keys:
visited.append(item)
return pipeline

The logic is extracted from the convert_lora_safetensor_to_diffusers.py of the diffusers git repo.

Take certainly one of the famous LoRA:MoXin for instance. you should utilize the __load_lora function like this:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion"  
,torch_dtype        = torch.float16
)
lora = (r"D:sd_modelsLoraMoxin_10.safetensors",0.8)
pipeline = __load_lora(pipeline=pipeline,lora_path=lora[0],lora_weight=lora[1])
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()prompt = """
shukezouma,negative space,shuimobysim 
a branch of flower, traditional chinese ink painting
"""
image = pipeline(prompt).images[0]
image.save("a branch of flower.png")

The prompt will generate a picture like this:

a branch of flower, generated by Andrew Zhu using diffusers

You possibly can call multiple times of __load_lora() to load several LoRAs for one generation.

With this function, you possibly can now load LoRA files with weighted numbers in real time and use them to generate high-quality images with Diffusers. The LoRA loading is pretty fast, often taking just one–2 seconds, way higher than converting and using(which is able to generate one other model file in GB size).

5. Use Custom Textural Inversions with Diffusers

Using custom Texture Inversions with Diffusers package could be a powerful approach to generate high-quality images. Nevertheless, the official documentation of Diffusers suggests that users have to train their very own Textual Inversions which might take as much as an hour on a V100 GPU. This will likely not be practical for a lot of users who wish to generate images quickly.

So I investigated it and located an answer that may enable diffusers to make use of a textual inversion identical to in Stable Diffusion WebUI. Below is the function I created to load a custom Textual Inversion.

def load_textual_inversion(
learned_embeds_path
, text_encoder
, tokenizer
, token = None
, weight = 0.5
):
'''
Use this function to load textual inversion model in model initilization stage 
or image generation stage. 
'''
loaded_learned_embeds = torch.load(learned_embeds_path, map_location="cpu")
string_to_token = loaded_learned_embeds['string_to_token']
string_to_param = loaded_learned_embeds['string_to_param']# separate token and the embeds
trained_token = list(string_to_token.keys())[0]
embeds = string_to_param[trained_token]
embeds = embeds[0] * weight
# solid to dtype of text_encoder
dtype = text_encoder.get_input_embeddings().weight.dtype
embeds.to(dtype)
# add the token in tokenizer
token = token if token just isn't None else trained_token
num_added_tokens = tokenizer.add_tokens(token)
if num_added_tokens == 0:
#print(f"The tokenizer already incorporates the token {token}.The brand new token will replace the previous one")
raise ValueError(f"The tokenizer already incorporates the token {token}. Please pass a unique `token` that just isn't already within the tokenizer.")
# resize the token embeddings
text_encoder.resize_token_embeddings(len(tokenizer))
# get the id for the token and assign the embeds
token_id = tokenizer.convert_tokens_to_ids(token)
text_encoder.get_input_embeddings().weight.data[token_id] = embeds
return (tokenizer,text_encoder)

Within the load_textual_inversion() function, it is advisable to provide the next arguments:

learned_embeds_path: Path to the pre-trained textual inversion model file in .pt or .bin format.
text_encoder: Text encoder object obtained from the Diffusion Pipeline.
tokenizer: Tokenizer object obtained from the Diffusion Pipeline.
token: Optional argument specifying the prompt token. By default, it is ready to None. it’s the keyword that can trigger the textual inversion in your prompt
weight: Optional argument specifying the load of the textual inversion. By default, I set it to 0.5. you possibly can change to other value as needed.

You possibly can now use the function with a diffusers pipeline like this:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion"  
,torch_dtype        = torch.float16
,safety_checker     = None
)textual_inversion_path = r"D:sd_modelsembeddingsstyle-empire.pt"
tokenizer       = pipeline.tokenizer
text_encoder    = pipeline.text_encoder 
load_textual_inversion(
learned_embeds_path     = textual_inversion_path
, tokenizer             = tokenizer
, text_encoder          = text_encoder
, token                 = 'styleempire'
)
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()
prompt = """
styleempire,award winning beautiful street, storm,((dark storm clouds))
, fluffy clouds within the sky, shaded flat illustration, digital art
, trending on artstation, highly detailed, fantastic detail, intricate
, ((lens flare)), (backlighting), (bloom)
"""
neg_prompt = """
cartoon, 3d, ((disfigured)), ((bad art)), ((deformed)), ((poorly drawn))
, ((extra limbs)), ((close up)), ((b&w)), weird colours, blurry
, hat, cap, glasses, sunglasses, lightning, face
"""
generator = torch.Generator("cuda").manual_seed(1)
image = pipeline(
prompt
,negative_prompt =neg_prompt
,generator       = generator
).images[0]
image.save("tv_test.png")

Here is the results of applying an Empire Style Textual Inversion.

The left’s modern street turns to an old London style.

6. Upscale Images

Diffusers package is great for generating high-quality images, but image upscaling just isn’t its primary function. Nevertheless, the Stable-Diffusion-WebUI offers a feature called HighRes, which allows users to upscale their generated images to 2x or 4x. It could be great if Diffusers users could benefit from the same feature. After some research and testing, I discovered that the SwinRI model is a superb option for image upscaling, and it could easily upscale images to 2x or 4x after they’re generated.

To make use of the SwinRI model for image upscaling, we are able to use the code from the GitHub repository of JingyunLiang/SwinIR. In case you just want codes, downloading models/network_swinir.py, utils/util_calculate_psnr_ssim.py and main_test_swinir.py is enough. Following the readme guideline, you possibly can upscale images like magic.

Here’s a sample of how well SwinRI can scale up a picture.

Left: original image, Right: 4x SwinRI upscaled image

Many other open-source solutions might be used to enhance image quality. Here list three other models that I attempted that return wonderful results.

RealSR can scale up a picture 4 times almost nearly as good as SwinRI, and its execution performance is the fastest, as an alternative of invoking PyTorch and CUDA. The creator compiles the code and CUDA usage to binary directly. My observations reveal that the RealSR can upscale a mage in about just 2–4 seconds.

CodeFormer is sweet at restoring blurred or broken faces, it could also remove noise and enhance background details. This solution and algorithm is widely utilized in other applications, including Stable-Diffusion-WebUI

One other powerful open-source solution that archives amazing results of face restoration, and it’s fast too. GFPGAN can also be integrated into Stable-Diffusion-WebUI.

7. Optimize Diffusers CUDA Memory Usage

When using Diffusers to generate images, it’s necessary to contemplate the CUDA memory usage, especially when you should load other models to further process the generated images. In case you attempt to load one other model like SwinIR to upscale images, you would possibly encounter a RuntimeError: CUDA out of memory as a result of the Diffuser model still occupying the CUDA memory.

To mitigate this issue, there are several solutions to optimize CUDA memory usage. The next two solutions I discovered work one of the best:

Sliced Attention for Additional Memory Savings

Sliced attention is a way that reduces the memory usage of self-attention mechanisms in transformers. By partitioning the eye matrix into smaller blocks, the memory requirements are reduced. This method might be used with the Diffusers package to cut back the memory footprint of the Diffuser model.

To make use of it in Diffusers, simply one line code:

pipeline.enable_attention_slicing()

Normally, you won’t have two models running at the identical time, the concept is to dump the model data to the CPU memory temporarily and release CUA memory space for other models, and only load as much as VRAM whenever you start using the model.

To make use of dynamically offload data to CPU memory in Diffusers, use this line code:

pipeline.enable_model_cpu_offload()

After applying this, each time Diffusers finish the image generation task, the model data shall be offloaded to CPU memory mechanically until the subsequent time calling.

Summary

The article discusses how one can improve the performance and capabilities of the Diffusers package, The article covers several solutions to common issues faced by Diffusers users, including loading local .safetensor models, boosting performance, removing the 77 prompt tokens limitation, using custom LoRA and Textual Inversion, upscaling images, and optimizing CUDA memory usage.

By applying these solutions, Diffusers users can generate high-quality images with higher performance and more control over the method. The article also includes code snippets and detailed explanations for every solution.

In case you can successfully apply these solutions and code in your case, there could possibly be a further profit, which I profit so much, is that you might implement your individual solutions by reading the Diffusers source code and understand higher how Stable Diffusion works. To me, learning, finding, and implementing these solutions is a fun journey. Hope these solutions also can allow you to and need you enjoy with Stable Diffusion and diffusers package.

Here provide the prompt that generates the heading image:

Babel tower falling down, walking on the starlight, dreamy ultra wide shot
, atmospheric, hyper realistic, epic composition, cinematic, octane render
, artstation landscape vista photography by Carr Clifton & Galen Rowell, 16K resolution
, Landscape veduta photo by Dustin Lefevre & tdraw, detailed landscape painting by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed post processing, artstation, rendering by octane, unreal engine

Size:
Seed:
Scheduler (or Sampling method):
Sampling steps:
CFG Scale (or Guidance Scale): SwinRI model:

License and Code Reuse

The solutions provided in this text were achieved through extensive source reading, later night testing, and logical design. It is vital to notice that on the time of writing (April 2023), loading LoRA and Textual Inversion solutions and code included in this text are the one working versions across the web.

In case you find the code presented in this text useful and wish to reuse it in your project, paper, or article, please reference back to this Medium article. The code presented here is licensed under the MIT license, which allows you to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software, subject to the conditions of the license.

Please note that the solutions presented in this text will not be the optimal or most effective approach to achieve the specified results, and are subject to alter as recent developments and enhancements are made. It’s at all times advisable to thoroughly test and validate any code before implementing it in a production environment.

Improving Diffusers Package for High-Quality Image Generation

Overcoming token size limitations, custom model loading, LoRa support, textual inversion support, and more

1. Load Up Local Model files in .safetensor Format

2. Boost the Performance of Diffusers

3. Remove the 77 prompt tokens limitation

4. Use Custom LoRA with Diffusers

5. Use Custom Textural Inversions with Diffusers

6. Upscale Images

7. Optimize Diffusers CUDA Memory Usage

Summary

License and Code Reuse

References

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

AI’s Growing Power Needs: Tech Industry’s Move Towards Nuclear Power

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

Improving Diffusers Package for High-Quality Image Generation

Overcoming token size limitations, custom model loading, LoRa support, textual inversion support, and more

1. Load Up Local Model files in .safetensor Format

2. Boost the Performance of Diffusers

3. Remove the 77 prompt tokens limitation

4. Use Custom LoRA with Diffusers

5. Use Custom Textural Inversions with Diffusers

6. Upscale Images

7. Optimize Diffusers CUDA Memory Usage

Summary

License and Code Reuse

References

What are your thoughts on this topic? Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.