Running IF with 🧨 diffusers on a Free Tier Google Colab

TL;DR: We show the right way to run one of the powerful open-source text
to image models IF on a free-tier Google Colab with 🧨 diffusers.

You too can explore the capabilities of the model directly within the Hugging Face Space.

if-collage
Image compressed from official IF GitHub repo.

Introduction

IF is a pixel-based text-to-image generation model and was released in
late April 2023 by DeepFloyd. The
model architecture is strongly inspired by Google’s closed-sourced
Imagen.

IF has two distinct benefits in comparison with existing text-to-image models
like Stable Diffusion:

The model operates directly in “pixel space” (i.e., on
uncompressed images) as an alternative of running the denoising process within the
latent space equivalent to Stable Diffusion.
The model is trained on outputs of
T5-XXL, a more powerful
text encoder than CLIP, utilized by
Stable Diffusion because the text encoder.

Because of this, IF is healthier at generating images with high-frequency
details (e.g., human faces and hands) and is the primary open-source
image generation model that may reliably generate images with text.

The downside of operating in pixel space and using a more powerful text
encoder is that IF has a significantly higher amount of parameters. T5,
IF’s text-to-image UNet, and IF’s upscaler UNet have 4.5B, 4.3B, and
1.2B parameters respectively. In comparison with Stable Diffusion
2.1‘s text
encoder and UNet having just 400M and 900M parameters, respectively.

Nevertheless, it is feasible to run IF on consumer hardware if one
optimizes the model for low-memory usage. We are going to show you possibly can do that
with 🧨 diffusers on this blog post.

In 1.), we explain the right way to use IF for text-to-image generation, and in 2.)
and three.), we go over IF’s image variation and image inpainting
capabilities.

💡 Note: We’re trading gains in memory by gains in
speed here to make it possible to run IF in a free-tier Google Colab. If
you could have access to high-end GPUs equivalent to an A100, we recommend leaving
all model components on GPU for max speed, as done within the
official IF demo.

💡 Note: A few of the larger images have been compressed to load faster
within the blog format. When using the official model, they must be even
higher quality!

Let’s dive in 🚀!

IF’s text generation capabilities

Accepting the license

Before you should use IF, that you must accept its usage conditions. To accomplish that:

1. Make certain to have a Hugging Face account and be logged in
1. Accept the license on the model card of DeepFloyd/IF-I-XL-v1.0. Accepting the license on the stage I model card will auto accept for the opposite IF models.
1. Make certain to login locally. Install huggingface_hub

pip install huggingface_hub --upgrade

run the login function in a Python shell

from huggingface_hub import login

login()

and enter your Hugging Face Hub access token.

Optimizing IF to run on memory constrained hardware

State-of-the-art ML shouldn’t just be within the hands of an elite few.
Democratizing ML means making models available to run on greater than just
the newest and best hardware.

The deep learning community has created world class tools to run
resource intensive models on consumer hardware:

Diffusers seamlessly integrates the above libraries to permit for a
easy API when optimizing large models.

The free-tier Google Colab is each CPU RAM constrained (13 GB RAM) as
well as GPU VRAM constrained (15 GB RAM for T4), which makes running the
whole >10B IF model difficult!

Let’s map out the dimensions of IF’s model components in full float32
precision:

There isn’t a way we are able to run the model in float32 because the T5 and Stage 1
UNet weights are each larger than the available CPU RAM.

In float16, the component sizes are 11GB, 8.6GB, and 1.25GB for T5,
Stage1 and Stage2 UNets, respectively, which is doable for the GPU, but
we’re still running into CPU memory overflow errors when loading the T5
(some CPU is occupied by other processes).

Due to this fact, we lower the precision of T5 much more by utilizing
bitsandbytes 8bit quantization, which allows saving the T5 checkpoint
with as little as 8
GB.

Now that every component matches individually into each CPU and GPU memory,
we’d like to ensure that that components have all of the CPU and GPU memory for
themselves when needed.

Diffusers supports modularly loading individual components i.e. we are able to
load the text encoder without loading the UNet. This modular loading
will be sure that we only load the component we’d like at a given step in
the pipeline to avoid exhausting the available CPU RAM and GPU VRAM.

Let’s give it a try 🚀

Available resources

The free-tier Google Colab comes with around 13 GB CPU RAM:

!grep MemTotal /proc/meminfo

MemTotal:       13297192 kB

And an NVIDIA T4 with 15 GB VRAM:

!nvidia-smi

Sun Apr 23 23:14:19 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P0    32W /  70W |   1335MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                                
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Install dependencies

Some optimizations can require up-to-date versions of dependencies. If
you’re having issues, please double check and upgrade versions.

! pip install --upgrade 
  diffusers~=0.16 
  transformers~=4.28 
  safetensors~=0.3 
  sentencepiece~=0.1 
  speed up~=0.18 
  bitsandbytes~=0.38 
  torch~=2.0 -q

1. Text-to-image generation

We are going to walk step-by-step through text-to-image generation with IF using
Diffusers. We are going to explain briefly APIs and optimizations, but more
in-depth explanations may be present in the official documentation for
Diffusers,
Transformers,
Speed up, and
bitsandbytes.

1.1 Load text encoder

We are going to load T5 using 8bit quantization. Transformers directly supports
bitsandbytes
through the load_in_8bit flag.

The flag variant="8bit" will download pre-quantized weights.

We also use the device_map flag to permit transformers to dump
model layers to the CPU or disk. Transformers big modeling supports
arbitrary device maps, which may be used to individually load model
parameters on to available devices. Passing "auto" will
routinely create a tool map. See the transformers
docs
for more information.

from transformers import T5EncoderModel

text_encoder = T5EncoderModel.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0",
    subfolder="text_encoder", 
    device_map="auto", 
    load_in_8bit=True, 
    variant="8bit"
)

1.2 Create text embeddings

The Diffusers API for accessing diffusion models is the
DiffusionPipeline class and its subclasses. Each instance of
DiffusionPipeline is a totally self contained set of methods and models
for running diffusion networks. We are able to override the models it uses by
passing alternative instances as keyword arguments to from_pretrained.

On this case, we pass None for the unet argument, so no UNet can be
loaded. This permits us to run the text embedding portion of the
diffusion process without loading the UNet into memory.

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", 
    text_encoder=text_encoder, 
    unet=None, 
    device_map="auto"
)

IF also comes with an excellent resolution pipeline. We are going to save the prompt
embeddings so we are able to later directly pass them to the super
resolution pipeline. It will allow the super resolution pipeline to be
loaded without a text encoder.

As an alternative of an astronaut just riding a
horse, let’s hand them a
sign as well!

Let’s define a fitting prompt:

prompt = "a photograph of an astronaut riding a horse holding an indication that claims Pixel's in space"

and run it through the 8bit quantized T5 model:

prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

1.3 Free memory

Once the prompt embeddings have been created. We don’t need the text
encoder anymore. Nonetheless, it remains to be in memory on the GPU. We want to
remove it in order that we are able to load the UNet.

It’s non-trivial to free PyTorch memory. We must garbage-collect the
Python objects which point to the actual memory allocated on the GPU.

First, use the Python keyword del to delete all Python objects
referencing allocated GPU memory

del text_encoder
del pipe

Deleting the python object shouldn’t be enough to free the GPU memory.
Garbage collection is when the actual GPU memory is freed.

Moreover, we’ll call torch.cuda.empty_cache(). This method
is not strictly mandatory because the cached cuda memory can be immediately
available for further allocations. Emptying the cache allows us to
confirm within the Colab UI that the memory is on the market.

We’ll use a helper function flush() to flush memory.

import gc
import torch

def flush():
    gc.collect()
    torch.cuda.empty_cache()

and run it

flush()

1.4 Stage 1: The foremost diffusion process

With our now available GPU memory, we are able to re-load the
DiffusionPipeline with only the UNet to run the foremost diffusion
process.

The variant and torch_dtype flags are utilized by Diffusers to download
and cargo the weights in 16 bit floating point format.

pipe = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", 
    text_encoder=None, 
    variant="fp16", 
    torch_dtype=torch.float16, 
    device_map="auto"
)

Often, we directly pass the text prompt to DiffusionPipeline.__call__.
Nonetheless, we previously computed our text embeddings which we are able to pass
as an alternative.

IF also comes with an excellent resolution diffusion process. Setting
output_type="pt" will return raw PyTorch tensors as an alternative of a PIL
image. This manner, we are able to keep the PyTorch tensors on GPU and pass them
on to the stage 2 super resolution pipeline.

Let’s define a random generator and run the stage 1 diffusion process.

generator = torch.Generator().manual_seed(1)
image = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds, 
    output_type="pt",
    generator=generator,
).images

Let’s manually convert the raw tensors to PIL and have a sneak peek at
the end result. The output of stage 1 is a 64×64 image.

from diffusers.utils import pt_to_pil

pil_image = pt_to_pil(image)
pipe.watermarker.apply_watermark(pil_image, pipe.unet.config.sample_size)

pil_image[0]

And again, we remove the Python pointer and free CPU and GPU memory:

del pipe
flush()

1.5 Stage 2: Super Resolution 64×64 to 256×256

IF comes with a separate diffusion process for upscaling.

We run each diffusion process with a separate pipeline.

The super resolution pipeline may be loaded with a text encoder if
needed. Nonetheless, we’ll normally have pre-computed text embeddings from
the primary IF pipeline. If that’s the case, load the pipeline without the text
encoder.

Create the pipeline

pipe = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", 
    text_encoder=None, 
    variant="fp16", 
    torch_dtype=torch.float16, 
    device_map="auto"
)

and run it, re-using the pre-computed text embeddings

image = pipe(
    image=image, 
    prompt_embeds=prompt_embeds, 
    negative_prompt_embeds=negative_embeds, 
    output_type="pt",
    generator=generator,
).images

Again we are able to inspect the intermediate results.

pil_image = pt_to_pil(image)
pipe.watermarker.apply_watermark(pil_image, pipe.unet.config.sample_size)

pil_image[0]

And again, we delete the Python pointer and free memory

del pipe
flush()

1.6 Stage 3: Super Resolution 256×256 to 1024×1024

The second super resolution model for IF is the previously release
Stability AI’s x4
Upscaler.

Let’s create the pipeline and cargo it directly on GPU with
device_map="auto".

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler", 
    torch_dtype=torch.float16, 
    device_map="auto"
)

🧨 diffusers makes independently developed diffusion models easily
composable as pipelines may be chained together. Here we are able to just take
the previous PyTorch tensor output and pass it to the tage 3 pipeline as
image=image.

💡 Note: The x4 Upscaler doesn’t use T5 and has its own text
encoder.
Due to this fact, we cannot use the previously created prompt embeddings and
as an alternative must pass the unique prompt.

pil_image = pipe(prompt, generator=generator, image=image).images

Unlike the IF pipelines, the IF watermark is not going to be added by default
to outputs from the Stable Diffusion x4 upscaler pipeline.

We are able to as an alternative manually apply the watermark.

from diffusers.pipelines.deepfloyd_if import IFWatermarker

watermarker = IFWatermarker.from_pretrained("DeepFloyd/IF-I-XL-v1.0", subfolder="watermarker")
watermarker.apply_watermark(pil_image, pipe.unet.config.sample_size)

View output image

pil_image[0]

Et voila! An attractive 1024×1024 image in a free-tier Google Colab.

Now we have shown how 🧨 diffusers makes it easy to decompose and modularly
load resource-intensive diffusion models.

💡 Note: We do not recommend using the above setup in production.
8bit quantization, manual de-allocation of model weights, and disk
offloading all trade off memory for time (i.e., inference speed). This
may be especially noticable if the diffusion pipeline is re-used. In
production, we recommend using a 40GB A100 with all model components
left on the GPU. See the official IF
demo.

2. Image variation

The identical IF checkpoints may also be used for text guided image variation
and inpainting. The core diffusion process is similar as text-to-image
generation except the initial noised image is created from the image to
be varied or inpainted.

To run image variation, load the identical checkpoints with
IFImg2ImgPipeline.from_pretrained() and
IFImg2ImgSuperResolution.from_pretrained().

The APIs for memory optimization are all the identical!

Let’s free the memory from the previous section.

del pipe
flush()

For image variation, we start with an initial image that we wish to
adapt.

For this section, we’ll adapt the famous “Slaps Roof of Automotive” meme.
Let’s download it from the web.

import requests

url = "https://i.kym-cdn.com/entries/icons/original/000/026/561/automobile.jpg"
response = requests.get(url)

and cargo it right into a PIL Image

from PIL import Image
from io import BytesIO

original_image = Image.open(BytesIO(response.content)).convert("RGB")
original_image = original_image.resize((768, 512))
original_image

The image variation pipeline take each PIL images and raw tensors. View
the docstrings for more indepth documentation on expected inputs, here.

2.1 Text Encoder

Image variation is guided by text, so we are able to define a prompt and encode
it with T5’s Text Encoder.

Again we load the text encoder into 8bit precision.

from transformers import T5EncoderModel

text_encoder = T5EncoderModel.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0",
    subfolder="text_encoder", 
    device_map="auto", 
    load_in_8bit=True, 
    variant="8bit"
)

For image variation, we load the checkpoint with
IFImg2ImgPipeline. When using
DiffusionPipeline.from_pretrained(...), checkpoints are loaded into
their default pipeline. The default pipeline for the IF is the
text-to-image IFPipeline. When loading checkpoints
with a non-default pipeline, the pipeline should be explicitly specified.

from diffusers import IFImg2ImgPipeline

pipe = IFImg2ImgPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", 
    text_encoder=text_encoder, 
    unet=None, 
    device_map="auto"
)

Let’s turn our salesman into an anime character.

prompt = "anime style"

As before, we create the text embeddings with T5

prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

and free GPU and CPU memory.

First, remove the Python pointers

del text_encoder
del pipe

after which free the memory

flush()

2.2 Stage 1: The foremost diffusion process

Next, we only load the stage 1 UNet weights into the pipeline object,
similar to we did within the previous section.

pipe = IFImg2ImgPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", 
    text_encoder=None, 
    variant="fp16", 
    torch_dtype=torch.float16, 
    device_map="auto"
)

The image variation pipeline requires each the unique image and the
prompt embeddings.

We are able to optionally use the strength argument to configure the quantity of
variation. strength directly controls the quantity of noise added.
Higher strength means more noise which suggests more variation.

generator = torch.Generator().manual_seed(0)
image = pipe(
    image=original_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds, 
    output_type="pt",
    generator=generator,
).images

Let’s check the intermediate 64×64 again.

pil_image = pt_to_pil(image)
pipe.watermarker.apply_watermark(pil_image, pipe.unet.config.sample_size)

pil_image[0]

Looks good! We are able to free the memory and upscale the image again.

del pipe
flush()

2.3 Stage 2: Super Resolution

For super resolution, load the checkpoint with
IFImg2ImgSuperResolutionPipeline and the identical checkpoint as before.

from diffusers import IFImg2ImgSuperResolutionPipeline

pipe = IFImg2ImgSuperResolutionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", 
    text_encoder=None, 
    variant="fp16", 
    torch_dtype=torch.float16, 
    device_map="auto"
)

💡 Note: The image variation super resolution pipeline requires the
generated image in addition to the unique image.

You too can use the Stable Diffusion x4 upscaler on this image. Feel
free to try it out using the code snippets in section 1.6.

image = pipe(
    image=image,
    original_image=original_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds, 
    generator=generator,
).images[0]
image

Nice! Let’s free the memory and have a look at the ultimate inpainting pipelines.

del pipe
flush()

3. Inpainting

The IF inpainting pipeline is similar because the image variation, except
only a select area of the image is denoised.

We specify the realm to inpaint with a picture mask.

Let’s showcase IF’s amazing “letter generation” capabilities. We are able to
replace this sign text with different slogan.

First let’s download the image

import requests

url = "https://i.imgflip.com/5j6x75.jpg"
response = requests.get(url)

and switch it right into a PIL Image

from PIL import Image
from io import BytesIO

original_image = Image.open(BytesIO(response.content)).convert("RGB")
original_image = original_image.resize((512, 768))
original_image

We are going to mask the sign so we are able to replace its text.

For convenience, we now have pre-generated the mask and loaded it right into a HF
dataset.

Let’s download it.

from huggingface_hub import hf_hub_download

mask_image = hf_hub_download("diffusers/docs-images", repo_type="dataset", filename="if/sign_man_mask.png")
mask_image = Image.open(mask_image)

mask_image

💡 Note: You possibly can create masks yourself by manually making a
greyscale image.

from PIL import Image
import numpy as np

height = 64
width = 64

example_mask = np.zeros((height, width), dtype=np.int8)


example_mask[20:30, 30:40] = 255



example_mask = Image.fromarray(example_mask, mode='L')

example_mask

Now we are able to start inpainting 🎨🖌

3.1. Text Encoder

Again, we load the text encoder first

from transformers import T5EncoderModel

text_encoder = T5EncoderModel.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0",
    subfolder="text_encoder", 
    device_map="auto", 
    load_in_8bit=True, 
    variant="8bit"
)

This time, we initialize the IFInpaintingPipeline in-painting pipeline
with the text encoder weights.

from diffusers import IFInpaintingPipeline

pipe = IFInpaintingPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", 
    text_encoder=text_encoder, 
    unet=None, 
    device_map="auto"
)

Alright, let’s have the person advertise for more layers as an alternative.

prompt = 'the text, "just stack more layers"'

Having defined the prompt, we are able to create the prompt embeddings

prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

Identical to before, we free the memory

del text_encoder
del pipe
flush()

3.2 Stage 1: The foremost diffusion process

Identical to before, we now load the stage 1 pipeline with only the UNet.

pipe = IFInpaintingPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", 
    text_encoder=None, 
    variant="fp16", 
    torch_dtype=torch.float16, 
    device_map="auto"
)

Now, we’d like to pass the input image, the mask image, and the prompt
embeddings.

image = pipe(
    image=original_image,
    mask_image=mask_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds, 
    output_type="pt",
    generator=generator,
).images

Let’s take a have a look at the intermediate output.

pil_image = pt_to_pil(image)
pipe.watermarker.apply_watermark(pil_image, pipe.unet.config.sample_size)

pil_image[0]

Looks good! The text is pretty consistent!

Let’s free the memory so we are able to upscale the image

del pipe
flush()

3.3 Stage 2: Super Resolution

For super resolution, load the checkpoint with
IFInpaintingSuperResolutionPipeline.

from diffusers import IFInpaintingSuperResolutionPipeline

pipe = IFInpaintingSuperResolutionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", 
    text_encoder=None, 
    variant="fp16", 
    torch_dtype=torch.float16, 
    device_map="auto"
)

The inpainting super resolution pipeline requires the generated image,
the unique image, the mask image, and the prompt embeddings.

Let’s do a final denoising run.

image = pipe(
    image=image,
    original_image=original_image,
    mask_image=mask_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds, 
    generator=generator,
).images[0]
image

Nice, the model generated text without making a single
spelling error!

Conclusion

IF in 32-bit floating point precision uses 40 GB of weights in total. We
showed how using only open source models and libraries, IF may be run on
a free-tier Google Colab instance.

The ML ecosystem advantages deeply from the sharing of open tools and open
models. This notebook alone used models from DeepFloyd, StabilityAI, and
Google. The libraries used — Diffusers, Transformers, Speed up, and
bitsandbytes — all profit from countless contributors from different
organizations.

A large thanks to the DeepFloyd team for the creation and open
sourcing of IF, and for contributing to the democratization of excellent
machine learning 🤗.

Source link

Running IF with 🧨 diffusers on a Free Tier Google Colab

Introduction

Table of contents

Accepting the license

Optimizing IF to run on memory constrained hardware

Available resources

Install dependencies

1. Text-to-image generation

1.1 Load text encoder

1.2 Create text embeddings

1.3 Free memory

1.4 Stage 1: The foremost diffusion process

1.5 Stage 2: Super Resolution 64×64 to 256×256

1.6 Stage 3: Super Resolution 256×256 to 1024×1024

2. Image variation

2.1 Text Encoder

2.2 Stage 1: The foremost diffusion process

2.3 Stage 2: Super Resolution

3. Inpainting

3.1. Text Encoder

3.2 Stage 1: The foremost diffusion process

3.3 Stage 2: Super Resolution

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

Supply-chain attack using invisible code hits GitHub and other repositories

Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

Running IF with 🧨 diffusers on a Free Tier Google Colab

Introduction

Table of contents

Accepting the license

Optimizing IF to run on memory constrained hardware

Available resources

Install dependencies

1. Text-to-image generation

1.1 Load text encoder

1.2 Create text embeddings

1.3 Free memory

1.4 Stage 1: The foremost diffusion process

1.5 Stage 2: Super Resolution 64×64 to 256×256

1.6 Stage 3: Super Resolution 256×256 to 1024×1024

2. Image variation

2.1 Text Encoder

2.2 Stage 1: The foremost diffusion process

2.3 Stage 2: Super Resolution

3. Inpainting

3.1. Text Encoder

3.2 Stage 1: The foremost diffusion process

3.3 Stage 2: Super Resolution

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.