Instruction-tuning Stable Diffusion with InstructPix2Pix

This post explores instruction-tuning to show Stable Diffusion to follow instructions to translate or process input images. With this method, we will prompt Stable Diffusion using an input image and an “instruction”, similar to – Apply a cartoon filter to the natural image.


Figure 1: We explore the instruction-tuning capabilities of Stable Diffusion. On this figure, we prompt an instruction-tuned Stable Diffusion system with prompts involving different transformations and input images. The tuned system seems to have the option to learn these transformations stated within the input prompts. Figure best viewed in color and zoomed in.

This concept of teaching Stable Diffusion to follow user instructions to perform edits on input images was introduced in InstructPix2Pix: Learning to Follow Image Editing Instructions. We discuss easy methods to extend the InstructPix2Pix training technique to follow more specific instructions related to tasks in image translation (similar to cartoonization) and low-level image processing (similar to image deraining). We cover:

Our code, pre-trained models, and datasets might be found here.

Introduction and motivation

Instruction-tuning is a supervised way of teaching language models to follow instructions to unravel a task. It was introduced in Positive-tuned Language Models Are Zero-Shot Learners (FLAN) by Google. From recent times, you may recall works like Alpaca and FLAN V2, that are good examples of how helpful instruction-tuning might be for various tasks.

The figure below shows a formulation of instruction-tuning (also called “instruction-finetuning”). Within the FLAN V2 paper, the authors take a pre-trained language model (T5, for instance) and fine-tune it on a dataset of exemplars, as shown within the figure below.


Figure 2: FLAN V2 schematic (figure taken from the FLAN V2 paper).

With this approach, one can create exemplars covering many various tasks, which makes instruction-tuning a multi-task training objective:

Input	Label	Task
Predict the sentiment of the following sentence: “The movie was pretty amazing. I couldn’t turn around my eyes even for a second.”	Positive	Sentiment evaluation / Sequence classification
Please answer the next query. What’s the boiling point of Nitrogen?	320.4F	Query answering
Translate the next English sentence into German: “I even have a cat.”	Ich habe eine Katze.	Machine translation
…	…	…

Using an identical philosophy, the authors of FLAN V2 conduct instruction-tuning on a combination of hundreds of tasks and achieve zero-shot generalization to unseen tasks:


Figure 3: FLAN V2 training and test task mixtures (figure taken from the FLAN V2 paper).

Our motivation behind this work comes partly from the FLAN line of labor and partly from InstructPix2Pix. We desired to explore if it’s possible to prompt Stable Diffusion with specific instructions and input images to process them as per our needs.

The pre-trained InstructPix2Pix models are good at following general instructions, but they might fall in need of following instructions involving specific transformations:


Figure 4: We observe that for the input images (left column), our models (right column) more faithfully perform “cartoonization” in comparison with the pre-trained InstructPix2Pix models (middle column). It’s interesting to notice the outcomes of the primary row where the pre-trained InstructPix2Pix models almost fail significantly. Figure best viewed in color and zoomed in. See original here.

Figure 4: We observe that for the input images (left column), our models (right column) more faithfully perform “cartoonization” in comparison with the pre-trained InstructPix2Pix models (middle column). It’s interesting to notice the outcomes of the primary row where the pre-trained InstructPix2Pix models almost fail significantly. Figure best viewed in color and zoomed in. See original here.

But we will still leverage the findings from InstructPix2Pix to suit our customizations.

Then again, paired datasets for tasks like cartoonization, image denoising, image deraining, etc. can be found publicly, which we will use to construct instruction-prompted datasets taking inspiration from FLAN V2. Doing so allows us to transfer the instruction-templating ideas explored in FLAN V2 to this work.

Dataset preparation

Cartoonization

In our early experiments, we prompted InstructPix2Pix to perform cartoonization and the outcomes were less than our expectations. We tried various inference-time hyperparameter combos (similar to image guidance scale and the variety of inference steps), but the outcomes still weren’t compelling. This motivated us to approach the issue in another way.

As hinted within the previous section, we wanted to profit from each worlds:

(1) training methodology of InstructPix2Pix and
(2) the pliability of making instruction-prompted dataset templates from FLAN.

We began by creating an instruction-prompted dataset for the duty of cartoonization. Figure 5 presents our dataset creation pipeline:


Figure 5: An outline of our dataset creation pipeline for cartoonization (best viewed in color and zoomed in).

Particularly, we:

Ask ChatGPT to generate 50 synonymous sentences for the next instruction: “Cartoonize the image.”
We then use a random sub-set (5000 samples) of the Imagenette dataset and leverage a pre-trained Whitebox CartoonGAN model to supply the cartoonized renditions of those images. The cartoonized renditions are the labels we wish our model to learn from. So, in a way, this corresponds to transferring the biases learned by the Whitebox CartoonGAN model to our model.
Then we create our exemplars in the next format:


Figure 6: Samples from the ultimate cartoonization dataset (best viewed in color and zoomed in).

Our final dataset for cartoonization might be found here. For more details on how the dataset was prepared, confer with this directory. We experimented with this dataset by fine-tuning InstructPix2Pix and got promising results (more details within the “Training experiments and results” section).

We then proceeded to see if we could generalize this approach to low-level image processing tasks similar to image deraining, image denoising, and image deblurring.

Low-level image processing

We concentrate on the common low-level image processing tasks explored in MAXIM. Particularly, we conduct our experiments for the next tasks: deraining, denoising, low-light image enhancement, and deblurring.

We took different variety of samples from the next datasets for every task and constructed a single dataset with prompts added like so:

Task	Prompt	Dataset	Variety of samples
Deblurring	“deblur the blurry image”	REDS (`train_blur` and `train_sharp`)	1200
Deraining	“derain the image”	Rain13k	686
Denoising	“denoise the noisy image”	SIDD	8
Low-light image enhancement	“enhance the low-light image”	LOL	23

Datasets mentioned above typically come as input-output pairs, so we do not need to fret in regards to the ground-truth. Our final dataset is on the market here. The ultimate dataset looks like so:


Figure 7: Samples from the ultimate low-level image processing dataset (best viewed in color and zoomed in).

Overall, this setup helps draw parallels from the FLAN setup, where we create a combination of various tasks. This also helps us train a single model one time, performing well to the various tasks we’ve got within the mixture. This varies significantly from what is usually done in low-level image processing. Works like MAXIM introduce a single model architecture able to modeling the various low-level image processing tasks, but training happens independently on the person datasets.

Training experiments and results

We based our training experiments on this script. Our training logs (including validation samples and training hyperparameters) can be found on Weight and Biases:

When training, we explored two options:

Positive-tuning from an existing InstructPix2Pix checkpoint
Positive-tuning from an existing Stable Diffusion checkpoint using the InstructPix2Pix training methodology

In our experiments, we discovered that the primary option helps us adapt to our datasets faster (by way of generation quality).

For more details on the training and hyperparameters, we encourage you to envision out our code and the respective run pages on Weights and Biases.

Cartoonization results

For testing the instruction-tuned cartoonization model, we compared the outputs as follows:


Figure 8: We compare the outcomes of our instruction-tuned cartoonization model (last column) with that of a CartoonGAN model (column two) and the pre-trained InstructPix2Pix model (column three). It’s evident that the instruction-tuned model can more faithfully match the outputs of the CartoonGAN model. Figure best viewed in color and zoomed in. See original here.

To collect these results, we sampled images from the validation split of ImageNette. We used the next prompt when using our model and the pre-trained InstructPix2Pix model: “Generate a cartoonized version of the image”. For these two models, we kept the image_guidance_scale and guidance_scale to 1.5 and seven.0, respectively, and variety of inference steps to twenty. Indeed more experimentation is required around these hyperparameters to check how they affect the outcomes of the pre-trained InstructPix2Pix model, particularly.

More comparative results can be found here. Our code for comparing these models is on the market here.

Our model, nevertheless, fails to supply the expected outputs for the classes from ImageNette, which it has not seen enough during training. That is somewhat expected, and we consider this might be mitigated by scaling the training dataset.

Low-level image processing results

For low-level image processing (our model), we follow the identical inference-time hyperparameters as above:

Variety of inference steps: 20
Image guidance scale: 1.5
Guidance scale: 7.0

For deraining, our model provides compelling results when put next to the ground-truth and the output of the pre-trained InstructPix2Pix model:


Figure 9: Deraining results (best viewed in color and zoomed in). Inference prompt: “derain the image” (same because the training set). See original here.

Nonetheless, for low-light image enhancement, it leaves rather a lot to be desired:


Figure 10: Low-light image enhancement results (best viewed in color and zoomed in). Inference prompt: “enhance the low-light image” (same because the training set). See original here.

This failure, perhaps, might be attributed to our model not seeing enough exemplars for the duty and possibly from higher training. We notice similar findings for deblurring as well:


Figure 11: Deblurring results (best viewed in color and zoomed in). Inference prompt: “deblur the image” (same because the training set). See original here.

We consider there’s a chance for the community to explore how much the duty mixture for low-level image processing affects the tip results. Does increasing the duty mixture with more representative samples help improve the tip results? We leave this query for the community to explore further.

You may check out the interactive demo below to make Stable Diffusion follow specific instructions:

Potential applications and limitations

On this planet of image editing, there’s a disconnect between what a website expert has in mind (the tasks to be performed) and the actions needed to be applied in editing tools (similar to Lightroom). Having a simple way of translating natural language goals to low-level image editing primitives can be a seamless user experience. With the introduction of mechanisms like InstructPix2Pix, it’s protected to say that we’re getting closer to that realm.

Nonetheless, challenges still remain:

These systems have to work for big high-resolution original images.
Diffusion models often invent or re-interpret an instruction to perform the modifications within the image space. For a practical image editing application, that is unacceptable.

Open questions

We acknowledge that our experiments are preliminary. We didn’t go deep into ablating the apparent aspects in our experiments. Hence, here we enlist just a few open questions that popped up during our experiments:

What happens we scale up the datasets? How does that impact the standard of the generated samples? We experimented with a handful of examples. For comparison, InstructPix2Pix was trained on greater than 30000 samples.
What’s the impact of coaching for longer, especially when the duty mixture is broader? In our experiments, we didn’t conduct hyperparameter tuning, let alone an ablation on the number of coaching steps.
How does this approach generalize to a broader mixture of tasks commonly done within the “instruction-tuning” world? We only covered 4 tasks for low-level image processing: deraining, deblurring, denoising, and low-light image enhancement. Does adding more tasks to the mixture with more representative samples help the model generalize to unseen tasks or, perhaps, a mix of tasks (example: “Deblur the image and denoise it”)?
Does using different variations of the identical instruction on-the-fly help improve performance? For cartoonization, we randomly sampled an instruction from the set of ChatGPT-generated synonymous instructions during dataset creation. But what happens once we perform random sampling during training as a substitute?

For low-level image processing, we used fixed instructions. What happens once we follow an identical methodology of using synonymous instructions for every task and input image?
What happens once we use ControlNet training setup, as a substitute? ControlNet also allows adapting a pre-trained text-to-image diffusion model to be conditioned on additional images (similar to semantic segmentation maps, canny edge maps, etc.). When you’re interested, then you should utilize the datasets presented on this post and perform ControlNet training referring to this post.

Conclusion

On this post, we presented our exploration of “instruction-tuning” of Stable Diffusion. While pre-trained InstructPix2Pix are good at following general image editing instructions, they might break when presented with more specific instructions. To mitigate that, we discussed how we prepared our datasets for further fine-tuning InstructPix2Pix and presented our results. As noted above, our results are still preliminary. But we hope this work provides a basis for the researchers working on similar problems they usually feel motivated to explore the open questions further.

Links

Because of Alara Dirik and Zhengzhong Tu for the helpful discussions. Because of Pedro Cuenca and Kashif Rasul for his or her helpful reviews on the post.

Citation

To cite this work, please use the next citation:

@article{
  Paul2023instruction-tuning-sd,
  creator = {Paul, Sayak},
  title = {Instruction-tuning Stable Diffusion with InstructPix2Pix},
  journal = {Hugging Face Blog},
  12 months = {2023},
  note = {https://huggingface.co/blog/instruction-tuning-sd},
}

Source link

Instruction-tuning Stable Diffusion with InstructPix2Pix

Introduction and motivation