🤗 Diffusers is glad to have a good time its first anniversary! It has been an exciting 12 months, and we’re proud and grateful for a way far we have come due to our community and open-source contributors. Last 12 months, text-to-image models like DALL-E 2, Imagen, and Stable Diffusion captured the world’s attention with their ability to generate stunningly photorealistic images from text, sparking a large surge of interest and development in generative AI. But access to those powerful models was limited.
At Hugging Face, our mission is to democratize good machine learning by collaborating and helping one another construct an open and ethical AI future together. Our mission motivated us to create the 🤗 Diffusers library so everyone can experiment, research, or just play with text-to-image models. That’s why we designed the library as a modular toolbox, so you possibly can customize a diffusion model’s components or simply start using it out-of-the-box.
As 🤗 Diffusers turns 1, here’s an outline of a number of the most notable features we’ve added to the library with the assistance of our community. We’re proud and immensely grateful for being a part of an engaged community that promotes accessible usage, pushes diffusion models beyond just text-to-image generation, and is an all-around inspiration.
Table of Contents
Striving for photorealism
Generative AI models are known for creating photorealistic images, but in the event you look closely, you could notice certain things that do not look right, like generating extra fingers on a hand. This 12 months, the DeepFloyd IF and Stability AI SDXL models made a splash by improving the standard of generated images to be much more photorealistic.
DeepFloyd IF – A modular diffusion model that features different processes for generating a picture (for instance, a picture is upscaled 3x to supply the next resolution image). Unlike Stable Diffusion, the IF model works directly on the pixel level, and it uses a big language model to encode text.
Stable Diffusion XL (SDXL) – The newest Stable Diffusion model from Stability AI, with significantly more parameters than its predecessor Stable Diffusion 2. It generates hyper-realistic images, leveraging a base model for close adherence to the prompt, and a refiner model specialized within the tremendous details and high-frequency content.
Head over to the DeepFloyd IF docs and the SDXL docs today to learn how you can start generating your personal images!
Video pipelines
Text-to-image pipelines are cool, but text-to-video is even cooler! We currently support two text-to-video pipelines, VideoFusion and Text2Video-Zero.
In case you’re already aware of text-to-image pipelines, using a text-to-video pipeline could be very similar:
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
prompt = "Darth Vader browsing a wave"
video_frames = pipe(prompt, num_frames=24).frames
video_path = export_to_video(video_frames)
We expect text-to-video to undergo a revolution during 🤗 Diffusers second 12 months, and we’re excited to see what the community builds on top of those to push the boundaries of video generation from language!
Text-to-3D models
Along with text-to-video, we even have text-to-3D generation now due to OpenAI’s Shap-E model. Shap-E is trained by encoding a big dataset of 3D-text pairs, and a diffusion model is conditioned on the encoder’s outputs. You possibly can design 3D assets for video games, interior design, and architecture.
Try it out today with the ShapEPipeline and ShapEImg2ImgPipeline.
Image editing pipelines
Image editing is one of the practical use cases in fashion, material design, and photography. With diffusion models, the probabilities of image editing proceed to expand.
We’ve many pipelines in 🤗 Diffusers to support image editing. There are image editing pipelines that let you describe your required edit as a prompt, removing concepts from a picture, and even a pipeline that unifies multiple generation methods to create high-quality images like panoramas. With 🤗 Diffusers, you possibly can experiment with the longer term of photo editing now!
Faster diffusion models
Diffusion models are known to be time-intensive due to their iterative steps. With OpenAI’s Consistency Models, the image generation process is significantly faster. Generating a single 256×256 resolution image only takes 3/4 of a second on a contemporary CPU! You possibly can do this out in 🤗 Diffusers with the ConsistencyModelPipeline.
On top of speedier diffusion models, we also offer many optimization techniques for faster inference like PyTorch 2.0’s scaled_dot_product_attention() (SDPA) and torch.compile(), sliced attention, feed-forward chunking, VAE tiling, CPU and model offloading, and more. These optimizations save memory, which translates to faster generation, and let you run inference on consumer GPUs. Once you distribute a model with 🤗 Diffusers, all of those optimizations are immediately supported!
Along with that, we also support specific hardware and formats like ONNX, the mps PyTorch device for Apple Silicon computers, Core ML, and others.
To learn more about how we optimize inference with 🤗 Diffusers, try the docs!
Ethics and safety
Generative models are cool, but additionally they have the flexibility to supply harmful and NSFW content. To assist users interact with these models responsibly and ethically, we’ve added a safety_checker component that flags inappropriate content generated during inference. Model creators can select to include this component into their models in the event that they want.
As well as, generative models may also be used to supply disinformation. Earlier this 12 months, the Balenciaga Pope went viral for a way realistic the image was despite it being fake. This underscores the importance and wish for a mechanism to differentiate between generated and human content. That’s why we’ve added an invisible watermark for images generated by the SDXL model, which helps users be higher informed.
The event of those features is guided by our ethical charter, which you could find in our documentation.
Support for LoRA
Fantastic-tuning diffusion models is dear and out of reach for many consumer GPUs. We added the Low-Rank Adaptation (LoRA) technique to shut this gap. With LoRA, which is a technique for parameter-efficient fine-tuning, you possibly can fine-tune large diffusion models faster and eat less memory. The resulting model weights are also very lightweight in comparison with the unique model, so you possibly can easily share your custom models. If you need to learn more, our documentation shows how you can perform fine-tuning and inference on Stable Diffusion with LoRA.
Along with LoRA, we support other training techniques for personalized generation, including DreamBooth, textual inversion, custom diffusion, and more!
Torch 2.0 optimizations
PyTorch 2.0 introduced support for torch.compile()and scaled_dot_product_attention(), a more efficient implementation of the eye mechanism. 🤗 Diffusers provides first-class support for these features leading to massive speedups in inference latency, which may sometimes be greater than twice as fast!
Along with visual content (images, videos, 3D assets, etc.), we also added support for audio! Try the documentation to learn more.
Community highlights
One of the gratifying experiences of the past 12 months has been seeing how the community is incorporating 🤗 Diffusers into their projects. From adapting Low-rank adaptation (LoRA) for faster training of text-to-image models to constructing a state-of-the-art inpainting tool, listed here are a couple of of our favourite projects:
We built Core ML Stable Diffusion to make it easier for developers so as to add state-of-the-art generative AI capabilities of their iOS, iPadOS and macOS apps with the best efficiency on Apple Silicon. We built on top of 🤗 Diffusers as a substitute of from scratch as 🤗 Diffusers consistently stays on top of a rapidly evolving field and promotes much needed interoperability of latest and old ideas.
🤗 Diffusers has been absolutely developer-friendly for me to dive right into stable diffusion models. Foremost differentiating factor clearly being that 🤗 Diffusers implementation is commonly not some code from research lab, which can be mostly focused on high velocity driven. While research codes are sometimes poorly written and obscure (lack of typing, assertions, inconsistent design patterns and conventions), 🤗 Diffusers was a breeze to make use of for me to hack my ideas inside couple of hours. Without it, I might have needed to speculate significantly more period of time to start out hacking. Well-written documentations and examples are extremely helpful as well.
BentoML is the unified framework for for constructing, shipping, and scaling production-ready AI applications incorporating traditional ML, pre-trained AI models, Generative and Large Language Models. All Hugging Face Diffuser models and pipelines may be seamlessly integrated into BentoML applications, enabling the running of models on essentially the most suitable hardware and independent scaling based on usage.
Invoke AI is an open-source Generative AI tool built to empower skilled creatives, from game designers and photographers to architects and product designers. Invoke recently launched their hosted offering at invoke.ai, allowing users to generate assets from any computer, powered by the most recent research in open-source.
TaskMatrix connects Large Language Model and a series of Visual Models to enable sending and receiving images during chatting.
Lama Cleaner is a strong image inpainting tool that uses Stable Diffusion technology to remove unwanted objects, defects, or people out of your pictures. It may well also erase and replace anything in your images with ease.

Grounded-SAM combines a strong Zero-Shot detector Grounding-DINO and Segment-Anything-Model (SAM) to construct a robust pipeline to detect and segment every thing with text inputs. When combined with 🤗 Diffusers inpainting models, Grounded-SAM can do highly controllable image editing tasks, including replacing specific objects, inpainting the background, etc.
Stable-Dreamfusion leverages the convenient implementations of 2D diffusion models in 🤗 Diffusers to duplicate recent text-to-3D and image-to-3D methods.
MMagic (Multimodal Advanced, Generative, and Intelligent Creation) is a complicated and comprehensive Generative AI toolbox that gives state-of-the-art AI models (e.g., diffusion models powered by 🤗 Diffusers and GAN) to synthesize, edit and enhance images and videos. In MMagic, users can use wealthy components to customize their very own models like twiddling with Legos and manage the training loop easily.
Tune-A-Video, developed by Jay Zhangjie Wu and his team at Show Lab, is the primary to fine-tune a pre-trained text-to-image diffusion model using a single text-video pair and enables changing video content while preserving motion.
We also collaborated with Google Cloud (who generously provided the compute) to offer technical guidance and mentorship to assist the community train diffusion models with TPUs (try a summary of the event here). There have been many cool models resembling this demo that mixes ControlNet with Segment Anything.
Finally, we were delighted to receive contributions to our codebase from over 300 contributors, which allowed us to collaborate together in essentially the most open way possible. Listed here are just a couple of of the contributions from our community:
Besides these, a heartfelt shoutout to the next contributors who helped us ship a number of the strongest features of Diffusers (in no particular order):
Constructing products with 🤗 Diffusers
During the last 12 months, we also saw many firms selecting to construct their products on top of 🤗 Diffusers. Listed here are a few products which have caught our attention:
- PlaiDay: “PlaiDay is a Generative AI experience where people collaborate, create, and connect. Our platform unlocks the limitless creativity of the human mind, and provides a secure, fun social canvas for expression.”
- Previs One: “Previs One is a diffuser pipeline for cinematic storyboarding and previsualization — it understands film and tv compositional rules just as a director would speak them.”
- Zust.AI: “We leverage Generative AI to create studio-quality product photos for brands and marketing agencies.”
- Dashtoon: “Dashtoon is constructing a platform to create and eat visual content. We’ve multiple pipelines that load multiple LORAs, multiple control-nets and even multiple models powered by diffusers. Diffusers has made the gap between a product engineer and a ML engineer super low allowing dashtoon to ship user value faster and higher.”
- Virtual Staging AI: “Filling empty rooms with beautiful furniture using generative models.”
- Hexo.AI: “Hexo AI helps brands get higher ROI on marketing spends through Personalized Marketing at Scale. Hexo is constructing a proprietary campaign generation engine which ingests customer data and generates brand compliant personalized creatives.”
In case you’re constructing products on top of 🤗 Diffusers, we’d love to talk to grasp how we will make the library higher together! Be at liberty to succeed in out to patrick@hf.co or sayak@hf.co.
Looking forward
As we have a good time our first anniversary, we’re grateful to our community and open-source contributors who’ve helped us come to date in such a short while. We’re glad to share that we’ll be presenting a 🤗 Diffusers demo at ICCV 2023 this fall – in the event you’re attending, do come and see us! We’ll proceed to develop and improve our library, making it easier for everybody to make use of. We’re also excited to see what the community will create next with our tools and resources. Thanks for being an element of our journey to date, and we look ahead to continuing to democratize good machine learning together! 🥳
❤️ Diffusers team
Acknowledgements: Thanks to Omar Sanseviero, Patrick von Platen, Giada Pistilli for his or her reviews, and Chunte Lee for designing the thumbnail.
