Generative AI — Agents of change What’s Stable Diffusion? How can we use this?


How we will use Generative AI to explore designs of future Volvo cars.

Ever wondered what would occur if you happen to let your computer dream? This was something that caught my eye and got me obsessed. For nearly 2 years now I actually have been exploring how feeding a machine with tons of images can allow it to think, after which create something that is really fascinating. A recent realm of digital art is pushing the boundaries of creativity and revolutionizing the strategy of creating art.

I’m Vivek Vivian, currently working with exploratory design team at Volvo Cars. Working with various kinds of data helped me realize that traditional visualization techniques, like bar graphs and charts, are boring and alter is required. This led me to search out creative ways of expressing data using real time visual programming tools and multimedia. This further pushed me into generative AI because it was amazing to take a peek into the world where we get to see how the machine thinks. I feel that synthetic intelligences will solve among the world’s most pressing problems in the approaching future.

These powerful AI image generators were born by combining deep neural networks that generate images, with language models that allow the user to offer a text input — prompt. These surprising algorithms have the chance to learn from billions of images after which generate wonderful images with varied styles, based on the users’ words. Models like DALL-E2, Midjourney, and Stable Diffusion are among the leading image generator AI networks currently available.

I’m currently collaborating with the Design Visualization team at Volvo Cars to explore ways of implementing recent technologies and techniques for co-creation and creativity. These tools allow us to make real-time changes to the environment, color, and materials where the cars might be viewed, providing superb control and making the method efficient and user-friendly. Not only can we use this approach to create 360° panoramas for virtual reality (VR), but we can even generate environments to view objects, interact with them, and enhance aesthetics.

In the next sections below we dig further into Stable Diffusion. What’s it? How does it work? followed by a use case that I’m experimenting with.

Stable Diffusion utilizes a diffusion model (DM) often called a latent diffusion model (LDM).[1] These models, introduced in 2015, are designed to eliminate successive instances of Gaussian noise applied to training images, which might be viewed as a sequence of denoising auto encoders. Stable Diffusion consists of three important components: the variational auto encoder (VAE), U-Net, and, optionally, a text encoder.[2]

The VAE encoder compresses the image from pixel space right into a smaller dimensional latent space, capturing the image’s underlying semantic meaning.[3] During forward diffusion, Gaussian noise is iteratively applied to the compressed latent representation. The U-Net block, which consists of a ResNet backbone, denoises the output from forward diffusion and reverses it to acquire a latent representation. Finally, the VAE decoder generates the ultimate image by converting the representation back into pixel space.[2] The denoising process might be conditioned on quite a lot of aspects, equivalent to text, images, or other modalities, through a cross-attention mechanism that exposes the encoded conditioning data to denoising U-Nets.[2]

For text conditioning, researchers leverage the fixed, pretrained CLIP ViT-L/14 text encoder to remodel text prompts into an embedding space.[1] LDMs are considered more computationally efficient for each training and generation, as per researchers’ findings.[3][4]

The stable diffusion model was trained on images and captions which can be taken from LAION-5B, a publicly available dataset. This model runs on under 10 GB of VRAM on consumer GPUs, generating images at 512×512 pixels in just a few seconds.

Unlike models like DALL-E, Stable Diffusion makes its source code available, together with the model (pretrained weights). It applies the Creative ML OpenRAIL-M license, a type of Responsible AI License (RAIL), to the model (M).[5]

On the Volvo Cars Open Innovation Arena, we’re experimenting with these technologies, and attempting to see how they might act as agents of change. We’re in a position to setup these networks and models up, on custom built servers which allows us to prototype and work on a proof of concept.

One in all the experiments is explained below:

Diffusion models might be used to create various kinds of environments and backgrounds for visualization. This was created using ControlNet depth algorithm, together with a custom trained diffusion model. ControlNet takes an input (image/prompt), generates a depth model of the input after which uses the text prompt to guide the image towards the needed result.

ControlNet is a neural network structure to regulate diffusion models by adding extra conditions. This principally allows us to regulate specific components of the image and helps us generate specific things. There are multiple different techniques like depth, canny, HED, and scribble, to call just a few, that are different ControlNet algorithms that might be used along with stable diffusion.

Using ControlNet together with diffusion models will help us get accurate control of the image. Below is an example of using the depth and canny ControlNet maps to generate panoramas.

These 360° images which can be generated can then be viewed in a 360° image viewer or may very well be used as a skybox on Unity3D to be viewed on a VR headset. 3D intractable objects might be added onto the scene and to create a totally immersive experience.

Last but not the least, let’s take a small tour of how we will arrange stable diffusion models locally in your machine or use a pre-made setup and get them up and running.

There are a lot of ways of doing this and listed below are two ways:

  1. In case of system constraints like low GPU memory or space, the choice could be to make use of the stable diffusion setup on hugging face.
  2. To set it up locally, the stable diffusion GitHub repository should be cloned to your local PC.
git clone

2.1. Once installed right into a required directory, open the stable diffusion folder and go into the models folder. These models might be downloaded from hugging face. There are multiple different models for instance stable-diffusion-v1.1, stable-diffusion-v1.2, stable-diffusion-v1.3, stable-diffusion-v1.4, and stable-diffusion-v2.1 (needs an additional .yaml file to be downloaded based on the model version and be placed in the identical folder because the model). After downloading these models, place them within the models folder in your stable diffusion directory.

NOTE — Please check the versions of Cuda and PyTorch required to run the models.

2.2. You possibly can run the next and luxuriate in your creations

python scripts/ --prompt "a photograph of an astronaut riding a horse" --plms 
usage: [-h] [--prompt [PROMPT]] [--outdir [OUTDIR]] [--skip_grid] [--skip_save] [--ddim_steps DDIM_STEPS] [--plms] [--laion400m] [--fixed_code] [--ddim_eta DDIM_ETA]
[--n_iter N_ITER] [--H H] [--W W] [--C C] [--f F] [--n_samples N_SAMPLES] [--n_rows N_ROWS] [--scale SCALE] [--from-file FROM_FILE] [--config CONFIG] [--ckpt CKPT]
[--seed SEED] [--precision {full,autocast}]

optional arguments:
-h, --help show this help message and exit
--prompt [PROMPT] the prompt to render
--outdir [OUTDIR] dir to put in writing results to
--skip_grid don't save a grid, only individual samples. Helpful when evaluating numerous samples
--skip_save don't save individual samples. For speed measurements.
--ddim_steps DDIM_STEPS variety of ddim sampling steps
--plms use plms sampling
--laion400m uses the LAION400M model
--fixed_code if enabled, uses the identical starting code across samples
--ddim_eta DDIM_ETA ddim eta (eta=0.0 corresponds to deterministic sampling
--n_iter N_ITER sample this often
--H H image height, in pixel space
--W W image width, in pixel space
--C C latent channels
--f F downsampling factor
--n_samples N_SAMPLES what number of samples to provide for every given prompt. A.k.a. batch size
--n_rows N_ROWS rows within the grid (default: n_samples)
--scale SCALE unconditional guidance scale: eps = eps(x, empty) + scale * (eps(x, cond) - eps(x, empty))
--from-file FROM_FILE if specified, load prompts from this file
--config CONFIG path to config which constructs model
--ckpt CKPT path to checkpoint of model
--seed SEED the seed (for reproducible sampling)
--precision {full,autocast} evaluate at this precision

Finally, it could be great to share what you create with us. Have a good time!


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
1 Comment
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x