Stable Diffusion: A technical breakdown in 2 minutes

Below is a document that explains how Stable Diffusion, a cutting-edge AI image generator, takes a user prompt and generates an image. I've done my best to use analogies to describe mathematical processes, so this post is by no means a true academic explanation of Stable Diffusion. For a more detailed technical perspective, refer to this whitepaper.

I'm going to assume you're already familiar with Stable Diffusion as a concept and have experimented with it. If not, I recommend exploring it first to get an intuitive grasp of what Stable Diffusion and other AI text-to-image generators do.

Here's the TL;DR

We start with a user-provided description of the desired image and random jumble of pixels as our starting image. We use an encoder to compare the two. A gradient descent algorithm makes the image more similar to the user's description. However, the gradient descent algorithm doesn't necessarily make the image "better". This is the job of the neural net, which is a model that has been trained on what a "good" image should look like. And we iterate through this cycle of making the image more similar to the image and then more coherent until we have a coherent image that is similar to the user's description.

Let's dive into the details.

We start with a user-provided description of the desired image.

Emma Watson as a powerful mysterious sorceress, casting lightning magic, detailed clothing

This prompt is fed into CLIP, an encoder developed by OpenAI. An encoder essentially takes an input (like text or pixel values of an image) and transforms it into a list of numbers representing that image. In the context of the CLIP encoder, it receives a text prompt and a pixel representation of an image (a matrix of pixel values) and evaluates their similarity. The crux of the CLIP encoder's function is its ability to mathematically gauge the similarity between an image and the user's text description of their envisioned image. It does not actually generate the image itself. This is achieved by converting both the image and the prompt into numerical representations and then comparing them using cosine similarity. Cosine similarity can be understood as a measure of how alike two lists of numbers are: the closer to +1, the more similar they are, and the closer to -1, the more dissimilar they become. The starting image provided to the CLIP encoder might just be a jumble of randomly colored pixels, though Stable Diffusion could have a more refined approach for initializing which I'm unaware of currently. Nonetheless, even if the initial image isn't a precise match to the user's prompt, Stable Diffusion employs an iterative method ensuring each step refines the image, drawing it closer to the user's vision. I will describe this iterative process next.

Now that we have a starting image, which might just be a random assortment of pixels, it's essential to understand how it gets refined. The primary tool for this refinement is the gradient descent algorithm. To understand gradient descent, imagine the current/starting image as a point on a hill. The elevation of this point represents the "loss" value – the divergence between the image and the user's desired image description. Essentially, the higher we are on this hill, the more our image deviates from the desired description. Our goal, then, is to descend this hill as quickly as possible, aiming for the steepest downhill path.

In practical terms, the gradient descent algorithm achieves this by tweaking the image's pixel values. It does so by comparing the current pixel values against the loss value provided by the CLIP encoder and then adjusting them to minimize this loss. Our metaphorical position on the hill descends, bringing our image closer in alignment with the user's description.

With this refined pixel representation, we then turn to a neural network called a UNet, a pretrained mathematical model. This model takes our improved image and further refines it, drawing on its training data to produce a more coherent picture. You can think of the UNet as a sketch artist. We provide an initial sketch, and the expert artist (UNet) uses their skills to enhance and improve upon that sketch, delivering a clearer rendition. This refining cycle continues until we achieve a satisfactory result.

From this point, the UNet's outputted image gets re-introduced to the CLIP encoder for another comparison against the initial text prompt. The gradient descent algorithm once again tweaks the pixel values, and the resultant image is fed back into the UNet. This loop perpetuates until the generated image's latent representation (a sequence of numbers) aligns closely enough with the original text prompt, as judged by the CLIP encoder.

That's it. If you want more detail, someone from Harvard who is much smarter than I am, put together this presentation.

DEV Community

Stable Diffusion: A technical breakdown in 2 minutes

Top comments (0)