图解Stable Diffusion模型

(V2 Nov 2022: Updated images for more precise description of forward diffusion. A few more images in this version)
(V2 2022年11月:更新了图像,以便更精确地描述前向扩散。在这个版本中增加了一些图像)

AI image generation is the most recent AI capability blowing people’s minds (mine included). The ability to create striking visuals from text descriptions has a magical quality to it and points clearly to a shift in how humans create art. The release of Stable Diffusion is a clear milestone in this development because it made a high-performance model available to the masses (performance in terms of image quality, as well as speed and relatively low resource/memory requirements).

最新的人工智能图像生成能力,让人们大跌眼镜。从文字描述中创造出引人注目的视觉效果的能力有一种神奇的品质,并明确指出了人类创造艺术的方式的转变。Stable Diffusion的发布是这一发展中的一个明显的里程碑,因为它使一个高性能的模型可供大众使用(在图像质量方面的性能,以及速度和相对较低的资源/内存要求)。

After experimenting with AI image generation, you may start to wonder how it works.

在尝试了人工智能图像生成后,你可能开始想知道它是如何工作的。

This is a gentle introduction to how Stable Diffusion works.

这是对Stable Diffusion工作的温和介绍。

Stable Diffusion is versatile in that it can be used in a number of different ways. Let’s focus at first on image generation from text only (text2img). The image above shows an example text input and the resulting generated image (The actual complete prompt is here). Aside from text to image, another main way of using it is by making it alter images (so inputs are text + image).
Stable Diffusion的用途很广,因为它可以用在许多不同的方面。首先让我们关注一下仅由文本生成的图像(text2img)。上面的图片显示了一个文本输入的例子和生成的图像(实际的完整提示在这里)。除了从文本到图像,另一个主要的使用方法是使它改变图像(所以输入是文本+图像)。

Let’s start to look under the hood because that helps explain the components, how they interact, and what the image generation options/parameters mean.
让我们开始看看引擎盖下面,因为这有助于解释这些组件,它们如何互动,以及图像生成选项/参数的含义。

The Components of Stable Diffusion Stable Diffusion的组成

Stable Diffusion is a system made up of several components and models. It is not one monolithic model.
Stable Diffusion “是一个由多个组件和模型组成的系统。它不是一个单一的模型。

As we look under the hood, the first observation we can make is that there’s a text-understanding component that translates the text information into a numeric representation that captures the ideas in the text.
当我们在引擎盖下观察时,我们可以做的第一个观察是,有一个文本理解组件,将文本信息翻译成数字表示,以捕捉文本中的想法。

We’re starting with a high-level view and we’ll get into more machine learning details later in this article. However, we can say that this text encoder is a special Transformer language model (technically: the text encoder of a CLIP model). It takes the input text and outputs a list of numbers representing each word/token in the text (a vector per token).
我们从一个高层次的视角开始,我们将在本文的后面进入更多的机器学习细节。然而,我们可以说,这个文本编码器是一个特殊的Transformer语言模型(技术上讲:CLIP模型的文本编码器)。它接受输入的文本,并输出一个数字列表,代表文本中的每个单词/符号(每个符号一个向量)。

That information is then presented to the Image Generator, which is composed of a couple of components itself.
这些信息然后被提交给图像生成器,它本身由几个组件组成。

The image generator goes through two stages:
图像生成器要经过两个阶段。

1- Image information creator 1- 图像信息创建

This component is the secret sauce of Stable Diffusion. It’s where a lot of the performance gain over previous models is achieved.
这个组件是Stable Diffusion的关键。与以前的模型相比,它的很多性能增益都是在这里实现的。

This component runs for multiple steps to generate image information. This is the steps parameter in Stable Diffusion interfaces and libraries which often defaults to 50 or 100.
这个组件运行多个步骤来生成图像信息。这就是Stable Diffusion界面和库中的步骤参数,通常默认为50或100。

The image information creator works completely in the image information space (or latent space). We’ll talk more about what that means later in the post. This property makes it faster than previous diffusion models that worked in pixel space. In technical terms, this component is made up of a UNet neural network and a scheduling algorithm.
图像信息创建组件完全在图像信息空间(或潜伏空间)中工作。我们将在后面的文章中进一步讨论这意味着什么。这一特性使它比以前在像素空间工作的扩散模型更快。在技术上,这个组件是由一个UNet神经网络和一个调度算法组成的。

The word “diffusion” describes what happens in this component. It is the step by step processing of information that leads to a high-quality image being generated in the end (by the next component, the image decoder).
扩散一词描述了在这个组件中发生的事情。它是对信息的一步步处理,导致最终生成高质量的图像(由下一个组件,即图像解码器)。

2- Image Decoder 2- 图像解码器

The image decoder paints a picture from the information it got from the information creator. It runs only once at the end of the process to produce the final pixel image.
图像解码器根据它从信息创造者那里得到的信息画出一幅画。在这个过程的最后,它只运行一次,以产生最终的像素图像。

With this we come to see the three main components (each with its own neural network) that make up Stable Diffusion:
由此,我们来看看构成Stable Diffusion的三个主要部分(每个部分都有自己的神经网络)。

  • ClipText for text encoding.  ClipText用于文本编码。
    Input: text.  输入:文本。
    Output: 77 token embeddings vectors, each in 768 dimensions.
    输出。77个符号嵌入向量,每个都有768个维度。

  • UNet + Scheduler to gradually process/diffuse information in the information (latent) space.
    UNet + Scheduler在信息(潜在)空间中逐步处理/分散信息。
    Input: text embeddings and a starting multi-dimensional array (structured lists of numbers, also called a tensor) made up of noise.
    输入:文本嵌入和一个由噪声组成的起始多维数组(结构化的数字列表,也叫张量)。
    Output: A processed information array
    输出。一个经过处理的信息阵列

  • Autoencoder Decoder that paints the final image using the processed information array.
    自动编码器 使用处理后的信息阵列绘制最终图像的解码器。
    Input: The processed information array (dimensions: (4,64,64))
    输入。处理后的信息阵列(尺寸:(4,64,64))。
    Output: The resulting image (dimensions: (3, 512, 512) which are (red/green/blue, width, height))
    输出。得到的图像(尺寸:(3,512,512),即(红/绿/蓝,宽,高))。

What is Diffusion Anyway? 扩散到底是什么?

Diffusion is the process that takes place inside the pink “image information creator” component. Having the token embeddings that represent the input text, and a random starting image information array (these are also called latents), the process produces an information array that the image decoder uses to paint the final image.
扩散是发生在粉红色的 “图像信息创建”组件内的过程。有了代表输入文本的标记嵌入和一个随机的起始图像信息阵列(这些也被称为潜像),这个过程产生了一个信息阵列,图像解码器用它来绘制最终的图像。

This process happens in a step-by-step fashion. Each step adds more relevant information. To get an intuition of the process, we can inspect the random latents array, and see that it translates to visual noise. Visual inspection in this case is passing it through the image decoder.
这个过程是按部就班地进行的。每一步都会增加更多的相关信息。为了获得这个过程的直觉,我们可以检查随机潜伏数组,并看到它转化为视觉噪音。在这种情况下,视觉检查是通过图像解码器进行的。

Diffusion happens in multiple steps, each step operates on an input latents array, and produces another latents array that better resembles the input text and all the visual information the model picked up from all images the model was trained on.
扩散发生在多个步骤中,每个步骤都对输入的潜标阵列进行操作,并产生另一个潜标阵列,这个潜标阵列更像输入的文本和模型从模型训练的所有图像中获取的所有视觉信息。

We can visualize a set of these latents to see what information gets added at each step.
我们可以将这些潜质的集合可视化,看看每一步都有哪些信息被添加。

The process is quite breathtaking to look at.
这个过程让人看了相当叹为观止。

Something especially fascinating happens between steps 2 and 4 in this case. It’s as if the outline emerges from the noise.
在这种情况下,在第2步和第4步之间发生了一些特别迷人的事情。仿佛轮廓从噪音中浮现出来。

How diffusion works 扩散是如何进行的

The central idea of generating images with diffusion models relies on the fact that we have powerful computer vision models. Given a large enough dataset, these models can learn complex operations. Diffusion models approach image generation by framing the problem as following:
用扩散模型生成图像的中心思想依赖于我们拥有强大的计算机视觉模型这一事实。鉴于有足够大的数据集,这些模型可以学习复杂的操作。扩散模型处理图像生成的方法是将问题框定如下。

Say we have an image, we generate some noise, and add it to the image.
假设我们有一个图像,我们产生一些噪音,并将其添加到图像中。

This can now be considered a training example. We can use this same formula to create lots of training examples to train the central component of our image generation model.
这现在可以被认为是一个训练实例。我们可以使用这个相同的公式来创建大量的训练实例来训练我们的图像生成模型的中心部分。

While this example shows a few noise amount values from image (amount 0, no noise) to total noise (amount 4, total noise), we can easily control how much noise to add to the image, and so we can spread it over tens of steps, creating tens of training examples per image for all the images in a training dataset.
虽然这个例子显示了从图像(量0,无噪声)到总噪声(量4,总噪声)的几个噪声量值,但我们可以很容易地控制向图像添加多少噪声,因此我们可以把它分散到几十步,为训练数据集中的所有图像创建每幅图像的几十个训练实例。

With this dataset, we can train the noise predictor and end up with a great noise predictor that actually creates images when run in a certain configuration. A training step should look familiar if you’ve had ML exposure:
有了这个数据集,我们就可以训练噪声预测器,最终得到一个伟大的噪声预测器,当以某种配置运行时,它可以实际创造图像。如果你接触过ML,训练步骤看起来应该很熟悉。

Let’s now see how this can generate images.
现在让我们看看这如何能生成图像。

Painting images by removing noise\

通过去除噪音来绘制图像

The trained noise predictor can take a noisy image, and the number of the denoising step, and is able to predict a slice of noise.
训练有素的噪声预测器可以取一个有噪声的图像,以及去噪步骤的数量,并能够预测噪声的片断。

The sampled noise is predicted so that if we subtract it from the image, we get an image that’s closer to the images the model was trained on (not the exact images themselves, but the distribution - the world of pixel arrangements where the sky is usually blue and above the ground, people have two eyes, cats look a certain way – pointy ears and clearly unimpressed).
采样的噪声被预测,所以如果我们从图像中减去它,我们得到的图像更接近模型训练的图像(不是确切的图像本身,而是分布–像素排列的世界,天空通常是蓝色的,并高于地面,人们有两只眼睛,猫看起来有某种方式–尖尖的耳朵,显然不受影响)。

If the training dataset was of aesthetically pleasing images (e.g., LAION Aesthetics, which Stable Diffusion was trained on), then the resulting image would tend to be aesthetically pleasing. If the we train it on images of logos, we end up with a logo-generating model.
如果训练数据集是美学意义上的图像(例如,LAION Aesthetics ,Stable Diffusion的训练对象),那么产生的图像就会倾向于美学意义。如果我们在logo图像上训练它,我们最终会得到一个生成logo的模型。

This concludes the description of image generation by diffusion models mostly as described in Denoising Diffusion Probabilistic Models. Now that you have this intuition of diffusion, you know the main components of not only Stable Diffusion, but also Dall-E 2 and Google’s Imagen.
关于扩散模型生成图像的描述到此结束,主要是在去噪扩散概率模型中描述的。现在你对扩散有了这样的直觉,你不仅知道Stable Diffusion的主要组成部分,而且还知道Dall-E 2和谷歌的Imagen。

Note that the diffusion process we described so far generates images without using any text data. So if we deploy this model, it would generate great looking images, but we’d have no way of controlling if it’s an image of a pyramid or a cat or anything else. In the next sections we’ll describe how text is incorporated in the process in order to control what type of image the model generates.
请注意,到目前为止,我们所描述的扩散过程是在不使用任何文本数据的情况下生成图像的。因此,如果我们部署这个模型,它将会产生非常漂亮的图像,但是我们没有办法控制它是一个金字塔的图像还是一只猫或者其他的东西。在接下来的章节中,我们将描述如何在这个过程中加入文本,以便控制模型生成的图像类型。

Speed Boost: Diffusion on Compressed (Latent) Data Instead of the Pixel Image
速度提升。在压缩的(潜伏的)数据上进行扩散,而不是在像素图像上进行扩散


To speed up the image generation process, the Stable Diffusion paper runs the diffusion process not on the pixel images themselves, but on a compressed version of the image. The paper calls this “Departure to Latent Space”.
为了加快图像生成过程,《Stable Diffusion》论文不是在像素图像本身上运行扩散过程,而是在图像的压缩版本上运行。 该论文将此称为 “向潜伏空间的出发”。

This compression (and later decompression/painting) is done via an autoencoder. The autoencoder compresses the image into the latent space using its encoder, then reconstructs it using only the compressed information using the decoder.
这种压缩(以及后来的解压/绘画)是通过一个自动编码器完成的。自动编码器使用其编码器将图像压缩到潜伏空间,然后使用解码器仅用压缩的信息来重建它。

Now the forward diffusion process is done on the compressed latents. The slices of noise are of noise applied to those latents, not to the pixel image. And so the noise predictor is actually trained to predict noise in the compressed representation (the latent space).
现在,前向扩散过程是在压缩的潜像上完成的。噪声的片断是应用于这些潜像的噪声,而不是应用于像素图像。因此,噪声预测器实际上是被训练来预测压缩后的表示(潜像空间)中的噪声。

The forward process (using the autoencoder’s encoder) is how we generate the data to train the noise predictor. Once it’s trained, we can generate images by running the reverse process (using the autoencoder’s decoder).
正向过程(使用自动编码器的编码器)是我们产生数据来训练噪声预测器的方式。一旦训练完成,我们就可以通过运行反向过程(使用自动编码器的解码器)来生成图像。

These two flows are what’s shown in Figure 3 of the LDM/Stable Diffusion paper:
这两个流程就是LDM/Stable Diffusion论文中的图3所示的内容。

This figure additionally shows the “conditioning” components, which in this case is the text prompts describing what image the model should generate. So let’s dig into the text components.
该图还显示了 “条件 “组件,在本例中是描述模型应该生成什么图像的文本提示。因此,让我们来深入了解一下文本组件。

The Text Encoder: A Transformer Language Model\

文本编码器。一个转化器语言模型

A Transformer language model is used as the language understanding component that takes the text prompt and produces token embeddings. The released Stable Diffusion model uses ClipText (A GPT-based model), while the paper used BERT.
一个Transformer语言模型被用作语言理解组件,它接受文本提示并产生标记嵌入。已发布的Stable Diffusion模型使用ClipText(一种基于GPT的模型),而本文使用BERT。

The choice of language model is shown by the Imagen paper to be an important one. Swapping in larger language models had more of an effect on generated image quality than larger image generation components.
Imagen的论文显示,语言模型的选择是一个重要的问题。与较大的图像生成组件相比,交换较大的语言模型对生成的图像质量有更大的影响。


Larger/better language models have a significant effect on the quality of image generation models. Source: Google Imagen paper by Saharia et. al.. Figure A.5.
更大/更好的语言模型对图像生成模型的质量有很大影响。资料来源。 Saharia等人的Google Imagen论文。图A.5.

The early Stable Diffusion models just plugged in the pre-trained ClipText model released by OpenAI. It’s possible that future models may switch to the newly released and much larger OpenCLIP variants of CLIP (Nov2022 update: True enough, Stable Diffusion V2 uses OpenClip). This new batch includes text models of sizes up to 354M parameters, as opposed to the 63M parameters in ClipText.
早期的Stable Diffusion模型只是插入了由OpenAI发布的预先训练好的ClipText模型。未来的模型有可能切换到新发布的、更大的OpenCLIP变体CLIP(Nov2022更新:确实如此,Stable DiffusionV2使用OpenClip)。这批新的模型包括尺寸高达354M参数的文本模型,而ClipText的参数为63M。

How CLIP is trained

CLIP is trained on a dataset of images and their captions. Think of a dataset looking like this, only with 400 million images and their captions:
CLIP是在一个图片和它们的标题的数据集上训练的。想想看,一个数据集看起来像这样,只是有4亿张图片和它们的标题。


A dataset of images and their captions.
一个图像及其标题的数据集。

In actuality, CLIP was trained on images crawled from the web along with their “alt” tags.
实际上,CLIP是在从网络上抓取的图片及其 “alt “标签上训练的。

CLIP is a combination of an image encoder and a text encoder. Its training process can be simplified to thinking of taking an image and its caption. We encode them both with the image and text encoders respectively.
CLIP是一个图像编码器和文本编码器的组合。它的训练过程可以简化为考虑采取一个图像和它的标题。我们分别用图像编码器和文本编码器对它们进行编码。

We then compare the resulting embeddings using cosine similarity. When we begin the training process, the similarity will be low, even if the text describes the image correctly.
然后我们用余弦相似度来比较所得到的嵌入。当我们开始训练过程时,即使文本正确描述了图像,相似度也会很低。

We update the two models so that the next time we embed them, the resulting embeddings are similar.
我们更新这两个模型,以便在下一次嵌入时,产生的嵌入是相似的。

By repeating this across the dataset and with large batch sizes, we end up with the encoders being able to produce embeddings where an image of a dog and the sentence “a picture of a dog” are similar. Just like in word2vec, the training process also needs to include negative examples of images and captions that don’t match, and the model needs to assign them low similarity scores.
通过在整个数据集上重复这种做法,并使用大批量的数据,我们最终发现编码器能够产生狗的图像和 “狗的照片 “这个句子相似的嵌入。就像在word2vec中一样,训练过程中也需要包括不匹配的图片和标题的负面例子,模型需要给它们分配低的相似度分数。

Feeding Text Information Into The Image Generation Process
在图像生成过程中输入文本信息


To make text a part of the image generation process, we have to adjust our noise predictor to use the text as an input.
为了使文本成为图像生成过程的一部分,我们必须调整我们的噪声预测器,将文本作为输入。

Our dataset now includes the encoded text. Since we’re operating in the latent space, both the input images and predicted noise are in the latent space.
我们的数据集现在包括编码的文本。由于我们是在潜空间中操作,输入图像和预测的噪声都在潜空间中。

To get a better sense of how the text tokens are used in the Unet, let’s look deeper inside the Unet.
为了更好地了解文本标记在Unet中的使用情况,让我们深入了解Unet的内部。

Layers of the Unet Noise predictor (without text)\

Unet噪声预测器的层数(无文本)

Let’s first look at a diffusion Unet that does not use text. Its inputs and outputs would look like this:
让我们先看看一个不使用文本的扩散Unet。它的输入和输出会是这样的。

Inside, we see that: 在里面,我们看到。

  • The Unet is a series of layers that work on transforming the latents array
    Unet是一系列的层,致力于转化潜在的阵列
  • Each layer operates on the output of the previous layer
    每一层都对前一层的输出进行操作
  • Some of the outputs are fed (via residual connections) into the processing later in the network
    一些输出(通过剩余连接)被送入网络的后期处理中
  • The timestep is transformed into a time step embedding vector, and that’s what gets used in the layers
    时间步数被转化为时间步数嵌入向量,这就是在各层中使用的东西。

Layers of the Unet Noise predictor WITH text\

Unet噪声预测器的分层与文本

Let’s now look how to alter this system to include attention to the text.
现在让我们看看如何改变这个系统,以包括对文本的关注。

The main change to the system we need to add support for text inputs (technical term: text conditioning) is to add an attention layer between the ResNet blocks.
为了增加对文本输入的支持(技术术语:文本调节),我们需要对系统进行的主要改变是在ResNet块之间增加一个注意层。

Note that the ResNet block doesn’t directly look at the text. But the attention layers merge those text representations in the latents. And now the next ResNet can utilize that incorporated text information in its processing.
请注意,ResNet块并不直接看文本。但是注意力层合并了潜意识中的那些文本表征。而现在,下一个ResNet可以在其处理过程中利用这些合并的文本信息。

Conclusion 总结

I hope this gives you a good first intuition about how Stable Diffusion works. Lots of other concepts are involved, but I believe they’re easier to understand once you’re familiar with the building blocks above. The resources below are great next steps that I found useful. Please reach out to me on Twitter for any corrections or feedback.
我希望这能让你对Stable Diffusion的工作方式有一个良好的初步直觉。其中还涉及很多其他的概念,但我相信,一旦你熟悉了上面的构件,它们就会更容易理解。下面的资源是我认为很有用的下一步行动。如有任何更正或反馈,请在Twitter上与我联系。

Resources 资源

Acknowledgements 鸣谢

Thanks to Robin Rombach, Jeremy Howard, Hamel Husain, Dennis Soemers, Yan Sidyakin, Freddie Vargus, Anna Golubeva, and the Cohere For AI community for feedback on earlier versions of this article.
感谢Robin Rombach、Jeremy Howard、Hamel Husain、Dennis Soemers、Yan Sidyakin、Freddie Vargus、Anna Golubeva,以及Cohere For AI社区对本文早期版本的反馈。

Contribute 贡献

Please help me make this article better. Possible ways:
请帮助我把这篇文章写得更好。可能的方法。

  • Send any feedback or corrections on Twitter or as a Pull Request
    在Twitter上或以Pull Request的形式发送任何反馈或更正。
  • Help make the article more accessible by suggesting captions and alt-text to the visuals (best as a pull request)
    通过对视觉图片的标题和alt-text提出建议,帮助使文章更容易理解(最好以拉动请求的形式)。
  • Translate it to another language and post it to your blog. Send me the link and I’ll add a link to it here. Translators of previous articles have always mentioned how much deeper they understood the concepts by going through the translation process.
    把它翻译成另一种语言并发布到你的博客。把链接发给我,我会在这里添加一个链接。以前文章的翻译者总是提到,通过翻译过程,他们对概念的理解更加深刻。

Discuss

If you’re interested in discussing the overlap of image generation models with language models, feel free to post in the #images-and-words channel in the Cohere community on Discord. There, we discuss areas of overlap, including:
如果你对讨论图像生成模型与语言模型的重叠问题感兴趣,欢迎在Discord的Cohere社区的#images-and-words频道发帖。在那里,我们讨论重叠的领域,包括。

  • fine-tuning language models to produce good image generation prompts
    微调语言模型以产生良好的图像生成提示
  • Using LLMs to split the subject, and style components of an image captioning prompt
    使用LLMs来分割图像说明提示的主题和风格部分
  • Image-to-prompt (via tools like Clip Interrogator)
    从图像到提示(通过像Clip Interrogator这样的工具)。

Citation


If you found this work helpful for your research, please cite it as following:
如果您认为这项工作对您的研究有帮助,请按以下方式引用。

1
2
3
4
5
6
7
@misc{alammar2022diffusion,
title={The Illustrated Stable Diffusion},
author={Alammar, J},
year={2022},
url={https://jalammar.github.io/illustrated-stable-diffusion/}
}