[memo]Scalable Diffusion Models with Transformers

UC Barkleyのwilliam peebles

Transformersを使った研究

Introduction

Diffusion transformer is a diffusion model that overtook the U-net. Ho first introduced the U-net backbone.

History

U-Net

CNN++

U-net inductive bias is not crucial to the performance of diffusion models

DiT adhere to the best practice of ViT.

U-net backbone → transformer

Diffusion transformers

Diffusion modes are trained to learn the reverse process that inverts forward process corruptions

Classifier-free guidance

Latent diffusion models

Patchify

Spatial input to sequences of tokens like ViT

In-context conditioning

Adaptive layer norm block

adaptive normalization layers

adaLN-Zero block

Experimental setup

Augmentation

horizontal flips

Transformer decoder

The size of an output noise prediction = the input’s

Rearrange the decoded tokens into the original spatial layout

Increasing transformer size and decreasing patch size have a better impact on the quality of images.

Diver DiT blocks

Conclusion

A simple transformer-based backbone for diffusion models.

DiT to larger models

DEV Community