Image / Video Gen
Image Generation Using Diffusion-Based Methods: Tips and Techniques for Stable Diffusion
Paper • 2208.11970 • PublishedNote More Theoretical Reports: https://arxiv.org/pdf/2303.08797
Tutorial on Diffusion Models for Imaging and Vision
Paper • 2403.18103 • Published • 2Denoising Diffusion Probabilistic Models
Paper • 2006.11239 • Published • 3Denoising Diffusion Implicit Models
Paper • 2010.02502 • Published • 3
Progressive Distillation for Fast Sampling of Diffusion Models
Paper • 2202.00512 • Published • 1Note 1. Introduce v_pred. As for DDPM noise scheduler 1.1 definition: v = \sqrt{\bar{\alpha_t}} \epsilon - \sqrt{1-\bar{\alpha_t}} x_0 1.2 The conversion btw epsilon pred and velocity pred: \epsilon_{pred} = \sqrt{\bar{\alpha_t}} v_{pred} + \sqrt{1-\bar{\alpha_t}} x_t
Flow Matching for Generative Modeling
Paper • 2210.02747 • Published
simple diffusion: End-to-end diffusion for high resolution images
Paper • 2301.11093 • Published • 2Note 1. use (v-prediction, epsilon loss) the loss. v_pred = uvit ( z_t , logsnr_t ) eps_pred = sigma_t * z_t + alpha_t * v_t
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Paper • 2209.03003 • Published • 1
MAGVIT: Masked Generative Video Transformer
Paper • 2212.05199 • PublishedNote 1. Inflation 1.1 Use a central inflation method for the convolution layers, where the corresponding 2D kernel fills in the temporally central slice of a zero-filled 3D kernel. 1.2 Replace the same (zero) padding in the convolution layers with reflect padding,
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Paper • 2310.05737 • Published • 4Note 1. Known as MAGVIT-2. Growing the vocabulary size can benefit the generation quality. 2. Both reconstruction and generation consistently improve as the vocabulary size increases. Vocab is single-dimensional variables For example, latent feat z \in R^{4} [-1, 1, -2, 3] --> [0, 1, 0, 1] --> sum([0, 2^1, 0, 2^3]) --> 10 [ 1, 1, 1, 3] --> [1, 1, 1, 1] --> sum([2^0, 2^2, 2^2, 2^3]) --> 15
Scalable Diffusion Models with Transformers
Paper • 2212.09748 • Published • 15Note 1. Following the U-Net initialization strategy, zero-initializing the final convolutional layer in each block before any residual connections, DiT regresses γ, β, and dimension-wise scaling parameters α that are applied immediately before any residual connections within the DiT block.
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Paper • 2408.12590 • Published • 33Note 1. Extend the 2D image-based VAE into a 3D VideoVAE with CausalConv3D. 2. Encode a long video with a divide-and-merge strategy. 3. Caption Model: 3.1 The temporal encoder is implemented with [Token Turing Machines](https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing).
Classifier-Free Diffusion Guidance
Paper • 2207.12598 • Published • 2Note 1. Follow-up work: APG(https://arxiv.org/pdf/2410.02416) 1.1 Leaning more on the orthogonal component significantly attenuates this saturation side effect in generations while maintaining the quality-boosting benefits of CFG. 1.2 APG performs best when applied to the denoised predictions rather than the noise prediction.
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Paper • 2310.00426 • Published • 61Note 1. Training Receipt - Initialize the T2I model with a low-cost class-condition model; - Pretrain on text-image pair data rich in information density; - Fine-tuning with superior aesthetic quality data; 2. adaLN-single - one global set of shifts and scales is computed only at the first block which is shared across all the blocks, denoted as shared_adaln_cond; - a layer-specific trainable embedding, denoted as adaln_cond; adaptively adjusts the scale and shift parameters in different blocks
FreeInit: Bridging Initialization Gap in Video Diffusion Models
Paper • 2312.07537 • Published • 26Note 1. Gap btw training & inference: the initial noises corrupted from real videos remain temporally correlated at the low-frequency band. 2. Free-Init Procedure 2.1 Initialize an independent Gaussian noise; 2.2 DDIM denoising to generate a clean video latent; 2.3 Obtain noisy version video latent through forward diffusion; 2.4 Combine the low-frequency components of this video latent with the high-frequency components from random Gaussian noise; 2.5 Repeat;
black-forest-labs/FLUX.1-schnell
Text-to-Image • Updated • 965k • • 2.47k
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Paper • 2403.03206 • Published • 56Note Known as SD-3 1. Change the distribution over t from the uniform distribution to the one giving more weight to intermediate timesteps by sampling them more frequently. 2. Use a ratio of 50 % original and 50 % synthetic captions. 3. MM-DiT
On the Importance of Noise Scheduling for Diffusion Models
Paper • 2301.10972 • Published • 1Note 1. When increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels). This is more important in video generation.
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
Paper • 2402.14797 • Published • 19Note 1. Argue that treating spatial and temporal modeling in a separable way causes motion artifacts, temporal inconsistencies, or generation of dynamic images rather than videos with vivid motion.
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
Paper • 2312.03641 • Published • 20Note 1. Motion Brush?