arxiv:2410.04081

ε-VAE: Denoising as Visual Decoding

Published on Oct 5

· Submitted by

Authors:

Abstract

In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. In this work, we offer a new perspective by proposing denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (FID), comparing it to state-of-the-art autoencoding approach. We hope this work offers new insights into integrating iterative generation and autoencoding for improved compression and generation.

View arXiv page View PDF Add to collection

Community

garyzhao9012

Paper submitter about 5 hours ago

We present Epsilon-VAE, an effective visual tokenization framework that introduces a diffusion decoder
into standard autoencoders, turning single-step decoding into a multi-step probabilistic process. Our approach outperforms traditional visual autoencoders in both reconstruction and generation quality, particularly in high-compression scenarios. We want to highlight that,

(1) Traditional image compression methods optimize the rate-distortion trade-off, prioritizing compactness over input fidelity. Building on this, we also aim to capture the broader "input distribution" during compression, generating compact representations suitable for latent generative models. Our approach introduces an additional dimension to the trade-off, perception or distribution fidelity, which aligns more closely with the rate-distortion-perception framework.

(2) Our decoding process contains stochasticity, which allows for capturing complex variations within the distribution. While stochasticity might suggest the risk of "hallucination" in reconstructions, the outputs remain faithful to the underlying distribution by design, producing perceptually plausible results. This advantage is particularly evident under extreme compression scenarios, with the degree of stochasticity adapting based on compression levels.

(3) Our diffusion-based decoding method maintains the resolution generalizability typically found in standard autoencoders. This feature is highly practical: our autoencoder is only need to be trained on lower-resolution images, while the subsequent latent generative model could be trained on latents derived from higher-resolution inputs.

Please check our paper for more detailed results!

librarian-bot

about 1 hour ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.04081 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.04081 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.04081 in a Space README.md to link it from this page.