L2P: Unlocking Latent Potential for Pixel Generation

TL;DR

L2P (Latent-to-Pixel) is a transfer learning framework that converts pre-trained latent diffusion models (LDMs) into pixel-space diffusion models with minimal computational overhead. By freezing most of the LDM’s layers and training only shallow layers on synthetic data generated by the source LDM, it achieves comparable performance while eliminating the VAE bottleneck and enabling native 4K generation on consumer hardware (8 GPUs).

Key claims

L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM’s intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation [abstract].
L2P trains exclusively on LDM-generated synthetic images, requiring zero real-data collection and enabling rapid convergence by fitting an already smooth data manifold [abstract].
L2P enables seamless migration of massive latent priors to pixel space using only 8 GPUs [abstract].
Eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation [abstract].
L2P performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval [abstract].
L2P incurs negligible training overhead across mainstream LDM architectures [abstract].

Method

L2P transfers a pre-trained latent diffusion model to pixel space through selective layer training. The approach freezes the intermediate layers of the source LDM, which contain the learned diffusion priors, and trains only the shallow layers to adapt from latent representations to pixel space. Instead of using the VAE decoder, L2P employs large-patch tokenization to operate directly in pixel space.

The training corpus consists entirely of synthetic images generated by the source LDM itself. This design choice exploits the fact that the source LDM has already learned a smooth data manifold, allowing the pixel-space model to converge rapidly without needing real training data. The frozen intermediate layers preserve the diffusion knowledge while the trainable shallow layers learn the latent-to-pixel transformation.

By removing the VAE from the inference pipeline, L2P eliminates a significant memory bottleneck. This architectural change enables the model to generate images at native 4K resolution, which would be prohibitively expensive with a VAE-based approach.

Results

On DPG-Bench, L2P performs on par with the source LDM [abstract].
On GenEval, L2P reaches 93% of the source LDM’s performance [abstract].
L2P completes training using only 8 GPUs [abstract].
The method enables native 4K ultra-high resolution generation by removing the VAE memory bottleneck [abstract].

Why it’s interesting

L2P demonstrates that pixel-space diffusion can be competitive with latent diffusion without training from scratch, addressing the prohibitive computational cost that has limited pixel-space approaches. The use of synthetic-only training data is particularly noteworthy — by training on outputs from the source LDM, the method sidesteps data collection entirely while still achieving strong transfer. The 4K generation capability without VAE overhead suggests a path toward higher-resolution generation that’s been memory-constrained in latent approaches. This connects to the broader trend Matt noted of renewed interest in pixel-space methods, potentially offering better quality-efficiency tradeoffs at high resolutions.