Improved Baselines with Representation Autoencoders

TL;DR

RAEv2 systematically improves Representation Autoencoders (RAE) through three key insights: aggregating multiple encoder layers instead of just the final layer, combining RAE with REPA (representation alignment) which work complementarily, and enabling efficient self-guidance without extra models. This achieves over 10× faster convergence than the original RAE, reaching state-of-the-art gFID of 1.06 on ImageNet-256 in just 80 epochs while improving both reconstruction and generation quality. The approach extends to text-to-image generation and world models.

Key claims

RAEv2 achieves state-of-the-art gFID of 1.06 on ImageNet-256 in 80 epochs, representing more than 10× faster convergence over the original RAE [Abstract].
On FDr^k metric, RAEv2 achieves 2.17 at 80 epochs compared to the previous best of 3.26 at 800 epochs without any post-training [Abstract].
RAEv2 attains EP_FID@2 (epochs to reach unguided gFID ≤ 2) of 35 epochs, versus 177 for the original RAE [Abstract].
Defining the encoder representation as the sum of the last k layers rather than solely the final layer greatly improves reconstruction without encoder finetuning or specialized data [Abstract].
RAE (using pretrained representation as encoder) and REPA (which distills the same representation to intermediate layers) exhibit complementary working mechanisms, allowing both to be used together [Abstract].
With the generalized formulation, stronger representations such as DINOv3-B yield better generation, despite performing worse than DINOv2-B under the original RAE recipe [§2.1].
RAEv2-NWM achieves an FVD of 105.61 on the RECON validation set for world models, compared to 762.73 for DIAMOND, 200.97 for NWM, and 312.01 for RAE [Table 21].
Reconstruction rFID improves from 0.60 to 0.18 with RAEv2 [Results summary].

Method

RAEv2 builds on Representation Autoencoders (RAE), which replace traditional VAE encoders with frozen pretrained vision encoders (like DINOv2 or DINOv3) for diffusion modeling. The key innovation is a generalized formulation where the encoder output is defined as the sum of the last k layers rather than just the final layer. This simple change leverages the rich abstractions that exist across all encoder layers without requiring encoder finetuning.

The second major contribution addresses the assumption that RAE replaces REPA (representation alignment). Through large-scale empirical analysis, the authors discover that RAE and REPA actually have complementary working mechanisms. RAE uses the pretrained representation as the encoder, while REPA distills the same representation to intermediate decoder layers. Combining both approaches not only improves performance but also simplifies guidance by reusing the REPA head, eliminating the need for separate guidance models (AutoGuidance) or extra forward passes (CFG).

The approach is validated across three settings: ImageNet-256 class-conditional generation, text-to-image generation, and navigation world models (future state prediction).

Results

On ImageNet-256 class-conditional generation, RAEv2 achieves state-of-the-art metrics in just 80 epochs: gFID of 1.06 and FDr^6 of 2.17, compared to the original RAE’s 3.26 at 800 epochs. The training efficiency improvement is dramatic—RAEv2 reaches gFID ≤ 2 in 35 epochs versus 177 epochs for the original RAE, representing a 5× speedup on this metric alone.

Reconstruction quality also improves significantly, with rFID dropping from 0.60 to 0.18. For world models on the RECON dataset, RAEv2-NWM achieves FVD of 105.61 compared to 762.73 for DIAMOND, 200.97 for NWM, and 312.01 for RAE baseline, with consistent improvements across all rollout horizons from 1 to 16 seconds on both FID and LPIPS metrics.

The generalized multi-layer aggregation enables stronger encoders like DINOv3-L to excel in both spatial and global performance while also improving generation, whereas they performed poorly under the original RAE formulation.

Why it’s interesting

RAEv2 demonstrates that unified tokenization for understanding and generation can be significantly improved through careful design choices without requiring new architectures or training procedures. The 10× convergence speedup makes RAE-based approaches competitive with traditional VAE pipelines on practical training budgets. The finding that RAE and REPA are complementary rather than competing approaches opens new design space for efficient guidance mechanisms. Validation across diverse domains (class-conditional, text-to-image, world models) suggests the improvements are fundamental rather than task-specific. This work from Saining Xie’s team at Meta/NYU alongside Adobe Research moves the unified vision-language tokenization line forward by a concrete margin (10× convergence, FID 1.13 on ImageNet 256×256) rather than incrementally.