JLT: Clean-Latent Prediction in Latent Diffusion Transformers

ImageNet 256x256 samples from JLT-B/1
Authors: Funing Fu1,*, Tenghui Wang2,*, Guanyu Zhou2, Junyong Cen1, Qichao Zhu3
1Independent Researcher  ·  2Wuhan University of Technology  ·  3Hangzhou Jiyi AI

TL;DR: We study whether predicting clean data is better than predicting velocity in latent space. Under the same architecture, training settings, and FLUX.2 VAE representation, clean-latent prediction (JLT-B/1) achieves FID 2.50 vs. velocity prediction (DiT-B/1) at FID 6.56 — a 62% improvement.

Results

Matched Target Ablation on ImageNet 256x256

Model Target FID-50K IS
JLT-B/1 x (clean) 2.56 220.74
DiT-B/1 v (velocity) 6.56 132.12
JLT-B/2 x (clean) 14.81 107.29
DiT-B/2 v (velocity) 28.71 58.46
JLT-B/1 (final) x (clean) 2.50 232.51

Method

Prediction Targets

Under the linear corruption path z_t = t * x + (1-t) * epsilon:

y_x = x,    y_epsilon = epsilon,    y_v = x - epsilon

These are algebraically equivalent via affine readout. But with finite model capacity, the direct output parameterization changes the regression difficulty.

Target-Geometry Analysis

Under local linear-Gaussian approximation x ~ N(0, Sigma):

Cov(y_x) = Sigma,    Cov(y_epsilon) = I,    Cov(y_v) = Sigma + I

Key insight: Velocity prediction adds an isotropic unit floor to every direction. When Sigma is anisotropic, low-variance directions become unit-variance in y_v, while clean prediction keeps their target variance small.

Training Curves

Training curves

Key Findings

  1. Target geometry matters in latent space: Clean-latent prediction consistently outperforms matched velocity prediction under fixed representation, architecture, and training settings.
  2. Mechanism: Velocity prediction adds an isotropic covariance floor and amplifies low-variance latent directions, while clean prediction attenuates them.
  3. Representation independence: The advantage holds at both /1 and /2 VAE-grid scales, not a byproduct of a particular patch size.

Citation

@misc{fu2026jltcleanlatentpredictionlatent,
  title={JLT: Clean-Latent Prediction in Latent Diffusion Transformers},
  author={Funing Fu and Tenghui Wang and Guanyu Zhou and Junyong Cen and Qichao Zhu},
  year={2026},
  eprint={2605.27102},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2605.27102}
}