data-archetype/semdisdiffae

Version History

Date Change
2026-04-08 Fix posterior VP interpolation to use float32 precision (was using model dtype)
2026-04-07 Rename package capacitor_diffae โ†’ fcdm_diffae, class FCDMDiffAE; encode() now returns whitened latents, decode() dewhitens internally
2026-04-06 Initial release

SemDisDiffAE (Semantically Disentangled Diffusion AutoEncoder) โ€” a fast image tokenizer with semantically structured 128-channel latents, built on FCDM (Fully Convolutional Diffusion Model) blocks with a VP-parameterized diagonal Gaussian posterior.

Trained with DINOv2 semantic alignment, this VAE was empirically found to offer comparable downstream diffusion convergence speed to other semantically aligned VAEs such as Flux.2 and PS-VAE, while being much faster to encode and decode and achieving very high reconstruction quality (38.6 dB mean PSNR on 2k images).

Built on a pure convolutional architecture with no attention layers in the encoder or decoder, enabling efficient inference at any resolution.

Technical Report ยท Interactive Results Viewer

Key Features

  • Fast: ~3 ms/img encode, ~6 ms/img decode (1 step) on Blackwell RTX Pro 6000 โ€” significantly faster than Flux.2 VAE
  • High fidelity: 38.6 dB mean PSNR (2k images), exceeding Flux.2 VAE (37.0 dB)
  • Semantically structured latents: DINOv2-aligned, producing latents with clear semantic segmentation visible in PCA projections
  • Comparable downstream convergence: empirically matches the downstream diffusion training convergence speed of Flux.2 and PS-VAE
  • Pure convolutional: no attention in encoder/decoder, O(n) in spatial resolution
  • VP diffusion decoder: single-step DDIM for PSNR-optimal, optional multi-step with PDG for perceptual sharpening

Architecture

Property Value
Parameters 88.8M
Patch size 16
Model dim 896
Encoder depth 4 blocks
Decoder depth 8 blocks (2+4+2 skip-concat)
Bottleneck 128 channels
Compression 16x spatial, 6.0x total
Posterior Diagonal Gaussian (VP log-SNR)
Block type FCDM (ConvNeXt + GRN + scale/gate AdaLN)

Quick Start

from fcdm_diffae import FCDMDiffAE, FCDMDiffAEInferenceConfig

model = FCDMDiffAE.from_pretrained("data-archetype/semdisdiffae", device="cuda")

# Encode (returns posterior mode by default)
latents = model.encode(images)  # [B,3,H,W] in [-1,1] -> [B,128,H/16,W/16]

# Decode โ€” PSNR-optimal (1 step, default)
recon = model.decode(latents, height=H, width=W)

# Decode โ€” perceptual sharpening (10 steps + PDG)
cfg = FCDMDiffAEInferenceConfig(num_steps=10, pdg=True, pdg_strength=2.0)
recon = model.decode(latents, height=H, width=W, inference_config=cfg)

# Full posterior access
posterior = model.encode_posterior(images)
z_sampled = posterior.sample()

Recommended Settings

Use case Steps PDG Notes
PSNR-optimal 1 off Default, fastest
Perceptual 10 on (2.0) Sharper, ~15x slower

PDG is primarily useful for more compressed bottlenecks (32 or 64 channels) and is rarely necessary for 128-channel models where reconstruction quality is already high.

Training

Trained with:

  • Pixel-space VP diffusion reconstruction loss (x-prediction, SiD2 weighting)
  • DINOv2-S semantic alignment (negative cosine, weight 0.01)
  • VP posterior variance expansion (weight 1e-5)
  • Latent scale regularization (weight 0.0001)
  • AdamW optimizer, bf16 mixed precision, EMA decay 0.9995
  • 251k steps on a single GPU

See the technical report for full details.

Dependencies

  • PyTorch >= 2.0
  • safetensors (for loading weights)

Citation

@misc{semdisdiffae,
  title   = {SemDisDiffAE: A Semantically Disentangled Diffusion Autoencoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = apr,
  url     = {https://huggingface.co/data-archetype/semdisdiffae},
}

License

Apache 2.0

Downloads last month
146
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support