| --- |
| license: apache-2.0 |
| tags: |
| - diffusion |
| - autoencoder |
| - image-reconstruction |
| - image-tokenizer |
| - pytorch |
| - fcdm |
| - semantic-alignment |
| library_name: fcdm_diffae |
| --- |
| |
| # data-archetype/semdisdiffae |
|
|
| ### Version History |
|
|
| | Date | Change | |
| |------|--------| |
| | 2026-04-08 | Fix posterior VP interpolation to use float32 precision (was using model dtype) | |
| | 2026-04-07 | Rename package `capacitor_diffae` → `fcdm_diffae`, class `FCDMDiffAE`; encode() now returns whitened latents, decode() dewhitens internally | |
| | 2026-04-06 | Initial release | |
|
|
| **SemDisDiffAE** (**Sem**antically **Dis**entangled **Diff**usion **A**uto**E**ncoder) |
| — a fast image tokenizer with semantically structured 128-channel latents, built |
| on FCDM (Fully Convolutional Diffusion Model) blocks with a VP-parameterized |
| diagonal Gaussian posterior. |
|
|
| Trained with DINOv2 semantic alignment, this VAE was empirically found to |
| offer comparable downstream diffusion convergence speed to other semantically |
| aligned VAEs such as Flux.2 and PS-VAE, while being much faster to encode |
| and decode and achieving very high reconstruction quality (38.6 dB mean PSNR |
| on 2k images). |
|
|
| Built on a pure convolutional architecture with no attention layers in the |
| encoder or decoder, enabling efficient inference at any resolution. |
|
|
| **[Technical Report](technical_report_semantic.md)** · |
| **[Interactive Results Viewer](https://huggingface.co/spaces/data-archetype/semdisdiffae-results)** |
|
|
| ## Key Features |
|
|
| - **Fast**: ~3 ms/img encode, ~6 ms/img decode (1 step) on Blackwell RTX Pro 6000 — significantly |
| faster than Flux.2 VAE |
| - **High fidelity**: 38.6 dB mean PSNR (2k images), exceeding Flux.2 VAE (37.0 dB) |
| - **Semantically structured latents**: DINOv2-aligned, producing latents with |
| clear semantic segmentation visible in PCA projections |
| - **Comparable downstream convergence**: empirically matches the downstream |
| diffusion training convergence speed of Flux.2 and PS-VAE |
| - **Pure convolutional**: no attention in encoder/decoder, O(n) in spatial resolution |
| - **VP diffusion decoder**: single-step DDIM for PSNR-optimal, optional multi-step |
| with PDG for perceptual sharpening |
|
|
| ## Architecture |
|
|
| | Property | Value | |
| |----------|-------| |
| | Parameters | 88.8M | |
| | Patch size | 16 | |
| | Model dim | 896 | |
| | Encoder depth | 4 blocks | |
| | Decoder depth | 8 blocks (2+4+2 skip-concat) | |
| | Bottleneck | 128 channels | |
| | Compression | 16x spatial, 6.0x total | |
| | Posterior | Diagonal Gaussian (VP log-SNR) | |
| | Block type | FCDM (ConvNeXt + GRN + scale/gate AdaLN) | |
|
|
| ## Quick Start |
|
|
| ```python |
| from fcdm_diffae import FCDMDiffAE, FCDMDiffAEInferenceConfig |
| |
| model = FCDMDiffAE.from_pretrained("data-archetype/semdisdiffae", device="cuda") |
| |
| # Encode (returns posterior mode by default) |
| latents = model.encode(images) # [B,3,H,W] in [-1,1] -> [B,128,H/16,W/16] |
| |
| # Decode — PSNR-optimal (1 step, default) |
| recon = model.decode(latents, height=H, width=W) |
| |
| # Decode — perceptual sharpening (10 steps + PDG) |
| cfg = FCDMDiffAEInferenceConfig(num_steps=10, pdg=True, pdg_strength=2.0) |
| recon = model.decode(latents, height=H, width=W, inference_config=cfg) |
| |
| # Full posterior access |
| posterior = model.encode_posterior(images) |
| z_sampled = posterior.sample() |
| ``` |
|
|
| ## Recommended Settings |
|
|
| | Use case | Steps | PDG | Notes | |
| |----------|-------|-----|-------| |
| | PSNR-optimal | 1 | off | Default, fastest | |
| | Perceptual | 10 | on (2.0) | Sharper, ~15x slower | |
|
|
| PDG is primarily useful for more compressed bottlenecks (32 or 64 channels) |
| and is rarely necessary for 128-channel models where reconstruction quality |
| is already high. |
|
|
| ## Training |
|
|
| Trained with: |
| - Pixel-space VP diffusion reconstruction loss (x-prediction, SiD2 weighting) |
| - DINOv2-S semantic alignment (negative cosine, weight 0.01) |
| - VP posterior variance expansion (weight 1e-5) |
| - Latent scale regularization (weight 0.0001) |
| - AdamW optimizer, bf16 mixed precision, EMA decay 0.9995 |
| - 251k steps on a single GPU |
|
|
| See the [technical report](technical_report_semantic.md) for full details. |
|
|
| ## Dependencies |
|
|
| - PyTorch >= 2.0 |
| - safetensors (for loading weights) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{semdisdiffae, |
| title = {SemDisDiffAE: A Semantically Disentangled Diffusion Autoencoder}, |
| author = {data-archetype}, |
| email = {data-archetype@proton.me}, |
| year = {2026}, |
| month = apr, |
| url = {https://huggingface.co/data-archetype/semdisdiffae}, |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|