File size: 4,409 Bytes
b32916f
 
 
 
 
 
 
 
 
 
1747d1e
b32916f
 
 
 
043616b
 
 
 
3196863
e21e7b5
043616b
 
b32916f
 
 
 
 
 
 
6f1ba8d
b32916f
 
 
 
 
 
7e84e5a
 
 
b32916f
 
cd3fa86
b32916f
 
 
 
 
cd3fa86
b32916f
cd3fa86
b32916f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1747d1e
b32916f
1747d1e
b32916f
 
 
 
 
 
 
 
1747d1e
b32916f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7e84e5a
 
 
 
2bfdf1f
7e84e5a
0cfb8fe
7e84e5a
 
 
 
 
 
b32916f
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - image-tokenizer
  - pytorch
  - fcdm
  - semantic-alignment
library_name: fcdm_diffae
---

# data-archetype/semdisdiffae

### Version History

| Date | Change |
|------|--------|
| 2026-04-08 | Fix posterior VP interpolation to use float32 precision (was using model dtype) |
| 2026-04-07 | Rename package `capacitor_diffae``fcdm_diffae`, class `FCDMDiffAE`; encode() now returns whitened latents, decode() dewhitens internally |
| 2026-04-06 | Initial release |

**SemDisDiffAE** (**Sem**antically **Dis**entangled **Diff**usion **A**uto**E**ncoder)
— a fast image tokenizer with semantically structured 128-channel latents, built
on FCDM (Fully Convolutional Diffusion Model) blocks with a VP-parameterized
diagonal Gaussian posterior.

Trained with DINOv2 semantic alignment, this VAE was empirically found to
offer comparable downstream diffusion convergence speed to other semantically
aligned VAEs such as Flux.2 and PS-VAE, while being much faster to encode
and decode and achieving very high reconstruction quality (38.6 dB mean PSNR
on 2k images).

Built on a pure convolutional architecture with no attention layers in the
encoder or decoder, enabling efficient inference at any resolution.

**[Technical Report](technical_report_semantic.md)** ·
**[Interactive Results Viewer](https://huggingface.co/spaces/data-archetype/semdisdiffae-results)**

## Key Features

- **Fast**: ~3 ms/img encode, ~6 ms/img decode (1 step) on Blackwell RTX Pro 6000 — significantly
  faster than Flux.2 VAE
- **High fidelity**: 38.6 dB mean PSNR (2k images), exceeding Flux.2 VAE (37.0 dB)
- **Semantically structured latents**: DINOv2-aligned, producing latents with
  clear semantic segmentation visible in PCA projections
- **Comparable downstream convergence**: empirically matches the downstream
  diffusion training convergence speed of Flux.2 and PS-VAE
- **Pure convolutional**: no attention in encoder/decoder, O(n) in spatial resolution
- **VP diffusion decoder**: single-step DDIM for PSNR-optimal, optional multi-step
  with PDG for perceptual sharpening

## Architecture

| Property | Value |
|----------|-------|
| Parameters | 88.8M |
| Patch size | 16 |
| Model dim | 896 |
| Encoder depth | 4 blocks |
| Decoder depth | 8 blocks (2+4+2 skip-concat) |
| Bottleneck | 128 channels |
| Compression | 16x spatial, 6.0x total |
| Posterior | Diagonal Gaussian (VP log-SNR) |
| Block type | FCDM (ConvNeXt + GRN + scale/gate AdaLN) |

## Quick Start

```python
from fcdm_diffae import FCDMDiffAE, FCDMDiffAEInferenceConfig

model = FCDMDiffAE.from_pretrained("data-archetype/semdisdiffae", device="cuda")

# Encode (returns posterior mode by default)
latents = model.encode(images)  # [B,3,H,W] in [-1,1] -> [B,128,H/16,W/16]

# Decode — PSNR-optimal (1 step, default)
recon = model.decode(latents, height=H, width=W)

# Decode — perceptual sharpening (10 steps + PDG)
cfg = FCDMDiffAEInferenceConfig(num_steps=10, pdg=True, pdg_strength=2.0)
recon = model.decode(latents, height=H, width=W, inference_config=cfg)

# Full posterior access
posterior = model.encode_posterior(images)
z_sampled = posterior.sample()
```

## Recommended Settings

| Use case | Steps | PDG | Notes |
|----------|-------|-----|-------|
| PSNR-optimal | 1 | off | Default, fastest |
| Perceptual | 10 | on (2.0) | Sharper, ~15x slower |

PDG is primarily useful for more compressed bottlenecks (32 or 64 channels)
and is rarely necessary for 128-channel models where reconstruction quality
is already high.

## Training

Trained with:
- Pixel-space VP diffusion reconstruction loss (x-prediction, SiD2 weighting)
- DINOv2-S semantic alignment (negative cosine, weight 0.01)
- VP posterior variance expansion (weight 1e-5)
- Latent scale regularization (weight 0.0001)
- AdamW optimizer, bf16 mixed precision, EMA decay 0.9995
- 251k steps on a single GPU

See the [technical report](technical_report_semantic.md) for full details.

## Dependencies

- PyTorch >= 2.0
- safetensors (for loading weights)

## Citation

```bibtex
@misc{semdisdiffae,
  title   = {SemDisDiffAE: A Semantically Disentangled Diffusion Autoencoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = apr,
  url     = {https://huggingface.co/data-archetype/semdisdiffae},
}
```

## License

Apache 2.0