3v324v23's picture
feat: add NNGen project under NNGen/ and ignore local secrets
0bdbec3
raw
history blame contribute delete
518 Bytes
Generate a high-level diagram of a Vision Transformer (ViT):
- Input: 224×224 RGB image
- Patch Embedding: split into 16×16 patches and apply a linear projection
- Add CLS token and positional encoding
- Transformer Encoder stack: Multi-Head Self-Attention + MLP + residual + LayerNorm (repeat L layers)
- Classification head: take CLS token for linear classification
Layout requirements: left-to-right flow; clear arrow directions; correct spelling of all labels; show the number of layers L; keep colors readable.