You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

DINOv3 ViT-H/16+ Booru Tagger

A multi-label image tagger trained on e621 and Danbooru annotations, using a DINOv3 ViT-H/16+ backbone fine-tuned end-to-end with a single linear projection head.

Model Details

Property	Value
Backbone	`facebook/dinov3-vith16plus-pretrain-lvd1689m`
Architecture	ViT-H/16+ · 32 layers · hidden dim 1280 · 20 heads · SwiGLU MLP · RoPE · 4 register tokens
Head	`Linear((1 + 4) × 1280 → 74 625)` — CLS + 4 register tokens concatenated
Vocabulary	74 625 tags (min frequency ≥ 50 across training set)
Input resolution	Any multiple of 16 px — trained at 512 px, generalises to higher resolutions
Input normalisation	ImageNet mean/std `[0.485, 0.456, 0.406]` / `[0.229, 0.224, 0.225]`
Output	Raw logits — apply `sigmoid` for per-tag probabilities
Parameters	~632 M (backbone) + ~480 M (head)

Training

Hyperparameter	Value
Training data	e621 + Danbooru (parquet)
Batch size	32
Learning rate	1e-6
Warmup steps	50
Loss	`BCEWithLogitsLoss` with per-tag `pos_weight = (neg/pos)^(1/T)`, cap 100
Optimiser	AdamW (β₁=0.9, β₂=0.999, wd=0.01)
Precision	bfloat16 (backbone) / float32 (projection + loss)
Hardware	2× GPU, ThreadPoolExecutor + NCCL all-reduce

Usage

Standalone (no `transformers` dependency)

from inference_tagger_standalone import Tagger

tagger = Tagger(
    checkpoint_path="tagger_proto.safetensors",
    vocab_path="tagger_vocab.json",
    device="cuda",
)

tags = tagger.predict("photo.jpg", topk=40)
# → [("solo", 0.98), ("anthro", 0.95), ...]

# or threshold-based
tags = tagger.predict("https://example.com/image.jpg", threshold=0.35)

CLI

# top-30 tags, pretty output
python inference_tagger_standalone.py \
    --checkpoint tagger_proto.safetensors \
    --vocab tagger_vocab.json \
    --images photo.jpg https://example.com/image.jpg \
    --topk 30

# comma-separated string (pipe into diffusion trainer)
python inference_tagger_standalone.py ... --format tags

# JSON
python inference_tagger_standalone.py ... --format json

Web UI

pip install fastapi uvicorn jinja2 aiofiles

python tagger_ui_server.py \
    --checkpoint tagger_proto.safetensors \
    --vocab tagger_vocab.json \
    --port 7860
# → open http://localhost:7860

Files

File	Description
`*.safetensors`	Model weights (bfloat16)
`tagger_vocab.json`	`{"idx2tag": [...]}` — 74 625 tag strings ordered by training frequency
`inference_tagger_standalone.py`	Self-contained inference script (no `transformers` dep)
`tagger_ui_server.py`	FastAPI + Jinja2 web UI server

Tag Vocabulary

Tags are sourced from e621 and Danbooru annotations and cover:

Subject — species, character count, gender (solo, duo, anthro, 1girl, male, …)
Body — anatomy, fur/scale/skin markings, body parts
Action / pose — looking at viewer, sitting, …
Scene — background, lighting, setting
Style — digital art, hi res, sketch, watercolor, …
Rating — explicit content tags are included; filter as needed for your use case

Minimum tag frequency threshold: 50 occurrences across the combined dataset.

Limitations

Evaluated on booru-style illustrations and furry art; performance on photographic images or other art styles is untested.
The vocabulary reflects the biases of e621 and Danbooru annotation practices.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track