You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

DINOv3 ViT-H/16+ Booru Tagger

A multi-label image tagger trained on e621 and Danbooru annotations, using a DINOv3 ViT-H/16+ backbone fine-tuned end-to-end with a single linear projection head.

Model Details

Property Value
Backbone facebook/dinov3-vith16plus-pretrain-lvd1689m
Architecture ViT-H/16+ · 32 layers · hidden dim 1280 · 20 heads · SwiGLU MLP · RoPE · 4 register tokens
Head Linear((1 + 4) × 1280 → 74 625) — CLS + 4 register tokens concatenated
Vocabulary 74 625 tags (min frequency ≥ 50 across training set)
Input resolution Any multiple of 16 px — trained at 512 px, generalises to higher resolutions
Input normalisation ImageNet mean/std [0.485, 0.456, 0.406] / [0.229, 0.224, 0.225]
Output Raw logits — apply sigmoid for per-tag probabilities
Parameters ~632 M (backbone) + ~480 M (head)

Training

Hyperparameter Value
Training data e621 + Danbooru (parquet)
Batch size 32
Learning rate 1e-6
Warmup steps 50
Loss BCEWithLogitsLoss with per-tag pos_weight = (neg/pos)^(1/T), cap 100
Optimiser AdamW (β₁=0.9, β₂=0.999, wd=0.01)
Precision bfloat16 (backbone) / float32 (projection + loss)
Hardware 2× GPU, ThreadPoolExecutor + NCCL all-reduce

Usage

Standalone (no transformers dependency)

from inference_tagger_standalone import Tagger

tagger = Tagger(
    checkpoint_path="tagger_proto.safetensors",
    vocab_path="tagger_vocab.json",
    device="cuda",
)

tags = tagger.predict("photo.jpg", topk=40)
# → [("solo", 0.98), ("anthro", 0.95), ...]

# or threshold-based
tags = tagger.predict("https://example.com/image.jpg", threshold=0.35)

CLI

# top-30 tags, pretty output
python inference_tagger_standalone.py \
    --checkpoint tagger_proto.safetensors \
    --vocab tagger_vocab.json \
    --images photo.jpg https://example.com/image.jpg \
    --topk 30

# comma-separated string (pipe into diffusion trainer)
python inference_tagger_standalone.py ... --format tags

# JSON
python inference_tagger_standalone.py ... --format json

Web UI

pip install fastapi uvicorn jinja2 aiofiles

python tagger_ui_server.py \
    --checkpoint tagger_proto.safetensors \
    --vocab tagger_vocab.json \
    --port 7860
# → open http://localhost:7860

Files

File Description
*.safetensors Model weights (bfloat16)
tagger_vocab.json {"idx2tag": [...]} — 74 625 tag strings ordered by training frequency
inference_tagger_standalone.py Self-contained inference script (no transformers dep)
tagger_ui_server.py FastAPI + Jinja2 web UI server

Tag Vocabulary

Tags are sourced from e621 and Danbooru annotations and cover:

  • Subject — species, character count, gender (solo, duo, anthro, 1girl, male, …)
  • Body — anatomy, fur/scale/skin markings, body parts
  • Action / poselooking at viewer, sitting, …
  • Scene — background, lighting, setting
  • Styledigital art, hi res, sketch, watercolor, …
  • Rating — explicit content tags are included; filter as needed for your use case

Minimum tag frequency threshold: 50 occurrences across the combined dataset.

Limitations

  • Evaluated on booru-style illustrations and furry art; performance on photographic images or other art styles is untested.
  • The vocabulary reflects the biases of e621 and Danbooru annotation practices.

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support