Py-Feat Pose-MLP v2 — Landmark-to-6DoF Head Pose

A small distilled MLP that takes 68 face landmarks (the dlib-68 / OpenFace layout produced by mobilefacenet, OpenFace, etc.) and emits 6DoF head pose calibrated to img2pose's coordinate frame. Designed for py-feat pipelines that use a face detector without a built-in pose head (e.g. RetinaFace in py-feat ≥ 0.7).

Model Description

py-feat's v0.6 production pipeline used img2pose as its face detector, which multi-tasks face localization with 6DoF head pose regression — so pose came "for free" from the detector. In v0.7 the default face detector became RetinaFace (much higher WIDERFACE Hard AP) which only detects faces. To preserve the Fex schema (pitch, roll, yaw, x, y, z columns), py-feat distills img2pose's pose regression into a small MLP that operates entirely on already-computed landmarks.

The MLP is bbox-free: it normalizes incoming landmarks by their centroid and inter-eye distance, so the same model works regardless of whether the upstream detector produced loose (img2pose) or tight (RetinaFace) face crops.

Model Details

Model type: Multi-layer perceptron (MLP)
Architecture: Linear(136→512) → LayerNorm → GELU → Dropout(0.15) → Linear(512→256) → LayerNorm → GELU → Dropout → Linear(256→128) → LayerNorm → GELU → Dropout → Linear(128→6)
Parameter count: 236,934 (~0.9 MB safetensors)
Input: 68 2D landmarks, normalized by landmark centroid and inter-eye distance (feat.utils.face_pose_mlp.normalize_landmarks).
Output: 6 values — [Pitch, Roll, Yaw, X, Y, Z]. The MLP emits z-scored values; the loader de-normalizes using mean/std stored in the sidecar pose_mlp_v2.json. Angles are radians, calibrated to img2pose's coordinate frame.
Framework: PyTorch (safetensors weight file, no pickle).
Inference cost: ~10 µs / face on CPU (batched), negligible vs. the upstream face/landmark stages.

Training Details

Teacher: img2pose (Albiero et al., 2021). The MLP is trained to match img2pose's regressed [Pitch, Roll, Yaw, X, Y, Z] outputs.
Training corpus: CelebV-HQ — n_clips = 35,445, n_train_frames = 2,783,134, n_val_frames = 154,619. Frames with FaceScore < 0.8 or |pose| > 75° are dropped (filters bad teacher signal on degenerate poses).
Loss: MSE on z-scored 6D output.
Optimizer: Adam, lr=1e-3, batch_size=1024.
Epochs: 40 (best val loss at last epoch — see pose_mlp_v2.json for per-epoch history).
Hardware: single GPU (training takes ~2 hr).
Seed: 42.

Held-out validation MAE on CelebV-HQ (clip-disjoint split)

Axis	MAE (°)
Pitch	2.66
Roll	2.34
Yaw	1.58

For reference, img2pose's reported MAE on the AFLW2000-3D / BIWI test sets is ~4° average. The MLP cannot exceed its teacher; values here are the gap between the MLP and the teacher's predictions, not against a ground-truth motion-capture rig.

v1 → v2 changelog

Aspect	v1	v2
Hidden	256→128→64	512→256→128
Activation	Linear → ReLU → Dropout	Linear → LayerNorm → GELU → Dropout
Dropout	0.10	0.15
Training frames	569,678	2,783,134
Epochs	30	40
Best val loss	0.0809	0.0777
Roll MAE (°)	2.530	2.335

Intended Use

Primary: Drop-in replacement for img2pose's pose head when using py-feat with a face detector that doesn't predict pose (face_model='retinaface' in feat.Detector, MediaPipe in feat.MPDetector).
Secondary: Any pipeline that produces 68 dlib-style face landmarks and wants img2pose-compatible head pose without re-running img2pose.

Out of scope

Eye / gaze direction — use L2CS-Net for gaze.
Mediapipe-478 landmarks — translate to 68 dlib landmarks first.
Static head-pose inference from a single landmark (less than 68 pts).

Usage

The MLP is loaded automatically by feat.Detector when face_model != 'img2pose'. To call it directly:

import torch
from feat.utils.face_pose_mlp import pose_from_landmarks_mlp

# 68 (x, y) landmarks in image-pixel coordinates, e.g. from mobilefacenet.
landmarks = torch.tensor([
    # ... [68, 2] ...
], dtype=torch.float32).unsqueeze(0)  # [1, 68, 2]

pose = pose_from_landmarks_mlp(landmarks)  # [1, 6]: (Pitch, Roll, Yaw, X, Y, Z)
print(pose)

Weights resolve from (in order):

FEAT_POSE_MLP_PATH environment variable
models/pose_mlp_v2.safetensors in the repo
This HuggingFace repo (py-feat/pose_mlp_v2)

Limitations

The MLP cannot improve on img2pose's accuracy — it only matches it more efficiently with bbox-free input. Use img2pose directly if you need img2pose's exact behavior (a tiny ~1° distillation gap may remain).
Trained on CelebV-HQ — performance on non-frontal, occluded, or heavily-rotated faces (>75°) is degraded by both the teacher and the data filter.
Output coordinates are img2pose's frame, not a standard FACS / BIWI frame. Pose values are interpretable across the py-feat pipeline but may need recalibration to compare with other tools.

Citation

If you use py-feat and this pose-MLP, please cite both py-feat and img2pose:

@article{cheong2023pyfeat,
  title={Py-Feat: Python Facial Expression Analysis Toolbox},
  author={Cheong, Jin Hyun and Jolly, Eshin and Xie, Tiankang and Byrne, Sophie and Kenney, Matthew and Chang, Luke J.},
  journal={Affective Science},
  volume={4},
  pages={781--796},
  year={2023}
}

@inproceedings{albiero2021img2pose,
  title={img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation},
  author={Albiero, Vítor and Chen, Xingyu and Yin, Xi and Pang, Guan and Hassner, Tal},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={7617--7627},
  year={2021}
}

@inproceedings{zhu2022celebvhq,
  title={CelebV-HQ: A Large-Scale Video Facial Attributes Dataset},
  author={Zhu, Hao and Wu, Wayne and Zhu, Wentao and Jiang, Liming and Tang, Siwei and Zhang, Li and Liu, Ziwei and Loy, Chen Change},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2022}
}

License

MIT (this distillation). The teacher (img2pose) is BSD-3, and the training corpus (CelebV-HQ) is released for non-commercial research use — please honor each upstream license if you re-train or re-distribute.

Files

pose_mlp_v2.safetensors — model weights (1 MB)
pose_mlp_v2.json — architecture, output-normalization stats, training history, validation MAE per epoch
README.md — this card

Acknowledgments

Distilled from img2pose by Vítor Albiero et al. (Meta AI / NVIDIA), trained on CelebV-HQ by Hao Zhu et al. (CUHK / S-Lab NTU). Built and maintained by Cosanlab at Dartmouth.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support