Instructions to use py-feat/pose_mlp_v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Py-Feat
How to use py-feat/pose_mlp_v2 with Py-Feat:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Py-Feat Pose-MLP v2 β Landmark-to-6DoF Head Pose
A small distilled MLP that takes 68 face landmarks (the dlib-68 / OpenFace
layout produced by mobilefacenet, OpenFace, etc.) and emits 6DoF head
pose calibrated to img2pose's coordinate frame. Designed for py-feat
pipelines that use a face detector without a built-in pose head (e.g.
RetinaFace in py-feat β₯ 0.7).
Model Description
py-feat's v0.6 production pipeline used img2pose as its face detector,
which multi-tasks face localization with 6DoF head pose regression β so
pose came "for free" from the detector. In v0.7 the default face detector
became RetinaFace (much higher WIDERFACE Hard AP) which only detects
faces. To preserve the Fex schema (pitch, roll, yaw, x, y,
z columns), py-feat distills img2pose's pose regression into a small
MLP that operates entirely on already-computed landmarks.
The MLP is bbox-free: it normalizes incoming landmarks by their centroid and inter-eye distance, so the same model works regardless of whether the upstream detector produced loose (img2pose) or tight (RetinaFace) face crops.
Model Details
- Model type: Multi-layer perceptron (MLP)
- Architecture:
Linear(136β512) β LayerNorm β GELU β Dropout(0.15) β Linear(512β256) β LayerNorm β GELU β Dropout β Linear(256β128) β LayerNorm β GELU β Dropout β Linear(128β6) - Parameter count: 236,934 (~0.9 MB safetensors)
- Input: 68 2D landmarks, normalized by landmark centroid and
inter-eye distance (
feat.utils.face_pose_mlp.normalize_landmarks). - Output: 6 values β
[Pitch, Roll, Yaw, X, Y, Z]. The MLP emits z-scored values; the loader de-normalizes usingmean/stdstored in the sidecarpose_mlp_v2.json. Angles are radians, calibrated to img2pose's coordinate frame. - Framework: PyTorch (safetensors weight file, no pickle).
- Inference cost: ~10 Β΅s / face on CPU (batched), negligible vs. the upstream face/landmark stages.
Training Details
- Teacher:
img2pose(Albiero et al., 2021). The MLP is trained to match img2pose's regressed[Pitch, Roll, Yaw, X, Y, Z]outputs. - Training corpus: CelebV-HQ β
n_clips = 35,445,n_train_frames = 2,783,134,n_val_frames = 154,619. Frames withFaceScore < 0.8or|pose| > 75Β°are dropped (filters bad teacher signal on degenerate poses). - Loss: MSE on z-scored 6D output.
- Optimizer: Adam,
lr=1e-3,batch_size=1024. - Epochs: 40 (best val loss at last epoch β see
pose_mlp_v2.jsonfor per-epoch history). - Hardware: single GPU (training takes ~2 hr).
- Seed: 42.
Held-out validation MAE on CelebV-HQ (clip-disjoint split)
| Axis | MAE (Β°) |
|---|---|
| Pitch | 2.66 |
| Roll | 2.34 |
| Yaw | 1.58 |
For reference, img2pose's reported MAE on the AFLW2000-3D / BIWI test sets is ~4Β° average. The MLP cannot exceed its teacher; values here are the gap between the MLP and the teacher's predictions, not against a ground-truth motion-capture rig.
v1 β v2 changelog
| Aspect | v1 | v2 |
|---|---|---|
| Hidden | 256β128β64 | 512β256β128 |
| Activation | Linear β ReLU β Dropout | Linear β LayerNorm β GELU β Dropout |
| Dropout | 0.10 | 0.15 |
| Training frames | 569,678 | 2,783,134 |
| Epochs | 30 | 40 |
| Best val loss | 0.0809 | 0.0777 |
| Roll MAE (Β°) | 2.530 | 2.335 |
Intended Use
- Primary: Drop-in replacement for img2pose's pose head when using
py-featwith a face detector that doesn't predict pose (face_model='retinaface'infeat.Detector, MediaPipe infeat.MPDetector). - Secondary: Any pipeline that produces 68 dlib-style face landmarks and wants img2pose-compatible head pose without re-running img2pose.
Out of scope
- Eye / gaze direction β use
L2CS-Netfor gaze. - Mediapipe-478 landmarks β translate to 68 dlib landmarks first.
- Static head-pose inference from a single landmark (less than 68 pts).
Usage
The MLP is loaded automatically by feat.Detector when
face_model != 'img2pose'. To call it directly:
import torch
from feat.utils.face_pose_mlp import pose_from_landmarks_mlp
# 68 (x, y) landmarks in image-pixel coordinates, e.g. from mobilefacenet.
landmarks = torch.tensor([
# ... [68, 2] ...
], dtype=torch.float32).unsqueeze(0) # [1, 68, 2]
pose = pose_from_landmarks_mlp(landmarks) # [1, 6]: (Pitch, Roll, Yaw, X, Y, Z)
print(pose)
Weights resolve from (in order):
FEAT_POSE_MLP_PATHenvironment variablemodels/pose_mlp_v2.safetensorsin the repo- This HuggingFace repo (
py-feat/pose_mlp_v2)
Limitations
- The MLP cannot improve on img2pose's accuracy β it only matches it more efficiently with bbox-free input. Use img2pose directly if you need img2pose's exact behavior (a tiny ~1Β° distillation gap may remain).
- Trained on CelebV-HQ β performance on non-frontal, occluded, or heavily-rotated faces (>75Β°) is degraded by both the teacher and the data filter.
- Output coordinates are img2pose's frame, not a standard FACS / BIWI
frame. Pose values are interpretable across the
py-featpipeline but may need recalibration to compare with other tools.
Citation
If you use py-feat and this pose-MLP, please cite both py-feat and
img2pose:
@article{cheong2023pyfeat,
title={Py-Feat: Python Facial Expression Analysis Toolbox},
author={Cheong, Jin Hyun and Jolly, Eshin and Xie, Tiankang and Byrne, Sophie and Kenney, Matthew and Chang, Luke J.},
journal={Affective Science},
volume={4},
pages={781--796},
year={2023}
}
@inproceedings{albiero2021img2pose,
title={img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation},
author={Albiero, VΓtor and Chen, Xingyu and Yin, Xi and Pang, Guan and Hassner, Tal},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages={7617--7627},
year={2021}
}
@inproceedings{zhu2022celebvhq,
title={CelebV-HQ: A Large-Scale Video Facial Attributes Dataset},
author={Zhu, Hao and Wu, Wayne and Zhu, Wentao and Jiang, Liming and Tang, Siwei and Zhang, Li and Liu, Ziwei and Loy, Chen Change},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2022}
}
License
MIT (this distillation). The teacher (img2pose) is BSD-3, and the
training corpus (CelebV-HQ) is released for non-commercial research
use β please honor each upstream license if you re-train or
re-distribute.
Files
pose_mlp_v2.safetensorsβ model weights (1 MB)pose_mlp_v2.jsonβ architecture, output-normalization stats, training history, validation MAE per epochREADME.mdβ this card
Acknowledgments
Distilled from img2pose by VΓtor Albiero et al. (Meta AI / NVIDIA), trained on CelebV-HQ by Hao Zhu et al. (CUHK / S-Lab NTU). Built and maintained by Cosanlab at Dartmouth.