TIGER-Lab/ClawBench
Viewer • Updated • 153 • 11
Benchmark dataset (V1+V2), live leaderboard Space, and full V1 execution traces — everything you need to run, regrade, or compare on ClawBench.
Note Task definitions (V1: 153, V2: 130), rubrics, eval schemas. Mirror of NAIL-Group/ClawBench.
Note Full 5-layer execution traces for every V1 model run — re-grade, debug, post-hoc analyze.
Note Paper: ClawBench: Can AI Agents Complete Everyday Online Tasks?
Note Full 5-layer V2 execution traces — re-grade, debug, post-hoc analyze
Can AI agents complete everyday online tasks?
Note Live leaderboard — V1/V2/all, sortable by Reward (judge-graded) or Intercepted. Submit via PR to results.csv.