ClawBench - a TIGER-Lab Collection

TIGER-Lab 's Collections

RationalRewards

General-Reasoner

VisualWebInstruct

CritiqueFineTuning

ClawBench

updated about 8 hours ago

Benchmark dataset (V1+V2), live leaderboard Space, and full V1 execution traces — everything you need to run, regrade, or compare on ClawBench.

TIGER-Lab/ClawBench

Viewer • Updated about 3 hours ago • 153 • 11

Note Task definitions (V1: 153, V2: 130), rubrics, eval schemas. Mirror of NAIL-Group/ClawBench.
NAIL-Group/ClawBenchV1Trace

Updated 2 days ago • 17

Note Full 5-layer execution traces for every V1 model run — re-grade, debug, post-hoc analyze.
ClawBench: Can AI Agents Complete Everyday Online Tasks?

Paper • 2604.08523 • Published Apr 9 • 262

Note Paper: ClawBench: Can AI Agents Complete Everyday Online Tasks?
TIGER-Lab/ClawBenchV2Trace

Updated 5 minutes ago

Note Full 5-layer V2 execution traces — re-grade, debug, post-hoc analyze
Running

Agents

ClawBench Leaderboard

🦀

Can AI agents complete everyday online tasks?

Note Live leaderboard — V1/V2/all, sortable by Reward (judge-graded) or Intercepted. Submit via PR to results.csv.