OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions Paper โข 2602.05843 โข Published 9 days ago โข 57
TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents Paper โข 2602.02196 โข Published 12 days ago โข 32
TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents Paper โข 2602.02196 โข Published 12 days ago โข 32
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation Paper โข 2602.03796 โข Published 11 days ago โข 56
$ฯ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation Paper โข 2503.13288 โข Published Mar 17, 2025 โข 51
MUR: Momentum Uncertainty guided Reasoning for Large Language Models Paper โข 2507.14958 โข Published Jul 20, 2025 โข 47
A^3-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation Paper โข 2601.09274 โข Published Jan 14 โข 84
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices Paper โข 2512.01374 โข Published Dec 1, 2025 โข 105
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows Paper โข 2510.24411 โข Published Oct 28, 2025 โข 72
MUR: Momentum Uncertainty guided Reasoning for Large Language Models Paper โข 2507.14958 โข Published Jul 20, 2025 โข 47
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Paper โข 2505.19897 โข Published May 26, 2025 โข 104