Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments Paper • 2601.16333 • Published Jan 22
Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding Paper • 2506.06275 • Published Jun 6, 2025
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks Paper • 2406.18403 • Published Jun 26, 2024
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition Paper • 2407.04559 • Published Jul 5, 2024