Evaluation
updated
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper
• 2403.04132
• Published
• 40
Evaluating Very Long-Term Conversational Memory of LLM Agents
Paper
• 2402.17753
• Published
• 19
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper
• 2402.12659
• Published
• 23
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue
Summarization
Paper
• 2402.13249
• Published
• 15
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
• 2405.01535
• Published
• 124
To Believe or Not to Believe Your LLM
Paper
• 2406.02543
• Published
• 35
Evaluating Open Language Models Across Task Types, Application Domains,
and Reasoning Types: An In-Depth Experimental Analysis
Paper
• 2406.11402
• Published
• 6
Judging the Judges: Evaluating Alignment and Vulnerabilities in
LLMs-as-Judges
Paper
• 2406.12624
• Published
• 37
Paper
• 2408.02666
• Published
• 29
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG
Evaluation Prompts
Paper
• 2504.21117
• Published
• 26
AutoLibra: Agent Metric Induction from Open-Ended Feedback
Paper
• 2505.02820
• Published
• 3
Which Agent Causes Task Failures and When? On Automated Failure
Attribution of LLM Multi-Agent Systems
Paper
• 2505.00212
• Published
• 9
Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in
Smart Personal Assistant
Paper
• 2504.18373
• Published
• 2
X-Reasoner: Towards Generalizable Reasoning Across Modalities and
Domains
Paper
• 2505.03981
• Published
• 15