Haakkim
AI & ML interests
None defined yet.
Recent Activity
An open arena-style human preference evaluation platform for Arabic large language models — built from the ground up for Arabic.
| Rank | Model | BT Score | 95% CI | Battles |
|---|---|---|---|---|
| 1 | mistralai/ministral-3b-2512 | 1001.75 | [1001.20, 1002.93] | 40 |
| 2 | mistralai/ministral-8b-2512 | 1001.61 | [1000.72, 1002.97] | 43 |
| 3 | Qwen/Qwen3-235B-A22B-Thinking-2507 | 1001.21 | [1000.47, 1002.00] | 38 |
| 4 | Qwen/Qwen3-30B-A3B-Instruct-2507 | 1001.14 | [999.96, 1002.83] | 31 |
| 5 | deepseek/deepseek-v3.2-exp | 1001.13 | [1000.27, 1002.16] | 38 |
| 6 | deepseek/deepseek-v3.1 | 1000.99 | [999.81, 1002.07] | 29 |
| 7 | Qwen/Qwen3-235B-A22B-Instruct-2507 | 1000.98 | [1000.12, 1002.08] | 39 |
| 8 | deepseek/deepseek-r1-0528 | 1000.93 | [1000.10, 1002.14] | 38 |
| 9 | openai/gpt-oss-120b | 1000.93 | [1000.04, 1002.58] | 25 |
| 10 | deepseek/deepseek-v3.2 | 1000.89 | [999.86, 1002.25] | 31 |
Ranked Arena
Random model pairing, single-turn MSA, matched system instruction. Results feed the official Bradley–Terry leaderboard.
✓ BT LeaderboardSide-by-Side
User-selected model pair, any dialect. Useful for targeted comparisons but excluded from ranked scoring to prevent selection bias.
Win-rate only10 Questions
Fixed Arabic prompt pool, any dialect. Provides consistent benchmarking within a curated set of questions.
Win-rate onlyInverse-Probability Weighting
Corrects for non-uniform model exposure using ε-greedy adaptive sampling weights, clamped to [P1, P99].
Bootstrap Confidence Intervals
200 vote-level resamples per run to produce 95% CIs on every model's BT score.
Rankability Gate
BT scores published only when the comparison graph is fully connected and ESS is sufficient; otherwise win-rate fallback is shown.
Log-odds Scale
1000-centered unscaled log-odds. A 1-point gap ≈ 2.7:1 win odds. Full reproducibility: pipeline and dataset are open.
Haakkim/Haakkim-1.0v — Battle Dataset
1,273 battle records (Parquet, PII-scrubbed). Includes voted comparisons and skipped battles across all 11 dialects and 3 evaluation modes. Full conversation transcripts, sampling weights, category annotations.
@misc{mars2026haakkim,
title = {Haakkim: An Arena-Style Human Preference Evaluation Platform for Arabic {LLMs}},
author = {Mars, Mourad and Barmandah, Hassan and Alassaf, Abdulrhman},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/Haakkim/Haakkim-1.0v}},
note = {College of Computing, Umm Al-Qura University, Mecca, Saudi Arabia}
}