Haakkim

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

HassanB4 updated a Space about 13 hours ago

Haakkim/README

HassanB4 published a Space about 13 hours ago

Haakkim/README

HassanB4 updated a dataset about 13 hours ago

Haakkim/Haakkim-1.0v

View all activity

Organization Card

Community About org cards

⚖

Haakkim

حَكِّم

An open arena-style human preference evaluation platform for Arabic large language models — built from the ground up for Arabic.

🌐 Live Platform 🏆 Leaderboard 📦 Dataset v1.0

Current Snapshot — v1.0

Statistics from the first public release of Haakkim battle data

1,273

Total Battles

831

BT-Ranked Battles

Models Ranked

Arabic Dialects

582

ESS (Clamped)

0.35

Graph Density

MSA Leaderboard — Top 10

Bradley–Terry scores (1000-centered log-odds). Full 67-model leaderboard at haakkim.tech

Rank	Model	BT Score	95% CI	Battles
1	mistralai/ministral-3b-2512	1001.75	[1001.20, 1002.93]	40
2	mistralai/ministral-8b-2512	1001.61	[1000.72, 1002.97]	43
3	Qwen/Qwen3-235B-A22B-Thinking-2507	1001.21	[1000.47, 1002.00]	38
4	Qwen/Qwen3-30B-A3B-Instruct-2507	1001.14	[999.96, 1002.83]	31
5	deepseek/deepseek-v3.2-exp	1001.13	[1000.27, 1002.16]	38
6	deepseek/deepseek-v3.1	1000.99	[999.81, 1002.07]	29
7	Qwen/Qwen3-235B-A22B-Instruct-2507	1000.98	[1000.12, 1002.08]	39
8	deepseek/deepseek-r1-0528	1000.93	[1000.10, 1002.14]	38
9	openai/gpt-oss-120b	1000.93	[1000.04, 1002.58]	25
10	deepseek/deepseek-v3.2	1000.89	[999.86, 1002.25]	31

Score scale: Haakkim uses unscaled log-odds units centered at 1000 — a 1-point gap corresponds to win odds of e¹ ≈ 2.7:1, producing a ~4-point spread across 67 models. Chatbot Arena-style Elo (×173.7) encodes identical win probabilities with hundreds-of-points spreads.

Arabic Dialect Coverage

11 varieties — from Modern Standard Arabic to regional dialects across the Arab world

MSA

77.5%

Tunisian

9.0%

Saudi

6.5%

Egyptian

3.5%

Levantine

1.7%

Sudanese

0.9%

Omani

0.4%

Iraqi

0.2%

Moroccan

<0.1%

Libyan

<0.1%

Algerian

<0.1%

Evaluation Modes

Three ways to compare Arabic LLMs — only Ranked Arena feeds the official leaderboard

⚔️

Ranked Arena

Random model pairing, single-turn MSA, matched system instruction. Results feed the official Bradley–Terry leaderboard.

✓ BT Leaderboard

↔️

Side-by-Side

User-selected model pair, any dialect. Useful for targeted comparisons but excluded from ranked scoring to prevent selection bias.

Win-rate only

❓

10 Questions

Fixed Arabic prompt pool, any dialect. Provides consistent benchmarking within a curated set of questions.

Win-rate only

Scoring Methodology

Statistically rigorous Bradley–Terry model with four key components

Inverse-Probability Weighting

Corrects for non-uniform model exposure using ε-greedy adaptive sampling weights, clamped to [P1, P99].

Bootstrap Confidence Intervals

200 vote-level resamples per run to produce 95% CIs on every model's BT score.

Rankability Gate

BT scores published only when the comparison graph is fully connected and ESS is sufficient; otherwise win-rate fallback is shown.

Log-odds Scale

1000-centered unscaled log-odds. A 1-point gap ≈ 2.7:1 win odds. Full reproducibility: pipeline and dataset are open.

📦

Haakkim/Haakkim-1.0v — Battle Dataset

1,273 battle records (Parquet, PII-scrubbed). Includes voted comparisons and skipped battles across all 11 dialects and 3 evaluation modes. Full conversation transcripts, sampling weights, category annotations.

View Dataset →

Team

College of Computing, Umm Al-Qura University — Mecca, Saudi Arabia

Dr. Mourad Mars

Academic Supervisor

Assistant Professor · Umm Al-Qura University

🤗 mouradmars ✉ msmars@uqu.edu.sa

Hassan Barmandah

AI Researcher

B.S. Software Engineering · Umm Al-Qura University

🤗 HassanB4 ⌥ HasanBGIt

Abdulrhman Alassaf

Software Engineer

Umm Al-Qura University

Citation

If you use Haakkim or this dataset in your research, please cite:

@misc{mars2026haakkim,
  title        = {Haakkim: An Arena-Style Human Preference Evaluation Platform for Arabic {LLMs}},
  author       = {Mars, Mourad and Barmandah, Hassan and Alassaf, Abdulrhman},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/datasets/Haakkim/Haakkim-1.0v}},
  note         = {College of Computing, Umm Al-Qura University, Mecca, Saudi Arabia}
}

Haakkim

AI & ML interests

Recent Activity

Ranked Arena

Side-by-Side

10 Questions

Inverse-Probability Weighting

Bootstrap Confidence Intervals

Rankability Gate

Log-odds Scale

Haakkim/Haakkim-1.0v — Battle Dataset

models 0

datasets 1

Haakkim/Haakkim-1.0v

AI & ML interests

Recent Activity

Team members 2

Ranked Arena

Side-by-Side

10 Questions

Inverse-Probability Weighting

Bootstrap Confidence Intervals

Rankability Gate

Log-odds Scale

Haakkim/Haakkim-1.0v — Battle Dataset

models 0

datasets 1