The Arabic RAG Leaderboard
The only leaderboard you will require for your RAG needs 🏆
For technical details, check our blog post here.
10 | 64.75 | 1,067.91 | 8,192 | 1,024 | 70.35 | 59.15 |
Evaluation Status
Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet | apache-2.0 | main | f16 | 305 |
Alibaba-NLP/gte-multilingual-base | apache-2.0 | main | f16 | 305 |
NAMAA-Space/AraModernBert-Base-STS | apache-2.0 | main | f32 | 149 |
Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka | apache-2.0 | main | f32 | 135 |
Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet | apache-2.0 | main | f32 | 118 |
Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2 | apache-2.0 | main | f32 | 135 |
Omartificial-Intelligence-Space/Arabic-labse-Matryoshka | apache-2.0 | main | f32 | 471 |
Omartificial-Intelligence-Space/GATE-AraBert-v1 | Open | main | f16 | |
intfloat/multilingual-e5-large-instruct | mit | main | f16 | 560 |
mohamed2811/Muffakir_Embedding | main | f32 | 135 | |
omarelshehy/Arabic-Retrieval-v1.0 | apache-2.0 | main | f32 | 135 |
omarelshehy/Arabic-STS-Matryoshka-V2 | Open | main | f16 | |
omarelshehy/Arabic-STS-Matryoshka-V2 | open | main | f32 | 135 |
omarelshehy/Arabic-STS-Matryoshka | apache-2.0 | main | f32 | 560 |
omarelshehy/arabic-english-sts-matryoshka-v2.0 | open | main | f32 | 560 |
omarelshehy/arabic-english-sts-matryoshka | apache-2.0 | main | f32 | 560 |
silma-ai/silma-embeddding-matryoshka-v0.1 | apache-2.0 | main | f32 | 135 |
silma-ai/silma-embeddding-sts-v0.1 | apache-2.0 | main | f32 | 135 |
No failed evaluations.
About Retrieval Evaluation
The retrieval evaluation assesses a model's ability to find and retrieve relevant information from a large corpus of Arabic text. Models are evaluated on:
Web Search Dataset Metrics
- MRR (Mean Reciprocal Rank): Measures the ranking quality by focusing on the position of the first relevant result
- nDCG (Normalized Discounted Cumulative Gain): Evaluates the ranking quality considering all relevant results
- Recall@5: Measures the proportion of relevant documents found in the top 5 results
- Overall Score: Combined score calculated as the average of MRR, nDCG, and Recall@5
Model Requirements
- Must support Arabic text embeddings
- Should handle queries of at least 512 tokens
- Must work with
sentence-transformers
library
Evaluation Process
- Models process Arabic web search queries
- Retrieved documents are evaluated using:
- MRR for first relevant result positioning
- nDCG for overall ranking quality
- Recall@5 for top results accuracy
- Metrics are averaged to calculate the overall score
- Models are ranked based on their overall performance
How to Prepare Your Model
- Ensure your model is publicly available on HuggingFace Hub (We don't support private model evaluations yet)
- Model should output fixed-dimension embeddings for text
- Support batch processing for efficient evaluation (this is default if you use
sentence-transformers
)
10 | 87.44 | 2,165.81 | 8,192 | 1,024 | 81.27 | 87.05 |
1 | 87.44 | 2165.81 | 8192 | 1024 | 81.27 | 87.05 | |
2 | 85.82 | 2165.81 | 8192 | 1024 | 80.18 | 85.19 | |
3 | 85.03 | 582.44 | 8192 | 768 | 76.76 | 86.05 | |
4 | 83.96 | 515.72 | 512 | 768 | 77.02 | 77.61 | |
5 | 82.28 | 515.72 | 512 | 768 | 76.43 | 75.05 | |
6 | 79.93 | 515.72 | 512 | 768 | 74.08 | 85 | |
7 | 68.89 | 421.97 | 512 | 768 | 63.6 | 63.27 | |
8 | 59.62 | 127.26 | 512 | 384 | 62.81 | 60.42 | |
9 | 56.75 | 515.72 | 512 | 768 | 51.19 | 58.99 | |
10 | 53.46 | 1067.91 | 512 | 1024 | 52.6 | 67.73 | |
11 | 52.44 | 417.64 | 512 | 768 | 47.92 | 76.33 | |
12 | 51.83 | 1060.65 | 512 | 768 | 50.2 | 55.01 | |
13 | 50.7 | 2165.81 | 8192 | 1024 | 47.45 | 49.72 | |
14 | 49.99 | 530.33 | 512 | 768 | 48.65 | 52.75 | |
15 | 49.88 | 1798.7 | 256 | 768 | 51.04 | 52.54 | |
16 | 49.74 | 1798.7 | 256 | 768 | 48.34 | 60.29 | |
17 | 49.55 | 2135.81 | 512 | 1024 | 48.29 | 49.86 | |
18 | 49.33 | 515.72 | 512 | 768 | 50.03 | 49.1 | |
19 | 48.68 | 1409.24 | 512 | 1024 | 46.56 | 45.88 | |
20 | 47.93 | 448.81 | 128 | 384 | 50.18 | 43.49 | |
21 | 47.86 | 1060.65 | 128 | 768 | 47.51 | 41.96 | |
22 | 47.37 | 2135.81 | 512 | 1024 | 46.48 | 47.87 | |
23 | 44.05 | 515.72 | 512 | 768 | 47.17 | 54.47 | |
24 | 43.7 | 515.73 | 512 | 768 | 47.98 | 50.42 |
Evaluation Status
Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet | cc-by-nc-4.0 | main | f16 | 306 |
Alibaba-NLP/gte-multilingual-reranker-base | apache-2.0 | main | f16 | 306 |
BAAI/bge-reranker-v2-m3 | apache-2.0 | main | f32 | 568 |
Lajavaness/bilingual-embedding-large | apache-2.0 | main | f32 | 560 |
NAMAA-Space/GATE-Reranker-V1 | apache-2.0 | main | f32 | 135 |
NAMAA-Space/Namaa-ARA-Reranker-V1 | apache-2.0 | main | f32 | 568 |
OmarAlsaabi/e5-base-mlqa-finetuned-arabic-for-rag | Open | main | f16 | |
OmarAlsaabi/e5-base-mlqa-finetuned-arabic-for-rag | main | f32 | 278 | |
Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet | apache-2.0 | main | f32 | 118 |
Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2 | apache-2.0 | main | f32 | 135 |
Omartificial-Intelligence-Space/Arabic-all-nli-triplet-Matryoshka | apache-2.0 | main | f32 | 278 |
Omartificial-Intelligence-Space/Arabic-labse-Matryoshka | apache-2.0 | main | f32 | 471 |
OrdalieTech/Solon-embeddings-large-0.1 | mit | main | f32 | 560 |
Snowflake/snowflake-arctic-embed-l-v2.0 | apache-2.0 | main | f32 | 568 |
anondeb/arabertv02_reranker_2021 | cc-by-nc-4.0 | main | f32 | 135 |
asafaya/bert-base-arabic | main | f32 | 111 | |
aubmindlab/bert-base-arabert | main | f32 | 136 | |
aubmindlab/bert-large-arabertv2 | main | i64 | 371 | |
colbert-ir/colbertv2.0 | mit | main | i64 | 110 |
cross-encoder/ms-marco-MiniLM-L-12-v2 | apache-2.0 | main | i64 | 33 |
intfloat/multilingual-e5-large-instruct | mit | main | f16 | 560 |
oddadmix/arabic-reranker-v1 | main | f32 | 135 | |
omarelshehy/Arabic-Retrieval-v1.0 | apache-2.0 | main | f32 | 135 |
sentence-transformers/LaBSE | apache-2.0 | main | i64 | 471 |
silma-ai/silma-embeddding-matryoshka-v0.1 | apache-2.0 | main | f32 | 135 |
jinaai/jina-embeddings-v3 | cc-by-nc-4.0 | main | bf16 | 572 |
jinaai/jina-embeddings-v3 | cc-by-nc-4.0 | main | bf16 | 572 |
About Reranking Evaluation
The reranking evaluation assesses a model's ability to improve search quality by reordering initially retrieved results. Models are evaluated across multiple unseen Arabic datasets to ensure robust performance.
Evaluation Metrics
- MRR@10 (Mean Reciprocal Rank at 10): Measures the ranking quality focusing on the first relevant result in top-10
- NDCG@10 (Normalized DCG at 10): Evaluates the ranking quality of all relevant results in top-10
- MAP (Mean Average Precision): Measures the overall precision across all relevant documents
All metrics are averaged across multiple evaluation datasets to provide a comprehensive assessment of model performance.
Model Requirements
- Must accept query-document pairs as input
- Should output relevance scores for reranking (has cross-attention or similar mechanism for query-document matching)
- Support for Arabic text processing
Evaluation Process
- Models are tested on multiple unseen Arabic datasets
- For each dataset:
- Initial candidate documents are provided
- Model reranks the candidates
- MRR@10, NDCG@10, and MAP are calculated
- Final scores are averaged across all datasets
- Models are ranked based on overall performance
How to Prepare Your Model
- Model should be public on HuggingFace Hub (private models are not supported yet)
- Make sure it works coherently with
sentence-transformers
library