Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
File size: 7,675 Bytes
12efa10 f07d235 12efa10 f07d235 bcbf716 f07d235 ca48878 12efa10 13aff27 12efa10 ca48878 aaf9571 12efa10 aaf9571 b6f2890 aaf9571 18ef3dc aaf9571 2fe1d39 aaf9571 2fe1d39 aaf9571 cb2c8cb aaf9571 cb2c8cb aaf9571 cb2c8cb aaf9571 cb2c8cb aaf9571 cb2c8cb aaf9571 cb2c8cb aaf9571 cb2c8cb aaf9571 cb2c8cb aaf9571 cb2c8cb aaf9571 cb2c8cb aaf9571 12efa10 aaf9571 12efa10 bcbf716 12efa10 bcbf716 aaf9571 bcbf716 12efa10 653f70c f61cbe1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
from dataclasses import dataclass
from enum import Enum
@dataclass
class EvalDimension:
metric: str
col_name: str
# Select your tasks here
# ---------------------------------------------------
class EvalDimensions(Enum):
d0 = EvalDimension("speed", "Speed (words/sec)")
d1 = EvalDimension("contamination_score", "Contamination Score")
d2 = EvalDimension("paraphrasing", "Paraphrasing")
d3 = EvalDimension("sentiment analysis", "Sentiment Analysis")
d4 = EvalDimension("coding", "Coding")
d5 = EvalDimension("function calling", "Function Calling")
d6 = EvalDimension("rag qa", "RAG QA")
d7 = EvalDimension("reading comprehension", "Reading Comprehension")
d8 = EvalDimension("entity extraction", "Entity Extraction")
d9 = EvalDimension("summarization", "Summarization")
d10 = EvalDimension("long context", "Long Context")
d11 = EvalDimension("mmlu", "MMLU")
d12 = EvalDimension("arabic language & grammar", "Arabic Language & Grammar")
d13 = EvalDimension("general knowledge", "General Knowledge")
d14 = EvalDimension("translation (incl dialects)", "Translation (incl Dialects)")
d15 = EvalDimension("trust & safety","Trust & Safety")
d16 = EvalDimension("writing (incl dialects)", "Writing (incl Dialects)")
d17 = EvalDimension("dialect detection", "Dialect Detection")
d18 = EvalDimension("reasoning & math", "Reasoning & Math")
d19 = EvalDimension("diacritization", "Diacritization")
d20 = EvalDimension("instruction following", "Instruction Following")
d21 = EvalDimension("transliteration", "Transliteration")
d22 = EvalDimension("structuring", "Structuring")
d23 = EvalDimension("hallucination", "Hallucination")
NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------
# Your leaderboard name
TITLE = """<div ><img class='abl_header_image' src='https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard/resolve/main/src/images/abl_logo.png' ></div>"""
# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
<h1 style='width: 100%;text-align: center;' id="space-title">Arabic Broad Leaderboard (ABL) - The first comprehensive Leaderboard for Arabic LLMs</h1>
ABL, the official leaderboard of the <a href='https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark' target='_blank'>Arabic Broad Benchmark (ABB)</a>
is a next-generation leaderboard offering innovative visualizations, analytical capabilities, model skill breakdowns, speed comparisons, and contamination detection mechanisms. ABL provides the community with an unprecedented ability to study the capabilities of Arabic models and choose the right model for the right task. Find more details in the FAQ section.
"""
# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
# FAQ
---
## What is the Benchmark Score?
* The benchmark score is calculated by taking the average of all individual question scores.
* Each question is scored from 0 to 10 using a mix of LLM-as-judge and manual rules, depending on the question type.
* Please refer to the ABB page below for more information about the scoring rules and the dataset:
https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark#scoring-rules
---
## What is the Contamination Score?
* The contamination score is a score measuring the probability that a model has been trained using the ABB benchmarking data to boost its scores on ABL.
* After testing each model on ABL, we run our private algorithm to detect contamination and arrive at a score.
* Contaminated models will show a red sign and a number above zero in the Contamination Score column.
* Any model showing signs of contamination will be deleted instantly from the leaderboard.
---
## What is the Speed?
* Speed shows how fast the model was during testing, using the "words per second" metric.
* The score is calculated by dividing the number of words generated by the model during the entire test by the time taken (in seconds) for the model to complete testing.
* Please note that we use the same GPU (A100) and a batch size of 1 for all Hugging Face models to ensure a fair comparison. Models above 15B are split across multiple GPUs.
* Each model should only be compared to other models in its size category.
* API or closed models can't be compared to open models, only to other API models, since they are not hosted on our infrastructure.
---
## What does Size mean?
* Models below 3.5B parameters are considered Nano.
* Models between 3.5B and 10B parameters are considered Small.
* Models between 10B and 35B parameters are considered Medium.
* Models above 35B parameters are considered Large.
---
## What does Source mean?
* API: Closed models tested via an API.
* Hugging Face: Open models downloaded and tested from Hugging Face via the `transformers` library.
---
## How can I reproduce the results?
You can easily reproduce the results of any model by following the steps on the ABB page below:
https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark#how-to-use-abb-to-benchmark-a-model
---
## I tested a model and got a slightly different score. Why is that?
* ABB is partially dependent on an external LLM-as-judge (GPT-4.1).
* LLMs are random in nature and will not always produce the same scores on every run.
* That said, according to our testing, such variations are always within a +/-1% range.
---
## I have seen an answer which seems correct to me but is getting a zero score. Why is that?
* First, LLM scoring is not always consistent, and sometimes it gives a wrong score to an answer, but based on our testing, this is very rare.
* Second, we also have fixed rules in place to penalize models; for example, when a model answers in another language or answers in two languages, we give a score of zero.
* In general, both fixed rules and LLM inconsistencies are applied to all models in the same way, which we consider fair.
---
## Why am I not allowed to submit models with more than 15B parameters?
* Models above 15B parameters don't fit into a single GPU and require provisioning of multiple GPUs, which we can't always guarantee to provision in an automated manner.
* We also know that most community models are below 15B parameters.
* As an exception, we can accept requests from organizations on a case-by-case basis.
* Finally, we will always make sure to include larger models when they have high adoption from the community.
---
## How can I learn more about ABL and ABB?
Feel free to read through the following resources:
* **ABB Page**: https://huggingface.co/datasets/silma-ai/arabic-broad-benchmark
* **ABL blog post**: Coming soon...
---
## How can I contact the benchmark maintainers?
You can contact us via [email protected]
"""
EVALUATION_QUEUE_TEXT = """
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite the Leaderboard"
CITATION_BUTTON_TEXT = r"""
@misc{ABL,
author = {SILMA.AI Team},
title = {Arabic Broad Leaderboard},
year = {2025},
publisher = {SILMA.AI},
howpublished = "{\url{https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard}}"
}
"""
FOOTER_TEXT = """<div style='display:flex;justify-content:center;align-items:center;'>
<span style='font-size:36px;font-weight:bold;margin-right:20px;'>Sponsored By</span>
<a href='https://silma.ai/?ref=abl' target='_blank'>
<img style='height:60px' src='https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard/resolve/main/src/images/silma-logo-wide.png' >
</a>
</div>"""
|