Update utils.py
Browse files
utils.py
CHANGED
@@ -115,7 +115,7 @@ table > tbody td:first-child {
|
|
115 |
"""
|
116 |
|
117 |
LLM_BENCHMARKS_ABOUT_TEXT = f"""
|
118 |
-
# Open Persian LLM Leaderboard (
|
119 |
|
120 |
> The Open Persian LLM Evaluation Leaderboard, developed by **Part DP AI** in collaboration with **AUT (Amirkabir University of Technology) NLP Lab**, provides a comprehensive benchmarking system specifically designed for Persian LLMs. This leaderboard, based on the open-source [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness), offers a unique platform for evaluating the performance of large language models (LLMs) on tasks that demand linguistic proficiency and technical skill in Persian.
|
121 |
|
@@ -127,13 +127,28 @@ LLM_BENCHMARKS_ABOUT_TEXT = f"""
|
|
127 |
> The leaderboard allows open participation, meaning that developers and researchers working with open-source models can submit evaluation requests for their models. This accessibility encourages the development and testing of Persian LLMs within the broader AI ecosystem.
|
128 |
>
|
129 |
> 2. **Task Diversity**
|
130 |
-
>
|
131 |
-
> - **
|
132 |
-
> - **
|
133 |
-
> - **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
134 |
> - **MMLU Pro**
|
135 |
-
> - **
|
136 |
-
> - **AUT Multiple Choice Persian**
|
137 |
>
|
138 |
> Each dataset is available in Persian, providing a robust testing ground for models in a non-English setting. The datasets collectively contain over **40k samples** across various categories such as **Common Knowledge**, **Reasoning**, **Summarization**, **Math**, and **Specialized Examinations**, offering comprehensive coverage of diverse linguistic and technical challenges.
|
139 |
>
|
|
|
115 |
"""
|
116 |
|
117 |
LLM_BENCHMARKS_ABOUT_TEXT = f"""
|
118 |
+
# Open Persian LLM Leaderboard (v2.0.0)
|
119 |
|
120 |
> The Open Persian LLM Evaluation Leaderboard, developed by **Part DP AI** in collaboration with **AUT (Amirkabir University of Technology) NLP Lab**, provides a comprehensive benchmarking system specifically designed for Persian LLMs. This leaderboard, based on the open-source [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness), offers a unique platform for evaluating the performance of large language models (LLMs) on tasks that demand linguistic proficiency and technical skill in Persian.
|
121 |
|
|
|
127 |
> The leaderboard allows open participation, meaning that developers and researchers working with open-source models can submit evaluation requests for their models. This accessibility encourages the development and testing of Persian LLMs within the broader AI ecosystem.
|
128 |
>
|
129 |
> 2. **Task Diversity**
|
130 |
+
> Over 20 specialized tasks have been curated for this leaderboard, each tailored to challenge different aspects of a model’s capabilities. These tasks include:
|
131 |
+
> - **GeneralKnowledge**
|
132 |
+
> - **GSM8K**
|
133 |
+
> - **DC-Homograph**
|
134 |
+
> - **MC-Homograph**
|
135 |
+
> - **PiQA**
|
136 |
+
> - **Proverb-Quiz**
|
137 |
+
> - **VerbEval**
|
138 |
+
> - **Winogrande**
|
139 |
+
> - **Arc-Challenge**
|
140 |
+
> - **Arc-Easy**
|
141 |
+
> - **Feqh**
|
142 |
+
> - **Hallucination (Truthfulness)**
|
143 |
+
> - **P-Hellaswag**
|
144 |
+
> - **Law**
|
145 |
+
> - **AUT Multiple Choice**
|
146 |
+
> - **Parsi Literature**
|
147 |
+
> - **BoolQA**
|
148 |
+
> - **Reading Comprehension**
|
149 |
+
> - **PartExpert**
|
150 |
> - **MMLU Pro**
|
151 |
+
> - **Iranian Social Norms**
|
|
|
152 |
>
|
153 |
> Each dataset is available in Persian, providing a robust testing ground for models in a non-English setting. The datasets collectively contain over **40k samples** across various categories such as **Common Knowledge**, **Reasoning**, **Summarization**, **Math**, and **Specialized Examinations**, offering comprehensive coverage of diverse linguistic and technical challenges.
|
154 |
>
|