Update src/display/about.py
Browse files- src/display/about.py +17 -1
src/display/about.py
CHANGED
|
@@ -40,7 +40,23 @@ LLM_BENCHMARKS_TEXT = f"""
|
|
| 40 |
## How it works
|
| 41 |
|
| 42 |
## Reproducibility
|
| 43 |
-
To reproduce
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
"""
|
| 46 |
|
|
|
|
| 40 |
## How it works
|
| 41 |
|
| 42 |
## Reproducibility
|
| 43 |
+
To reproduce my results, here is the commands you can run:
|
| 44 |
+
|
| 45 |
+
I use LM-Evaluation-Harness-Turkish, a version of the LM Evaluation Harness adapted for Turkish datasets, to ensure our leaderboard results are both reliable and replicable. Please see https://github.com/malhajar17/lm-evaluation-harness_turkish for more information
|
| 46 |
+
|
| 47 |
+
## How to Reproduce Results:
|
| 48 |
+
|
| 49 |
+
### 1) Set Up the repo: Clone the "lm-evaluation-harness_turkish" @ https://github.com/malhajar17/lm-evaluation-harness_turkish and follow the installation instructions.
|
| 50 |
+
### 2) Run Evaluations: To get the results as on the leaderboard (Some tests might show small variations), use the following command, adjusting for your model. For example, with the Trendyol model:
|
| 51 |
+
```python
|
| 52 |
+
lm_eval --model vllm --model_args pretrained=Trendyol/Trendyol-LLM-7b-chat-v1.0 --tasks truthfulqa_mc2_tr,truthfulqa_mc1_tr,mmlu_tr,winogrande_tr,gsm8k_tr,arc_challenge_tr,hellaswag_tr --output /workspace/Trendyol-LLM-7b-chat-v1.0
|
| 53 |
+
```
|
| 54 |
+
### 3) Report Results: I take the average of truthfulqa_mc1_tr and truthfulqa_mc2_tr scores and report it as truthfulqa. The results file generated is then uploaded to the OpenLLM Turkish Leaderboard.
|
| 55 |
+
|
| 56 |
+
Notes:
|
| 57 |
+
|
| 58 |
+
- I currently use "vllm" which might differ slightly as per the LM Evaluation Harness.
|
| 59 |
+
- All the tests are using "acc" as metric, with a plan to migrate to "acc_norm" for "ARC" and "Hellaswag" soon.
|
| 60 |
|
| 61 |
"""
|
| 62 |
|