Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Commit
·
7abc6a7
1
Parent(s):
96d111a
Update benchmark count and fix typo (`inetuning->finetuning`) (#395)
Browse files- Update benchmark count and fix typo (`inetuning->finetuning`) (cdeea55b7621c0b1fa7515a40bf2fb50df62d5d7)
Co-authored-by: Alvaro Bartolome <[email protected]>
- src/display/about.py +2 -2
src/display/about.py
CHANGED
|
@@ -28,7 +28,7 @@ If there is no icon, we have not uploaded the information on the model yet, feel
|
|
| 28 |
|
| 29 |
## How it works
|
| 30 |
|
| 31 |
-
📈 We evaluate models on
|
| 32 |
|
| 33 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
| 34 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
|
@@ -67,7 +67,7 @@ The tasks and few shots parameters are:
|
|
| 67 |
Side note on the baseline scores:
|
| 68 |
- for log-likelihood evaluation, we select the random baseline
|
| 69 |
- for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
|
| 70 |
-
- for GSM8K, we select the score obtained in the paper after
|
| 71 |
|
| 72 |
## Quantization
|
| 73 |
To get more information about quantization, see:
|
|
|
|
| 28 |
|
| 29 |
## How it works
|
| 30 |
|
| 31 |
+
📈 We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
|
| 32 |
|
| 33 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
| 34 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
|
|
|
| 67 |
Side note on the baseline scores:
|
| 68 |
- for log-likelihood evaluation, we select the random baseline
|
| 69 |
- for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
|
| 70 |
+
- for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
|
| 71 |
|
| 72 |
## Quantization
|
| 73 |
To get more information about quantization, see:
|