Spaces:

logikon
/

open_cot_leaderboard

Running on CPU Upgrade

App Files Files Community

clefourrier HF Staff

alvarobartt HF Staff commited on Nov 21, 2023

Commit

7abc6a7

1 Parent(s): 96d111a

Update benchmark count and fix typo (`inetuning->finetuning`) (#395)

Browse files

- Update benchmark count and fix typo (`inetuning->finetuning`) (cdeea55b7621c0b1fa7515a40bf2fb50df62d5d7)

Co-authored-by: Alvaro Bartolome <[email protected]>

Files changed (1) hide show

src/display/about.py +2 -2

src/display/about.py CHANGED Viewed

@@ -28,7 +28,7 @@ If there is no icon, we have not uploaded the information on the model yet, feel
 ## How it works
-📈 We evaluate models on 4 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank">  Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
 - <a href="https://arxiv.org/abs/1803.05457" target="_blank">  AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
 - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
@@ -67,7 +67,7 @@ The tasks and few shots parameters are:
 Side note on the baseline scores:
 - for log-likelihood evaluation, we select the random baseline
 - for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
-- for GSM8K, we select the score obtained in the paper after inetuning a 6B model on the full GSM8K training set for 50 epochs
 ## Quantization
 To get more information about quantization, see:

 ## How it works
+📈 We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank">  Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
 - <a href="https://arxiv.org/abs/1803.05457" target="_blank">  AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
 - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
 Side note on the baseline scores:
 - for log-likelihood evaluation, we select the random baseline
 - for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
+- for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
 ## Quantization
 To get more information about quantization, see: