small update
Browse files
README.md
CHANGED
@@ -10,7 +10,7 @@ pinned: false
|
|
10 |
# EvoEval: Evolving Coding Benchmarks via LLM
|
11 |
|
12 |
**EvoEval**<sup>1</sup> is a holistic benchmark suite created by _evolving_ **HumanEval** problems:
|
13 |
-
- ๐ฅ
|
14 |
- ๐ฎ Allows evaluation/comparison across different **dimensions** and problem **types** (i.e., _Difficult_, _Creative_ or _Tool Use_ problems). See our [**visualization tool**](https://evo-eval.github.io/visualization.html) for ready-to-use comparison
|
15 |
- ๐ Complete with [**leaderboard**](https://evo-eval.github.io/leaderboard.html), **groundtruth solutions**, **robust testcases** and **evaluation scripts** to easily fit into your evaluation pipeline
|
16 |
- ๐ค Generated LLM code samples from **>50** different models to save you time in running experiments
|
|
|
10 |
# EvoEval: Evolving Coding Benchmarks via LLM
|
11 |
|
12 |
**EvoEval**<sup>1</sup> is a holistic benchmark suite created by _evolving_ **HumanEval** problems:
|
13 |
+
- ๐ฅ Contains **828** new problems across **5** ๐ semantic-altering and **2** โญ semantic-preserving benchmarks
|
14 |
- ๐ฎ Allows evaluation/comparison across different **dimensions** and problem **types** (i.e., _Difficult_, _Creative_ or _Tool Use_ problems). See our [**visualization tool**](https://evo-eval.github.io/visualization.html) for ready-to-use comparison
|
15 |
- ๐ Complete with [**leaderboard**](https://evo-eval.github.io/leaderboard.html), **groundtruth solutions**, **robust testcases** and **evaluation scripts** to easily fit into your evaluation pipeline
|
16 |
- ๐ค Generated LLM code samples from **>50** different models to save you time in running experiments
|