Update README.md
Browse files
README.md
CHANGED
@@ -46,16 +46,16 @@ This is the repository for the base 13B version finetuned based on [CodeLlama-13
|
|
46 |
|
47 |
## Model Eval
|
48 |
|
49 |
-
HumanEval is the
|
50 |
-
|
51 |
-
It is impratical for us to manually set specific
|
52 |
|
53 |
-
|
54 |
-
To simplify the comparison, we chosed the Pass@1 metric
|
55 |
|
56 |
-
**For
|
57 |
|
58 |
-
**Otherwise, we use greedy decoding method for each model during
|
59 |
|
60 |
| Model | HumanEval python pass@1 |
|
61 |
| --- |----------------------------------------------------------------------------- |
|
@@ -67,8 +67,8 @@ To simplify the comparison, we chosed the Pass@1 metric on python language, but
|
|
67 |
| opencsg-CodeLlama-34b-v0.1(4k)| **48.8%** |
|
68 |
|
69 |
**TODO**
|
70 |
-
-
|
71 |
-
-
|
72 |
|
73 |
|
74 |
|
|
|
46 |
|
47 |
## Model Eval
|
48 |
|
49 |
+
HumanEval is the most common code generation benchmark for evaluating model performance, especially on the compeltion of code exercise cases.
|
50 |
+
Model evaluation is, to some extent, a metaphysics. Different models have different sensitivities to decoding methods, parameters and instructions.
|
51 |
+
It is impratical for us to manually set specific configurations for each fine-tuned model, because a real LLM should master general capabilities despite the parameters being manipulated by users.
|
52 |
|
53 |
+
Therefore, OpenCSG racked their brains to provide a relatively fair method to compare the fine-tuned models on the HumanEval benchmark.
|
54 |
+
To simplify the comparison, we chosed the Pass@1 metric for the Python language, but our fine-tuning dataset includes samples in multiple languages.
|
55 |
|
56 |
+
**For fairness, we evaluated the original and fine-tuned CodeLlama models based only on the prompts from the original cases, without including any other instructions.**
|
57 |
|
58 |
+
**Otherwise, we use the greedy decoding method for each model during evaluation.**
|
59 |
|
60 |
| Model | HumanEval python pass@1 |
|
61 |
| --- |----------------------------------------------------------------------------- |
|
|
|
67 |
| opencsg-CodeLlama-34b-v0.1(4k)| **48.8%** |
|
68 |
|
69 |
**TODO**
|
70 |
+
- We will provide more benchmark scores on fine-tuned models in the future.
|
71 |
+
- We will provide different practical problems to evaluate the performance of fine-tuned models in the field of software engineering.
|
72 |
|
73 |
|
74 |
|