AmeliaYin commited on
Commit
48a6781
·
verified ·
1 Parent(s): 624bd72

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -46,16 +46,16 @@ This is the repository for the base 13B version finetuned based on [CodeLlama-13
46
 
47
  ## Model Eval
48
 
49
- HumanEval is the commonest code generation benchmark to evaluate the performance of models, especially on the the compeltion of code exercise cases.
50
- Somehow, model evaluation is a kind of metaphysics. Different models are sensitive to different decoding methods, paramters and instructions.
51
- It is impratical for us to manually set specific configuration for each fine-tuned model, because a real LLM should master the universal capability despite the parameters manipulated by users.
52
 
53
- Thus, OpenCSG strained our brains to provide a relatively fair method to compare the fine-tuned models on HumanEval benchmark.
54
- To simplify the comparison, we chosed the Pass@1 metric on python language, but our finetuning dataset includes samples in multi language.
55
 
56
- **For fair, we evaluated the fine-tuned and original codellama models only with the original cases' prompts, not including any other instruction else.**
57
 
58
- **Otherwise, we use greedy decoding method for each model during the evaluation.**
59
 
60
  | Model | HumanEval python pass@1 |
61
  | --- |----------------------------------------------------------------------------- |
@@ -67,8 +67,8 @@ To simplify the comparison, we chosed the Pass@1 metric on python language, but
67
  | opencsg-CodeLlama-34b-v0.1(4k)| **48.8%** |
68
 
69
  **TODO**
70
- - we will provide much more benchmark scores on fine-tuned models in future.
71
- - we will provide different practical problems to evaluate the performance of fine-tuned models in the field of software engineering.
72
 
73
 
74
 
 
46
 
47
  ## Model Eval
48
 
49
+ HumanEval is the most common code generation benchmark for evaluating model performance, especially on the compeltion of code exercise cases.
50
+ Model evaluation is, to some extent, a metaphysics. Different models have different sensitivities to decoding methods, parameters and instructions.
51
+ It is impratical for us to manually set specific configurations for each fine-tuned model, because a real LLM should master general capabilities despite the parameters being manipulated by users.
52
 
53
+ Therefore, OpenCSG racked their brains to provide a relatively fair method to compare the fine-tuned models on the HumanEval benchmark.
54
+ To simplify the comparison, we chosed the Pass@1 metric for the Python language, but our fine-tuning dataset includes samples in multiple languages.
55
 
56
+ **For fairness, we evaluated the original and fine-tuned CodeLlama models based only on the prompts from the original cases, without including any other instructions.**
57
 
58
+ **Otherwise, we use the greedy decoding method for each model during evaluation.**
59
 
60
  | Model | HumanEval python pass@1 |
61
  | --- |----------------------------------------------------------------------------- |
 
67
  | opencsg-CodeLlama-34b-v0.1(4k)| **48.8%** |
68
 
69
  **TODO**
70
+ - We will provide more benchmark scores on fine-tuned models in the future.
71
+ - We will provide different practical problems to evaluate the performance of fine-tuned models in the field of software engineering.
72
 
73
 
74