opencsg
/

opencsg-CodeLlama-13b-v0.1

Text Generation

text-generation-inference

Model card Files Files and versions

AmeliaYin commited on Jan 22, 2024

Commit

48a6781

·

verified ·

1 Parent(s): 624bd72

Update README.md

Files changed (1) hide show

README.md +9 -9

README.md CHANGED Viewed

@@ -46,16 +46,16 @@ This is the repository for the base 13B version finetuned based on [CodeLlama-13
 ## Model Eval
-HumanEval is the commonest code generation benchmark to evaluate the performance of models, especially on the the compeltion of code exercise cases.
-Somehow, model evaluation is a kind of metaphysics. Different models are sensitive to different decoding methods, paramters and instructions.
-It is impratical for us to manually set specific configuration for each fine-tuned model, because a real LLM should master the universal capability despite the parameters manipulated by users.
-Thus, OpenCSG strained our brains to provide a relatively fair method to compare the fine-tuned models on HumanEval benchmark.
-To simplify the comparison, we chosed the Pass@1 metric on python language, but our finetuning dataset includes samples in multi language.
-**For fair, we evaluated the fine-tuned and original codellama models only with the original cases' prompts, not including any other instruction else.**
-**Otherwise, we use greedy decoding method for each model during the evaluation.**
 | Model     | HumanEval python pass@1                                                 |
 | ---  |----------------------------------------------------------------------------- |
@@ -67,8 +67,8 @@ To simplify the comparison, we chosed the Pass@1 metric on python language, but
 | opencsg-CodeLlama-34b-v0.1(4k)| **48.8%** |
 **TODO**
-- we will provide much more benchmark scores on fine-tuned models in future.
-- we will provide different practical problems to evaluate the performance of fine-tuned models in the field of software engineering.

 ## Model Eval
+HumanEval is the most common code generation benchmark for evaluating model performance, especially on the compeltion of code exercise cases.
+Model evaluation is, to some extent, a metaphysics. Different models have different sensitivities to decoding methods, parameters and instructions.
+It is impratical for us to manually set specific configurations for each fine-tuned model, because a real LLM should master general capabilities despite the parameters being manipulated by users.
+Therefore, OpenCSG racked their brains to provide a relatively fair method to compare the fine-tuned models on the HumanEval benchmark.
+To simplify the comparison, we chosed the Pass@1 metric for the Python language, but our fine-tuning dataset includes samples in multiple languages.
+**For fairness, we evaluated the original and fine-tuned CodeLlama models based only on the prompts from the original cases, without including any other instructions.**
+**Otherwise, we use the greedy decoding method for each model during evaluation.**
 | Model     | HumanEval python pass@1                                                 |
 | ---  |----------------------------------------------------------------------------- |
 | opencsg-CodeLlama-34b-v0.1(4k)| **48.8%** |
 **TODO**
+- We will provide more benchmark scores on fine-tuned models in the future.
+- We will provide different practical problems to evaluate the performance of fine-tuned models in the field of software engineering.