Spaces:
Runtime error
Runtime error
| title: codebleu | |
| tags: | |
| - evaluate | |
| - metric | |
| - code | |
| - codebleu | |
| description: "Unofficial `CodeBLEU` implementation with Linux and MacOS supports available with PyPI and HF HUB." | |
| sdk: gradio | |
| sdk_version: 3.19.1 | |
| app_file: app.py | |
| pinned: false | |
| # Metric Card for codebleu | |
| ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.* | |
| ## Metric Description | |
| Unofficial `CodeBLEU` implementation with Linux and MacOS supports available with PyPI and HF HUB. | |
| > An ideal evaluation metric should consider the grammatical correctness and the logic correctness. | |
| > We propose weighted n-gram match and syntactic AST match to measure grammatical correctness, and introduce semantic data-flow match to calculate logic correctness. | |
| >  | |
| (from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator/CodeBLEU) repo) | |
| In a nutshell, `CodeBLEU` is a weighted combination of `n-gram match (BLEU)`, `weighted n-gram match (BLEU-weighted)`, `AST match` and `data-flow match` scores. | |
| The metric has shown higher correlation with human evaluation than `BLEU` and `accuracy` metrics. | |
| ## How to Use | |
| *Give general statement of how to use the metric* | |
| *Provide simplest possible example for using the metric* | |
| ### Inputs | |
| - `refarences` (`list[str]` or `list[list[str]]`): reference code | |
| - `predictions` (`list[str]`) predicted code | |
| - `lang` (`str`): code language, see `codebleu.AVAILABLE_LANGS` for available languages (python, c_sharp c, cpp, javascript, java, php at the moment) | |
| - `weights` (`tuple[float,float,float,float]`): weights of the `ngram_match`, `weighted_ngram_match`, `syntax_match`, and `dataflow_match` respectively, defaults to `(0.25, 0.25, 0.25, 0.25)` | |
| - `tokenizer` (`callable`): to split code string to tokens, defaults to `s.split()` | |
| ### Output Values | |
| [//]: # (*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*) | |
| [//]: # (*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*) | |
| The metric outputs the `dict[str, float]` with following fields: | |
| - `codebleu`: the final `CodeBLEU` score | |
| - `ngram_match_score`: `ngram_match` score (BLEU) | |
| - `weighted_ngram_match_score`: `weighted_ngram_match` score (BLEU-weighted) | |
| - `syntax_match_score`: `syntax_match` score (AST match) | |
| - `dataflow_match_score`: `dataflow_match` score | |
| Each of the scores is in range `[0, 1]`, where `1` is the best score. | |
| ### Examples | |
| [//]: # (*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*) | |
| Using pip package (`pip install codebleu`): | |
| ```python | |
| from codebleu import calc_codebleu | |
| prediction = "def add ( a , b ) :\n return a + b" | |
| reference = "def sum ( first , second ) :\n return second + first" | |
| result = calc_codebleu([reference], [prediction], lang="python", weights=(0.25, 0.25, 0.25, 0.25), tokenizer=None) | |
| print(result) | |
| # { | |
| # 'codebleu': 0.5537, | |
| # 'ngram_match_score': 0.1041, | |
| # 'weighted_ngram_match_score': 0.1109, | |
| # 'syntax_match_score': 1.0, | |
| # 'dataflow_match_score': 1.0 | |
| # } | |
| ``` | |
| Or using `evaluate` library (package required): | |
| ```python | |
| import evaluate | |
| metric = evaluate.load("k4black/codebleu") | |
| prediction = "def add ( a , b ) :\n return a + b" | |
| reference = "def sum ( first , second ) :\n return second + first" | |
| result = metric.compute([reference], [prediction], lang="python", weights=(0.25, 0.25, 0.25, 0.25), tokenizer=None) | |
| ``` | |
| Note: `language` is required; | |
| ## Limitations and Bias | |
| [//]: # (*Note any known limitations or biases that the metric has, with links and references if possible.*) | |
| As this library require `so` file compilation it is platform dependent. | |
| Currently available for Linux (manylinux) and MacOS on Python 3.8+. | |
| ## Citation | |
| ```bibtex | |
| @misc{ren2020codebleu, | |
| title={CodeBLEU: a Method for Automatic Evaluation of Code Synthesis}, | |
| author={Shuo Ren and Daya Guo and Shuai Lu and Long Zhou and Shujie Liu and Duyu Tang and Neel Sundaresan and Ming Zhou and Ambrosio Blanco and Shuai Ma}, | |
| year={2020}, | |
| eprint={2009.10297}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.SE} | |
| } | |
| ``` | |
| ## Further References | |
| This implementation is Based on original [CodeXGLUE/CodeBLEU](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator/CodeBLEU) code -- refactored, build for macos, tested and fixed multiple crutches to make it more usable. | |
| The source code is available at GitHub [k4black/codebleu](https://github.com/k4black/codebleu) repository. | |