Spaces:

k4black
/

codebleu

Runtime error

App Files Files Community

codebleu / README.md

github-actions

Auto files update [main]

5c3c11c over 2 years ago

preview code

raw

history blame

5.06 kB

	---
	title: codebleu
	tags:
	- evaluate
	- metric
	- code
	- codebleu
	description: "Unofficial `CodeBLEU` implementation with Linux and MacOS supports available with PyPI and HF HUB."
	sdk: gradio
	sdk_version: 3.19.1
	app_file: app.py
	pinned: false
	---

	# Metric Card for codebleu

	*Module Card Instructions:* Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.

	## Metric Description
	Unofficial `CodeBLEU` implementation with Linux and MacOS supports available with PyPI and HF HUB.

	> An ideal evaluation metric should consider the grammatical correctness and the logic correctness.
	> We propose weighted n-gram match and syntactic AST match to measure grammatical correctness, and introduce semantic data-flow match to calculate logic correctness.
	> ![CodeBLEU](CodeBLEU.jpg)
	(from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator/CodeBLEU) repo)

	In a nutshell, `CodeBLEU` is a weighted combination of `n-gram match (BLEU)`, `weighted n-gram match (BLEU-weighted)`, `AST match` and `data-flow match` scores.

	The metric has shown higher correlation with human evaluation than `BLEU` and `accuracy` metrics.

	## How to Use
	Give general statement of how to use the metric

	Provide simplest possible example for using the metric

	### Inputs

	- `refarences` (`list[str]` or `list[list[str]]`): reference code
	- `predictions` (`list[str]`) predicted code
	- `lang` (`str`): code language, see `codebleu.AVAILABLE_LANGS` for available languages (python, c_sharp c, cpp, javascript, java, php at the moment)
	- `weights` (`tuple[float,float,float,float]`): weights of the `ngram_match`, `weighted_ngram_match`, `syntax_match`, and `dataflow_match` respectively, defaults to `(0.25, 0.25, 0.25, 0.25)`
	- `tokenizer` (`callable`): to split code string to tokens, defaults to `s.split()`


	### Output Values

	[//]: # (Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02})

	[//]: # (State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better.")

	The metric outputs the `dict[str, float]` with following fields:
	- `codebleu`: the final `CodeBLEU` score
	- `ngram_match_score`: `ngram_match` score (BLEU)
	- `weighted_ngram_match_score`: `weighted_ngram_match` score (BLEU-weighted)
	- `syntax_match_score`: `syntax_match` score (AST match)
	- `dataflow_match_score`: `dataflow_match` score

	Each of the scores is in range `[0, 1]`, where `1` is the best score.


	### Examples

	[//]: # (Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.)

	Using pip package (`pip install codebleu`):
	```python
	from codebleu import calc_codebleu

	prediction = "def add ( a , b ) :\n return a + b"
	reference = "def sum ( first , second ) :\n return second + first"

	result = calc_codebleu([reference], [prediction], lang="python", weights=(0.25, 0.25, 0.25, 0.25), tokenizer=None)
	print(result)
	# {
	# 'codebleu': 0.5537,
	# 'ngram_match_score': 0.1041,
	# 'weighted_ngram_match_score': 0.1109,
	# 'syntax_match_score': 1.0,
	# 'dataflow_match_score': 1.0
	# }
	```

	Or using `evaluate` library (package required):
	```python
	import evaluate
	metric = evaluate.load("k4black/codebleu")

	prediction = "def add ( a , b ) :\n return a + b"
	reference = "def sum ( first , second ) :\n return second + first"

	result = metric.compute([reference], [prediction], lang="python", weights=(0.25, 0.25, 0.25, 0.25), tokenizer=None)
	```

	Note: `language` is required;


	## Limitations and Bias

	[//]: # (Note any known limitations or biases that the metric has, with links and references if possible.)

	As this library require `so` file compilation it is platform dependent.

	Currently available for Linux (manylinux) and MacOS on Python 3.8+.


	## Citation
	```bibtex
	@misc{ren2020codebleu,
	title={CodeBLEU: a Method for Automatic Evaluation of Code Synthesis},
	author={Shuo Ren and Daya Guo and Shuai Lu and Long Zhou and Shujie Liu and Duyu Tang and Neel Sundaresan and Ming Zhou and Ambrosio Blanco and Shuai Ma},
	year={2020},
	eprint={2009.10297},
	archivePrefix={arXiv},
	primaryClass={cs.SE}
	}
	```

	## Further References

	This implementation is Based on original [CodeXGLUE/CodeBLEU](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator/CodeBLEU) code -- refactored, build for macos, tested and fixed multiple crutches to make it more usable.

	The source code is available at GitHub [k4black/codebleu](https://github.com/k4black/codebleu) repository.