Spaces:
Runtime error
Runtime error
| title: BLEURT | |
| emoji: 🤗 | |
| colorFrom: blue | |
| colorTo: red | |
| sdk: gradio | |
| sdk_version: 3.0.2 | |
| app_file: app.py | |
| pinned: false | |
| tags: | |
| - evaluate | |
| - metric | |
| description: >- | |
| BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model (Devlin et al. 2018) | |
| and then employing another pre-training phrase using synthetic data. Finally it is trained on WMT human annotations. You may run BLEURT out-of-the-box or fine-tune | |
| it for your specific application (the latter is expected to perform better). | |
| See the project's README at https://github.com/google-research/bleurt#readme for more information. | |
| # Metric Card for BLEURT | |
| ## Metric Description | |
| BLEURT is a learned evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model [Devlin et al. 2018](https://arxiv.org/abs/1810.04805), employing another pre-training phrase using synthetic data, and finally trained on WMT human annotations. | |
| It is possible to run BLEURT out-of-the-box or fine-tune it for your specific application (the latter is expected to perform better). | |
| See the project's [README](https://github.com/google-research/bleurt#readme) for more information. | |
| ## Intended Uses | |
| BLEURT is intended to be used for evaluating text produced by language models. | |
| ## How to Use | |
| This metric takes as input lists of predicted sentences and reference sentences: | |
| ```python | |
| >>> predictions = ["hello there", "general kenobi"] | |
| >>> references = ["hello there", "general kenobi"] | |
| >>> bleurt = load("bleurt", module_type="metric") | |
| >>> results = bleurt.compute(predictions=predictions, references=references) | |
| ``` | |
| ### Inputs | |
| - **predictions** (`list` of `str`s): List of generated sentences to score. | |
| - **references** (`list` of `str`s): List of references to compare to. | |
| - **checkpoint** (`str`): BLEURT checkpoint. Will default to `BLEURT-tiny` if not specified. Other models that can be chosen are: `"bleurt-tiny-128"`, `"bleurt-tiny-512"`, `"bleurt-base-128"`, `"bleurt-base-512"`, `"bleurt-large-128"`, `"bleurt-large-512"`, `"BLEURT-20-D3"`, `"BLEURT-20-D6"`, `"BLEURT-20-D12"` and `"BLEURT-20"`. | |
| ### Output Values | |
| - **scores** : a `list` of scores, one per prediction. | |
| Output Example: | |
| ```python | |
| {'scores': [1.0295498371124268, 1.0445425510406494]} | |
| ``` | |
| BLEURT's output is always a number between 0 and (approximately 1). This value indicates how similar the generated text is to the reference texts, with values closer to 1 representing more similar texts. | |
| #### Values from Popular Papers | |
| The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) reported that the metric is better correlated with human judgment compared to similar metrics such as BERT and BERTscore. | |
| BLEURT is used to compare models across different asks (e.g. (Table to text generation)[https://paperswithcode.com/sota/table-to-text-generation-on-dart?metric=BLEURT]). | |
| ### Examples | |
| Example with the default model: | |
| ```python | |
| >>> predictions = ["hello there", "general kenobi"] | |
| >>> references = ["hello there", "general kenobi"] | |
| >>> bleurt = load("bleurt", module_type="metric") | |
| >>> results = bleurt.compute(predictions=predictions, references=references) | |
| >>> print(results) | |
| {'scores': [1.0295498371124268, 1.0445425510406494]} | |
| ``` | |
| Example with the `"bleurt-base-128"` model checkpoint: | |
| ```python | |
| >>> predictions = ["hello there", "general kenobi"] | |
| >>> references = ["hello there", "general kenobi"] | |
| >>> bleurt = load("bleurt", module_type="metric", checkpoint="bleurt-base-128") | |
| >>> results = bleurt.compute(predictions=predictions, references=references) | |
| >>> print(results) | |
| {'scores': [1.0295498371124268, 1.0445425510406494]} | |
| ``` | |
| ## Limitations and Bias | |
| The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) showed that BLEURT correlates well with human judgment, but this depends on the model and language pair selected. | |
| Furthermore, currently BLEURT only supports English-language scoring, given that it leverages models trained on English corpora. It may also reflect, to a certain extent, biases and correlations that were present in the model training data. | |
| Finally, calculating the BLEURT metric involves downloading the BLEURT model that is used to compute the score, which can take a significant amount of time depending on the model chosen. Starting with the default model, `bleurt-tiny`, and testing out larger models if necessary can be a useful approach if memory or internet speed is an issue. | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{bleurt, | |
| title={BLEURT: Learning Robust Metrics for Text Generation}, | |
| author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh}, | |
| booktitle={ACL}, | |
| year={2020}, | |
| url={https://arxiv.org/abs/2004.04696} | |
| } | |
| ``` | |
| ## Further References | |
| - The original [BLEURT GitHub repo](https://github.com/google-research/bleurt/) | |