Spaces:
Runtime error
Runtime error
| # --------------------------------------------------- | |
| # Your leaderboard name | |
| TITLE = """<h1 align="center" id="space-title">InstruSumEval Leaderboard</h1>""" | |
| # What does your leaderboard evaluate? | |
| INTRODUCTION_TEXT = """ | |
| - This leaderboard evaluates the *evaluation* capabilities of language models on the [salesforce/instrusum](https://huggingface.co/datasets/Salesforce/InstruSum) benchmark from our paper ["Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization"](https://arxiv.org/abs/2311.09184). | |
| - InstruSum is a benchmark for instruction-controllable summarization, where the goal is to generate summaries that satisfy user-provided instructions. | |
| - The benchmark contains human evaluations for the generated summaries, on which the models are evaluated as judges for *long-context* instruction-following. | |
| ### Metrics | |
| - **Accuracy**: The percentage of times the model agrees with the human evaluator. | |
| - **Agreement**: The Cohen's Kappa score between the model and human evaluator. | |
| - **Self-Accuracy**: The percentage of times the model agrees with itself when the inputs are swapped. | |
| - **Self-Agreement**: The Cohen's Kappa score between the model and itself when the inputs are swapped. | |
| """ | |
| # Which evaluations are you running? how can people reproduce what you have? | |
| LLM_BENCHMARKS_TEXT = f""" | |
| ## How it works | |
|  | |
| ### Task | |
| The LLMs are evaluated as judges in a pairwise comparison task. | |
| Each judge is presented with two **instruction-controllable** summaries and asked to select the better one. | |
| The model's accuracy and agreement with the human evaluator are then calculated. | |
| ### Dataset | |
| The human annotations are from the [InstruSum](https://huggingface.co/datasets/Salesforce/InstruSum) dataset. | |
| Its pairwise annotation [subset](https://huggingface.co/datasets/Salesforce/InstruSum/viewer/human_eval_pairwise) is used for evaluation. | |
| This subset contains converted pairwise human evaluation results based on the human evaluation results in the [`human_eval`](https://huggingface.co/datasets/Salesforce/InstruSum/viewer/human_eval) subset. | |
| The conversion process is as follows: | |
| - The ranking-based human evaluation results are convered into pairwise comparisons for the *overall quality* aspect. | |
| - Only comparisons where the annotators reached a consensus are included. | |
| - Comparisons that resulted in a tie are excluded. | |
| ### Evaluation Details | |
| - The instruction-controllable summarization is treated as a *long-context* instruction-following task. | |
| Therefore, the source article and the instruction is combined to form a single instruction for the model to follow. | |
| - The LLMs are evaluated on the pairwise comparison task. | |
| The [prompt](https://github.com/princeton-nlp/LLMBar/blob/main/LLMEvaluator/evaluators/prompts/comparison/Vanilla.txt) from [LLMBar](https://github.com/princeton-nlp/LLMBar) is adopted for the evaluation. | |
| - The pairwise comparison is conducted bidirectionally. The model's responses are swapped to evaluate the self-agreement. | |
| """ | |
| CITATION_BUTTON_LABEL = "Please cite our paper if you use InstruSum in your work." | |
| CITATION_BUTTON_TEXT = r"""@inproceedings{liu2024benchmarking, | |
| title={Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization}, | |
| author={Liu, Yixin and Fabbri, Alexander R and Chen, Jiawen and Zhao, Yilun and Han, Simeng and Joty, Shafiq and Liu, Pengfei and Radev, Dragomir and Wu, Chien-Sheng and Cohan, Arman}, | |
| booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024", | |
| year = "2024", | |
| }""" | |