Spaces:

Crisp-Unimib
/

ITALIC-Leaderboard

Runtime error

App Files Files Community

Andrea Seveso commited on Jun 18

Commit

24a1861

1 Parent(s): 62b25e2

About ITALIC

Browse files

Files changed (1) hide show

src/about.py +82 -8

src/about.py CHANGED Viewed

@@ -1,6 +1,7 @@
 from dataclasses import dataclass
 from enum import Enum
 @dataclass
 class Task:
     benchmark: str
@@ -11,29 +12,95 @@ class Task:
 # Select your tasks here
 # ---------------------------------------------------
 class Tasks(Enum):
-    # task_key in the json file, metric_key in the json file, name to display in the leaderboard
     task0 = Task("anli_r1", "acc", "ANLI")
     task1 = Task("logiqa", "acc_norm", "LogiQA")
-NUM_FEWSHOT = 0 # Change with your few shot
-# ---------------------------------------------------
 # Your leaderboard name
-TITLE = """<h1 align="center" id="space-title">Demo leaderboard</h1>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
-Intro text
 """
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
-## How it works
-## Reproducibility
-To reproduce our results, here is the commands you can run:
 """
@@ -69,4 +136,11 @@ If everything is done, check you can launch the EleutherAIHarness on your model
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
 """

 from dataclasses import dataclass
 from enum import Enum
 @dataclass
 class Task:
     benchmark: str
 # Select your tasks here
 # ---------------------------------------------------
 class Tasks(Enum):
+    # task_key in the json file, metric_key in the json file, name to display in the leaderboard
     task0 = Task("anli_r1", "acc", "ANLI")
     task1 = Task("logiqa", "acc_norm", "LogiQA")
+NUM_FEWSHOT = 0  # Change with your few shot
+# ---------------------------------------------------
 # Your leaderboard name
+TITLE = """<h1 align="center" id="space-title">_ITALIC_ leaderboard</h1>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
+_ITALIC_ is a benchmark evaluating language models’ understanding of Italian culture, commonsense reasoning and linguistic proficiency in a morphologically rich language.
 """
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
+## Dataset Details
+### Dataset Description
+We present _ITALIC_, a large-scale benchmark dataset of 10,000 multiple-choice questions designed to evaluate the natural language understanding of the Italian language and culture.
+_ITALIC_ spans 12 domains, exploiting public tests to score domain experts in real-world scenarios.
+We detail our data collection process, stratification techniques, and selection strategies.
+_ITALIC_ provides a comprehensive assessment suite that captures commonsense reasoning and linguistic proficiency in a morphologically rich language.
+It serves as a benchmark for evaluating existing models and as a roadmap for future research, encouraging the development of more sophisticated and culturally aware natural language systems.
+- **Curated by:** CRISP research centre [https://crispresearch.it/](https://crispresearch.it/)
+- **Language(s) (NLP):** Italian
+- **License:** MIT
+### Dataset Sources
+- **Huggingface:** [https://huggingface.co/datasets/Crisp-Unimib/ITALIC](https://huggingface.co/datasets/Crisp-Unimib/ITALIC)
+- **Zenodo** [https://doi.org/10.5281/zenodo.14725822](https://doi.org/10.5281/zenodo.14725822)
+- **Paper:** [Full Paper Available at ACL Anthology](https://aclanthology.org/2025.naacl-long.68.pdf)
+## Dataset Structure
+_ITALIC_ contains 10,000 carefully curated questions selected from an initial corpus of 2,110,643 questions.
+Each question is formatted as a multiple-choice query, with an average question length of 87 characters and a median of 4 answer options.
+The longest question is 577 characters long. The minimum number of choices per question is 2, while the maximum is 5.
+The total number of tokens across the input data amounts to 499,963.
+| Column           | Data Type | Description                                     |
+| ---------------- | --------- | ----------------------------------------------- |
+| `question`       | [String]  | The actual content of the question              |
+| `options`        | [List]    | The options to choose from. Only one is correct |
+| `answer`         | [String]  | The correct answer out of the options           |
+| `category`       | [String]  | The dedicated cultural section of the question  |
+| `macro_category` | [String]  | The macro category of the question              |
+## Dataset Creation
+### Curation Rationale
+The corpus comprises questions and tasks from real-world exams, professional assessments, and domain-specific challenges.
+Given that the data originates from institutional sources, it is expected to maintain a high standard of quality and accuracy, as domain experts crafted it for public evaluations.
+### Source Data
+#### Data Collection and Processing
+The initial data was sourced from various files in PDF, HTML, DOC, and other formats published by official bodies that announce individual competitive public examinations.
+Please consult the full paper for a detailed description of our curation process.
+#### Who are the source data producers?
+The dataset includes tests for admission to the Carabinieri, Penitentiary Police, Italian Army, State Police, Forestry Corps, Firefighters, Air Force, Navy, Guardia di Finanza, Italian ministries, teachers of the Italian school system of all levels, principals of the Italian school system of all levels, nurses of the national health system, and managers of the public administration from 2008 to 2024 available freely on the website of each institutional body.
+#### Personal and Sensitive Information
+The dataset does not contain confidential information.
+It is also free from content that could be considered offensive, insulting, threatening, or distressing. Since it solely comprises data from standardised tests and does not involve human subjects or personal data, an ethical review process was not required.
+## Bias, Risks, and Limitations
+Potential risks of misuse include using the benchmark results to justify or argue against the need to develop native LLMs specifically tailored for the Italian language.
+This possibility should be considered to avoid misinterpretations or unintended consequences when leveraging the evaluation outcomes.
+### Maintenance
+_ITALIC_ is designed to be robust and fully operational upon release, with no need for routine maintenance. However, as language and cultural norms evolve, periodic updates will be required to ensure the benchmark remains relevant. A new dataset version will be created and made available in such cases.
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
+@inproceedings{seveso2025italic,
+  title={ITALIC: An Italian Culture-Aware Natural Language Benchmark},
+  author={Seveso, Andrea and Potert{\`\i}, Daniele and Federici, Edoardo and Mezzanzanica, Mario and Mercorio, Fabio},
+  booktitle={Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
+  pages={1469--1478},
+  year={2025}
+}
 """