Andrea Seveso commited on
Commit
24a1861
·
1 Parent(s): 62b25e2

About ITALIC

Browse files
Files changed (1) hide show
  1. src/about.py +82 -8
src/about.py CHANGED
@@ -1,6 +1,7 @@
1
  from dataclasses import dataclass
2
  from enum import Enum
3
 
 
4
  @dataclass
5
  class Task:
6
  benchmark: str
@@ -11,29 +12,95 @@ class Task:
11
  # Select your tasks here
12
  # ---------------------------------------------------
13
  class Tasks(Enum):
14
- # task_key in the json file, metric_key in the json file, name to display in the leaderboard
15
  task0 = Task("anli_r1", "acc", "ANLI")
16
  task1 = Task("logiqa", "acc_norm", "LogiQA")
17
 
18
- NUM_FEWSHOT = 0 # Change with your few shot
19
- # ---------------------------------------------------
20
 
 
 
21
 
22
 
23
  # Your leaderboard name
24
- TITLE = """<h1 align="center" id="space-title">Demo leaderboard</h1>"""
25
 
26
  # What does your leaderboard evaluate?
27
  INTRODUCTION_TEXT = """
28
- Intro text
29
  """
30
 
31
  # Which evaluations are you running? how can people reproduce what you have?
32
  LLM_BENCHMARKS_TEXT = f"""
33
- ## How it works
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- ## Reproducibility
36
- To reproduce our results, here is the commands you can run:
37
 
38
  """
39
 
@@ -69,4 +136,11 @@ If everything is done, check you can launch the EleutherAIHarness on your model
69
 
70
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
71
  CITATION_BUTTON_TEXT = r"""
 
 
 
 
 
 
 
72
  """
 
1
  from dataclasses import dataclass
2
  from enum import Enum
3
 
4
+
5
  @dataclass
6
  class Task:
7
  benchmark: str
 
12
  # Select your tasks here
13
  # ---------------------------------------------------
14
  class Tasks(Enum):
15
+ # task_key in the json file, metric_key in the json file, name to display in the leaderboard
16
  task0 = Task("anli_r1", "acc", "ANLI")
17
  task1 = Task("logiqa", "acc_norm", "LogiQA")
18
 
 
 
19
 
20
+ NUM_FEWSHOT = 0 # Change with your few shot
21
+ # ---------------------------------------------------
22
 
23
 
24
  # Your leaderboard name
25
+ TITLE = """<h1 align="center" id="space-title">_ITALIC_ leaderboard</h1>"""
26
 
27
  # What does your leaderboard evaluate?
28
  INTRODUCTION_TEXT = """
29
+ _ITALIC_ is a benchmark evaluating language models’ understanding of Italian culture, commonsense reasoning and linguistic proficiency in a morphologically rich language.
30
  """
31
 
32
  # Which evaluations are you running? how can people reproduce what you have?
33
  LLM_BENCHMARKS_TEXT = f"""
34
+ ## Dataset Details
35
+
36
+ ### Dataset Description
37
+
38
+ We present _ITALIC_, a large-scale benchmark dataset of 10,000 multiple-choice questions designed to evaluate the natural language understanding of the Italian language and culture.
39
+ _ITALIC_ spans 12 domains, exploiting public tests to score domain experts in real-world scenarios.
40
+ We detail our data collection process, stratification techniques, and selection strategies.
41
+
42
+ _ITALIC_ provides a comprehensive assessment suite that captures commonsense reasoning and linguistic proficiency in a morphologically rich language.
43
+ It serves as a benchmark for evaluating existing models and as a roadmap for future research, encouraging the development of more sophisticated and culturally aware natural language systems.
44
+
45
+ - **Curated by:** CRISP research centre [https://crispresearch.it/](https://crispresearch.it/)
46
+ - **Language(s) (NLP):** Italian
47
+ - **License:** MIT
48
+
49
+ ### Dataset Sources
50
+
51
+ - **Huggingface:** [https://huggingface.co/datasets/Crisp-Unimib/ITALIC](https://huggingface.co/datasets/Crisp-Unimib/ITALIC)
52
+ - **Zenodo** [https://doi.org/10.5281/zenodo.14725822](https://doi.org/10.5281/zenodo.14725822)
53
+ - **Paper:** [Full Paper Available at ACL Anthology](https://aclanthology.org/2025.naacl-long.68.pdf)
54
+
55
+ ## Dataset Structure
56
+
57
+ _ITALIC_ contains 10,000 carefully curated questions selected from an initial corpus of 2,110,643 questions.
58
+
59
+ Each question is formatted as a multiple-choice query, with an average question length of 87 characters and a median of 4 answer options.
60
+ The longest question is 577 characters long. The minimum number of choices per question is 2, while the maximum is 5.
61
+ The total number of tokens across the input data amounts to 499,963.
62
+
63
+ | Column | Data Type | Description |
64
+ | ---------------- | --------- | ----------------------------------------------- |
65
+ | `question` | [String] | The actual content of the question |
66
+ | `options` | [List] | The options to choose from. Only one is correct |
67
+ | `answer` | [String] | The correct answer out of the options |
68
+ | `category` | [String] | The dedicated cultural section of the question |
69
+ | `macro_category` | [String] | The macro category of the question |
70
+
71
+ ## Dataset Creation
72
+
73
+ ### Curation Rationale
74
+
75
+ The corpus comprises questions and tasks from real-world exams, professional assessments, and domain-specific challenges.
76
+ Given that the data originates from institutional sources, it is expected to maintain a high standard of quality and accuracy, as domain experts crafted it for public evaluations.
77
+
78
+ ### Source Data
79
+
80
+ #### Data Collection and Processing
81
+
82
+ The initial data was sourced from various files in PDF, HTML, DOC, and other formats published by official bodies that announce individual competitive public examinations.
83
+
84
+ Please consult the full paper for a detailed description of our curation process.
85
+
86
+ #### Who are the source data producers?
87
+
88
+ The dataset includes tests for admission to the Carabinieri, Penitentiary Police, Italian Army, State Police, Forestry Corps, Firefighters, Air Force, Navy, Guardia di Finanza, Italian ministries, teachers of the Italian school system of all levels, principals of the Italian school system of all levels, nurses of the national health system, and managers of the public administration from 2008 to 2024 available freely on the website of each institutional body.
89
+
90
+ #### Personal and Sensitive Information
91
+
92
+ The dataset does not contain confidential information.
93
+ It is also free from content that could be considered offensive, insulting, threatening, or distressing. Since it solely comprises data from standardised tests and does not involve human subjects or personal data, an ethical review process was not required.
94
+
95
+ ## Bias, Risks, and Limitations
96
+
97
+ Potential risks of misuse include using the benchmark results to justify or argue against the need to develop native LLMs specifically tailored for the Italian language.
98
+ This possibility should be considered to avoid misinterpretations or unintended consequences when leveraging the evaluation outcomes.
99
+
100
+ ### Maintenance
101
+
102
+ _ITALIC_ is designed to be robust and fully operational upon release, with no need for routine maintenance. However, as language and cultural norms evolve, periodic updates will be required to ensure the benchmark remains relevant. A new dataset version will be created and made available in such cases.
103
 
 
 
104
 
105
  """
106
 
 
136
 
137
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
138
  CITATION_BUTTON_TEXT = r"""
139
+ @inproceedings{seveso2025italic,
140
+ title={ITALIC: An Italian Culture-Aware Natural Language Benchmark},
141
+ author={Seveso, Andrea and Potert{\`\i}, Daniele and Federici, Edoardo and Mezzanzanica, Mario and Mercorio, Fabio},
142
+ booktitle={Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
143
+ pages={1469--1478},
144
+ year={2025}
145
+ }
146
  """