Andrea Seveso
Rename categories
6aa8d26
from dataclasses import dataclass
from enum import Enum
@dataclass
class Task:
benchmark: str
metric: str
col_name: str
# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
# task_key in the json file, metric_key in the json file, name to display in the leaderboard
task0 = Task("tot", "acc", "Total")
# ---------------------------------------------------
task1 = Task("art", "acc", "Art")
task2 = Task("civic", "acc", "Civic Education")
task3 = Task("eve", "acc", "Current Events")
task4 = Task("geo", "acc", "Geography")
task5 = Task("his", "acc", "History")
task6 = Task("lit", "acc", "Literature")
task7 = Task("tou", "acc", "Tourism")
# ---------------------------------------------------
task8 = Task("lex", "acc", "Lexicon")
task9 = Task("morp", "acc", "Morphology")
task10 = Task("orth", "acc", "Orthography")
task11 = Task("syno", "acc", "Synonyms and Antonyms")
task12 = Task("synt", "acc", "Syntax")
NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------
# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">ITALIC leaderboard</h1>"""
# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
_ITALIC_ is a benchmark evaluating language models’ understanding of Italian culture, commonsense reasoning and linguistic proficiency in a morphologically rich language.
"""
# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
## Dataset Details
### Dataset Description
We present _ITALIC_, a large-scale benchmark dataset of 10,000 multiple-choice questions designed to evaluate the natural language understanding of the Italian language and culture.
_ITALIC_ spans 12 domains, exploiting public tests to score domain experts in real-world scenarios.
We detail our data collection process, stratification techniques, and selection strategies.
_ITALIC_ provides a comprehensive assessment suite that captures commonsense reasoning and linguistic proficiency in a morphologically rich language.
It serves as a benchmark for evaluating existing models and as a roadmap for future research, encouraging the development of more sophisticated and culturally aware natural language systems.
- **Curated by:** CRISP research centre [https://crispresearch.it/](https://crispresearch.it/)
- **Language(s) (NLP):** Italian
- **License:** MIT
### Dataset Sources
- **Huggingface:** [https://huggingface.co/datasets/Crisp-Unimib/ITALIC](https://huggingface.co/datasets/Crisp-Unimib/ITALIC)
- **Zenodo** [https://doi.org/10.5281/zenodo.14725822](https://doi.org/10.5281/zenodo.14725822)
- **Paper:** [Full Paper Available at ACL Anthology](https://aclanthology.org/2025.naacl-long.68.pdf)
## Dataset Structure
_ITALIC_ contains 10,000 carefully curated questions selected from an initial corpus of 2,110,643 questions.
Each question is formatted as a multiple-choice query, with an average question length of 87 characters and a median of 4 answer options.
The longest question is 577 characters long. The minimum number of choices per question is 2, while the maximum is 5.
The total number of tokens across the input data amounts to 499,963.
| Column | Data Type | Description |
| ---------------- | --------- | ----------------------------------------------- |
| `question` | [String] | The actual content of the question |
| `options` | [List] | The options to choose from. Only one is correct |
| `answer` | [String] | The correct answer out of the options |
| `category` | [String] | The dedicated cultural section of the question |
| `macro_category` | [String] | The macro category of the question |
## Dataset Creation
### Curation Rationale
The corpus comprises questions and tasks from real-world exams, professional assessments, and domain-specific challenges.
Given that the data originates from institutional sources, it is expected to maintain a high standard of quality and accuracy, as domain experts crafted it for public evaluations.
### Source Data
#### Data Collection and Processing
The initial data was sourced from various files in PDF, HTML, DOC, and other formats published by official bodies that announce individual competitive public examinations.
Please consult the full paper for a detailed description of our curation process.
#### Who are the source data producers?
The dataset includes tests for admission to the Carabinieri, Penitentiary Police, Italian Army, State Police, Forestry Corps, Firefighters, Air Force, Navy, Guardia di Finanza, Italian ministries, teachers of the Italian school system of all levels, principals of the Italian school system of all levels, nurses of the national health system, and managers of the public administration from 2008 to 2024 available freely on the website of each institutional body.
#### Personal and Sensitive Information
The dataset does not contain confidential information.
It is also free from content that could be considered offensive, insulting, threatening, or distressing. Since it solely comprises data from standardised tests and does not involve human subjects or personal data, an ethical review process was not required.
## Bias, Risks, and Limitations
Potential risks of misuse include using the benchmark results to justify or argue against the need to develop native LLMs specifically tailored for the Italian language.
This possibility should be considered to avoid misinterpretations or unintended consequences when leveraging the evaluation outcomes.
### Maintenance
_ITALIC_ is designed to be robust and fully operational upon release, with no need for routine maintenance. However, as language and cultural norms evolve, periodic updates will be required to ensure the benchmark remains relevant. A new dataset version will be created and made available in such cases.
"""
EVALUATION_QUEUE_TEXT = """
## Some good practices before submitting a model
### 1) Make sure you can load your model and tokenizer using AutoClasses:
```python
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
```
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
Note: make sure your model is public!
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!
### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
### 3) Make sure your model has an open license!
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗
### 4) Fill up your model card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card
## In case of model failure
If your model is displayed in the `FAILED` category, its execution stopped.
Make sure you have followed the above steps first.
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""@inproceedings{seveso2025italic,
title={ITALIC: An Italian Culture-Aware Natural Language Benchmark},
author={Seveso, Andrea and Potert{\`\i}, Daniele and Federici, Edoardo and Mezzanzanica, Mario and Mercorio, Fabio},
booktitle={Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
pages={1469--1478},
year={2025}
}
"""