Spaces:
Running
Running
from __future__ import annotations | |
from pathlib import Path | |
import gradio as gr | |
import pandas as pd | |
from apscheduler.schedulers.background import BackgroundScheduler | |
from constants import Constants, model_type_emoji | |
from gradio_leaderboard import ColumnFilter, Leaderboard, SelectColumns | |
TITLE = """<h1 align="center" id="space-title">TabArena Leaderboard for Predictive Machine Learning on IID Tabular Data</h1>""" | |
INTRODUCTION_TEXT = """ | |
TabArena is a living benchmark system for predictive machine learning on tabular data. | |
The goal of TabArena and its leaderboard is to asses the peak performance of | |
model-specific pipelines. | |
**Datasets:** Currently, the leaderboard is based on a manually curated collection of | |
51 tabular classification and regression datasets for independent and identically distributed | |
(IID) data, spanning the small to medium data regime. The datasets were carefully | |
curated to represent various real-world predictive machine learning use cases. | |
**Models:** The focus of the leaderboard is on model-specific pipelines. Each pipeline | |
is evaluated with default or tuned hyperparameter configuration or as an ensemble of | |
tuned configurations. Each model is implemented in a tested real-world pipeline that was | |
optimized to get the most out of the model by the maintainers of TabArena, and where | |
possible together with the authors of the model. | |
**Metrics:** The leaderboard is ranked based on Elo. We present several additional | |
metrics. See the `About` tab for more information on the metrics. | |
**Reference Pipeline:** The leaderboard includes a reference pipeline, which is applied | |
independently of the tuning protocol and constraints we constructed for models within TabArena. | |
The reference pipeline aims to represent the performance quickly achievable by a | |
practitioner on a dataset. The current reference pipeline is the predictive machine | |
learning system AutoGluon (version 1.3, with the best_quality preset and | |
4 hours for training). AutoGluon represents an ensemble pipeline across various model | |
types and thus provides a reference for model-specific pipelines. | |
The current leaderboard is based on TabArena-v0.1. | |
""" | |
ABOUT_TEXT = """ | |
TabArena is a living benchmark system for predictive machine learning on tabular data. | |
We introduce TabArena and provide an overview of TabArena-v0.1 in our paper: TBA. | |
## Using TabArena for Benchmarking | |
To compare your own methods to the pre-computed results for all models on the leaderboard, | |
you can use the TabArena framework. For examples on how to use TabArena for benchmarking, | |
please see https://github.com/TabArena/tabarena_benchmarking_examples | |
## Contributing to the Leaderboard; Contributing Models | |
For guidelines on how to contribute your model to TabArena, or the result of your model | |
to the official leaderboard, please see the appendix of our paper: TBA. | |
## Contributing Data | |
For anything related to the datasets used in TabArena, please see https://github.com/TabArena/tabarena_dataset_curation | |
--- | |
## Leaderboard Documentation | |
The leaderboard is ranked by Elo and includes several other metrics. Here is a short | |
description for these metrics: | |
#### Elo | |
We evaluate models using the Elo rating system, following Chatbot Arena. Elo is a | |
pairwise comparison-based rating system where each model's rating predicts its expected | |
win probability against others, with a 400-point Elo gap corresponding to a 10 to 1 | |
(91\%) expected win rate. We calibrate 1000 Elo to the performance of our default | |
random forest configuration across all figures, and perform 100 rounds of bootstrapping | |
to obtain 95\% confidence intervals. Elo scores are computed using ROC AUC for binary | |
classification, log-loss for multiclass classification, and RMSE for regression. | |
#### Normalized Score | |
Following TabRepo, we linearly rescale the error such that the best method has a | |
normalized score of one, and the median method has a normalized score of 0. Scores | |
below zero are clipped to zero. These scores are then averaged across datasets. | |
#### Average Rank | |
Ranks of methods are computed on each dataset (lower is better) and averaged. | |
#### Harmonic Mean Rank | |
Taking the harmonic mean of ranks, 1/((1/N) * sum(1/rank_i for i in range(N))), | |
more strongly favors methods having very low ranks on some datasets. It therefore favors | |
methods that are sometimes very good and sometimes very bad over methods that are | |
always mediocre, as the former are more likely to be useful in conjunction with | |
other methods. | |
#### Improvability | |
We introduce improvability as a metric that measures how many percent lower the error | |
of the best method is than the current method on a dataset. This is then averaged over | |
datasets. Formally, for a single dataset improvability is (err_i - besterr_i)/err_i * 100\%. | |
Improvability is always between $0\%$ and $100\%$. | |
--- | |
## Contact | |
For most inquires, please open issues in the relevant GitHub repository or here on | |
HuggingFace. | |
For any other inquiries related to TabArena, please reach out to: [email protected] | |
### Core Maintainers | |
The current core maintainers of TabArena are: | |
[Nick Erickson](https://github.com/Innixma), | |
[Lennart Purucker](https://github.com/LennartPurucker/), | |
[Andrej Tschalzev](https://github.com/atschalz), | |
[David HolzmΓΌller](https://github.com/dholzmueller) | |
""" | |
CITATION_BUTTON_LABEL = ( | |
"If you use TabArena or the leaderboard in your research please cite the following:" | |
) | |
CITATION_BUTTON_TEXT = r""" | |
@article{ | |
TBA, | |
} | |
""" | |
def get_model_family(model_name: str) -> str: | |
prefixes_mapping = { | |
Constants.reference: ["AutoGluon"], | |
Constants.neural_network: ["REALMLP", "TabM", "FASTAI", "MNCA", "NN_TORCH"], | |
Constants.tree: ["GBM", "CAT", "EBM", "XGB", "XT", "RF"], | |
Constants.foundational: ["TABDPT", "TABICL", "TABPFN"], | |
Constants.baseline: ["KNN", "LR"], | |
} | |
for method_type, prefixes in prefixes_mapping.items(): | |
for prefix in prefixes: | |
if prefix.lower() in model_name.lower(): | |
return method_type | |
return Constants.other | |
def rename_map(model_name: str) -> str: | |
rename_map = { | |
"TABM": "TabM", | |
"REALMLP": "RealMLP", | |
"GBM": "LightGBM", | |
"CAT": "CatBoost", | |
"XGB": "XGBoost", | |
"XT": "ExtraTrees", | |
"RF": "RandomForest", | |
"MNCA": "ModernNCA", | |
"NN_TORCH": "TorchMLP", | |
"FASTAI": "FastaiMLP", | |
"TABPFNV2": "TabPFNv2", | |
"EBM": "EBM", | |
"TABDPT": "TabDPT", | |
"TABICL": "TabICL", | |
"KNN": "KNN", | |
"LR": "Linear", | |
} | |
for prefix in rename_map: | |
if prefix in model_name: | |
return model_name.replace(prefix, rename_map[prefix]) | |
return model_name | |
def load_data(filename: str): | |
df_leaderboard = pd.read_csv(Path(__file__).parent / "data" / f"{filename}.csv.zip") | |
print( | |
f"Loaded dataframe with {len(df_leaderboard)} rows and columns {df_leaderboard.columns}" | |
) | |
# add model family information | |
df_leaderboard["Type"] = df_leaderboard.loc[:, "method"].apply( | |
lambda s: model_type_emoji[get_model_family(s)] | |
) | |
df_leaderboard["TypeName"] = df_leaderboard.loc[:, "method"].apply( | |
lambda s: get_model_family(s) | |
) | |
df_leaderboard["method"] = df_leaderboard["method"].apply(rename_map) | |
# elo,elo+,elo-,mrr | |
df_leaderboard["Elo 95% CI"] = ( | |
"+" | |
+ df_leaderboard["elo+"].round(0).astype(int).astype(str) | |
+ "/-" | |
+ df_leaderboard["elo-"].round(0).astype(int).astype(str) | |
) | |
# select only the columns we want to display | |
df_leaderboard["normalized-score"] = 1 - df_leaderboard["normalized-error"] | |
df_leaderboard["hmr"] = 1/df_leaderboard["mrr"] | |
df_leaderboard["improvability"] = 100 * df_leaderboard["champ_delta"] | |
df_leaderboard = df_leaderboard.loc[ | |
:, | |
[ | |
"Type", | |
"TypeName", | |
"method", | |
"elo", | |
"Elo 95% CI", | |
"normalized-score", | |
"rank", | |
"hmr", | |
"improvability", | |
"median_time_train_s_per_1K", | |
"median_time_infer_s_per_1K", | |
], | |
] | |
# round for better display | |
df_leaderboard[["elo", "Elo 95% CI"]] = df_leaderboard[["elo", "Elo 95% CI"]].round( | |
0 | |
) | |
df_leaderboard[["median_time_train_s_per_1K", "rank", "hmr"]] = df_leaderboard[ | |
["median_time_train_s_per_1K", "rank", "hmr"] | |
].round(2) | |
df_leaderboard[["normalized-score", "median_time_infer_s_per_1K", "improvability"]] = df_leaderboard[ | |
["normalized-score", "median_time_infer_s_per_1K", "improvability"] | |
].round(3) | |
df_leaderboard = df_leaderboard.sort_values(by="elo", ascending=False) | |
df_leaderboard = df_leaderboard.reset_index(drop=True) | |
df_leaderboard = df_leaderboard.reset_index(names="#") | |
# rename some columns | |
return df_leaderboard.rename( | |
columns={ | |
"median_time_train_s_per_1K": "Median Train Time (s/1K) [β¬οΈ]", | |
"median_time_infer_s_per_1K": "Median Predict Time (s/1K) [β¬οΈ]", | |
"method": "Model", | |
"elo": "Elo [β¬οΈ]", | |
"rank": "Rank [β¬οΈ]", | |
"normalized-score": "Normalized Score [β¬οΈ]", | |
"hmr": "Harmonic Mean Rank [β¬οΈ]", | |
"improvability": "Improvability (%) [β¬οΈ]", | |
} | |
) | |
def make_leaderboard(df_leaderboard: pd.DataFrame) -> Leaderboard: | |
df_leaderboard["TypeFiler"] = df_leaderboard["TypeName"].apply( | |
lambda m: f"{m} {model_type_emoji[m]}" | |
) | |
# De-selects but does not filter... | |
# default = df_leaderboard["TypeFiler"].unique().tolist() | |
# default = [(s, s) for s in default if "AutoML" not in s] | |
df_leaderboard["Only Default"] = df_leaderboard["Model"].str.endswith("(default)") | |
df_leaderboard["Only Tuned"] = df_leaderboard["Model"].str.endswith("(tuned)") | |
df_leaderboard["Only Tuned + Ensemble"] = df_leaderboard["Model"].str.endswith( | |
"(tuned + ensemble)" | |
) | df_leaderboard["Model"].str.endswith("(4h)") | |
# Add Imputed count postfix | |
mask = df_leaderboard["Model"].str.startswith("TabPFNv2") | |
df_leaderboard.loc[mask, "Model"] = ( | |
df_leaderboard.loc[mask, "Model"] + " [35.29% IMPUTED]" | |
) | |
mask = df_leaderboard["Model"].str.startswith("TabICL") | |
df_leaderboard.loc[mask, "Model"] = ( | |
df_leaderboard.loc[mask, "Model"] + " [29.41% IMPUTED]" | |
) | |
df_leaderboard["Imputed"] = df_leaderboard["Model"].str.startswith( | |
"TabPFNv2" | |
) | df_leaderboard["Model"].str.startswith("TabICL") | |
df_leaderboard["Imputed"] = df_leaderboard["Imputed"].replace( | |
{ | |
True: "Imputed", | |
False: "Not Imputed", | |
} | |
) | |
return Leaderboard( | |
value=df_leaderboard, | |
select_columns=SelectColumns( | |
default_selection=list(df_leaderboard.columns), | |
cant_deselect=["Type", "Model"], | |
label="Select Columns to Display:", | |
), | |
hide_columns=[ | |
"TypeName", | |
"TypeFiler", | |
"RefModel", | |
"Only Default", | |
"Only Tuned", | |
"Only Tuned + Ensemble", | |
"Imputed", | |
], | |
search_columns=["Model", "Type"], | |
filter_columns=[ | |
ColumnFilter("TypeFiler", type="checkboxgroup", label="Model Types."), | |
ColumnFilter("Only Default", type="boolean", default=False), | |
ColumnFilter("Only Tuned", type="boolean", default=False), | |
ColumnFilter("Only Tuned + Ensemble", type="boolean", default=False), | |
ColumnFilter( | |
"Imputed", | |
type="checkboxgroup", | |
label="(Not) Imputed Models.", | |
info="We impute the performance for models that cannot run on all" | |
" datasets due to task or dataset size constraints (e.g. TabPFN," | |
" TabICL). We impute with the performance of a default RandomForest." | |
" We add a postfix [X% IMPUTED] to the model if any results were" | |
" imputed. The X% shows the percentage of" | |
" datasets that were imputed. In general, imputation negatively" | |
" represents the model performance, punishing the model for not" | |
" being able to run on all datasets.", | |
), | |
], | |
bool_checkboxgroup_label="Custom Views (exclusive, only toggle one at a time):", | |
) | |
def main(): | |
demo = gr.Blocks() | |
with demo: | |
gr.HTML(TITLE) | |
gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text") | |
with gr.Tabs(elem_classes="tab-buttons"): | |
with gr.TabItem("π TabArena-v0.1", elem_id="llm-benchmark-tab-table", id=2): | |
df_leaderboard = load_data("tabarena_leaderboard") | |
make_leaderboard(df_leaderboard) | |
# TODO: decide on which subsets we want to support here. | |
# with gr.TabItem('π Regression', elem_id="llm-benchmark-tab-table", id=0): | |
# df_leaderboard = load_data("leaderboard-regression") | |
# leaderboard = make_leaderboard(df_leaderboard) | |
# | |
# with gr.TabItem('π Classification', elem_id="llm-benchmark-tab-table", id=1): | |
# df_leaderboard = load_data("leaderboard-classification") | |
# leaderboard = make_leaderboard(df_leaderboard) | |
# | |
# with gr.TabItem('π Classification', elem_id="llm-benchmark-tab-table", id=1): | |
# df_leaderboard = load_data("leaderboard-classification") | |
# leaderboard = make_leaderboard(df_leaderboard) | |
# | |
# with gr.TabItem('π TabPFNv2-Compatible', elem_id="llm-benchmark-tab-table", id=1): | |
# df_leaderboard = load_data("leaderboard-classification") | |
# leaderboard = make_leaderboard(df_leaderboard) | |
# | |
# with gr.TabItem('π TabICL-Compatible', elem_id="llm-benchmark-tab-table", id=1): | |
# df_leaderboard = load_data("leaderboard-classification") | |
# leaderboard = make_leaderboard(df_leaderboard) | |
with gr.TabItem("π About", elem_id="llm-benchmark-tab-table", id=4): | |
gr.Markdown(ABOUT_TEXT, elem_classes="markdown-text") | |
with gr.Row(), gr.Accordion("π Citation", open=False): | |
gr.Textbox( | |
value=CITATION_BUTTON_TEXT, | |
label=CITATION_BUTTON_LABEL, | |
lines=20, | |
elem_id="citation-button", | |
show_copy_button=True, | |
) | |
scheduler = BackgroundScheduler() | |
# scheduler.add_job(restart_space, "interval", seconds=1800) | |
scheduler.start() | |
demo.queue(default_concurrency_limit=40).launch() | |
demo.launch() | |
if __name__ == "__main__": | |
main() | |