alana89
/

TabSTAR

Tabular Classification

Safetensors

tabstar

Model card Files Files and versions Community

alana89 commited on 28 days ago

Commit

21fc54b

verified ·

1 Parent(s): 759bda8

Quickstart

Browse files

Files changed (1) hide show

README.md +75 -33

README.md CHANGED Viewed

@@ -6,36 +6,78 @@ pipeline_tag: tabular-classification
 ---
-**TabSTAR** is now available to everyone through our repository: you can download the model and fit it on your own data.
-[🔗 GitHub Repository: alanarazi7/TabSTAR](https://github.com/alanarazi7/TabSTAR)
-We are currently working on enhancing it to be production-ready. Feel free to raise any issues or feature requests.
-# Paper title and link
-The model was presented in the paper [TabSTAR: A Foundation Tabular Model With Semantically Target-Aware
-  Representations](https://arxiv.org/abs/2505.18125).
-# Paper abstract
-The abstract of the paper is the following:
-While deep learning has achieved remarkable success across many domains, it
-has historically underperformed on tabular learning tasks, which remain
-dominated by gradient boosting decision trees (GBDTs). However, recent
-advancements are paving the way for Tabular Foundation Models, which can
-leverage real-world knowledge and generalize across diverse datasets,
-particularly when the data contains free-text. Although incorporating language
-model capabilities into tabular tasks has been explored, most existing methods
-utilize static, target-agnostic textual representations, limiting their
-effectiveness. We introduce TabSTAR: a Foundation Tabular Model with
-Semantically Target-Aware Representations. TabSTAR is designed to enable
-transfer learning on tabular data with textual features, with an architecture
-free of dataset-specific parameters. It unfreezes a pretrained text encoder and
-takes as input target tokens, which provide the model with the context needed
-to learn task-specific embeddings. TabSTAR achieves state-of-the-art
-performance for both medium- and large-sized datasets across known benchmarks
-of classification tasks with text features, and its pretraining phase exhibits
-scaling laws in the number of datasets, offering a pathway for further
-performance improvements.

 ---
+<p align="center">
+  <img src="https://raw.githubusercontent.com/alanarazi7/TabSTAR/main/figures/tabstar_logo.png" alt="TabSTAR Logo" width="50%">
+</p>
+---
+## Install
+To fit a pretrained TabSTAR model to your own dataset, install the package:
+```bash
+pip install tabstar
+```
+---
+## Quickstart Example
+```python
+from importlib.resources import files
+import pandas as pd
+from sklearn.metrics import classification_report
+from sklearn.model_selection import train_test_split
+from tabstar.tabstar_model import TabSTARClassifier
+# Load sample data
+csv_path = files("tabstar").joinpath("resources", "imdb.csv")
+x = pd.read_csv(csv_path)
+y = x.pop('Genre_is_Drama')
+# Split train/test
+x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
+# Initialize and train
+tabstar = TabSTARClassifier()
+tabstar.fit(x_train, y_train)
+# Predict and evaluate
+y
+```
+---
+# 📚 TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations
+**Repository:** [alanarazi7/TabSTAR](https://github.com/alanarazi7/TabSTAR)
+**Paper:** [TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations](https://arxiv.org/abs/2505.18125)
+**License:** MIT © Alan Arazi et al.
+---
+## Abstract
+> While deep learning has achieved remarkable success across many domains, it
+> has historically underperformed on tabular learning tasks, which remain
+> dominated by gradient boosting decision trees (GBDTs). However, recent
+> advancements are paving the way for Tabular Foundation Models, which can
+> leverage real-world knowledge and generalize across diverse datasets,
+> particularly when the data contains free-text. Although incorporating language
+> model capabilities into tabular tasks has been explored, most existing methods
+> utilize static, target-agnostic textual representations, limiting their
+> effectiveness. We introduce TabSTAR: a Foundation Tabular Model with
+> Semantically Target-Aware Representations. TabSTAR is designed to enable
+> transfer learning on tabular data with textual features, with an architecture
+> free of dataset-specific parameters. It unfreezes a pretrained text encoder and
+> takes as input target tokens, which provide the model with the context needed
+> to learn task-specific embeddings. TabSTAR achieves state-of-the-art
+> performance for both medium- and large-sized datasets across known benchmarks
+> of classification tasks with text features, and its pretraining phase exhibits
+> scaling laws in the number of datasets, offering a pathway for further
+> performance improvements.