alana89 commited on
Commit
21fc54b
·
verified ·
1 Parent(s): 759bda8

Quickstart

Browse files
Files changed (1) hide show
  1. README.md +75 -33
README.md CHANGED
@@ -6,36 +6,78 @@ pipeline_tag: tabular-classification
6
  ---
7
 
8
 
9
- **TabSTAR** is now available to everyone through our repository: you can download the model and fit it on your own data.
10
-
11
- [🔗 GitHub Repository: alanarazi7/TabSTAR](https://github.com/alanarazi7/TabSTAR)
12
-
13
- We are currently working on enhancing it to be production-ready. Feel free to raise any issues or feature requests.
14
-
15
- # Paper title and link
16
-
17
- The model was presented in the paper [TabSTAR: A Foundation Tabular Model With Semantically Target-Aware
18
- Representations](https://arxiv.org/abs/2505.18125).
19
-
20
- # Paper abstract
21
-
22
- The abstract of the paper is the following:
23
-
24
- While deep learning has achieved remarkable success across many domains, it
25
- has historically underperformed on tabular learning tasks, which remain
26
- dominated by gradient boosting decision trees (GBDTs). However, recent
27
- advancements are paving the way for Tabular Foundation Models, which can
28
- leverage real-world knowledge and generalize across diverse datasets,
29
- particularly when the data contains free-text. Although incorporating language
30
- model capabilities into tabular tasks has been explored, most existing methods
31
- utilize static, target-agnostic textual representations, limiting their
32
- effectiveness. We introduce TabSTAR: a Foundation Tabular Model with
33
- Semantically Target-Aware Representations. TabSTAR is designed to enable
34
- transfer learning on tabular data with textual features, with an architecture
35
- free of dataset-specific parameters. It unfreezes a pretrained text encoder and
36
- takes as input target tokens, which provide the model with the context needed
37
- to learn task-specific embeddings. TabSTAR achieves state-of-the-art
38
- performance for both medium- and large-sized datasets across known benchmarks
39
- of classification tasks with text features, and its pretraining phase exhibits
40
- scaling laws in the number of datasets, offering a pathway for further
41
- performance improvements.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ---
7
 
8
 
9
+ <p align="center">
10
+ <img src="https://raw.githubusercontent.com/alanarazi7/TabSTAR/main/figures/tabstar_logo.png" alt="TabSTAR Logo" width="50%">
11
+ </p>
12
+
13
+ ---
14
+
15
+ ## Install
16
+
17
+ To fit a pretrained TabSTAR model to your own dataset, install the package:
18
+
19
+ ```bash
20
+ pip install tabstar
21
+ ```
22
+
23
+ ---
24
+
25
+ ## Quickstart Example
26
+
27
+ ```python
28
+ from importlib.resources import files
29
+ import pandas as pd
30
+ from sklearn.metrics import classification_report
31
+ from sklearn.model_selection import train_test_split
32
+
33
+ from tabstar.tabstar_model import TabSTARClassifier
34
+
35
+ # Load sample data
36
+ csv_path = files("tabstar").joinpath("resources", "imdb.csv")
37
+ x = pd.read_csv(csv_path)
38
+ y = x.pop('Genre_is_Drama')
39
+
40
+ # Split train/test
41
+ x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
42
+
43
+ # Initialize and train
44
+ tabstar = TabSTARClassifier()
45
+ tabstar.fit(x_train, y_train)
46
+
47
+ # Predict and evaluate
48
+ y
49
+ ```
50
+
51
+ ---
52
+
53
+ # 📚 TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations
54
+
55
+ **Repository:** [alanarazi7/TabSTAR](https://github.com/alanarazi7/TabSTAR)
56
+
57
+ **Paper:** [TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations](https://arxiv.org/abs/2505.18125)
58
+
59
+ **License:** MIT © Alan Arazi et al.
60
+
61
+ ---
62
+
63
+ ## Abstract
64
+
65
+ > While deep learning has achieved remarkable success across many domains, it
66
+ > has historically underperformed on tabular learning tasks, which remain
67
+ > dominated by gradient boosting decision trees (GBDTs). However, recent
68
+ > advancements are paving the way for Tabular Foundation Models, which can
69
+ > leverage real-world knowledge and generalize across diverse datasets,
70
+ > particularly when the data contains free-text. Although incorporating language
71
+ > model capabilities into tabular tasks has been explored, most existing methods
72
+ > utilize static, target-agnostic textual representations, limiting their
73
+ > effectiveness. We introduce TabSTAR: a Foundation Tabular Model with
74
+ > Semantically Target-Aware Representations. TabSTAR is designed to enable
75
+ > transfer learning on tabular data with textual features, with an architecture
76
+ > free of dataset-specific parameters. It unfreezes a pretrained text encoder and
77
+ > takes as input target tokens, which provide the model with the context needed
78
+ > to learn task-specific embeddings. TabSTAR achieves state-of-the-art
79
+ > performance for both medium- and large-sized datasets across known benchmarks
80
+ > of classification tasks with text features, and its pretraining phase exhibits
81
+ > scaling laws in the number of datasets, offering a pathway for further
82
+ > performance improvements.
83
+