Spaces:
Sleeping
Sleeping
# Text Classification & Regression | |
Training a text classification/regression model with AutoTrain is super-easy! Get your data ready in | |
proper format and then with just a few clicks, your state-of-the-art model will be ready to | |
be used in production. | |
Config file task names: | |
- `text_classification` | |
- `text-classification` | |
- `text_regression` | |
- `text-regression` | |
## Data Format | |
Text classification/regression supports datasets in both CSV and JSONL formats. | |
### CSV Format | |
Let's train a model for classifying the sentiment of a movie review. The data should be | |
in the following CSV format: | |
```csv | |
text,target | |
"this movie is great",positive | |
"this movie is bad",negative | |
. | |
. | |
. | |
``` | |
As you can see, we have two columns in the CSV file. One column is the text and the other | |
is the label. The label can be any string. In this example, we have two labels: `positive` | |
and `negative`. You can have as many labels as you want. | |
And if you would like to train a model for scoring a movie review on a scale of 1-5. The data can be as follows: | |
```csv | |
text,target | |
"this movie is great",4.9 | |
"this movie is bad",1.5 | |
. | |
. | |
. | |
``` | |
### JSONL Format | |
Instead of CSV you can also use JSONL format. The JSONL format should be as follows: | |
```json | |
{"text": "this movie is great", "target": "positive"} | |
{"text": "this movie is bad", "target": "negative"} | |
. | |
. | |
. | |
``` | |
and for regression: | |
```json | |
{"text": "this movie is great", "target": 4.9} | |
{"text": "this movie is bad", "target": 1.5} | |
. | |
. | |
``` | |
### Column Mapping / Names | |
Your CSV dataset must have two columns: `text` and `target`. | |
If your column names are different than `text` and `target`, you can map the dataset column to AutoTrain column names. | |
## Training | |
### Local Training | |
To train a text classification/regression model locally, you can use the `autotrain --config config.yaml` command. | |
Here is an example of a `config.yaml` file for training a text classification model: | |
```yaml | |
task: text_classification # or text_regression | |
base_model: google-bert/bert-base-uncased | |
project_name: autotrain-bert-imdb-finetuned | |
log: tensorboard | |
backend: local | |
data: | |
path: stanfordnlp/imdb | |
train_split: train | |
valid_split: test | |
column_mapping: | |
text_column: text | |
target_column: label | |
params: | |
max_seq_length: 512 | |
epochs: 3 | |
batch_size: 4 | |
lr: 2e-5 | |
optimizer: adamw_torch | |
scheduler: linear | |
gradient_accumulation: 1 | |
mixed_precision: fp16 | |
hub: | |
username: ${HF_USERNAME} | |
token: ${HF_TOKEN} | |
push_to_hub: true | |
``` | |
In this example, we are training a text classification model using the `google-bert/bert-base-uncased` model on the IMDB dataset. | |
We are using the `stanfordnlp/imdb` dataset, which is already available on Hugging Face Hub. | |
We are training the model for 3 epochs with a batch size of 4 and a learning rate of `2e-5`. | |
We are using the `adamw_torch` optimizer and the `linear` scheduler. | |
We are also using mixed precision training with a gradient accumulation of 1. | |
If you want to use a local CSV/JSONL dataset, you can change the `data` section to: | |
```yaml | |
data: | |
path: data/ # this must be the path to the directory containing the train and valid files | |
train_split: train # this must be either train.csv or train.json | |
valid_split: valid # this must be either valid.csv or valid.json | |
column_mapping: | |
text_column: text # this must be the name of the column containing the text | |
target_column: label # this must be the name of the column containing the target | |
``` | |
To train the model, run the following command: | |
```bash | |
$ autotrain --config config.yaml | |
``` | |
You can find example config files for text classification and regression in the [here](https://github.com/huggingface/autotrain-advanced/tree/main/configs/text_classification) and [here](https://github.com/huggingface/autotrain-advanced/tree/main/configs/text_regression) respectively. | |
### Training on Hugging Face Spaces | |
The parameters for training on Hugging Face Spaces are the same as for local training. | |
If you are using your own dataset, select "Local" as dataset source and upload your dataset. | |
In the following screenshot, we are training a text classification model using the `google-bert/bert-base-uncased` model on the IMDB dataset. | |
 | |
For text regression, all you need to do is select "Text Regression" as the task and everything else remains the same (except the data, of course). | |
## Training Parameters | |
Training parameters for text classification and regression are the same. | |
[[autodoc]] trainers.text_classification.params.TextClassificationParams | |