Spaces:
Configuration error
Configuration error
# Text Classification Intel® Transfer Learning Tool CLI Example | |
## Fine Tuning Using Your Own Dataset | |
The example below shows how to fine tune a TensorFlow text classification model using your own | |
dataset in the .csv format. The .csv file is expected to have 2 columns: a numerical class label | |
and the text/sentence to classify. Note that although the TLT API is more flexible and allows for | |
providing map functions to translate string class names to numerical values and filtering which | |
columns are being used, the CLI only allows using .csv files in the expected format. | |
The `--dataset-dir` argument is the path to the directory where your dataset is located, and the | |
`--dataset-file` is the name of the .csv file to load from that directory. Use the `--class-names` | |
argument to specify a list of the classes and the `--delimiter` to specify the character that | |
separates the two columns. If no `--delimiter` is specified, the CLI will default to use a comma (`,`). | |
This example is downloading the [SMS Spam Collection](https://archive.ics.uci.edu/dataset/228/sms+spam+collection) | |
dataset, which has a tab separated value file in the .zip file. This dataset has labeled SMS text | |
messages that are either being classified as `ham` or `spam`. The first column in the data file has | |
the label (`ham` or `spam`) and the second column is the text of the SMS message. The string class | |
labels are replaced with numerical values before training. | |
```bash | |
# Create dataset and output directories | |
DATASET_DIR=/tmp/data | |
OUTPUT_DIR=/tmp/output | |
mkdir -p ${DATASET_DIR} | |
mkdir -p ${OUTPUT_DIR} | |
# Download and extract the dataset | |
wget -P ${DATASET_DIR} https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip | |
unzip ${DATASET_DIR}/sms+spam+collection.zip | |
# Make a copy of the .csv file with 'numerical' in the file name | |
DATASET_FILE=SMSSpamCollection_numerical.csv | |
cp ${DATASET_DIR}/SMSSpamCollection ${DATASET_DIR}/${DATASET_FILE} | |
# Replace string class labels with numerical values in the .csv file\ | |
# The list numerical class labels passed as the --class-names during training and evaluation | |
sed -i 's/ham/0/g' ${DATASET_DIR}/${DATASET_FILE} | |
sed -i 's/spam/1/g' ${DATASET_DIR}/${DATASET_FILE} | |
# Train google/bert_uncased_L-10_H-256_A-4 using our dataset file, which has tab delimiters | |
tlt train \ | |
-f tensorflow \ | |
--model-name google/bert_uncased_L-10_H-256_A-4 \ | |
--output-dir ${OUTPUT_DIR} \ | |
--dataset-dir ${DATASET_DIR} \ | |
--dataset-file ${DATASET_FILE} \ | |
--epochs 2 \ | |
--class-names 0,1 \ | |
--delimiter $'\t' | |
# Evaluate the model exported after training | |
# Note that your --model-dir path may vary, since each training run creates a new directory | |
tlt eval \ | |
--model-dir ${OUTPUT_DIR}/google_bert_uncased_L-10_H-256_A-4/1 \ | |
--model-name google/bert_uncased_L-10_H-256_A-4 \ | |
--dataset-dir ${DATASET_DIR} \ | |
--dataset-file ${DATASET_FILE} \ | |
--class-names 0,1 \ | |
--delimiter $'\t' | |
``` | |
## Fine Tuning Using a Dataset from the TFDS Catalog | |
This example demonstrates using the Intel Transfer Learning Tool CLI to fine tune a text classification model using a | |
dataset from the [TensorFlow Datasets (TFDS) catalog](https://www.tensorflow.org/datasets/catalog/overview). | |
Intel Transfer Learning Tool supports the following text classification datasets from TFDS: | |
[imdb_reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews), | |
[glue/sst2](https://www.tensorflow.org/datasets/catalog/imdb_reviews), | |
and [glue/cola](https://www.tensorflow.org/datasets/catalog/glue#gluecola_default_config). | |
```bash | |
# Create dataset and output directories | |
DATASET_DIR=/tmp/data | |
OUTPUT_DIR=/tmp/output | |
mkdir -p ${DATASET_DIR} | |
mkdir -p ${OUTPUT_DIR} | |
# Name of the dataset to use | |
DATASET_NAME=imdb_reviews | |
# Train google/bert_uncased_L-10_H-256_A-4 using the TFDS dataset | |
tlt train \ | |
-f tensorflow \ | |
--model-name google/bert_uncased_L-10_H-256_A-4 \ | |
--output-dir ${OUTPUT_DIR} \ | |
--dataset-dir ${DATASET_DIR} \ | |
--dataset-name ${DATASET_NAME} \ | |
--epochs 2 | |
# Evaluate the model exported after training | |
# Note that your --model-dir path may vary, since each training run creates a new directory | |
tlt eval \ | |
--model-dir ${OUTPUT_DIR}/google_bert_uncased_L-10_H-256_A-4/2 \ | |
--model-name google/bert_uncased_L-10_H-256_A-4 \ | |
--dataset-dir ${DATASET_DIR} \ | |
--dataset-name ${DATASET_NAME} | |
``` | |
## Distributed Transfer Learning Using a Dataset from Hugging Face | |
This example runs a distributed PyTorch training job using the TLT CLI. It fine tunes a text classification model | |
for document-level sentiment analysis using a dataset from the [Hugging Face catalog](https://huggingface.co/datasets). | |
Intel Transfer Learning Tool supports the following text classification datasets from Hugging Face: | |
* [imdb](https://huggingface.co/datasets/imdb) | |
* [tweet_eval](https://huggingface.co/datasets/tweet_eval) | |
* [rotten_tomatoes](https://huggingface.co/datasets/rotten_tomatoes) | |
* [ag_news](https://huggingface.co/datasets/ag_news) | |
* [sst2](https://huggingface.co/datasets/sst2) | |
Follow [these instructions](/tlt/distributed/README.md) to set up your machines for distributed training with PyTorch. This will | |
ensure your environment has the right prerequisites, package dependencies, and hostfile configuration. When | |
you have successfully run the sanity check, the following commands will fine-tune `bert-large-uncased` with sst2 for | |
one epoch using 2 nodes and 2 processes per node. | |
```bash | |
# Create dataset and output directories | |
DATASET_DIR=/tmp/data | |
OUTPUT_DIR=/tmp/output | |
mkdir -p ${DATASET_DIR} | |
mkdir -p ${OUTPUT_DIR} | |
# Name of the dataset to use | |
DATASET_NAME=sst2 | |
# Train bert-large-uncased using the Hugging Face dataset sst2 | |
tlt train \ | |
-f pytorch \ | |
--model_name bert-large-uncased \ | |
--dataset_name sst2 \ | |
--output_dir $OUTPUT_DIR \ | |
--dataset_dir $DATASET_DIR \ | |
--distributed \ | |
--hostfile hostfile \ | |
--nnodes 2 \ | |
--nproc_per_node 2 | |
``` | |
## Citations | |
``` | |
@InProceedings{maas-EtAl:2011:ACL-HLT2011, | |
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, | |
title = {Learning Word Vectors for Sentiment Analysis}, | |
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, | |
month = {June}, | |
year = {2011}, | |
address = {Portland, Oregon, USA}, | |
publisher = {Association for Computational Linguistics}, | |
pages = {142--150}, | |
url = {http://www.aclweb.org/anthology/P11-1015} | |
} | |
@inproceedings{wang2019glue, | |
title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding}, | |
author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.}, | |
note={In the Proceedings of ICLR.}, | |
year={2019} | |
} | |
@misc{misc_sms_spam_collection_228, | |
author = {Almeida, Tiago}, | |
title = {{SMS Spam Collection}}, | |
year = {2012}, | |
howpublished = {UCI Machine Learning Repository} | |
} | |
@inproceedings{socher-etal-2013-recursive, | |
title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank", | |
author = "Socher, Richard and | |
Perelygin, Alex and | |
Wu, Jean and | |
Chuang, Jason and | |
Manning, Christopher D. and | |
Ng, Andrew and | |
Potts, Christopher", | |
booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing", | |
month = oct, | |
year = "2013", | |
address = "Seattle, Washington, USA", | |
publisher = "Association for Computational Linguistics", | |
url = "https://www.aclweb.org/anthology/D13-1170", | |
pages = "1631--1642", | |
} | |
``` | |