# Text Classification IntelĀ® Transfer Learning Tool CLI Example ## Fine Tuning Using Your Own Dataset The example below shows how to fine tune a TensorFlow text classification model using your own dataset in the .csv format. The .csv file is expected to have 2 columns: a numerical class label and the text/sentence to classify. Note that although the TLT API is more flexible and allows for providing map functions to translate string class names to numerical values and filtering which columns are being used, the CLI only allows using .csv files in the expected format. The `--dataset-dir` argument is the path to the directory where your dataset is located, and the `--dataset-file` is the name of the .csv file to load from that directory. Use the `--class-names` argument to specify a list of the classes and the `--delimiter` to specify the character that separates the two columns. If no `--delimiter` is specified, the CLI will default to use a comma (`,`). This example is downloading the [SMS Spam Collection](https://archive.ics.uci.edu/dataset/228/sms+spam+collection) dataset, which has a tab separated value file in the .zip file. This dataset has labeled SMS text messages that are either being classified as `ham` or `spam`. The first column in the data file has the label (`ham` or `spam`) and the second column is the text of the SMS message. The string class labels are replaced with numerical values before training. ```bash # Create dataset and output directories DATASET_DIR=/tmp/data OUTPUT_DIR=/tmp/output mkdir -p ${DATASET_DIR} mkdir -p ${OUTPUT_DIR} # Download and extract the dataset wget -P ${DATASET_DIR} https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip unzip ${DATASET_DIR}/sms+spam+collection.zip # Make a copy of the .csv file with 'numerical' in the file name DATASET_FILE=SMSSpamCollection_numerical.csv cp ${DATASET_DIR}/SMSSpamCollection ${DATASET_DIR}/${DATASET_FILE} # Replace string class labels with numerical values in the .csv file\ # The list numerical class labels passed as the --class-names during training and evaluation sed -i 's/ham/0/g' ${DATASET_DIR}/${DATASET_FILE} sed -i 's/spam/1/g' ${DATASET_DIR}/${DATASET_FILE} # Train google/bert_uncased_L-10_H-256_A-4 using our dataset file, which has tab delimiters tlt train \ -f tensorflow \ --model-name google/bert_uncased_L-10_H-256_A-4 \ --output-dir ${OUTPUT_DIR} \ --dataset-dir ${DATASET_DIR} \ --dataset-file ${DATASET_FILE} \ --epochs 2 \ --class-names 0,1 \ --delimiter $'\t' # Evaluate the model exported after training # Note that your --model-dir path may vary, since each training run creates a new directory tlt eval \ --model-dir ${OUTPUT_DIR}/google_bert_uncased_L-10_H-256_A-4/1 \ --model-name google/bert_uncased_L-10_H-256_A-4 \ --dataset-dir ${DATASET_DIR} \ --dataset-file ${DATASET_FILE} \ --class-names 0,1 \ --delimiter $'\t' ``` ## Fine Tuning Using a Dataset from the TFDS Catalog This example demonstrates using the Intel Transfer Learning Tool CLI to fine tune a text classification model using a dataset from the [TensorFlow Datasets (TFDS) catalog](https://www.tensorflow.org/datasets/catalog/overview). Intel Transfer Learning Tool supports the following text classification datasets from TFDS: [imdb_reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews), [glue/sst2](https://www.tensorflow.org/datasets/catalog/imdb_reviews), and [glue/cola](https://www.tensorflow.org/datasets/catalog/glue#gluecola_default_config). ```bash # Create dataset and output directories DATASET_DIR=/tmp/data OUTPUT_DIR=/tmp/output mkdir -p ${DATASET_DIR} mkdir -p ${OUTPUT_DIR} # Name of the dataset to use DATASET_NAME=imdb_reviews # Train google/bert_uncased_L-10_H-256_A-4 using the TFDS dataset tlt train \ -f tensorflow \ --model-name google/bert_uncased_L-10_H-256_A-4 \ --output-dir ${OUTPUT_DIR} \ --dataset-dir ${DATASET_DIR} \ --dataset-name ${DATASET_NAME} \ --epochs 2 # Evaluate the model exported after training # Note that your --model-dir path may vary, since each training run creates a new directory tlt eval \ --model-dir ${OUTPUT_DIR}/google_bert_uncased_L-10_H-256_A-4/2 \ --model-name google/bert_uncased_L-10_H-256_A-4 \ --dataset-dir ${DATASET_DIR} \ --dataset-name ${DATASET_NAME} ``` ## Distributed Transfer Learning Using a Dataset from Hugging Face This example runs a distributed PyTorch training job using the TLT CLI. It fine tunes a text classification model for document-level sentiment analysis using a dataset from the [Hugging Face catalog](https://huggingface.co/datasets). Intel Transfer Learning Tool supports the following text classification datasets from Hugging Face: * [imdb](https://huggingface.co/datasets/imdb) * [tweet_eval](https://huggingface.co/datasets/tweet_eval) * [rotten_tomatoes](https://huggingface.co/datasets/rotten_tomatoes) * [ag_news](https://huggingface.co/datasets/ag_news) * [sst2](https://huggingface.co/datasets/sst2) Follow [these instructions](/tlt/distributed/README.md) to set up your machines for distributed training with PyTorch. This will ensure your environment has the right prerequisites, package dependencies, and hostfile configuration. When you have successfully run the sanity check, the following commands will fine-tune `bert-large-uncased` with sst2 for one epoch using 2 nodes and 2 processes per node. ```bash # Create dataset and output directories DATASET_DIR=/tmp/data OUTPUT_DIR=/tmp/output mkdir -p ${DATASET_DIR} mkdir -p ${OUTPUT_DIR} # Name of the dataset to use DATASET_NAME=sst2 # Train bert-large-uncased using the Hugging Face dataset sst2 tlt train \ -f pytorch \ --model_name bert-large-uncased \ --dataset_name sst2 \ --output_dir $OUTPUT_DIR \ --dataset_dir $DATASET_DIR \ --distributed \ --hostfile hostfile \ --nnodes 2 \ --nproc_per_node 2 ``` ## Citations ``` @InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} } @inproceedings{wang2019glue, title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding}, author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.}, note={In the Proceedings of ICLR.}, year={2019} } @misc{misc_sms_spam_collection_228, author = {Almeida, Tiago}, title = {{SMS Spam Collection}}, year = {2012}, howpublished = {UCI Machine Learning Repository} } @inproceedings{socher-etal-2013-recursive, title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank", author = "Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher", booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing", month = oct, year = "2013", address = "Seattle, Washington, USA", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D13-1170", pages = "1631--1642", } ```