Spaces:

ParamDev
/

Quality-Control-Inspector

Configuration error

App Files Files Community

Quality-Control-Inspector / examples /cli /text_classification.md

ParamDev

Upload folder using huggingface_hub

a01ef8c verified 11 days ago

preview code

raw

history blame contribute delete

7.72 kB

	# Text Classification Intel® Transfer Learning Tool CLI Example

	## Fine Tuning Using Your Own Dataset

	The example below shows how to fine tune a TensorFlow text classification model using your own
	dataset in the .csv format. The .csv file is expected to have 2 columns: a numerical class label
	and the text/sentence to classify. Note that although the TLT API is more flexible and allows for
	providing map functions to translate string class names to numerical values and filtering which
	columns are being used, the CLI only allows using .csv files in the expected format.

	The `--dataset-dir` argument is the path to the directory where your dataset is located, and the
	`--dataset-file` is the name of the .csv file to load from that directory. Use the `--class-names`
	argument to specify a list of the classes and the `--delimiter` to specify the character that
	separates the two columns. If no `--delimiter` is specified, the CLI will default to use a comma (`,`).

	This example is downloading the [SMS Spam Collection](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)
	dataset, which has a tab separated value file in the .zip file. This dataset has labeled SMS text
	messages that are either being classified as `ham` or `spam`. The first column in the data file has
	the label (`ham` or `spam`) and the second column is the text of the SMS message. The string class
	labels are replaced with numerical values before training.
	```bash
	# Create dataset and output directories
	DATASET_DIR=/tmp/data
	OUTPUT_DIR=/tmp/output
	mkdir -p ${DATASET_DIR}
	mkdir -p ${OUTPUT_DIR}

	# Download and extract the dataset
	wget -P ${DATASET_DIR} https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip
	unzip ${DATASET_DIR}/sms+spam+collection.zip

	# Make a copy of the .csv file with 'numerical' in the file name
	DATASET_FILE=SMSSpamCollection_numerical.csv
	cp ${DATASET_DIR}/SMSSpamCollection ${DATASET_DIR}/${DATASET_FILE}

	# Replace string class labels with numerical values in the .csv file\
	# The list numerical class labels passed as the --class-names during training and evaluation
	sed -i 's/ham/0/g' ${DATASET_DIR}/${DATASET_FILE}
	sed -i 's/spam/1/g' ${DATASET_DIR}/${DATASET_FILE}

	# Train google/bert_uncased_L-10_H-256_A-4 using our dataset file, which has tab delimiters
	tlt train \
	-f tensorflow \
	--model-name google/bert_uncased_L-10_H-256_A-4 \
	--output-dir ${OUTPUT_DIR} \
	--dataset-dir ${DATASET_DIR} \
	--dataset-file ${DATASET_FILE} \
	--epochs 2 \
	--class-names 0,1 \
	--delimiter $'\t'

	# Evaluate the model exported after training
	# Note that your --model-dir path may vary, since each training run creates a new directory
	tlt eval \
	--model-dir ${OUTPUT_DIR}/google_bert_uncased_L-10_H-256_A-4/1 \
	--model-name google/bert_uncased_L-10_H-256_A-4 \
	--dataset-dir ${DATASET_DIR} \
	--dataset-file ${DATASET_FILE} \
	--class-names 0,1 \
	--delimiter $'\t'
	```

	## Fine Tuning Using a Dataset from the TFDS Catalog

	This example demonstrates using the Intel Transfer Learning Tool CLI to fine tune a text classification model using a
	dataset from the [TensorFlow Datasets (TFDS) catalog](https://www.tensorflow.org/datasets/catalog/overview).
	Intel Transfer Learning Tool supports the following text classification datasets from TFDS:
	[imdb_reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews),
	[glue/sst2](https://www.tensorflow.org/datasets/catalog/imdb_reviews),
	and [glue/cola](https://www.tensorflow.org/datasets/catalog/glue#gluecola_default_config).

	```bash
	# Create dataset and output directories
	DATASET_DIR=/tmp/data
	OUTPUT_DIR=/tmp/output
	mkdir -p ${DATASET_DIR}
	mkdir -p ${OUTPUT_DIR}

	# Name of the dataset to use
	DATASET_NAME=imdb_reviews

	# Train google/bert_uncased_L-10_H-256_A-4 using the TFDS dataset
	tlt train \
	-f tensorflow \
	--model-name google/bert_uncased_L-10_H-256_A-4 \
	--output-dir ${OUTPUT_DIR} \
	--dataset-dir ${DATASET_DIR} \
	--dataset-name ${DATASET_NAME} \
	--epochs 2

	# Evaluate the model exported after training
	# Note that your --model-dir path may vary, since each training run creates a new directory
	tlt eval \
	--model-dir ${OUTPUT_DIR}/google_bert_uncased_L-10_H-256_A-4/2 \
	--model-name google/bert_uncased_L-10_H-256_A-4 \
	--dataset-dir ${DATASET_DIR} \
	--dataset-name ${DATASET_NAME}
	```

	## Distributed Transfer Learning Using a Dataset from Hugging Face
	This example runs a distributed PyTorch training job using the TLT CLI. It fine tunes a text classification model
	for document-level sentiment analysis using a dataset from the [Hugging Face catalog](https://huggingface.co/datasets).
	Intel Transfer Learning Tool supports the following text classification datasets from Hugging Face:
	* [imdb](https://huggingface.co/datasets/imdb)
	* [tweet_eval](https://huggingface.co/datasets/tweet_eval)
	* [rotten_tomatoes](https://huggingface.co/datasets/rotten_tomatoes)
	* [ag_news](https://huggingface.co/datasets/ag_news)
	* [sst2](https://huggingface.co/datasets/sst2)

	Follow [these instructions](/tlt/distributed/README.md) to set up your machines for distributed training with PyTorch. This will
	ensure your environment has the right prerequisites, package dependencies, and hostfile configuration. When
	you have successfully run the sanity check, the following commands will fine-tune `bert-large-uncased` with sst2 for
	one epoch using 2 nodes and 2 processes per node.

	```bash
	# Create dataset and output directories
	DATASET_DIR=/tmp/data
	OUTPUT_DIR=/tmp/output
	mkdir -p ${DATASET_DIR}
	mkdir -p ${OUTPUT_DIR}

	# Name of the dataset to use
	DATASET_NAME=sst2

	# Train bert-large-uncased using the Hugging Face dataset sst2
	tlt train \
	-f pytorch \
	--model_name bert-large-uncased \
	--dataset_name sst2 \
	--output_dir $OUTPUT_DIR \
	--dataset_dir $DATASET_DIR \
	--distributed \
	--hostfile hostfile \
	--nnodes 2 \
	--nproc_per_node 2
	```

	## Citations
	```
	@InProceedings{maas-EtAl:2011:ACL-HLT2011,
	author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
	title = {Learning Word Vectors for Sentiment Analysis},
	booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
	month = {June},
	year = {2011},
	address = {Portland, Oregon, USA},
	publisher = {Association for Computational Linguistics},
	pages = {142--150},
	url = {http://www.aclweb.org/anthology/P11-1015}
	}

	@inproceedings{wang2019glue,
	title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
	author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
	note={In the Proceedings of ICLR.},
	year={2019}
	}

	@misc{misc_sms_spam_collection_228,
	author = {Almeida, Tiago},
	title = {{SMS Spam Collection}},
	year = {2012},
	howpublished = {UCI Machine Learning Repository}
	}

	@inproceedings{socher-etal-2013-recursive,
	title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank",
	author = "Socher, Richard and
	Perelygin, Alex and
	Wu, Jean and
	Chuang, Jason and
	Manning, Christopher D. and
	Ng, Andrew and
	Potts, Christopher",
	booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing",
	month = oct,
	year = "2013",
	address = "Seattle, Washington, USA",
	publisher = "Association for Computational Linguistics",
	url = "https://www.aclweb.org/anthology/D13-1170",
	pages = "1631--1642",
	}
	```