Spaces:

ParamDev
/

Quality-Control-Inspector

Configuration error

App Files Files Community

Quality-Control-Inspector / tlt /distributed /pytorch /README.md

ParamDev

Upload folder using huggingface_hub

a01ef8c verified 11 days ago

preview code

raw

history blame contribute delete

3.07 kB

	# Distributed Training with PyTorch and Intel® Transfer Learning Tool

	## Multinode setup

	### Create and activate a Python3 virtual environment

	We encourage you to use a python virtual environment (virtualenv or conda) for consistent package management. Make sure to follow only the chosen method on all the nodes. Mixing those configurations is not supported.

	There are two ways to do this:

	a. Using `virtualenv`:

	1. Login to one of the participating nodes.

	2. Create and activate a new python3 virtualenv

	```
	virtualenv -p python3 tlt_dev_venv
	source tlt_dev_venv/bin/activate
	```

	3. Install Intel® Transfer Learning Tool (see main [README](/README.md))
	```
	pip install --editable .
	```

	4. Install multinode dependencies from the shell script. You can also compile `torch_ccl` manually from [here](https://github.com/intel/torch-ccl)
	```
	bash tlt/distributed/pytorch/pyt_hvd_setup.sh
	```

	b. Or `conda`:

	1. Login to one of the participating nodes.

	2. Create and activate a new conda environment
	```
	conda create -n tlt_dev_venv python=3.8 --yes
	conda activate tlt_dev_venv
	```

	3. Install Intel® Transfer Learning Tool (see main [README](/README.md))
	```
	pip install --editable .
	```

	4. Install dependencies from the shell script
	```
	bash tlt/distributed/pytorch/run_install.sh
	```

	## Verify multinode setup

	Create a `hostfile` with a list of IP addresses of the participating nodes and type the following command. You should see a list of hostnames of the nodes.
	```
	mpiexec.hydra -ppn 1 -f hostfile hostname
	```
	Note: If the above command errors out as `'mpiexec.hydra' command not found`, activate the oneAPI environment:
	```
	source /opt/intel/oneapi/setvars.sh
	```

	## Launch a distributed training job with TLT CLI

	Step 1: Create a `hostfile` with a list of IP addresses of the participating nodes. Make sure
	the first IP address to be of the current node.

	Step 2: Launch a distributed training job with TLT CLI using the appropriate flags.
	```
	tlt train \
	-f pytorch \
	--model_name resnet50 \
	--dataset_name CIFAR10 \
	--output_dir $OUTPUT_DIR \
	--dataset_dir $DATASET_DIR \
	--distributed \
	--hostfile hostfile \
	--nnodes 2 \
	--nproc_per_node 2
	```

	(Optional): Use the `--use_horovod` flag to use horovod for distributed training

	## Troubleshooting

	- *"Port already in use"*

	Might happen when you keyboard interrupt training.

	Fix: Release the port from the terminal (or) log out and log in again to free the port.

	- *"HTTP Connection error"*

	Might happen if there are several attempts to train text classification model as it uses Hugging Face API to make calls to get dataset, model, tokenizer.

	Fix: Wait for about few seconds and try again.

	- *"TimeoutException"* when using horovod

	Might happen when horovod times out waiting for tasks to start.

	Fix: Check connectivity between servers. You may need to increase the `--hvd-start-timeout` parameter if you have too many servers. Default value for timeout is 30 seconds.