Distributed Training with PyTorch and Intel® Transfer Learning Tool

Multinode setup

Create and activate a Python3 virtual environment

We encourage you to use a python virtual environment (virtualenv or conda) for consistent package management. Make sure to follow only the chosen method on all the nodes. Mixing those configurations is not supported.

There are two ways to do this:

a. Using virtualenv:

Login to one of the participating nodes.
Create and activate a new python3 virtualenv

virtualenv -p python3 tlt_dev_venv
source tlt_dev_venv/bin/activate

Install Intel® Transfer Learning Tool (see main README)

pip install --editable .

Install multinode dependencies from the shell script. You can also compile torch_ccl manually from here

bash tlt/distributed/pytorch/pyt_hvd_setup.sh

b. Or conda:

Login to one of the participating nodes.
Create and activate a new conda environment

conda create -n tlt_dev_venv python=3.8 --yes
conda activate tlt_dev_venv

Install Intel® Transfer Learning Tool (see main README)

pip install --editable .

Install dependencies from the shell script

bash tlt/distributed/pytorch/run_install.sh

Verify multinode setup

Create a hostfile with a list of IP addresses of the participating nodes and type the following command. You should see a list of hostnames of the nodes.

mpiexec.hydra -ppn 1 -f hostfile hostname

Note: If the above command errors out as 'mpiexec.hydra' command not found, activate the oneAPI environment:

source /opt/intel/oneapi/setvars.sh

Launch a distributed training job with TLT CLI

Step 1: Create a hostfile with a list of IP addresses of the participating nodes. Make sure the first IP address to be of the current node.

Step 2: Launch a distributed training job with TLT CLI using the appropriate flags.

tlt train \
    -f pytorch \
    --model_name resnet50 \
    --dataset_name CIFAR10 \
    --output_dir $OUTPUT_DIR \
    --dataset_dir $DATASET_DIR \
    --distributed \
    --hostfile hostfile \
    --nnodes 2 \
    --nproc_per_node 2

(Optional): Use the --use_horovod flag to use horovod for distributed training

Troubleshooting

"Port already in use"

Might happen when you keyboard interrupt training.

Fix: Release the port from the terminal (or) log out and log in again to free the port.
"HTTP Connection error"

Might happen if there are several attempts to train text classification model as it uses Hugging Face API to make calls to get dataset, model, tokenizer.

Fix: Wait for about few seconds and try again.
"TimeoutException" when using horovod

Might happen when horovod times out waiting for tasks to start.

Fix: Check connectivity between servers. You may need to increase the --hvd-start-timeout parameter if you have too many servers. Default value for timeout is 30 seconds.