Spaces:
Configuration error
Distributed Training with PyTorch and Intel® Transfer Learning Tool
Multinode setup
Create and activate a Python3 virtual environment
We encourage you to use a python virtual environment (virtualenv or conda) for consistent package management. Make sure to follow only the chosen method on all the nodes. Mixing those configurations is not supported.
There are two ways to do this:
a. Using virtualenv
:
Login to one of the participating nodes.
Create and activate a new python3 virtualenv
virtualenv -p python3 tlt_dev_venv
source tlt_dev_venv/bin/activate
- Install Intel® Transfer Learning Tool (see main README)
pip install --editable .
- Install multinode dependencies from the shell script. You can also compile
torch_ccl
manually from here
bash tlt/distributed/pytorch/pyt_hvd_setup.sh
b. Or conda
:
Login to one of the participating nodes.
Create and activate a new conda environment
conda create -n tlt_dev_venv python=3.8 --yes
conda activate tlt_dev_venv
- Install Intel® Transfer Learning Tool (see main README)
pip install --editable .
- Install dependencies from the shell script
bash tlt/distributed/pytorch/run_install.sh
Verify multinode setup
Create a hostfile
with a list of IP addresses of the participating nodes and type the following command. You should see a list of hostnames of the nodes.
mpiexec.hydra -ppn 1 -f hostfile hostname
Note: If the above command errors out as 'mpiexec.hydra' command not found
, activate the oneAPI environment:
source /opt/intel/oneapi/setvars.sh
Launch a distributed training job with TLT CLI
Step 1: Create a hostfile
with a list of IP addresses of the participating nodes. Make sure
the first IP address to be of the current node.
Step 2: Launch a distributed training job with TLT CLI using the appropriate flags.
tlt train \
-f pytorch \
--model_name resnet50 \
--dataset_name CIFAR10 \
--output_dir $OUTPUT_DIR \
--dataset_dir $DATASET_DIR \
--distributed \
--hostfile hostfile \
--nnodes 2 \
--nproc_per_node 2
(Optional): Use the --use_horovod
flag to use horovod for distributed training
Troubleshooting
"Port already in use"
Might happen when you keyboard interrupt training.
Fix: Release the port from the terminal (or) log out and log in again to free the port.
"HTTP Connection error"
Might happen if there are several attempts to train text classification model as it uses Hugging Face API to make calls to get dataset, model, tokenizer.
Fix: Wait for about few seconds and try again.
"TimeoutException" when using horovod
Might happen when horovod times out waiting for tasks to start.
Fix: Check connectivity between servers. You may need to increase the
--hvd-start-timeout
parameter if you have too many servers. Default value for timeout is 30 seconds.