Spaces:

ParamDev
/

Quality-Control-Inspector

Configuration error

File size: 3,070 Bytes

a01ef8c

# Distributed Training with PyTorch and Intel® Transfer Learning Tool

## Multinode setup

### Create and activate a Python3 virtual environment

We encourage you to use a python virtual environment (virtualenv or conda) for consistent package management. Make sure to follow only the chosen method on all the nodes. Mixing those configurations is not supported. 

There are two ways to do this:

a. Using `virtualenv`:

1. Login to one of the participating nodes.

2. Create and activate a new python3 virtualenv

```
virtualenv -p python3 tlt_dev_venv
source tlt_dev_venv/bin/activate
```

3. Install Intel® Transfer Learning Tool (see main [README](/README.md))
```
pip install --editable .
```

4. Install multinode dependencies from the shell script. You can also compile `torch_ccl` manually from [here](https://github.com/intel/torch-ccl)
```
bash tlt/distributed/pytorch/pyt_hvd_setup.sh
```

b. Or `conda`:

1. Login to one of the participating nodes.

2. Create and activate a new conda environment
```
conda create -n tlt_dev_venv python=3.8 --yes
conda activate tlt_dev_venv
```

3. Install Intel® Transfer Learning Tool (see main [README](/README.md))
```
pip install --editable .
```

4. Install dependencies from the shell script
```
bash tlt/distributed/pytorch/run_install.sh
```

## Verify multinode setup

Create a `hostfile` with a list of IP addresses of the participating nodes and type the following command. You should see a list of hostnames of the nodes.
```
mpiexec.hydra -ppn 1 -f hostfile hostname
```
**Note:** If the above command errors out as `'mpiexec.hydra' command not found`, activate the oneAPI environment:
```
source /opt/intel/oneapi/setvars.sh
```

## Launch a distributed training job with TLT CLI

**Step 1:** Create a `hostfile` with a list of IP addresses of the participating nodes. Make sure 
the first IP address to be of the current node.

**Step 2:** Launch a distributed training job with TLT CLI using the appropriate flags.
```
tlt train \
    -f pytorch \
    --model_name resnet50 \
    --dataset_name CIFAR10 \
    --output_dir $OUTPUT_DIR \
    --dataset_dir $DATASET_DIR \
    --distributed \
    --hostfile hostfile \
    --nnodes 2 \
    --nproc_per_node 2
```

**(Optional)**: Use the `--use_horovod` flag to use horovod for distributed training

## Troubleshooting

- ***"Port already in use"***
    
    Might happen when you keyboard interrupt training.

    **Fix:** Release the port from the terminal (or) log out and log in again to free the port.

- ***"HTTP Connection error"***

    Might happen if there are several attempts to train text classification model as it uses Hugging Face API to make calls to get dataset, model, tokenizer.

    **Fix:** Wait for about few seconds and try again.

- ***"TimeoutException"*** when using horovod

    Might happen when horovod times out waiting for tasks to start. 
    
    **Fix:** Check connectivity between servers. You may need to increase the `--hvd-start-timeout` parameter if you have too many servers. Default value for timeout is 30 seconds.