|
# Mixed Precision ImageNet Training in PyTorch |
|
|
|
`main_amp.py` is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet). |
|
It implements Automatic Mixed Precision (Amp) training of popular model architectures, such as ResNet, AlexNet, and VGG, on the ImageNet dataset. Command-line flags forwarded to `amp.initialize` are used to easily manipulate and switch between various pure and mixed precision "optimization levels" or `opt_level`s. For a detailed explanation of `opt_level`s, see the [updated API guide](https://nvidia.github.io/apex/amp.html). |
|
|
|
Three lines enable Amp: |
|
``` |
|
# Added after model and optimizer construction |
|
model, optimizer = amp.initialize(model, optimizer, flags...) |
|
... |
|
# loss.backward() changed to: |
|
with amp.scale_loss(loss, optimizer) as scaled_loss: |
|
scaled_loss.backward() |
|
``` |
|
|
|
With the new Amp API **you never need to explicitly convert your model, or the input data, to half().** |
|
|
|
## Requirements |
|
|
|
- Download the ImageNet dataset and move validation images to labeled subfolders |
|
- The following script may be helpful: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh |
|
|
|
## Training |
|
|
|
To train a model, create softlinks to the Imagenet dataset, then run `main.py` with the desired model architecture, as shown in `Example commands` below. |
|
|
|
The default learning rate schedule is set for ResNet50. `main_amp.py` script rescales the learning rate according to the global batch size (number of distributed processes \* per-process minibatch size). |
|
|
|
## Example commands |
|
|
|
**Note:** batch size `--b 224` assumes your GPUs have >=16GB of onboard memory. You may be able to increase this to 256, but that's cutting it close, so it may out-of-memory for different Pytorch versions. |
|
|
|
**Note:** All of the following use 4 dataloader subprocesses (`--workers 4`) to reduce potential |
|
CPU data loading bottlenecks. |
|
|
|
**Note:** `--opt-level` `O1` and `O2` both use dynamic loss scaling by default unless manually overridden. |
|
`--opt-level` `O0` and `O3` (the "pure" training modes) do not use loss scaling by default. |
|
`O0` and `O3` can be told to use loss scaling via manual overrides, but using loss scaling with `O0` |
|
(pure FP32 training) does not really make sense, and will trigger a warning. |
|
|
|
Softlink training and validation datasets into the current directory: |
|
``` |
|
$ ln -sf /data/imagenet/train-jpeg/ train |
|
$ ln -sf /data/imagenet/val-jpeg/ val |
|
``` |
|
|
|
### Summary |
|
|
|
Amp allows easy experimentation with various pure and mixed precision options. |
|
``` |
|
$ python main_amp.py -a resnet50 --b 128 --workers 4 --opt-level O0 ./ |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O3 ./ |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O3 --keep-batchnorm-fp32 True ./ |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 ./ |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 --loss-scale 128.0 ./ |
|
$ python -m torch.distributed.launch --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 ./ |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./ |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 --loss-scale 128.0 ./ |
|
$ python -m torch.distributed.launch --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./ |
|
``` |
|
Options are explained below. Again, the [updated API guide](https://nvidia.github.io/apex/amp.html) provides more detail. |
|
|
|
#### `--opt-level O0` (FP32 training) and `O3` (FP16 training) |
|
|
|
"Pure FP32" training: |
|
``` |
|
$ python main_amp.py -a resnet50 --b 128 --workers 4 --opt-level O0 ./ |
|
``` |
|
"Pure FP16" training: |
|
``` |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O3 ./ |
|
``` |
|
FP16 training with FP32 batchnorm: |
|
``` |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O3 --keep-batchnorm-fp32 True ./ |
|
``` |
|
Keeping the batchnorms in FP32 improves stability and allows Pytorch |
|
to use cudnn batchnorms, which significantly increases speed in Resnet50. |
|
|
|
The `O3` options might not converge, because they are not true mixed precision. |
|
However, they can be useful to establish "speed of light" performance for |
|
your model, which provides a baseline for comparison with `O1` and `O2`. |
|
For Resnet50 in particular, `--opt-level O3 --keep-batchnorm-fp32 True` establishes |
|
the "speed of light." (Without `--keep-batchnorm-fp32`, it's slower, because it does |
|
not use cudnn batchnorm.) |
|
|
|
#### `--opt-level O1` (Official Mixed Precision recipe, recommended for typical use) |
|
|
|
`O1` patches Torch functions to cast inputs according to a whitelist-blacklist model. |
|
FP16-friendly (Tensor Core) ops like gemms and convolutions run in FP16, while ops |
|
that benefit from FP32, like batchnorm and softmax, run in FP32. |
|
Also, dynamic loss scaling is used by default. |
|
``` |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 ./ |
|
``` |
|
`O1` overridden to use static loss scaling: |
|
``` |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 --loss-scale 128.0 |
|
``` |
|
Distributed training with 2 processes (1 GPU per process, see **Distributed training** below |
|
for more detail) |
|
``` |
|
$ python -m torch.distributed.launch --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 ./ |
|
``` |
|
For best performance, set `--nproc_per_node` equal to the total number of GPUs on the node |
|
to use all available resources. |
|
|
|
#### `--opt-level O2` ("Almost FP16" mixed precision. More dangerous than O1.) |
|
|
|
`O2` exists mainly to support some internal use cases. Please prefer `O1`. |
|
|
|
`O2` casts the model to FP16, keeps batchnorms in FP32, |
|
maintains master weights in FP32, and implements |
|
dynamic loss scaling by default. (Unlike --opt-level O1, --opt-level O2 |
|
does not patch Torch functions.) |
|
``` |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./ |
|
``` |
|
"Fast mixed precision" overridden to use static loss scaling: |
|
``` |
|
$ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 --loss-scale 128.0 ./ |
|
``` |
|
Distributed training with 2 processes (1 GPU per process) |
|
``` |
|
$ python -m torch.distributed.launch --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./ |
|
``` |
|
|
|
## Distributed training |
|
|
|
`main_amp.py` optionally uses `apex.parallel.DistributedDataParallel` (DDP) for multiprocess training with one GPU per process. |
|
``` |
|
model = apex.parallel.DistributedDataParallel(model) |
|
``` |
|
is a drop-in replacement for |
|
``` |
|
model = torch.nn.parallel.DistributedDataParallel(model, |
|
device_ids=[arg.local_rank], |
|
output_device=arg.local_rank) |
|
``` |
|
(because Torch DDP permits multiple GPUs per process, with Torch DDP you are required to |
|
manually specify the device to run on and the output device. |
|
With Apex DDP, it uses only the current device by default). |
|
|
|
The choice of DDP wrapper (Torch or Apex) is orthogonal to the use of Amp and other Apex tools. It is safe to use `apex.amp` with either `torch.nn.parallel.DistributedDataParallel` or `apex.parallel.DistributedDataParallel`. In the future, I may add some features that permit optional tighter integration between `Amp` and `apex.parallel.DistributedDataParallel` for marginal performance benefits, but currently, there's no compelling reason to use Apex DDP versus Torch DDP for most models. |
|
|
|
To use DDP with `apex.amp`, the only gotcha is that |
|
``` |
|
model, optimizer = amp.initialize(model, optimizer, flags...) |
|
``` |
|
must precede |
|
``` |
|
model = DDP(model) |
|
``` |
|
If DDP wrapping occurs before `amp.initialize`, `amp.initialize` will raise an error. |
|
|
|
With both Apex DDP and Torch DDP, you must also call `torch.cuda.set_device(args.local_rank)` within |
|
each process prior to initializing your model or any other tensors. |
|
More information can be found in the docs for the |
|
Pytorch multiprocess launcher module [torch.distributed.launch](https://pytorch.org/docs/stable/distributed.html#launch-utility). |
|
|
|
`main_amp.py` is written to interact with |
|
[torch.distributed.launch](https://pytorch.org/docs/master/distributed.html#launch-utility), |
|
which spawns multiprocess jobs using the following syntax: |
|
``` |
|
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS main_amp.py args... |
|
``` |
|
`NUM_GPUS` should be less than or equal to the number of visible GPU devices on the node. The use of `torch.distributed.launch` is unrelated to the choice of DDP wrapper. It is safe to use either apex DDP or torch DDP with `torch.distributed.launch`. |
|
|
|
Optionally, one can run imagenet with synchronized batch normalization across processes by adding |
|
`--sync_bn` to the `args...` |
|
|
|
## Deterministic training (for debugging purposes) |
|
|
|
Running with the `--deterministic` flag should produce bitwise identical outputs run-to-run, |
|
regardless of what other options are used (see [Pytorch docs on reproducibility](https://pytorch.org/docs/stable/notes/randomness.html)). |
|
Since `--deterministic` disables `torch.backends.cudnn.benchmark`, `--deterministic` may |
|
cause a modest performance decrease. |
|
|
|
## Profiling |
|
|
|
If you're curious how the network actually looks on the CPU and GPU timelines (for example, how good is the overall utilization? |
|
Is the prefetcher really overlapping data transfers?) try profiling `main_amp.py`. |
|
[Detailed instructions can be found here](https://gist.github.com/mcarilli/213a4e698e4a0ae2234ddee56f4f3f95). |
|
|