|
.. role:: hidden |
|
:class: hidden-section |
|
|
|
Advanced Amp Usage |
|
=================================== |
|
|
|
GANs |
|
---- |
|
|
|
GANs are an interesting synthesis of several topics below. A `comprehensive example`_ |
|
is under construction. |
|
|
|
.. _`comprehensive example`: |
|
https://github.com/NVIDIA/apex/tree/master/examples/dcgan |
|
|
|
Gradient clipping |
|
----------------- |
|
Amp calls the params owned directly by the optimizer's ``param_groups`` the "master params." |
|
|
|
These master params may be fully or partially distinct from ``model.parameters()``. |
|
For example, with `opt_level="O2"`_, ``amp.initialize`` casts most model params to FP16, |
|
creates an FP32 master param outside the model for each newly-FP16 model param, |
|
and updates the optimizer's ``param_groups`` to point to these FP32 params. |
|
|
|
The master params owned by the optimizer's ``param_groups`` may also fully coincide with the |
|
model params, which is typically true for ``opt_level``\s ``O0``, ``O1``, and ``O3``. |
|
|
|
In all cases, correct practice is to clip the gradients of the params that are guaranteed to be |
|
owned **by the optimizer's** ``param_groups``, instead of those retrieved via ``model.parameters()``. |
|
|
|
Also, if Amp uses loss scaling, gradients must be clipped after they have been unscaled |
|
(which occurs during exit from the ``amp.scale_loss`` context manager). |
|
|
|
The following pattern should be correct for any ``opt_level``:: |
|
|
|
with amp.scale_loss(loss, optimizer) as scaled_loss: |
|
scaled_loss.backward() |
|
# Gradients are unscaled during context manager exit. |
|
# Now it's safe to clip. Replace |
|
# torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) |
|
# with |
|
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm) |
|
# or |
|
torch.nn.utils.clip_grad_value_(amp.master_params(optimizer), max_) |
|
|
|
Note the use of the utility function ``amp.master_params(optimizer)``, |
|
which returns a generator-expression that iterates over the |
|
params in the optimizer's ``param_groups``. |
|
|
|
Also note that ``clip_grad_norm_(amp.master_params(optimizer), max_norm)`` is invoked |
|
*instead of*, not *in addition to*, ``clip_grad_norm_(model.parameters(), max_norm)``. |
|
|
|
.. _`opt_level="O2"`: |
|
https://nvidia.github.io/apex/amp.html#o2-fast-mixed-precision |
|
|
|
Custom/user-defined autograd functions |
|
-------------------------------------- |
|
|
|
The old Amp API for `registering user functions`_ is still considered correct. Functions must |
|
be registered before calling ``amp.initialize``. |
|
|
|
.. _`registering user functions`: |
|
https://github.com/NVIDIA/apex/tree/master/apex/amp#annotating-user-functions |
|
|
|
Forcing particular layers/functions to a desired type |
|
----------------------------------------------------- |
|
|
|
I'm still working on a generalizable exposure for this that won't require user-side code divergence |
|
across different ``opt-level``\ s. |
|
|
|
Multiple models/optimizers/losses |
|
--------------------------------- |
|
|
|
Initialization with multiple models/optimizers |
|
********************************************** |
|
|
|
``amp.initialize``'s optimizer argument may be a single optimizer or a list of optimizers, |
|
as long as the output you accept has the same type. |
|
Similarly, the ``model`` argument may be a single model or a list of models, as long as the accepted |
|
output matches. The following calls are all legal:: |
|
|
|
model, optim = amp.initialize(model, optim,...) |
|
model, [optim0, optim1] = amp.initialize(model, [optim0, optim1],...) |
|
[model0, model1], optim = amp.initialize([model0, model1], optim,...) |
|
[model0, model1], [optim0, optim1] = amp.initialize([model0, model1], [optim0, optim1],...) |
|
|
|
Backward passes with multiple optimizers |
|
**************************************** |
|
|
|
Whenever you invoke a backward pass, the ``amp.scale_loss`` context manager must receive |
|
**all the optimizers that own any params for which the current backward pass is creating gradients.** |
|
This is true even if each optimizer owns only some, but not all, of the params that are about to |
|
receive gradients. |
|
|
|
If, for a given backward pass, there's only one optimizer whose params are about to receive gradients, |
|
you may pass that optimizer directly to ``amp.scale_loss``. Otherwise, you must pass the |
|
list of optimizers whose params are about to receive gradients. Example with 3 losses and 2 optimizers:: |
|
|
|
# loss0 accumulates gradients only into params owned by optim0: |
|
with amp.scale_loss(loss0, optim0) as scaled_loss: |
|
scaled_loss.backward() |
|
|
|
# loss1 accumulates gradients only into params owned by optim1: |
|
with amp.scale_loss(loss1, optim1) as scaled_loss: |
|
scaled_loss.backward() |
|
|
|
# loss2 accumulates gradients into some params owned by optim0 |
|
# and some params owned by optim1 |
|
with amp.scale_loss(loss2, [optim0, optim1]) as scaled_loss: |
|
scaled_loss.backward() |
|
|
|
Optionally have Amp use a different loss scaler per-loss |
|
******************************************************** |
|
|
|
By default, Amp maintains a single global loss scaler that will be used for all backward passes |
|
(all invocations of ``with amp.scale_loss(...)``). No additional arguments to ``amp.initialize`` |
|
or ``amp.scale_loss`` are required to use the global loss scaler. The code snippets above with |
|
multiple optimizers/backward passes use the single global loss scaler under the hood, |
|
and they should "just work." |
|
|
|
However, you can optionally tell Amp to maintain a loss scaler per-loss, which gives Amp increased |
|
numerical flexibility. This is accomplished by supplying the ``num_losses`` argument to |
|
``amp.initialize`` (which tells Amp how many backward passes you plan to invoke, and therefore |
|
how many loss scalers Amp should create), then supplying the ``loss_id`` argument to each of your |
|
backward passes (which tells Amp the loss scaler to use for this particular backward pass):: |
|
|
|
model, [optim0, optim1] = amp.initialize(model, [optim0, optim1], ..., num_losses=3) |
|
|
|
with amp.scale_loss(loss0, optim0, loss_id=0) as scaled_loss: |
|
scaled_loss.backward() |
|
|
|
with amp.scale_loss(loss1, optim1, loss_id=1) as scaled_loss: |
|
scaled_loss.backward() |
|
|
|
with amp.scale_loss(loss2, [optim0, optim1], loss_id=2) as scaled_loss: |
|
scaled_loss.backward() |
|
|
|
``num_losses`` and ``loss_id``\ s should be specified purely based on the set of |
|
losses/backward passes. The use of multiple optimizers, or association of single or |
|
multiple optimizers with each backward pass, is unrelated. |
|
|
|
Gradient accumulation across iterations |
|
--------------------------------------- |
|
|
|
The following should "just work," and properly accommodate multiple models/optimizers/losses, as well as |
|
gradient clipping via the `instructions above`_:: |
|
|
|
# If your intent is to simulate a larger batch size using gradient accumulation, |
|
# you can divide the loss by the number of accumulation iterations (so that gradients |
|
# will be averaged over that many iterations): |
|
loss = loss/iters_to_accumulate |
|
|
|
with amp.scale_loss(loss, optimizer) as scaled_loss: |
|
scaled_loss.backward() |
|
|
|
# Every iters_to_accumulate iterations, call step() and reset gradients: |
|
if iter%iters_to_accumulate == 0: |
|
# Gradient clipping if desired: |
|
# torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm) |
|
optimizer.step() |
|
optimizer.zero_grad() |
|
|
|
As a minor performance optimization, you can pass ``delay_unscale=True`` |
|
to ``amp.scale_loss`` until you're ready to ``step()``. You should only attempt ``delay_unscale=True`` |
|
if you're sure you know what you're doing, because the interaction with gradient clipping and |
|
multiple models/optimizers/losses can become tricky.:: |
|
|
|
if iter%iters_to_accumulate == 0: |
|
# Every iters_to_accumulate iterations, unscale and step |
|
with amp.scale_loss(loss, optimizer) as scaled_loss: |
|
scaled_loss.backward() |
|
optimizer.step() |
|
optimizer.zero_grad() |
|
else: |
|
# Otherwise, accumulate gradients, don't unscale or step. |
|
with amp.scale_loss(loss, optimizer, delay_unscale=True) as scaled_loss: |
|
scaled_loss.backward() |
|
|
|
.. _`instructions above`: |
|
https://nvidia.github.io/apex/advanced.html#gradient-clipping |
|
|
|
Custom data batch types |
|
----------------------- |
|
|
|
The intention of Amp is that you never need to cast your input data manually, regardless of |
|
``opt_level``. Amp accomplishes this by patching any models' ``forward`` methods to cast |
|
incoming data appropriately for the ``opt_level``. But to cast incoming data, |
|
Amp needs to know how. The patched ``forward`` will recognize and cast floating-point Tensors |
|
(non-floating-point Tensors like IntTensors are not touched) and |
|
Python containers of floating-point Tensors. However, if you wrap your Tensors in a custom class, |
|
the casting logic doesn't know how to drill |
|
through the tough custom shell to access and cast the juicy Tensor meat within. You need to tell |
|
Amp how to cast your custom batch class, by assigning it a ``to`` method that accepts a ``torch.dtype`` |
|
(e.g., ``torch.float16`` or ``torch.float32``) and returns an instance of the custom batch cast to |
|
``dtype``. The patched ``forward`` checks for the presence of your ``to`` method, and will |
|
invoke it with the correct type for the ``opt_level``. |
|
|
|
Example:: |
|
|
|
class CustomData(object): |
|
def __init__(self): |
|
self.tensor = torch.cuda.FloatTensor([1,2,3]) |
|
|
|
def to(self, dtype): |
|
self.tensor = self.tensor.to(dtype) |
|
return self |
|
|
|
.. warning:: |
|
|
|
Amp also forwards numpy ndarrays without casting them. If you send input data as a raw, unwrapped |
|
ndarray, then later use it to create a Tensor within your ``model.forward``, this Tensor's type will |
|
not depend on the ``opt_level``, and may or may not be correct. Users are encouraged to pass |
|
castable data inputs (Tensors, collections of Tensors, or custom classes with a ``to`` method) |
|
wherever possible. |
|
|
|
.. note:: |
|
|
|
Amp does not call ``.cuda()`` on any Tensors for you. Amp assumes that your original script |
|
is already set up to move Tensors from the host to the device as needed. |
|
|