Spaces:
Paused
Paused
| .. role:: hidden | |
| :class: hidden-section | |
| Advanced Amp Usage | |
| =================================== | |
| GANs | |
| ---- | |
| GANs are an interesting synthesis of several topics below. A `comprehensive example`_ | |
| is under construction. | |
| .. _`comprehensive example`: | |
| https://github.com/NVIDIA/apex/tree/master/examples/dcgan | |
| Gradient clipping | |
| ----------------- | |
| Amp calls the params owned directly by the optimizer's ``param_groups`` the "master params." | |
| These master params may be fully or partially distinct from ``model.parameters()``. | |
| For example, with `opt_level="O2"`_, ``amp.initialize`` casts most model params to FP16, | |
| creates an FP32 master param outside the model for each newly-FP16 model param, | |
| and updates the optimizer's ``param_groups`` to point to these FP32 params. | |
| The master params owned by the optimizer's ``param_groups`` may also fully coincide with the | |
| model params, which is typically true for ``opt_level``\s ``O0``, ``O1``, and ``O3``. | |
| In all cases, correct practice is to clip the gradients of the params that are guaranteed to be | |
| owned **by the optimizer's** ``param_groups``, instead of those retrieved via ``model.parameters()``. | |
| Also, if Amp uses loss scaling, gradients must be clipped after they have been unscaled | |
| (which occurs during exit from the ``amp.scale_loss`` context manager). | |
| The following pattern should be correct for any ``opt_level``:: | |
| with amp.scale_loss(loss, optimizer) as scaled_loss: | |
| scaled_loss.backward() | |
| # Gradients are unscaled during context manager exit. | |
| # Now it's safe to clip. Replace | |
| # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) | |
| # with | |
| torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm) | |
| # or | |
| torch.nn.utils.clip_grad_value_(amp.master_params(optimizer), max_) | |
| Note the use of the utility function ``amp.master_params(optimizer)``, | |
| which returns a generator-expression that iterates over the | |
| params in the optimizer's ``param_groups``. | |
| Also note that ``clip_grad_norm_(amp.master_params(optimizer), max_norm)`` is invoked | |
| *instead of*, not *in addition to*, ``clip_grad_norm_(model.parameters(), max_norm)``. | |
| .. _`opt_level="O2"`: | |
| https://nvidia.github.io/apex/amp.html#o2-fast-mixed-precision | |
| Custom/user-defined autograd functions | |
| -------------------------------------- | |
| The old Amp API for `registering user functions`_ is still considered correct. Functions must | |
| be registered before calling ``amp.initialize``. | |
| .. _`registering user functions`: | |
| https://github.com/NVIDIA/apex/tree/master/apex/amp#annotating-user-functions | |
| Forcing particular layers/functions to a desired type | |
| ----------------------------------------------------- | |
| I'm still working on a generalizable exposure for this that won't require user-side code divergence | |
| across different ``opt-level``\ s. | |
| Multiple models/optimizers/losses | |
| --------------------------------- | |
| Initialization with multiple models/optimizers | |
| ********************************************** | |
| ``amp.initialize``'s optimizer argument may be a single optimizer or a list of optimizers, | |
| as long as the output you accept has the same type. | |
| Similarly, the ``model`` argument may be a single model or a list of models, as long as the accepted | |
| output matches. The following calls are all legal:: | |
| model, optim = amp.initialize(model, optim,...) | |
| model, [optim0, optim1] = amp.initialize(model, [optim0, optim1],...) | |
| [model0, model1], optim = amp.initialize([model0, model1], optim,...) | |
| [model0, model1], [optim0, optim1] = amp.initialize([model0, model1], [optim0, optim1],...) | |
| Backward passes with multiple optimizers | |
| **************************************** | |
| Whenever you invoke a backward pass, the ``amp.scale_loss`` context manager must receive | |
| **all the optimizers that own any params for which the current backward pass is creating gradients.** | |
| This is true even if each optimizer owns only some, but not all, of the params that are about to | |
| receive gradients. | |
| If, for a given backward pass, there's only one optimizer whose params are about to receive gradients, | |
| you may pass that optimizer directly to ``amp.scale_loss``. Otherwise, you must pass the | |
| list of optimizers whose params are about to receive gradients. Example with 3 losses and 2 optimizers:: | |
| # loss0 accumulates gradients only into params owned by optim0: | |
| with amp.scale_loss(loss0, optim0) as scaled_loss: | |
| scaled_loss.backward() | |
| # loss1 accumulates gradients only into params owned by optim1: | |
| with amp.scale_loss(loss1, optim1) as scaled_loss: | |
| scaled_loss.backward() | |
| # loss2 accumulates gradients into some params owned by optim0 | |
| # and some params owned by optim1 | |
| with amp.scale_loss(loss2, [optim0, optim1]) as scaled_loss: | |
| scaled_loss.backward() | |
| Optionally have Amp use a different loss scaler per-loss | |
| ******************************************************** | |
| By default, Amp maintains a single global loss scaler that will be used for all backward passes | |
| (all invocations of ``with amp.scale_loss(...)``). No additional arguments to ``amp.initialize`` | |
| or ``amp.scale_loss`` are required to use the global loss scaler. The code snippets above with | |
| multiple optimizers/backward passes use the single global loss scaler under the hood, | |
| and they should "just work." | |
| However, you can optionally tell Amp to maintain a loss scaler per-loss, which gives Amp increased | |
| numerical flexibility. This is accomplished by supplying the ``num_losses`` argument to | |
| ``amp.initialize`` (which tells Amp how many backward passes you plan to invoke, and therefore | |
| how many loss scalers Amp should create), then supplying the ``loss_id`` argument to each of your | |
| backward passes (which tells Amp the loss scaler to use for this particular backward pass):: | |
| model, [optim0, optim1] = amp.initialize(model, [optim0, optim1], ..., num_losses=3) | |
| with amp.scale_loss(loss0, optim0, loss_id=0) as scaled_loss: | |
| scaled_loss.backward() | |
| with amp.scale_loss(loss1, optim1, loss_id=1) as scaled_loss: | |
| scaled_loss.backward() | |
| with amp.scale_loss(loss2, [optim0, optim1], loss_id=2) as scaled_loss: | |
| scaled_loss.backward() | |
| ``num_losses`` and ``loss_id``\ s should be specified purely based on the set of | |
| losses/backward passes. The use of multiple optimizers, or association of single or | |
| multiple optimizers with each backward pass, is unrelated. | |
| Gradient accumulation across iterations | |
| --------------------------------------- | |
| The following should "just work," and properly accommodate multiple models/optimizers/losses, as well as | |
| gradient clipping via the `instructions above`_:: | |
| # If your intent is to simulate a larger batch size using gradient accumulation, | |
| # you can divide the loss by the number of accumulation iterations (so that gradients | |
| # will be averaged over that many iterations): | |
| loss = loss/iters_to_accumulate | |
| with amp.scale_loss(loss, optimizer) as scaled_loss: | |
| scaled_loss.backward() | |
| # Every iters_to_accumulate iterations, call step() and reset gradients: | |
| if iter%iters_to_accumulate == 0: | |
| # Gradient clipping if desired: | |
| # torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm) | |
| optimizer.step() | |
| optimizer.zero_grad() | |
| As a minor performance optimization, you can pass ``delay_unscale=True`` | |
| to ``amp.scale_loss`` until you're ready to ``step()``. You should only attempt ``delay_unscale=True`` | |
| if you're sure you know what you're doing, because the interaction with gradient clipping and | |
| multiple models/optimizers/losses can become tricky.:: | |
| if iter%iters_to_accumulate == 0: | |
| # Every iters_to_accumulate iterations, unscale and step | |
| with amp.scale_loss(loss, optimizer) as scaled_loss: | |
| scaled_loss.backward() | |
| optimizer.step() | |
| optimizer.zero_grad() | |
| else: | |
| # Otherwise, accumulate gradients, don't unscale or step. | |
| with amp.scale_loss(loss, optimizer, delay_unscale=True) as scaled_loss: | |
| scaled_loss.backward() | |
| .. _`instructions above`: | |
| https://nvidia.github.io/apex/advanced.html#gradient-clipping | |
| Custom data batch types | |
| ----------------------- | |
| The intention of Amp is that you never need to cast your input data manually, regardless of | |
| ``opt_level``. Amp accomplishes this by patching any models' ``forward`` methods to cast | |
| incoming data appropriately for the ``opt_level``. But to cast incoming data, | |
| Amp needs to know how. The patched ``forward`` will recognize and cast floating-point Tensors | |
| (non-floating-point Tensors like IntTensors are not touched) and | |
| Python containers of floating-point Tensors. However, if you wrap your Tensors in a custom class, | |
| the casting logic doesn't know how to drill | |
| through the tough custom shell to access and cast the juicy Tensor meat within. You need to tell | |
| Amp how to cast your custom batch class, by assigning it a ``to`` method that accepts a ``torch.dtype`` | |
| (e.g., ``torch.float16`` or ``torch.float32``) and returns an instance of the custom batch cast to | |
| ``dtype``. The patched ``forward`` checks for the presence of your ``to`` method, and will | |
| invoke it with the correct type for the ``opt_level``. | |
| Example:: | |
| class CustomData(object): | |
| def __init__(self): | |
| self.tensor = torch.cuda.FloatTensor([1,2,3]) | |
| def to(self, dtype): | |
| self.tensor = self.tensor.to(dtype) | |
| return self | |
| .. warning:: | |
| Amp also forwards numpy ndarrays without casting them. If you send input data as a raw, unwrapped | |
| ndarray, then later use it to create a Tensor within your ``model.forward``, this Tensor's type will | |
| not depend on the ``opt_level``, and may or may not be correct. Users are encouraged to pass | |
| castable data inputs (Tensors, collections of Tensors, or custom classes with a ``to`` method) | |
| wherever possible. | |
| .. note:: | |
| Amp does not call ``.cuda()`` on any Tensors for you. Amp assumes that your original script | |
| is already set up to move Tensors from the host to the device as needed. | |