Network Training and Gadient Derivation

Automatic Differentiation

After the forward network is constructed, MindSpore provides an interface to automatic differentiation to calculate the gradient results of the model. In the tutorial of automatic derivation, some descriptions of various gradient calculation scenarios are given.

Network Training

The entire training network consists of the forward network (network and loss function), automatic gradient derivation and optimizer update. MindSpore provides three ways to implement this process.

  1. Encapsulate model and perform network training by using model.train or model.fit, such as model training.

  2. Apply the encapsulated TrainOneStepCell and TrainOneStepWithLossScaleCell separately to common training process and training process with loss_scale, such as Quick Start: Linear Fitting.

  3. Customize training Cell.

Customizing training Cell

The first two methods are illustrated with two examples from the official website. For the customized training Cell, we first review what is done in the TrainOneStepCell:

import mindspore as ms
from mindspore import ops, nn
from mindspore.parallel._utils import (_get_device_num, _get_gradients_mean,
                                       _get_parallel_mode)
class TrainOneStepCell(nn.Cell):
    def __init__(self, network, optimizer):
        super(TrainOneStepCell, self).__init__(auto_prefix=False)
        self.network = network  # Network structure with loss
        self.network.set_grad()   # Required for PYNATIVE mode. If True, the inverse network requiring the computation of gradients will be generated when the forward network is executed.
        self.optimizer = optimizer   # Optimizer for parameter updates
        self.weights = self.optimizer.parameters    # Get the parameters of the optimizer

        # Related logic of parallel computation
        self.reducer_flag = False
        self.grad_reducer = ops.identity
        self.parallel_mode = _get_parallel_mode()
        self.reducer_flag = self.parallel_mode in (ms.ParallelMode.DATA_PARALLEL, ms.ParallelMode.HYBRID_PARALLEL)
        if self.reducer_flag:
            self.mean = _get_gradients_mean()
            self.degree = _get_device_num()
            self.grad_reducer = nn.DistributedGradReducer(self.weights, self.mean, self.degree)

    def construct(self, *inputs):
        loss, grads = ms.value_and_grad(self.network, grad_position=None, weights=self.weights)(*inputs)   # obtain the loss and get the gradient of all Parameter free variables
        # grads = grad_op(grads)    # Some computing logic for the gradient can be added here, such as gradient clipping
        grads = self.grad_reducer(grads)  # Gradient aggregation
        self.optimizer(grads)    # Optimizer updates parameters
        return loss

The whole training process can be encapsulated into a Cell, in which the forward computation, backward gradient derivation and parameter update of the network can be realized, and we can do some special processing on the gradient after obtaining it.

Gradient Clipping

When there is gradient explosion or particularly large gradient during the training process, in a case that training is not stable, you can consider adding gradient clipping. Here is an example of the common use of global_norm for gradient clipping scenarios.

import mindspore as ms
from mindspore import nn, ops

_grad_scale = ops.MultitypeFuncGraph("grad_scale")

@_grad_scale.register("Tensor", "Tensor")
def tensor_grad_scale(scale, grad):
    return grad * ops.cast(ops.Reciprocal()(scale), ops.dtype(grad))

class MyTrainOneStepCell(nn.TrainOneStepWithLossScaleCell):
    """
    Network training package class with gradient clip.

    Append an optimizer to the training network after that the construct function
    can be called to create the backward graph.

    Args:
        network (Cell): The training network.
        optimizer (Cell): Optimizer for updating the weights.
        sens (Number): The adjust parameter. Default value is 1.0.
        grad_clip (bool): Whether clip gradients. Default value is False.
    """

    def __init__(self, network, optimizer, scale_sense=1, grad_clip=False):
        if isinstance(scale_sense, (int, float)):
            scale_sense = nn.FixedLossScaleUpdateCell(scale_sense)
        super(MyTrainOneStepCell, self).__init__(network, optimizer, scale_sense)
        self.grad_clip = grad_clip

    def construct(self, x, img_shape, gt_bboxe, gt_label, gt_num):
        # Most are attributes and methods of base class, and refer to the corresponding base class API for details
        weights = self.weights
        loss = self.network(x, img_shape, gt_bboxe, gt_label, gt_num)
        scaling_sens = self.scale_sense
        # Start floating point overflow detection. Create and clear overflow detection status
        status, scaling_sens = self.start_overflow_check(loss, scaling_sens)
        # Multiply the gradient by a scale to prevent the gradient from overflowing
        scaling_sens_filled = ops.ones_like(loss) * ops.cast(scaling_sens, ops.dtype(loss))
        grads = self.grad(self.network, weights)(x, img_shape, gt_bboxe, gt_label, gt_num, scaling_sens_filled)
        # Calculate the true gradient value by dividing the obtained gradient by the scale
        grads = self.hyper_map(ops.partial(_grad_scale, scaling_sens), grads)
        # Gradient clipping
        if self.grad_clip:
            grads = ops.clip_by_global_norm(grads)
        # Gradient aggregation
        grads = self.grad_reducer(grads)

        # Get floating point overflow status
        cond = self.get_overflow_status(status, grads)
        # Calculate loss scaling factor based on overflow state during dynamic loss scale
        overflow = self.process_loss_scale(cond)
        # If there is no overflow, execute the optimizer to update the parameters
        if not overflow:
            self.optimizer(grads)
        return loss, cond, scaling_sens.value()

Gradient Accumulation

Gradient accumulation is a method that data samples of the training neural network are split into several small Batch by Batch, and are calculated in order to solve the problem that Batch size is too large due to lack of memory, or the OOM, that is, the neural network can not be trained or the network model is too large to load.

Please refer to Gradient Accumulation for details.