# Gradient Operation

`Ascend` `GPU` `CPU` `Model Development`

[![View Source On Gitee](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_source_en.png)](https://gitee.com/mindspore/docs/blob/r1.5/docs/mindspore/programming_guide/source_en/grad_operation.md)

## Overview

GradOperation is used to generate the gradient of the input function. The `get_all`, `get_by_list`, and `sens_param` parameters are used to control the gradient calculation method. For details, see [mindspore API](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.GradOperation.html).

The following is an example of using GradOperation.

## First-order Derivation

The first-order derivative method of MindSpore is `mindspore.ops.GradOperation (get_all=False, get_by_list=False, sens_param=False)`. When `get_all` is set to `False`, the first input derivative is computed. When `get_all` is set to `True`, all input derivatives are computed. When `get_by_list` is set to `False`, weight derivation is not performed. When `get_by_list` is set to `True`, weight derivation is performed. `sens_param` scales the output value of the network to change the final gradient. Therefore, its dimension is consistent with the output dimension. The following uses the first-order derivation of the MatMul operator for in-depth analysis.

For details about the complete sample code, see [First-order Derivation Sample Code](https://gitee.com/mindspore/docs/tree/r1.5/docs/sample_code/high_order_differentiation/first_order).

### Input Derivation

The input derivation code is as follows:

```python
import numpy as np
import mindspore.context as context
import mindspore.nn as nn
import mindspore.ops as ops
from mindspore import Tensor
from mindspore import ParameterTuple, Parameter
from mindspore import dtype as mstype
context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
class Net(nn.Cell):
    def __init__(self):
        super(Net, self).__init__()
        self.matmul = ops.MatMul()
        self.z = Parameter(Tensor(np.array([1.0], np.float32)), name='z')
    def construct(self, x, y):
        x = x * self.z
        out = self.matmul(x, y)
        return out

class GradNetWrtX(nn.Cell):
    def __init__(self, net):
        super(GradNetWrtX, self).__init__()
        self.net = net
        self.grad_op = ops.GradOperation()
    def construct(self, x, y):
        gradient_function = self.grad_op(self.net)
        return gradient_function(x, y)

x = Tensor([[0.8, 0.6, 0.2], [1.8, 1.3, 1.1]], dtype=mstype.float32)
y = Tensor([[0.11, 3.3, 1.1], [1.1, 0.2, 1.4], [1.1, 2.2, 0.3]], dtype=mstype.float32)
output = GradNetWrtX(Net())(x, y)
print(output)
```

The output is as follows:

```text
[[4.5099998 2.7 3.6000001]
 [4.5099998 2.7 3.6000001]]
```

To facilitate analysis, inputs `x`, `y`, and `z` can be expressed as follows:

```python
x = Tensor([[x1, x2, x3], [x4, x5, x6]])  
y = Tensor([[y1, y2, y3], [y4, y5, y6], [y7, y8, y9]])
z = Tensor([z])
```

The following forward result can be obtained based on the definition of the MatMul operator:

$output = [[(x1 \cdot y1 + x2 \cdot y4 + x3 \cdot y7) \cdot z, (x1 \cdot y2 + x2 \cdot y5 + x3 \cdot y8) \cdot z, (x1 \cdot y3 + x2 \cdot y6 + x3 \cdot y9) \cdot z]$,

$[(x4 \cdot y1 + x5 \cdot y4 + x6 \cdot y7) \cdot z, (x4 \cdot y2 + x5 \cdot y5 + x6 \cdot y8) \cdot z, (x4 \cdot y3 + x5 \cdot y6 + x6 \cdot y9) \cdot z]]$

MindSpore uses the Reverse[3] automatic differentiation mechanism during gradient computation. The output result is summed and then the derivative of the input `x` is computed.

(1) Summation formula:

$\sum{output} = [(x1 \cdot y1 + x2 \cdot y4 + x3 \cdot y7) + (x1 \cdot y2 + x2 \cdot y5 + x3 \cdot y8) + (x1 \cdot y3 + x2 \cdot y6 + x3 \cdot y9) +$

$(x4 \cdot y1 + x5 \cdot y4 + x6 \cdot y7) + (x4 \cdot y2 + x5 \cdot y5 + x6 \cdot y8) + (x4 \cdot y3 + x5 \cdot y6 + x6 \cdot y9)] \cdot z$

(2) Derivation formula:

$\frac{\mathrm{d}(\sum{output})}{\mathrm{d}x} = [[(y1 + y2 + y3) \cdot z, (y4 + y5 + y6) \cdot z, (y7 + y8 + y9) \cdot z], [(y1 + y2 + y3) \cdot z, (y4 + y5 + y6) \cdot z, (y7 + y8 + y9) \cdot z]]$

(3) Computation result:

$\frac{\mathrm{d}(\sum{output})}{\mathrm{d}x} = [[4.5099998 \quad 2.7 \quad 3.6000001] [4.5099998 \quad 2.7 \quad 3.6000001]]$

If the derivatives of the `x` and `y` inputs are considered, you only need to set `self.grad_op = GradOperation(get_all=True)` in `GradNetWrtX`.

### Weight Derivation

If the derivation of weights is considered, change `GradNetWrtX` to the following:

```python
class GradNetWrtX(nn.Cell):
    def __init__(self, net):
        super(GradNetWrtX, self).__init__()
        self.net = net
        self.params = ParameterTuple(net.trainable_params())
        self.grad_op = ops.GradOperation(get_by_list=True)
    def construct(self, x, y):
        gradient_function = self.grad_op(self.net, self.params)
        return gradient_function(x, y)
```

```python
output = GradNetWrtX(Net())(x, y)
print(output)
```

The output is as follows:

```text
(Tensor(shape=[1], dtype=Float32, value= [ 2.15359993e+01]),)
```

The derivation formula is changed to:

$\frac{\mathrm{d}(\sum{output})}{\mathrm{d}z} = (x1 \cdot y1 + x2 \cdot y4 + x3 \cdot y7) + (x1 \cdot y2 + x2 \cdot y5 + x3 \cdot y8) + (x1 \cdot y3 + x2 \cdot y6 + x3 \cdot y9) + $

$(x4 \cdot y1 + x5 \cdot y4 + x6 \cdot y7) + (x4 \cdot y2 + x5 \cdot y5 + x6 \cdot y8) + (x4 \cdot y3 + x5 \cdot y6 + x6 \cdot y9)$

Computation result

$\frac{\mathrm{d}(\sum{output})}{\mathrm{d}z} = [2.15359993e+01]$

### Gradient Value Scaling

You can use the `sens_param` parameter to control the scaling of the gradient value.

```python
class GradNetWrtX(nn.Cell):
    def __init__(self, net):
        super(GradNetWrtX, self).__init__()
        self.net = net
        self.grad_op = ops.GradOperation(sens_param=True)
        self.grad_wrt_output = Tensor([[0.1, 0.6, 0.2], [0.8, 1.3, 1.1]], dtype=mstype.float32)
    def construct(self, x, y):
        gradient_function = self.grad_op(self.net)
        return gradient_function(x, y, self.grad_wrt_output)
```

```python
output = GradNetWrtX(Net())(x, y)  
print(output)
```

The output is as follows:

```text
[[2.211 0.51 1.49 ]
 [5.588 2.68 4.07 ]]
```

`self.grad_wrt_output` may be denoted as the following form:

```python
self.grad_wrt_output = Tensor([[s1, s2, s3], [s4, s5, s6]])
```

The output value after scaling is the product of the original output value and the element corresponding to `self.grad_wrt_output`.

$output = [[(x1 \cdot y1 + x2 \cdot y4 + x3 \cdot y7) \cdot z \cdot s1, (x1 \cdot y2 + x2 \cdot y5 + x3 \cdot y8) \cdot z \cdot s2, (x1 \cdot y3 + x2 \cdot y6 + x3 \cdot y9) \cdot z \cdot s3], $

$[(x4 \cdot y1 + x5 \cdot y4 + x6 \cdot y7) \cdot z \cdot s4, (x4 \cdot y2 + x5 \cdot y5 + x6 \cdot y8) \cdot z \cdot s5, (x4 \cdot y3 + x5 \cdot y6 + x6 \cdot y9) \cdot z \cdot s6]$

The derivation formula is changed to compute the derivative of the sum of the output values to each element of `x`.

$\frac{\mathrm{d}(\sum{output})}{\mathrm{d}x} = [[(s1 \cdot y1 + s2 \cdot y2 + s3 \cdot y3) \cdot z, (s1 \cdot y4 + s2 \cdot y5 + s3 \cdot y6) \cdot z, (s1 \cdot y7 + s2 \cdot y8 + s3 \cdot y9) \cdot z], $

$[(s4 \cdot y1 + s5 \cdot y2 + s6 \cdot y3) \cdot z, (s4 \cdot y4 + s5 \cdot y5 + s6 \cdot y6) \cdot z, (s4 \cdot y7 + s5 \cdot y8 + s6 \cdot y9) \cdot z]$

To compute the derivative of a single output (for example, `output[0][0]`) to the input, set the scaling value of the corresponding position to 1, and set the scaling values of other positions to 0. You can also change the network structure as follows:

```python
class Net(nn.Cell):
    def __init__(self):
        super(Net, self).__init__()
        self.matmul = ops.MatMul()
        self.z = Parameter(Tensor(np.array([1.0], np.float32)), name='z')
    def construct(self, x, y):
        x = x * self.z
        out = self.matmul(x, y)
        return out[0][0]
```

```python
output = GradNetWrtX(Net())(x, y)  
print(output)
```

The output is as follows:

```text
[[0.11 1.1 1.1]
 [0.   0.  0. ]]
```

## Stop Gradient

We can use `stop_gradient` to disable calculation of gradient for certain operators. For example:

```python
import numpy as np
import mindspore.nn as nn
import mindspore.ops as ops
from mindspore import Tensor
from mindspore import ParameterTuple, Parameter
from mindspore import dtype as mstype
from mindspore.ops import stop_gradient

class Net(nn.Cell):
    def __init__(self):
        super(Net, self).__init__()
        self.matmul = ops.MatMul()

    def construct(self, x, y):
        out1 = self.matmul(x, y)
        out2 = self.matmul(x, y)
        out2 = stop_gradient(out2)
        out = out1 + out2
        return out

class GradNetWrtX(nn.Cell):
    def __init__(self, net):
        super(GradNetWrtX, self).__init__()
        self.net = net
        self.grad_op = ops.GradOperation()

    def construct(self, x, y):
        gradient_function = self.grad_op(self.net)
        return gradient_function(x, y)

x = Tensor([[0.8, 0.6, 0.2], [1.8, 1.3, 1.1]], dtype=mstype.float32)
y = Tensor([[0.11, 3.3, 1.1], [1.1, 0.2, 1.4], [1.1, 2.2, 0.3]], dtype=mstype.float32)
output = GradNetWrtX(Net())(x, y)
print(output)
```

```text
    [[4.5, 2.7, 3.6],
     [4.5, 2.7, 3.6]]
```

Here, we set `stop_gradient` to `out2`, so this operator does not have any contribution to gradient. If we delete `out2 = stop_gradient(out2)`, the result is:

```text
    [[9.0, 5.4, 7.2],
     [9.0, 5.4, 7.2]]
```

After we do not set `stop_gradient` to `out2`, it will make the same contribution to gradient as `out1`. So we can see that each result has doubled.

## High-order Derivation

High-order differentiation is used in domains such as AI-supported scientific computing and second-order optimization. For example, in the molecular dynamics simulation, when the potential energy is trained using the neural network[1], the derivative of the neural network output to the input needs to be computed in the loss function, and then the second-order cross derivative of the loss function to the input and the weight exists in backward propagation. In addition, the second-order derivatives of the output to the input exist in differential equations solved by AI (such as PINNs[2]). Another example is that in order to enable the neural network to converge quickly in the second-order optimization, the second-order derivative of the loss function to the weight needs to be computed using the Newton method.

MindSpore can support high-order derivatives by computing derivatives for multiple times. The following uses several examples to describe how to compute derivatives.

For details about the complete sample code, see [High-order Derivation Sample Code](https://gitee.com/mindspore/docs/tree/r1.5/docs/sample_code/high_order_differentiation/second_order).

### Single-input Single-output High-order Derivative

For example, the second-order derivative (-Sin) of the Sin operator is implemented as follows:

```python
import numpy as np
import mindspore.context as context
import mindspore.nn as nn
import mindspore.ops as ops
from mindspore import Tensor
context.set_context(mode=context.GRAPH_MODE, device_target="GPU")

class Net(nn.Cell):
    def __init__(self):
        super(Net, self).__init__()
        self.sin = ops.Sin()
    def construct(self, x):
        out = self.sin(x)
        return out

class Grad(nn.Cell):
    def __init__(self, network):
        super(Grad, self).__init__()
        self.grad = ops.GradOperation()
        self.network = network
    def construct(self, x):
        gout= self.grad(self.network)(x)
        return gout
class GradSec(nn.Cell):
    def __init__(self, network):
        super(GradSec, self).__init__()
        self.grad = ops.GradOperation()
        self.network = network
    def construct(self, x):
        gout= self.grad(self.network)(x)
        return gout

net=Net()
firstgrad = Grad(net) # first order
secondgrad = GradSec(firstgrad) # second order
x_train = Tensor(np.array([1.0], dtype=np.float32))
output = secondgrad(x_train)
print(output)
```

The output is as follows:

```text
[-0.841471]
```

### Single-input Multi-output High-order Derivative

For example, for a multiplication operation with multiple outputs, a high-order derivative of the multiplication operation is as follows:

```python
import mindspore.context as context
import mindspore.nn as nn
import mindspore.ops as ops
from mindspore import Tensor
from mindspore import dtype as mstype
context.set_context(mode=context.GRAPH_MODE, device_target="GPU")

class Net(nn.Cell):
    def __init__(self):
        super(Net, self).__init__()
        self.mul = ops.Mul()
    def construct(self, x):
        out = self.mul(x, x)
        return out

class Grad(nn.Cell):
    def __init__(self, network):
        super(Grad, self).__init__()
        self.grad = ops.GradOperation(sens_param=False)
        self.network = network
    def construct(self, x):
        gout = self.grad(self.network)(x)
        return gout
class GradSec(nn.Cell):
    def __init__(self, network):
        super(GradSec, self).__init__()
        self.grad = ops.GradOperation(sens_param=False)
        self.network = network
    def construct(self, x):
        gout = self.grad(self.network)(x)
        return gout

net=Net()
firstgrad = Grad(net) # first order
secondgrad = GradSec(firstgrad) # second order
x = Tensor([0.1, 0.2, 0.3], dtype=mstype.float32)
output = secondgrad(x)
print(output)
```

The output is as follows:

```text
[2. 2. 2.]
```

### Multiple-Input Multiple-Output High-Order Derivative

For example, if a neural network has multiple inputs `x` and `y`, second-order derivatives `dxdx`, `dydy`, `dxdy`, and `dydx` may be obtained by using a gradient scaling mechanism as follows:

```python
import numpy as np
import mindspore.context as context
import mindspore.nn as nn
import mindspore.ops as ops
from mindspore import Tensor
context.set_context(mode=context.GRAPH_MODE, device_target="GPU")

class Net(nn.Cell):
    def __init__(self):
        super(Net, self).__init__()
        self.mul = ops.Mul()

    def construct(self, x, y):
        x_square = self.mul(x, x)
        x_square_y = self.mul(x_square, y)
        return x_square_y

class Grad(nn.Cell):
    def __init__(self, network):
        super(Grad, self).__init__()
        self.grad = ops.GradOperation(get_all=True, sens_param=False)
        self.network = network
    def construct(self, x, y):
        gout = self.grad(self.network)(x, y) # return dx, dy
        return gout

class GradSec(nn.Cell):
    def __init__(self, network):
        super(GradSec, self).__init__()
        self.grad = ops.GradOperation(get_all=True, sens_param=True)
        self.network = network
        self.sens1 = Tensor(np.array([1]).astype('float32'))
        self.sens2 = Tensor(np.array([0]).astype('float32'))
    def construct(self, x, y):
        dxdx, dxdy = self.grad(self.network)(x, y, (self.sens1,self.sens2))
        dydx, dydy = self.grad(self.network)(x, y, (self.sens2,self.sens1))
        return dxdx, dxdy, dydx, dydy

net = Net()
firstgrad = Grad(net) # first order
secondgrad = GradSec(firstgrad) # second order
x_train = Tensor(np.array([4],dtype=np.float32))
y_train = Tensor(np.array([5],dtype=np.float32))
dxdx, dxdy, dydx, dydy = secondgrad(x_train, y_train)
print(dxdx, dxdy, dydx, dydy)
```

The output is as follows:

```text
[10] [8.] [8.] [0.]
```

Specifically, results of computing the first-order derivatives are `dx` and `dy`. If `dxdx` is computed, only the first-order derivative `dx` needs to be retained, and scaling values corresponding to `x` and `y` are set to 1 and 0 respectively, that is, `self.grad(self.network)(x, y, (self.sens1,self.sens2))`. Similarly, if `dydy` is computed, only the first-order derivative `dy` is retained, and `sens_param` corresponding to `x` and `y` is set to 0 and 1, respectively, that is, `self.grad(self.network)(x, y, (self.sens2,self.sens1))`.

## Support for Second-order Differential Operators

CPU supports the following operators: [Square](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Square.html#mindspore.ops.Square),
[Exp](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Exp.html#mindspore.ops.Exp), [Neg](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Neg.html#mindspore.ops.Neg), [Mul](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Mul.html#mindspore.ops.Mul), and [MatMul](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.MatMul.html#mindspore.ops.MatMul).

GPU supports the following operators: [Pow](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Pow.html#mindspore.ops.Pow), [Log](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Log.html#mindspore.ops.Log), [Square](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Square.html#mindspore.ops.Square), [Exp](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Exp.html#mindspore.ops.Exp), [Neg](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Neg.html#mindspore.ops.Neg), [Mul](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Mul.html#mindspore.ops.Mul), [Div](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Div.html#mindspore.ops.Div), [MatMul](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.MatMul.html#mindspore.ops.MatMul), [Sin](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Sin.html#mindspore.ops.Sin), [Cos](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Cos.html#mindspore.ops.Cos), [Tan](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Tan.html#mindspore.ops.Tan) and [Atanh](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Atanh.html#mindspore.ops.Atanh).

Ascend supports the following operators: [Pow](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Pow.html#mindspore.ops.Pow), [Log](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Log.html#mindspore.ops.Log), [Square](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Square.html#mindspore.ops.Square), [Exp](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Exp.html#mindspore.ops.Exp), [Neg](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Neg.html#mindspore.ops.Neg), [Mul](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Mul.html#mindspore.ops.Mul), [Div](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Div.html#mindspore.ops.Div), [MatMul](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.MatMul.html#mindspore.ops.MatMul), [Sin](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Sin.html#mindspore.ops.Sin), [Cos](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Cos.html#mindspore.ops.Cos), [Tan](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Tan.html#mindspore.ops.Tan), [Sinh](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Sinh.html#mindspore.ops.Sinh), [Cosh](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Cosh.html#mindspore.ops.Cosh) and [Atanh](https://www.mindspore.cn/docs/api/en/r1.5/api_python/ops/mindspore.ops.Atanh.html#mindspore.ops.Atanh).

## References

[1] Zhang L, Han J, Wang H, et al. [Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics[J]](https://arxiv.org/pdf/1707.09571v2.pdf). Physical review letters, 2018, 120(14): 143001.

[2] Raissi M, Perdikaris P, Karniadakis G E. [Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations[J]](https://arxiv.org/pdf/1711.10561.pdf). arXiv preprint arXiv:1711.10561, 2017.

[3] Baydin A G, Pearlmutter B A, Radul A A, et al. [Automatic differentiation in machine learning: a survey[J]](https://jmlr.org/papers/volume18/17-468/17-468.pdf). The Journal of Machine Learning Research, 2017, 18(1): 5595-5637.