The Adam algorithm is proposed in Adam: A Method for Stochastic Optimization. The AdamWeightDecay variant was proposed in Decoupled Weight Decay Regularization.

The updating formulas are as follows,

$\begin{split}\begin{array}{ll} \\ m = \beta_1 * m + (1 - \beta_1) * g \\ v = \beta_2 * v + (1 - \beta_2) * g * g \\ update = \frac{m}{\sqrt{v} + \epsilon} \\ update = \begin{cases} update + weight\_decay * w & \text{ if } weight\_decay > 0 \\ update & \text{ otherwise } \end{cases} \\ w = w - lr * update \end{array}\end{split}$

$$m$$ represents the 1st moment vector, $$v$$ represents the 2nd moment vector, $$g$$ represents gradient, $$\beta_1, \beta_2$$ represent beta1 and beta2, $$lr$$ represents learning_rate, $$w$$ represents var, $$decay$$ represents weight_decay, $$\epsilon$$ represents epsilon.

Parameters

use_locking (bool) – Whether to enable a lock to protect variable tensors from being updated. If true, updates of the var, m, and v tensors will be protected by a lock. If false, the result is unpredictable. Default: False.

Inputs:
• var (Tensor) - Weights to be updated. The shape is $$(N, *)$$ where $$*$$ means, any number of additional dimensions. The data type can be float16 or float32.

• m (Tensor) - The 1st moment vector in the updating formula, the shape and data type value should be the same as var.

• v (Tensor) - the 2nd moment vector in the updating formula, the shape and data type value should be the same as var. Mean square gradients with the same type as var.

• lr (float) - $$l$$ in the updating formula. The paper suggested value is $$10^{-8}$$, the data type value should be the same as var.

• beta1 (float) - The exponential decay rate for the 1st moment estimations, the data type value should be the same as var. The paper suggested value is $$0.9$$

• beta2 (float) - The exponential decay rate for the 2nd moment estimations, the data type value should be the same as var. The paper suggested value is $$0.999$$

• epsilon (float) - Term added to the denominator to improve numerical stability.

• decay (float) - The weight decay value, must be a scalar tensor with float data type. Default: 0.0.

• gradient (Tensor) - Gradient, has the same shape and data type as var.

Outputs:

Tuple of 3 Tensor, the updated parameters.

• var (Tensor) - The same shape and data type as var.

• m (Tensor) - The same shape and data type as m.

• v (Tensor) - The same shape and data type as v.

Supported Platforms:

GPU CPU

Examples

>>> import numpy as np
>>> import mindspore.nn as nn
>>> from mindspore import Tensor, Parameter, ops
>>> class Net(nn.Cell):
...     def __init__(self):
...         super(Net, self).__init__()
...         self.var = Parameter(Tensor(np.ones([2, 2]).astype(np.float32)), name="var")
...         self.m = Parameter(Tensor(np.ones([2, 2]).astype(np.float32)), name="m")
...         self.v = Parameter(Tensor(np.ones([2, 2]).astype(np.float32)), name="v")
...     def construct(self, lr, beta1, beta2, epsilon, decay, grad):
...         out = self.adam_weight_decay(self.var, self.m, self.v, lr, beta1, beta2,