mindspore.experimental.optim.RAdam

class mindspore.experimental.optim.RAdam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.0)[source]

Implements RAdam algorithm.

\[\begin{split}\begin{align*} &\rule{180mm}{0.4pt} \\ &\textbf{Input}: \gamma \text{ (lr)}, \: \beta_1, \beta_2 \text{ (betas)}, \: \theta_0 \text{ (params)}, \:f(\theta) \text{ (objective)}, \: \lambda \text{ (weightdecay)}, \: \epsilon \text{ (epsilon)} \\ &\textbf{Initialize}: \begin{cases} m_0 \leftarrow 0 \text{ (first moment)} \\ v_0 \leftarrow 0 \text{ (second moment)} \\ \rho_{\infty} \xleftarrow{\text{def}} \dfrac{2}{1 - \beta_2} - 1 \end{cases} \\ &\rule{180mm}{0.4pt} \\ &\textbf{For } t = 1 \text{ to } \ldots \text{ do}: \\ &\quad g_t \leftarrow \nabla_{\theta} f_t(\theta_{t - 1}) \\ &\quad \text{If } \lambda \neq 0: \\ &\quad\quad g_t \leftarrow g_t + \lambda \theta_{t - 1} \\ &\quad m_t \leftarrow \beta_1 m_{t - 1} + (1 - \beta_1) g_t \\ &\quad v_t \leftarrow \beta_2 v_{t - 1} + (1 - \beta_2) g_t^2 \\ &\quad \widehat{m_t} \leftarrow \dfrac{m_t}{1 - \beta_1^t} \\ &\quad \text{Let } \rho_t' = 2 t \beta_2^t /(1 - \beta_2^t) \quad \text{(auxiliary variable)} \\ &\quad \rho_t \leftarrow \rho_{\infty} - \rho_t' \\ &\quad \text{If } \rho_t > 5: \\ &\quad\quad l_t \leftarrow \dfrac{\sqrt{1 - \beta_2^t}}{\sqrt{v_t} + \epsilon} \\ &\quad\quad r_t \leftarrow \sqrt{\dfrac{(\rho_t - 4)(\rho_t - 2)\rho_{\infty}}{(\rho_{\infty} - 4) (\rho_{\infty} - 2) \rho_t}} \\ &\quad\quad \theta_t \leftarrow \theta_{t - 1} - \gamma \widehat{m_t} r_t l_t \\ &\quad \text{Else}: \\ &\quad\quad \theta_t \leftarrow \theta_{t - 1} - \gamma \widehat{m_t} \\ &\rule{180mm}{0.4pt} \\ &\bf{Return}: \theta_t \\ &\rule{180mm}{0.4pt} \end{align*}\end{split}\]

For more details about RAdam algorithm, please refer to On the Variance of the Adaptive Learning Rate and Beyond.

Warning

This is an experimental optimizer API that is subject to change. This module must be used with lr scheduler module in LRScheduler Class .

Parameters

params (Union[list(Parameter), list(dict)]) – list of parameters to optimize or dicts defining parameter groups.
lr (Union[int, float, Tensor], optional) – learning rate. Default: 1e-3.
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. Default: (0.9, 0.999).
eps (float, optional) – term added to the denominator to improve numerical stability. Default: 1e-8.
weight_decay (float, optional) – weight decay (L2 penalty). Default: 0.0.

Inputs:

gradients (tuple[Tensor]) - The gradients of params.

Raises

ValueError – If the learning rate is not int, float or Tensor.
ValueError – If the learning rate is less than 0.
ValueError – If the eps is less than 0.0.
ValueError – If the weight_decay is less than 0.
ValueError – If elements of betas not in the range of [0, 1).

Supported Platforms:: Ascend GPU CPU

Examples

>>> import mindspore
>>> from mindspore import nn
>>> from mindspore.experimental import optim
>>> # Define the network structure of LeNet5. Refer to
>>> # https://atomgit.com/mindspore/docs/blob/master/docs/mindspore/code/lenet.py
>>> net = LeNet5()
>>> loss_fn = nn.SoftmaxCrossEntropyWithLogits(sparse=True)
>>> optimizer = optim.RAdam(net.trainable_params(), lr=0.1)
>>> def forward_fn(data, label):
...     logits = net(data)
...     loss = loss_fn(logits, label)
...     return loss, logits
>>> grad_fn = mindspore.value_and_grad(forward_fn, None, optimizer.parameters, has_aux=True)
>>> def train_step(data, label):
...     (loss, _), grads = grad_fn(data, label)
...     optimizer(grads)
...     return loss