# Optimization Algorithms `Ascend` `GPU` `CPU` `Model Development` [![View Source On Gitee](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_source_en.png)](https://gitee.com/mindspore/docs/blob/r1.6/docs/mindspore/programming_guide/source_en/optim.md) ## Overview `mindspore.nn.optim` is a module in the MindSpore framework for implementing various optimization algorithms, including common optimizers and learning rates. In addition, the universal APIs can integrate updated and complex methods into the module. `mindspore.nn.optim` provides common optimizers for models, such as `SGD`, `ADAM`, and `Momentum`. The optimizer is used to compute and update the gradient. The selection of the model optimization algorithm directly affects the performance of the final model. If the effect is poor, the problem may be caused by the optimization algorithm instead of the feature or model design. In addition, `mindspore.nn` provides the learning rate module. Learning rates are classified into `dynamic_lr` and `learning_rate_schedule`, which are both dynamic learning rates. However, the implementation methods are different. The learning rate is the most important parameter in supervised learning and deep learning. It determines whether the objective function can converge to a local minimum and when it can converge to a minimum. An appropriate learning rate can make the objective function converge to a local minimum in an appropriate time. > All the following examples support the CPU, GPU, and Ascend environments. ## Learning Rates ### dynamic_lr The `mindspore.nn.dynamic_lr` module contains the following classes: - `piecewise_constant_lr` class: computes the learning rate based on the unchanged segment. - `exponential_decay_lr` class: computes the learning rate based on the exponential decay function. - `natural_exp_decay_lr` class: computes the learning rate based on the natural exponential decay function. - `inverse_decay_lr` class: computes the learning rate based on the inverse time attenuation function. - `cosine_decay_lr` class: computes the learning rate based on the cosine attenuation function. - `polynomial_decay_lr` class: computes the learning rate based on the polynomial attenuation function. - `warmup_lr` class: improves the learning rate. They are different implementations of `dynamic_lr`. For example, the code example of the `piecewise_constant_lr` class is as follows: ```python from mindspore.nn import piecewise_constant_lr def test_dynamic_lr(): milestone = [2, 5, 10] learning_rates = [0.1, 0.05, 0.01] lr = piecewise_constant_lr(milestone, learning_rates) print(lr) if __name__ == '__main__': test_dynamic_lr() ``` The following information is displayed: ```text [0.1, 0.1, 0.05, 0.05, 0.05, 0.01, 0.01, 0.01, 0.01, 0.01] ``` ### learning_rate_schedule The `mindspore.nn.learning_rate_schedule` module has the following classes: `ExponentialDecayLR`, `NaturalExpDecayLR`, `InverseDecayLR`, and `CosineDecayLR`. `PolynomialDecayLR` class and `WarmUpLR` class. They belong to `learning_rate_schedule` but are implemented in different ways. Their meanings are as follows: - `ExponentialDecayLR` class: computes the learning rate based on the exponential decay function. - `NaturalExpDecayLR` class: computes the learning rate based on the natural exponential decay function. - `InverseDecayLR` class: computes the learning rate based on the inverse time attenuation function. - `CosineDecayLR` class: computes the learning rate based on the cosine attenuation function. - `PolynomialDecayLR` class: computes the learning rate based on the polynomial attenuation function. - `WarmUpLR` class: improves the learning rate. They are different implementations of `learning_rate_schedule`. For example, the code example of the ExponentialDecayLR class is as follows: ```python from mindspore import dtype as mstype from mindspore import Tensor from mindspore.nn import ExponentialDecayLR def test_learning_rate_schedule(): learning_rate = 0.1 # learning_rate(float) - The initial value of learning rate. decay_rate = 0.9 # decay_rate(float) - The decay rate. decay_steps = 4 # decay_steps(int) - A value used to calculate decayed learning rate. global_step = Tensor(2, mstype.int32) exponential_decay_lr = ExponentialDecayLR(learning_rate, decay_rate, decay_steps) res = exponential_decay_lr(global_step) print(res) if __name__ == '__main__': test_learning_rate_schedule() ``` The following information is displayed: ```text 0.094868325 ``` ## Optimzers ### Usage To use `mindspore.nn.optim`, you need to build an `Optimizer` object. This object can maintain the current parameter status and update parameters based on the computed gradient. - Building To build an `Optimizer`, you need to give it an iterable that contains the parameters (must be Variable objects) that need to be optimized. Then, you can set the `Optimizer` parameter options, such as the learning rate and weight attenuation. A code example is as follows: ```python from mindspore import nn optim = nn.SGD(group_params, learning_rate=0.1, weight_decay=0.0) optim = nn.Adam(params=net.trainable_params()) optim = nn.Adam(group_params, learning_rate=0.1, weight_decay=0.0) ``` - Setting options for each parameter separately The optimizer also allows you to set options for each parameter separately. Do not pass in the variable directly but pass in the iterable of a dictionary. Each dictionary defines a group of parameters and contains a key, which corresponds to a parameter value. Other keys should be other parameters accepted by the optimizer and will be used to optimize this group of parameters. You can pass options as keyword parameters, which are used as default values in groups where these options are not overridden. This is useful when you want to change the options of only one parameter group without changing the options of other parameter groups. Take `SGD` as an example. When you want to determine the learning rate of each layer, run the following command: ```python from mindspore import nn optim = nn.SGD([{'params': conv_params, 'weight_decay': 0.01}, {'params': no_conv_params, 'lr': 0.01}, {'order_params': net.trainable_params()}], learning_rate=0.1, weight_decay=0.0) ``` This example indicates that when the parameter is conv_params, the weight attenuation is 0.01 and the learning rate is 0.1. When the parameter is no_conv_params, the weight attenuation is 0.0 and the learning rate is 0.01. The learning_rate=0.1 is used for all groups where the learning rate is not set. The same rule applies to weight_deca. ### Built-in Optimizers Common deep learning optimization algorithms include `SGD`, `Adam`, `Ftrl`, `lazyadam`, `Momentum`, `RMSprop`, `Lars`, `Proximal_ada_grad`, and `lamb`. In the `mindspore.nn.optim` module, they have corresponding class implementations. For example: - `SGD`: The default parameter is pure SGD. When the `momentum` parameter is set to a value other than 0, the first-order momentum is considered. After `nesterov` is set to True, the value changes to `NAG`, that is, `Nesterov Accelerated Gradient`. When the gradient is computed, the gradient of the step forward is computed. - `RMSprop` considers the second-order momentum. Different parameters have different learning rates, that is, adaptive learning rates. `Adagrad` is optimized. Only the second-order momentum within a certain window is considered through exponential smoothing. - `Adam` considers both first-order momentum and second-order momentum. It can be seen as a further consideration of the first-order momentum based on `RMSprop`. For example, the code example of `SGD` is as follows: ```python from mindspore import nn, Model, Tensor import mindspore.ops as ops import numpy as np from mindspore import dtype as mstype from mindspore import Parameter class Net(nn.Cell): def __init__(self): super(Net, self).__init__() self.matmul = ops.MatMul() self.conv = nn.Conv2d(1, 6, 5, pad_mode='valid') self.z = Parameter(Tensor(np.array([1.0], np.float32))) def construct(self, x, y): x = x * self.z out = self.matmul(x, y) return out net = Net() optim = nn.SGD(params=net.trainable_params()) conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params())) no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params())) group_params = [{'params': conv_params, 'weight_decay': 0.01}, {'params': no_conv_params, 'lr': 0.01}, {'order_params': net.trainable_params()}] optim = nn.SGD(group_params, learning_rate=0.1, weight_decay=0.0) loss = nn.SoftmaxCrossEntropyWithLogits() model = Model(net, loss_fn=loss, optimizer=optim) ```