mindformers.core.ConstantWithCoolDownLR

class mindformers.core.ConstantWithCoolDownLR(learning_rate: float, warmup_steps: int = None, warmup_lr_init: float = 0., warmup_ratio: float = None, keep_steps: int = 0, decay_steps: int = None, decay_ratio: float = None, total_steps: int = None, num_cycles: float = 0.5, lr_end1: float = 0, final_steps: int = 0, lr_end2: float = None, **kwargs)[source]

Constant Learning Rate with Cooldown. Implement as described in the paper DeepSeek-V3 Technical Report, page 23.

The ConstantWithCoolDownLR uses a linear warm-up strategy to gradually increase the learning rate for each parameter group, and keep stable for some steps, after which it follows a cosine function to decay. Finally, it will switch to a new constant learning rate after another steps.

During the warm-up phase, the learning rate increases linearly from a smaller initial value to the base learning rate, as described by the following formula:

\[\eta_t = \eta_{\text{warmup}} + t \times \frac{\eta_{\text{base}} - \eta_{\text{warmup}}}{\text{warmup_steps}}\]

where \(\eta_{\text{warmup}}\) is the initial learning rate during the warm-up phase, and \(\eta_{\text{base}}\) is the base learning rate after the warm-up phase.

During the decay phase, the learning rate follows a cosine decay schedule:

\[\eta_t = \eta_{\text{end}} + \frac{1}{2}(\eta_{\text{base}} - \eta_{\text{end}})\left(1 + \cos\left(\frac{t_{cur}}{t_{max}}\pi\right)\right)\]

where \(t_{cur}\) is the number of steps since the beginning of the decay phase, and \(t_{max}\) is the total number of decay steps.

Parameters

learning_rate (float) – Initial value of learning rate.
warmup_steps (int, optional) – The number of warm up steps. Default: None.
warmup_lr_init (float, optional) – Initial learning rate in warm up steps. Default: 0..
warmup_ratio (float, optional) – Ratio of total training steps used for warmup. Default: None.
keep_steps (int, optional) – The number of steps keeping at the max learning rate after warmup. Default: 0.
decay_steps (int, optional) – The number of decay steps. Default: None.
decay_ratio (float, optional) – Ratio of total training steps used for decay. Default: None.
total_steps (int, optional) – The number of total steps. Default: None.
num_cycles (float, optional) – The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0 following a half-cosine). Default: 0.5.
lr_end1 (float, optional) – The value of learning rate after decay. Default: 0..
final_steps (int, optional) – The number of steps keeping at lr_end1. Default: 0.
lr_end2 (float, optional) – Final value of learning rate. The same as lr_end1 if set None. Default: None.

Inputs:

global_step (Tensor) - The global step.

Outputs:

Learning rate.

Examples

>>> import mindspore as ms
>>> from mindformers.core.lr import ConstantWithCoolDownLR
>>>
>>> ms.set_context(mode=ms.GRAPH_MODE)
>>> warmup_steps = 10
>>> keep_steps = 10
>>> decay_steps = 10
>>> final_steps = 10
>>> learning_rate = 0.005
>>>
>>> linear_warmup = ConstantWithCoolDownLR(learning_rate=learning_rate,
...                                        warmup_steps=warmup_steps,
...                                        keep_steps=keep_steps,
...                                        decay_steps=decay_steps,
...                                        final_steps=final_steps,
...                                        lr_end1=0.002, lr_end2=0.001)
>>> print(linear_warmup(Tensor(1)))
0.0005
>>> print(linear_warmup(Tensor(15)))
0.005
>>> print(linear_warmup(Tensor(25)))
0.0035
>>> print(linear_warmup(Tensor(35)))
0.002
>>> print(linear_warmup(Tensor(45)))
0.001