Resume Training2.0
Overview
MindSpore Transformers has complete resume training capabilities. The core functions and applicable scenarios are as follows:
Core Functions: Supports loading saved checkpoints to quickly resume training progress without starting from scratch;
Multi-scenario Adaptation: Covers four mainstream resume training scenarios
Interruption Resume Training: After an abnormal interruption of a normal training task (such as equipment failure, network fluctuation), resume the training process based on the saved checkpoint;
Scaling Resume Training: Adjust the number of cards (expansion / reduction) during training and continue training based on the saved checkpoint;
Incremental Resume Training: On the basis of existing training results, supplement the training dataset and continue training based on the saved checkpoint;
Automatic Recovery Resume Training: Supports the platform to automatically start resume training without manual intervention;
For large-scale training tasks (long training cycles and large resource investment), it can avoid progress loss caused by unexpected interruptions and significantly reduce time and computing resource waste.
This document only applies to scenarios where Checkpoint 2.0 are used for resume training; if users use Checkpoint 1.0, please refer to the old version resume training document.
Checkpoint Introduction
The training checkpoints of MindSpore Transformers are stored in the output/checkpoint directory by default, and each checkpoint is independently saved as a subfolder named after iteration. Taking the checkpoint generated in the first step of an 8-card task as an example, its saving format is as follows:
output
├── checkpoint
├── iteration_0000001
├── metadata.json
├── common.json
├── {prefix}-model-0000000-0000008.safetensor
...
├── {prefix}-model-0000007-0000008.safetensor
├── {prefix}-opt-0000000-0000008.safetensor
...
└── {prefix}-opt-0000007-0000008.safetensor
...
└── latest_checkpointed_iteration.txt
You can refer to Checkpoint Saving and Loading for more information about checkpoints.
Configuration Description
Parameter Name |
Description |
Value Description |
|---|---|---|
load_checkpoint |
The path to the checkpoint folder. It can be filled with the path of the |
(str, optional) - Default value: |
resume_training |
The switch for the resume training function. When set to |
(bool, optional) - Default value: |
Scenario Introduction
Interruption Resume Training
Overview: After an abnormal interruption of a normal training task, resume the training process based on the saved checkpoint without changing the distributed strategy.
MindSpore Transformers provides two ways to start resuming training:
Resume training based on the number of iterations recorded in
latest_checkpointed_iteration.txtload_checkpoint: /path/to/checkpoint resume_training: True
Resume training based on the specified number of iterations
load_checkpoint: /path/to/checkpoint/iteration_{x} resume_training: True
x represents the training iteration step corresponding to the checkpoint. For example, "0000001" indicates the checkpoint corresponding to the 1st training step.
Scaling Resume Training
Overview: When it is necessary to expand/reduce the cluster scale or modify the distributed strategy to continue the training task, the configuration is the same as Interruption Resume Training. Relying on the online Reshard mechanism, MindSpore Transformers can ensure that the checkpoint weights automatically adapt to any distributed strategy, ensuring smooth resume training.
Resume training based on the number of iterations recorded in
latest_checkpointed_iteration.txtload_checkpoint: /path/to/checkpoint resume_training: True
Resume training based on the specified number of iterations
load_checkpoint: /path/to/checkpoint/iteration_{x} resume_training: True
x represents the training iteration step corresponding to the checkpoint. For example, "0000001" indicates the checkpoint corresponding to the 1st training step.
Incremental Resume Training
Overview: The training dataset needs to be produced and trained simultaneously. After the current dataset is trained, add the newly produced dataset to continue training until all datasets are trained. This scenario requires users to preset the total steps of the learning rate curve in advance according to the total amount of data for training.
Assume that a total of 10T tokens of data are trained, each produced dataset contains only 1T tokens of data, and the entire training process is completed in 10 epochs, which takes a total of 100000 steps.
Step 1: Preset the total training steps to fix the learning rate curve of the entire training process
lr_schedule: total_steps: 100000
Step 2: Set a sufficiently large epoch value to ensure that all datasets can be trained
runner_config: epochs: 15
The learning rate curve of the entire training process has been fixed, and the epoch value setting will not affect the learning rate. A larger value can be set to ensure that 10 datasets can be trained.
Step 3: After training one epoch of the dataset, you can replace the dataset to resume training. The following is resume training based on the number of iterations recorded in
latest_checkpointed_iteration.txt. For other resume training methods, please refer to Interruption Resume Training or Scaling Resume Training.load_checkpoint: /path/to/checkpoint resume_training: True
When replacing the dataset for resume training, due to the different number of samples in each dataset, the displayed epoch and single-batch step may change, but the total number of training steps remains unchanged, which is a normal phenomenon.
Automatic Recovery Resume Training
Overview: To support the platform to automatically start resume training without manual intervention, load_checkpoint can be configured as the checkpoint saving directory path: when training for the first time, the directory is empty, and the model initializes parameters randomly; during resume training, it will recover training based on the last saved complete checkpoint in the directory.
load_checkpoint: /path/to/output/checkpoint
resume_training: True
Constraint Description
In multi-machine scenarios, all checkpoint files need to be stored in the same shared directory for resume training. Users need to configure the shared path to the environment variable
SHARED_PATHS; it is recommended to configure the top-level shared directory first. Example: When the shared directory is/data01, executeexport SHARED_PATHS=/data01.