Resumable Training After Breakpoint
Overview
MindSpore Transformers supports step-level resume training functionality, enabling the loading of saved checkpoints to resume previous training states. This feature is particularly important for handling large-scale training tasks, as it effectively reduces time and resource waste caused by unexpected interruptions.
MindSpore Transformers supports saving and loading weights in both ckpt and safetensors formats. It supports various resume training scenarios such as interrupted training resumption, strategy conversion resumption, incremental training resumption, and automatic recovery resumption. It also supports different weight loading methods including loading the last fully saved weights, loading weights from a specified step, and loading MindSpore merged weights for resumption.
In a distributed environment, resume training requires that weights from all nodes be stored in the same shared directory. Users can set the shared path via the environment variable SHARED_PATHS.
Introduction to Weight and Strategy Files
MindSpore Transformers saves weight and strategy files, which are by default stored in the output/checkpoint and output/strategy folders. Users can modify the output_dir parameter in the YAML configuration to change the path of the output folder.
Weight files mainly store network parameters, optimizer parameters, and resume training information. Weight files are saved separately in rank-specific folders, and each rank folder maintains a meta.json file to record the last fully saved weight information for that rank. Taking a single-machine 8-card setup as an example, the weight saving format is as follows:
output/checkpoint
├── rank_0
├── meta.json
└── {prefix}-{epoch}_{step}.safetensors
├── rank_1
├── meta.json
└── {prefix}-{epoch}_{step}.safetensors
...
├── rank_7
├── meta.json
└── {prefix}-{epoch}_{step}.safetensors
The prefix of the weight name contains rank_id information, e.g.,
llama3_1_8b_rank_0. If a weight with the same prefix already exists when saving, an incremental suffix will be automatically added to the prefix to prevent overwriting old weights. For example, if "llama3_1_8b_rank_0" already exists, the prefix will be updated to "llama3_1_8b_rank_0_1", and if "llama3_1_8b_rank_0_1" also exists, it will be updated to "llama3_1_8b_rank_0_2".
Strategy files are only saved in distributed training tasks and are used for weight strategy conversion. Strategy files are saved in ckpt format with the rank_id as the suffix, mainly recording the network and optimizer sharding information for the current rank. Taking a single-machine 8-card setup as an example, the strategy file saving format is as follows:
output/strategy
├── ckpt_strategy_rank_0.ckpt
├── ckpt_strategy_rank_1.ckpt
...
└── ckpt_strategy_rank_7.ckpt
Strategy files will overwrite old files when saved. To prevent overwriting or mixing strategy files from different tasks, please promptly save strategy files to a custom folder.
For more information about weights, refer to Ckpt Weights and Safetensors Weights.
YAML Parameter Configuration Description
Parameter |
Description |
|---|---|
load_checkpoint |
Path to the weight file or folder, required for resuming training, default is an empty string. |
src_strategy_path_or_dir |
Path to the strategy file or folder, required when |
auto_trans_ckpt |
Switch for automatic weight conversion, needs to be enabled when the weights configured in load_checkpoint do not match the distributed strategy of the current task, default is |
transform_process_num |
Number of processes used for automatic weight conversion, only applicable to automatic conversion of ckpt format weights, which can accelerate weight conversion. Default is |
resume_training |
Switch for resuming training, can be set to |
load_ckpt_format |
Format of the weights configured in load_checkpoint, can be set to |
remove_redundancy |
Switch for loading without redundancy, needs to be enabled when the weights configured in load_checkpoint are safetensors format weights saved without redundancy, default is |
load_ckpt_async |
Whether to execute weight loading in parallel with model compilation. This configuration only applies to asynchronous loading scenarios with ckpt format weights and unchanged distributed strategy. Default is |
Introduction to Resume Training Scenarios
Interrupted Training Resumption
Overview: Resume training based on saved weights after an unexpected interruption of a normal training task, without changing the distributed strategy.
Resume training from the last fully saved weights
load_checkpoint: /path/to/checkpoint resume_training: True
The system will automatically search for and load the last fully saved weights based on the weight records in each rank's
meta.jsonfor resumption.If there is no meta.json in all rank sub-folders of the weight folder, it will fall back to resuming from the weights with the latest timestamp for each rank.
Resume training from weights of a specified step
load_checkpoint: /path/to/checkpoint # For ckpt weights, fill in {prefix}-{epoch}_{step}.ckpt resume_training: {prefix}-{epoch}_{step}.safetensors
Users must ensure the integrity of the specified weights. Each rank will automatically replace the rank information in the "prefix" to update the weight name to be loaded. For example, if the specified weight name is
llama3_1_8b_rank_0-200_1.safetensors, when loading rank_1, the weight name will be replaced withllama3_1_8b_rank_1-200_1.safetensors. An error will occur if the weight is missing for a certain rank.
Strategy Conversion Resumption
Overview: Continue training after modifying the distributed strategy or expanding/shrinking the cluster scale, requiring enabling automatic weight conversion.
Safetensors Weights
Enabling automatic weight conversion will automatically merge safetensors weights into full weights for distributed loading. The merged safetensors weights will be saved to the output/unified_checkpoint folder. If the weights have been offline merged into full weights, they will be directly loaded in a distributed manner. For offline merging steps, refer to the Safetensors Weights - Weight Slicing and Merging section.
Resume training from the last fully saved weights
load_checkpoint: /path/to/checkpoint src_strategy_path_or_dir: /path/to/strategy resume_training: True auto_trans_ckpt: True
Resume training from weights of a specified step
load_checkpoint: /path/to/checkpoint src_strategy_path_or_dir: /path/to/strategy resume_training: {prefix}-{epoch}_{step}.safetensors auto_trans_ckpt: True
Resume training from merged weights
load_checkpoint: /path/to/unified_checkpoint resume_training: True auto_trans_ckpt: True
Ckpt Weights
Enabling automatic weight conversion will automatically convert weights to the distributed strategy of the current task before loading. The converted ckpt weights will be saved to the output/transformed_checkpoint folder, which can be directly loaded for subsequent use without enabling weight automatic conversion.
If there are multiple step weight files in the rank sub-folder of the weights, it is necessary to offline filter the weights to ensure that each rank sub-folder contains only a single ckpt file to be loaded.
load_checkpoint: /path/to/checkpoint
src_strategy_path_or_dir: /path/to/strategy
resume_training: True
auto_trans_ckpt: True
transform_process_num: 8
Incremental Training Resumption
Overview: The training dataset needs to be produced and trained incrementally. After training on the current dataset, new produced datasets are added for continued training until all datasets are processed. This scenario requires users to preset the total steps of the learning rate curve in advance based on the total amount of training data.
Assume a total of 10T tokens of data will be trained, with each produced dataset containing 1T tokens. The entire training process is completed in 10 epochs, requiring a total of 100,000 steps.
Step 1: Preset the total training steps to fix the learning rate curve for the entire training process
lr_schedule: total_steps: 100000
Step 2: Set a sufficiently large epoch value to ensure all datasets can be trained
runner_config: epochs: 15
The learning rate curve for the entire training process is fixed, and the epoch value setting will not affect the learning rate. You can set a larger value to ensure that all 10 datasets are fully trained.
Step 3: After training 1 epoch of the dataset, replace the dataset and resume training. The following example resumes from the last fully saved weights; for other resumption methods, refer to Interrupted Training Resumption or Strategy Conversion Resumption.
load_checkpoint: /path/to/checkpoint resume_training: True
Due to inconsistent sample counts across datasets, the displayed epoch and step may change when resuming with a new dataset. However, the total number of training steps remains unchanged, which is a normal phenomenon.
Automatic Recovery Resumption
Overview: To facilitate automatic resumption of training by the platform without manual intervention, configure load_checkpoint to the save path of weight checkpoints. During the first training run, this directory is empty, and training will start normally with randomly initialized weights. For resumption, training will resume from the last fully saved weights in this directory.
load_checkpoint: /path/to/output/checkpoint
resume_training: True
Notes and Recommendations
Distributed resume training must enable data sinking mode by configuring
sink_mode=True.It is recommended to set the
SHARED_PATHSenvironment variable to the path of the top-level shared directory. For example, if/data01is the shared directory and the project directory is under it, configureexport SHARED_PATHS=/data01.It is recommended to save weights and strategy files of training tasks with different distributed strategies in separate folders.