Fault Recovery

View Source on Gitee

During the distributed parallel training process, MindSpore has three recovery methods when encountering problems such as failures of compute nodes or communication interruptions:

  • Model Reloading: During training, by configuring the parameters to be merged and saved, a complete model parameter file is saved for each card, which can be directly loaded for checkpoint recovery. See Model loading in Model saving and loading for details.

  • Disaster Recovery in Dynamic Cluster Scenarios: In the dynamic cluster startup scenario, if a process fails, the other processes will enter a waiting state, and the training task can be continued by pulling up the failed process without restarting the cluster (currently only supports GPU hardware platforms).

  • Fault Recovery Based on Redundant Information: In large model training, the devices divided according to the dimension of data parallelism have the same parameters of their models. According to this principle, these redundant parameter information can be utilized as a backup, and in case of one node failure, another node utilizing the same parameters can recover the failed node.