[{"data":1,"prerenderedAt":411},["ShallowReactive",2],{"content-query-xf1agK6M30":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"category":13,"body":14,"_type":405,"_id":406,"_source":407,"_file":408,"_stem":409,"_extension":410},"/technology-blogs/en/1574","en",false,"","[AI Design Patterns] 05 - Checkpoints: Periodically Saving Models","The Checkpoints pattern ensures model training reliability and makes early stopping easier.","2022-06-02","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/06/27/aef47f48d2fb4f04a63d30001de6d056.png","technology-blogs","Influencers",{"type":15,"children":16,"toc":402},"root",[17,32,38,46,51,60,65,73,78,83,88,96,101,106,114,119,124,131,136,144,152,164,169,174,179,184,189,194,199,230,235,240,245,250,255,271,279,284,289,294,299,307,319,326,331,336,341,349,354,362,372,383],{"type":18,"tag":19,"props":20,"children":22},"element","h1",{"id":21},"ai-design-patterns-05-checkpoints-periodically-saving-models",[23,30],{"type":18,"tag":24,"props":25,"children":26},"span",{},[27],{"type":28,"value":29},"text","AI Design Patterns",{"type":28,"value":31}," 05 - Checkpoints: Periodically Saving Models",{"type":18,"tag":33,"props":34,"children":35},"p",{},[36],{"type":28,"value":37},"The Eager pattern is used for data processing. Which patterns can be used for model training? In the database domain, to avoid re-executing a failed storage process that takes a long time, the storage process is continuously recorded as checkpoints. In this way, a failed storage process can continue from the latest checkpoint, avoiding waste of time. A model training process usually takes longer than a storage process. Without a reliability mechanism, a model must be trained again from the beginning once the training process fails, wasting a lot of time. Similar to the mechanism used in the database domain, the Checkpoints pattern is a mechanism that ensures model training reliability.",{"type":18,"tag":33,"props":39,"children":40},{},[41],{"type":18,"tag":42,"props":43,"children":45},"img",{"alt":7,"src":44},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/06/25/f72b934bb25e4b67a4709a94a62604c1.png",[],{"type":18,"tag":33,"props":47,"children":48},{},[49],{"type":28,"value":50},"Overview of AI design patterns",{"type":18,"tag":33,"props":52,"children":53},{},[54],{"type":18,"tag":55,"props":56,"children":57},"strong",{},[58],{"type":28,"value":59},"Definition",{"type":18,"tag":33,"props":61,"children":62},{},[63],{"type":28,"value":64},"The Checkpoints pattern refers to the process of saving the full state of a model periodically (according to the number of iterations or training duration). When a model training fails, a saved checkpoint can be used to resume the training. This saves a lot of time because the model does not need to be trained again from the beginning. The Checkpoints pattern is suitable for scenarios where the training takes a long time or early stopping and fine-tuning need to be implemented. This pattern is also ideal for resumable training upon exceptions.",{"type":18,"tag":33,"props":66,"children":67},{},[68],{"type":18,"tag":55,"props":69,"children":70},{},[71],{"type":28,"value":72},"Challenge",{"type":18,"tag":33,"props":74,"children":75},{},[76],{"type":28,"value":77},"The cost is high to resume a failed training of a complex model from the beginning. The more layers a neural network has, or the larger the training dataset, the more time is required for training because more parameters and data samples need to be tuned and processed. For example, it takes 3 to 4 hours to train a VGG16 network on the CIFAR-10 dataset using a common NVIDIA GPU. Hours of training time would be wasted if the training fails and needs to be restarted from the beginning.",{"type":18,"tag":33,"props":79,"children":80},{},[81],{"type":28,"value":82},"Also, after a training runs for a long time, the precision may remain unchanged and overfitting may occur. In this case, it is more efficient to stop the training early (this practice is called early stopping) to obtain an intermediate model state.",{"type":18,"tag":33,"props":84,"children":85},{},[86],{"type":28,"value":87},"What's more, fine-tuning needs to be performed on intermediate models, so that the training can be more effective against new data to achieve better generalization.",{"type":18,"tag":33,"props":89,"children":90},{},[91],{"type":18,"tag":55,"props":92,"children":93},{},[94],{"type":28,"value":95},"Solution",{"type":18,"tag":33,"props":97,"children":98},{},[99],{"type":28,"value":100},"A model state is saved at the end of each epoch. If the next epoch fails, the training can continue from a saved checkpoint. Compared with an exported model (such as a final neural network model containing information about weights, activation functions, and hidden layers), the intermediate model state includes additional information, such as the epoch and current batch count, to ensure that the training can be resumed from it. However, the learning rate is usually not included in a checkpoint because it is dynamically adjusted during training.",{"type":18,"tag":33,"props":102,"children":103},{},[104],{"type":28,"value":105},"Ideally, a checkpoint should be saved after weights are updated based on each batch of data, but this leads to too many intermediate models occupying huge space. In practice, a checkpoint is usually saved at the end of each epoch, or only the latest several checkpoints are retained.",{"type":18,"tag":33,"props":107,"children":108},{},[109],{"type":18,"tag":55,"props":110,"children":111},{},[112],{"type":28,"value":113},"Scenarios",{"type":18,"tag":33,"props":115,"children":116},{},[117],{"type":28,"value":118},"Most AI frameworks provide the checkpointing capability during model training. MindSpore provides the ModelCheckpoint and CheckpointConfig modules through the training APIs to help you save checkpoints. You can manually add checkpoints, save checkpoints periodically (according to the number of iterations or training duration), or save checkpoints upon an exception or failure.",{"type":18,"tag":33,"props":120,"children":121},{},[122],{"type":28,"value":123},"A checkpoint file is a binary file that stores the values of all training parameters. The Protocol Buffers mechanism is adopted, which is language- and platform-agnostic and thus delivers good scalability.",{"type":18,"tag":33,"props":125,"children":126},{},[127],{"type":18,"tag":42,"props":128,"children":130},{"alt":7,"src":129},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/06/25/6b8fb677c3c048ccaa281f93a7323aec.png",[],{"type":18,"tag":33,"props":132,"children":133},{},[134],{"type":28,"value":135},"The following describes how to periodically save model states in MindSpore and how to save a model state upon an exception.",{"type":18,"tag":33,"props":137,"children":138},{},[139],{"type":18,"tag":55,"props":140,"children":141},{},[142],{"type":28,"value":143},"Saving Periodically",{"type":18,"tag":33,"props":145,"children":146},{},[147],{"type":18,"tag":55,"props":148,"children":149},{},[150],{"type":28,"value":151},"(1) Saving according to the number of iterations",{"type":18,"tag":33,"props":153,"children":154},{},[155,157,162],{"type":28,"value":156},"The following code snippet is an example of saving checkpoints according to the number of iterations and applying the saving policy through callback during model training. During training, a checkpoint is saved every 1785 steps and a maximum of 10 intermediate models are retained. The model name format is ",{"type":18,"tag":55,"props":158,"children":159},{},[160],{"type":28,"value":161},"checkpoint_lenet-1_1875.ckpt",{"type":28,"value":163},".",{"type":18,"tag":33,"props":165,"children":166},{},[167],{"type":28,"value":168},"from mindspore.train.callback import ModelCheckpoint, CheckpointConfig",{"type":18,"tag":33,"props":170,"children":171},{},[172],{"type":28,"value":173},"# Set the model saving policy. In this example, a model is saved every 1875 steps, and a maximum of 10 checkpoint models are retained.",{"type":18,"tag":33,"props":175,"children":176},{},[177],{"type":28,"value":178},"config_ck = CheckpointConfig(save_checkpoint_steps=1875, keep_checkpoint_max=10)",{"type":18,"tag":33,"props":180,"children":181},{},[182],{"type":28,"value":183},"# Apply the model saving policy.",{"type":18,"tag":33,"props":185,"children":186},{},[187],{"type":28,"value":188},"ckpoint = ModelCheckpoint(prefix=\"checkpoint_lenet\", config=config_ck)",{"type":18,"tag":33,"props":190,"children":191},{},[192],{"type":28,"value":193},"# Apply the saving policy to model training through a callback.",{"type":18,"tag":33,"props":195,"children":196},{},[197],{"type":28,"value":198},"model.train(epoch_size, ds_train, callbacks=[ckpoint_cb])",{"type":18,"tag":33,"props":200,"children":201},{},[202,204,209,211,216,218,222,224,228],{"type":28,"value":203},"The ",{"type":18,"tag":55,"props":205,"children":206},{},[207],{"type":28,"value":208},"load_checkpoint",{"type":28,"value":210}," and ",{"type":18,"tag":55,"props":212,"children":213},{},[214],{"type":28,"value":215},"load_param_into_net",{"type":28,"value":217}," functions are used to load models. The following code uses ",{"type":18,"tag":55,"props":219,"children":220},{},[221],{"type":28,"value":208},{"type":28,"value":223}," to load network parameters from a saved checkpoint and imports them to a network instance using ",{"type":18,"tag":55,"props":225,"children":226},{},[227],{"type":28,"value":215},{"type":28,"value":229}," for subsequent training or evaluation.",{"type":18,"tag":33,"props":231,"children":232},{},[233],{"type":28,"value":234},"from mindspore import load_checkpoint, load_param_into_net",{"type":18,"tag":33,"props":236,"children":237},{},[238],{"type":28,"value":239},"# Load the saved model used for testing.",{"type":18,"tag":33,"props":241,"children":242},{},[243],{"type":28,"value":244},"param_dict = load_checkpoint(\"checkpoint_lenet-1_1875.ckpt\")",{"type":18,"tag":33,"props":246,"children":247},{},[248],{"type":28,"value":249},"# Load parameters to the network.",{"type":18,"tag":33,"props":251,"children":252},{},[253],{"type":28,"value":254},"load_param_into_net(net, param_dict)",{"type":18,"tag":33,"props":256,"children":257},{},[258,260,269],{"type":28,"value":259},"See the example in ",{"type":18,"tag":261,"props":262,"children":266},"a",{"href":263,"rel":264},"https://mindspore.cn/tutorials/en/r1.7/beginner/quick_start.html",[265],"nofollow",[267],{"type":28,"value":268},"[1]",{"type":28,"value":270}," for the complete code.",{"type":18,"tag":33,"props":272,"children":273},{},[274],{"type":18,"tag":55,"props":275,"children":276},{},[277],{"type":28,"value":278},"(2) Saving according to training duration",{"type":18,"tag":33,"props":280,"children":281},{},[282],{"type":28,"value":283},"This saving policy provides second- and minute-based configuration parameters. For example, in the following code, a checkpoint file is saved every 30 seconds and retained for 3 minutes.",{"type":18,"tag":33,"props":285,"children":286},{},[287],{"type":28,"value":288},"from mindspore import CheckpointConfig",{"type":18,"tag":33,"props":290,"children":291},{},[292],{"type":28,"value":293},"# A checkpoint file is saved every 30 seconds and retained for 3 minutes.",{"type":18,"tag":33,"props":295,"children":296},{},[297],{"type":28,"value":298},"config_ck = CheckpointConfig(save_checkpoint_seconds=30, keep_checkpoint_per_n_minutes=3)",{"type":18,"tag":33,"props":300,"children":301},{},[302],{"type":18,"tag":55,"props":303,"children":304},{},[305],{"type":28,"value":306},"Saving Upon Exception",{"type":18,"tag":33,"props":308,"children":309},{},[310,312,317],{"type":28,"value":311},"When training large models, the interval between checkpoints is prolonged to reduce the number of checkpoints to be saved. For example, when training a Pangu model, the interval between checkpoints is 4 to 5 hours. If a failure occurs between two checkpoints, hours of time will be wasted even though checkpoints are implemented. In MindSpore 1.7, resumable training is provided as an enhancement of the checkpoint function. A checkpoint is triggered upon an exception, ensuring that the training can continue from the state when the exception occurs. This saves time. To use resumable training, simply add ",{"type":18,"tag":55,"props":313,"children":314},{},[315],{"type":28,"value":316},"exception_save=True",{"type":28,"value":318}," when configuring the policy.",{"type":18,"tag":33,"props":320,"children":321},{},[322],{"type":18,"tag":42,"props":323,"children":325},{"alt":7,"src":324},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/06/25/e8ded96945d442eeba0e697c30521bfd.png",[],{"type":18,"tag":33,"props":327,"children":328},{},[329],{"type":28,"value":330},"from mindspore import ModelCheckpoint, CheckpointConfig",{"type":18,"tag":33,"props":332,"children":333},{},[334],{"type":28,"value":335},"# Enable resumable training.",{"type":18,"tag":33,"props":337,"children":338},{},[339],{"type":28,"value":340},"config_ck = CheckpointConfig(save_checkpoint_steps=32, keep_checkpoint_max=10, exception_save=True)",{"type":18,"tag":33,"props":342,"children":343},{},[344],{"type":18,"tag":55,"props":345,"children":346},{},[347],{"type":28,"value":348},"Summary",{"type":18,"tag":33,"props":350,"children":351},{},[352],{"type":28,"value":353},"The Checkpoints pattern ensures model training reliability and makes early stopping easier. Resumable training is of great value to large models, avoiding the waste of time when an exception occurs. The Checkpoints pattern also facilitates fine-tuning during transfer learning, which will be introduced in the next article of this series.",{"type":18,"tag":33,"props":355,"children":356},{},[357],{"type":18,"tag":55,"props":358,"children":359},{},[360],{"type":28,"value":361},"References",{"type":18,"tag":33,"props":363,"children":364},{},[365,367],{"type":28,"value":366},"[1] MindSpore Quickstart example: ",{"type":18,"tag":261,"props":368,"children":370},{"href":263,"rel":369},[265],[371],{"type":28,"value":263},{"type":18,"tag":33,"props":373,"children":374},{},[375,377],{"type":28,"value":376},"[2] Saving models in MindSpore: ",{"type":18,"tag":261,"props":378,"children":381},{"href":379,"rel":380},"https://gitee.com/mindspore/docs/blob/master/tutorials/source_en/advanced/train/save.md",[265],[382],{"type":28,"value":379},{"type":18,"tag":33,"props":384,"children":385},{},[386,388,394,396],{"type":28,"value":387},"[3] ",{"type":18,"tag":389,"props":390,"children":391},"em",{},[392],{"type":28,"value":393},"Machine Learning Design Patterns",{"type":28,"value":395},": ",{"type":18,"tag":261,"props":397,"children":400},{"href":398,"rel":399},"https://www.oreilly.com/library/view/machine-learning-design/9781098115777/",[265],[401],{"type":28,"value":398},{"title":7,"searchDepth":403,"depth":403,"links":404},4,[],"markdown","content:technology-blogs:en:1574.md","content","technology-blogs/en/1574.md","technology-blogs/en/1574","md",1776506103334]