# Evaluating the Model during Training `Linux` `Ascend` `GPU` `CPU` `Beginner` `Intermediate` `Expert` `Model Export` `Model Training` [![View Source On Gitee](../_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.0/tutorials/training/source_en/advanced_use/evaluate_the_model_during_training.md) ## Overview For a complex network, epoch training usually needs to be performed for dozens or even hundreds of times. Before training, it is difficult to know when a model can achieve required accuracy in epoch training. Therefore, the accuracy of the model is usually validated at a fixed epoch interval in training and the corresponding model is saved. After the training is completed, you can quickly select the optimal model by viewing the change of the corresponding model accuracy. This section uses this method and takes the LeNet network as an example. The procedure is as follows: 1. Define the callback function EvalCallBack to implement synchronous training and validation. 2. Define a training network and execute it. 3. Draw a line chart based on the model accuracy under different epochs and select the optimal model. For a complete example, see [notebook](https://gitee.com/mindspore/docs/blob/r1.0/tutorials/notebook/evaluate_the_model_during_training.ipynb). ## Defining the Callback Function EvalCallBack Implementation idea: The model accuracy is validated every n epochs. The model accuracy is implemented in the user-defined function. For details about the usage, see [API Description](https://www.mindspore.cn/doc/api_python/en/r1.0/mindspore/mindspore.train.html#mindspore.train.callback.Callback). Core implementation: Validation points are set in `epoch_end` of the callback function as follows: `cur_epoch % eval_per_epoch == 0`: indicates that the model accuracy is validated every `eval_per_epoch` epoch. - `cur_epoch`: indicates epoch value in the current training process. - `eval_per_epoch`: indicates user-defined value, that is, the validation frequency. Other parameters are described as follows: - `model`: indicates `Model` function in MindSpore. - `eval_dataset`: indicates the validation dataset. - `epoch_per_eval`: records the accuracy of the validation model and the corresponding number of epochs. The data format is `{"epoch": [], "acc": []}`. ```python from mindspore.train.callback import Callback class EvalCallBack(Callback): def __init__(self, model, eval_dataset, eval_per_epoch, epoch_per_eval): self.model = model self.eval_dataset = eval_dataset self.eval_per_epoch = eval_per_epoch self.epoch_per_eval = epoch_per_eval def epoch_end(self, run_context): cb_param = run_context.original_args() cur_epoch = cb_param.cur_epoch_num if cur_epoch % self.eval_per_epoch == 0: acc = self.model.eval(self.eval_dataset, dataset_sink_mode=True) self.epoch_per_eval["epoch"].append(cur_epoch) self.epoch_per_eval["acc"].append(acc["Accuracy"]) print(acc) ``` ## Defining and Executing the Training Network In the `CheckpointConfig` parameter for saving the model, you need to calculate the number of steps in a single epoch and then determine the frequency of model accuracy validation as needed. In this example, there are 1875 steps per epoch. Based on the principle of validating once every two epochs, set `save_checkpoint_steps=eval_per_epoch*1875`. The variable `eval_per_epoch` is equal to 2. The parameters are described as follows: - `config_ck`: defines and saves model information. - `save_checkpoint_steps`: indicates the number of steps for saving a model. - `keep_checkpoint_max`: indicates the maximum number of models that can be saved. - `ckpoint_cb`: defines the name and path for saving the model. - `model`: defines a model. - `model.train`: indicates the model training function. - `epoch_per_eval`: defines the number for collecting `epoch` and the dictionary of corresponding model accuracy information. ```python from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor from mindspore.train import Model from mindspore import context from mindspore.nn.metrics import Accuracy if __name__ == "__main__": context.set_context(mode=context.GRAPH_MODE, device_target="GPU") ckpt_save_dir = "./lenet_ckpt" eval_per_epoch = 2 ... ... # need to calculate how many steps are in each epoch, in this example, 1875 steps per epoch. config_ck = CheckpointConfig(save_checkpoint_steps=eval_per_epoch*1875, keep_checkpoint_max=15) ckpoint_cb = ModelCheckpoint(prefix="checkpoint_lenet",directory=ckpt_save_dir, config=config_ck) model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}) epoch_per_eval = {"epoch": [], "acc": []} eval_cb = EvalCallBack(model, eval_data, eval_per_epoch, epoch_per_eval) model.train(epoch_size, train_data, callbacks=[ckpoint_cb, LossMonitor(375), eval_cb], dataset_sink_mode=True) ``` The output is as follows: epoch: 1 step: 375, loss is 2.298612 epoch: 1 step: 750, loss is 2.075152 epoch: 1 step: 1125, loss is 0.39205977 epoch: 1 step: 1500, loss is 0.12368304 epoch: 1 step: 1875, loss is 0.20988345 epoch: 2 step: 375, loss is 0.20582482 epoch: 2 step: 750, loss is 0.029070046 epoch: 2 step: 1125, loss is 0.041760832 epoch: 2 step: 1500, loss is 0.067035824 epoch: 2 step: 1875, loss is 0.0050643035 {'Accuracy': 0.9763621794871795} ... ... epoch: 9 step: 375, loss is 0.021227183 epoch: 9 step: 750, loss is 0.005586236 epoch: 9 step: 1125, loss is 0.029125651 epoch: 9 step: 1500, loss is 0.00045874066 epoch: 9 step: 1875, loss is 0.023556218 epoch: 10 step: 375, loss is 0.0005807788 epoch: 10 step: 750, loss is 0.02574059 epoch: 10 step: 1125, loss is 0.108463734 epoch: 10 step: 1500, loss is 0.01950589 epoch: 10 step: 1875, loss is 0.10563098 {'Accuracy': 0.979667467948718} Find the `lenet_ckpt` folder in the same directory. The folder contains five models and data related to a calculation graph. The structure is as follows: ``` lenet_ckpt ├── checkpoint_lenet-10_1875.ckpt ├── checkpoint_lenet-2_1875.ckpt ├── checkpoint_lenet-4_1875.ckpt ├── checkpoint_lenet-6_1875.ckpt ├── checkpoint_lenet-8_1875.ckpt └── checkpoint_lenet-graph.meta ``` ## Defining the Function to Obtain the Model Accuracy in Different Epochs Define the drawing function `eval_show`, load `epoch_per_eval` to `eval_show`, and draw the model accuracy variation chart based on different `epoch`. ```python import matplotlib.pyplot as plt def eval_show(epoch_per_eval): plt.xlabel("epoch number") plt.ylabel("Model accuracy") plt.title("Model accuracy variation chart") plt.plot(epoch_per_eval["epoch"], epoch_per_eval["acc"], "red") plt.show() eval_show(epoch_per_eval) ``` The output is as follows: ![png](./images/evaluate_the_model_during_training.png) You can easily select the optimal model based on the preceding figure. ## Summary The MNIST dataset is used for training through the convolutional neural network LeNet5. This section describes how to validate a model during model training, save the model corresponding to `epoch`, and select the optimal model.