{ "cells": [ { "cell_type": "markdown", "id": "6b898861", "metadata": {}, "source": [ "# 回调机制 Callback\n", "\n", "[![在OpenI运行](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.9/resource/_static/logo_openi.png)](https://openi.pcl.ac.cn/MindSpore/docs/src/branch/r1.9/tutorials/source_zh_cn/advanced/model/callback.ipynb?card=2&image=MindSpore1.8.1) [![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.9/resource/_static/logo_notebook.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.9/tutorials/zh_cn/advanced/model/mindspore_callback.ipynb) \n", "[![下载样例代码](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.9/resource/_static/logo_download_code.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.9/tutorials/zh_cn/advanced/model/mindspore_callback.py) \n", "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.9/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.9/tutorials/source_zh_cn/advanced/model/callback.ipynb)\n", "\n", "在深度学习训练过程中,为及时掌握网络模型的训练状态、实时观察网络模型各参数的变化情况和实现训练过程中用户自定义的一些操作,MindSpore提供了回调机制(Callback)来实现上述功能。\n", "\n", "Callback回调机制一般用在网络模型训练过程`Model.train`中,MindSpore的`Model`会按照Callback列表`callbacks`顺序执行回调函数,用户可以通过设置不同的回调类来实现在训练过程中或者训练后执行的功能。\n", "\n", "> 更多内置回调类的信息及使用方式请参考[API文档](https://www.mindspore.cn/docs/zh-CN/r1.9/api_python/mindspore/mindspore.Callback.html#mindspore.Callback)。\n", "\n", "## Callback介绍\n", "\n", "当聊到回调Callback的时候,大部分用户都会觉得很难理解,是不是需要堆栈或者特殊的调度方式,实际上我们简单的理解回调:\n", "\n", "假设函数A有一个参数,这个参数是个函数B,当函数A执行完以后执行函数B,那么这个过程就叫回调。\n", "\n", "`Callback`是回调的意思,MindSpore中的回调函数实际上不是一个函数而是一个类,用户可以使用回调机制来**观察训练过程中网络内部的状态和相关信息,或在特定时期执行特定动作**。\n", "\n", "例如监控损失函数Loss、保存模型参数ckpt、动态调整参数lr、提前终止训练任务等。下面我们继续以手写体识别模型为例,介绍常见的内置回调函数和自定义回调函数。" ] }, { "cell_type": "code", "execution_count": 1, "id": "53502ae0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip (10.3 MB)\n", "\n", "file_sizes: 100%|██████████████████████████| 10.8M/10.8M [00:00<00:00, 23.8MB/s]\n", "Extracting zip file...\n", "Successfully downloaded / unzipped to ./\n" ] } ], "source": [ "from download import download\n", "from mindspore import nn, Model\n", "from mindspore.dataset import vision, transforms, MnistDataset\n", "from mindspore.common.initializer import Normal\n", "from mindspore import dtype as mstype\n", "\n", "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/\" \\\n", " \"notebook/datasets/MNIST_Data.zip\"\n", "\n", "# 下载MNIST数据集\n", "download(url, \"./\", kind=\"zip\", replace=True)\n", "\n", "# 数据处理\n", "def proc_dataset(data_path, batch_size=32):\n", " mnist_ds = MnistDataset(data_path, shuffle=True)\n", "\n", " # define map operations\n", " image_transforms = [\n", " vision.Resize(32),\n", " vision.Rescale(1.0 / 255.0, 0),\n", " vision.Normalize(mean=(0.1307,), std=(0.3081,)),\n", " vision.HWC2CHW()\n", " ]\n", "\n", " label_transform = transforms.TypeCast(mstype.int32)\n", "\n", " mnist_ds = mnist_ds.map(operations=label_transform, input_columns=\"label\")\n", " mnist_ds = mnist_ds.map(operations=image_transforms, input_columns=\"image\")\n", " mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True)\n", " return mnist_ds\n", "\n", "train_dataset = proc_dataset('MNIST_Data/train')\n", "\n", "# 定义LeNet-5网络模型\n", "class LeNet5(nn.Cell):\n", "\n", " def __init__(self, num_class=10, num_channel=1):\n", " super(LeNet5, self).__init__()\n", " self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid')\n", " self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid')\n", " self.relu = nn.ReLU()\n", " self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)\n", " self.flatten = nn.Flatten()\n", " self.fc1 = nn.Dense(16 * 5 * 5, 120, weight_init=Normal(0.02))\n", " self.fc2 = nn.Dense(120, 84, weight_init=Normal(0.02))\n", " self.fc3 = nn.Dense(84, num_class, weight_init=Normal(0.02))\n", "\n", " def construct(self, x):\n", " x = self.conv1(x)\n", " x = self.relu(x)\n", " x = self.max_pool2d(x)\n", " x = self.conv2(x)\n", " x = self.relu(x)\n", " x = self.max_pool2d(x)\n", " x = self.flatten(x)\n", " x = self.relu(self.fc1(x))\n", " x = self.relu(self.fc2(x))\n", " x = self.fc3(x)\n", " return x\n", "\n", "def create_model():\n", " model = LeNet5()\n", " net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction=\"mean\")\n", " net_opt = nn.Momentum(model.trainable_params(), learning_rate=0.01, momentum=0.9)\n", " trainer = Model(model, loss_fn=net_loss, optimizer=net_opt, metrics={\"Accuracy\": nn.Accuracy()})\n", " return trainer\n", "\n", "trainer = create_model()" ] }, { "cell_type": "markdown", "id": "447deb5e", "metadata": {}, "source": [ "## 常用的内置回调函数\n", "\n", "MindSpore提供`Callback`能力,支持用户在训练/推理的特定阶段,插入自定义的操作。\n", "\n", "### ModelCheckpoint\n", "\n", "用于保存训练后的网络模型和参数,方便进行再推理或再训练,MindSpore提供了[ModelCheckpoint](https://mindspore.cn/docs/zh-CN/r1.9/api_python/mindspore/mindspore.ModelCheckpoint.html?highlight=modelcheckpoint#mindspore.ModelCheckpoint)接口,一般与配置保存信息接口[CheckpointConfig](https://mindspore.cn/docs/zh-CN/r1.9/api_python/mindspore/mindspore.CheckpointConfig.html#mindspore.CheckpointConfig)配合使用。" ] }, { "cell_type": "code", "execution_count": 2, "id": "f1665e56", "metadata": {}, "outputs": [], "source": [ "from mindspore import CheckpointConfig, ModelCheckpoint\n", "\n", "# 设置保存模型的配置信息\n", "config = CheckpointConfig(save_checkpoint_steps=1875, keep_checkpoint_max=10)\n", "# 实例化保存模型回调接口,定义保存路径和前缀名\n", "ckpt_callback = ModelCheckpoint(prefix=\"mnist\", directory=\"./checkpoint\", config=config)\n", "\n", "# 开始训练,加载保存模型和参数回调函数\n", "trainer.train(1, train_dataset, callbacks=[ckpt_callback])" ] }, { "cell_type": "markdown", "id": "68219c74", "metadata": {}, "source": [ "上面代码运行后,生成的Checkpoint文件目录结构如下:\n", "\n", "```text\n", "./checkpoint/\n", "├── mnist-1_1875.ckpt # 保存参数文件\n", "└── mnist-graph.meta # 编译后的计算图\n", "```\n", "\n", "### LossMonitor\n", "\n", "用于监控训练或测试过程中的损失函数值Loss变化情况,可设置`per_print_times`控制打印Loss值的间隔。" ] }, { "cell_type": "code", "execution_count": 3, "id": "40006c83", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "epoch: 1 step: 1875, loss is 0.008795851841568947\n", "epoch: 2 step: 1875, loss is 0.007240554317831993\n", "epoch: 3 step: 1875, loss is 0.0036914246156811714\n" ] } ], "source": [ "from mindspore import LossMonitor\n", "\n", "loss_monitor = LossMonitor(1875)\n", "\n", "trainer.train(3, train_dataset, callbacks=[loss_monitor])" ] }, { "cell_type": "markdown", "id": "f333c296", "metadata": {}, "source": [ "训练场景下,LossMonitor监控训练的Loss值;边训练边推理场景下,监控训练的Loss值和推理的Metrics值。" ] }, { "cell_type": "code", "execution_count": 4, "id": "63c08736", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "epoch: 1 step: 1875, loss is 0.0026960039976984262\n", "Eval result: epoch 1, metrics: {'Accuracy': 0.9888822115384616}\n", "epoch: 2 step: 1875, loss is 0.00038617433165200055\n", "Eval result: epoch 2, metrics: {'Accuracy': 0.9877804487179487}\n" ] } ], "source": [ "test_dataset = proc_dataset('MNIST_Data/test')\n", "\n", "trainer.fit(2, train_dataset, test_dataset, callbacks=[loss_monitor])" ] }, { "cell_type": "markdown", "id": "ba362fc5", "metadata": {}, "source": [ "### TimeMonitor\n", "\n", "用于监控训练或测试过程的执行时间。可设置`data_size`控制打印执行时间的间隔。" ] }, { "cell_type": "code", "execution_count": 5, "id": "d48f823e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train epoch time: 3876.302 ms, per step time: 2.067 ms\n" ] } ], "source": [ "from mindspore import TimeMonitor\n", "\n", "time_monitor = TimeMonitor(1875)\n", "trainer.train(1, train_dataset, callbacks=[time_monitor])" ] }, { "cell_type": "markdown", "id": "b6488131", "metadata": {}, "source": [ "## 自定义回调机制\n", "\n", "MindSpore不仅有功能强大的内置回调函数,当用户有自己的特殊需求时,还可以基于`Callback`基类自定义回调类。\n", "\n", "用户可以基于`Callback`基类,根据自身的需求,实现自定义`Callback`。`Callback`基类定义如下所示:" ] }, { "cell_type": "code", "execution_count": 6, "id": "ad6d592d", "metadata": {}, "outputs": [], "source": [ "class Callback():\n", " \"\"\"Callback base class\"\"\"\n", " def on_train_begin(self, run_context):\n", " \"\"\"Called once before the network executing.\"\"\"\n", "\n", " def on_train_epoch_begin(self, run_context):\n", " \"\"\"Called before each epoch beginning.\"\"\"\n", "\n", " def on_train_epoch_end(self, run_context):\n", " \"\"\"Called after each epoch finished.\"\"\"\n", "\n", " def on_train_step_begin(self, run_context):\n", " \"\"\"Called before each step beginning.\"\"\"\n", "\n", " def on_train_step_end(self, run_context):\n", " \"\"\"Called after each step finished.\"\"\"\n", "\n", " def on_train_end(self, run_context):\n", " \"\"\"Called once after network training.\"\"\"" ] }, { "cell_type": "markdown", "id": "b6d218ef", "metadata": {}, "source": [ "回调机制可以把训练过程中的重要信息记录下来,通过把一个字典类型变量`RunContext.original_args()`,传递给Callback对象,使得用户可以在各个自定义的Callback中获取到相关属性,执行自定义操作,也可以自定义其他变量传递给`RunContext.original_args()`对象。\n", "\n", "`RunContext.original_args()`中的常用属性有:\n", "\n", "- epoch_num:训练的epoch的数量\n", "- batch_num:一个epoch中step的数量\n", "- cur_epoch_num:当前的epoch数\n", "- cur_step_num:当前的step数\n", "\n", "- loss_fn:损失函数\n", "- optimizer:优化器\n", "- train_network:训练的网络\n", "- train_dataset:训练的数据集\n", "- net_outputs:网络的输出结果\n", "\n", "- parallel_mode:并行模式\n", "- list_callback:所有的Callback函数\n", "\n", "通过下面两个场景,我们可以增加对自定义Callback回调机制功能的了解。\n", "\n", "### 自定义终止训练\n", "\n", "实现在规定时间内终止训练功能。用户可以设定时间阈值,当训练时间达到这个阈值后就终止训练过程。\n", "\n", "下面代码中,通过`run_context.original_args`方法可以获取到`cb_params`字典,字典里会包含前文描述的主要属性信息。\n", "\n", "同时可以对字典内的值进行修改和添加,在`begin`函数中定义一个`init_time`对象传递给`cb_params`字典。每个数据迭代结束`step_end`之后会进行判断,当训练时间大于设置的时间阈值时,会向run_context传递终止训练的信号,提前终止训练,并打印当前的epoch、step、loss的值。" ] }, { "cell_type": "code", "execution_count": 7, "id": "9cf7aa84", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Begin training, time is: 1673515004.6783535\n", "epoch: 1 step: 1875, loss is 0.0006050781812518835\n", "End training, time: 1673515009.1824663, epoch: 1, step: 1875, loss:0.0006050782\n" ] } ], "source": [ "import time\n", "from mindspore import Callback\n", "\n", "class StopTimeMonitor(Callback):\n", "\n", " def __init__(self, run_time):\n", " \"\"\"定义初始化过程\"\"\"\n", " super(StopTimeMonitor, self).__init__()\n", " self.run_time = run_time # 定义执行时间\n", "\n", " def on_train_begin(self, run_context):\n", " \"\"\"开始训练时的操作\"\"\"\n", " cb_params = run_context.original_args()\n", " cb_params.init_time = time.time() # 获取当前时间戳作为开始训练时间\n", " print(f\"Begin training, time is: {cb_params.init_time}\")\n", "\n", " def on_train_step_end(self, run_context):\n", " \"\"\"每个step结束后执行的操作\"\"\"\n", " cb_params = run_context.original_args()\n", " epoch_num = cb_params.cur_epoch_num # 获取epoch值\n", " step_num = cb_params.cur_step_num # 获取step值\n", " loss = cb_params.net_outputs # 获取损失值loss\n", " cur_time = time.time() # 获取当前时间戳\n", "\n", " if (cur_time - cb_params.init_time) > self.run_time:\n", " print(f\"End training, time: {cur_time}, epoch: {epoch_num}, step: {step_num}, loss:{loss}\")\n", " run_context.request_stop() # 停止训练\n", "\n", "train_dataset = proc_dataset('MNIST_Data/train')\n", "trainer.train(10, train_dataset, callbacks=[LossMonitor(), StopTimeMonitor(4)])" ] }, { "cell_type": "markdown", "id": "61ba7ead", "metadata": {}, "source": [ "从上面的打印结果可以看出,运行时间到达了阈值后就结束了训练。\n", "\n", "### 自定义阈值保存模型\n", "\n", "该回调机制实现当loss小于设定的阈值时,保存网络模型权重ckpt文件。\n", "\n", "示例代码如下:" ] }, { "cell_type": "code", "execution_count": 8, "id": "c4097da6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "epoch: 1 step: 1875, loss is 0.15191984176635742\n", "epoch: 2 step: 1875, loss is 0.14701086282730103\n", "epoch: 3 step: 1875, loss is 0.0020134493242949247\n", "Saved checkpoint, loss:0.0020134, current step num:5625.\n", "epoch: 4 step: 1875, loss is 0.018305214121937752\n", "epoch: 5 step: 1875, loss is 0.00019801077723968774\n", "Saved checkpoint, loss:0.0001980, current step num:9375.\n" ] } ], "source": [ "from mindspore import save_checkpoint\n", "\n", "# 定义保存ckpt文件的回调接口\n", "class SaveCkptMonitor(Callback):\n", " \"\"\"定义初始化过程\"\"\"\n", "\n", " def __init__(self, loss):\n", " super(SaveCkptMonitor, self).__init__()\n", " self.loss = loss # 定义损失值阈值\n", "\n", " def on_train_step_end(self, run_context):\n", " \"\"\"定义step结束时的执行操作\"\"\"\n", " cb_params = run_context.original_args()\n", " cur_loss = cb_params.net_outputs.asnumpy() # 获取当前损失值\n", "\n", " # 如果当前损失值小于设定的阈值就停止训练\n", " if cur_loss < self.loss:\n", " # 自定义保存文件名\n", " file_name = f\"./checkpoint/{cb_params.cur_epoch_num}_{cb_params.cur_step_num}.ckpt\"\n", " # 保存网络模型\n", " save_checkpoint(save_obj=cb_params.train_network, ckpt_file_name=file_name)\n", " print(\"Saved checkpoint, loss:{:8.7f}, current step num:{:4}.\".format(cur_loss, cb_params.cur_step_num))\n", "\n", "trainer = create_model()\n", "train_dataset = proc_dataset('MNIST_Data/train')\n", "trainer.train(5, train_dataset, callbacks=[LossMonitor(), SaveCkptMonitor(0.01)])" ] }, { "cell_type": "markdown", "id": "d789682d", "metadata": {}, "source": [ "最终,损失值小于设定阈值的网络权重被保存在`./checkpoint/`目录下。" ] } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 5 }