{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# 模型保存与导出\n", "\n", "[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/resource/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.8/tutorials/zh_cn/advanced/train/mindspore_save.ipynb) [![下载样例代码](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/resource/_static/logo_download_code.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.8/tutorials/zh_cn/advanced/train/mindspore_save.py) [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.8/tutorials/source_zh_cn/advanced/train/save.ipynb)\n", "\n", "在模型训练过程中,可以添加检查点(CheckPoint)用于保存模型的参数,以便执行推理及再训练使用。如果想继续在不同硬件平台上做推理,可通过网络和CheckPoint格式文件生成对应的MindIR、AIR和ONNX格式文件。\n", "\n", "- **CheckPoint**:采用了Protocol Buffers机制,存储了网络中的所有的参数值。一般用于训练任务中断后恢复训练,或训练后的微调(Fine Tune)任务中。\n", "- **MindIR**:全称MindSpore IR,是MindSpore的一种基于图表示的函数式IR,定义了可扩展的图结构以及算子的IR表示,同时存储了网络结构和权重参数值。它消除了不同后端的模型差异,一般用于跨硬件平台执行推理任务,比如把在Ascend 910训练好的模型,放在Ascend 310、GPU以及MindSpore Lite端侧上执行推理。\n", "- **ONNX**:全称Open Neural Network Exchange,是一种针对机器学习所设计的开放式的文件格式,同时存储了网络结构和权重参数值。一般用于不同框架间的模型迁移或在推理引擎(TensorRT)上使用。\n", "- **AIR**:全称Ascend Intermediate Representation,是华为定义的针对机器学习所设计的开放式的文件格式,同时存储了网络结构和权重参数值,能更好地适配Ascend AI处理器。一般用于Ascend 310上执行推理任务。\n", "\n", "本章主要介绍如何保存CheckPoint格式文件和导出MindIR、AIR和ONNX格式文件的方法。\n", "\n", "## 保存模型\n", "\n", "初级教程的[保存与加载章节](https://mindspore.cn/tutorials/zh-CN/r1.8/beginner/save_load.html)已经介绍了使用`save_checkpoint`直接保存模型参数和使用Callback机制在训练过程中保存模型参数方法。本节将进一步介绍在训练过程中保存模型参数和使用`save_checkpoint`直接保存模型参数的方法。\n", "\n", "### 训练过程保存模型\n", "\n", "在训练过程中保存模型参数,MindSpore提供了两种保存策略,迭代策略和时间策略,可以通过创建`CheckpointConfig`对象设置相应策略。迭代策略和时间策略不能同时使用,其中迭代策略优先级高于时间策略,当同时设置时,只有迭代策略可以生效。当参数显示设置为None时,表示放弃该策略。另外,当训练过程中发生异常时,MindSpore也提供了断点续训功能,即在异常发生时系统会自动保存异常发生时的CheckPoint文件。\n", "\n", "1. 迭代策略\n", "\n", "`CheckpointConfig`中可根据迭代的次数进行配置,配置迭代策略的参数如下:\n", "\n", "- `save_checkpoint_steps`:表示每隔多少个step保存一个CheckPoint文件,默认值为1。\n", "- `keep_checkpoint_max`:表示最多保存多少个CheckPoint文件,默认值为5。\n", "\n", "```Python\n", "import mindspore as ms\n", "\n", "# 每隔32个step保存一个CheckPoint文件,且最多保存10个CheckPoint文件\n", "config_ck = ms.CheckpointConfig(save_checkpoint_steps=32, keep_checkpoint_max=10)\n", "```\n", "\n", "在迭代策略脚本正常结束的情况下,会默认保存最后一个step的CheckPoint文件。\n", "\n", "2. 时间策略\n", "\n", "`CheckpointConfig`中可根据训练的时长进行配置,配置时间策略的参数如下:\n", "\n", "- `save_checkpoint_seconds`:表示每隔多少秒保存一个CheckPoint文件,默认值为0。\n", "- `keep_checkpoint_per_n_minutes`:表示每隔多少分钟保留一个CheckPoint文件,默认值为0。\n", "\n", "```Python\n", "import mindspore as ms\n", "\n", "# 每隔30秒保存一个CheckPoint文件,每隔3分钟保留一个CheckPoint文件\n", "config_ck = ms.CheckpointConfig(save_checkpoint_seconds=30, keep_checkpoint_per_n_minutes=3)\n", "```\n", "\n", "`save_checkpoint_seconds`参数不可与`save_checkpoint_steps`参数一起使用。如果同时设置了两个参数,则`save_checkpoint_seconds`参数无效。\n", "\n", "3. 断点续训\n", "\n", "MindSpore提供了断点续训的功能,当用户开启该功能时,如果在训练过程中发生了异常,那么MindSpore会自动保存异常发生时的CheckPoint文件(临终CheckPoint)。断点续训的功能通过CheckpointConfig中的`exception_save`参数(bool类型)控制,设置为True时开启该功能,False关闭该功能,默认为False。断点续训功能保存的临终CheckPoint文件与正常流程保存的CheckPoint互不影响,命名机制和保存路径与正常流程设置保持一致,唯一不同之处在于会在临终CheckPoint文件名最后加上’_breakpoint’进行区分。其用法如下:\n", "\n", "```Python\n", "import mindspore as ms\n", "\n", "# 配置断点续训功能开启\n", "config_ck = ms.CheckpointConfig(save_checkpoint_steps=32, keep_checkpoint_max=10, exception_save=True)\n", "```\n", "\n", "如果在训练过程中发生了异常,那么会自动保存临终CheckPoint,假如在训练中的第10个epoch的第10个step中发生异常,保存的临终CheckPoint文件如下。\n", "\n", "```Python\n", "# 临终CheckPoint文件名最后会加上'_breakpoint'与正常流程CheckPoint区分开\n", "resnet50-10_10_breakpoint.ckpt \n", "```\n", "\n", "### 直接保存模型\n", "\n", "可以使用`save_checkpoint`函数直接把内存中的网络权重参数保存到CheckPoint文件,常用参数如下所示:\n", "\n", "- `save_obj`:Cell对象或者数据列表。\n", "- `ckpt_file_name`: checkpoint文件名称。如果文件已存在,将会覆盖原有文件。\n", "- `integrated_save`:在并行场景下是否合并保存拆分的Tensor。默认值为True。\n", "- `async_save`:是否异步执行保存checkpoint文件。默认值为False。\n", "- `append_dict`:需要保存的其他信息。dict的键必须为str类型,dict的值类型必须是float或者bool类型。默认值为None。\n", "\n", "1. `save_obj`参数\n", "\n", "初级教程的[保存与加载章节](https://mindspore.cn/tutorials/zh-CN/r1.8/beginner/save_load.html#保存与加载)已经介绍了当`save_obj`为Cell对象时,如何使用`save_checkpoint`直接保存模型参数。下面介绍当传入数据列表时,如何保存模型参数。传入数据列表时,列表的每个元素为字典类型,比如[{“name”: param_name, “data”: param_data},…], `param_name`的类型必须是str,`param_data`的类型必须是Parameter或者Tensor。示例如下所示:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import mindspore as ms\n", "\n", "save_list = [{\"name\": \"lr\", \"data\": ms.Tensor(0.01, ms.float32)}, {\"name\": \"train_epoch\", \"data\": ms.Tensor(20, ms.int32)}]\n", "ms.save_checkpoint(save_list, \"hyper_param.ckpt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. `integrated_save`参数\n", "\n", "表示参数是否合并保存,默认为True。在模型并行场景下,Tensor会被切分到不同卡所运行的程序中。如果integrated_save设置为True,则这些被切分的Tensor会被合并保存到每个checkpoint文件中,这样checkpoint文件保存的就是完整的训练参数。\n", "\n", "```Python\n", "ms.save_checkpoint(net, \"resnet50-2_32.ckpt\", integrated_save=True)\n", "```\n", "\n", "3. `async_save`参数\n", "\n", "表示是否开启异步保存功能,默认为False。如果设置为True,则会开启多线程执行写checkpoint文件操作,从而可以并行执行训练和保存任务,在训练大规模网络时会节省脚本运行的总时长。\n", "\n", "```Python\n", "ms.save_checkpoint(net, \"resnet50-2_32.ckpt\", async_save=True)\n", "```\n", "\n", "4. `append_dict`参数\n", "\n", "需要额外保存的信息,类型为dict类型,目前只支持基础类型的保存,包括int、float、bool等。\n", "\n", "```Python\n", "save_dict = {\"epoch_num\": 2, \"lr\": 0.01}\n", "# 除了net中的参数,save_dict的信息也会保存在ckpt文件中\n", "ms.save_checkpoint(net, \"resnet50-2_32.ckpt\",append_dict=save_dict)\n", "```\n", "\n", "## 迁移学习\n", "\n", "迁移学习场景中,使用预训练模型进行训练时,CheckPoint文件中的模型参数无法直接使用,需要根据实际情况进行修改才能适用于当前网络模型。本节介绍如何删除Resnet50的预训练模型中的全连接层参数。\n", "\n", "首先下载[Resnet50的预训练模型](https://download.mindspore.cn/vision/classification/resnet50_224.ckpt),该模型文件是由[MindSpore Vision](https://mindspore.cn/vision/docs/zh-CN/r0.1/index.html)中的`resnet50`模型在ImageNet数据集上训练得到的。\n", "\n", "使用`load_checkpoint`接口加载训练模型,该接口返回一个Dict类型,该字典的健值key为网络各层的名称,类型为字符型Str;字典的值value为网络层的参数值,类型为Parameter。\n", "\n", "如下示例中由于Resnet50预训练模型的分类类别数为1000,而示例中定义的resnet50网络分类类别数为2,所以需要删除预训练模型中的全连接层参数。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Delete parameter from checkpoint: head.dense.weight\n", "Delete parameter from checkpoint: head.dense.bias\n" ] } ], "source": [ "from mindvision.classification.models import resnet50\n", "import mindspore as ms\n", "from mindvision.dataset import DownLoad\n", "\n", "# 下载Resnet50的预训练模型\n", "dl = DownLoad()\n", "dl.download_url('https://download.mindspore.cn/vision/classification/resnet50_224.ckpt')\n", "# 定义分类类被为2的resnet50网络\n", "resnet = resnet50(2)\n", "# 模型参数保存到param_dict中\n", "param_dict = ms.load_checkpoint(\"resnet50_224.ckpt\")\n", "\n", "# 获取全连接层的参数名列表\n", "param_filter = [x.name for x in resnet.head.get_parameters()]\n", "\n", "def filter_ckpt_parameter(origin_dict, param_filter):\n", " \"\"\"删除origin_dict中包含param_filter参数名的元素\"\"\"\n", " for key in list(origin_dict.keys()): # 获取模型的所有参数名\n", " for name in param_filter: # 遍历模型中待删除的参数名\n", " if name in key:\n", " print(\"Delete parameter from checkpoint:\", key)\n", " del origin_dict[key]\n", " break\n", "\n", "# 删除全连接层\n", "filter_ckpt_parameter(param_dict, param_filter)\n", "\n", "# 打印更新后的模型参数\n", "ms.load_param_into_net(resnet, param_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 模型导出\n", "\n", "MindSpore的`export`可以将网络模型导出为指定格式的文件,用于其他硬件平台的推理。`export`主要参数如下所示:\n", "\n", "- `net`:MindSpore网络结构。\n", "- `inputs`:网络的输入,支持输入类型为Tensor。当输入有多个时,需要一起传入,如`ms.export(network, ms.Tensor(input1), ms.Tensor(input2), file_name='network', file_format='MINDIR')`。\n", "- `file_name`:导出模型的文件名称,如果`file_name`没有包含对应的后缀名(如.mindir),设置`file_format`后系统会为文件名自动添加后缀。\n", "- `file_format`:MindSpore目前支持导出”AIR”,”ONNX”和”MINDIR”格式的模型。\n", "\n", "如下介绍使用`export`将resnet50网络和相应的CheckPoint格式文件生成对应的MindIR、AIR和ONNX格式文件。\n", "\n", "### 导出MindIR格式文件\n", "\n", "如果想跨平台或硬件执行推理(如昇腾AI处理器、MindSpore端侧、GPU等),可以通过网络定义和CheckPoint生成MindIR格式模型文件。当前支持基于静态图。如下使用MindSpore Vision中的`resnet50`模型和该模型在ImageNet数据集上训练得到的模型文件resnet50_224.ckpt,导出MindIR格式文件。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import mindspore as ms\n", "from mindvision.classification.models import resnet50\n", "\n", "resnet = resnet50(1000)\n", "ms.load_checkpoint(\"resnet50_224.ckpt\", net=resnet)\n", "\n", "input_np = np.random.uniform(0.0, 1.0, size=[1, 3, 224, 224]).astype(np.float32)\n", "\n", "# 导出文件resnet50_224.mindir到当前文件夹\n", "ms.export(resnet, ms.Tensor(input_np), file_name='resnet50_224', file_format='MINDIR')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "若想在MindIR格式文件中保存模型推理时需要的预处理操作信息,可以将数据集对象传入export接口:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import mindspore.dataset as ds\n", "import mindspore.dataset.vision as vision\n", "import mindspore as ms\n", "from mindvision.classification.models import resnet50\n", "from mindvision.dataset import DownLoad\n", "\n", "def create_dataset_for_renset(path):\n", " \"\"\"创建数据集\"\"\"\n", " data_set = ds.ImageFolderDataset(path)\n", " mean = [0.485 * 255, 0.456 * 255, 0.406 * 255]\n", " std = [0.229 * 255, 0.224 * 255, 0.225 * 255]\n", " data_set = data_set.map(operations=[vision.Decode(), vision.Resize(256), vision.CenterCrop(224),\n", " vision.Normalize(mean=mean, std=std), vision.HWC2CHW()], input_columns=\"image\")\n", " data_set = data_set.batch(1)\n", " return data_set\n", "\n", "dataset_url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/beginner/DogCroissants.zip\"\n", "path = \"./datasets\"\n", "# 下载并解压数据集\n", "dl = DownLoad()\n", "dl.download_and_extract_archive(url=dataset_url, download_path=path)\n", "# 加载数据集\n", "path = \"./datasets/DogCroissants/val/\"\n", "de_dataset = create_dataset_for_renset(path)\n", "# 定义网络\n", "resnet = resnet50()\n", "\n", "# 加载预处理模型参数到网络中\n", "ms.load_checkpoint(\"resnet50_224.ckpt\", net=resnet)\n", "# 导出带预处理信息的MindIR文件\n", "ms.export(resnet, de_dataset, file_name='resnet50_224', file_format='MINDIR')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> 如果file_name没有包含”.mindir”后缀,系统会为其自动添加”.mindir”后缀。\n", "\n", "为了避免Protocol Buffers的硬件限制,当导出的模型参数大小超过1G时,框架默认会把网络结构和参数分开保存。\n", "\n", "- 网络结构文件的名称以用户指定前缀加_graph.mindir结尾。\n", "- 同级目录下,会生成一个用户指定前缀加_variables的文件夹,里面存放网络的参数。其中参数大小每超过1T会被分开保存成命名为data_1、data_2、data_3等的多个文件。\n", "\n", "以上述代码为例,如果带参数的模型大小超过1G,生成的目录结构如下:\n", "\n", "```Text\n", "├── resnet50_224_graph.mindir\n", "└── resnet50_224_variables\n", " ├── data_1\n", " ├── data_2\n", " └── data_3\n", "\n", "```\n", "\n", "### 导出ONNX格式文件\n", "\n", "当有了CheckPoint文件后,如果想继续在昇腾AI处理器、GPU或CPU等多种硬件上做推理,需要通过网络和CheckPoint生成对应的ONNX格式模型文件。如下使用MindSpore Vision中的`resnet50`模型和该模型在ImageNet数据集上训练得到的模型文件resnet50_224.ckpt,导出ONNX格式文件。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import mindspore as ms\n", "from mindvision.classification.models import resnet50\n", "\n", "resnet = resnet50()\n", "ms.load_checkpoint(\"resnet50_224.ckpt\", net=resnet)\n", "\n", "input_np = np.random.uniform(0.0, 1.0, size=[1, 3, 224, 224]).astype(np.float32)\n", "\n", "# 保存resnet50_224.onnx文件到当前目录下\n", "ms.export(resnet, ms.Tensor(input_np), file_name='resnet50_224', file_format='ONNX')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> - 如果file_name没有包含”.onnx”后缀,系统会为其自动添加”.onnx”后缀。\n", "> - 目前ONNX格式导出仅支持ResNet系列、YOLOV3、YOLOV4、BERT网络。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 导出AIR格式文件\n", "\n", "AIR格式文件用于在昇腾AI处理器上执行推理,导出AIR格式文件需要在昇腾AI处理器上进行操作,可通过网络定义和CheckPoint生成AIR格式模型文件。如下使用MindSpore Vision中的`resnet50`模型和该模型在ImageNet数据集上训练得到的模型文件resnet50_224.ckpt,在昇腾AI处理器上导出AIR格式文件。\n", "\n", "```Python\n", "import numpy as np\n", "import mindspore as ms\n", "from mindvision.classification.models import resnet50\n", "\n", "resnet = resnet50()\n", "# 加载参数到网络中\n", "ms.load_checkpoint(\"resnet50_224.ckpt\", net=resnet)\n", "# 网络输入\n", "input_np = np.random.uniform(0.0, 1.0, size=[1, 3, 224, 224]).astype(np.float32)\n", "# 保存resnet50_224.air文件到当前目录下\n", "ms.export(resnet, ms.Tensor(input_np), file_name='resnet50_224', file_format='AIR')\n", "```\n", "\n", "如果file_name没有包含“.air”后缀,系统会为其自动添加“.air”后缀。" ] } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 5 }