{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 使用MindConverter迁移脚本\n", "\n", "[![](https://gitee.com/mindspore/docs/raw/r1.3/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.3/docs/mindspore/migration_guide/source_zh_cn/migration_case_of_mindconverter.ipynb) [![](https://gitee.com/mindspore/docs/raw/r1.3/resource/_static/logo_notebook.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.3/migration_guide/zh_cn/mindspore_migration_case_of_mindconverter.ipynb) [![](https://gitee.com/mindspore/docs/raw/r1.3/resource/_static/logo_modelarts.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9taW5kc3BvcmUtd2Vic2l0ZS5vYnMuY24tbm9ydGgtNC5teWh1YXdlaWNsb3VkLmNvbS9ub3RlYm9vay9yMS4zL21pZ3JhdGlvbl9ndWlkZS96aF9jbi9taW5kc3BvcmVfbWlncmF0aW9uX2Nhc2Vfb2ZfbWluZGNvbnZlcnRlci5pcHluYg==&imageid=65f636a0-56cf-49df-b941-7d2a07ba8c8c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 概述" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PyTorch模型转换为MindSpore脚本和权重,首先需要将PyTorch模型导出为ONNX模型,然后使用MindConverter CLI工具进行脚本和权重迁移。\n", "HuggingFace Transformers是PyTorch框架下主流的自然语言处理三方库,我们以Transformer中的BertForMaskedLM为例,演示迁移过程。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 环境准备\n", "\n", "本案例需安装以下Python三方库:\n", "\n", "```bash\n", "pip install torch==1.5.1\n", "pip install transformers==4.2.2\n", "pip install mindspore==1.2.0\n", "pip install mindinsight==1.2.0\n", "pip install onnx\n", "```\n", "\n", "> 以上安装命令可选用国内的清华源途径进行安装,可加快文件下载速度,即在上述命令后面添加`-i https://pypi.tuna.tsinghua.edu.cn/simple`。\n", ">\n", "> 安装`ONNX`第三方库时,需要提前安装`protobuf-compiler`,`libprotoc-dev`,如果没有以上两个库,可以使用命令`apt-get install protobuf-compiler libprotoc-dev`进行安装。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ONNX模型导出\n", "\n", "首先实例化HuggingFace中的BertForMaskedLM,以及相应的分词器(首次使用时需要下载模型权重、词表、模型配置等数据)。\n", "\n", "关于HuggingFace的使用,本文不做过多介绍,详细使用请参考[HuggingFace使用文档](https://huggingface.co/transformers/model_doc/bert.html)。\n", "\n", "该模型可对句子中被掩蔽(mask)的词进行预测。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from transformers.models.bert import BertForMaskedLM, BertTokenizer\n", "\n", "tokenizer = BertTokenizer.from_pretrained(\"bert-base-uncased\")\n", "model = BertForMaskedLM.from_pretrained(\"bert-base-uncased\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们使用该模型进行推理,生成若干组测试用例,以验证模型迁移的正确性。\n", "\n", "这里我们以一条句子为例`china is a poworful country, its capital is beijing.`。\n", "\n", "我们对`beijing`进行掩蔽(mask),输入`china is a poworful country, its capital is [MASK].`至模型,模型预期输出应为`beijing`。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MASK TOKEN id: 12\n", "Tokens: [[ 101 2859 2003 1037 23776 16347 5313 2406 1010 2049 3007 2003\n", " 103 1012 102]]\n", "Attention mask: [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]\n", "Token type ids: [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]\n", "Pred id: 7211\n", "Pred token: beijing\n" ] } ], "source": [ "import numpy as np\n", "import torch\n", "\n", "text = \"china is a poworful country, its capital is [MASK].\"\n", "tokenized_sentence = tokenizer(text)\n", "\n", "mask_idx = tokenized_sentence[\"input_ids\"].index(tokenizer.convert_tokens_to_ids(\"[MASK]\"))\n", "input_ids = np.array([tokenized_sentence[\"input_ids\"]])\n", "attention_mask = np.array([tokenized_sentence[\"attention_mask\"]])\n", "token_type_ids = np.array([tokenized_sentence[\"token_type_ids\"]])\n", "\n", "# Get [MASK] token id.\n", "print(f\"MASK TOKEN id: {mask_idx}\")\n", "print(f\"Tokens: {input_ids}\") \n", "print(f\"Attention mask: {attention_mask}\")\n", "print(f\"Token type ids: {token_type_ids}\")\n", "\n", "model.eval()\n", "with torch.no_grad():\n", " predictions = model(input_ids=torch.tensor(input_ids),\n", " attention_mask=torch.tensor(attention_mask),\n", " token_type_ids=torch.tensor(token_type_ids))\n", " predicted_index = torch.argmax(predictions[0][0][mask_idx])\n", " predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]\n", " print(f\"Pred id: {predicted_index}\")\n", " print(f\"Pred token: {predicted_token}\")\n", " assert predicted_token == \"beijing\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "HuggingFace提供了导出ONNX模型的工具,可使用如下方法将HuggingFace的预训练模型导出为ONNX模型:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating folder exported_bert_base_uncased\n", "Using framework PyTorch: 1.5.1+cu101\n", "Found input input_ids with shape: {0: 'batch', 1: 'sequence'}\n", "Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}\n", "Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}\n", "Found output output_0 with shape: {0: 'batch', 1: 'sequence'}\n", "Ensuring inputs are in correct order\n", "position_ids is not present in the generated input list.\n", "Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']\n" ] } ], "source": [ "from pathlib import Path\n", "from transformers.convert_graph_to_onnx import convert\n", "\n", "# Exported onnx model path.\n", "saved_onnx_path = \"./exported_bert_base_uncased/bert_base_uncased.onnx\"\n", "convert(\"pt\", model, Path(saved_onnx_path), 11, tokenizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "根据打印的信息,我们可以看到导出的ONNX模型输入节点有3个:`input_ids`,`token_type_ids`,`attention_mask`,以及相应的输入轴,\n", "输出节点有一个`output_0`。\n", "\n", "至此ONNX模型导出成功,接下来对导出的ONNX模型精度进行验证(ONNX模型导出过程在ARM机器上执行,可能需要用户自行编译安装PyTorch以及Transformers三方库)。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ONNX模型验证\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们仍然使用PyTorch模型推理时的句子`china is a poworful country, its capital is [MASK].`作为输入,观测ONNX模型表现是否符合预期。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ONNX Pred id: 7211\n", "ONNX Pred token: beijing\n" ] } ], "source": [ "import onnx\n", "import onnxruntime as ort\n", "\n", "model = onnx.load(saved_onnx_path)\n", "sess = ort.InferenceSession(bytes(model.SerializeToString()))\n", "result = sess.run(\n", " output_names=None,\n", " input_feed={\"input_ids\": input_ids, \n", " \"attention_mask\": attention_mask,\n", " \"token_type_ids\": token_type_ids}\n", ")[0]\n", "predicted_index = np.argmax(result[0][mask_idx])\n", "predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]\n", "\n", "print(f\"ONNX Pred id: {predicted_index}\")\n", "print(f\"ONNX Pred token: {predicted_token}\")\n", "assert predicted_token == \"beijing\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以看到,导出的ONNX模型功能与原PyTorch模型完全一致,接下来可以使用MindConverter进行脚本和权重迁移了!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MindConverter进行模型脚本和权重迁移" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "MindConverter进行模型转换时,需要给定模型路径(`--model_file`)、输入节点(`--input_nodes`)、输入节点尺寸(`--shape`)、输出节点(`--output_nodes`)。\n", "\n", "生成的脚本输出路径(`--output`)、转换报告路径(`--report`)为可选参数,默认为当前路径下的output目录,若输出目录不存在将自动创建。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "MindConverter: conversion is completed.\n", "\n" ] } ], "source": [ "!mindconverter --model_file ./exported_bert_base_uncased/bert_base_uncased.onnx --shape 1,128 1,128 1,128 \\\n", " --input_nodes input_ids token_type_ids attention_mask \\\n", " --output_nodes output_0 \\\n", " --output ./converted_bert_base_uncased \\\n", " --report ./converted_bert_base_uncased" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**看到“MindConverter: conversion is completed.”即代表模型已成功转换!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "转换完成后,该目录下生成如下文件:\n", "- 模型定义脚本(后缀为.py)\n", "- 权重ckpt文件(后缀为.ckpt)\n", "- 迁移前后权重映射(后缀为.json)\n", "- 转换报告(后缀为.txt)\n", "\n", "通过ls命令检查一下转换结果。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bert_base_uncased.ckpt\treport_of_bert_base_uncased.txt\r\n", "bert_base_uncased.py\tweight_map_of_bert_base_uncased.json\r\n" ] } ], "source": [ "!ls ./converted_bert_base_uncased" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以看到所有文件已生成。\n", "\n", "迁移完成,接下来我们对迁移后模型精度进行验证。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MindSpore模型验证\n", "我们仍然使用`china is a poworful country, its capital is [MASK].`作为输入,观测迁移后模型表现是否符合预期。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "由于工具在转换时,需要将模型尺寸冻结,因此在使用MindSpore进行推理验证时,需要将句子补齐(Pad)到固定长度,可通过如下函数实现句子补齐。\n", "\n", "推理时,句子长度需小于转换时的最大句长(这里我们最长句子长度为128,即在转换阶段通过`--shape 1,128`指定)。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def padding(input_ids, attn_mask, token_type_ids, target_len=128):\n", " length = len(input_ids)\n", " for i in range(target_len - length):\n", " input_ids.append(0)\n", " attn_mask.append(0)\n", " token_type_ids.append(0)\n", " return np.array([input_ids]), np.array([attn_mask]), np.array([token_type_ids])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ONNX Pred id: 7211\n" ] } ], "source": [ "from converted_bert_base_uncased.bert_base_uncased import Model as MsBert\n", "from mindspore import load_checkpoint, load_param_into_net, context, Tensor\n", "\n", "\n", "context.set_context(mode=context.GRAPH_MODE, device_target=\"GPU\")\n", "padded_input_ids, padded_attention_mask, padded_token_type = padding(tokenized_sentence[\"input_ids\"], \n", " tokenized_sentence[\"attention_mask\"], \n", " tokenized_sentence[\"token_type_ids\"], \n", " target_len=128)\n", "padded_input_ids = Tensor(padded_input_ids)\n", "padded_attention_mask = Tensor(padded_attention_mask)\n", "padded_token_type = Tensor(padded_token_type)\n", "\n", "model = MsBert()\n", "param_dict = load_checkpoint(\"./converted_bert_base_uncased/bert_base_uncased.ckpt\")\n", "not_load_params = load_param_into_net(model, param_dict)\n", "output = model(padded_attention_mask, padded_input_ids, padded_token_type)\n", "\n", "assert not not_load_params\n", "\n", "predicted_index = np.argmax(output.asnumpy()[0][mask_idx])\n", "print(f\"ONNX Pred id: {predicted_index}\")\n", "assert predicted_index == 7211" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "至此,使用MindConverter进行脚本和权重迁移完成。\n", "\n", "用户可根据使用场景编写训练、推理、部署脚本,实现个人业务逻辑。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 常见问题" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:如何修改迁移后脚本的批次大小(Batch size)、句子长度(Sequence length)等尺寸(shape)规格,以实现模型可支持任意尺寸的数据推理、训练?**\n", "\n", "A:迁移后脚本存在shape限制,通常是由于Reshape算子导致,或其他涉及张量排布变化的算子导致。以上述Bert迁移为例,首先创建两个全局变量,表示预期的批次大小、句子长度,而后修改Reshape操作的目标尺寸,替换成相应的批次大小、句子长度的全局变量即可。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:生成后的脚本中类名的定义不符合开发者的习惯,如`class Module0(nn.Cell)`,人工修改是否会影响转换后的权重加载?**\n", "\n", "A:权重的加载仅与变量名、类结构有关,因此类名可以修改,不影响权重加载。若需要调整类的结构,则相应的权重命名需要同步修改以适应迁移后模型的结构。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }