{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 异构并行训练\n", "\n", "`Ascend` `GPU` `设计` `模型运行`\n", "\n", "[![在线运行](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_modelarts.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9taW5kc3BvcmUtd2Vic2l0ZS5vYnMuY24tbm9ydGgtNC5teWh1YXdlaWNsb3VkLmNvbS9ub3RlYm9vay9yMS42L3Byb2dyYW1taW5nX2d1aWRlL3poX2NuL2Rlc2lnbi9taW5kc3BvcmVfaGV0ZXJvZ2VuZW91c190cmFpbmluZy5pcHluYg==&imageid=65f636a0-56cf-49df-b941-7d2a07ba8c8c) [![下载Notebook](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_notebook.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.6/programming_guide/zh_cn/design/mindspore_heterogeneous_training.ipynb) [![下载样例代码](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_download_code.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.6/programming_guide/zh_cn/design/mindspore_heterogeneous_training.py) [![查看源文件](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.6/docs/mindspore/programming_guide/source_zh_cn/design/heterogeneous_training.ipynb)\n", "\n", "## 概述\n", "\n", "异构并行训练方法是通过分析图上算子内存占用和计算密集度,将内存消耗巨大或适合CPU逻辑处理的算子切分到CPU子图,将内存消耗较小计算密集型算子切分到硬件加速器子图,框架协同不同子图进行网络训练,使得处于不同硬件且无依赖关系的子图能够并行进行执行的过程。\n", "\n", "## 计算流程\n", "\n", "MindSpore异构并行训练典型的计算流程如下图所示:\n", "\n", "![heterogeneous-heter](./images/heter.png)\n", "\n", "1. 用户设置网络执行的后端" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2022-01-05T02:15:56.790220Z", "start_time": "2022-01-05T02:15:55.114811Z" } }, "outputs": [], "source": [ "from mindspore import context\n", "context.set_context(device_target=\"GPU\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 用户设置特定算子执行后端" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2022-01-05T09:02:10.573036Z", "start_time": "2022-01-05T09:02:09.034905Z" } }, "outputs": [], "source": [ "from mindspore import ops\n", "\n", "prim = ops.Add()\n", "\n", "prim.add_prim_attr(\"primitive_target\", \"CPU\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. 框架根据计算图算子标志进行切图\n", "4. 框架调度不同后端执行子图" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "当前典型使用异构并行计算的场景有:优化器异构、Embedding异构、PS异构。\n", "\n", "## 优化器异构\n", "\n", "在盘古或GPT3大模型训练过程中,优化器状态占用了大量内存,进而限制了可训练的模型规模。使用优化器异构,将优化器指定到CPU上执行,可以极大扩展可训练模型规模:\n", "\n", "![heterogeneous-heter-opt](./images/heter-opt.png)\n", "\n", "如图所示,将Adam算子配置到CPU执行同时指定加速器进行FP16计算,可以将参数内存占用降低到原始的1/3。\n", "\n", "1. 配置优化器算子到CPU执行\n", "2. 初始化FP16的权重参数以及FP32的优化器状态变量\n", "3. 将输入优化器的梯度转为FP16(如果本来就是FP16梯度,可忽略这步)\n", "4. 权重和梯度转为FP32参与优化器运算\n", "5. 更新后的FP32权重赋值给FP16的权重\n", "\n", "优化器异构代码样例如下:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2022-01-05T09:02:10.635821Z", "start_time": "2022-01-05T09:02:10.574494Z" } }, "outputs": [], "source": [ "import numpy as np\n", "from mindspore import dtype as mstype\n", "import mindspore.ops as ops\n", "from mindspore.common.initializer import initializer\n", "from mindspore import Tensor\n", "from mindspore import ParameterTuple\n", "from mindspore.nn import Optimizer\n", "_adam_opt = ops.MultitypeFuncGraph(\"adam_opt\")\n", "host_assign = ops.Assign()\n", "host_assign.add_prim_attr(\"primitive_target\", \"CPU\")\n", "host_cast = ops.Cast()\n", "host_cast.add_prim_attr(\"primitive_target\", \"CPU\")\n", "device_cast = ops.Cast()\n", "\n", "@_adam_opt.register(\"Function\", \"Tensor\", \"Tensor\", \"Tensor\", \"Tensor\", \"Number\", \"Tensor\", \"Tensor\", \"Tensor\",\n", " \"Tensor\", \"Bool\", \"Bool\")\n", "def _update_run_kernel(opt, beta1, beta2, eps, lr, weight_decay, param, m, v, gradient, decay_flags, optim_filter):\n", " \"\"\"\n", " Update parameters by AdamWeightDecay op.\n", " \"\"\"\n", " success = True\n", " if optim_filter:\n", " param32 = host_cast(param, mstype.float32)\n", " gradient = device_cast(gradient, mstype.float32)\n", " if decay_flags:\n", " next_param = opt(param32, m, v, lr, beta1, beta2, eps, weight_decay, gradient)\n", " else:\n", " next_param = opt(param32, m, v, lr, beta1, beta2, eps, 0.0, gradient)\n", " ret = host_assign(param, host_cast(ops.depend(param32, next_param), ops.dtype(param)))\n", " return ops.depend(success, ret)\n", " return success\n", "\n", "class AdamWeightDecayOp(Optimizer):\n", " def __init__(self, params, learning_rate=1e-3, beta1=0.9, beta2=0.999, eps=1e-6, weight_decay=0.0):\n", " super(AdamWeightDecayOp, self).__init__(learning_rate, params, weight_decay)\n", " self.beta1 = Tensor(np.array([beta1]).astype(np.float32))\n", " self.beta2 = Tensor(np.array([beta2]).astype(np.float32))\n", " self.eps = Tensor(np.array([eps]).astype(np.float32))\n", " self.moments1 = self.clone_param32(prefix=\"adam_m\", init='zeros')\n", " self.moments2 = self.clone_param32(prefix=\"adam_v\", init='zeros')\n", " self.opt = ops.AdamWeightDecay()\n", " self.hyper_map = ops.HyperMap()\n", " self.opt.add_prim_attr(\"primitive_target\", \"CPU\")\n", "\n", " def construct(self, gradients):\n", " \"\"\"AdamWeightDecayOp\"\"\"\n", " lr = self.get_lr()\n", " if self.is_group:\n", " if self.is_group_lr:\n", " optim_result = self.map_reverse(ops.partial(_adam_opt, self.opt, self.beta1, self.beta2, self.eps),\n", " lr, self.weight_decay, self.parameters, self.moments1, self.moments2,\n", " gradients, self.decay_flags, self.optim_filter)\n", " else:\n", " optim_result = self.map_reverse(ops.partial(_adam_opt, self.opt, self.beta1, self.beta2, self.eps, lr),\n", " self.weight_decay, self.parameters, self.moments1, self.moments2,\n", " gradients, self.decay_flags, self.optim_filter)\n", " else:\n", " optim_result = self.map_reverse(ops.partial(_adam_opt, self.opt, self.beta1, self.beta2, self.eps, lr,\n", " self.weight_decay), self.parameters, self.moments1, self.moments2,\n", " gradients, self.decay_flags, self.optim_filter)\n", " return optim_result\n", "\n", " def clone_param32(self, prefix, init=None):\n", " new = []\n", " for old_param in self.parameters:\n", " param_init = init\n", " if init is None:\n", " param_init = old_param.init\n", " new_state = old_param.clone()\n", " new_state.set_dtype(mstype.float32)\n", " new_state.set_data(initializer(param_init, shape=old_param.shape, dtype=mstype.float32))\n", " new_state.name = prefix + '.' + new_state.name\n", " new.append(new_state)\n", " return ParameterTuple(new)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "步骤4、5也可以直接融合到优化器算子中做进一步优化,完整的优化器异构训练流程可以参考: \n", "\n", "## Embedding异构\n", "\n", "在一些需要查Embedding大表的网络中,Embedding表往往有上百G的规模,受加速器内存大小限制,无法直接将整表加载到加速器上执行。通过将与权重表相连的算子放到CPU上执行,避免加速器由于内存限制而无法训练网络的问题。\n", "\n", "![heterogeneous-heter-embed](./images/heter-embed.png)\n", "\n", "1. 配置EmbeddingLookup算子到CPU执行" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2022-01-05T09:02:10.663460Z", "start_time": "2022-01-05T09:02:10.636839Z" } }, "outputs": [], "source": [ "ops.EmbeddingLookup().add_prim_attr('primitive_target', 'CPU')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 配置EmbeddingLookup相关优化器到CPU执行" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2022-01-05T09:02:10.680690Z", "start_time": "2022-01-05T09:02:10.665043Z" } }, "outputs": [], "source": [ "use_locking = False\n", "use_nesterov = False\n", "ops.FusedSparseLazyAdam(use_locking, use_nesterov).add_prim_attr(\"primitive_target\", \"CPU\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "EmbeddingLookup算子设置代码样例如下:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2022-01-05T09:02:10.709005Z", "start_time": "2022-01-05T09:02:10.682761Z" } }, "outputs": [], "source": [ "import mindspore.nn as nn\n", "import mindspore.ops as ops\n", "from mindspore import Parameter\n", "from mindspore.common.initializer import initializer\n", "\n", "class EmbeddingLookup(nn.Cell):\n", " def __init__(self, vocab_size, embedding_size, param_init='normal',\n", " target='CPU', sparse=True):\n", " \"\"\"Initialize EmbeddingLookup.\"\"\"\n", " super(EmbeddingLookup, self).__init__()\n", " validator.check_value_type('sparse', sparse, [bool], self.cls_name)\n", " self.vocab_size = validator.check_positive_int(vocab_size, 'vocab_size')\n", " self.target = target\n", " self.sparse = sparse\n", " if target not in ('CPU', 'DEVICE'):\n", " raise ValueError('Attr \\'target\\' of \\'EmbeddingLookup\\' Op passed '\n", " + str(target) + ', should be one of values in \\'CPU\\', \\'DEVICE\\'.')\n", " if not sparse and target == 'CPU':\n", " raise ValueError('When target is CPU, embedding_lookup must be sparse.')\n", " if sparse:\n", " self.gatherv2 = ops.SparseGatherV2()\n", " else:\n", " self.gatherv2 = ops.Gather()\n", " self.embeddinglookup = ops.EmbeddingLookup().add_prim_attr('primitive_target', 'CPU')\n", " self.embedding_size = validator.check_positive_int(embedding_size, 'embedding_size')\n", " self.embedding_table = Parameter(initializer(param_init, [self.vocab_size, self.embedding_size]),\n", " name='embedding_table')\n", "\n", " def construct(self, indices):\n", " if self.target == \"CPU\":\n", " out = self.embeddinglookup(self.embedding_table, indices, 0)\n", " else:\n", " out = self.gatherv2(self.embedding_table, indices, 0)\n", " return out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "当前nn目录下的EmbeddingLookup、FTRL、LazyAdam等算子已经封装好异构接口,用户只需设置target属性为CPU或DEVICE即可切换执行后端。\n", "\n", "整体调用流程可以参考:\n", "\n", "## PS异构\n", "\n", "在EmbeddingTable达到T级别,单机内存无法放下时,使用Parameter Server,通过异构的Pull/Push算子进行权重的拉取和更新。\n", "\n", "![heterogeneous-heter-ps](./images/heter-ps.png)\n", "\n", "Parameter Server封装异构流程,用户只需配置参数使用PS即可,具体配置流程请参考[Parameter Server训练流程](https://www.mindspore.cn/docs/programming_guide/zh-CN/r1.6/apply_parameter_server_training.html)。\n", "\n", "此外,wide&deep网络中也有使用PS的流程,可参考:\n", "\n", "## 约束\n", "\n", "当前需要用户指定算子执行的后端,不支持根据网络进行自动化配置。" ] } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 4 }