{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 应用RoundToNearest后量化算法\n",
    "\n",
    "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/golden-stick/blob/r0.4/mindspore_gs/ptq/round_to_nearest/README.ipynb)\n",
    "\n",
    "## RoundToNearest后量化算法简介\n",
    "\n",
    "RoundToNearest算法是一类较朴素的后量化算法，其取整方式使用了Round to nearest，即四舍五入的方式。\n",
    "\n",
    "当前金箍棒中的RoundToNearest后量化（后面使用RTN来简称）主要针对LLM（大语言模型场景），使用MinMax校正器对线性层（Linear）进行量化。伪量化的网络结构示意如下：\n",
    "\n",
    "![fakequantizer](https://gitee.com/mindspore/golden-stick/raw/r0.4/docs/images/ptq/rtn/rtn-fakequantizer.png)\n",
    "\n",
    "表1：RTN算法规格\n",
    "\n",
    "| 规格 | 规格说明 |\n",
    "| --- | --- |\n",
    "| 硬件支持 | 量化阶段运行在CPU，量化模型推理仅支持Ascend |\n",
    "| 网络支持 | Llama2 13B/70B，具体请参见[Llama2网络](https://gitee.com/hangangqiang/mindformers/tree/dev/mindformers/models/llama) |\n",
    "| 运行模式支持 | Graph模式和PyNative模式 |\n",
    "\n",
    "## 示例\n",
    "\n",
    "跟金箍棒仓所有算法一样，RTN算法的应用主要可以分为两个阶段：量化阶段和部署阶段。\n",
    "\n",
    "量化阶段是部署前提前完成的，主要的工作是：收集权重的分布、计算量化参数、量化权重数据、插入反量化节点。\n",
    "\n",
    "部署阶段通常是指用户在生产环境，使用MindSpore框架对量化后的模型进行推理的过程。\n",
    "\n",
    "本用例使用Llama2网络进行演示，主要分四个步骤：环境准备、模型量化、模型部署评估、效果分析。\n",
    "\n",
    "### 步骤1. 环境准备\n",
    "\n",
    "#### 1.1. Ascend环境\n",
    "\n",
    "RTN算法需要运行在Ascend硬件上，Ascend的环境配置可以参考[MindSpore安装指南](https://www.mindspore.cn/install)安装昇腾AI处理器配套软件包小节和配置环境变量小节。\n",
    "\n",
    "#### 1.2. MindSpore环境\n",
    "\n",
    "金箍棒依赖于MindSpore，需要提前安装合适的MindSpore。可以从MindSpore官网下载预编译好的[v2.3.0-rc1版本安装包](https://www.mindspore.cn/versions)并安装。\n",
    "\n",
    "#### 1.3. MindSpore Transformers环境\n",
    "\n",
    "金箍棒依赖于MindSpore Transformers，需要提前安装合适的MindSpore Transformers。可以从MindSpore官网下载预编译好的[v1.1.0-rc1版本安装包](https://www.mindspore.cn/versions)并安装。\n",
    "\n",
    "#### 1.4. 金箍棒环境\n",
    "\n",
    "从MindSpore官网下载预编译好的[MindSpore GoldenStick v0.4.0 版本安装包](https://www.mindspore.cn/versions)并安装。\n",
    "\n",
    "#### 1.5. 相关文件准备\n",
    "\n",
    "需要预先下载MindSpore Transformers Llama2网络相关的文件以及评估使用的数据集，包括：wikitext2数据集和Llama2 7B网络相关文件。\n",
    "\n",
    "第一步创建工作目录："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "!mkdir workspace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "第二步准备数据集，由于权限限制，需要手动下载wikitext2数据集：\n",
    "\n",
    "数据集下载地址：[WikiText2数据集](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip)\n",
    "\n",
    "下载好数据集后，需要将数据集文件拷贝到上一步中创建的workspace目录下，并确保数据集文件名为wikitext-2-v1.zip，然后运行解压代码："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Archive:  wikitext-2-v1.zip\n",
      "   creating: wikitext-2/\n",
      "  inflating: wikitext-2/wiki.test.tokens  \n",
      "  inflating: wikitext-2/wiki.valid.tokens  \n",
      "  inflating: wikitext-2/wiki.train.tokens  \n"
     ]
    }
   ],
   "source": [
    "!cd workspace; unzip wikitext-2-v1.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "第三步准备Llama2 7B网络checkpoint文件，Llama2分词器文件，Llama2模型配置文件："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2024-03-19 17:29:17--  https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/llama2/llama2_7b.ckpt\n",
      "Length: 13476850247 (13G) [binary/octet-stream]\n",
      "Saving to: ‘llama2_7b.ckpt’\n",
      "\n",
      "llama2_7b.ckpt      100%[===================>]  12.55G  27.5MB/s    in 7m 39s  \n",
      "\n",
      "2024-03-19 17:36:57 (28.0 MB/s) - ‘llama2_7b.ckpt’ saved [13476850247/13476850247]\n",
      "\n",
      "--2024-03-19 17:36:57--  https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/llama2/tokenizer.model\n",
      "Length: 499723 (488K) [binary/octet-stream]\n",
      "Saving to: ‘tokenizer.model’\n",
      "\n",
      "tokenizer.model     100%[===================>] 488.01K  --.-KB/s    in 0.1s    \n",
      "\n",
      "2024-03-19 17:36:57 (3.37 MB/s) - ‘tokenizer.model’ saved [499723/499723]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!cd workspace; wget --no-check-certificate -O llama2_7b.ckpt https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/llama2/llama2_7b.ckpt\n",
    "!cd workspace; wget --no-check-certificate -O tokenizer.model https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/llama2/tokenizer.model\n",
    "!cd workspace; cp ../configs/run_llama2_7b_910b.yaml ./"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 下载时如果遇到网络问题，可以尝试使用浏览器手动下载相应文件，并放到相应目录下\n",
    "\n",
    "完成上述准备后，检查目录结构："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[01;34m.\u001b[00m\r\n",
      "├── llama2_7b.ckpt\r\n",
      "├── \u001b[01;34mwikitext-2\u001b[00m\r\n",
      "│   ├── wiki.train.tokens\r\n",
      "│   ├── wiki.test.tokens\r\n",
      "│   └── wiki.valid.tokens\r\n",
      "├── tokenizer.model\r\n",
      "├── \u001b[01;31mwikitext-2-v1.zip\u001b[00m\r\n",
      "└── run_llama2_7b_910b.yaml\r\n",
      "\r\n",
      "1 directory, 7 files\r\n"
     ]
    }
   ],
   "source": [
    "!cd workspace; tree -L 2 -U"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 步骤2. 模型量化\n",
    "\n",
    "#### 2.1. 实例化MindFormerConfig\n",
    "\n",
    "构造MindSpore Transformers仓的Llama2网络，首先需要构造MindFormerConfig配置项，这里先实现一个创建MindFormerConfig的工具函数，并定义了一些常量："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "import mindspore as ms\n",
    "from mindspore import context\n",
    "from mindformers import LlamaForCausalLM, MindFormerConfig, LlamaConfig, init_context, TransformerOpParallelConfig\n",
    "\n",
    "\n",
    "def _set_config(config_path, device_target, device_id):\n",
    "    \"\"\"setup MindFormerConfig\"\"\"\n",
    "    mfconfig = MindFormerConfig(config_path)\n",
    "    if device_id != -1:\n",
    "        mfconfig.context.device_id = device_id\n",
    "    mfconfig.context.device_target = device_target\n",
    "    mfconfig.model.model_config = LlamaConfig(**mfconfig.model.model_config)\n",
    "\n",
    "    init_context(use_parallel=mfconfig.use_parallel, context_config=mfconfig.context, parallel_config=mfconfig.parallel)\n",
    "\n",
    "    parallel_config = TransformerOpParallelConfig(**mfconfig.parallel_config)\n",
    "    mfconfig.model.model_config.parallel_config = parallel_config\n",
    "    mfconfig.model.model_config.checkpoint_name_or_path = mfconfig.load_checkpoint\n",
    "    return mfconfig\n",
    "\n",
    "\n",
    "def create_mfconfig(config_path, device_target, device_id, bs, seq_len, tokenizer_path=\"\", ckpt_path=\"\", model_parallel=1):\n",
    "    \"\"\"Create mindformers config for llama2 network for example.\"\"\"\n",
    "    if model_parallel > 1:\n",
    "        use_parallel = True\n",
    "    else:\n",
    "        use_parallel = False\n",
    "        model_parallel = 1\n",
    "    compute_dtype = ms.float16\n",
    "    config = _set_config(config_path, device_target, device_id)\n",
    "    config.model.model_config.batch_size = bs\n",
    "    config.model.model_config.seq_length = seq_len\n",
    "    config.model.model_config.compute_dtype = compute_dtype\n",
    "    config.model.model_config.layernorm_compute_type = ms.float32\n",
    "    config.model.model_config.softmax_compute_type = ms.float16\n",
    "    config.model.model_config.rotary_dtype = ms.float16\n",
    "    config.model.model_config.param_init_type = ms.float16\n",
    "    config.processor.tokenizer.vocab_file = tokenizer_path\n",
    "    config.load_checkpoint = ckpt_path\n",
    "    config.model.model_config.checkpoint_name_or_path = ckpt_path\n",
    "    config.use_parallel = use_parallel\n",
    "    config.parallel_config.model_parallel = model_parallel\n",
    "    return config\n",
    "\n",
    "llama2_config_file = \"./workspace/run_llama2_7b_910b.yaml\"\n",
    "llama2_w16a16_ckpt_file = \"./workspace/llama2_7b.ckpt\"\n",
    "llama2_w8a16_ckpt_file = \"./workspace/llama2-7b-w8a16.ckpt\"\n",
    "vocab_file = \"./workspace/tokenizer.model\"\n",
    "wikitext2_ds_path = \"./workspace/wikitext-2/wiki.valid.tokens\"\n",
    "bs = 1\n",
    "seq_len = 256\n",
    "device_id = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 代码中的device_id变量可以根据运行环境中Ascend硬件空闲情况修改。\n",
    "\n",
    "实例化一个MindFormerConfig对象："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "import mindspore as ms\n",
    "from mindspore import context\n",
    "\n",
    "context.set_context(device_target=\"CPU\", mode=ms.GRAPH_MODE)\n",
    "quant_network_config = create_mfconfig(llama2_config_file, \"CPU\", device_id, bs, seq_len, ckpt_path=llama2_w16a16_ckpt_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.2. 实例化Llama2网络"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-03-19 18:52:23,380 - mindformers[mindformers/version_control.py:62] - INFO - The Cell Reuse compilation acceleration feature is not supported when the environment variable ENABLE_CELL_REUSE is 0 or MindSpore version is earlier than 2.1.0 or stand_alone mode or pipeline_stages <= 1\n",
      "2024-03-19 18:52:23,382 - mindformers[mindformers/version_control.py:66] - INFO - \n",
      "The current ENABLE_CELL_REUSE=0, please set the environment variable as follows: \n",
      "export ENABLE_CELL_REUSE=1 to enable the Cell Reuse compilation acceleration feature.\n",
      "2024-03-19 18:52:23,383 - mindformers[mindformers/version_control.py:72] - INFO - The Cell Reuse compilation acceleration feature does not support single-card mode.This feature is disabled by default. ENABLE_CELL_REUSE=1 does not take effect.\n",
      "2024-03-19 18:52:23,384 - mindformers[mindformers/version_control.py:75] - INFO - The Cell Reuse compilation acceleration feature only works in pipeline parallel mode(pipeline_stage>1).Current pipeline stage=1, the feature is disabled by default.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "......\n",
      "[WARNING] ME(1746443:281473469290880,MainProcess):2024-03-19-18:55:07.118.525 [mindspore/train/serialization.py:185] The type of model.layers.31.attention_norm.weight:Float16 in 'parameter_dict' is different from the type of it in 'net':Float32, then the type convert from Float16 to Float32 in the network.\n",
      "[WARNING] ME(1746443:281473469290880,MainProcess):2024-03-19-18:55:07.123.086 [mindspore/train/serialization.py:185] The type of model.layers.31.ffn_norm.weight:Float16 in 'parameter_dict' is different from the type of it in 'net':Float32, then the type convert from Float16 to Float32 in the network.\n",
      "[WARNING] ME(1746443:281473469290880,MainProcess):2024-03-19-18:55:07.751.733 [mindspore/train/serialization.py:185] The type of model.norm_out.weight:Float16 in 'parameter_dict' is different from the type of it in 'net':Float32, then the type convert from Float16 to Float32 in the network.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-03-19 18:55:08,105 - mindformers[mindformers/models/modeling_utils.py:1413] - INFO - weights in ./workspace/llama2_7b.ckpt are loaded\n"
     ]
    }
   ],
   "source": [
    "import mindspore as ms\n",
    "from mindformers import LlamaForCausalLM\n",
    "\n",
    "network = LlamaForCausalLM(quant_network_config.model.model_config)\n",
    "network.set_train(False)\n",
    "network.phase = 'predict'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.3. 实例化RTN算法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mindspore_gs.common import BackendTarget\n",
    "from mindspore_gs.ptq import PTQConfig, PTQMode\n",
    "from mindspore_gs.ptq import RoundToNearest as RTN\n",
    "cfg = PTQConfig(mode=PTQMode.QUANTIZE, backend=BackendTarget.ASCEND)\n",
    "ptq = RTN(config=cfg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.4. 量化Llama2网络并保存ckpt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "qnet = ptq.apply(network.model)\n",
    "qnet = ptq.convert(qnet)\n",
    "network.model = qnet\n",
    "ms.save_checkpoint(network, llama2_w8a16_ckpt_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "成功运行后，会在当前目录下生成`llama2-7b-w8a16.ckpt`文件。\n",
    "\n",
    "### 步骤3. 模型部署\n",
    "\n",
    "#### 3.1. 实例化MindFormerConfig和Llama2网络"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-03-19 19:19:37,710 - mindformers[mindformers/version_control.py:62] - INFO - The Cell Reuse compilation acceleration feature is not supported when the environment variable ENABLE_CELL_REUSE is 0 or MindSpore version is earlier than 2.1.0 or stand_alone mode or pipeline_stages <= 1\n",
      "2024-03-19 19:19:37,710 - mindformers[mindformers/version_control.py:66] - INFO - \n",
      "The current ENABLE_CELL_REUSE=0, please set the environment variable as follows: \n",
      "export ENABLE_CELL_REUSE=1 to enable the Cell Reuse compilation acceleration feature.\n",
      "2024-03-19 19:19:37,711 - mindformers[mindformers/version_control.py:72] - INFO - The Cell Reuse compilation acceleration feature does not support single-card mode.This feature is disabled by default. ENABLE_CELL_REUSE=1 does not take effect.\n",
      "2024-03-19 19:19:37,712 - mindformers[mindformers/version_control.py:75] - INFO - The Cell Reuse compilation acceleration feature only works in pipeline parallel mode(pipeline_stage>1).Current pipeline stage=1, the feature is disabled by default.\n",
      "2024-03-19 19:21:07,859 - mindformers[mindformers/models/modeling_utils.py:1415] - INFO - model built, but weights is unloaded, since the config has no checkpoint_name_or_path attribute or checkpoint_name_or_path is None.\n"
     ]
    }
   ],
   "source": [
    "context.set_context(device_target=\"Ascend\")\n",
    "deploy_network_config = create_mfconfig(llama2_config_file, \"Ascend\", device_id, bs, seq_len)\n",
    "deploy_network = LlamaForCausalLM(deploy_network_config.model.model_config)\n",
    "deploy_network.set_train(False)\n",
    "deploy_network.phase = 'predict'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.2. 加载量化后的ckpt\n",
    "\n",
    "由于MindSpore当前不支持保存修改后的网络，所以在加载量化ckpt之前，需要先用算法恢复带量化结构的网络，然后再加载ckpt到网络。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'model.tok_embeddings.embedding_weight': Parameter (name=model.tok_embeddings.embedding_weight, shape=(32000, 4096), dtype=Float16, requires_grad=True),\n",
       " ......\n",
       " 'model.layers.31.feed_forward.w3._weight_quantizer.scale': Parameter (name=model.layers.31.feed_forward.w3._weight_quantizer.scale, shape=(11008,), dtype=Float16, requires_grad=True),\n",
       " 'model.layers.31.feed_forward.w3._weight_quantizer.zp_neg': Parameter (name=model.layers.31.feed_forward.w3._weight_quantizer.zp_neg, shape=(11008,), dtype=Float16, requires_grad=True),\n",
       " 'model.norm_out.weight': Parameter (name=model.norm_out.weight, shape=(4096,), dtype=Float32, requires_grad=True),\n",
       " 'lm_head.weight': Parameter (name=lm_head.weight, shape=(32000, 4096), dtype=Float16, requires_grad=True)}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "deploy_cfg = PTQConfig(mode=PTQMode.DEPLOY, backend=BackendTarget.ASCEND)\n",
    "deploy_ptq = RTN(config=deploy_cfg)\n",
    "deploy_network.model = deploy_ptq.apply(deploy_network.model)\n",
    "deploy_network.model = deploy_ptq.convert(deploy_network.model)\n",
    "ms.load_checkpoint(llama2_w8a16_ckpt_file, deploy_network)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3. 评估量化后的网络\n",
    "\n",
    "本示例对Llama2在wikitext2数据集上评估Perplexity指标。使用步骤1中下载好的分词器和数据集文件分别实例化分词器对象和数据集对象，并实例化PerplexityMetric对象作为metric。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[INFO] GE(1746443,python):2024-03-19-19:25:18.990.947 [ge_api.cc:523][status:INIT]1746443 AddGraph:Start to add graph in Session. graph_id: 1, graph_name: kernel_graph224, session_id: 0.\n",
      "[INFO] GE(1746443,python):2024-03-19-19:25:18.991.481 [ge_api.cc:1154][status:INIT]1746443 CompileGraph:Start to compile graph, graph_id: 1\n",
      "[INFO] GE(1746443,python):2024-03-19-19:25:18.991.586 [graph_manager.cc:1264][EVENT]1746443 PreRun:PreRun start: graph node size 1, session id 0, graph id 1, graph name kernel_graph224.\n",
      "[INFO] GE(1746443,python):2024-03-19-19:25:19.065.657 [ge_api.cc:1160][status:STOP]1746443 CompileGraph:Compile graph success.\n",
      "[INFO] GE(1746443,python):2024-03-19-19:25:19.067.797 [ge_api.cc:787][status:INIT]2453595 RunGraphWithStreamAsync:Session run graph with stream async, session_id: 0, graph_id: 1, input size 0, output size 0\n",
      "[INFO] GE(1746443,python):2024-03-19-19:25:19.079.152 [ge_api.cc:799][status:STOP]2453595 RunGraphWithStreamAsync:Session run graph with stream async finished.\n",
      "[INFO] GE(1746443,python):2024-03-19-19:26:40.520.923 [ge_api.cc:523][status:INIT]1746443 AddGraph:Start to add graph in Session. graph_id: 2, graph_name: kernel_graph225, session_id: 0.\n",
      "[INFO] GE(1746443,python):2024-03-19-19:26:40.581.045 [ge_api.cc:1154][status:INIT]1746443 CompileGraph:Start to compile graph, graph_id: 2\n",
      "[INFO] GE(1746443,python):2024-03-19-19:26:40.633.523 [graph_manager.cc:1264][EVENT]1746443 PreRun:PreRun start: graph node size 3025, session id 0, graph id 2, graph name kernel_graph225.\n",
      "[INFO] GE(1746443,python):2024-03-19-19:28:24.659.856 [ge_api.cc:799][status:STOP]2453595 RunGraphWithStreamAsync:Session run graph with stream async finished.\n",
      "[INFO] GE(1746443,python):2024-03-19-19:28:24.665.855 [ge_api.cc:787][status:INIT]2453595 RunGraphWithStreamAsync:Session run graph with stream async, session_id: 0, graph_id: 2, input size 739, output size 3\n",
      "[INFO] GE(1746443,python):2024-03-19-19:28:24.667.497 [ge_api.cc:799][status:STOP]2453595 RunGraphWithStreamAsync:Session run graph with stream async finished.\n",
      "[INFO] GE(1746443,python):2024-03-19-19:28:25.267.844 [ge_api.cc:787][status:INIT]2453595 RunGraphWithStreamAsync:Session run graph with stream async, session_id: 0, graph_id: 3, input size 3, output size 1\n",
      "......\n",
      "[INFO] GE(1746443,python):2024-03-19-19:29:18.708.299 [ge_api.cc:799][status:STOP]2453595 RunGraphWithStreamAsync:Session run graph with stream async finished.\n",
      "W8A16 Perplexity: {'PerplexityMetric': {'loss': 2.910757654840339, 'PPL': 18.370711954412435}}\n"
     ]
    }
   ],
   "source": [
    "import mindspore as ms\n",
    "from mindformers import LlamaTokenizer\n",
    "from mindformers.core.metric import PerplexityMetric\n",
    "from mindspore_gs.datasets import create_wikitext_dataset\n",
    "\n",
    "tokenizer = LlamaTokenizer(vocab_file=vocab_file)\n",
    "deploy_ds = create_wikitext_dataset(wikitext2_ds_path, bs, seq_len, tokenizer)\n",
    "deploy_metrics = {\"PerplexityMetric\": PerplexityMetric()}\n",
    "deploy_model = ms.Model(deploy_network, metrics=deploy_metrics, eval_network=deploy_network)\n",
    "quant_ppl = deploy_model.eval(deploy_ds, dataset_sink_mode=deploy_network_config.runner_config.sink_mode)\n",
    "print(f\"W8A16 Perplexity: {quant_ppl}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 步骤4. 效果分析\n",
    "\n",
    "#### 4.1. 评估FP16网络的Perplexity指标"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "......\n",
      "[WARNING] ME(2617230:281472981539200,MainProcess):2024-03-19-19:41:51.626.9 [mindspore/train/serialization.py:185] The type of model.layers.31.attention_norm.weight:Float16 in 'parameter_dict' is different from the type of it in 'net':Float32, then the type convert from Float16 to Float32 in the network.\n",
      "[WARNING] ME(2617230:281472981539200,MainProcess):2024-03-19-19:41:51.881.3 [mindspore/train/serialization.py:185] The type of model.layers.31.ffn_norm.weight:Float16 in 'parameter_dict' is different from the type of it in 'net':Float32, then the type convert from Float16 to Float32 in the network.\n",
      "[WARNING] ME(2617230:281472981539200,MainProcess):2024-03-19-19:41:51.695.148 [mindspore/train/serialization.py:185] The type of model.norm_out.weight:Float16 in 'parameter_dict' is different from the type of it in 'net':Float32, then the type convert from Float16 to Float32 in the network.\n",
      "2024-03-19 19:41:52,132 - mindformers[mindformers/models/modeling_utils.py:1413] - INFO - weights in ./workspace/llama2_7b.ckpt are loaded\n",
      "[INFO] GE(2617230,python):2024-03-19-19:42:00.316.847 [ge_api.cc:523][status:INIT]2617230 AddGraph:Start to add graph in Session. graph_id: 1, graph_name: kernel_graph0, session_id: 0.\n",
      "[INFO] GE(2617230,python):2024-03-19-19:42:00.317.200 [ge_api.cc:1154][status:INIT]2617230 CompileGraph:Start to compile graph, graph_id: 1\n",
      "[INFO] GE(2617230,python):2024-03-19-19:42:00.317.282 [graph_manager.cc:1264][EVENT]2617230 PreRun:PreRun start: graph node size 1, session id 0, graph id 1, graph name kernel_graph0.\n",
      "......\n",
      "[INFO] GE(2617230,python):2024-03-19-19:43:17.424.380 [ge_api.cc:787][status:INIT]2654383 RunGraphWithStreamAsync:Session run graph with stream async, session_id: 0, graph_id: 2, input size 291, output size 3\n",
      "[INFO] GE(2617230,python):2024-03-19-19:43:17.424.812 [ge_api.cc:799][status:STOP]2654383 RunGraphWithStreamAsync:Session run graph with stream async finished.\n",
      "[INFO] GE(2617230,python):2024-03-19-19:43:17.464.158 [ge_api.cc:787][status:INIT]2654383 RunGraphWithStreamAsync:Session run graph with stream async, session_id: 0, graph_id: 3, input size 3, output size 1\n",
      "[INFO] GE(2617230,python):2024-03-19-19:43:17.464.296 [ge_api.cc:799][status:STOP]2654383 RunGraphWithStreamAsync:Session run graph with stream async finished.\n",
      "FP16 Perplexity: {'PerplexityMetric': {'loss': 2.909490694278072, 'PPL': 18.347451724873594}}\n"
     ]
    }
   ],
   "source": [
    "import mindspore as ms\n",
    "from mindformers import LlamaForCausalLM, LlamaTokenizer\n",
    "from mindformers.core.metric import PerplexityMetric\n",
    "from mindspore_gs.datasets import create_wikitext_dataset\n",
    "\n",
    "fp16_network_config = create_mfconfig(llama2_config_file, \"Ascend\", device_id, bs, seq_len, ckpt_path=llama2_w16a16_ckpt_file)\n",
    "fp16_network = LlamaForCausalLM(fp16_network_config.model.model_config)\n",
    "fp16_network.set_train(False)\n",
    "fp16_network.phase = 'predict'\n",
    "tokenizer = LlamaTokenizer(vocab_file=vocab_file)\n",
    "fp16_ds = create_wikitext_dataset(wikitext2_ds_path, bs, seq_len, tokenizer)\n",
    "fp16_metrics = {\"PerplexityMetric\": PerplexityMetric()}\n",
    "fp16_model = ms.Model(fp16_network, metrics=fp16_metrics, eval_network=fp16_network)\n",
    "fp16_ppl = fp16_model.eval(fp16_ds, dataset_sink_mode=fp16_network_config.runner_config.sink_mode)\n",
    "print(f\"FP16 Perplexity: {fp16_ppl}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.2. 比较结果\n",
    "\n",
    "表2：Llama2 7B网络RTN算法量化前后对比\n",
    "\n",
    "| 指标 | FP16 | W8A16 | 收益 |\n",
    "| --- | --- | --- | --- |\n",
    "| ckpt-size(GB)↓ | 13 | 7.1 | -5.9 |\n",
    "| wikitext2-Perplexity↓ | 18.347 | 18.370 | 0.023 |\n",
    "\n",
    "可以看到，经过RTN量化算法处理后：\n",
    "\n",
    "1. 量化后网络的参数量缩减了5.9GB，只剩下原Float16时的54.6%，即网络部署时，用于静态权重存储的显存下降到Float16时的54.6%。因而量化后的网络可以在资源更紧张的环境上部署，或者在相同的环境中提供更大的吞吐量。\n",
    "2. 量化后网络的在wikitext2数据集上的混淆度有极微小的上升，即量化后网络在wikitext2上生成式任务的效果几乎无损。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py37-0308",
   "language": "python",
   "name": "py37-0308"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}