{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 应用RoundToNearest后量化算法\n", "\n", "[](https://gitee.com/mindspore/golden-stick/blob/r1.3.0/mindspore_gs/ptq/round_to_nearest/README_CN.ipynb)\n", "\n", "## RoundToNearest后量化算法简介\n", "\n", "RoundToNearest算法是一类较朴素的后量化算法,其取整方式采用Round to nearest,即四舍五入的方式。\n", "\n", "当前金箍棒中的RoundToNearest后量化(以下简称RTN)主要针对LLM(大语言模型)场景。该算法使用MinMax校正器对线性层(Linear)进行量化。伪量化的网络结构示意如下:\n", "\n", "\n", "\n", "表1:RTN算法规格\n", "\n", "| 规格 | 规格说明 |\n", "| --- | --- |\n", "| 硬件支持 | 量化阶段运行在CPU,量化模型推理仅支持Ascend |\n", "| 网络支持 | Llama2系列网络,具体请参见[Llama2网络](https://gitee.com/mindspore/mindformers/tree/r1.7.0/mindformers/models/llama) |\n", "| 运行模式支持 | Graph模式和PyNative模式 |\n", "\n", "表2:网络使用RTN算法量化前后对比\n", "\n", "
| 指标\n", " | llama2-7B\n", " | llama2-13B\n", " | llama2-70B\n", " | baichuan2-13B\n", " | chatGLM3-6B\n", " | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FP16 | W8A16 | 收益\n", " | FP16 | W8A16 | 收益\n", " | FP16 | W8A16 | 收益\n", " | FP16 | W8A16 | 收益\n", " | FP16 | W8A16 | 收益\n", " | |
| ckpt-size(GB)↓\n", " | 13 | 7.1 | -45.38%\n", " | 25 | 14 | -44.00%\n", " | 129 | 65 | -49.61%\n", " | 26 | 15 | -42.31%\n", " | 12 | 6.1 | -49.17%\n", " | 
| wikitext2-Perplexity↓\n", " | 15.130 | 15.129 | 0.00\n", " | 14.18 | 14.203 | 0.02\n", " | 10.379 | 10.435 | 0.046\n", " | 23.955 | 23.912 | -0.043\n", " | - | - | -\n", " | 
| squad1.1-F1↑\n", " | 60.48 | 60.76 | 0.28\n", " | - | - | -\n", " | - | - | -\n", " | - | - | -\n", " | - | - | -\n", " | 
| squad1.1-EM↑\n", " | 39.62 | 39.57 | -0.05\n", " | - | - | -\n", " | - | - | -\n", " | - | - | -\n", " | - | - | -\n", " | 
| 全量性能(tokens/s)\n", " | 9.08 | 9.04 | 0\n", " | - | - | -\n", " | - | - | -\n", " | - | - | -\n", " | - | - | -\n", " | 
| 增量性能(tokens/s)\n", " | 30.24 | 21.08 | -30.29%\n", " | - | - | -\n", " | - | - | -\n", " | - | - | -\n", " | - | - | -\n", " | 
| 显存(GB)\n", " | - | - | -\n", " | 27 | 16 | -40.7%\n", " | - | - | -\n", " | - | - | -\n", " | - | - | -\n", " |