{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 应用RoundToNearest后量化算法\n", "\n", "[](https://gitee.com/mindspore/golden-stick/blob/r1.0.0/mindspore_gs/ptq/round_to_nearest/README_CN.ipynb)\n", "\n", "## RoundToNearest后量化算法简介\n", "\n", "RoundToNearest算法是一类较朴素的后量化算法,其取整方式使用了Round to nearest,即四舍五入的方式。\n", "\n", "当前金箍棒中的RoundToNearest后量化(后面使用RTN来简称)主要针对LLM(大语言模型场景),使用MinMax校正器对线性层(Linear)进行量化。伪量化的网络结构示意如下:\n", "\n", "\n", "\n", "表1:RTN算法规格\n", "\n", "| 规格 | 规格说明 |\n", "| --- | --- |\n", "| 硬件支持 | 量化阶段运行在CPU,量化模型推理仅支持Ascend |\n", "| 网络支持 | Llama2系列网络,具体请参见[Llama2网络](https://gitee.com/mindspore/mindformers/tree/v1.3.2/mindformers/models/llama) |\n", "| 运行模式支持 | Graph模式和PyNative模式 |\n", "\n", "表2:网络使用RTN算法量化前后对比\n", "\n", "
指标 | \n", "llama2-7B | \n", "llama2-13B | \n", "llama2-70B | \n", "baichuan2-13B | \n", "chatGLM3-6B | \n", "||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FP16 | W8A16 | 收益 | \n", "FP16 | W8A16 | 收益 | \n", "FP16 | W8A16 | 收益 | \n", "FP16 | W8A16 | 收益 | \n", "FP16 | W8A16 | 收益 | \n", "|
ckpt-size(GB)↓ | \n", "13 | 7.1 | -45.38% | \n", "25 | 14 | -44.00% | \n", "129 | 65 | -49.61% | \n", "26 | 15 | -42.31% | \n", "12 | 6.1 | -49.17% | \n", "
wikitext2-Perplexity↓ | \n", "15.130 | 15.129 | 0.00 | \n", "14.18 | 14.203 | 0.02 | \n", "10.379 | 10.435 | 0.046 | \n", "23.955 | 23.912 | -0.043 | \n", "- | - | - | \n", "
squad1.1-F1↑ | \n", "60.48 | 60.76 | 0.28 | \n", "- | - | - | \n", "- | - | - | \n", "- | - | - | \n", "- | - | - | \n", "
squad1.1-EM↑ | \n", "39.62 | 39.57 | -0.05 | \n", "- | - | - | \n", "- | - | - | \n", "- | - | - | \n", "- | - | - | \n", "
全量性能(tokens/s) | \n", "9.08 | 9.04 | 0 | \n", "- | - | - | \n", "- | - | - | \n", "- | - | - | \n", "- | - | - | \n", "
增量性能(tokens/s) | \n", "30.24 | 21.08 | -30.29% | \n", "- | - | - | \n", "- | - | - | \n", "- | - | - | \n", "- | - | - | \n", "
显存(GB) | \n", "- | - | - | \n", "27 | 16 | -40.7% | \n", "- | - | - | \n", "- | - | - | \n", "- | - | - | \n", "