量化方法

本文档将为用户介绍模型量化与量化推理的方法。量化方法通过牺牲部分模型精度的方式，达到降低模型部署时的资源需求的目的，并提升模型部署时的性能，从而允许模型被部署到更多的设备上。由于大语言模型的规模较大，出于成本考虑，训练后量化成为主流模型量化方案，具体可以参考后量化技术简介。

本文档中，创建量化模型章节将以DeepSeek-R1为例，介绍模型后量化的步骤；量化模型推理章节介绍如何使用量化模型进行推理。

创建量化模型

以DeepSeek-R1网络为例，使用OutlierSuppressionLite算法对其进行W8A8量化。

使用MindSpore金箍棒量化网络

我们将使用MindSpore 金箍棒的PTQ算法对DeepSeek-R1网络进行量化，详细方法参考DeepSeekR1-OutlierSuppressionLite量化样例

直接下载量化权重

我们已经将量化好的DeepSeek-R1上传到魔乐社区：MindSpore-Lab/DeepSeek-R1-0528-A8W8，可以参考魔乐社区文档将权重下载到本地。

量化模型推理

在上一步中获取到DeepSeek-R1 W8A8量化权重后，保证该权重存放相对路径为DeepSeek-R1-W8A8。

离线推理

用户可以参考安装指南，进行vLLM-MindSpore插件的环境搭建。用户需设置以下环境变量：

export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.

环境准备完成后，用户可以使用如下Python代码，进行离线推理服务：

import vllm_mindspore # Add this line on the top of script.
from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "I am",
    "Today is",
    "Llama is"
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, top_p=0.95)

# Create a LLM
llm = LLM(model="DeepSeek-R1-W8A8")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}. Generated text: {generated_text!r}")

执行成功后，将获得如下推理结果：

Prompt: 'I am'. Generated text: ' trying to create a virtual environment for my Python project, but I am encountering some'
Prompt: 'Today is'. Generated text: ' the 100th day of school. To celebrate, the teacher has'
Prompt: 'Llama is'. Generated text: ' a 100% natural, biodegradable, and compostable alternative'