Quantization Methods

This document introduces model quantization and quantized inference methods. Quantization reduces inference resources with minor cost of precision, while improving inference performance to enable deployment on more devices. With the large scale of LLMs, post-training quantization has become the mainstream approach for model quantization. For details, refer to Post-Training Quantization Introduction.

In this document, the Creating Quantized Models section introduces post-training quantization steps using DeepSeek-R1 as an example. The Quantized Model Inference section explains how to perform inference with quantized models.

Creating Quantized Models

We use the DeepSeek-R1 network as an example to introduce W8A8 quantization with the OutlierSuppressionLite algorithm. This chapter requires the MindSpore Golden Stick module. Please refer to here for details about this module.

Quantizing Networks with MindSpore Golden Stick

We employ MindSpore Golden Stick's PTQ algorithm for quantization of DeepSeek-R1. For detailed methods, refer to DeepSeekR1-OutlierSuppressionLite Quantization Example.

Downloading Quantized Weights

We have uploaded the quantized DeepSeek-R1 to ModelArts Community: MindSpore-Lab/DeepSeek-R1-0528-A8W8. Refer to the ModelArts Community documentation to download the weights locally.

Quantized Model Inference

After obtaining the DeepSeek-R1 W8A8 weights, ensure they are stored in the relative path DeepSeek-R1-W8A8.

Offline Inference

Refer to the Installation Guide to set up the vLLM-MindSpore Plugin environment. User needs to set the following environment variables:

export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.

Once ready, use the following Python code for offline inference:

import vllm_mindspore  # Add this line at the top of the script
from vllm import LLM, SamplingParams

# Sample prompts
prompts = [
    "I am",
    "Today is",
    "Llama is"
]

# Create sampling parameters
sampling_params = SamplingParams(temperature=0.0, top_p=0.95)

# Initialize LLM
llm = LLM(model="DeepSeek-R1-W8A8")
# Generate text
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Successful execution will yield inference results like:

Prompt: 'I am', Generated text: ' trying to create a virtual environment for my Python project, but I am encountering some'
Prompt: 'Today is', Generated text: ' the 100th day of school. To celebrate, the teacher has'
Prompt: 'Llama is', Generated text: ' a 100% natural, biodegradable, and compostable alternative'