MindSpore Golden Stick Documentation
MindSpore Golden Stick is a model compression toolkit jointly designed and developed by the MindSpore team and Huawei Noah's Ark Lab. We have two major goals:
Build model compression capabilities for the MindSpore open-source ecosystem and provide simple, easy-to-use interfaces to improve deployment efficiency of MindSpore networks.
Shield the complexity of frameworks and hardware while offering extensible foundational capabilities for model compression algorithms.
Based on MindSpore's built-in compression technologies and a componentized design, MindSpore Golden Stick features:
SoTA Algorithms: The model compression algorithms in Golden Stick mainly come from two sources: one is the state-of-the-art algorithms from the industry, which we continuously follow up on in the MindSpore ecosystem; the other is innovative algorithms provided by Huawei's algorithm teams.
Easy-to-use Interface: Golden Stick provides Transformers-like interfaces and supports direct compression of Hugging Face community weights, with output weights that also conform to the Hugging Face community weight format.
Layered Decoupling: Golden Stick is committed to building an easy-to-use algorithm research platform. We have designed the framework with layered and modular architecture, which on one hand shields the complexity of frameworks and hardware, and on the other hand facilitates algorithm engineers to quickly innovate and experiment at different levels of algorithms.
Hardware adaptation: Supports quantizing Hugging Face weights on Ascend hardware and deployment via the vLLM-MindSpore Plugin or MindSpore Transformers.
You can refer to the Architecture Design to quickly understand the system architecture of MindSpore Golden Stick.
If you have any suggestions for MindSpore Golden Stick, please contact us via issues, and we will respond promptly.
Using MindSpore Golden Stick for Model Compression
MindSpore Golden Stick provides unified model compression interfaces and supports multiple compression techniques such as Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), and model pruning. You can learn more from the following documentation:
Currently, Golden Stick focuses primarily on compressing LLMs and multimodal understanding models, mainly with PTQ. QAT and pruning algorithms were originally designed for CV models and are no longer under active evolution and maintenance. We may plan QAT or pruning algorithms for LLMs in the future; contributions and feature requests are welcome in the community.
Repository: <https://gitee.com/mindspore/golden-stick>
Supported Algorithms in MindSpore Golden Stick
Post-Training Quantization (PTQ):
-
Basic principles and common approaches of post-training quantization.
-
An innovative algorithm jointly developed by Huawei Taylor team and the MindSpore team, providing higher-accuracy A8W8 quantization.
A8W4 Mixed-Precision Quantization
Combines OutlierSuppressionLite and GPTQ to achieve layer-wise mixed precision quantization.
-
Supports A16W8 quantization and provides fundamental weight quantization capability.
-
Supports A8W8 quantization and improves accuracy by smoothing activation distributions.
-
Supports A16W4 quantization with activation-aware weight quantization to improve low-bit performance.
-
Supports A16W4 quantization with a gradient-based post-training quantization method.
-
Supports per-token dynamic quantization, computing quantization parameters in real time during inference.
-
Quantization-Aware Training (QAT):
-
Basic principles and common approaches of quantization-aware training.
Simulated Quantization Training
Simulates quantization effects during training to improve the accuracy of quantized models.
-
Searching for Low-Bit quantization-aware training algorithm.
-
Pruning:
-
Basic principles and common approaches of pruning.
-
Supports channel-level structured pruning to reduce the number of model parameters.
-
Deployment Integration
Contribution Guide
RELEASE NOTES