MindSpore Golden Stick Documentation

MindSpore Golden Stick is a model compression toolkit jointly designed and developed by the MindSpore team and Huawei Noah's Ark Lab. We have two major goals:

  1. Build model compression capabilities for the MindSpore open-source ecosystem and provide simple, easy-to-use interfaces to improve deployment efficiency of MindSpore networks.

  2. Shield the complexity of frameworks and hardware while offering extensible foundational capabilities for model compression algorithms.

Based on MindSpore's built-in compression technologies and a componentized design, MindSpore Golden Stick features:

  • SoTA Algorithms: The model compression algorithms in Golden Stick mainly come from two sources: one is the state-of-the-art algorithms from the industry, which we continuously follow up on in the MindSpore ecosystem; the other is innovative algorithms provided by Huawei's algorithm teams.

  • Easy-to-use Interface: Golden Stick provides Transformers-like interfaces and supports direct compression of Hugging Face community weights, with output weights that also conform to the Hugging Face community weight format.

  • Layered Decoupling: Golden Stick is committed to building an easy-to-use algorithm research platform. We have designed the framework with layered and modular architecture, which on one hand shields the complexity of frameworks and hardware, and on the other hand facilitates algorithm engineers to quickly innovate and experiment at different levels of algorithms.

  • Hardware adaptation: Supports quantizing Hugging Face weights on Ascend hardware and deployment via the vLLM-MindSpore Plugin or MindSpore Transformers.

You can refer to the Architecture Design to quickly understand the system architecture of MindSpore Golden Stick.

If you have any suggestions for MindSpore Golden Stick, please contact us via issues, and we will respond promptly.

Using MindSpore Golden Stick for Model Compression

MindSpore Golden Stick provides unified model compression interfaces and supports multiple compression techniques such as Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), and model pruning. You can learn more from the following documentation:

Currently, Golden Stick focuses primarily on compressing LLMs and multimodal understanding models, mainly with PTQ. QAT and pruning algorithms were originally designed for CV models and are no longer under active evolution and maintenance. We may plan QAT or pruning algorithms for LLMs in the future; contributions and feature requests are welcome in the community.

Repository: <https://gitee.com/mindspore/golden-stick>

Supported Algorithms in MindSpore Golden Stick