Supported Features List
The features supported by vLLM-MindSpore Plugin are consistent with the community version of vLLM. For feature descriptions and usage, please refer to the vLLM Official Documentation.
The following are the features supported in vLLM-MindSpore Plugin.
Features |
vLLM V0 |
vLLM V1 |
|---|---|---|
Chunked Prefill |
√ |
√ |
Automatic Prefix Caching |
√ |
√ |
Multi step scheduler |
√ |
× |
DeepSeek MTP |
√ |
WIP |
Async output |
√ |
√ |
Quantization |
√ |
√ |
LoRA |
WIP |
WIP |
Tensor Parallel |
√ |
√ |
Pipeline Parallel |
WIP |
WIP |
Expert Parallel |
× |
√ |
Data Parallel |
× |
√ |
Prefill Decode Disaggregation |
× |
WIP |
Multi Modality |
WIP |
WIP |
Prompt adapter |
× |
WIP |
Speculative decoding |
× |
WIP |
LogProbs |
× |
√ |
Prompt logProbs |
× |
WIP |
Best of |
× |
× |
Beam search |
× |
WIP |
Guided Decoding |
× |
WIP |
Pooling |
× |
× |
Enc-dec |
× |
× |
Reasoning Outputs |
√ |
√ |
Tool Calling |
WIP |
√ |
Graph Capture |
x |
√ |
√: Feature aligned with the community version of vLLM.
×: Currently unsupported; alternative solutions are recommended.
WIP: Under development or planned for future implementation.
Feature Description
LoRA currently has two inference modes: static graph and dynamic graph. The static graph offers better performance but does not support dynamic unloading or loading LoRA adapters. The LoRA feature currently only supports the Qwen2.5, other models are in the process of adaptation.
Tool Calling only supports DeepSeek V3 0324 W8A8 model.
Atlas 300I Duo has supported Chunked Prefill, LoRA(static graph), Automatic Prefix Caching and Tensor Parallel, and other features are in the process of adaptation.