Parallel Inference Methods
The vLLM-MindSpore plugin supports hybrid parallel inference configurations combining Tensor Parallelism (TP), Data Parallelism (DP), and Expert Parallelism (EP), and can be launched for multi-node multi-card setups using Ray or multiprocess. For applicable scenarios of different parallel strategies, refer to the vLLM Official Documentation. The following sections will detail the usage scenarios, parameter configuration, and Online Inference for Tensor Parallelism, Data Parallelism, Expert Parallelism, and Hybrid Parallelism.
Tensor Parallelism
Tensor Parallelism shards the model weight parameters within each model layer across multiple NPUs. Tensor Parallelism is the recommended strategy for large model inference when the model size exceeds the capacity of a single NPU, or when there is a need to reduce the pressure on a single NPU and free up more space for the KV cache to achieve higher throughput. For more information, see the Introduction to Tensor Parallelism in vLLM.
Parameter Configuration
To use Tensor Parallelism (TP), configure the following options in the launch command vllm-mindspore serve:
--tensor-parallel-size: The TP parallelism degree.
Single-Node Example
The following command is an example of launching Tensor Parallelism for Qwen-2.5 on a single node with four cards:
TENSOR_PARALLEL_SIZE=4 # TP parallelism degree
vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --tensor-parallel-size ${TENSOR_PARALLEL_SIZE}
Multi-Node Example
Multi-node Tensor Parallelism relies on Ray for launch. Please refer to Ray Multi-Node Cluster Management for Ray environment configuration.
The following command is an example of launching Tensor Parallelism for Qwen-2.5 across two nodes with four cards total:
# Master Node:
TENSOR_PARALLEL_SIZE=4 # TP parallelism degree
vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --tensor-parallel-size ${TENSOR_PARALLEL_SIZE}
Data Parallelism
Data Parallelism fully replicates the model across multiple groups of NPUs, and processes requests from different batches in parallel during inference. Data Parallelism is the recommended strategy for large model inference when there are sufficient NPUs to fully replicate the model, when there is a need to improve throughput rather than scale the model size, or when maintaining isolation between request batches in a multi-user environment is required. Data Parallelism can be combined with other parallel strategies. Please note: MoE (Mixture of Experts) layers will be sharded based on the product of the Tensor Parallelism degree and the Data Parallelism degree. For more information, see the Introduction to Data Parallelism in vLLM.
Parameter Configuration
To use Data Parallelism (DP), configure the following options in the launch command vllm-mindspore serve:
--data-parallel-size: The DP parallelism degree.--data-parallel-backend: Sets the DP deployment method. Options arempandray. By default, DP is deployed using multiprocess:mp: Deploy with multiprocess.ray: Deploy with Ray.
When
--data-parallel-backendis set tomp, the following options must also be configured in the launch command:--data-parallel-size-local: The number of DP workers on the current service node. The sum across all nodes should equal to--data-parallel-size.--data-parallel-start-rank: The offset of the first DP worker responsible for the current service node.--data-parallel-address: The communication IP address of the master node.--data-parallel-rpc-port: The communication port of the master node.
Service Startup Examples
Ray Startup Example (Recommended)
Ray simplifies startup in multi-node scenarios and is the recommended method. Please refer to Ray Multi-Node Cluster Management for Ray environment configuration. The following command is an example of launching Data Parallelism for Qwen-2.5 across two nodes with four cards using Ray:
DATA_PARALLEL_SIZE=4 # DP parallelism degree
DATA_PARALLEL_SIZE_LOCAL=2 # Number of DP workers on the current service node. The sum across all nodes should equal to `--data-parallel-size`
vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL} --data-parallel-backend=ray
Multiprocess Startup
When users need to simplify dependencies, they can use multiprocess startup. The following command is an example of launching Data Parallelism for Qwen-2.5 across two nodes with four cards using multiprocess:
# Master Node:
vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL}
# Worker Node:
vllm-mindspore serve /path/to/Qwen2.5/model --headless --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL}
Expert Parallelism
Expert Parallelism is a parallelization form specific to Mixture of Experts (MoE) models, achieved by distributing different expert networks across multiple NPUs. This parallel mode is suitable for MoE models (such as DeepSeekV3, Qwen3-MoE, Llama-4, etc.) and is the recommended strategy for large model inference when there is a need to balance the computational load of expert networks across NPUs. Enable Expert Parallelism by setting enable_expert_parallel=True and --additional-config. This setting will cause the MoE layers to adopt the Expert Parallelism strategy instead of Tensor Parallelism. The parallelism degree for Expert Parallelism will remain consistent with the already set Tensor Parallelism degree. For more information, see the Introduction to Expert Parallelism in vLLM.
Parameter Configuration
To use Expert Parallelism (EP), configure the following options in the launch command vllm-mindspore serve:
--enable-expert-parallel: Enable Expert Parallelism.--additional-config: Configure theexpert_parallelfield to set the EP parallelism degree. For example, to configure EP as 4:--additional-config '{"expert_parallel": 4}'
If
--enable-expert-parallelis not configured, EP is not enabled, and configuring--additional-config '{"expert_parallel": 4}'will not take effect.If
--enable-expert-parallelis configured, but--additional-config '{"expert_parallel": 4}'is not configured, then the EP parallelism degree equals to the TP parallelism degree multiplied by the DP parallelism degree.If
--enable-expert-parallelis configured, and--additional-config '{"expert_parallel": 4}'is also configured, then the EP parallelism degree equals to4.
Service Startup Examples
Single-Node Example
The following command is an example of launching Expert Parallelism for Qwen-3 MOE on a single node with eight cards:
vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --additional-config '{"expert_parallel": 8}
Multi-Node Example
Multi-node Expert Parallelism relies on Ray for launch. Please refer to Ray Multi-Node Cluster Management for Ray environment configuration. The following command is an example of launching Expert Parallelism for Qwen-3 MOE across two nodes with eight cards total using Ray:
vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --additional-config '{"expert_parallel": 8} --data-parallel-backend ray
Hybrid Parallelism
Users can flexibly combine and adjust parallel strategies based on the model used and the available machine resources. For example, in a DeepSeek-R1 scenario, the following hybrid strategy can be used:
Tensor Parallelism: 4
Data Parallelism: 4
Expert Parallelism: 4
Based on the introductions above, the configurations for the three parallel strategies can be combined and enabled in the vllm-mindspore serve launch command. Multi-node Hybrid Parallelism relies on Ray for launch. Please refer to Ray Multi-Node Cluster Management for Ray environment configuration. The combined Ray launch command for Hybrid Parallelism is as follows:
vllm-mindspore serve /path/to/DeepSeek-R1 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --data-parallel-backend ray
Appendix
Ray Multi-Node Cluster Management
On Ascend, there are two startup methods: multiprocess and Ray. In multi-node scenarios, if using Ray, an additional pyACL package needs to be installed to adapt Ray, and the CANN dependency versions on all nodes must be consistent.
Installing pyACL
pyACL (Python Ascend Computing Language) wraps the corresponding API interfaces of AscendCL through CPython. Using these interfaces allows management of Ascend AI processors and their corresponding computing resources.
In the target environment, after obtaining the appropriate version of the Ascend-cann-nnrt installation package, extract the pyACL dependency package and install it separately. Then add the installation path to the environment variables:
./Ascend-cann-nnrt_*_linux-aarch64.run --noexec --extract=./
cd ./run_package
./Ascend-pyACL_*_linux-aarch64.run --full --install-path=<install_path>
export PYTHONPATH=<install_path>/CANN-<VERSION>/python/site-packages/:$PYTHONPATH
If there are permission issues during installation, use the following command to add permissions:
chmod -R 777 ./Ascend-pyACL_*_linux-aarch64.run
The Ascend runtime package can be downloaded from the Ascend homepage. For example, you can refer to installation and download the runtime package.
Multi-Node Cluster
Before managing a multi-node cluster, check that the hostnames of all nodes are different. If any are the same, set different hostnames using hostname <new-host-name>.
Start the head node:
ray start --head --port=<port-to-ray>. Upon successful startup, the connection method for worker nodes will be displayed. Configure as follows, replacingIPandaddresswith the actual environment information.
Local node IP: *.*.*.*
-------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='*.*.*.*:*'
To connect to this Ray cluster:
import ray
ray.init()
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
Connect worker nodes to the head node:
ray start --address=<head_node_ip>:<port>.Check the cluster status using
ray status. If the displayed total number of NPUs matches the sum across all nodes, the cluster is successful.When there are two nodes, each with 8 NPUs, the result is as follows:
======== Autoscaler status: 2025-05-19 00:00:00.000000 ======== Node status --------------------------------------------------------------- Active: 1 node_efa0981305b1204810c3080c09898097099090f09ee909d0ae12545 1 node_184f44c4790135907ab098897c878699d89098e879f2403bc990112 Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0.0/384.0 CPU 0.0/16.0 NPU 0B/2.58TiB memory 0B/372.56GiB object_store_memory Demands: (no resource demands)
Online Inference
Setting Environment Variables
Configure the following environment variables on both the head and worker nodes:
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export MS_ENABLE_LCCL=off
export HCCL_OP_EXPANSION_MODE=AIV
export MS_ALLOC_CONF=enable_vmm:true
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export VLLM_MS_MODEL_BACKEND=MindFormers
Environment Variable Descriptions:
MS_ENABLE_LCCL: Disables LCCL and enables HCCL communication.HCCL_OP_EXPANSION_MODE: Configures the scheduling and expansion location of the communication algorithm to be the AI Vector Core computing unit on the Device side.MS_ALLOC_CONF: Sets the memory policy. Refer to the MindSpore Official Documentation.ASCEND_RT_VISIBLE_DEVICES: Configures the available device IDs for each node. Users can query this using thenpu-smi infocommand.VLLM_MS_MODEL_BACKEND: The backend of the model being run. The models and model backends currently supported by the vLLM-MindSpore plugin can be queried in the Model Support List.
If users need to use the Ray deployment method, the following environment variables must be additionally set:
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
export GLOO_SOCKET_IFNAME=enp189s0f0
export HCCL_SOCKET_IFNAME=enp189s0f0
export TP_SOCKET_IFNAME=enp189s0f0
Environment Variable Descriptions:
PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION: Used when there are version compatibility issues.GLOO_SOCKET_IFNAME: The GLOO backend port name, used for the network interface name when using gloo communication between multiple machines. Find the network card name corresponding to the IP viaifconfig.HCCL_SOCKET_IFNAME: Configures the HCCL port name, used for the network interface name when using HCCL communication between multiple machines. Find the network card name corresponding to the IP viaifconfig.TP_SOCKET_IFNAME: Configures the TP port name, used for the network interface name when using TP communication between multiple machines. Find the network card name corresponding to the IP viaifconfig.
Starting the Service
The vLLM-MindSpore plugin can deploy online inference using the OpenAI API protocol. The following is the startup process for online inference:
# Launch configuration parameter explanation
vllm-mindspore serve
[Model Tag: Path to model Config and weight files]
--trust-remote-code # Use the locally downloaded model file
--max-num-seqs [Maximum Batch Size]
--max-model-len [Model Context Length]
--max-num-batched-tokens [Maximum number of tokens supported per iteration, recommended 4096]
--block-size [Block Size, recommended 128]
--gpu-memory-utilization [Memory utilization rate, recommended 0.9]
--tensor-parallel-size [TP parallelism degree]
--headless # Only needed for worker nodes, indicates no server-side related content is needed
--data-parallel-size [DP parallelism degree]
--data-parallel-size-local [Number of DP workers on the current service node. The sum across all nodes equals data-parallel-size]
--data-parallel-start-rank [The offset of the first DP worker responsible for the current service node, used when using the multiprocess startup method]
--data-parallel-address [The communication IP address of the master node, used when using the multiprocess startup method]
--data-parallel-rpc-port [The communication port of the master node, used when using the multiprocess startup method]
--enable-expert-parallel # Enable Expert Parallelism
--data-parallel-backend [ray, mp] # Specify the dp deployment method as Ray or mp (i.e., multiprocess)
--additional-config # Parallel features and additional configurations
Users can specify the local path where the model is saved as the model tag.
The following are execution examples for the multiprocess and Ray startup methods respectively, taking DP4-EP4-TP4 as example:
Multiprocess Startup Method
# Master Node:
vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 29550 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --quantization ascend
# Worker Node:
vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --headless --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 29550 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --quantization ascend
Specifically, data-parallel-address and --data-parallel-rpc-port must be configured with the actual environment information for the running instance.
Ray Startup Method
# Master Node:
vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --data-parallel-backend ray --quantization ascend
Sending Requests
Use the following command to send a request. The prompt field is the model input:
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am, "max_tokens": 120, "temperature": 0}'
Users must ensure that the "model" field matches the model tag used when starting the service for the request to successfully match the model.