Multi-machine Parallel Inference (DeepSeek R1)
vLLM-MindSpore Plugin supports hybrid parallel inference with configurations of tensor parallelism (TP), data parallelism (DP), expert parallelism (EP), and their combinations. For the applicable scenarios of different parallel strategies, refer to the vLLM official documentation.
This document uses the DeepSeek R1 671B W8A8 model as an example to introduce the inference workflows for tensor parallelism (TP16) and hybrid parallelism. The DeepSeek R1 671B W8A8 model requires multiple nodes to run inference. To ensure consistent execution configurations (including model configuration file paths, Python environments, etc.) across all nodes, it is recommended to use Docker containers to eliminate execution differences.
Users can configure the environment by following the Docker Installation section below.
Docker Installation
In this section, we recommend to use docker to deploy the vLLM-MindSpore Plugin environment. The following sections are the steps for deployment:
Building the Image
User can execute the following commands to clone the vLLM-MindSpore Plugin code repository and build the image:
git clone -b r0.3.0 https://gitee.com/mindspore/vllm-mindspore.git
bash build_image.sh
After a successful build, user will get the following output:
Successfully built e40bcbeae9fc
Successfully tagged vllm_ms_20250726:latest
Here, e40bcbeae9fc
is the image ID, and vllm_ms_20250726:latest
is the image name and tag. User can run the following command to confirm that the Docker image has been successfully created:
docker images
Entering the Container
After building the image step, use the predefined environment variable DOCKER_NAME
to start and enter the container:
docker exec -it $DOCKER_NAME bash
Downloading Model Weights
User can download the model using either Python Tool or git-lfs Tool.
Downloading with Python Tool
Execute the following Python script to download the MindSpore-compatible DeepSeek-R1 W8A8 weights and files from Modelers Community:
from openmind_hub import snapshot_download
snapshot_download(repo_id="MindSpore-Lab/DeepSeek-R1-0528-A8W8",
local_dir="/path/to/save/deepseek_r1_0528_a8w8",
local_dir_use_symlinks=False)
local_dir
is the user-specified model save path. Ensure sufficient disk space is available.
Downloading with git-lfs Tool
Run the following command to check if git-lfs is available:
git lfs install
If available, the following output will be displayed:
Git LFS initialized.
If the tool is unavailable, install git-lfs first. Refer to git-lfs installation guidance in the FAQ section.
Once confirmed, download the weights by executing the following command:
git clone https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-0528-A8W8.git
TP16 Tensor Parallel Inference
vLLM manages and runs multi-node resources through Ray. This example corresponds to a scenario with Tensor Parallelism (TP) set to 16.
Setting Environment Variables
Environment variables must be set before creating the Ray cluster. If the environment changes, stop the cluster with ray stop
and recreate it; otherwise, the environment variables will not take effect.
Configure the following environment variables on the master and worker nodes:
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export GLOO_SOCKET_IFNAME=enp189s0f0
export HCCL_SOCKET_IFNAME=enp189s0f0
export TP_SOCKET_IFNAME=enp189s0f0
export MS_ENABLE_LCCL=off
export HCCL_OP_EXPANSION_MODE=AIV
export MS_ALLOC_CONF=enable_vmm:true
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export vLLM_MODEL_BACKEND=MindFormers
export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml
Environment variable descriptions:
GLOO_SOCKET_IFNAME
: GLOO backend port. Useifconfig
to find the network interface name corresponding to the IP.HCCL_SOCKET_IFNAME
: Configure the HCCL port. Useifconfig
to find the network interface name corresponding to the IP.TP_SOCKET_IFNAME
: Configure the TP port. Useifconfig
to find the network interface name corresponding to the IP.MS_ENABLE_LCCL
: Disable LCCL and enable HCCL communication.HCCL_OP_EXPANSION_MODE
: Configure the communication algorithm expansion location to the AI Vector Core (AIV) computing unit on the device side.MS_ALLOC_CONF
: Set the memory policy. Refer to the MindSpore documentation.ASCEND_RT_VISIBLE_DEVICES
: Configure the available device IDs for each node. Use thenpu-smi info
command to check.vLLM_MODEL_BACKEND
: The backend of the model to run. Currently supported models and backends for vLLM-MindSpore Plugin can be found in the Model Support List.MINDFORMERS_MODEL_CONFIG
: Model configuration file. Users can find the corresponding YAML file in the MindSpore Transformers repository, such as predict_deepseek_r1_671b_w8a8.yaml.
The model parallel strategy is specified in the parallel_config
of the configuration file. For example, the TP16 tensor parallel configuration is as follows:
# default parallel of device num = 16 for Atlas 800T A2
parallel_config:
data_parallel: 1
model_parallel: 16
pipeline_stage: 1
expert_parallel: 1
Additionally, users need to ensure that MindSpore Transformers is installed. Users can add it by running the following command:
export PYTHONPATH=/path/to/mindformers:$PYTHONPATH
This will include MindSpore Transformers in the Python path.
Starting Ray for Multi-Node Cluster Management
On Ascend, the pyACL package must be installed to adapt Ray. Additionally, the CANN dependency versions on all nodes must be consistent.
Installing pyACL
pyACL (Python Ascend Computing Language) encapsulates AscendCL APIs via CPython, enabling management of Ascend AI processors and computing resources.
In the corresponding environment, obtain the Ascend-cann-nnrt installation package for the required version, extract the pyACL dependency package, install it separately, and add the installation path to the environment variables:
./Ascend-cann-nnrt_8.0.RC1_linux-aarch64.run --noexec --extract=./
cd ./run_package
./Ascend-pyACL_8.0.RC1_linux-aarch64.run --full --install-path=<install_path>
export PYTHONPATH=<install_path>/CANN-<VERSION>/python/site-packages/:$PYTHONPATH
If you encounter permission issues during installation, you can grant permissions using:
chmod -R 777 ./Ascend-pyACL_8.0.RC1_linux-aarch64.run
Download the Ascend runtime package from the Ascend homepage.
Multi-Node Cluster
Before managing a multi-node cluster, ensure that the hostnames of all nodes are unique. If they are the same, set different hostnames using hostname <new-host-name>
.
Start the master node:
ray start --head --port=<port-to-ray>
. After successful startup, the connection method for worker nodes will be displayed. For example, runningray start --head --port=6379
on a node with IP192.5.5.5
will display:Local node IP: 192.5.5.5 -------------------- Ray runtime started. -------------------- Next steps To add another node to this Ray cluster, run ray start --address='192.5.5.5:6379' To connect to this Ray cluster: import ray ray.init() To terminate the Ray runtime, run ray stop To view the status of the cluster, use ray status
Connect worker nodes to the master node:
ray start --address=<head_node_ip>:<port>
.Check the cluster status with
ray status
. If the total number of NPUs displayed matches the sum of all nodes, the cluster is successfully created.For example, with two nodes, each with 8 NPUs, the output will be:
======== Autoscaler status: 2025-05-19 00:00:00.000000 ======== Node status --------------------------------------------------------------- Active: 1 node_efa0981305b1204810c3080c09898097099090f09ee909d0ae12545 1 node_184f44c4790135907ab098897c878699d89098e879f2403bc990112 Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0.0/384.0 CPU 0.0/16.0 NPU 0B/2.58TiB memory 0B/372.56GiB object_store_memory Demands: (no resource demands)
Online Inference
Starting the Service
vLLM-MindSpore Plugin can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service.
# Service launch parameter explanation
vllm-mindspore serve
--model=[Model Config/Weights Path]
--trust-remote-code # Use locally downloaded model files
--max-num-seqs [Maximum Batch Size]
--max-model-len [Maximum Input/Output Length]
--max-num-batched-tokens [Maximum Tokens per Iteration, recommended: 4096]
--block-size [Block Size, recommended: 128]
--gpu-memory-utilization [GPU Memory Utilization, recommended: 0.9]
--tensor-parallel-size [TP Parallelism Degree]
Execution example:
# Master node:
vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max_model_len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray
In tensor parallel scenarios, the --tensor-parallel-size
parameter overrides the model_parallel
configuration in the model YAML file. User can also set the local model path by --model
argument.
Sending Requests
Use the following command to send requests, where prompt
is the model input:
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0, "top_p": 1.0, "top_k": 1, "repetition_penalty": 1.0}'
User needs to ensure that the "model"
field matches the --model
in the service startup, and the request can successfully match the model.
Hybrid Parallel Inference
vLLM manages and operates resources across multiple nodes through Ray. This example corresponds to the following parallel strategy:
Data Parallelism (DP): 4;
Tensor Parallelism (TP): 4;
Expert Parallelism (EP): 4.
Setting Environment Variables
Configure the following environment variables on the master and worker nodes:
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export MS_ENABLE_LCCL=off
export HCCL_OP_EXPANSION_MODE=AIV
export MS_ALLOC_CONF=enable_vmm:true
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export vLLM_MODEL_BACKEND=MindFormers
export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8_ep4tp4.yaml
Environment variable descriptions:
MS_ENABLE_LCCL
: Disable LCCL and enable HCCL communication.HCCL_OP_EXPANSION_MODE
: Configure the communication algorithm expansion location to the AI Vector Core (AIV) computing unit on the device side.MS_ALLOC_CONF
: Set the memory policy. Refer to the MindSpore documentation.ASCEND_RT_VISIBLE_DEVICES
: Configure the available device IDs for each node. Use thenpu-smi info
command to check.vLLM_MODEL_BACKEND
: The backend of the model to run. Currently supported models and backends for vLLM-MindSpore Plugin can be found in the Model Support List.MINDFORMERS_MODEL_CONFIG
: Model configuration file. Users can find the corresponding YAML file in the MindSpore Transformers repository, such as predict_deepseek_r1_671b_w8a8_ep4t4.yaml.
The model parallel strategy is specified in the parallel_config
of the configuration file. For example, the hybrid parallel configuration is as follows:
# default parallel of device num = 16 for Atlas 800T A2
parallel_config:
data_parallel: 4
model_parallel: 4
pipeline_stage: 1
expert_parallel: 4
data_parallel
and model_parallel
specify the parallelism strategy for the attention and feed-forward dense layers, while expert_parallel
specifies the expert routing parallelism strategy for MoE layers. Ensure that data_parallel
* model_parallel
is divisible by expert_parallel
.
Online Inference
Starting the Service
vllm-mindspore
can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service:
# Parameter explanations for service launch
vllm-mindspore serve
--model=[Model Config/Weights Path]
--trust-remote-code # Use locally downloaded model files
--max-num-seqs [Maximum Batch Size]
--max-model-len [Maximum Input/Output Length]
--max-num-batched-tokens [Maximum Tokens per Iteration, recommended: 4096]
--block-size [Block Size, recommended: 128]
--gpu-memory-utilization [GPU Memory Utilization, recommended: 0.9]
--tensor-parallel-size [TP Parallelism Degree]
--headless # Required only for worker nodes, indicating no service-side content
--data-parallel-size [DP Parallelism Degree]
--data-parallel-size-local [DP count on the current service node, sum across all nodes equals data-parallel-size]
--data-parallel-start-rank [Offset of the first DP handled by the current service node]
--data-parallel-address [Master node communication IP]
--data-parallel-rpc-port [Master node communication port]
--enable-expert-parallel # Enable expert parallelism
User can also set the local model path by --model
argument. The following is an execution example:
# Master node:
vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel
# Worker node:
vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel
Sending Requests
Use the following command to send requests, where prompt
is the model input:
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 20, "temperature": 0}'
User needs to ensure that the "model"
field matches the --model
in the service startup, and the request can successfully match the model.