性能测试

vLLM-MindSpore插件的性能测试能力，继承自vLLM所提供的性能测试能力，详情可参考vLLM Benchmark文档。该文档将介绍在线性能测试与离线性能测试，用户可以根据所介绍步骤进行性能测试。

在线性能测试

若用户使用单卡推理，以Qwen2.5-7B为例，可按照文档单卡推理（Qwen2.5-7B）进行环境准备，设置以下环境变量：

export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.

使用以下命令启动在线推理：

vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct --device auto --disable-log-requests

若使用多卡推理，以Qwen2.5-32B 为例，可按照文档多卡推理（Qwen2.5-32B）进行环境准备，则可用以下命令启动在线推理：

export TENSOR_PARALLEL_SIZE=4
export MAX_MODEL_LEN=1024
vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN

当返回以下日志时，则服务已成功拉起：

INFO:     Started server process [21349]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

拉取vLLM代码仓库，导入vLLM-MindSpore插件，复用其中的benchmark功能：

export VLLM_BRANCH=v0.9.1
git clone https://github.com/vllm-project/vllm.git -b ${VLLM_BRANCH}
cd vllm
sed -i '1i import vllm_mindspore' benchmarks/benchmark_serving.py

其中，VLLM_BRANCH为vLLM的分支名，其需要与vLLM-MindSpore插件相配套。配套关系可以参考这里。

执行测试脚本：

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# single-card, take Qwen2.5-7B as example:
python3 benchmarks/benchmark_serving.py \
    --backend openai-chat \
    --endpoint /v1/chat/completions  \
    --model Qwen/Qwen2.5-7B-Instruct  \
    --dataset-name sharegpt  \
    --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json  \
    --num-prompts 10

# multi-card, take Qwen2.5-32B as example:
python3 benchmarks/benchmark_serving.py \
    --backend openai-chat \
    --endpoint /v1/chat/completions  \
    --model Qwen/Qwen2.5-32B-Instruct  \
    --dataset-name sharegpt  \
    --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json  \
    --num-prompts 10

若成功执行测试，则可以返回如下结果：

============ Serving Benchmark Result ============
Successful requests:                     ....
Benchmark duration (s):                  ....
Total input tokens:                      ....
Total generated tokens:                  ....
Request throughput (req/s):              ....
Output token throughput (tok/s):         ....
Total Token throughput (tok/s):          ....
---------------Time to First Token----------------
Mean TTFT (ms):                          ....
Median TTFT (ms):                        ....
P99 TTFT (ms):                           ....
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          ....
Median TPOT (ms):                        ....
P99 TPOT (ms):                           ....
---------------Inter-token Latency----------------
Mean ITL (ms):                           ....
Median ITL (ms):                         ....
P99 ITL (ms):                            ....
==================================================

离线性能测试

用户使用离线性能测试时，以Qwen2.5-7B为例，可按照文档单卡推理（Qwen2.5-7B）进行环境准备，设置以下环境变量：

export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.

并拉取vLLM代码仓库，导入vLLM-MindSpore插件，复用其中的benchmark功能：

export VLLM_BRANCH=v0.9.1
git clone https://github.com/vllm-project/vllm.git -b ${VLLM_BRANCH}
cd vllm
sed -i '1i import vllm_mindspore' benchmarks/benchmark_throughput.py

其中，VLLM_BRANCH为vLLM的分支名，其需要与vLLM-MindSpore插件相配套。配套关系可以参考这里。

用户可通过以下命令运行测试脚本。该脚本将启动模型并执行测试，用户无需额外启动模型：

python3 benchmarks/benchmark_throughput.py \  
    --model Qwen/Qwen2.5-7B-Instruct \  
    --dataset-name sonnet \  
    --dataset-path benchmarks/sonnet.txt \  
    --num-prompts 10

若成功执行测试，则可以返回如下结果：

Throughput: ... requests/s, ... total tokens/s, ... output tokens/s
Total num prompt tokens:  ...
Total num output tokens:  ...