性能测试
vLLM-MindSpore插件的性能测试能力,继承自vLLM所提供的性能测试能力,详情可参考vLLM BenchMark文档。该文档将介绍在线性能测试与离线性能测试,用户可以根据所介绍步骤进行性能测试。
在线性能测试
若用户使用单卡推理,以Qwen2.5-7B为例,可按照文档单卡推理(Qwen2.5-7B)进行环境准备,设置以下环境变量:
export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
并以下命令启动在线推理:
vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct --device auto --disable-log-requests
若使用多卡推理,以Qwen2.5-32B 为例,可按照文档多卡推理(Qwen2.5-32B)进行环境准备,则可用以下命令启动在线推理:
export TENSOR_PARALLEL_SIZE=4
export MAX_MODEL_LEN=1024
python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
当返回以下日志时,则服务已成功拉起:
INFO: Started server process [21349]
INFO: Waiting for application startup.
INFO: Application startup complete.
拉取vLLM代码仓,导入vLLM-MindSpore插件,复用其中benchmark功能:
export VLLM_BRANCH=v0.9.1
git clone https://github.com/vllm-project/vllm.git -b ${VLLM_BRANCH}
cd vllm
sed -i '1i import vllm_mindspore' benchmarks/benchmark_serving.py
其中,VLLM_BRANCH
为vLLM的分支名,其需要与vLLM-MindSpore插件相配套。配套关系可以参考这里。
执行测试脚本:
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# single-card, take Qwen2.5-7B as example:
python3 benchmarks/benchmark_serving.py \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset-name sharegpt \
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 10
# multi-card, take Qwen2.5-32B as example:
python3 benchmarks/benchmark_serving.py \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model Qwen/Qwen2.5-32B-Instruct \
--dataset-name sharegpt \
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 10
若成功执行测试,则可以返回如下结果:
============ Serving Benchmark Result ============
Successful requests: ....
Benchmark duration (s): ....
Total input tokens: ....
Total generated tokens: ....
Request throughput (req/s): ....
Output token throughput (tok/s): ....
Total Token throughput (tok/s): ....
---------------Time to First Token----------------
Mean TTFT (ms): ....
Median TTFT (ms): ....
P99 TTFT (ms): ....
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): ....
Median TPOT (ms): ....
P99 TPOT (ms): ....
---------------Inter-token Latency----------------
Mean ITL (ms): ....
Median ITL (ms): ....
P99 ITL (ms): ....
==================================================
离线性能测试
用户使用离线性能测试时,以Qwen2.5-7B为例,可按照文档单卡推理(Qwen2.5-7B)进行环境准备,设置以下环境变量:
export vLLM_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
并拉取vLLM代码仓,导入vLLM-MindSpore插件,复用其中benchmark功能:
export VLLM_BRANCH=v0.9.1
git clone https://github.com/vllm-project/vllm.git -b ${VLLM_BRANCH}
cd vllm
sed -i '1i import vllm_mindspore' benchmarks/benchmark_throughput.py
其中,VLLM_BRANCH
为vLLM的分支名,其需要与vLLM-MindSpore插件相配套。配套关系可以参考这里。
用户可通过以下命令,运行测试脚本。该脚本将启动模型,并执行测试,用户不需要再拉起模型:
python3 benchmarks/benchmark_throughput.py \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset-name sonnet \
--dataset-path benchmarks/sonnet.txt \
--num-prompts 10
若成功执行测试,则可以返回如下结果:
Throughput: ... requests/s, ... total tokens/s, ... output tokens/s
Total num prompt tokens: ...
Total num output tokens: ...