Inference

View Source On Gitee

Overview

MindSpore Transformers offers large model inference capabilities. Users can execute the run_mindformer unified script for inference. By using the run_mindformer unified script, users can start the process directly through configuration files without writing any code, making it very convenient to use.

Basic Process

The inference process can be categorized into the following steps:

1. Models of Selective Inference

Depending on the required inference task, different models are chosen, e.g. for text generation one can choose Qwen3, etc.

2. Preparing Model Files

Obtain the Hugging Face model file: weights, configurations, and tokenizers. Store the downloaded files in the same folder directory for convenient subsequent use.

3. YAML Configuration File Modification

The user needs to configure a YAML file to define all the configurations of the task. MindSpore Transformers provides a YAML configuration template. Users can customize the configuration based on the template according to the actual scenario. For detailed information, please refer to the Guide to Using Inference Configuration Templates.

4. Executing Inference Tasks

Use the unified script run_mindformer to execute inference tasks.

Inference Based on the run_mindformer Script

For single-device inference, you can directly run run_mindformer.py. For multi-device inference, you need to run scripts/msrun_launcher.sh.

The arguments to run_mindformer.py are described below:

Parameters

Parameter Descriptions

config

Path to the yaml configuration file

run_mode

The running mode, with inference set to predict

use_parallel

Whether to use multicard inference

predict_data

Input data for inference. Multi-batch inference needs to pass the path to the txt file of the input data, which contains multiple lines of inputs.

predict_batch_size

batch_size for multi-batch inference

msrun_launcher.sh includes the run_mindformer.py command and the number of inference cards as two parameters.

The following will describe the usage of single and multi-card inference using Qwen3-8B as an example, with the recommended configuration of the predict_qwen3.yaml file.

Configuration Modification

The current inference can directly reuse Hugging Face's configuration file and tokenizer, and load the weights of Hugging Face's safetensors format online. The configuration modification when in use is as follows:

use_legacy: False
pretrained_model_dir: '/path/hf_dir'

Parameter Description:

  • use_legacy: Determine whether to use the old architecture. Default value: 'True';

  • pretrained_model_dir: Hugging Face model directory path, where files such as model configuration and Tokenizer are placed.

The default configuration is a single-card inference configuration. If multi-card inference is required, the relevant configuration modifications are as follows:

use_parallel: False
parallel_config:
  data_parallel: 1
  model_parallel: 1

For specific configuration instructions, please refer to yaml Configuration Instructions.

Single-Device Inference

When using full weight reasoning, it is recommended to use the default configuration and execute the following command to start the reasoning task:

python run_mindformer.py \
--config configs/qwen3/predict_qwen3.yaml \
--run_mode predict \
--use_parallel False \
--predict_data '帮助我制定一份去上海的旅游攻略'

The following results appear, proving the success of the reasoning. The reasoning results will also be saved to the text_generation_result.txt file in the current directory.

'text_generation_text': [帮助我制定一份去上海的旅游攻略,包括景点、美食、住宿等信息...]

Multi-Card Inference

The configuration requirements for multi-card inference are different from those for single-card inference. Please refer to the following for configuration modification:

  1. The configuration of model_parallel and the number of cards used need to be consistent. The following use case is 4-card inference, and model_parallel needs to be set to 4;

  2. The current version of multi-card inference does not support data parallelism. data_parallel needs to be set to 1.

When using full weight reasoning, it is necessary to enable the online splitting mode to load the weights. Refer to the following command:

bash scripts/msrun_launcher.sh "run_mindformer.py \
 --config configs/qwen3/predict_qwen3.yaml \
 --run_mode predict \
 --use_parallel True \
 --predict_data '帮助我制定一份去上海的旅游攻略'" 4

The following results appear, proving the success of the reasoning. The reasoning results will also be saved to the text_generation_result.txt file in the current directory. Detailed logs can be viewed through the directory ./output/msrun_log.

'text_generation_text': [帮助我制定一份去上海的旅游攻略,包括景点、美食、住宿等信息...]

Multi-Device Multi-Batch Inference

Multi-card multi-batch inference is initiated in the same way as multi-card inference, but requires the addition of the predict_batch_size inputs and the modification of the predict_data inputs.

The content and format of the input_predict_data.txt file is an input each line, and the number of questions is the same as the predict_batch_size, which can be found in the following format:

帮助我制定一份去上海的旅游攻略
帮助我制定一份去上海的旅游攻略
帮助我制定一份去上海的旅游攻略
帮助我制定一份去上海的旅游攻略

Take full weight reasoning as an example. The reasoning task can be started by referring to the following command:

bash scripts/msrun_launcher.sh "run_mindformer.py \
 --config configs/qwen3/predict_qwen3.yaml \
 --run_mode predict \
 --predict_batch_size 4 \
 --use_parallel True \
 --predict_data path/to/input_predict_data.txt" 4

Inference results are viewed in the same way as multi-card inference.

More Information

For more inference examples of different models, see the models supported by MindSpore Transformers.