Start Tasks
Overview
MindSpore Transformers provides a one-click startup script run_mindformer.py and a distributed task launch script msrun_launcher.sh.
The
run_mindformer.pyscript is used to start tasks on a single device, providing one-click capabilities for pre-training, fine-tuning, and inference tasks.The
msrun_launcher.shscript is used to start distributed tasks on multi-device within a single node or multi-device with multi-node, launching tasks on each device through the msrun tool.
Run_mindformer One-click Start Script
In the root directory of the MindSpore Transformers code, execute the run_mindformer.py script using Python to start the task. The supported parameters of the script are as follows. When an optional parameter is not set or is set to None, the configuration with the same name in the YAML configuration file will be taken.
Basic Parameters
Parameters |
Parameter Descriptions |
Value Description |
Applicable Scenarios |
|---|---|---|---|
|
YAML config files. |
str, required |
pre-train/finetune/predict |
|
Set the backend execution mode. |
int, optional, |
pre-train/finetune/predict |
|
Set the execution device ID. The value must be within the range of available devices. |
int, optional |
pre-train/finetune/predict |
|
Set the backend execution device. MindSpore Transformers is only supported on |
str, optional |
pre-train/finetune/predict |
|
Set the running mode of the model: |
str, optional |
pre-train/finetune/predict |
|
File or folder paths for loading weights. For detailed usage, please refer to Weight Conversion Function |
str, optional |
pre-train/finetune/predict |
|
Whether to use parallel mode. |
bool, optional |
pre-train/finetune/predict |
|
Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. This parameter has been deprecated and will be removed in the next version. |
str, optional |
pre-train/finetune/predict |
|
Set the paths for saving logs, weights, sharding strategies, and other files. |
str, optional |
pre-train/finetune/predict |
|
The absolute path of the directory where the external code is located. For example, the model directory under the research directory. |
str, optional |
pre-train/finetune/predict |
|
Remote save url, where all the output files will be transferred and stored in here. This parameter has been deprecated and will be removed in the next version. |
str, optional |
pre-train/finetune/predict |
|
Set the global seed. For details, refer to mindspore.set_seed. |
int, optional |
pre-train/finetune/predict |
|
Whether Hugging Face AutoTokenizer trusts remote code. |
bool, optional |
pre-train/finetune/predict |
Weight Slicing
Parameters |
Parameter Descriptions |
Value Description |
Applicable Scenarios |
|---|---|---|---|
|
The strategy of load_checkpoint. |
str, optional |
pre-train/finetune/predict |
|
Enable online weight automatic conversion. Refer to Weight Conversion Function. |
bool, optional |
pre-train/finetune/predict |
|
The number of processes responsible for checkpoint transform. |
int, optional |
pre-train/finetune/predict |
|
Whether to only save the strategy files. |
bool, optional, when it is |
pre-train/finetune/predict |
|
The path to the distributed strategy file to be loaded. This parameter has been deprecated and will be removed in the next version. |
str, optional |
pre-train/finetune/predict |
Training
Parameters |
Parameter Descriptions |
Value Description |
Applicable Scenarios |
|---|---|---|---|
|
Whether to evaluate in training process. This parameter has been deprecated and will be removed in the next version. |
bool, optional |
pre-train/finetune |
|
Dataset directory of data loader to eval. This parameter has been deprecated and will be removed in the next version. |
str, optional |
pre-train/finetune |
|
Dataset directory of data loader to pre-train/finetune. |
str, optional |
pre-train/finetune |
|
Enable resumable training after breakpoint. For details, refer to Resumable Training After Breakpoint. |
bool, optional |
pre-train/finetune |
|
Whether to use profile analysis. This parameter has been deprecated and will be removed in the next version. |
bool, optional |
pre-train/finetune |
|
Train epochs. |
int, optional |
pre-train/finetune |
|
The sample size of the batch data. |
int, optional |
pre-train/finetune |
|
The number of gradient accumulation steps. |
int, optional |
pre-train/finetune |
|
Whether to use sink mode. This parameter has been deprecated and will be removed in the next version. |
bool, optional |
pre-train/finetune |
|
Number of datasets samples used. |
int, optional |
pre-train/finetune |
Inference
Parameters |
Parameter Descriptions |
Value Description |
Applicable Scenarios |
|---|---|---|---|
|
Input data for inference. |
str, optional, It can be the input for predict (single-batch predict) or the file path of a txt file containing multiple lines of text (multi-batch predict). |
predict |
|
Modal type of input data for predict. This parameter has been deprecated and will be removed in the next version. |
str, optional |
predict |
|
LoRA ID for predict. This parameter has been deprecated and will be removed in the next version. |
str, optional |
predict |
|
The batch size for multi-batch inference. |
int, optional |
predict |
|
Whether to use random sampling when selecting tokens for inference. |
int, optional, |
predict |
Distributed Task Pull-up Script
The distributed task pull up script msrun_launcher.sh is located in the scripts/ directory and can automatically start distributed multiprocess tasks using the msrun command based on the input parameters. This script has the following several usage methods:
For Default 8 Devices In Single Machine:
bash msrun_launcher.sh [EXECUTE_ORDER]
For Quick Start On Multiple Devices In Single Machine:
bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM]
For Multiple Devices In Single Machine:
bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] [MASTER_PORT] [LOG_DIR] [JOIN] [CLUSTER_TIME_OUT]
For Multiple Devices In Multiple Machines:
bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] [LOCAL_WORKER] [MASTER_ADDR] [MASTER_PORT] [NODE_RANK] [LOG_DIR] [JOIN] [CLUSTER_TIME_OUT]
The parameter descriptions of the script are as follows:
Parameters |
Parameter Descriptions |
Value Description |
|---|---|---|
|
The parameters of the Python script command to be executed in a distributed manner. |
str, required, set it to a string containing the Python script to be executed and the script parameters |
|
The total number of Worker processes participating in the distributed task. |
int, optional, default: |
|
The number of Worker processes pulled up on the current node. |
int, optional, default: |
|
Specifies the IP address or hostname of the Scheduler. |
str, optional, default: |
|
Specifies the Scheduler binding port number. |
int, optional, default: |
|
The index of the current node. |
int, optional, default: |
|
Worker, and Scheduler log output paths. |
str, optional, default: |
|
Whether msrun waits for the Worker as well as the Scheduler to exit. |
bool, optional, default: |
|
Cluster networking timeout in seconds. |
int, optional, default: |
Task Startup Tutorial
Next, taking the fine-tuning of Qwen2.5-0.5B as an example, we will explain the usage of single-device, single-node, and multi-node tasks.
Single-Device
Execute the Python script in the root directory of the MindSpore Transformers code to perform single-device fine-tuning. The path in the command needs to be replaced with the real path.
python run_mindformer.py \
--register_path research/qwen2_5 \
--config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \
--use_parallel False \
--run_mode finetune \
--train_dataset_dir ./path/alpaca-data.mindrecord
Single-Node
Execute the msrun startup script in the root directory of the MindSpore Transformers code to perform single-node fine-tuning. The path in the command needs to be replaced with the real path.
bash scripts/msrun_launcher.sh "run_mindformer.py \
--register_path research/qwen2_5 \
--config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \
--run_mode finetune \
--train_dataset_dir ./path/alpaca-data.mindrecord "
Multi-Node
Take Qwen2.5-0.5B as an example to perform 2-node 16-device fine-tuning.
Modify the corresponding config file
research/qwen2_5/finetune_qwen2_5_0_5b_8k.yamlbased on information such as the number of used nodes:parallel_config: data_parallel: 16 ...
If the number of nodes and the number of devices are used to change,
data_parallel,model_parallel, andpipeline_stageneed to be modified to meet the actual number of running devices.device_num=data_parallel×model_parallel×pipeline_stage. Meanwhile,micro_batch_num >= pipeline_stage.Execute the msrun startup script:
For distributed tasks by executing scripts on multiple nodes and multiple devices, it is necessary to run the scripts on different nodes respectively and set the parameter
MASTER_ADDRto the IP address of the main node. The IP addresses set for all nodes are the same, and only the parameterNODE_RANKis different among different nodes.# Node 0. Set the IP address of node 0 to the value of {master_addr}, which is used as the IP address of the primary node. There are 16 devices in total with 2 devices for each node. bash scripts/msrun_launcher.sh "run_mindformer.py \ --register_path research/qwen2_5 \ --config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \ --train_dataset_dir /{path}/wiki4096.mindrecord \ --run_mode finetune" \ 16 8 {master_addr} 8118 0 output/msrun_log False 300 # Node 1. Set the IP address of node 0 to the value of {master_addr}, which is used as the IP address of the primary node. The startup commands of node 0 and node 1 differ only in the parameter NODE_RANK. bash scripts/msrun_launcher.sh "run_mindformer.py \ --register_path research/qwen2_5 \ --config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \ --train_dataset_dir /{path}/wiki4096.mindrecord \ --run_mode finetune" \ 16 8 {master_addr} 8118 1 output/msrun_log False 300