Start Tasks
Overview
MindSpore Transformers provides a one-click startup script run_mindformer.py
and a distributed task launch script msrun_launcher.sh
.
The
run_mindformer.py
script is used to start tasks on a single device, providing one-click capabilities for pre-training, fine-tuning, and inference tasks.The
msrun_launcher.sh
script is used to start distributed tasks on multi-device within a single node or multi-device with multi-node, launching tasks on each device through the msrun tool.
Run_mindformer One-click Start Script
In the root directory of the MindSpore Transformers code, execute the run_mindformer.py
script using Python to start the task. The supported parameters of the script are as follows. When an optional parameter is not set or is set to None
, the configuration with the same name in the YAML configuration file will be taken.
Basic Parameters
Parameters |
Parameter Descriptions |
Value Description |
Applicable Scenarios |
---|---|---|---|
|
YAML config files. |
str, required |
pre-train/finetune/predict |
|
Set the backend execution mode. |
int, optional, |
pre-train/finetune/predict |
|
Set the execution device ID. The value must be within the range of available devices. |
int, optional |
pre-train/finetune/predict |
|
Set the backend execution device. MindSpore Transformers is only supported on |
str, optional |
pre-train/finetune/predict |
|
Set the running mode of the model: |
str, optional |
pre-train/finetune/predict |
|
File or folder paths for loading weights. For detailed usage, please refer to Weight Conversion Function |
str, optional |
pre-train/finetune/predict |
|
Whether use parallel mode. |
bool, optional |
pre-train/finetune/predict |
|
Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. This parameter has been deprecated and will be removed in the next version. |
str, optional |
pre-train/finetune/predict |
|
Set the paths for saving logs, weights, sharding strategies, and other files. |
str, optional |
pre-train/finetune/predict |
|
The absolute path of the directory where the external code is located. For example, the model directory under the research directory. |
str, optional |
pre-train/finetune/predict |
|
Remote save url, where all the output files will transferred and stored in here. This parameter has been deprecated and will be removed in the next version. |
str, optional |
pre-train/finetune/predict |
|
Set the global seed. For details, refer to mindspore.set_seed. |
int, optional |
pre-train/finetune/predict |
|
Whether Hugging Face AutoTokenizer trusts remote code. |
bool, optional |
pre-train/finetune/predict |
Weight Slicing
Parameters |
Parameter Descriptions |
Value Description |
Applicable Scenarios |
---|---|---|---|
|
The strategy of load_checkpoint. |
str, optional |
pre-train/finetune/predict |
|
Enable online weight automatic conversion. Refer to Weight Conversion Function. |
bool, optional |
pre-train/finetune/predict |
|
The number of processes responsible for checkpoint transform. |
int, optional |
pre-train/finetune/predict |
|
Whether to only save the strategy files. |
bool, optional, when it is |
pre-train/finetune/predict |
|
The path to the distributed strategy file to be loaded. This parameter has been deprecated and will be removed in the next version. |
str, optional |
pre-train/finetune/predict |
Training
Parameters |
Parameter Descriptions |
Value Description |
Applicable Scenarios |
---|---|---|---|
|
Whether to evaluate in training process. This parameter has been deprecated and will be removed in the next version. |
bool, optional |
pre-train/finetune |
|
Dataset directory of data loader to eval. This parameter has been deprecated and will be removed in the next version. |
str, optional |
pre-train/finetune |
|
Dataset directory of data loader to pre-train/finetune. |
str, optional |
pre-train/finetune |
|
Enable resumable training after breakpoint. For details, refer to Resumable Training After Breakpoint. |
bool, optional |
pre-train/finetune |
|
Whether to use profile analysis. This parameter has been deprecated and will be removed in the next version. |
bool, optional |
pre-train/finetune |
|
Train epochs. |
int, optional |
pre-train/finetune |
|
The sample size of the batch data. |
int, optional |
pre-train/finetune |
|
The number of gradient accumulation steps. |
int, optional |
pre-train/finetune |
|
Whether to use sink mode. This parameter has been deprecated and will be removed in the next version. |
bool, optional |
pre-train/finetune |
|
Number of datasets samples used. |
int, optional |
pre-train/finetune |
Inference
Parameters |
Parameter Descriptions |
Value Description |
Applicable Scenarios |
---|---|---|---|
|
Input data for inference. |
str, optional, It can be the input for predict (single-batch predict) or the file path of a txt file containing multiple lines of text (multi-batch predict). |
predict |
|
Modal type of input data for predict. This parameter has been deprecated and will be removed in the next version. |
str, optional |
predict |
|
LoRA ID for predict. This parameter has been deprecated and will be removed in the next version. |
str, optional |
predict |
|
The batch size for multi-batch inference. |
int, optional |
predict |
|
Whether to use random sampling when selecting tokens for inference. |
int, optional, |
predict |
Distributed Task Pull-up Script
The distributed task pull up script msrun_launcher.sh
is located in the scripts/
directory and can automatically start distributed multiprocess tasks using the msrun command based on the input parameters. This script has the following several usage methods:
For Default 8 Devices In Single Machine:
bash msrun_launcher.sh [EXECUTE_ORDER]
For Quick Start On Multiple Devices In Single Machine:
bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM]
For Multiple Devices In Single Machine:
bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] [MASTER_PORT] [LOG_DIR] [JOIN] [CLUSTER_TIME_OUT]
For Multiple Devices In Multiple Machines:
bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] [LOCAL_WORKER] [MASTER_ADDR] [MASTER_PORT] [NODE_RANK] [LOG_DIR] [JOIN] [CLUSTER_TIME_OUT]
The parameter descriptions of the script are as follows:
Parameters |
Parameter Descriptions |
Value Description |
---|---|---|
|
The parameters of the Python script command to be executed in a distributed manner. |
str, required, set it to a string containing the Python script to be executed and the script parameters |
|
The total number of Worker processes participating in the distributed task. |
int, optional, default: |
|
The number of Worker processes pulled up on the current node. |
int, optional, default: |
|
Specifies the IP address or hostname of the Scheduler. |
str, optional, default: |
|
Specifies the Scheduler binding port number. |
int, optional, default: |
|
The index of the current node. |
int, optional, default: |
|
Worker, and Scheduler log output paths. |
str, optional, default: |
|
Whether msrun waits for the Worker as well as the Scheduler to exit. |
bool, optional, default: |
|
Cluster networking timeout in seconds. |
int, optional, default: |
Task Startup Tutorial
Next, taking the fine-tuning of Qwen2.5-0.5B as an example, we will explain the usage of single-device, single-node, and multi-node tasks.
Single-Device
Execute the Python script in the root directory of the MindSpore Transformers code to perform single-device fine-tuning. The path in the command needs to be replaced with the real path.
python run_mindformer.py \
--register_path research/qwen2_5 \
--config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \
--use_parallel False \
--run_mode finetune \
--train_dataset_dir ./path/alpaca-data.mindrecord
Single-Node
Execute the msrun startup script in the root directory of the MindSpore Transformers code to perform single-node fine-tuning. The path in the command needs to be replaced with the real path.
bash scripts/msrun_launcher.sh "run_mindformer.py \
--register_path research/qwen2_5 \
--config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \
--run_mode finetune \
--train_dataset_dir ./path/alpaca-data.mindrecord "
Multi-Node
Take Qwen2.5-0.5B as an example to perform 2-node 16-device fine-tuning.
Modify the corresponding config file
research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml
based on information such as the number of used nodes:parallel_config: data_parallel: 2 model_parallel: 4 pipeline_stage: 2 micro_batch_num: 16 vocab_emb_dp: True gradient_aggregation_group: 4
If the number of nodes and the number of devices are used to change,
data_parallel
,model_parallel
, andpipeline_stage
need to be modified to meet the actual number of running devices .device_num=data_parallel×model_parallel×pipeline_stage
. Meanwhile,micro_batch_num >= pipeline_stage
.Execute the msrun startup script:
For distributed tasks by executing scripts on multiple nodes and multiple devices, it is necessary to run the scripts on different nodes respectively and set the parameter
MASTER_ADDR
to the ip address of the main node. The ip addresses set for all nodes are the same, and only the parameterNODE_RANK
is different among different nodes. The meanings of each parameter position can be found in msrun Launching.# Node 0. Set the IP address of node 0 to the value of {ip_addr}, which is used as the IP address of the primary node. There are 16 devices in total with 2 devices for each node. bash scripts/msrun_launcher.sh "run_mindformer.py \ --register_path research/qwen2_5 \ --config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \ --train_dataset_dir /{path}/wiki4096.mindrecord \ --run_mode finetune" \ 16 8 {ip_addr} 8118 0 output/msrun_log False 300 # Node 1. Set the IP address of node 0 to the value of {ip_addr}, which is used as the IP address of the primary node. The startup commands of node 0 and node 1 differ only in the parameter NODE_RANK. bash scripts/msrun_launcher.sh "run_mindformer.py \ --register_path research/qwen2_5 \ --config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \ --train_dataset_dir /{path}/wiki4096.mindrecord \ --run_mode finetune" \ 16 8 {ip_addr} 8118 1 output/msrun_log False 300