Start Tasks

View Source On Gitee

Overview

MindSpore Transformers provides a one-click startup script run_mindformer.py and a distributed task launch script msrun_launcher.sh.

  • The run_mindformer.py script is used to start tasks on a single device, providing one-click capabilities for pre-training, fine-tuning, and inference tasks.

  • The msrun_launcher.sh script is used to start distributed tasks on multi-device within a single node or multi-device with multi-node, launching tasks on each device through the msrun tool.

Run_mindformer One-click Start Script

In the root directory of the MindSpore Transformers code, execute the run_mindformer.py script using Python to start the task. The supported parameters of the script are as follows. When an optional parameter is not set or is set to None, the configuration with the same name in the YAML configuration file will be taken.

Basic Parameters

Parameters

Parameter Descriptions

Value Description

Applicable Scenarios

--config

YAML config files.

str, required

pre-train/finetune/predict

--mode

Set the backend execution mode.

int, optional, 0 is GRAPH_MODE and 1 is PYNATIVE_MODE. Currently, only GRAPH_MODE is supported.

pre-train/finetune/predict

--device_id

Set the execution device ID. The value must be within the range of available devices.

int, optional

pre-train/finetune/predict

--device_target

Set the backend execution device. MindSpore Transformers is only supported on Ascend devices.

str, optional

pre-train/finetune/predict

--run_mode

Set the running mode of the model: train, finetune or predict.

str, optional

pre-train/finetune/predict

--load_checkpoint

File or folder paths for loading weights. For detailed usage, please refer to Weight Conversion Function

str, optional

pre-train/finetune/predict

--use_parallel

Whether use parallel mode.

bool, optional

pre-train/finetune/predict

--options

Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. This parameter has been deprecated and will be removed in the next version.

str, optional

pre-train/finetune/predict

--output_dir

Set the paths for saving logs, weights, sharding strategies, and other files.

str, optional

pre-train/finetune/predict

--register_path

The absolute path of the directory where the external code is located. For example, the model directory under the research directory.

str, optional

pre-train/finetune/predict

--remote_save_url

Remote save url, where all the output files will transferred and stored in here. This parameter has been deprecated and will be removed in the next version.

str, optional

pre-train/finetune/predict

--seed

Set the global seed. For details, refer to mindspore.set_seed.

int, optional

pre-train/finetune/predict

--trust_remote_code

Whether Hugging Face AutoTokenizer trusts remote code.

bool, optional

pre-train/finetune/predict

Weight Slicing

Parameters

Parameter Descriptions

Value Description

Applicable Scenarios

--src_strategy_path_or_dir

The strategy of load_checkpoint.

str, optional

pre-train/finetune/predict

--auto_trans_ckpt

Enable online weight automatic conversion. Refer to Weight Conversion Function.

bool, optional

pre-train/finetune/predict

--transform_process_num

The number of processes responsible for checkpoint transform.

int, optional

pre-train/finetune/predict

--only_save_strategy

Whether to only save the strategy files.

bool, optional, when it is true, the task exits directly after saving the strategy file.

pre-train/finetune/predict

--strategy_load_checkpoint

The path to the distributed strategy file to be loaded. This parameter has been deprecated and will be removed in the next version.

str, optional

pre-train/finetune/predict

Training

Parameters

Parameter Descriptions

Value Description

Applicable Scenarios

--do_eval

Whether to evaluate in training process. This parameter has been deprecated and will be removed in the next version.

bool, optional

pre-train/finetune

--eval_dataset_dir

Dataset directory of data loader to eval. This parameter has been deprecated and will be removed in the next version.

str, optional

pre-train/finetune

--train_dataset_dir

Dataset directory of data loader to pre-train/finetune.

str, optional

pre-train/finetune

--resume_training

Enable resumable training after breakpoint. For details, refer to Resumable Training After Breakpoint.

bool, optional

pre-train/finetune

--profile

Whether to use profile analysis. This parameter has been deprecated and will be removed in the next version.

bool, optional

pre-train/finetune

--epochs

Train epochs.

int, optional

pre-train/finetune

--batch_size

The sample size of the batch data.

int, optional

pre-train/finetune

--gradient_accumulation_steps

The number of gradient accumulation steps.

int, optional

pre-train/finetune

--sink_mode

Whether to use sink mode. This parameter has been deprecated and will be removed in the next version.

bool, optional

pre-train/finetune

--num_samples

Number of datasets samples used.

int, optional

pre-train/finetune

Inference

Parameters

Parameter Descriptions

Value Description

Applicable Scenarios

--predict_data

Input data for inference.

str, optional, It can be the input for predict (single-batch predict) or the file path of a txt file containing multiple lines of text (multi-batch predict).

predict

--modal_type

Modal type of input data for predict. This parameter has been deprecated and will be removed in the next version.

str, optional

predict

--adapter_id

LoRA ID for predict. This parameter has been deprecated and will be removed in the next version.

str, optional

predict

--predict_batch_size

The batch size for multi-batch inference.

int, optional

predict

--do_sample

Whether to use random sampling when selecting tokens for inference.

int, optional, True means using sampling encoding, False means using greedy decoding.

predict

Distributed Task Pull-up Script

The distributed task pull up script msrun_launcher.sh is located in the scripts/ directory and can automatically start distributed multiprocess tasks using the msrun command based on the input parameters. This script has the following several usage methods:

  1. For Default 8 Devices In Single Machine:

bash msrun_launcher.sh [EXECUTE_ORDER]
  1. For Quick Start On Multiple Devices In Single Machine:

bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM]
  1. For Multiple Devices In Single Machine:

bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] [MASTER_PORT] [LOG_DIR] [JOIN] [CLUSTER_TIME_OUT]
  1. For Multiple Devices In Multiple Machines:

bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] [LOCAL_WORKER] [MASTER_ADDR] [MASTER_PORT] [NODE_RANK] [LOG_DIR] [JOIN] [CLUSTER_TIME_OUT]

The parameter descriptions of the script are as follows:

Parameters

Parameter Descriptions

Value Description

EXECUTE_ORDER

The parameters of the Python script command to be executed in a distributed manner.

str, required, set it to a string containing the Python script to be executed and the script parameters

WORKER_NUM

The total number of Worker processes participating in the distributed task.

int, optional, default: 8

LOCAL_WORKER

The number of Worker processes pulled up on the current node.

int, optional, default: 8

MASTER_ADDR

Specifies the IP address or hostname of the Scheduler.

str, optional, default: "127.0.0.1"

MASTER_PORT

Specifies the Scheduler binding port number.

int, optional, default: 8118

NODE_RANK

The index of the current node.

int, optional, default: 0

LOG_DIR

Worker, and Scheduler log output paths.

str, optional, default: "output/msrun_log"

JOIN

Whether msrun waits for the Worker as well as the Scheduler to exit.

bool, optional, default: False

CLUSTER_TIME_OUT

Cluster networking timeout in seconds.

int, optional, default: 7200

Task Startup Tutorial

Next, taking the fine-tuning of Qwen2.5-0.5B as an example, we will explain the usage of single-device, single-node, and multi-node tasks.

Single-Device

Execute the Python script in the root directory of the MindSpore Transformers code to perform single-device fine-tuning. The path in the command needs to be replaced with the real path.

python run_mindformer.py \
--register_path research/qwen2_5 \
--config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \
--use_parallel False \
--run_mode finetune \
--train_dataset_dir ./path/alpaca-data.mindrecord

Single-Node

Execute the msrun startup script in the root directory of the MindSpore Transformers code to perform single-node fine-tuning. The path in the command needs to be replaced with the real path.

bash scripts/msrun_launcher.sh "run_mindformer.py \
 --register_path research/qwen2_5 \
 --config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \
 --run_mode finetune \
 --train_dataset_dir ./path/alpaca-data.mindrecord "

Multi-Node

Take Qwen2.5-0.5B as an example to perform 2-node 16-device fine-tuning.

  1. Modify the corresponding config file research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml based on information such as the number of used nodes:

    parallel_config:
      data_parallel: 2
      model_parallel: 4
      pipeline_stage: 2
      micro_batch_num: 16
      vocab_emb_dp: True
      gradient_aggregation_group: 4
    

    If the number of nodes and the number of devices are used to change, data_parallel, model_parallel, and pipeline_stage need to be modified to meet the actual number of running devices . device_num=data_parallel×model_parallel×pipeline_stage. Meanwhile, micro_batch_num >= pipeline_stage.

  2. Execute the msrun startup script:

    For distributed tasks by executing scripts on multiple nodes and multiple devices, it is necessary to run the scripts on different nodes respectively and set the parameter MASTER_ADDR to the ip address of the main node. The ip addresses set for all nodes are the same, and only the parameter NODE_RANK is different among different nodes. The meanings of each parameter position can be found in msrun Launching.

    # Node 0. Set the IP address of node 0 to the value of {ip_addr}, which is used as the IP address of the primary node. There are 16 devices in total with 2 devices for each node.
    bash scripts/msrun_launcher.sh "run_mindformer.py \
      --register_path research/qwen2_5 \
      --config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \
      --train_dataset_dir /{path}/wiki4096.mindrecord \
      --run_mode finetune" \
      16 8 {ip_addr} 8118 0 output/msrun_log False 300
    
    
    # Node 1. Set the IP address of node 0 to the value of {ip_addr}, which is used as the IP address of the primary node. The startup commands of node 0 and node 1 differ only in the parameter NODE_RANK.
    bash scripts/msrun_launcher.sh "run_mindformer.py \
      --register_path research/qwen2_5 \
      --config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \
      --train_dataset_dir /{path}/wiki4096.mindrecord \
      --run_mode finetune" \
      16 8 {ip_addr} 8118 1 output/msrun_log False 300