# Start Tasks [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/start_tasks.md) ## Overview MindSpore Transformers provides a one-click startup script `run_mindformer.py` and a distributed task launch script `msrun_launcher.sh`. - The `run_mindformer.py` script is used to start tasks on a **single device**, providing one-click capabilities for pre-training, fine-tuning, and inference tasks. - The `msrun_launcher.sh` script is used to start distributed tasks on **multi-device within a single node** or **multi-device with multi-node**, launching tasks on each device through the [msrun](https://www.mindspore.cn/tutorials/en/master/parallel/msrun_launcher.html) tool. ## Run_mindformer One-click Start Script In the root directory of the MindSpore Transformers code, execute the `run_mindformer.py` script using Python to start the task. The supported parameters of the script are as follows. **When an optional parameter is not set or is set to ``None``, the configuration with the same name in the YAML configuration file will be taken**. ### Basic Parameters | Parameters | Parameter Descriptions | Value Description | Applicable Scenarios | |:---------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|----------------------------| | `--config` | YAML config files. | str, required | pre-train/finetune/predict | | `--mode` | Set the backend execution mode. | int, optional, `0` is GRAPH_MODE and `1` is PYNATIVE_MODE. Currently, only GRAPH_MODE is supported. | pre-train/finetune/predict | | `--device_id` | Set the execution device ID. The value must be within the range of available devices. | int, optional | pre-train/finetune/predict | | `--device_target` | Set the backend execution device. MindSpore Transformers is only supported on `Ascend` devices. | str, optional | pre-train/finetune/predict | | `--run_mode` | Set the running mode of the model: `train`, `finetune` or `predict`. | str, optional | pre-train/finetune/predict | | `--load_checkpoint` | File or folder paths for loading weights. For detailed usage, please refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/master/feature/ckpt.html) | str, optional | pre-train/finetune/predict | | `--use_parallel` | Whether use parallel mode. | bool, optional | pre-train/finetune/predict | | `--options` | Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. This parameter has been deprecated and will be removed in the next version. | str, optional | pre-train/finetune/predict | | `--output_dir` | Set the paths for saving logs, weights, sharding strategies, and other files. | str, optional | pre-train/finetune/predict | | `--register_path` | The absolute path of the directory where the external code is located. For example, the model directory under the research directory. | str, optional | pre-train/finetune/predict | | `--remote_save_url` | Remote save url, where all the output files will transferred and stored in here. This parameter has been deprecated and will be removed in the next version. | str, optional | pre-train/finetune/predict | | `--seed` | Set the global seed. For details, refer to [mindspore.set_seed](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_seed.html). | int, optional | pre-train/finetune/predict | | `--trust_remote_code` | Whether Hugging Face AutoTokenizer trusts remote code. | bool, optional | pre-train/finetune/predict | ### Weight Slicing | Parameters | Parameter Descriptions | Value Description | Applicable Scenarios | |:----------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|-----------------------------| | `--src_strategy_path_or_dir` | The strategy of load_checkpoint. | str, optional | pre-train/finetune/predict | | `--auto_trans_ckpt` | Enable online weight automatic conversion. Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/master/feature/ckpt.html). | bool, optional | pre-train/finetune/predict | | `--transform_process_num` | The number of processes responsible for checkpoint transform. | int, optional | pre-train/finetune/predict | | `--only_save_strategy` | Whether to only save the strategy files. | bool, optional, when it is `true`, the task exits directly after saving the strategy file. | pre-train/finetune/predict | | `--strategy_load_checkpoint` | The path to the distributed strategy file to be loaded. This parameter has been deprecated and will be removed in the next version. | str, optional | pre-train/finetune/predict | ### Training | Parameters | Parameter Descriptions | Value Description | Applicable Scenarios | |:-------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|----------------------| | `--do_eval` | Whether to evaluate in training process. This parameter has been deprecated and will be removed in the next version. | bool, optional | pre-train/finetune | | `--eval_dataset_dir` | Dataset directory of data loader to eval. This parameter has been deprecated and will be removed in the next version. | str, optional | pre-train/finetune | | `--train_dataset_dir` | Dataset directory of data loader to pre-train/finetune. | str, optional | pre-train/finetune | | `--resume_training` | Enable resumable training after breakpoint. For details, refer to [Resumable Training After Breakpoint](https://www.mindspore.cn/mindformers/docs/en/master/feature/resume_training.html#resumable-training). | bool, optional | pre-train/finetune | | `--profile` | Whether to use profile analysis. This parameter has been deprecated and will be removed in the next version. | bool, optional | pre-train/finetune | | `--epochs` | Train epochs. | int, optional | pre-train/finetune | | `--batch_size` | The sample size of the batch data. | int, optional | pre-train/finetune | | `--gradient_accumulation_steps` | The number of gradient accumulation steps. | int, optional | pre-train/finetune | | `--sink_mode` | Whether to use sink mode. This parameter has been deprecated and will be removed in the next version. | bool, optional | pre-train/finetune | | `--num_samples` | Number of datasets samples used. | int, optional | pre-train/finetune | ### Inference | Parameters | Parameter Descriptions | Value Description | Applicable Scenarios | |:----------------------:|:------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------| | `--predict_data` | Input data for inference. | str, optional, It can be the input for predict (single-batch predict) or the file path of a txt file containing multiple lines of text (multi-batch predict). | predict | | `--modal_type` | Modal type of input data for predict. This parameter has been deprecated and will be removed in the next version. | str, optional | predict | | `--adapter_id` | LoRA ID for predict. This parameter has been deprecated and will be removed in the next version. | str, optional | predict | | `--predict_batch_size` | The batch size for multi-batch inference. | int, optional | predict | | `--do_sample` | Whether to use random sampling when selecting tokens for inference. | int, optional, ``True`` means using sampling encoding, ``False`` means using greedy decoding. | predict | ## Distributed Task Pull-up Script The distributed task pull up script `msrun_launcher.sh` is located in the `scripts/` directory and can automatically start distributed multiprocess tasks using the [msrun](https://www.mindspore.cn/tutorials/en/master/parallel/msrun_launcher.html) command based on the input parameters. This script has the following several usage methods: 1. For Default 8 Devices In Single Machine: ```bash bash msrun_launcher.sh [EXECUTE_ORDER] ``` 2. For Quick Start On Multiple Devices In Single Machine: ```bash bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] ``` 3. For Multiple Devices In Single Machine: ```bash bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] [MASTER_PORT] [LOG_DIR] [JOIN] [CLUSTER_TIME_OUT] ``` 4. For Multiple Devices In Multiple Machines: ```bash bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] [LOCAL_WORKER] [MASTER_ADDR] [MASTER_PORT] [NODE_RANK] [LOG_DIR] [JOIN] [CLUSTER_TIME_OUT] ``` The parameter descriptions of the script are as follows: | Parameters | Parameter Descriptions | Value Description | |:------------------:|:-------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------| | `EXECUTE_ORDER` | The parameters of the Python script command to be executed in a distributed manner. | str, required, set it to a string containing the Python script to be executed and the script parameters | | `WORKER_NUM` | The total number of Worker processes participating in the distributed task. | int, optional, default: `8` | | `LOCAL_WORKER` | The number of Worker processes pulled up on the current node. | int, optional, default: `8` | | `MASTER_ADDR` | Specifies the IP address or hostname of the Scheduler. | str, optional, default: `"127.0.0.1"` | | `MASTER_PORT` | Specifies the Scheduler binding port number. | int, optional, default: `8118` | | `NODE_RANK` | The index of the current node. | int, optional, default: `0` | | `LOG_DIR` | Worker, and Scheduler log output paths. | str, optional, default: `"output/msrun_log"` | | `JOIN` | Whether msrun waits for the Worker as well as the Scheduler to exit. | bool, optional, default: `False` | | `CLUSTER_TIME_OUT` | Cluster networking timeout in seconds. | int, optional, default: `7200` | ## Task Startup Tutorial Next, taking the fine-tuning of Qwen2.5-0.5B as an example, we will explain the usage of single-device, single-node, and multi-node tasks. ### Single-Device Execute the Python script in the root directory of the MindSpore Transformers code to perform single-device fine-tuning. The path in the command needs to be replaced with the real path. ```shell python run_mindformer.py \ --register_path research/qwen2_5 \ --config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \ --use_parallel False \ --run_mode finetune \ --train_dataset_dir ./path/alpaca-data.mindrecord ``` ### Single-Node Execute the msrun startup script in the root directory of the MindSpore Transformers code to perform single-node fine-tuning. The path in the command needs to be replaced with the real path. ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ --register_path research/qwen2_5 \ --config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \ --run_mode finetune \ --train_dataset_dir ./path/alpaca-data.mindrecord " ``` ### Multi-Node Take Qwen2.5-0.5B as an example to perform 2-node 16-device fine-tuning. 1. Modify the corresponding config file `research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml` based on information such as the number of used nodes: ```yaml parallel_config: data_parallel: 2 model_parallel: 4 pipeline_stage: 2 micro_batch_num: 16 vocab_emb_dp: True gradient_aggregation_group: 4 ``` > If the number of nodes and the number of devices are used to change, `data_parallel`, `model_parallel`, and `pipeline_stage` need to be modified to meet the actual number of running devices . `device_num=data_parallel×model_parallel×pipeline_stage`. Meanwhile, `micro_batch_num >= pipeline_stage`. 2. Execute the msrun startup script: For distributed tasks by executing scripts on multiple nodes and multiple devices, it is necessary to run the scripts on different nodes respectively and set the parameter `MASTER_ADDR` to the ip address of the main node. The ip addresses set for all nodes are the same, and only the parameter `NODE_RANK` is different among different nodes. The meanings of each parameter position can be found in [msrun Launching](https://www.mindspore.cn/tutorials/en/master/parallel/msrun_launcher.html). ```shell # Node 0. Set the IP address of node 0 to the value of {ip_addr}, which is used as the IP address of the primary node. There are 16 devices in total with 2 devices for each node. bash scripts/msrun_launcher.sh "run_mindformer.py \ --register_path research/qwen2_5 \ --config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \ --train_dataset_dir /{path}/wiki4096.mindrecord \ --run_mode finetune" \ 16 8 {ip_addr} 8118 0 output/msrun_log False 300 # Node 1. Set the IP address of node 0 to the value of {ip_addr}, which is used as the IP address of the primary node. The startup commands of node 0 and node 1 differ only in the parameter NODE_RANK. bash scripts/msrun_launcher.sh "run_mindformer.py \ --register_path research/qwen2_5 \ --config research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml \ --train_dataset_dir /{path}/wiki4096.mindrecord \ --run_mode finetune" \ 16 8 {ip_addr} 8118 1 output/msrun_log False 300 ```