# msrun Launching [](https://gitee.com/mindspore/docs/blob/br_base/tutorials/source_en/parallel/msrun_launcher.md) ## Overview `msrun` is an encapsulation of the [Dynamic Cluster](https://www.mindspore.cn/tutorials/en/br_base/parallel/dynamic_cluster.html) startup method. Users can use `msrun` to pull multi-process distributed tasks across nodes with a single command line instruction. Users can use `msrun` to pull up multi-process distributed tasks on each node with a single command line command, and there is no need to manually set [dynamic networking environment variables](https://www.mindspore.cn/tutorials/en/br_base/parallel/dynamic_cluster.html). `msrun` supports both `Ascend`, `GPU` and `CPU` backends. As with the `Dynamic Cluster` startup, `msrun` has no dependencies on third-party libraries and configuration files. > - `msrun` is available after the user installs MindSpore, and the command `msrun --help` can be used to view the supported parameters. > - `msrun` supports `graph mode` as well as `PyNative mode`. A parameters list of command line:
Parameters | Functions | Types | Values | Instructions |
---|---|---|---|---|
--worker_num | The total number of Worker processes participating in the distributed task. | Integer | An integer greater than 0. The default value is 8. | The total number of Workers started on all nodes should be equal to this parameter: if the total number is greater than this parameter, the extra Worker processes will fail to register; if the total number is less than this parameter, the cluster will wait for a certain period of timeout before prompting the task to pull up the failed task and exit, and the size of the timeout window can be configured by the parameter cluster_time_out . |
--local_worker_num | The number of Worker processes pulled up on the current node. | Integer | An integer greater than 0. The default value is 8. | When this parameter is consistent with worker_num , it means that all Worker processes are executed locally. The node_rank value is ignored in this scenario. |
--master_addr | Specifies the IP address or hostname of the Scheduler. | String | Legal IP address or hostname. The default is the IP address 127.0.0.1. | msrun will automatically detect on which node to pull up the Scheduler process, and users do not need to care. If the corresponding IP address cannot be found or the hostname cannot be resolved by DNS, the training task will pull up and fail. IPv6 addresses are not supported in the current version. If a hostname is input as a parameter, msrun will automatically resolve it to an IP address, which requires the user's environment to support DNS service. |
--master_port | Specifies the Scheduler binding port number. | Integer | Port number in the range 1024 to 65535. The default is 8118. | |
--node_rank | The index of the current node. | Integer | An integer greater than 0. The default value is -1. | This parameter is ignored in single-machine multi-card scenario. In multi-machine and multi-card scenarios, if this parameter is not set, the rank_id of the Worker process will be assigned automatically; if it is set, the rank_id will be assigned to the Worker process on each node according to the index. If the number of Worker processes per node is different, it is recommended that this parameter not be configured to automatically assign the rank_id. |
--log_dir | Worker, and Scheduler log output paths. | String | Folder path. Defaults to the current directory. | If the path does not exist, msrun creates the folder recursively. The log format is as follows: for the Scheduler process, the log is named scheduler.log ; For Worker process, log name is worker_[rank].log , where rank suffix is the same as the rank_id assigned to the Worker, but they may be inconsistent in multiple-machine and multiple-card scenarios where node_rank is not set. It is recommended that grep -rn "Global rank id" is executed to view rank_id of each Worker. |
--join | Whether msrun waits for the Worker as well as the Scheduler to exit. | Bool | True or False. Default: False. | If set to False, msrun will exit immediately after pulling up the process and check the logs to confirm that the distributed task is executing properly. If set to True, msrun waits for all processes to exit, collects the exception log and exits. |
--cluster_time_out | Cluster networking timeout in seconds. | Integer | Default: 600 seconds. | This parameter represents the waiting time in cluster networking. If no worker_num number of Workers register successfully beyond this time window, the task pull-up fails. |
--bind_core | Enable processes binding CPU cores. | Bool/Dict | True/False or a device-to-CPU-range dict. Default: False. | If set to True, msrun will automatically allocates CPU ranges based on device affinity; when manually passing a dict, e.g., {"device0":["0-10"],"device1":["11-20"]} , it assigns CPU range 0-10 to process 0 (device0) and 11-20 to process 1 (device1). |
--sim_level | Set simulated compilation level. | Integer | Default: -1. Disable simulated compilation. | If this parameter is set, msrun starts only the processes for simulated compilation and does not execute operators. This feature is commonly used to debug large-scale distributed training parallel strategies, and to detect memory and strategy issues in advance. The settings for the simulated compilation level can be found in the document: DryRun. |
--sim_rank_id | rank_id of the simulated process. | Integer | Default: -1. Disable simulated compilation for a single process. | Set rank id of the simulated process. |
--rank_table_file | rank_table configuration. Only valid on Ascend platform. | String | File path of rank_table configuration. Default: empty string. | This parameter represents the rank_table configuration file on Ascend platform, describing current distributed cluster. Since the rank_table configuration file reflects distributed cluster information at the physical level, when using this configuration, make sure that the Devices visible to the current process are consistent with the rank_table configuration. The Device visible to the current process can be set via the environment variable ASCEND_RT_VISIBLE_DEVICES . |
--worker_log_name | Specifies the worker log name. | String | File name of worker log. Default: worker_[rank].log . |
This parameter represents support users configure worker log name, and support configure ip and hostname to worker log name by {ip} and {hostname} separately. The suffix of worker log name is rank by default. |
--tail_worker_log | Enable output worker log to console. | String | One or multiple integers associated with the worker process rank_id. Default: -1. | This parameter represents output all worker logs of the current node to console by default, and supports users specify one or more worker logs output to console when --join=True . This parameter should be in [0, local_worker_num]. |
task_script | User Python scripts. | String | Legal script path. | Normally, this parameter is the python script path, and msrun will pull up the process as python task_script task_script_args by default.msrun also supports this parameter as pytest. In this scenario the task script and task parameters are passed in the parameter task_script_args . |
task_script_args | Parameters for the user Python script. | Parameter list. | For example, msrun --worker_num=8 --local_worker_num=8 train.py --device_target=Ascend --dataset_path=/path/to/dataset |
Environment Variables | Functions | Values |
---|---|---|
MS_ROLE | This process role. |
The current version of msrun exports the following two values:
|
MS_SCHED_HOST | The IP address of the user-specified Scheduler. | Same as parameter --master_addr . |
MS_SCHED_PORT | User-specified Scheduler binding port number. | Same as parameter --master_port . |
MS_WORKER_NUM | The total number of Worker processes specified by the user. | Same as parameter --worker_num . |
MS_TOPO_TIMEOUT | Cluster Timeout Time. | Same as parameter --cluster_time_out . |
RANK_SIZE | The total number of Worker processes specified by the user. | Same as parameter --worker_num . |
RANK_ID | The rank_id assigned to the Worker process. | In a multi-machine multi-card scenario, if the parameter --node_rank is not set, RANK_ID will only be exported after the cluster is initialized.So to use this environment variable, it is recommended to set the --node_rank parameter correctly. |