# Dynamic Cluster Startup [![View Source on AtomGit](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.8.0/resource/_static/logo_source_en.svg)](https://atomgit.com/mindspore/docs/blob/r2.8.0/tutorials/source_en/parallel/dynamic_cluster.md) ## Overview For reliability requirements during training, MindSpore provides **dynamic cluster** features that enable users to start Ascend/GPU/CPU distributed training tasks without relying on any third-party library (OpenMPI) and without any modification to the training script. We recommend users to use this startup method in preference. The MindSpore **Dynamic Cluster** feature replaces the OpenMPI capability by **reusing the Parameter Server mode training architecture**. The **Dynamic Cluster** feature starts multiple MindSpore training processes as `Workers`, and starts an additional `Scheduler` for cluster and disaster recovery, thus, distributed training can be achieved without the need for OpenMPI's message passing mechanism. The user only needs to make a few changes to the startup script to perform distributed training. > Dynamic cluster supports Ascend, GPU and CPU, so the dynamic cluster startup script can be quickly migrated between multiple hardware platforms without additional modifications. The relevant environment variables: | Environment Variables | Functions | Type      | Value | Description | |:----------------------|:---------|:------------------------|:------|:------------| | `MS_ROLE` | Specifies the role of this process. | String | | The Worker and Parameter Server processes register with the Scheduler process to complete the networking. | | `MS_SCHED_HOST` | Specifies the IP address of the Scheduler. | String | Legal IP address. | IPv6 addresses are only supported on `Ascend` platform in current version. | | `MS_SCHED_PORT` | Specifies the Scheduler binding port number. | Integer | Port number in the range of 1024 to 65535. |-| | `MS_NODE_ID` | Specifies the ID of this process, unique within the cluster. | String | Represents the unique ID of this process, which is automatically generated by MindSpore by default. | MS_NODE_ID needs to be set in the following cases. Normally it does not need to be set and is automatically generated by MindSpore: | | `MS_WORKER_NUM` | Specifies the number of processes with the role MS_WORKER. | Integer | Integer greater than 0. | The number of Worker processes started by the user should be equal to the value of this environment variable. If it is less than this value, the networking fails. If it is greater than this value, the Scheduler process will complete the networking according to the order of Worker registration, and the redundant Worker processes will fail to start. | | `MS_SERVER_NUM` | Specifies the number of processes with the role MS_PSERVER. | Integer | Integer greater than 0. | Only set in Parameter Server training mode. | | `MS_WORKER_IP` | Specifies the IP address used for communication and networking between processes. | String | Legitimate IP address. | This environment variable must be set when using IPv6. But when MS_SCHED_HOST is set to **::1** (Representing local loopback interface in IPv6), there's no need to set MS_WORKER_IP because MindSpore will use local loopback interface to communicate by default. | | `MS_ENABLE_RECOVERY` | Turn on disaster recovery. | Integer | 1 for on, 0 for off. The default is 0. |-| | `MS_ENABLE_LCCL` | Whether to use LCCL as communication library. | Integer | 1 for yes, other values for no. The default is no. | The LCCL communication library currently only supports single-machine multi-card scenario and must be executed when the graph compilation level is O0. | | `MS_DISABLE_LCCL_KERNELS_LIST` | Specifies the blacklist of LCCL operators that are not enabled. | String | Valid operator names, with multiple operators separated by commas (','). | Takes effect only when using the LCCL communication library.
Currently supported LCCL operators:
Notes:
- Operator names are case-sensitive
- There should be no spaces when multiple operators are separated by commas | | `MS_TOPO_TIMEOUT` | Cluster networking phase timeout time in seconds. | Integer | The default is 30 minutes. | This value represents that all nodes can register to the scheduler within this time window. If the time window is exceeded, registration will fail and if the number of nodes does not meet the requirements, cluster networking will fail. We suggest users to configure this environment variable when the cluster is in large-scale. | | `MS_NODE_TIMEOUT` | Node heartbeat timeout in seconds. | Integer | The default is 30 seconds. | The heartbeat timeout threshold between the Scheduler and Worker. If the Scheduler fails to receive heartbeat messages from the Worker within this period, it will trigger the cluster's abnormal exit process. | | `MS_RECEIVE_MSG_TIMEOUT` | Node timeout for receiving messages in seconds. | Integer | The default is 15 seconds. | This value represents the timeout window for the node to receive messages from the other end. If there is no message response within the time window, an empty message is returned. | | `MS_RETRY_INTERVAL_LOWER` | Lower limit of message retry interval between nodes in seconds. | Integer | The default is 3 seconds. | This value represents the lower limit of the time interval between each retry of sending a message by a node. MindSpore randomly selects the value between `MS_RETRY_INTERVAL_LOWER` and `MS_RETRY_INTERVAL_UPPER` as the interval time. This variable is used to control the message concurrency of the Scheduler. | | `MS_RETRY_INTERVAL_UPPER` | Upper limit of message retry interval between nodes in seconds | Integer | The default is 5 seconds. | This value represents the upper limit of the time interval between each retry of sending a message by a node. MindSpore randomly selects the value between `MS_RETRY_INTERVAL_LOWER` and `MS_RETRY_INTERVAL_UPPER` as the interval time. This variable is used to control the message concurrency of the Scheduler. | | `MS_DISABLE_HEARTBEAT` | Disable the heartbeat feature between nodes in the cluster. | Integer | Heartbeat feature is enabled by default. | If set to 1, the heartbeat between cluster nodes will be disabled. In this scenario, Scheduler will not detect Workers' exception and will not control the cluster to exit. This variable can reduce the message concurrency of the Scheduler.
It is recommended to set this environment variable when using `gdb attach` command for debugging. | | `MS_HEARTBEAT_RETRY_TIMEOUT` | Node heartbeat retry timeout in seconds. | Integer | The default is 20 seconds. | The timeout threshold for reconnection when the Worker does not receive a heartbeat response from the Scheduler. If the Worker fails to re-establish a connection with the Scheduler within this time period, it will trigger its own abnormal exit. Ensure that the value of `MS_NODE_TIMEOUT` is greater than `MS_HEARTBEAT_RETRY_TIMEOUT`, so that the Scheduler can re-receive the heartbeat message from the Worker within the `MS_NODE_TIMEOUT` window, ensuring the stable operation of the cluster. Notes: this environment variable configuration does not take effect when the heartbeat function is disabled or the cluster is not managed by the Scheduler process. | > The environment variables `MS_SCHED_HOST`, `MS_SCHED_PORT`, and `MS_WORKER_NUM` need to be consistent in their contents, or else the networking will fail due to the inconsistency in the configurations of the processes. ## Operation Practice Dynamic cluster startup scripts are consistent across hardware platforms. The following is an example of how to write a startup script for Ascend: > You can download the full sample code here: [startup_method](https://atomgit.com/mindspore/docs/tree/r2.8.0/docs/sample_code/startup_method). The directory structure is as follows: ```text └─ sample_code ├─ startup_method ├── net.py ├── run_dynamic_cluster.sh ├── run_dynamic_cluster_1.sh ├── run_dynamic_cluster_2.sh ... ``` `net.py` is defining the network structure and training process, and `run_dynamic_cluster.sh`, `run_dynamic_cluster_1.sh` and `run_dynamic_cluster_2.sh` are executing scripts. ### 1. Preparing Python Training Scripts Here, as an example of data parallel, a recognition network is trained for the MNIST dataset. First specify the operation mode, hardware device, etc. Unlike single card scripts, parallel scripts also need to specify configuration items such as parallel mode and initialize HCCL, NCCL or MCCL communication via `init()`. If you don't set `device_target` here, it will be automatically specified as the backend hardware device corresponding to the MindSpore package. ```python import mindspore as ms from mindspore.communication import init ms.set_context(mode=ms.GRAPH_MODE) ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) init() ms.set_seed(1) ``` Then build the following network: ```python from mindspore import nn class Network(nn.Cell): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") self.relu = nn.ReLU() def construct(self, x): x = self.flatten(x) logits = self.relu(self.fc(x)) return logits net = Network() ``` Finally, the dataset is processed and the training process is defined: ```python import os from mindspore import nn import mindspore as ms import mindspore.dataset as ds from mindspore.communication import get_rank, get_group_size def create_dataset(batch_size): dataset_path = os.getenv("DATA_PATH") rank_id = get_rank() rank_size = get_group_size() dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) image_transforms = [ ds.vision.Rescale(1.0 / 255.0, 0), ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), ds.vision.HWC2CHW() ] label_transform = ds.transforms.TypeCast(ms.int32) dataset = dataset.map(image_transforms, 'image') dataset = dataset.map(label_transform, 'label') dataset = dataset.batch(batch_size) return dataset data_set = create_dataset(32) loss_fn = nn.CrossEntropyLoss() optimizer = nn.SGD(net.trainable_params(), 1e-2) def forward_fn(data, label): logits = net(data) loss = loss_fn(logits, label) return loss, logits grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) grad_reducer = nn.DistributedGradReducer(optimizer.parameters) for epoch in range(10): i = 0 for data, label in data_set: (loss, _), grads = grad_fn(data, label) grads = grad_reducer(grads) optimizer(grads) if i % 10 == 0: print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) i += 1 ``` ### 2. Preparing the Startup Script #### Single-Machine Multi-Card The content of the single-machine multi-card startup script [run_dynamic_cluster.sh](https://atomgit.com/mindspore/docs/blob/r2.8.0/docs/sample_code/startup_method/run_dynamic_cluster.sh) is as follows. Taking the single-machine 8-card as an example: ```bash EXEC_PATH=$(pwd) if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip fi unzip MNIST_Data.zip fi export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ rm -rf device mkdir device echo "start training" # Start 8 Worker training processes in a loop for((i=0;i<8;i++)); do export MS_WORKER_NUM=8 # Set the number of Worker processes in the cluster to 8 export MS_SCHED_HOST=127.0.0.1 # Set the Scheduler IP address to the local loop address export MS_SCHED_PORT=8118 # Set Scheduler port export MS_ROLE=MS_WORKER # Set the started process to the MS_WORKER role export MS_NODE_ID=$i # Set process id, optional python ./net.py > device/worker_$i.log 2>&1 & # Start training script done # Start 1 Scheduler process export MS_WORKER_NUM=8 # Set the number of Worker processes in the cluster to 8 export MS_SCHED_HOST=127.0.0.1 # Set the Scheduler IP address to the local loop address export MS_SCHED_PORT=8118 # Set Scheduler port export MS_ROLE=MS_SCHED # Set the started process to the MS_SCHED role python ./net.py > device/scheduler.log 2>&1 & # Start training script ``` > The training scripts for the Scheduler and Worker processes are identical in content and startup method, because the internal processes of the two roles are handled differently in MindSpore. Users simply pull up the process in the normal training manner, without modifying the Python code by role. This is one of the reasons why dynamic cluster startup scripts can be consistent across multiple hardware platforms. A single-machine 8-card distributed training can be executed by executing the following command: ```bash bash run_dynamic_cluster.sh ``` The script will run in the background, the log file will be saved to the device directory and the result will be saved in the worker_*.log and is as follows: ```text epoch: 0, step: 0, loss is 2.3499548 epoch: 0, step: 10, loss is 1.6682479 epoch: 0, step: 20, loss is 1.4237018 epoch: 0, step: 30, loss is 1.0437132 ... ``` #### Multi-Machine Multi-Card The startup script needs to be split in the multi-machine training scenario. The following is an example of performing 2-machine 8-card training, with each machine executing the startup 4 Worker: The script [run_dynamic_cluster_1.sh](https://atomgit.com/mindspore/docs/blob/r2.8.0/docs/sample_code/startup_method/run_dynamic_cluster_1.sh) starts 1 `Scheduler` process and 4 `Worker` processes on node 1: ```bash EXEC_PATH=$(pwd) if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip fi unzip MNIST_Data.zip fi export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ rm -rf device mkdir device echo "start training" # Start Worker1 to Worker4, 4 Worker training processes in a loop for((i=0;i<4;i++)); do export MS_WORKER_NUM=8 # Set the total number of Worker processes in the cluster to 8 (including other node processes) export MS_SCHED_HOST= # Set the Scheduler IP address to the Node 1 IP address export MS_SCHED_PORT=8118 # Set the Scheduler port export MS_ROLE=MS_WORKER # Set the startup process to the MS_WORKER role export MS_NODE_ID=$i # Set process id, optional python ./net.py > device/worker_$i.log 2>&1 & # Start training script done # Start 1 Scheduler process on node 1 export MS_WORKER_NUM=8 # Set the total number of Worker processes in the cluster to 8 (including other node processes) export MS_SCHED_HOST= # Set the Scheduler IP address to the Node 1 IP address export MS_SCHED_PORT=8118 # Set the Scheduler port export MS_ROLE=MS_SCHED # Set the startup process to the MS_SCHED role python ./net.py > device/scheduler.log 2>&1 & # Start training script ``` The script [run_dynamic_cluster_2.sh](https://atomgit.com/mindspore/docs/blob/r2.8.0/docs/sample_code/startup_method/run_dynamic_cluster_2.sh) starts `Worker5` to `Worker8` on node 2 (without executing Scheduler): ```bash EXEC_PATH=$(pwd) if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip fi unzip MNIST_Data.zip fi export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ rm -rf device mkdir device echo "start training" # Start Worker5 to Worker8, 4 Worker training processes in a loop for((i=4;i<8;i++)); do export MS_WORKER_NUM=8 # Set the total number of Worker processes in the cluster to 8 (including other node processes) export MS_SCHED_HOST= # Set the Scheduler IP address to the Node 1 IP address export MS_SCHED_PORT=8118 # Set the Scheduler port export MS_ROLE=MS_WORKER # Set the startup process to the MS_WORKER role export MS_NODE_ID=$i # Set process id, optional python ./net.py > device/worker_$i.log 2>&1 & # Start training script done ``` > In a multi-machine task, you need to set a different hostname for each host node, otherwise you will get an error reporting `device id` out of bounds. Refer to [FAQ](https://www.mindspore.cn/docs/en/r2.8.0/faq/distributed_parallel.html#q-when-starting-distributed-framework-using-dynamic-cluster-or-msrun-in-multi-machine-scenario,-an-error-is-reported-that-device-id-is-out-of-range-how-can-we-solve-it?). > > In a multi-machine task, `MS_WORKER_NUM` should be the total number of Worker nodes in the cluster. > > The inter-node network needs to be connected. You can use the `telnet ` command to test if this node is connected to a Scheduler node that has been started. Execute on Node 1: ```bash bash run_dynamic_cluster_1.sh ``` Execute on Node 2: ```bash bash run_dynamic_cluster_2.sh ``` That is, you can perform 2-machine 8-card distributed training tasks. ## Disaster Recovery Dynamic cluster supports disaster recovery under data parallel. In a parallel training scenario with multi-card data, if a process quits abnormally, the training can be continued after pulling up the corresponding script of the corresponding process again, and the accuracy convergence will not be affected. ## Security Authentication Dynamic cluster also supports the **Secure Encrypted Channel** feature, which supports the `TLS/SSL` protocol to satisfy users security needs. By default, the secure encrypted channel is turned off. If you need to turn it on, call init() only after configuring the secure encrypted channel correctly via `set_ps_context`, otherwise the initialization of the networking will fail. If you want to use the secure Encrypted channel, please configure it: `set_ps_context(config_file_path="/path/to/config_file.json", enable_ssl=True, client_password="xxxxxx", server_password="xxxxxx")` The `config.json` configuration file specified by `config_file_path` needs to add the following fields: ```json { "server_cert_path": "server.p12", "crl_path": "", "client_cert_path": "client.p12", "ca_cert_path": "ca.crt", "cipher_list": "ECDHE-R SA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-DSS-AES256-GCM-SHA384:DHE-PSK-AES128-GCM-SHA256:DHE-PSK-AES256-GCM-SHA384:DHE-PSK-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-PSK-CHACHA20-POLY1305:DHE-RSA-AES128-CCM:DHE-RSA-AES256-CCM:DHE-RSA-CHACHA20-POLY1305:DHE-PSK-AES128-CCM:DHE-PSK-AES256-CCM:ECDHE-ECDSA-AES128-CCM:ECDHE-ECDSA-AES256-CCM:ECDHE-ECDSA-CHACHA20-POLY1305", "cert_expire_warning_time_in_day": 90 } ``` - `server_cert_path`: The path to the p12 file (SSL-specific certificate file) that contains the cipher text of the certificate and the secret key on the server side. - `crl_path`: The file path to the revocation list (used to distinguish invalid untrusted certificates from valid trusted certificates). - `client_cert_path`: The client contains the path to the p12 file (SSL-specific certificate file) with the cipher text of the certificate and secret key. - `ca_cert_path`: The path to root certificate - `cipher_list`: Cipher suite (list of supported SSL encrypted types) - `cert_expire_warning_time_in_day`: The warning time of certificate expiration. The secret key in the p12 file is stored in cipher text, and the password needs to be passed in when starting. Please refer to the Python API [mindspore.set_ps_context](https://www.mindspore.cn/docs/en/r2.8.0/api_python/mindspore/mindspore.set_ps_context.html#mindspore.set_ps_context) for the `client_password` and `server_password` fields.