# Dynamic Cluster Startup

[![View Source on AtomGit](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.8.0/resource/_static/logo_source_en.svg)](https://atomgit.com/mindspore/docs/blob/r2.8.0/tutorials/source_en/parallel/dynamic_cluster.md)

## Overview

For reliability requirements during training, MindSpore provides **dynamic cluster** features that enable users to start Ascend/GPU/CPU distributed training tasks without relying on any third-party library (OpenMPI) and without any modification to the training script. We recommend users to use this startup method in preference.

The MindSpore **Dynamic Cluster** feature replaces the OpenMPI capability by **reusing the Parameter Server mode training architecture**.

The **Dynamic Cluster** feature starts multiple MindSpore training processes as `Workers`, and starts an additional `Scheduler` for cluster and disaster recovery, thus, distributed training can be achieved without the need for OpenMPI's message passing mechanism. The user only needs to make a few changes to the startup script to perform distributed training.

> Dynamic cluster supports Ascend, GPU and CPU, so the dynamic cluster startup script can be quickly migrated between multiple hardware platforms without additional modifications.

The relevant environment variables:

| Environment Variables | Functions | Type&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Value | Description |
|:----------------------|:---------|:------------------------|:------|:------------|
| `MS_ROLE` | Specifies the role of this process. | String | <ul><li>MS_SCHED: represents the Scheduler process. A training task starts only one Scheduler, which is responsible for networking, disaster recovery, etc., and **does not execute training code**.</li><li>MS_WORKER: Represents the Worker process, which generally sets up the distributed training process for this role.</li><li>MS_PSERVER: represents the Parameter Server process. Only in Parameter Server mode this role is effective.</li></ul> | The Worker and Parameter Server processes register with the Scheduler process to complete the networking. |
| `MS_SCHED_HOST` | Specifies the IP address of the Scheduler. | String | Legal IP address. | IPv6 addresses are only supported on `Ascend` platform in current version. |
| `MS_SCHED_PORT` | Specifies the Scheduler binding port number. | Integer | Port number in the range of 1024 to 65535. |-|
| `MS_NODE_ID` | Specifies the ID of this process, unique within the cluster. | String | Represents the unique ID of this process, which is automatically generated by MindSpore by default. | MS_NODE_ID needs to be set in the following cases. Normally it does not need to be set and is automatically generated by MindSpore:<ul><li>Enable Disaster Recovery Scenario: Disaster recovery requires obtaining the current process ID and thus re-registering with the Scheduler.</li><li>Enable GLOG log redirection scenario: In order to ensure that the logs of each training process are saved independently, it is necessary to set the process ID, which is used as the log saving path suffix.</li><li>Specify process rank id scenario: users can specify the rank id of this process by setting MS_NODE_ID to some integer.</li></ul> |
| `MS_WORKER_NUM` | Specifies the number of processes with the role MS_WORKER. | Integer | Integer greater than 0. | The number of Worker processes started by the user should be equal to the value of this environment variable. If it is less than this value, the networking fails. If it is greater than this value, the Scheduler process will complete the networking according to the order of Worker registration, and the redundant Worker processes will fail to start. |
| `MS_SERVER_NUM` | Specifies the number of processes with the role MS_PSERVER. | Integer | Integer greater than 0. | Only set in Parameter Server training mode. |
| `MS_WORKER_IP` | Specifies the IP address used for communication and networking between processes. | String | Legitimate IP address. | This environment variable must be set when using IPv6. But when MS_SCHED_HOST is set to **::1** (Representing local loopback interface in IPv6), there's no need to set MS_WORKER_IP because MindSpore will use local loopback interface to communicate by default. |
| `MS_ENABLE_RECOVERY` | Turn on disaster recovery. | Integer | 1 for on, 0 for off. The default is 0. |-|
| `MS_ENABLE_LCCL` | Whether to use LCCL as communication library. | Integer | 1 for yes, other values for no. The default is no. | The LCCL communication library currently only supports single-machine multi-card scenario and must be executed when the graph compilation level is O0. |
| `MS_DISABLE_LCCL_KERNELS_LIST` | Specifies the blacklist of LCCL operators that are not enabled. | String | Valid operator names, with multiple operators separated by commas (','). | Takes effect only when using the LCCL communication library.<br>Currently supported LCCL operators:<br><ul><li>AllReduce</li><li>AllGather</li><li>AllGatherMatmul</li><li>Broadcast</li><li>Barrier</li><li>MatMulAllReduce</li><li>MatmulReduceScatter</li><li>ReduceScatter</li></ul>Notes:<br>    - Operator names are case-sensitive<br>    - There should be no spaces when multiple operators are separated by commas |
| `MS_TOPO_TIMEOUT` | Cluster networking phase timeout time in seconds. | Integer | The default is 30 minutes. | This value represents that all nodes can register to the scheduler within this time window. If the time window is exceeded, registration will fail and if the number of nodes does not meet the requirements, cluster networking will fail. We suggest users to configure this environment variable when the cluster is in large-scale. |
| `MS_NODE_TIMEOUT` | Node heartbeat timeout in seconds. | Integer | The default is 30 seconds. | The heartbeat timeout threshold between the Scheduler and Worker. If the Scheduler fails to receive heartbeat messages from the Worker within this period, it will trigger the cluster's abnormal exit process. |
| `MS_RECEIVE_MSG_TIMEOUT` | Node timeout for receiving messages in seconds. | Integer | The default is 15 seconds. | This value represents the timeout window for the node to receive messages from the other end. If there is no message response within the time window, an empty message is returned. |
| `MS_RETRY_INTERVAL_LOWER` | Lower limit of message retry interval between nodes in seconds. | Integer | The default is 3 seconds. | This value represents the lower limit of the time interval between each retry of sending a message by a node. MindSpore randomly selects the value between `MS_RETRY_INTERVAL_LOWER` and `MS_RETRY_INTERVAL_UPPER` as the interval time. This variable is used to control the message concurrency of the Scheduler. |
| `MS_RETRY_INTERVAL_UPPER` | Upper limit of message retry interval between nodes in seconds | Integer | The default is 5 seconds. | This value represents the upper limit of the time interval between each retry of sending a message by a node. MindSpore randomly selects the value between `MS_RETRY_INTERVAL_LOWER` and `MS_RETRY_INTERVAL_UPPER` as the interval time. This variable is used to control the message concurrency of the Scheduler. |
| `MS_DISABLE_HEARTBEAT` | Disable the heartbeat feature between nodes in the cluster. | Integer | Heartbeat feature is enabled by default. | If set to 1, the heartbeat between cluster nodes will be disabled. In this scenario, Scheduler will not detect Workers' exception and will not control the cluster to exit. This variable can reduce the message concurrency of the Scheduler.<br>It is recommended to set this environment variable when using `gdb attach` command for debugging. |
| `MS_HEARTBEAT_RETRY_TIMEOUT` | Node heartbeat retry timeout in seconds. | Integer | The default is 20 seconds. | The timeout threshold for reconnection when the Worker does not receive a heartbeat response from the Scheduler. If the Worker fails to re-establish a connection with the Scheduler within this time period, it will trigger its own abnormal exit. Ensure that the value of `MS_NODE_TIMEOUT` is greater than `MS_HEARTBEAT_RETRY_TIMEOUT`, so that the Scheduler can re-receive the heartbeat message from the Worker within the `MS_NODE_TIMEOUT` window, ensuring the stable operation of the cluster. Notes: this environment variable configuration does not take effect when the heartbeat function is disabled or the cluster is not managed by the Scheduler process. |

> The environment variables `MS_SCHED_HOST`, `MS_SCHED_PORT`, and `MS_WORKER_NUM` need to be consistent in their contents, or else the networking will fail due to the inconsistency in the configurations of the processes.

## Operation Practice

Dynamic cluster startup scripts are consistent across hardware platforms. The following is an example of how to write a startup script for Ascend:

> You can download the full sample code here: [startup_method](https://atomgit.com/mindspore/docs/tree/r2.8.0/docs/sample_code/startup_method).

The directory structure is as follows:

```text
└─ sample_code
    ├─ startup_method
       ├── net.py
       ├── run_dynamic_cluster.sh
       ├── run_dynamic_cluster_1.sh
       ├── run_dynamic_cluster_2.sh
    ...
```

`net.py` is defining the network structure and training process, and `run_dynamic_cluster.sh`, `run_dynamic_cluster_1.sh` and `run_dynamic_cluster_2.sh` are executing scripts.

### 1. Preparing Python Training Scripts

Here, as an example of data parallel, a recognition network is trained for the MNIST dataset.

First specify the operation mode, hardware device, etc. Unlike single card scripts, parallel scripts also need to specify configuration items such as parallel mode and initialize HCCL, NCCL or MCCL communication via `init()`. If you don't set `device_target` here, it will be automatically specified as the backend hardware device corresponding to the MindSpore package.

```python
import mindspore as ms
from mindspore.communication import init

ms.set_context(mode=ms.GRAPH_MODE)
ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True)
init()
ms.set_seed(1)
```

Then build the following network:

```python
from mindspore import nn

class Network(nn.Cell):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros")
        self.relu = nn.ReLU()

    def construct(self, x):
        x = self.flatten(x)
        logits = self.relu(self.fc(x))
        return logits
net = Network()
```

Finally, the dataset is processed and the training process is defined:

```python
import os
from mindspore import nn
import mindspore as ms
import mindspore.dataset as ds
from mindspore.communication import get_rank, get_group_size

def create_dataset(batch_size):
    dataset_path = os.getenv("DATA_PATH")
    rank_id = get_rank()
    rank_size = get_group_size()
    dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id)
    image_transforms = [
        ds.vision.Rescale(1.0 / 255.0, 0),
        ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)),
        ds.vision.HWC2CHW()
    ]
    label_transform = ds.transforms.TypeCast(ms.int32)
    dataset = dataset.map(image_transforms, 'image')
    dataset = dataset.map(label_transform, 'label')
    dataset = dataset.batch(batch_size)
    return dataset

data_set = create_dataset(32)
loss_fn = nn.CrossEntropyLoss()
optimizer = nn.SGD(net.trainable_params(), 1e-2)

def forward_fn(data, label):
    logits = net(data)
    loss = loss_fn(logits, label)
    return loss, logits

grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True)
grad_reducer = nn.DistributedGradReducer(optimizer.parameters)

for epoch in range(10):
    i = 0
    for data, label in data_set:
        (loss, _), grads = grad_fn(data, label)
        grads = grad_reducer(grads)
        optimizer(grads)
        if i % 10 == 0:
            print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss))
        i += 1
```

### 2. Preparing the Startup Script

#### Single-Machine Multi-Card

The content of the single-machine multi-card startup script [run_dynamic_cluster.sh](https://atomgit.com/mindspore/docs/blob/r2.8.0/docs/sample_code/startup_method/run_dynamic_cluster.sh) is as follows. Taking the single-machine 8-card as an example:

```bash
EXEC_PATH=$(pwd)
if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then
    if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then
        wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip
    fi
    unzip MNIST_Data.zip
fi
export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/

rm -rf device
mkdir device
echo "start training"

# Start 8 Worker training processes in a loop
for((i=0;i<8;i++));
do
    export MS_WORKER_NUM=8          # Set the number of Worker processes in the cluster to 8
    export MS_SCHED_HOST=127.0.0.1  # Set the Scheduler IP address to the local loop address
    export MS_SCHED_PORT=8118       # Set Scheduler port
    export MS_ROLE=MS_WORKER        # Set the started process to the MS_WORKER role
    export MS_NODE_ID=$i                      # Set process id, optional
    python ./net.py > device/worker_$i.log 2>&1 &                            # Start training script
done

# Start 1 Scheduler process
export MS_WORKER_NUM=8              # Set the number of Worker processes in the cluster to 8
export MS_SCHED_HOST=127.0.0.1      # Set the Scheduler IP address to the local loop address
export MS_SCHED_PORT=8118           # Set Scheduler port
export MS_ROLE=MS_SCHED             # Set the started process to the MS_SCHED role
python ./net.py > device/scheduler.log 2>&1 &     # Start training script
```

> The training scripts for the Scheduler and Worker processes are identical in content and startup method, because the internal processes of the two roles are handled differently in MindSpore. Users simply pull up the process in the normal training manner, without modifying the Python code by role. This is one of the reasons why dynamic cluster startup scripts can be consistent across multiple hardware platforms.

A single-machine 8-card distributed training can be executed by executing the following command:

```bash
bash run_dynamic_cluster.sh
```

The script will run in the background, the log file will be saved to the device directory and the result will be saved in the worker_*.log and is as follows:

```text
epoch: 0, step: 0, loss is 2.3499548
epoch: 0, step: 10, loss is 1.6682479
epoch: 0, step: 20, loss is 1.4237018
epoch: 0, step: 30, loss is 1.0437132
...
```

#### Multi-Machine Multi-Card

The startup script needs to be split in the multi-machine training scenario. The following is an example of performing 2-machine 8-card training, with each machine executing the startup 4 Worker:

The script [run_dynamic_cluster_1.sh](https://atomgit.com/mindspore/docs/blob/r2.8.0/docs/sample_code/startup_method/run_dynamic_cluster_1.sh) starts 1 `Scheduler` process and 4 `Worker` processes on node 1:

```bash
EXEC_PATH=$(pwd)
if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then
    if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then
        wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip
    fi
    unzip MNIST_Data.zip
fi
export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/

rm -rf device
mkdir device
echo "start training"

# Start Worker1 to Worker4, 4 Worker training processes in a loop
for((i=0;i<4;i++));
do
    export MS_WORKER_NUM=8                    # Set the total number of Worker processes in the cluster to 8 (including other node processes)
    export MS_SCHED_HOST=<node_1 ip address>  # Set the Scheduler IP address to the Node 1 IP address
    export MS_SCHED_PORT=8118                 # Set the Scheduler port
    export MS_ROLE=MS_WORKER                  # Set the startup process to the MS_WORKER role
    export MS_NODE_ID=$i                      # Set process id, optional
    python ./net.py > device/worker_$i.log 2>&1 &                                    # Start training script
done

# Start 1 Scheduler process on node 1
export MS_WORKER_NUM=8                        # Set the total number of Worker processes in the cluster to 8 (including other node processes)
export MS_SCHED_HOST=<node_1 ip address>      # Set the Scheduler IP address to the Node 1 IP address
export MS_SCHED_PORT=8118                     # Set the Scheduler port
export MS_ROLE=MS_SCHED                       # Set the startup process to the MS_SCHED role
python ./net.py > device/scheduler.log 2>&1 &    # Start training script
```

The script [run_dynamic_cluster_2.sh](https://atomgit.com/mindspore/docs/blob/r2.8.0/docs/sample_code/startup_method/run_dynamic_cluster_2.sh) starts `Worker5` to `Worker8` on node 2 (without executing Scheduler):

```bash
EXEC_PATH=$(pwd)
if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then
    if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then
        wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip
    fi
    unzip MNIST_Data.zip
fi
export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/

rm -rf device
mkdir device
echo "start training"

# Start Worker5 to Worker8, 4 Worker training processes in a loop
for((i=4;i<8;i++));
do
    export MS_WORKER_NUM=8                    # Set the total number of Worker processes in the cluster to 8 (including other node processes)
    export MS_SCHED_HOST=<node_1 ip address>  # Set the Scheduler IP address to the Node 1 IP address
    export MS_SCHED_PORT=8118                 # Set the Scheduler port
    export MS_ROLE=MS_WORKER                  # Set the startup process to the MS_WORKER role
    export MS_NODE_ID=$i                      # Set process id, optional
    python ./net.py > device/worker_$i.log 2>&1 &                             # Start training script
done
```

> In a multi-machine task, you need to set a different hostname for each host node, otherwise you will get an error reporting `device id` out of bounds. Refer to [FAQ](https://www.mindspore.cn/docs/en/r2.8.0/faq/distributed_parallel.html#q-when-starting-distributed-framework-using-dynamic-cluster-or-msrun-in-multi-machine-scenario,-an-error-is-reported-that-device-id-is-out-of-range-how-can-we-solve-it?).
>
> In a multi-machine task, `MS_WORKER_NUM` should be the total number of Worker nodes in the cluster.
>
> The inter-node network needs to be connected. You can use the `telnet <scheduler ip> <scheduler port>` command to test if this node is connected to a Scheduler node that has been started.

Execute on Node 1:

```bash
bash run_dynamic_cluster_1.sh
```

Execute on Node 2:

```bash
bash run_dynamic_cluster_2.sh
```

That is, you can perform 2-machine 8-card distributed training tasks.

## Disaster Recovery

Dynamic cluster supports disaster recovery under data parallel. In a parallel training scenario with multi-card data, if a process quits abnormally, the training can be continued after pulling up the corresponding script of the corresponding process again, and the accuracy convergence will not be affected.

## Security Authentication

Dynamic cluster also supports the **Secure Encrypted Channel** feature, which supports the `TLS/SSL` protocol to satisfy users security needs. By default, the secure encrypted channel is turned off. If you need to turn it on, call init() only after configuring the secure encrypted channel correctly via `set_ps_context`, otherwise the initialization of the networking will fail. If you want to use the secure Encrypted channel, please configure it:

`set_ps_context(config_file_path="/path/to/config_file.json", enable_ssl=True, client_password="xxxxxx", server_password="xxxxxx")`

The `config.json` configuration file specified by `config_file_path` needs to add the following fields:

```json
{
  "server_cert_path": "server.p12",
  "crl_path": "",
  "client_cert_path": "client.p12",
  "ca_cert_path": "ca.crt",
  "cipher_list": "ECDHE-R SA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-DSS-AES256-GCM-SHA384:DHE-PSK-AES128-GCM-SHA256:DHE-PSK-AES256-GCM-SHA384:DHE-PSK-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-PSK-CHACHA20-POLY1305:DHE-RSA-AES128-CCM:DHE-RSA-AES256-CCM:DHE-RSA-CHACHA20-POLY1305:DHE-PSK-AES128-CCM:DHE-PSK-AES256-CCM:ECDHE-ECDSA-AES128-CCM:ECDHE-ECDSA-AES256-CCM:ECDHE-ECDSA-CHACHA20-POLY1305",
  "cert_expire_warning_time_in_day": 90
}
```

- `server_cert_path`: The path to the p12 file (SSL-specific certificate file) that contains the cipher text of the certificate and the secret key on the server side.
- `crl_path`: The file path to the revocation list (used to distinguish invalid untrusted certificates from valid trusted certificates).
- `client_cert_path`: The client contains the path to the p12 file (SSL-specific certificate file) with the cipher text of the certificate and secret key.
- `ca_cert_path`: The path to root certificate
- `cipher_list`: Cipher suite (list of supported SSL encrypted types)
- `cert_expire_warning_time_in_day`: The warning time of certificate expiration.

The secret key in the p12 file is stored in cipher text, and the password needs to be passed in when starting. Please refer to the Python API [mindspore.set_ps_context](https://www.mindspore.cn/docs/en/r2.8.0/api_python/mindspore/mindspore.set_ps_context.html#mindspore.set_ps_context) for the `client_password` and `server_password` fields.