# mpirun Startup

[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/tutorials/experts/source_en/parallel/mpirun.md)

## Overview

Open Message Passing Interface (OpenMPI) is an open source, high-performance message-passing programming library for parallel computing and distributed memory computing, which realizes parallel computing by passing messages between different processes for many scientific computing and machine learning tasks. Parallel training with OpenMPI is a generalized approach to accelerate the training process by utilizing parallel computing resources on computing clusters or multi-core machines. OpenMPI serves the function of synchronizing data on the Host side as well as inter-process networking in distributed training scenarios.

Unlike rank table startup, the user does not need to configure the `RANK_TABLE_FILE` environment variable to run the script via OpenMPI `mpirun` command on the Ascend hardware platform.

> The `mpirun` startup supports Ascend and GPU, in addition to both PyNative mode and Graph mode.

Related commands:

1. The `mpirun` startup command is as follows, where `DEVICE_NUM` is the number of GPUs on the machine:

    ```bash
    mpirun -n DEVICE_NUM python net.py
    ```

2. `mpirun` can also be configured with the following parameters. For more configuration, see [mpirun documentation](https://www.open-mpi.org/doc/current/man1/mpirun.1.php):

    - `--output-filename log_output`: Save the log information of all processes to the `log_output` directory, and the logs on different cards will be saved in the corresponding files under the `log_output/1/` path by `rank_id`.
    - `--merge-stderr-to-stdout`: Merge stderr to the output message of stdout.
    - `--allow-run-as-root`: This parameter is required if the script is executed through the root user.
    - `-mca orte_abort_on_non_zero_status 0`: When a child process exits abnormally, OpenMPI will abort all child processes by default. If you don't want to abort child processes automatically, you can add this parameter.
    - `-bind-to none`: OpenMPI will specify the number of available CPU cores for the child process to be pulled up by default. If you don't want to limit the number of cores used by the process, you can add this parameter.

> OpenMPI starts up with a number of OPMI_* environment variables, and users should avoid manually modifying these environment variables in scripts.

## Operation Practice

The `mpirun` startup script is consistent across Ascend and GPU hardware platforms. Below is a demonstration of how to write a startup script using Ascend as an example:

> You can download the full sample code here: [startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method).

The directory structure is as follows:

```text
└─ sample_code
    ├─ startup_method
       ├── net.py
       ├── hostfile
       ├── run_mpirun_1.sh
       ├── run_mpirun_2.sh
    ...
```

`net.py` is to define the network structure and training process. `run_mpirun_1.sh` and `run_mpirun_2.sh` are the execution scripts, and `hostfile` is the file to configure the multi-machine and multi-card files.

### 1. Installing OpenMPI

Download the OpenMPI-4.1.4 source code [openmpi-4.1.4.tar.gz] (https://www.open-mpi.org/software/ompi/v4.1/). Refer to [OpenMPI official website tutorial](https://www.open-mpi.org/faq/?category=building#easy-build) for installation.

### 2. Preparing Python Training Scripts

Here, as an example of data parallel, a recognition network is trained for the MNIST dataset.

First specify the operation mode, hardware device, etc. Unlike single card scripts, parallel scripts also need to specify configuration items such as parallel mode and initialize HCCL or NCCL communication via init. If you don't set `device_target` here, it will be automatically specified as the backend hardware device corresponding to the MindSpore package.

```python
import mindspore as ms
from mindspore.communication import init

ms.set_context(mode=ms.GRAPH_MODE)
ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True)
init()
ms.set_seed(1)
```

Then build the following network:

```python
from mindspore import nn

class Network(nn.Cell):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros")
        self.relu = nn.ReLU()

    def construct(self, x):
        x = self.flatten(x)
        logits = self.relu(self.fc(x))
        return logits
net = Network()
```

Finally, the dataset is processed and the training process is defined:

```python
import os
from mindspore import nn
import mindspore as ms
import mindspore.dataset as ds
from mindspore.communication import get_rank, get_group_size

def create_dataset(batch_size):
    dataset_path = os.getenv("DATA_PATH")
    rank_id = get_rank()
    rank_size = get_group_size()
    dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id)
    image_transforms = [
        ds.vision.Rescale(1.0 / 255.0, 0),
        ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)),
        ds.vision.HWC2CHW()
    ]
    label_transform = ds.transforms.TypeCast(ms.int32)
    dataset = dataset.map(image_transforms, 'image')
    dataset = dataset.map(label_transform, 'label')
    dataset = dataset.batch(batch_size)
    return dataset

data_set = create_dataset(32)
loss_fn = nn.CrossEntropyLoss()
optimizer = nn.SGD(net.trainable_params(), 1e-2)

def forward_fn(data, label):
    logits = net(data)
    loss = loss_fn(logits, label)
    return loss, logits

grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True)
grad_reducer = nn.DistributedGradReducer(optimizer.parameters)

for epoch in range(10):
    i = 0
    for data, label in data_set:
        (loss, _), grads = grad_fn(data, label)
        grads = grad_reducer(grads)
        optimizer(grads)
        if i % 10 == 0:
            print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss))
        i += 1
```

### 3. Preparing the Startup Script

#### Single-Machine Multi-Card

First download the [MNIST](http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip) dataset and extract it to the current folder.

Then execute the single-machine multi-card boot script, using the single-machine 8-card example:

```bash
export DATA_PATH=./MNIST_Data/train/
mpirun -n 8 --output-filename log_output --merge-stderr-to-stdout python net.py
```

The log file will be saved to the `log_output` directory and the result will be saved in the `log_output/1/rank.*/stdout` and is as follows:

```text
epoch: 0, step: 0, loss is 2.3413472
epoch: 0, step: 10, loss is 1.6298866
epoch: 0, step: 20, loss is 1.3729795
epoch: 0, step: 30, loss is 1.2199347
epoch: 0, step: 40, loss is 0.85778403
epoch: 0, step: 50, loss is 1.0849445
epoch: 0, step: 60, loss is 0.9102987
epoch: 0, step: 70, loss is 0.7571399
epoch: 0, step: 80, loss is 0.7989929
epoch: 0, step: 90, loss is 1.0189024
epoch: 0, step: 100, loss is 0.6298542
...
```

#### Multi-Machine Multi-Card

Before running multi-machine multi-card training, you first need to follow the following configuration:

1. Ensure that the same versions of OpenMPI, NCCL, Python, and MindSpore are available on each node.

2. To configure host-to-host password-free login, you can refer to the following steps to configure it:
    - Identify the same user as the login user for each host (root is not recommended);
    - Execute `ssh-keygen -t rsa -P ""` to generate the key;
    - Execute `ssh-copy-id DEVICE-IP` to set the IP of the machine that needs password-free login;
    - Execute `ssh DEVICE-IP`. If you can log in without entering a password, the above configuration is successful;
    - Execute the above command on all machines to ensure two-by-two interoperability.

After the configuration is successful, you can start the multi-machine task with the `mpirun` command, and there are currently two ways to start a multi-machine training task:

- By means of `mpirun -H`. The startup script is as follows:

    ```bash
    export DATA_PATH=./MNIST_Data/train/
    mpirun -n 16 -H DEVICE1_IP:8,DEVICE2_IP:8 --output-filename log_output --merge-stderr-to-stdout python net.py
    ```

    indicates that 8 processes are started to run the program on the machines with ip DEVICE1_IP and DEVICE2_IP respectively. Execute on one of the nodes:

    ```bash
    bash run_mpirun_1.sh
    ```

- By means of the `mpirun --hostfile` method. For debugging purposes, this method is recommended for executing multi-machine multi-card scripts. First you need to construct the hostfile file as follows:

    ```text
    DEVICE1 slots=8
    192.168.0.1 slots=8
    ```

    The format of each line is `[hostname] slots=[slotnum]`, and hostname can be either ip or hostname. The above example indicates that there are 8 cards on DEVICE1, and there are also 8 cards on the machine with ip 192.168.0.1.

    The execution script for the 2-machine 16-card is as follows, and you need to pass in the variable `HOSTFILE`, which indicates the path to the hostfile file:

    ```bash
    export DATA_PATH=./MNIST_Data/train/
    HOSTFILE=$1
    mpirun -n 16 --hostfile $HOSTFILE --output-filename log_output --merge-stderr-to-stdout python net.sh
    ```

    Execute on one of the nodes:

    ```bash
    bash run_mpirun_2.sh ./hostfile
    ```

After execution, the log file is saved to the log_output directory and the result is saved in log_output/1/rank.*/stdout.