Multi-dimensional Hybrid Parallel Case Based on Double Recursive Search

Overview

Multi-dimensional hybrid parallel based on double recursive search means that the user can configure optimization methods such as recomputation, optimizer parallel, pipeline parallel. Based on the user configurations, the operator-level strategy is automatically searched by the double recursive strategy search algorithm, which generates the optimal parallel strategy.

Operation Practice

The following is a multi-dimensional hybrid parallel case based on double recursive search using Ascend or GPU single-machine 8-card as an example:

Example Code Description

Download the complete example code: multiple_mix.

The directory structure is as follows:

└─ sample_code
    ├─ multiple_mix
       ├── sapp_mix_train.py
       └── run_sapp_mix_train.sh
    ...

sapp_mix_train.py is the script that defines the network structure and the training process. run_sapp_mix_train.sh is the execution script.

Configuring Distributed Environment

Initialize HCCL or NCCL communication with init. device_target is automatically specified as the backend hardware device corresponding to the MindSpore package.

import os
import mindspore as ms
from mindspore.communication import init

os.environ['MS_DEV_SAVE_GRAPHS'] = '2'
ms.set_context(mode=ms.GRAPH_MODE)
ms.runtime.set_memory(max_size="25GB")
init()
ms.set_seed(1)

Network Definition

The network definition adds recomputation, pipeline parallel to the data parallel and model parallel provided by the double recursive strategy search algorithm:

from mindspore import nn

class Network(nn.Cell):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layer1 = nn.Dense(28*28, 512)
        self.relu1 = nn.ReLU()
        self.layer2 = nn.Dense(512, 512)
        self.relu2 = nn.ReLU()
        self.layer3 = nn.Dense(512, 1)

    def construct(self, x):
        x = self.flatten(x)
        x = self.layer1(x)
        x = self.relu1(x)
        x = self.layer2(x)
        x = self.relu2(x)
        logits = self.layer3(x)
        return logits

with no_init_parameters():
    net = Network()
    optimizer = nn.SGD(net.trainable_params(), 1e-2)
# Configure recomputation of relu operators
net.relu1.recompute()
net.relu2.recompute()

Loading the Datasets

The dataset is loaded in the same way as the single-card model, with the following code:

import os
import mindspore.dataset as ds
from mindspore.parallel.auto_parallel import AutoParallel
from mindspore import nn, train
from mindspore.communication import init

def create_dataset(batch_size):
    """create dataset"""
    dataset_path = "./MNIST_Data/train"
    dataset = ds.MnistDataset(dataset_path)
    image_transforms = [
        ds.vision.Rescale(1.0 / 255.0, 0),
        ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)),
        ds.vision.HWC2CHW()
    ]
    label_transform = ds.transforms.TypeCast(ms.int32)
    dataset = dataset.map(image_transforms, 'image')
    dataset = dataset.map(label_transform, 'label')
    dataset = dataset.batch(batch_size)
    return dataset

data_set = create_dataset(32)

Training the Network

This part is consistent with the pipeline parallel training code. Two additional interfaces need to be called based on the stand-alone training code: nn.WithLossCell for wrapping the network and loss function, and parallel.nn.Pipeline for wrapping the LossCell and configuring the MicroBatch size. Specify the run mode, run device, run card number, etc. through the Autoparallel interface. Unlike single-card scripts, parallel scripts also need to specify the parallel mode parallel_mode as double recursive strategy search mode recursive_programming for auto-slicing of the data parallel and model parallel. stages is the number of stages in pipeline parallel, and optimizer parallel is enabled by hsdp. The code is as follows:

import mindspore as ms
from mindspore import nn, train

loss_fn = nn.MAELoss()
loss_cb = train.LossMonitor()
# Configure the pipeline_stage number for each layer in pipeline parallelism
net_with_grads = ms.parallel.nn.Pipeline(nn.WithLossCell(net, loss_fn), 4,
                                            stage_config={"_backbone.layer1": 0,
                                                        "_backbone.relu1": 0,
                                                        "_backbone.layer2": 1,
                                                        "_backbone.relu2": 1,
                                                        "_backbone.layer3": 1,})
net_with_grads_new = AutoParallel(net_with_grads, parallel_mode="recursive_programming")
net_with_grads_new.hsdp()
net_with_grads_new.full_batch = True
net_with_grads_new.pipeline(stages=2, scheduler="1f1b")
model = ms.Model(net_with_grads, optimizer=optimizer)
model.train(10, data_set, callbacks=[loss_cb], dataset_sink_mode=True)

Running a Stand-alone Eight-Card Script

Next, the corresponding scripts are invoked by commands, using the msrun startup method and the 8-card distributed training script as an example of distributed training:

bash run_sapp_mix_train.sh

The results are saved in log_output/1/rank.*/stdout, and the example is as follows:

epoch: 1 step: 1875, loss is 11.6961808800697327
epoch: 2 step: 1875, loss is 10.2737872302532196
epoch: 3 step: 1875, loss is 8.87508840560913086
epoch: 4 step: 1875, loss is 8.1057268142700195