Other Features

https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0/resource/_static/logo_source_en.png

Sharding Propagation

In operator-level parallelism, the user is required to configure a slicing strategy for each operator in the forward network (if not configured, the data-parallel policy is used by default). The slicing strategy propagation feature can configure only a few operators to automatically generate a feasible sharding strategy for operators without a sharding strategy, and achieve the effect of minimizing communication overhead.

Parameter Server Training

Parameter Server is a widely used architecture in distributed training, which has better flexibility, scalability, and node disaster tolerance than the AllReduce training method of data parallel synchronization. The parameter server supports both synchronous SGD (Stochastic Gradient Descent) and asynchronous SGD training algorithms. In terms of scalability, the calculation of the model and the update of the model are deployed in the worker and server processes respectively, so that the resources of the worker and server can be scaled horizontally independently (adding or removing the worker and server resources). In addition, in the environment of large-scale data centers, computing equipment, networks and storage often have various failures that lead to some node abnormalities, and under the architecture of parameter servers, such failures can be easily handled without affecting the tasks in training.

Communication Operator Fusion

In the distributed training scenario, cross-device or even cross-node data transmission is a bottleneck that restricts scalability and computing power utilization. Communication operator fusion is an important method to improve the utilization of network resources and accelerate the efficiency of data transmission, which packages the communication operators of the same source node and the destination node and executes them at the same time to avoid the additional overhead caused by multiple single operator execution.

Communication Subgraph Extraction and Reuse

In distributed training, as the model size increases, the number of communication operators required also rises significantly. On one hand, it will boost the communication time in model compilation; on the other hand, it will consume a large amount of streams, and when the required number of streams exceeds the hardware limit, the model cannot scale even more, thus becoming a bottleneck in the development of large models. By classifying communication operators, extracting communication subgraphs for operators of the same class and reusing these extracted subgraphs, we can reduce the number of communication operators in the graph compilation. It will decrease communication time and require less streams so that the model size can further expand.

Dataset Slicing

When doing distributed training, you need to import the training dataset to each device. There are two common ways to import: 1) Import in parallel with the data, that is, the data is split into match dimensions, and each device is imported as part; 2) Import full amount of data per device. In addition, when some dimensions of the data are particularly large (such as the H/W dimension of the remote sensing picture may be particularly large), even if the sample size is small, the picture needs to be split, that is, the data is split in the H/W dimension, and each device reads a part of the picture. This special performance supports splitting datasets into specific dimensions to meet training requirements in the field of large-format image processing.

Functional Operator Splitting

In dynamic graph mode, you specify that a part of the network structure executes in graph mode and performs various parallel operations.

Performing Distributed Training on K8S Clusters

MindSpore Operator is a plugin that follows Kubernetes’ Operator pattern (based on the CRD-Custom Resource Definition feature) and implements distributed training on Kubernetes. MindSpore Operator defines Scheduler, PS, worker three roles in CRD, and users can easily use MindSpore on K8S for distributed training through simple YAML file configuration. The code repository of mindSpore Operator is described in: ms-operator.