[{"data":1,"prerenderedAt":868},["ShallowReactive",2],{"content-query-6qM4nuFr3X":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":10,"date":11,"cover":12,"type":13,"category":14,"body":15,"_type":862,"_id":863,"_source":864,"_file":865,"_stem":866,"_extension":867},"/technology-blogs/en/1766","en",false,"",[9],"AI Engineering","Distributed training is an ideal choice ever for models with high training requirements.","2022-06-29","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/11/28/20c2cd6af9ad4403b554fe4ebd56c9bd.png","technology-blogs","Influencers",{"type":16,"children":17,"toc":859},"root",[18,32,38,43,48,64,69,74,79,84,89,94,99,104,125,130,142,152,157,183,194,205,229,240,251,276,287,292,313,318,326,331,363,373,384,389,407,415,434,445,456,461,474,479,484,496,504,509,514,531,562,592,630,646,680,688,692,703,714,719,724,732,744,755,766,777,796,804,809,814,819,824,829,834,839,847],{"type":19,"tag":20,"props":21,"children":23},"element","h1",{"id":22},"ai-engineering-05-distributed-training-practice-of-the-resnet-50-model-based-on-mindspore",[24,30],{"type":19,"tag":25,"props":26,"children":27},"span",{},[28],{"type":29,"value":9},"text",{"type":29,"value":31}," 05 - Distributed Training Practice of the ResNet-50 Model Based on MindSpore",{"type":19,"tag":33,"props":34,"children":35},"p",{},[36],{"type":29,"value":37},"June 29, 2022",{"type":19,"tag":33,"props":39,"children":40},{},[41],{"type":29,"value":42},"Introduction",{"type":19,"tag":33,"props":44,"children":45},{},[46],{"type":29,"value":47},"Previously, we talked about the importance of distributed parallelism for AI computing power, parallel policies, and principles of automatic and semi-automatic parallelism implemented by the MindSpore framework. In this article, we use automatic parallelism to complete distributed training of the ResNet-50 model based on MindSpore. The procedure is as follows:",{"type":19,"tag":33,"props":49,"children":50},{},[51,53,62],{"type":29,"value":52},"1. Prepare a dataset: Download the ",{"type":19,"tag":54,"props":55,"children":59},"a",{"href":56,"rel":57},"https://www.cs.toronto.edu/~kriz/cifar.html",[58],"nofollow",[60],{"type":29,"value":61},"CIFAR-10 dataset",{"type":29,"value":63}," as the training dataset.",{"type":19,"tag":33,"props":65,"children":66},{},[67],{"type":29,"value":68},"2. Configure an 8-device environment powered by the Ascend 910 AI Processor.",{"type":19,"tag":33,"props":70,"children":71},{},[72],{"type":29,"value":73},"3. Invoke the collective communication library: Introduce the HCCL to initialize the communication between multiple devices.",{"type":19,"tag":33,"props":75,"children":76},{},[77],{"type":29,"value":78},"4. Load the dataset using data parallelism.",{"type":19,"tag":33,"props":80,"children":81},{},[82],{"type":29,"value":83},"5. Define the ResNet-50 network.",{"type":19,"tag":33,"props":85,"children":86},{},[87],{"type":29,"value":88},"6. Define a loss function and an optimizer for the distributed parallelism scenario.",{"type":19,"tag":33,"props":90,"children":91},{},[92],{"type":29,"value":93},"7. Build network training code: Define the distributed parallelism policy and training code.",{"type":19,"tag":33,"props":95,"children":96},{},[97],{"type":29,"value":98},"8. Execute the training script.",{"type":19,"tag":33,"props":100,"children":101},{},[102],{"type":29,"value":103},"Step 1: Prepare a Dataset",{"type":19,"tag":33,"props":105,"children":106},{},[107,109,115,117,123],{"type":29,"value":108},"Download the CIFAR-10 dataset trained by ResNet-50. The dataset consists of 60,000 ",{"type":19,"tag":110,"props":111,"children":112},"em",{},[113],{"type":29,"value":114},"32 x 32",{"type":29,"value":116}," color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Download the dataset package and decompress it to a local path. The ",{"type":19,"tag":118,"props":119,"children":120},"strong",{},[121],{"type":29,"value":122},"cifar-10-batches-bin",{"type":29,"value":124}," folder is extracted.",{"type":19,"tag":33,"props":126,"children":127},{},[128],{"type":29,"value":129},"Step 2: Configure the Distributed Environment",{"type":19,"tag":33,"props":131,"children":132},{},[133,135,140],{"type":29,"value":134},"To perform distributed training on a bare metal server, we need to configure the networking information file of the current multi-device environment. Take the Ascend 910 AI Processor as an example. The following shows the ",{"type":19,"tag":118,"props":136,"children":137},{},[138],{"type":29,"value":139},"rank_table_8pcs.json",{"type":29,"value":141}," configuration file for an 8-device environment.",{"type":19,"tag":143,"props":144,"children":146},"pre",{"code":145}," {      \n\n    \"board_id\": \"0x0000\",      \n\n    \"chip_info\": \"910\",      \n\n    \"deploy_mode\": \"lab\",      \n\n    \"group_count\": \"1\",      \n\n    \"group_list\": [      \n\n        {      \n\n            \"device_num\": \"8\",      \n\n            \"server_num\": \"1\",      \n\n            \"group_name\": \"\",      \n\n            \"instance_count\": \"8\",      \n\n            \"instance_list\": [...]      \n\n        }      \n\n    ],      \n\n    \"para_plane_nic_location\": \"device\",      \n\n    \"para_plane_nic_name\": [\"eth0\",\"eth1\",\"eth2\",\"eth3\",\"eth4\",\"eth5\",\"eth6\",\"eth7\"],      \n\n    \"para_plane_nic_num\": \"8\",      \n\n    \"status\": \"completed\"      \n\n}      \n",[147],{"type":19,"tag":148,"props":149,"children":150},"code",{"__ignoreMap":7},[151],{"type":29,"value":145},{"type":19,"tag":33,"props":153,"children":154},{},[155],{"type":29,"value":156},"Set the following parameters as required:",{"type":19,"tag":33,"props":158,"children":159},{},[160,162,167,169,174,176,181],{"type":29,"value":161},"· ",{"type":19,"tag":118,"props":163,"children":164},{},[165],{"type":29,"value":166},"board_id",{"type":29,"value":168},": current operating environment. Set this parameter to ",{"type":19,"tag":118,"props":170,"children":171},{},[172],{"type":29,"value":173},"0x0000",{"type":29,"value":175}," for x86 and ",{"type":19,"tag":118,"props":177,"children":178},{},[179],{"type":29,"value":180},"0x0020",{"type":29,"value":182}," for ARM.",{"type":19,"tag":33,"props":184,"children":185},{},[186,187,192],{"type":29,"value":161},{"type":19,"tag":118,"props":188,"children":189},{},[190],{"type":29,"value":191},"server_num",{"type":29,"value":193},": number of hosts.",{"type":19,"tag":33,"props":195,"children":196},{},[197,198,203],{"type":29,"value":161},{"type":19,"tag":118,"props":199,"children":200},{},[201],{"type":29,"value":202},"server_id",{"type":29,"value":204},": IP address of the local host.",{"type":19,"tag":33,"props":206,"children":207},{},[208,209,214,216,221,222,227],{"type":29,"value":161},{"type":19,"tag":118,"props":210,"children":211},{},[212],{"type":29,"value":213},"device_num",{"type":29,"value":215},"/",{"type":19,"tag":118,"props":217,"children":218},{},[219],{"type":29,"value":220},"para_plane_nic_num",{"type":29,"value":215},{"type":19,"tag":118,"props":223,"children":224},{},[225],{"type":29,"value":226},"instance_count",{"type":29,"value":228},": number of devices.",{"type":19,"tag":33,"props":230,"children":231},{},[232,233,238],{"type":29,"value":161},{"type":19,"tag":118,"props":234,"children":235},{},[236],{"type":29,"value":237},"rank_id",{"type":29,"value":239},": logical sequence number of a device, which starts from 0.",{"type":19,"tag":33,"props":241,"children":242},{},[243,244,249],{"type":29,"value":161},{"type":19,"tag":118,"props":245,"children":246},{},[247],{"type":29,"value":248},"device_id",{"type":29,"value":250},": physical sequence number of a device, that is, the actual sequence number of a device on the host.",{"type":19,"tag":33,"props":252,"children":253},{},[254,255,260,262,267,269,274],{"type":29,"value":161},{"type":19,"tag":118,"props":256,"children":257},{},[258],{"type":29,"value":259},"device_ip",{"type":29,"value":261},": IP address of the integrated NIC. You can run the ",{"type":19,"tag":118,"props":263,"children":264},{},[265],{"type":29,"value":266},"cat /etc/hccn.conf",{"type":29,"value":268}," command on the current host to obtain this value (specified by ",{"type":19,"tag":118,"props":270,"children":271},{},[272],{"type":29,"value":273},"address_x",{"type":29,"value":275},").",{"type":19,"tag":33,"props":277,"children":278},{},[279,280,285],{"type":29,"value":161},{"type":19,"tag":118,"props":281,"children":282},{},[283],{"type":29,"value":284},"para_plane_nic_name",{"type":29,"value":286},": NIC name.",{"type":19,"tag":33,"props":288,"children":289},{},[290],{"type":29,"value":291},"Step 3: Invoke the Collective Communication Library",{"type":19,"tag":33,"props":293,"children":294},{},[295,297,302,304,311],{"type":29,"value":296},"The distributed parallel training on MindSpore requires the Huawei Collective Communication Library (HCCL) for communication, which can be obtained in the software package of the Ascend AI Processor. In addition, the collective communication API provided by the HCCL is encapsulated in ",{"type":19,"tag":118,"props":298,"children":299},{},[300],{"type":29,"value":301},"mindspore.communication.management",{"type":29,"value":303}," to simplify distributed information configuration. HCCL implements multi-server multi-device communication based on the Ascend AI Processor. For common restrictions on using the distributed service, see ",{"type":19,"tag":54,"props":305,"children":308},{"href":306,"rel":307},"https://www.mindspore.cn/tutorial/en/0.2.0-alpha/advanced_use/distributed_training.html",[58],[309],{"type":29,"value":310},"Distributed Training",{"type":29,"value":312},".",{"type":19,"tag":33,"props":314,"children":315},{},[316],{"type":29,"value":317},"The sample code for invoking the collective communication library is as follows:",{"type":19,"tag":143,"props":319,"children":321},{"code":320},"import os      \n\nfrom mindspore import context      \n\nfrom mindspore.communication.management import init      \n\nif __name__ == \"__main__\":      \n\n     context.set_context(mode=context.GRAPH_MODE, device_target=\"Ascend\", device_id=int(os.environ[\"DEVICE_ID\"]))      \n\n     init()      \n\n     ...      \n",[322],{"type":19,"tag":148,"props":323,"children":324},{"__ignoreMap":7},[325],{"type":29,"value":320},{"type":19,"tag":33,"props":327,"children":328},{},[329],{"type":29,"value":330},"In the preceding commands:",{"type":19,"tag":33,"props":332,"children":333},{},[334,335,340,342,347,349,354,356,361],{"type":29,"value":161},{"type":19,"tag":118,"props":336,"children":337},{},[338],{"type":29,"value":339},"mode=context.GRAPH_MODE",{"type":29,"value":341},": To use distributed training, set ",{"type":19,"tag":118,"props":343,"children":344},{},[345],{"type":29,"value":346},"mode",{"type":29,"value":348}," to ",{"type":19,"tag":118,"props":350,"children":351},{},[352],{"type":29,"value":353},"Graph_Mode",{"type":29,"value":355},". (",{"type":19,"tag":118,"props":357,"children":358},{},[359],{"type":29,"value":360},"PYNATIVE_MODE",{"type":29,"value":362}," does not support parallelism.)",{"type":19,"tag":33,"props":364,"children":365},{},[366,367,371],{"type":29,"value":161},{"type":19,"tag":118,"props":368,"children":369},{},[370],{"type":29,"value":248},{"type":29,"value":372},": physical sequence number of the device, that is, the actual sequence number of the device on the host.",{"type":19,"tag":33,"props":374,"children":375},{},[376,377,382],{"type":29,"value":161},{"type":19,"tag":118,"props":378,"children":379},{},[380],{"type":29,"value":381},"init",{"type":29,"value":383},": enables HCCL communication and completes initialization for distributed training.",{"type":19,"tag":33,"props":385,"children":386},{},[387],{"type":29,"value":388},"Step 4: Load the Dataset Using Data Parallelism",{"type":19,"tag":33,"props":390,"children":391},{},[392,394,399,401,405],{"type":29,"value":393},"During distributed training, data is imported using data parallelism. The following code loads the CIFAR-10 dataset using data parallelism. In the code, ",{"type":19,"tag":118,"props":395,"children":396},{},[397],{"type":29,"value":398},"data_path",{"type":29,"value":400}," indicates the dataset path, that is, the path of the ",{"type":19,"tag":118,"props":402,"children":403},{},[404],{"type":29,"value":122},{"type":29,"value":406}," folder.",{"type":19,"tag":143,"props":408,"children":410},{"code":409},"import mindspore.common.dtype as mstype      \n\nimport mindspore.dataset as ds      \n\nimport mindspore.dataset.transforms.c_transforms as C      \n\nimport mindspore.dataset.transforms.vision.c_transforms as vision      \n\nfrom mindspore.communication.management import get_rank, get_group_size      \n\ndef create_dataset(data_path, repeat_num=1, batch_size=32, rank_id=0, rank_size=1):      \n\n    resize_height = 224      \n\n    resize_width = 224      \n\n    rescale = 1.0 / 255.0      \n\n    shift = 0.0      \n\n         \n\n    # Get rank_id and rank_size.      \n\n    rank_id = get_rank()      \n\n    rank_size = get_group_size()      \n\n    data_set = ds.Cifar10Dataset(data_path, num_shards=rank_size, shard_id=rank_id)      \n\n         \n\n    # Define map operations.      \n\n    random_crop_op = vision.RandomCrop((32, 32), (4, 4, 4, 4))      \n\n    random_horizontal_op = vision.RandomHorizontalFlip()      \n\n    resize_op = vision.Resize((resize_height, resize_width))      \n\n    rescale_op = vision.Rescale(rescale, shift)      \n\n    normalize_op = vision.Normalize((0.4465, 0.4822, 0.4914), (0.2010, 0.1994, 0.2023))      \n\n    changeswap_op = vision.HWC2CHW()      \n\n    type_cast_op = C.TypeCast(mstype.int32)      \n\n    c_trans = [random_crop_op, random_horizontal_op]      \n\n    c_trans += [resize_op, rescale_op, normalize_op, changeswap_op]      \n\n    # Apply map operations on images.      \n\n    data_set = data_set.map(input_columns=\"label\", operations=type_cast_op)      \n\n    data_set = data_set.map(input_columns=\"image\", operations=c_trans)      \n\n    # apply shuffle operations      \n\n    data_set = data_set.shuffle(buffer_size=10)      \n\n    # Apply batch operations.      \n\n    data_set = data_set.batch(batch_size=batch_size, drop_remainder=True)      \n\n    # Apply repeat operations.      \n\n    data_set = data_set.repeat(repeat_num)      \n\n    return data_set   \n",[411],{"type":19,"tag":148,"props":412,"children":413},{"__ignoreMap":7},[414],{"type":29,"value":409},{"type":19,"tag":33,"props":416,"children":417},{},[418,420,425,427,432],{"type":29,"value":419},"Parameters ",{"type":19,"tag":118,"props":421,"children":422},{},[423],{"type":29,"value":424},"num_shards",{"type":29,"value":426}," (number of devices) and ",{"type":19,"tag":118,"props":428,"children":429},{},[430],{"type":29,"value":431},"shard_id",{"type":29,"value":433}," (logical sequence number of a device) need to be passed to the dataset API. You are advised to obtain them through the HCCL API.",{"type":19,"tag":33,"props":435,"children":436},{},[437,438,443],{"type":29,"value":161},{"type":19,"tag":118,"props":439,"children":440},{},[441],{"type":29,"value":442},"get_rank",{"type":29,"value":444},": obtains the ID of the current device in the cluster.",{"type":19,"tag":33,"props":446,"children":447},{},[448,449,454],{"type":29,"value":161},{"type":19,"tag":118,"props":450,"children":451},{},[452],{"type":29,"value":453},"get_group_size",{"type":29,"value":455},": obtains the number of devices in the cluster.",{"type":19,"tag":33,"props":457,"children":458},{},[459],{"type":29,"value":460},"Step 5: Define the Network",{"type":19,"tag":33,"props":462,"children":463},{},[464,466,473],{"type":29,"value":465},"In data parallelism and automatic parallelism, the way of defining the network is the same as that in a single-device system. For details about the defining code, see ",{"type":19,"tag":54,"props":467,"children":470},{"href":468,"rel":469},"https://gitee.com/mindspore/docs/blob/r0.2/tutorials/tutorial_code/resnet/resnet.py",[58],[471],{"type":29,"value":472},"ResNet-50 Implementation",{"type":29,"value":312},{"type":19,"tag":33,"props":475,"children":476},{},[477],{"type":29,"value":478},"Step 6: Define a Loss Function and an Optimizer",{"type":19,"tag":33,"props":480,"children":481},{},[482],{"type":29,"value":483},"Automatic parallelism splits a model by operator and obtains the optimal parallel policy through algorithms. Different from single-device training, you are advised to use small operators to implement a loss function to achieve better parallel training performance.",{"type":19,"tag":33,"props":485,"children":486},{},[487,489,494],{"type":29,"value":488},"For the loss function, we expand ",{"type":19,"tag":118,"props":490,"children":491},{},[492],{"type":29,"value":493},"SoftmaxCrossEntropyWithLogits",{"type":29,"value":495}," into multiple small operators for implementation based on a mathematical formula. The sample code is as follows:",{"type":19,"tag":143,"props":497,"children":499},{"code":498},"from mindspore.ops import operations as P      \n\nfrom mindspore import Tensor      \n\nimport mindspore.ops.functional as F      \n\nimport mindspore.common.dtype as mstype      \n\nimport mindspore.nn as nn      \n\nclass SoftmaxCrossEntropyExpand(nn.Cell):      \n\n    def __init__(self, sparse=False):      \n\n        super(SoftmaxCrossEntropyExpand, self).__init__()      \n\n        self.exp = P.Exp()      \n\n        self.sum = P.ReduceSum(keep_dims=True)      \n\n        self.onehot = P.OneHot()      \n\n        self.on_value = Tensor(1.0, mstype.float32)      \n\n        self.off_value = Tensor(0.0, mstype.float32)      \n\n        self.div = P.Div()      \n\n        self.log = P.Log()      \n\n        self.sum_cross_entropy = P.ReduceSum(keep_dims=False)      \n\n        self.mul = P.Mul()      \n\n        self.mul2 = P.Mul()      \n\n        self.mean = P.ReduceMean(keep_dims=False)      \n\n        self.sparse = sparse      \n\n        self.max = P.ReduceMax(keep_dims=True)      \n\n        self.sub = P.Sub()      \n\n             \n\n    def construct(self, logit, label):      \n\n        logit_max = self.max(logit, -1)      \n\n        exp = self.exp(self.sub(logit, logit_max))      \n\n        exp_sum = self.sum(exp, -1)      \n\n        softmax_result = self.div(exp, exp_sum)      \n\n        if self.sparse:      \n\n            label = self.onehot(label, F.shape(logit)[1], self.on_value, self.off_value)      \n\n        softmax_result_log = self.log(softmax_result)      \n\n        loss = self.sum_cross_entropy((self.mul(softmax_result_log, label)), -1)      \n\n        loss = self.mul2(F.scalar_to_array(-1.0), loss)      \n\n        loss = self.mean(loss, -1)      \n\n        return loss\n",[500],{"type":19,"tag":148,"props":501,"children":502},{"__ignoreMap":7},[503],{"type":29,"value":498},{"type":19,"tag":33,"props":505,"children":506},{},[507],{"type":29,"value":508},"We use the Momentum optimizer to update parameters. For details, see the implementation in the sample code.",{"type":19,"tag":33,"props":510,"children":511},{},[512],{"type":29,"value":513},"Step 7: Build Network Training Code",{"type":19,"tag":33,"props":515,"children":516},{},[517,522,524,529],{"type":19,"tag":118,"props":518,"children":519},{},[520],{"type":29,"value":521},"context.set_auto_parallel_context",{"type":29,"value":523}," is an API for configuring parallel training parameters and must be called before initializing the model. If no parameters are specified, the framework automatically sets the empirical values of the parameters based on the parallel mode. For example, in data parallelism, ",{"type":19,"tag":118,"props":525,"children":526},{},[527],{"type":29,"value":528},"parameter_broadcast",{"type":29,"value":530}," is enabled by default. Parameters:",{"type":19,"tag":33,"props":532,"children":533},{},[534,535,540,542,547,549,554,556,561],{"type":29,"value":161},{"type":19,"tag":118,"props":536,"children":537},{},[538],{"type":29,"value":539},"parallel_mode",{"type":29,"value":541},": distributed parallelism. Possible values are ",{"type":19,"tag":118,"props":543,"children":544},{},[545],{"type":29,"value":546},"ParallelMode.STAND_ALONE",{"type":29,"value":548}," (default), ",{"type":19,"tag":118,"props":550,"children":551},{},[552],{"type":29,"value":553},"ParallelMode.DATA_PARALLEL",{"type":29,"value":555},", and ",{"type":19,"tag":118,"props":557,"children":558},{},[559],{"type":29,"value":560},"ParallelMode.AUTO_PARALLEL",{"type":29,"value":312},{"type":19,"tag":33,"props":563,"children":564},{},[565,566,570,572,577,579,584,586,591],{"type":29,"value":161},{"type":19,"tag":118,"props":567,"children":568},{},[569],{"type":29,"value":528},{"type":29,"value":571},": indicates whether to broadcast initialized parameters. In ",{"type":19,"tag":118,"props":573,"children":574},{},[575],{"type":29,"value":576},"DATA_PARALLEL",{"type":29,"value":578}," and ",{"type":19,"tag":118,"props":580,"children":581},{},[582],{"type":29,"value":583},"HYBRID_PARALLEL",{"type":29,"value":585}," modes, the default value is ",{"type":19,"tag":118,"props":587,"children":588},{},[589],{"type":29,"value":590},"True",{"type":29,"value":312},{"type":19,"tag":33,"props":593,"children":594},{},[595,596,601,603,608,610,615,617,621,623,628],{"type":29,"value":161},{"type":19,"tag":118,"props":597,"children":598},{},[599],{"type":29,"value":600},"mirror_mean",{"type":29,"value":602},": During backward propagation, the framework collects gradients of parameters by using data parallelism across multiple hosts, obtains the global gradient value, and passes the global gradient value to the optimizer for update. The default value is ",{"type":19,"tag":118,"props":604,"children":605},{},[606],{"type":29,"value":607},"False",{"type":29,"value":609},", indicating that the ",{"type":19,"tag":118,"props":611,"children":612},{},[613],{"type":29,"value":614},"allreduce_sum",{"type":29,"value":616}," operation is applied. The value ",{"type":19,"tag":118,"props":618,"children":619},{},[620],{"type":29,"value":590},{"type":29,"value":622}," indicates that the ",{"type":19,"tag":118,"props":624,"children":625},{},[626],{"type":29,"value":627},"allreduce_mean",{"type":29,"value":629}," operation is applied.",{"type":19,"tag":33,"props":631,"children":632},{},[633,634,638,639,644],{"type":29,"value":161},{"type":19,"tag":118,"props":635,"children":636},{},[637],{"type":29,"value":213},{"type":29,"value":578},{"type":19,"tag":118,"props":640,"children":641},{},[642],{"type":29,"value":643},"global_rank:",{"type":29,"value":645}," Retain their default values, which are obtained by calling the HCCL API.",{"type":19,"tag":33,"props":647,"children":648},{},[649,651,656,658,662,663,668,670,674,675,679],{"type":29,"value":650},"If multiple network cases exist in the script, call ",{"type":19,"tag":118,"props":652,"children":653},{},[654],{"type":29,"value":655},"context.reset_auto_parallel_context()",{"type":29,"value":657}," to restore all parameters to default values before executing the next case. In the following example, we set ",{"type":19,"tag":118,"props":659,"children":660},{},[661],{"type":29,"value":539},{"type":29,"value":348},{"type":19,"tag":118,"props":664,"children":665},{},[666],{"type":29,"value":667},"AUTO_PARALLEL",{"type":29,"value":669},". To switch to data parallelism, change the value of ",{"type":19,"tag":118,"props":671,"children":672},{},[673],{"type":29,"value":539},{"type":29,"value":348},{"type":19,"tag":118,"props":676,"children":677},{},[678],{"type":29,"value":576},{"type":29,"value":312},{"type":19,"tag":143,"props":681,"children":683},{"code":682},"from mindspore import context      \n\nfrom mindspore.nn.optim.momentum import Momentum      \n\nfrom mindspore.train.callback import LossMonitor      \n\nfrom mindspore.train.model import Model, ParallelMode      \n\nfrom resnet import resnet50      \n\ndevice_id = int(os.getenv('DEVICE_ID'))      \n\ncontext.set_context(mode=context.GRAPH_MODE, device_target=\"Ascend\")      \n\ncontext.set_context(device_id=device_id) # set device_id      \n\ndef test_train_cifar(epoch_size=10):      \n\n    context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, mirror_mean=True)      \n\n    loss_cb = LossMonitor()      \n\n    dataset = create_dataset(data_path, epoch_size)      \n\n    batch_size = 32      \n\n    num_classes = 10      \n\n    net = resnet50(batch_size, num_classes)      \n\n    loss = SoftmaxCrossEntropyExpand(sparse=True)      \n\n    opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, 0.9)      \n\n    model = Model(net, loss_fn=loss, optimizer=opt)      \n\n    model.train(epoch_size, dataset, callbacks=[loss_cb], dataset_sink_mode=True)    \n",[684],{"type":19,"tag":148,"props":685,"children":686},{"__ignoreMap":7},[687],{"type":29,"value":682},{"type":19,"tag":33,"props":689,"children":690},{},[691],{"type":29,"value":330},{"type":19,"tag":33,"props":693,"children":694},{},[695,696,701],{"type":29,"value":161},{"type":19,"tag":118,"props":697,"children":698},{},[699],{"type":29,"value":700},"dataset_sink_mode=True",{"type":29,"value":702},": uses the dataset offloading mode. That is, the computing is performed on the hardware platform.",{"type":19,"tag":33,"props":704,"children":705},{},[706,707,712],{"type":29,"value":161},{"type":19,"tag":118,"props":708,"children":709},{},[710],{"type":29,"value":711},"LossMonitor",{"type":29,"value":713},": returns the loss value through the callback function to monitor the loss function.",{"type":19,"tag":33,"props":715,"children":716},{},[717],{"type":29,"value":718},"Step 8: Execute the Training Script",{"type":19,"tag":33,"props":720,"children":721},{},[722],{"type":29,"value":723},"After the script required for training is edited, run the corresponding command to call the script. Currently, the single-device single-process operating mode is used on MindSpore for distributed execution. That is, one process runs on each device, and the number of total processes is the same as that of devices in use. For device 0, the corresponding process is executed in the foreground. For other devices, the corresponding processes are executed in the background. You need to create a directory for each process to store log information and operator compilation information. The following uses a distributed training script with eight devices as an example to describe how to run a training script.",{"type":19,"tag":143,"props":725,"children":727},{"code":726},"     #!/bin/bash      \n\nexport DATA_PATH=${DATA_PATH: -$1}      \n\n       export RANK_TABLE_FILE=$(pwd) /rank_table_8pcs.json      \n\n       export RANK_SIZE=8      \n\n       for((i=1;i\u003C${RANK_SIZE};i++))      \n\n       do      \n\n           rm -rf device$i      \n\n           mkdir device$i      \n\n           cp ./resnet50_distributed_training.py ./resnet.py ./device$i      \n\n           cd ./device$i      \n\n           export DEVICE_ID=$i      \n\n           export RANK_ID=$i      \n\n           echo \"start training for device $i\"      \n\n           env > env$i.log      \n\n           pytest -s -v ./resnet50_distributed_training.py > train.log$i 2>&1 &      \n\n           cd ../      \n\n       done      \n\n       rm -rf device0      \n\n       mkdir device0      \n\n       cp ./resnet50_distributed_training.py ./resnet.py ./device0      \n\n       cd ./device0      \n\n       export DEVICE_ID=0      \n\n       export RANK_ID=0      \n\n       echo \"start training for device 0\"      \n\n       env > env0.log      \n\n       pytest -s -v ./resnet50_distributed_training.py > train.log0 2>&1      \n\n       if [ $? -eq 0 ];then      \n\n           echo \"training success\"      \n\n       else      \n\n           echo \"training failed\"      \n\n           exit 2      \n\n       fi      \n\n       cd ../   \n",[728],{"type":19,"tag":148,"props":729,"children":730},{"__ignoreMap":7},[731],{"type":29,"value":726},{"type":19,"tag":33,"props":733,"children":734},{},[735,737,742],{"type":29,"value":736},"The script requires the variable ",{"type":19,"tag":118,"props":738,"children":739},{},[740],{"type":29,"value":741},"DATA_PATH",{"type":29,"value":743}," (path of the dataset). The following environment variables need to be set:",{"type":19,"tag":33,"props":745,"children":746},{},[747,748,753],{"type":29,"value":161},{"type":19,"tag":118,"props":749,"children":750},{},[751],{"type":29,"value":752},"RANK_TABLE_FILE",{"type":29,"value":754},": path of the network information file.",{"type":19,"tag":33,"props":756,"children":757},{},[758,759,764],{"type":29,"value":161},{"type":19,"tag":118,"props":760,"children":761},{},[762],{"type":29,"value":763},"DEVICE_ID",{"type":29,"value":765},": actual sequence number of the current device on the corresponding host.",{"type":19,"tag":33,"props":767,"children":768},{},[769,770,775],{"type":29,"value":161},{"type":19,"tag":118,"props":771,"children":772},{},[773],{"type":29,"value":774},"RANK_ID",{"type":29,"value":776},": logical sequence number of the current device.",{"type":19,"tag":33,"props":778,"children":779},{},[780,782,787,789,794],{"type":29,"value":781},"Run the script to start distributed training. The running takes about 5 minutes, which is mainly used for operator compilation. The actual training takes less than 20 seconds. For single-device scenarios, the script compilation and training takes about 10 minutes in total. The following is an example segment in ",{"type":19,"tag":118,"props":783,"children":784},{},[785],{"type":29,"value":786},"train.log",{"type":29,"value":788}," in the ",{"type":19,"tag":118,"props":790,"children":791},{},[792],{"type":29,"value":793},"device",{"type":29,"value":795}," directory:",{"type":19,"tag":143,"props":797,"children":799},{"code":798},"epoch: 1 step: 156, loss is 2.0084016\n\nepoch: 2 step: 156, loss is 1.6407638\n\nepoch: 3 step: 156, loss is 1.6164391\n\nepoch: 4 step: 156, loss is 1.6838071\n\nepoch: 5 step: 156, loss is 1.6320667\n\nepoch: 6 step: 156, loss is 1.3098773\n\nepoch: 7 step: 156, loss is 1.3515002\n\nepoch: 8 step: 156, loss is 1.2943741\n\nepoch: 9 step: 156, loss is 1.2316195\n\nepoch: 10 step: 156, loss is 1.1533381\n",[800],{"type":19,"tag":148,"props":801,"children":802},{"__ignoreMap":7},[803],{"type":29,"value":798},{"type":19,"tag":33,"props":805,"children":806},{},[807],{"type":29,"value":808},"Summary",{"type":19,"tag":33,"props":810,"children":811},{},[812],{"type":29,"value":813},"Distributed training involves more development complexity than single-device training in the following aspects:",{"type":19,"tag":33,"props":815,"children":816},{},[817],{"type":29,"value":818},"1. Distributed configuration and communication for multiple devices",{"type":19,"tag":33,"props":820,"children":821},{},[822],{"type":29,"value":823},"2. Operator selection for the loss function when defining a network",{"type":19,"tag":33,"props":825,"children":826},{},[827],{"type":29,"value":828},"3. Complex training script",{"type":19,"tag":33,"props":830,"children":831},{},[832],{"type":29,"value":833},"4. Difficult debugging and tuning",{"type":19,"tag":33,"props":835,"children":836},{},[837],{"type":29,"value":838},"Regardless of these development and debugging costs, distributed training brings impressive performance benefits. For the ResNet-50 model, the performance of eight-device training (Ascend AI Processors) is several times higher than that of single-device training. Therefore, distributed training is an ideal choice ever for models with high training requirements.",{"type":19,"tag":33,"props":840,"children":841},{},[842],{"type":19,"tag":118,"props":843,"children":844},{},[845],{"type":29,"value":846},"References",{"type":19,"tag":33,"props":848,"children":849},{},[850,852],{"type":29,"value":851},"[1] ",{"type":19,"tag":54,"props":853,"children":856},{"href":854,"rel":855},"https://gitee.com/mindspore/docs/tree/master/docs/sample_code/distributed_training",[58],[857],{"type":29,"value":858},"Sample Code of Distributed Training",{"title":7,"searchDepth":860,"depth":860,"links":861},4,[],"markdown","content:technology-blogs:en:1766.md","content","technology-blogs/en/1766.md","technology-blogs/en/1766","md",1776506104311]