# Using MindSpore on the Cloud `Linux` `Ascend` `Whole Process` `Beginner` `Intermediate` `Expert` [![View Source On Gitee](https://gitee.com/mindspore/docs/raw/r1.3/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.3/docs/mindspore/programming_guide/source_en/use_on_the_cloud.md) ## Overview ModelArts is a one-stop AI development platform provided by HUAWEI CLOUD. It integrates the Ascend AI Processor resource pool. Developers can experience MindSpore on this platform. ResNet-50 is used as an example to describe how to use MindSpore to complete a training task on ModelArts. ## Preparations ### Preparing ModelArts Create an account, configure ModelArts, and create an Object Storage Service (OBS) bucket by referring to the "Preparations" section in the ModelArts tutorial. > For more information about ModelArts, visit . Prepare ModelArts by referring to the "Preparations" section. ### Accessing Ascend AI Processor Resources on HUAWEI CLOUD You can click [here](https://console.huaweicloud.com/modelarts/?region=cn-north-4#/dashboard/applyModelArtsAscend910Beta) to join the beta testing program of the ModelArts Ascend Compute Service. ### Preparing Data ModelArts uses OBS to store data. Therefore, before starting a training job, you need to upload the data to OBS. The CIFAR-10 dataset in binary format is used as an example. 1. Download and decompress the CIFAR-10 dataset. > Download the CIFAR-10 dataset at . Among the three dataset versions provided on the page, select CIFAR-10 binary version. 2. Create an OBS bucket (for example, ms-dataset), create a data directory (for example, cifar-10) in the bucket, and upload the CIFAR-10 data to the data directory according to the following structure. ```text └─Object storage/ms-dataset/cifar-10 ├─train │ data_batch_1.bin │ data_batch_2.bin │ data_batch_3.bin │ data_batch_4.bin │ data_batch_5.bin │ └─eval test_batch.bin ``` ### Preparing for Script Execution Create an OBS bucket (for example, `resnet50-train`), create a code directory (for example, `resnet50_cifar10_train`) in the bucket, and upload all scripts in the following directories to the code directory: > ResNet-50 is used in scripts in to train the CIFAR-10 dataset and validate the accuracy after training is complete. `1*Ascend` or `8*Ascend` can be used in scripts on ModelArts for training. > > Note that the script version must be the same as the MindSpore version selected in "Creating a Training Task." For example, if you use scripts provided for MindSpore 1.1, you need to select MindSpore 1.1 when creating a training job. To facilitate subsequent training job creation, you need to create a training output directory and a log output directory. The directory structure created in this example is as follows: ```text └─Object storage/resnet50-train ├─resnet50_cifar10_train │ dataset.py │ resnet.py │ resnet50_train.py │ ├─output └─log ``` ## Running the MindSpore Script on ModelArts After Simple Adaptation Scripts provided in section "Preparing for Script Execution" can directly run on ModelArts. If you want to experience how to use ResNet-50 to train CIFAR-10, skip this section. If you need to run customized MindSpore scripts or more MindSpore sample code on ModelArts, perform simple adaptation on the MindSpore code as follows: ### Adapting to Script Arguments 1. Set `data_url` and `train_url`. They are necessary for running the script on ModelArts, corresponding to the data storage path (an OBS path) and training output path (an OBS path), respectively. ``` python import argparse parser = argparse.ArgumentParser(description='ResNet-50 train.') parser.add_argument('--data_url', required=True, default=None, help='Location of data.') parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.') ``` 2. ModelArts allows you to pass arguments to the configuration options in the script. For details, see "Creating a Training Job." ``` python parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.') ``` ### Adapting to OBS Data MindSpore does not provide APIs for directly accessing OBS data. You need to use APIs provided by MoXing to interact with OBS. ModelArts training scripts are executed in containers. Generally, the `/cache` directory is used to store the container data. > HUAWEI CLOUD MoXing provides various APIs for users: . In this example, only the `copy_parallel` API is used. 1. Download the data stored in OBS to an execution container. ```python import moxing as mox mox.file.copy_parallel(src_url='s3://dataset_url/', dst_url='/cache/data_path') ``` 2. Upload the training output from the container to OBS. ```python import moxing as mox mox.file.copy_parallel(src_url='/cache/output_path', dst_url='s3://output_url/') ``` ### Adapting to 8-Device Training Jobs To run scripts in the `8*Ascend` environment, you need to adapt dataset creation code and a local data path, and configure a distributed policy. By obtaining the environment variables `DEVICE_ID` and `RANK_SIZE`, you can build training scripts applicable to `1*Ascend` and `8*Ascend`. 1. Adapt a local path. ```python import os device_num = int(os.getenv('RANK_SIZE')) device_id = int(os.getenv('DEVICE_ID')) # define local data path local_data_path = '/cache/data' if device_num > 1: # define distributed local data path local_data_path = os.path.join(local_data_path, str(device_id)) ``` 2. Adapt datasets. ```python import os import mindspore.dataset.engine as de device_id = int(os.getenv('DEVICE_ID')) device_num = int(os.getenv('RANK_SIZE')) if device_num == 1: # create train data for 1 Ascend situation ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True) else: # create train data for 1 Ascend situation, split train data for 8 Ascend situation ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True, num_shards=device_num, shard_id=device_id) ``` 3. Configure a distributed policy. ```python import os from mindspore import context from mindspore.context import ParallelMode device_num = int(os.getenv('RANK_SIZE')) if device_num > 1: context.set_auto_parallel_context(device_num=device_num, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True) ``` ### Sample Code Perform simple adaptation on the MindSpore script based on the preceding three points. The following pseudocode is used as an example: Original MindSpore script: ``` python import os import argparse from mindspore import context from mindspore.context import ParallelMode import mindspore.dataset.engine as de device_id = int(os.getenv('DEVICE_ID')) device_num = int(os.getenv('RANK_SIZE')) def create_dataset(dataset_path): if device_num == 1: ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True) else: ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True, num_shards=device_num, shard_id=device_id) return ds def resnet50_train(args): if device_num > 1: context.set_auto_parallel_context(device_num=device_num, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True) train_dataset = create_dataset(local_data_path) if __name__ == '__main__': parser = argparse.ArgumentParser(description='ResNet-50 train.') parser.add_argument('--local_data_path', required=True, default=None, help='Location of data.') parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.') args_opt, unknown = parser.parse_known_args() resnet50_train(args_opt) ``` Adapted MindSpore script: ``` python import os import argparse from mindspore import context from mindspore.context import ParallelMode import mindspore.dataset.engine as de # adapt to cloud: used for downloading data import moxing as mox device_id = int(os.getenv('DEVICE_ID')) device_num = int(os.getenv('RANK_SIZE')) def create_dataset(dataset_path): if device_num == 1: ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True) else: ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True, num_shards=device_num, shard_id=device_id) return ds def resnet50_train(args): # adapt to cloud: define local data path local_data_path = '/cache/data' if device_num > 1: context.set_auto_parallel_context(device_num=device_num, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True) # adapt to cloud: define distributed local data path local_data_path = os.path.join(local_data_path, str(device_id)) # adapt to cloud: download data from obs to local location print('Download data.') mox.file.copy_parallel(src_url=args.data_url, dst_url=local_data_path) train_dataset = create_dataset(local_data_path) if __name__ == '__main__': parser = argparse.ArgumentParser(description='ResNet-50 train.') # adapt to cloud: get obs data path parser.add_argument('--data_url', required=True, default=None, help='Location of data.') # adapt to cloud: get obs output path parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.') parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.') args_opt, unknown = parser.parse_known_args() resnet50_train(args_opt) ``` ## Creating a Training Job Create a training job to run the MindSpore script. The following provides step-by-step instructions for creating a training job on ModelArts. ### Opening the ModelArts Console Click Console on the HUAWEI CLOUD ModelArts home page at . ### Using a Common Framework to Create a Training Job ModelArts Tutorial shows how to use a common framework to create a training job. ### Using MindSpore as a Common Framework to Create a Training Job Training scripts and data in this tutorial are used as an example to describe how to configure arguments on the training job creation page. 1. `Algorithm Source`: Click `Frameworks`, and then select `Ascend-Powered-Engine` and the required MindSpore version. (`Mindspore-0.5-python3.7-aarch64` is used as an example here. Use scripts corresponding to the selected version.) 2. `Code Directory`: Select a code directory created in an OBS bucket. Set `Startup File` to a startup script in the code directory. 3. `Data Source`: Click `Data Storage Path` and enter the CIFAR-10 dataset path in OBS. 4. `Argument`: Set `data_url` and `train_url` to the values of `Data Storage Path` and `Training Output Path`, respectively. Click the add icon to pass values to other arguments in the script, for example, `epoch_size`. 5. `Resource Pool`: Click `Public Resource Pool > Ascend`. 6. `Specification`: Select `Ascend: 1 * Ascend 910 CPU: 24-core 96 GiB` or `Ascend: 8 * Ascend 910 CPU: 192-core 768 GiB`, which indicate single-node single-device and single-node 8-device specifications, respectively. ## Viewing the Execution Result 1. You can view run logs on the Training Jobs page. The `8*Ascend` specification is used to execute the ResNet-50 training job. The total number of epochs is 92, the accuracy is about 92%, and the number of images trained per second is about 12,000. The `1*Ascend` specification is used to execute the ResNet-50 training job. The total number of epochs is 92, the accuracy is about 95%, and the number of images trained per second is about 1800. 2. If you specify a log path when creating a training job, you can download log files from OBS and view them.