{ "cells": [ { "cell_type": "markdown", "source": [ "# Optimizing the Data Processing\n", "\n", "`Ascend` `GPU` `CPU` `Data Preparation`\n", "\n", "[![Run in ModelArts](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_modelarts_en.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9taW5kc3BvcmUtd2Vic2l0ZS5vYnMuY24tbm9ydGgtNC5teWh1YXdlaWNsb3VkLmNvbS9ub3RlYm9vay9tYXN0ZXIvcHJvZ3JhbW1pbmdfZ3VpZGUvZW4vbWluZHNwb3JlX29wdGltaXplX2RhdGFfcHJvY2Vzc2luZy5pcHluYg==&imageid=65f636a0-56cf-49df-b941-7d2a07ba8c8c) [![Download Notebook](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_notebook_en.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.6/programming_guide/en/mindspore_optimize_data_processing.ipynb) [![View Source On Gitee](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_source_en.png)](https://gitee.com/mindspore/docs/blob/r1.6/docs/mindspore/programming_guide/source_en/optimize_data_processing.ipynb)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Overview\n", "\n", "Data is the most important factor of deep learning. Data quality determines the upper limit of deep learning result, whereas model quality enables the result to approach the upper limit. Therefore, high-quality data input is beneficial to the entire deep neural network. During the entire data processing and data augmentation process, data continuously flows through a pipeline to the training system." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "![pipeline](https://gitee.com/mindspore/docs/raw/r1.6/docs/mindspore/programming_guide/source_en/images/pipeline.png)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "MindSpore provides data processing and data augmentation functions for users. In the pipeline process, if each step can be properly used, the data performance will be greatly improved. This section describes how to optimize performance during data loading, data processing, and data augmentation based on the [CIFAR-10 dataset[1]](#references).\n", "\n", "In addition, the storage, architecture and computing resources of the operating system will influence the performance of data processing to a certain extent.\n", "\n", "## Preparations\n", "\n", "### Importing Modules\n", "\n", "The `dataset` module provides APIs for loading and processing datasets." ], "metadata": {} }, { "cell_type": "code", "execution_count": 1, "source": [ "import mindspore.dataset as ds" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "The `numpy` module is used to generate ndarrays." ], "metadata": {} }, { "cell_type": "code", "execution_count": 2, "source": [ "import numpy as np" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Downloading the Required Dataset\n", "\n", "Run the following command to download the dataset:\n", "Download the CIFAR-10 Binary format dataset, decompress them and store them in the `./datasets` path, use this dataset when loading data." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "import os\n", "import requests\n", "import tarfile\n", "import zipfile\n", "import shutil\n", "\n", "requests.packages.urllib3.disable_warnings()\n", "\n", "def download_dataset(url, target_path):\n", " \"\"\"download and decompress dataset\"\"\"\n", " if not os.path.exists(target_path):\n", " os.makedirs(target_path)\n", " download_file = url.split(\"/\")[-1]\n", " if not os.path.exists(download_file):\n", " res = requests.get(url, stream=True, verify=False)\n", " if download_file.split(\".\")[-1] not in [\"tgz\", \"zip\", \"tar\", \"gz\"]:\n", " download_file = os.path.join(target_path, download_file)\n", " with open(download_file, \"wb\") as f:\n", " for chunk in res.iter_content(chunk_size=512):\n", " if chunk:\n", " f.write(chunk)\n", " if download_file.endswith(\"zip\"):\n", " z = zipfile.ZipFile(download_file, \"r\")\n", " z.extractall(path=target_path)\n", " z.close()\n", " if download_file.endswith(\".tar.gz\") or download_file.endswith(\".tar\") or download_file.endswith(\".tgz\"):\n", " t = tarfile.open(download_file)\n", " names = t.getnames()\n", " for name in names:\n", " t.extract(name, target_path)\n", " t.close()\n", " print(\"The {} file is downloaded and saved in the path {} after processing\".format(os.path.basename(url), target_path))\n", "\n", "download_dataset(\"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz\", \"./datasets\")\n", "test_path = \"./datasets/cifar-10-batches-bin/test\"\n", "train_path = \"./datasets/cifar-10-batches-bin/train\"\n", "os.makedirs(test_path, exist_ok=True)\n", "os.makedirs(train_path, exist_ok=True)\n", "if not os.path.exists(os.path.join(test_path, \"test_batch.bin\")):\n", " shutil.move(\"./datasets/cifar-10-batches-bin/test_batch.bin\", test_path)\n", "[shutil.move(\"./datasets/cifar-10-batches-bin/\"+i, train_path) for i in os.listdir(\"./datasets/cifar-10-batches-bin/\") if os.path.isfile(\"./datasets/cifar-10-batches-bin/\"+i) and not i.endswith(\".html\") and not os.path.exists(os.path.join(train_path, i))]" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "The directory structure of the downloaded dataset file is as follows:\n", "\n", "```text\n", "./datasets/cifar-10-batches-bin\n", "├── readme.html\n", "├── test\n", "│ └── test_batch.bin\n", "└── train\n", " ├── batches.meta.txt\n", " ├── data_batch_1.bin\n", " ├── data_batch_2.bin\n", " ├── data_batch_3.bin\n", " ├── data_batch_4.bin\n", " └── data_batch_5.bin\n", "```" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Download cifar-10 Python file format dataset, decompress them in the `./datasets/cifar-10-batches-py` path, use this dataset when converting data." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "download_dataset(\"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-python.tar.gz\", \"./datasets\")" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "The directory structure of the extracted dataset file is as follows:\n", "\n", "```text\n", "./datasets/cifar-10-batches-py\n", "├── batches.meta\n", "├── data_batch_1\n", "├── data_batch_2\n", "├── data_batch_3\n", "├── data_batch_4\n", "├── data_batch_5\n", "├── readme.html\n", "└── test_batch\n", "```" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Optimizing the Data Loading Performance\n", "\n", "MindSpore provides multiple data loading methods, including common dataset loading, user-defined dataset loading, and the MindSpore data format loading. The dataset loading performance varies depending on the underlying implementation method.\n", "\n", "| | Common Dataset | User-defined Dataset | MindRecord Dataset |\n", "| :----: | :----: | :----: | :----: |\n", "| Underlying implementation | C++ | Python | C++ |\n", "| Performance | High | Medium | High |\n", "\n", "### Performance Optimization Solution" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "![data-loading-performance-scheme](https://gitee.com/mindspore/docs/raw/r1.6/docs/mindspore/programming_guide/source_en/images/data_loading_performance_scheme.png)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Suggestions on data loading performance optimization are as follows:\n", "\n", "- Built-in loading operators are preferred for supported dataset formats. For details, see [Built-in Loading Operators](https://www.mindspore.cn/docs/api/en/r1.6/api_python/mindspore.dataset.html), if the performance cannot meet the requirements, use the multi-thread concurrency solution. For details, see [Multi-thread Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#multi-thread-optimization-solution).\n", "- For a dataset format that is not supported, convert the format to the mindspore data format and then use the `MindDataset` class to load the dataset (Please refer to the [API](https://www.mindspore.cn/docs/api/en/r1.6/api_python/dataset/mindspore.dataset.MindDataset.html) for detailed use). Please refer to [Converting Dataset to MindRecord](https://www.mindspore.cn/docs/programming_guide/en/r1.6/convert_dataset.html), if the performance cannot meet the requirements, use the multi-thread concurrency solution, for details, see [Multi-thread Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#multi-thread-optimization-solution).\n", "- For dataset formats that are not supported, the user-defined `GeneratorDataset` class is preferred for implementing fast algorithm verification (Please refer to the [API](https://www.mindspore.cn/docs/api/en/r1.6/api_python/dataset/mindspore.dataset.GeneratorDataset.html) for detailed use), if the performance cannot meet the requirements, the multi-process concurrency solution can be used. For details, see [Multi-process Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#multi-process-optimization-solution).\n", "\n", "### Code Example\n", "\n", "Based on the preceding suggestions of data loading performance optimization, the `Cifar10Dataset` class of built-in loading operators (Please refer to the [API](https://www.mindspore.cn/docs/api/en/r1.6/api_python/dataset/mindspore.dataset.Cifar10Dataset.html) for detailed use), the `MindDataset` class after data conversion, and the `GeneratorDataset` class are used to load data. The sample code is displayed as follows:\n", "\n", "1. Use the `Cifar10Dataset` class of built-in operators to load the CIFAR-10 dataset in binary format. The multi-thread optimization solution is used for data loading. Four threads are enabled to concurrently complete the task. Finally, a dictionary iterator is created for the data and a data record is read through the iterator." ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "cifar10_path = \"./datasets/cifar-10-batches-bin/train\"\n", "\n", "# create Cifar10Dataset for reading data\n", "cifar10_dataset = ds.Cifar10Dataset(cifar10_path, num_parallel_workers=4)\n", "# create a dictionary iterator and read a data record through the iterator\n", "print(next(cifar10_dataset.create_dict_iterator()))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'image': Tensor(shape=[32, 32, 3], dtype=UInt8, value=\n", "[[[209, 206, 192],\n", " [211, 209, 201],\n", " [221, 217, 213],\n", " ...\n", " [172, 175, 194],\n", " [169, 173, 190],\n", " [115, 121, 145]],\n", " [[226, 230, 211],\n", " [227, 229, 218],\n", " [230, 232, 221],\n", " ...\n", " [153, 153, 171],\n", " [156, 156, 173],\n", " [106, 111, 129]],\n", " [[214, 226, 203],\n", " [214, 222, 204],\n", " [217, 227, 206],\n", " ...\n", " [167, 166, 176],\n", " [147, 147, 156],\n", " [ 78, 84, 96]],\n", " ...\n", " [[ 40, 69, 61],\n", " [ 37, 63, 57],\n", " [ 43, 68, 66],\n", " ...\n", " [ 55, 70, 69],\n", " [ 40, 54, 51],\n", " [ 27, 44, 36]],\n", " [[ 33, 61, 50],\n", " [ 37, 65, 56],\n", " [ 54, 72, 74],\n", " ...\n", " [ 47, 60, 56],\n", " [ 58, 66, 64],\n", " [ 36, 50, 46]],\n", " [[ 29, 41, 37],\n", " [ 38, 60, 59],\n", " [ 51, 76, 81],\n", " ...\n", " [ 32, 51, 43],\n", " [ 47, 61, 54],\n", " [ 56, 67, 66]]]), 'label': Tensor(shape=[], dtype=UInt32, value= 5)}\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "2. Use the `Cifar10ToMR` class to convert the CIFAR-10 dataset into the MindSpore data format. In this example, the CIFAR-10 dataset in Python file format is used. Then use the `MindDataset` class to load the dataset in the MindSpore data format. The multi-thread optimization solution is used for data loading. Four threads are enabled to concurrently complete the task. Finally, a dictionary iterator is created for data and a data record is read through the iterator." ], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ "import os\n", "from mindspore.mindrecord import Cifar10ToMR\n", "\n", "trans_path = \"./transform/\"\n", "\n", "if not os.path.exists(trans_path):\n", " os.mkdir(trans_path)\n", "\n", "os.system(\"rm -f {}cifar10*\".format(trans_path))\n", "\n", "cifar10_path = './datasets/cifar-10-batches-py'\n", "cifar10_mindrecord_path = './transform/cifar10.record'\n", "\n", "cifar10_transformer = Cifar10ToMR(cifar10_path, cifar10_mindrecord_path)\n", "# execute transformation from CIFAR-10 to MindRecord\n", "cifar10_transformer.transform(['label'])\n", "\n", "# create MindDataset for reading data\n", "cifar10_mind_dataset = ds.MindDataset(dataset_files=cifar10_mindrecord_path, num_parallel_workers=4)\n", "# create a dictionary iterator and read a data record through the iterator\n", "print(next(cifar10_mind_dataset.create_dict_iterator()))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'data': Tensor(shape=[1283], dtype=UInt8, value= [255, 216, 255, 224, 0, 16, 74, 70, 73, 70, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 255, 219, 0, 67, \n", " 107, 249, 17, 58, 213, 185, 117, 181, 143, 255, 217]), 'id': Tensor(shape=[], dtype=Int64, value= 32476), 'label': Tensor(shape=[], dtype=Int64, value= 9)}\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "3. The `GeneratorDataset` class is used to load the user-defined dataset, and the multi-process optimization solution is used. Four processes are enabled to concurrently complete the task. Finally, a dictionary iterator is created for the data, and a data record is read through the iterator." ], "metadata": {} }, { "cell_type": "code", "execution_count": 7, "source": [ "def generator_func(num):\n", " for i in range(num):\n", " yield (np.array([i]),)\n", "\n", "# create a GeneratorDataset object for reading data\n", "dataset = ds.GeneratorDataset(source=generator_func(5), column_names=[\"data\"], num_parallel_workers=4)\n", "# create a dictionary iterator and read a data record through the iterator\n", "print(next(dataset.create_dict_iterator()))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'data': Tensor(shape=[1], dtype=Int64, value= [0])}\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Optimizing the Shuffle Performance\n", "\n", "The shuffle operation is used to shuffle ordered datasets or repeated datasets. MindSpore provides the `shuffle` function for users. A larger value of `buffer_size` indicates a higher shuffling degree, consuming more time and computing resources. This API allows users to shuffle the data at any time during the entire pipeline process.Please refer to [shuffle](https://www.mindspore.cn/docs/programming_guide/en/r1.6/pipeline.html#shuffle). However, because the underlying implementation methods are different, the performance of this method is not as good as that of setting the `shuffle` parameter to directly shuffle data by referring to the [Built-in Loading Operators](https://www.mindspore.cn/docs/api/en/r1.6/api_python/mindspore.dataset.html).\n", "\n", "### Performance Optimization Solution" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "![shuffle-performance-scheme](https://gitee.com/mindspore/docs/raw/r1.6/docs/mindspore/programming_guide/source_en/images/shuffle_performance_scheme.png)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Suggestions on shuffle performance optimization are as follows:\n", "\n", "- Use the `shuffle` parameter of built-in loading operators to shuffle data.\n", "- If the `shuffle` function is used and the performance still cannot meet the requirements, adjust the value of the `buffer_size` parameter to improve the performance.\n", "\n", "### Code Example\n", "\n", "Based on the preceding shuffle performance optimization suggestions, the `shuffle` parameter of the `Cifar10Dataset` class of built-in loading operators and the `Shuffle` function are used to shuffle data. The sample code is displayed as follows:\n", "\n", "1. Use the `Cifar10Dataset` class of built-in operators to load the CIFAR-10 dataset. In this example, the CIFAR-10 dataset in binary format is used, and the `shuffle` parameter is set to True to perform data shuffle. Finally, a dictionary iterator is created for the data and a data record is read through the iterator." ], "metadata": {} }, { "cell_type": "code", "execution_count": 8, "source": [ "cifar10_path = \"./datasets/cifar-10-batches-bin/train\"\n", "\n", "# create Cifar10Dataset for reading data\n", "cifar10_dataset = ds.Cifar10Dataset(cifar10_path, shuffle=True)\n", "# create a dictionary iterator and read a data record through the iterator\n", "print(next(cifar10_dataset.create_dict_iterator()))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'image': Tensor(shape=[32, 32, 3], dtype=UInt8, value=\n", "[[[119, 193, 196],\n", " [121, 192, 204],\n", " [123, 193, 209],\n", " ...\n", " [110, 168, 177],\n", " [109, 167, 176],\n", " [110, 168, 178]],\n", " [[110, 188, 199],\n", " [109, 185, 202],\n", " [111, 186, 204],\n", " ...\n", " [107, 173, 179],\n", " [107, 173, 179],\n", " [109, 175, 182]],\n", " [[110, 186, 200],\n", " [108, 183, 199],\n", " [110, 184, 199],\n", " ...\n", " [115, 183, 189],\n", " [117, 185, 190],\n", " [117, 185, 191]],\n", " ...\n", " [[210, 253, 250],\n", " [212, 251, 250],\n", " [214, 250, 249],\n", " ...\n", " [194, 247, 247],\n", " [190, 246, 245],\n", " [184, 245, 244]],\n", " [[215, 253, 251],\n", " [218, 252, 250],\n", " [220, 251, 249],\n", " ...\n", " [200, 248, 248],\n", " [195, 247, 245],\n", " [189, 245, 244]],\n", " [[216, 253, 253],\n", " [222, 251, 250],\n", " [225, 250, 249],\n", " ...\n", " [204, 249, 248],\n", " [200, 246, 244],\n", " [196, 245, 244]]]), 'label': Tensor(shape=[], dtype=UInt32, value= 0)}\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "2. Use the `shuffle` function to shuffle data. Set `buffer_size` to 3 and use the `GeneratorDataset` class to generate data." ], "metadata": {} }, { "cell_type": "code", "execution_count": 9, "source": [ "def generator_func():\n", " for i in range(5):\n", " yield (np.array([i, i+1, i+2, i+3, i+4]),)\n", "\n", "ds1 = ds.GeneratorDataset(source=generator_func, column_names=[\"data\"])\n", "print(\"before shuffle:\")\n", "for data in ds1.create_dict_iterator():\n", " print(data[\"data\"])\n", "\n", "ds2 = ds1.shuffle(buffer_size=3)\n", "print(\"after shuffle:\")\n", "for data in ds2.create_dict_iterator():\n", " print(data[\"data\"])" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "before shuffle:\n", "[0 1 2 3 4]\n", "[1 2 3 4 5]\n", "[2 3 4 5 6]\n", "[3 4 5 6 7]\n", "[4 5 6 7 8]\n", "after shuffle:\n", "[2 3 4 5 6]\n", "[0 1 2 3 4]\n", "[1 2 3 4 5]\n", "[4 5 6 7 8]\n", "[3 4 5 6 7]\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Optimizing the Data Augmentation Performance\n", "\n", "During image classification training, especially when the dataset is small, users can use data augmentation to preprocess images to enrich the dataset. MindSpore provides multiple data augmentation methods, including:\n", "\n", "- Use the built-in C operator (`c_transforms` module) to perform data augmentation.\n", "- Use the built-in Python operator (`py_transforms` module) to perform data augmentation.\n", "- Users can define Python functions as needed to perform data augmentation.\n", "\n", "Please refer to [Data Augmentation](https://www.mindspore.cn/docs/programming_guide/en/r1.6/augmentation.html). The performance varies according to the underlying implementation methods.\n", "\n", "| Module | Underlying API | Description |\n", "| :----: | :----: | :----: |\n", "| c_transforms | C++ (based on OpenCV) | High performance |\n", "| py_transforms | Python (based on PIL) | This module provides multiple image augmentation functions and the method for converting PIL images into NumPy arrays |\n", "\n", "### Performance Optimization Solution" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "![data-enhancement-performance-scheme](https://gitee.com/mindspore/docs/raw/r1.6/docs/mindspore/programming_guide/source_en/images/data_enhancement_performance_scheme.png)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Suggestions on data augmentation performance optimization are as follows:\n", "\n", "- The `c_transforms` module is preferentially used to perform data augmentation for its highest performance. If the performance cannot meet the requirements, refer to [Multi-thread Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#multi-thread-optimization-solution), [Compose Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#compose-optimization-solution), or [Operator Fusion Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#operator-fusion-optimization-solution).\n", "- If the `py_transforms` module is used to perform data augmentation and the performance still cannot meet the requirements, refer to [Multi-thread Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#multi-thread-optimization-solution), [Multi-process Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#multi-process-optimization-solution), [Compose Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#compose-optimization-solution), or [Operator Fusion Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#operator-fusion-optimization-solution).\n", "- The `c_transforms` module maintains buffer management in C++, and the `py_transforms` module maintains buffer management in Python. Because of the performance cost of switching between Python and C++, it is advised not to use different operator types together.\n", "- If the user-defined Python functions are used to perform data augmentation and the performance still cannot meet the requirements, use the [Multi-thread Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#multi-thread-optimization-solution) or [Multi-process Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/r1.6/optimize_data_processing.html#multi-process-optimization-solution). If the performance still cannot be improved, in this case, optimize the user-defined Python code.\n", "\n", "MindSpore also supports users to use the data enhancement methods in the `c_transforms` and `py_transforms` modules at the same time, but due to the different underlying implementations of the two, excessive mixing will increase resource overhead and reduce processing performance. It is recommended that users can use the operators in `c_transforms` or `py_transforms` alone; or use one of them first, and then use the other. Please do not switch frequently between the data enhancement interface of two different implementation modules.\n", "\n", "### Code Example\n", "\n", "Based on the preceding suggestions of data augmentation performance optimization, the `c_transforms` module and user-defined Python function are used to perform data augmentation. The code is displayed as follows:\n", "\n", "1. The `c_transforms` module is used to perform data augmentation. During data augmentation, the multi-thread optimization solution is used. Four threads are enabled to concurrently complete the task. The operator fusion optimization solution is used and the `RandomResizedCrop` fusion class is used to replace the `RandomResize` and `RandomCrop` classes." ], "metadata": {} }, { "cell_type": "code", "execution_count": 10, "source": [ "import mindspore.dataset.vision.c_transforms as C\n", "import matplotlib.pyplot as plt\n", "\n", "cifar10_path = \"./datasets/cifar-10-batches-bin/train\"\n", "\n", "# create Cifar10Dataset for reading data\n", "cifar10_dataset = ds.Cifar10Dataset(cifar10_path, num_parallel_workers=4)\n", "transforms = C.RandomResizedCrop((800, 800))\n", "# apply the transform to the dataset through dataset.map()\n", "cifar10_dataset = cifar10_dataset.map(operations=transforms, input_columns=\"image\", num_parallel_workers=4)\n", "\n", "data = next(cifar10_dataset.create_dict_iterator())\n", "plt.imshow(data[\"image\"].asnumpy())\n", "plt.show()" ], "outputs": [ { "output_type": "display_data", "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" } } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "2. A user-defined Python function is used to perform data augmentation. During data augmentation, the multi-process optimization solution is used, and four processes are enabled to concurrently complete the task." ], "metadata": {} }, { "cell_type": "code", "execution_count": 11, "source": [ "def generator_func():\n", " for i in range(5):\n", " yield (np.array([i, i+1, i+2, i+3, i+4]),)\n", "\n", "ds3 = ds.GeneratorDataset(source=generator_func, column_names=[\"data\"])\n", "print(\"before map:\")\n", "for data in ds3.create_dict_iterator():\n", " print(data[\"data\"])\n", "\n", "func = lambda x: x**2\n", "ds4 = ds3.map(operations=func, input_columns=\"data\", python_multiprocessing=True, num_parallel_workers=4)\n", "print(\"after map:\")\n", "for data in ds4.create_dict_iterator():\n", " print(data[\"data\"])" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "before map:\n", "[0 1 2 3 4]\n", "[1 2 3 4 5]\n", "[2 3 4 5 6]\n", "[3 4 5 6 7]\n", "[4 5 6 7 8]\n", "after map:\n", "[ 0 1 4 9 16]\n", "[ 1 4 9 16 25]\n", "[ 4 9 16 25 36]\n", "[ 9 16 25 36 49]\n", "[16 25 36 49 64]\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Optimizing the Operating System Performance\n", "\n", "Data processing is performed on the host. Therefore, configurations of the host or operating system may affect the performance of data processing. Major factors include storage, NUMA architecture, and CPU (computing resources).\n", "\n", "1. Storage\n", "\n", " The data loading process involves frequent disk operations, and the performance of disk reading and writing directly affects the speed of data loading. Solid State Drive (SSD) is recommended for storing large datasets. SSD reduces the impact of I/O on data processing.\n", "\n", " > In most cases, after a dataset is loaded, it is stored in page cache of the operating system. To some extent, this reduces I/O overheads and accelerates reading subsequent epochs.\n", "\n", "2. NUMA architecture\n", "\n", " NUMA (Non-uniform Memory Architecture) is developed to solve the scalability problem of traditional Symmetric Multi-processor systems. The NUMA system has multiple memory buses. Several processors are connected to one memory via memory bus to form a group. This way, the entire large system is divided into several groups, the concept of this group is called a node in the NUMA system. Memory belonging to this node is called local memory, memory belonging to other nodes (with respect to this node) is called foreign memory. Therefore, the latency for each node to access its local memory is different from accessing foreign memory. This needs to be avoided during data processing. Generally, the following command can be used to bind a process to a node:\n", "\n", " ```bash\n", " numactl --cpubind=0 --membind=0 python train.py\n", " ```\n", "\n", " The example above binds the `train.py` process to `numa node` 0." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "3. CPU (computing resource)\n", "\n", " Although the data processing speed can be accelerated through multi-threaded parallel technology, there is actually no guarantee that CPU computing resources will be fully utilized. If you can artificially complete the configuration of computing resources in advance, it will be able to improve the utilization of CPU computing resources to a certain extent.\n", "\n", " - Resource allocation\n", "\n", " In distributed training, multiple training processes are run on one device. These training processes allocate and compete for computing resources based on the policy of the operating system. When there is a large number of processes, data processing performance may deteriorate due to resource contention. In some cases, users need to manually allocate resources to avoid resource contention.\n", "\n", " ```bash\n", " numactl --cpubind=0 python train.py\n", " ```\n", "\n", " or\n", "\n", " ```bash\n", " taskset -c 0-15 python train.py\n", " ```\n", "\n", " > The `numactl` method directly specifies `numa node id`. The `taskset` method allows for finer control by specifying `cpu core` within a `numa node`. The `core id` range from 0 to 15.\n", "\n", " - CPU frequency\n", "\n", " The setting of CPU frequency is critical to maximizing the computing power of the host CPU. Generally, the Linux kernel supports the tuning of the CPU frequency to reduce power consumption. Power consumption can be reduced to varying degrees by selecting power management policies for different system idle states. However, lower power consumption means slower CPU wake-up which in turn impacts performance. Therefore, if the CPU's power setting is in the conservative or powersave mode, `cpupower` command can be used to switch performance modes, resulting in significant data processing performance improvement.\n", "\n", " ```bash\n", " cpupower frequency-set -g performance\n", " ```" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Performance Optimization Solution Summary\n", "\n", "### Multi-thread Optimization Solution\n", "\n", "During the data pipeline process, the number of threads for related operators can be set to improve the concurrency and performance. If the user does not manually specify the `num_parallel_workers` parameter, each data processing operation will use 8 sub-threads for concurrent processing by default. For example:\n", "\n", "- During data loading, the `num_parallel_workers` parameter in the built-in data loading class is used to set the number of threads.\n", "- During data augmentation, the `num_parallel_workers` parameter in the `map` function is used to set the number of threads.\n", "- During batch processing, the `num_parallel_workers` parameter in the `batch` function is used to set the number of threads.\n", "\n", "For details, see [Built-in Loading Operators](https://www.mindspore.cn/docs/api/en/r1.6/api_python/mindspore.dataset.html).\n", "\n", "When using MindSpore for standalone or distributed training, the setting of the `num_parallel_workers` parameter should follow the following principles:\n", "\n", "- The summary of the `num_parallel_workers` parameter set for each data loading and processing operation should not be greater than the maximum number of CPU cores of the machine, otherwise it will cause resource competition between each operation.\n", "- Before setting the `num_parallel_workers` parameter, it is recommended to use MindSpore's Profiler (performance analysis) tool to analyze the performance of each operation in the training, and allocate more resources to the operation with pool performance, that is, set a large `num_parallel_workers` to balance the throughput between various operations and avoid unnecessary waiting.\n", "- In a standalone training scenario, increasing the `num_parallel_workers` parameter can often directly improve processing performance, but in a distributed scenario, due to increased CPU competition, blindly increasing `num_parallel_workers` may lead to performance degradation. You need to try to use a compromise value.\n", "\n", "### Multi-process Optimization Solution\n", "\n", "During data processing, operators implemented by Python support the multi-process mode. For example:\n", "\n", "- By default, the `GeneratorDataset` class is in multi-process mode. The `num_parallel_workers` parameter indicates the number of enabled processes. The default value is 1. For details, see [GeneratorDataset](https://www.mindspore.cn/docs/api/en/r1.6/api_python/dataset/mindspore.dataset.GeneratorDataset.html).\n", "- If the user-defined Python function or the `py_transforms` module is used to perform data augmentation and the `python_multiprocessing` parameter of the `map` function is set to True, the `num_parallel_workers` parameter indicates the number of processes and the default value of the `python_multiprocessing` parameter is False. In this case, the `num_parallel_workers` parameter indicates the number of threads. For details, see [Built-in Loading Operators](https://www.mindspore.cn/docs/api/en/r1.6/api_python/mindspore.dataset.html).\n", "\n", "### Compose Optimization Solution\n", "\n", "Map operators can receive the Tensor operator list and apply all these operators based on a specific sequence. Compared with the Map operator used by each Tensor operator, such Fat Map operators can achieve better performance, as shown in the following figure:" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "![compose](https://gitee.com/mindspore/docs/raw/r1.6/docs/mindspore/programming_guide/source_en/images/compose.png)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Operator Fusion Optimization Solution\n", "\n", "Some fusion operators are provided to aggregate the functions of two or more operators into one operator. For details, see [Augmentation Operators](https://www.mindspore.cn/docs/api/en/r1.6/api_python/mindspore.dataset.vision.html). Compared with the pipelines of their components, such fusion operators provide better performance. As shown in the figure:\n", "\n", "![operator-fusion](https://gitee.com/mindspore/docs/raw/r1.6/docs/mindspore/programming_guide/source_en/images/operator_fusion.png)\n", "\n", "### Operating System Optimization Solution\n", "\n", "- Use Solid State Drives to store the data.\n", "- Bind the process to a NUMA node.\n", "- Manually allocate more computing resources.\n", "- Set a higher CPU frequency.\n", "\n", "## References\n", "\n", "[1] Alex Krizhevsky. [Learning Multiple Layers of Features from Tiny Images](http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf)." ], "metadata": {} } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }