{ "cells": [ { "cell_type": "markdown", "source": [ "# Converting Dataset to MindRecord\r\n", "\r\n", "`Ascend` `GPU` `CPU` `Data Preparation`\r\n", "\r\n", "[![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_source_en.png)](https://gitee.com/mindspore/docs/blob/r1.5/docs/mindspore/programming_guide/source_en/convert_dataset.ipynb) [![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_notebook_en.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.5/programming_guide/en/mindspore_convert_dataset.ipynb) [![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_modelarts_en.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9taW5kc3BvcmUtd2Vic2l0ZS5vYnMuY24tbm9ydGgtNC5teWh1YXdlaWNsb3VkLmNvbS9ub3RlYm9vay9tYXN0ZXIvcHJvZ3JhbW1pbmdfZ3VpZGUvZW4vbWluZHNwb3JlX2NvbnZlcnRfZGF0YXNldC5pcHluYg==&imageid=65f636a0-56cf-49df-b941-7d2a07ba8c8c)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Overview\n", "\n", "Users can convert non-standard datasets and common datasets into the MindSpore data format, MindRecord, so that they can be easily loaded to MindSpore for training. In addition, the performance of MindSpore in some scenarios is optimized, which delivers better user experience when you use datasets in the MindSpore data format.\n", "\n", "The MindSpore data format has the following features:\n", "\n", "1. Unified storage and access of user data are implemented, simplifying training data loading.\n", "2. Data is aggregated for storage, which can be efficiently read, managed and moved.\n", "3. Data encoding and decoding are efficient and transparent to users.\n", "4. The partition size is flexibly controlled to implement distributed training.\n", "\n", "The mindspore data format aims to normalize the datasets of users to MindRecord, which can be further loaded through the `MindDataset` and used in the training procedure (Please refer to the [API](https://www.mindspore.cn/docs/api/en/r1.5/api_python/dataset/mindspore.dataset.MindDataset.html) for detailed use)." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "![data-conversion-concept](https://gitee.com/mindspore/docs/raw/r1.5/docs/mindspore/programming_guide/source_en/images/data_conversion_concept.png)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "A MindRecord file consists of data files and index files. Data files and index files do not support renaming for now.\n", "\n", "- Data file\n", "\n", " A data file contains a file header, scalar data pages and block data pages for storing normalized training data. It is recommended that the size of a single MindRecord file does not exceed 20 GB. Users can break up a large dataset and store the dataset into multiple MindRecord files.\n", "\n", "- Index file\n", "\n", " An index file contains the index information generated based on scalar data (such as image labels and image file names), used for convenient data fetching and storing statistical data about the dataset." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "![mindrecord](https://gitee.com/mindspore/docs/raw/r1.5/docs/mindspore/programming_guide/source_en/images/mindrecord.png)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "A data file consists of the following key parts:\n", "\n", "- File Header\n", "\n", " The file header stores the file header size, scalar data page size, block data page size, schema, index fields, statistics, file partition information, and mapping between scalar data and block data. It is the metadata of the MindRecord file.\n", "\n", "- Scalar data page\n", "\n", " The scalar data page is used to store integer, string and floating point data, such as the label of an image, file name of an image, and length, width of an image. The information suitable for storage with scalars is stored here.\n", "\n", "- Block data page\n", "\n", " The block data page is used to store data such as binary strings and NumPy arrays. Additional examples include converted python dictionaries generated from texts and binary image files.\n", "\n", "## Converting Dataset to MindRecord\n", "\n", "The following tutorial demonstrates how to convert image data and its annotations to MindRecord. For more instructions on MindSpore data format conversion, please refer to the [MindSpore Data Format Conversion](https://www.mindspore.cn/docs/programming_guide/en/r1.5/dataset_conversion.html) chapter in the programming guide.\n", "\n", "Example 1: Show how to convert data into a MindRecord data file according to the defined dataset structure.\n", "\n", "1. Import the `FileWriter` class for file writing." ], "metadata": {} }, { "cell_type": "code", "execution_count": 1, "source": [ " from mindspore.mindrecord import FileWriter" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "2. Define a dataset schema which defines dataset fields and field types." ], "metadata": {} }, { "cell_type": "code", "execution_count": 2, "source": [ " cv_schema_json = {\"file_name\": {\"type\": \"string\"}, \"label\": {\"type\": \"int32\"}, \"data\": {\"type\": \"bytes\"}}" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ " Schema mainly contains `name`, `type` and `shape`:\n", " - `name`: field names, consist of letters, digits and underscores.\n", " - `type`: field types, include int32, int64, float32, float64, string and bytes.\n", " - `shape`: [-1] for one-dimensional array, [m, n, ...] for higher dimensional array in which m and n represent the dimensions. \n", "\n", " > - The type of a field with the `shape` attribute can only be int32, int64, float32, or float64.\n", " > - If the field has the `shape` attribute, only data in `numpy.ndarray` type can be transferred to the `write_raw_data` API.\n", "\n", "3. Prepare the data sample list to be written based on the user-defined schema format. Binary data of the images is transferred below." ], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ " data = [{\"file_name\": \"1.jpg\", \"label\": 0, \"data\": b\"\\x10c\\xb3w\\xa8\\xee$o&\\xd4\\x00\\xf8\\x129\\x15\\xd9\\xf2q\\xc0\\xa2\\x91YFUO\\x1dsE1\\x1ep\"},\n", " {\"file_name\": \"3.jpg\", \"label\": 99, \"data\": b\"\\xaf\\xafU<\\xb8|6\\xbd}\\xc1\\x99[\\xeaj+\\x8f\\x84\\xd3\\xcc\\xa0,i\\xbb\\xb9-\\xcdz\\xecp{T\\xb1\\xdb\"}]" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "4. Adding index fields can accelerate data loading. This step is optional." ], "metadata": {} }, { "cell_type": "code", "execution_count": 4, "source": [ " indexes = [\"file_name\", \"label\"]" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "5. Create a `FileWriter` object, transfer the file name and number of slices, add the schema and index, call the `write_raw_data` API to write data, and call the `commit` API to generate a local data file." ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ " writer = FileWriter(file_name=\"test.mindrecord\", shard_num=4)\n", " writer.add_schema(cv_schema_json, \"test_schema\")\n", " writer.add_index(indexes)\n", " writer.write_raw_data(data)\n", " writer.commit()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "MSRStatus.SUCCESS" ] }, "metadata": {}, "execution_count": 5 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ " This example will generate `test.mindrecord0`, `test.mindrecord0.db`, `test.mindrecord1`, `test.mindrecord1.db`, `test.mindrecord2`, `test.mindrecord2.db`, `test.mindrecord3`, `test.mindrecord3.db`, totally eight files, called MindRecord datasets. `test.mindrecord0` and `test.mindrecord0.db` are collectively referred to as a MindRecord file, where `test.mindrecord0` is the data file and `test.mindrecord0.db` is the index file.\n", "\n", " **Interface Description:**\n", " - `write_raw_data`: write data to memory.\n", " - `commit`: write data in memory to disk.\n", "\n", "6. For adding data to the existing data format file, call the `open_for_append` API to open the existing data file, call the `write_raw_data` API to write new data, and then call the `commit` API to generate a local data file." ], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ " writer = FileWriter.open_for_append(\"test.mindrecord0\")\n", " writer.write_raw_data(data)\n", " writer.commit()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "MSRStatus.SUCCESS" ] }, "metadata": {}, "execution_count": 6 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Example 2: Convert a picture in `jpg` format into a MindRecord dataset according to the method in Example 1.\n", "\n", "Download the image data `transform.jpg` that needs to be processed as the raw data to be processed.\n", "\n", "Create a folder directory `./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/` to store all the converted datasets in this experience.\n", "\n", "Create a folder directory `./datasets/convert_dataset_to_mindrecord/images/` to store the downloaded image data.\n", "\n", "Execute the following command in jupyter notebook:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "!wget -N https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/transform.jpg --no-check-certificate\n", "!mkdir -p ./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/\n", "!mkdir -p ./datasets/convert_dataset_to_mindrecord/images/\n", "!mv -f ./transform.jpg ./datasets/convert_dataset_to_mindrecord/images/" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Execute the following code to convert the downloaded `transform.jpg` into a MindRecord dataset." ], "metadata": {} }, { "cell_type": "code", "execution_count": 8, "source": [ "# step 1 import class FileWriter\n", "import os\n", "from mindspore.mindrecord import FileWriter\n", "\n", "# clean up old run files before in Linux\n", "data_path = './datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/'\n", "os.system('rm -f {}test.*'.format(data_path))\n", "\n", "# import FileWriter class ready to write data\n", "data_record_path = './datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/test.mindrecord'\n", "writer = FileWriter(file_name=data_record_path,shard_num=4)\n", "\n", "# define the data type\n", "data_schema = {\"file_name\":{\"type\":\"string\"},\"label\":{\"type\":\"int32\"},\"data\":{\"type\":\"bytes\"}}\n", "writer.add_schema(data_schema,\"test_schema\")\n", "\n", "# prepeare the data contents\n", "file_name = \"./datasets/convert_dataset_to_mindrecord/images/transform.jpg\"\n", "with open(file_name, \"rb\") as f:\n", " bytes_data = f.read()\n", "data = [{\"file_name\":\"transform.jpg\", \"label\":1, \"data\":bytes_data}]\n", "\n", "# add index field\n", "indexes = [\"file_name\",\"label\"]\n", "writer.add_index(indexes)\n", "\n", "# save data to the files\n", "writer.write_raw_data(data)\n", "writer.commit()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "MSRStatus.SUCCESS" ] }, "metadata": {}, "execution_count": 8 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "This example will generate 8 files, which become the MindRecord dataset. `test.mindrecord0` and `test.mindrecord0.db` are called 1 MindRecord files, where `test.mindrecord0` is the data file, and `test.mindrecord0.db` is the index file. The generated files are as follows:\n", "\n", "```text\n", "./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/\n", "├── test.mindrecord0\n", "├── test.mindrecord0.db\n", "├── test.mindrecord1\n", "├── test.mindrecord1.db\n", "├── test.mindrecord2\n", "├── test.mindrecord2.db\n", "├── test.mindrecord3\n", "└── test.mindrecord3.db\n", "```" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Loading MindRecord Dataset\n", "\n", "The following tutorial briefly demonstrates how to load the MindRecord dataset using the `MindDataset`.\n", "\n", "1. Import the `dataset` for dataset loading." ], "metadata": {} }, { "cell_type": "code", "execution_count": 10, "source": [ " import mindspore.dataset as ds" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "2. Use the `MindDataset` to load the MindRecord dataset." ], "metadata": {} }, { "cell_type": "code", "execution_count": 11, "source": [ " data_set = ds.MindDataset(dataset_file=\"test.mindrecord0\") # read full dataset\n", " count = 0\n", " for item in data_set.create_dict_iterator(output_numpy=True):\n", " print(\"sample: {}\".format(item))\n", " count += 1\n", " print(\"Got {} samples\".format(count))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "sample: {'data': array([175, 175, 85, 60, 184, 124, 54, 189, 125, 193, 153, 91, 234,\n", " 106, 43, 143, 132, 211, 204, 160, 44, 105, 187, 185, 45, 205,\n", " 122, 236, 112, 123, 84, 177, 219], dtype=uint8), 'file_name': array(b'3.jpg', dtype='|S5'), 'label': array(99, dtype=int32)}\n", "sample: {'data': array([ 16, 99, 179, 119, 168, 238, 36, 111, 38, 60, 113, 140, 142,\n", " 40, 162, 144, 144, 150, 188, 177, 30, 212, 81, 69, 82, 19,\n", " 63, 255, 217], dtype=uint8), 'file_name': array(b'1.jpg', dtype='|S5'), 'label': array(0, dtype=int32)}\n", "sample: {'data': array([175, 175, 85, 60, 184, 124, 54, 189, 125, 193, 153, 91, 234,\n", " 106, 43, 143, 132, 211, 204, 160, 44, 105, 187, 185, 45, 205,\n", " 122, 236, 112, 123, 84, 177, 219], dtype=uint8), 'file_name': array(b'3.jpg', dtype='|S5'), 'label': array(99, dtype=int32)}\n", "sample: {'data': array([230, 218, 209, 174, 7, 184, 62, 212, 0, 248, 18, 57, 21,\n", " 217, 242, 113, 192, 162, 145, 89, 70, 85, 79, 29, 115, 69,\n", " 49, 30, 112], dtype=uint8), 'file_name': array(b'2.jpg', dtype='|S5'), 'label': array(56, dtype=int32)}\n", "sample: {'data': array([ 16, 99, 179, 119, 168, 238, 36, 111, 38, 60, 113, 140, 142,\n", " 40, 162, 144, 144, 150, 188, 177, 30, 212, 81, 69, 82, 19,\n", " 63, 255, 217], dtype=uint8), 'file_name': array(b'1.jpg', dtype='|S5'), 'label': array(0, dtype=int32)}\n", "sample: {'data': array([230, 218, 209, 174, 7, 184, 62, 212, 0, 248, 18, 57, 21,\n", " 217, 242, 113, 192, 162, 145, 89, 70, 85, 79, 29, 115, 69,\n", " 49, 30, 112], dtype=uint8), 'file_name': array(b'2.jpg', dtype='|S5'), 'label': array(56, dtype=int32)}\n", "Got 6 samples\n" ] } ], "metadata": {} } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 5 }