{ "cells": [ { "cell_type": "markdown", "source": [ "# Converting Dataset to MindRecord\n", "\n", "[![Download Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/resource/_static/logo_notebook_en.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.8/tutorials/en/advanced/dataset/mindspore_record.ipynb) [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/resource/_static/logo_source_en.png)](https://gitee.com/mindspore/docs/blob/r1.8/tutorials/source_en/advanced/dataset/record.ipynb)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "In MindSpore, the dataset used to train the network model can be converted into MindSpore-specific format data (MindSpore Record format), making it easier to save and load data. The goal is to normalize the user's dataset and further enable the reading of the data through the MindDataset interface and use it during the training process.\n", "\n", "![conversion](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/tutorials/source_en/advanced/dataset/images/data_conversion_concept.png)\n", "\n", "In addition, the performance of MindSpore in some scenarios is optimized, and using the MindSpore Record data format can reduce disk IO and network IO overhead, which results in a better user experience.\n", "\n", "The MindSpore data format has the following features:\n", "\n", "1. Unified storage and access of user data are implemented, simplifying training data loading.\n", "2. Data is aggregated for storage, which can be efficiently read, managed and moved.\n", "3. Data encoding and decoding are efficient and transparent to users.\n", "4. The partition size is flexibly controlled to implement distributed training.\n", "\n", "## Record File Structure\n", "\n", "As shown in the following figure, a MindSpore Record file consists of a data file and an index file." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "![mindrecord](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/tutorials/source_en/advanced/dataset/images/mindrecord.png)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "The data file contains file headers, scalar data pages, and block data pages, which are used to store the training data after user normalization, and a single MindSpore Record file is recommended to be less than 20G, and the user can store the large dataset as multiple MindSpore Record files.\n", "\n", "The index file contains index information generated based on scalar data (such as image Label, image file name) for convenient retrieval and statistical dataset information.\n", "\n", "The specific purposes of file headers, scalar data pages, and block data pages in data files are as follows:\n", "\n", "- **File header**: the meta information of MindSpore Record file, which is mainly used to store file header size, scalar data page size, block data page size, Schema information, index field, statistics, file segment information, the correspondence between scalar data and block data, etc.\n", "\n", "- **Scalar data page**: mainly used to store integer, string, floating-point data, such as the Label of the image, the file name of the image, the length and width of the image, that is, the information suitable for storing with scalars will be saved here.\n", "\n", "- **Block data page**: mainly used to store binary strings, NumPy arrays and other data, such as binary image files themselves, dictionaries converted into text, etc.\n", "\n", "> It should be noted that neither the data files nor the index files can support renaming operations at this time." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Converting Dataset to Record Format\n", "\n", "The following mainly describes how to convert CV class data and NLP class data to MindSpore Record file format, and read MindSpore Record file through the `MindDataset` interface.\n", "\n", "### Converting CV class dataset\n", "\n", "This example mainly uses a CV dataset containing 100 records and converts it to MindSpore Record format as an example, and describes how to convert a CV class dataset to the MindSpore Record file format and read it through the `MindDataset` interface.\n", "\n", "First, you need to create a dataset of 100 pictures and save it, whose sample contains three fields: `file_name` (string), `label` (integer), and `data` (binary), and then use the `MindDataset` interface to read the MindSpore Record file.\n", "\n", "1. Generate 100 images and convert them to MindSpore Record file format." ], "metadata": {} }, { "cell_type": "code", "execution_count": 1, "source": [ "import os\n", "from PIL import Image\n", "from io import BytesIO\n", "\n", "import mindspore.mindrecord as record\n", "\n", "\n", "# The full path to the output MindSpore Record file\n", "MINDRECORD_FILE = \"test.mindrecord\"\n", "\n", "if os.path.exists(MINDRECORD_FILE):\n", " os.remove(MINDRECORD_FILE)\n", " os.remove(MINDRECORD_FILE + \".db\")\n", "\n", "# Define the contained fields\n", "cv_schema = {\"file_name\": {\"type\": \"string\"},\n", " \"label\": {\"type\": \"int32\"},\n", " \"data\": {\"type\": \"bytes\"}}\n", "\n", "# Declare the MindSpore Record file format\n", "writer = record.FileWriter(file_name=MINDRECORD_FILE, shard_num=1)\n", "writer.add_schema(cv_schema, \"it is a cv dataset\")\n", "writer.add_index([\"file_name\", \"label\"])\n", "\n", "# Build a dataset\n", "data = []\n", "for i in range(100):\n", " i += 1\n", " sample = {}\n", " white_io = BytesIO()\n", " Image.new('RGB', (i*10, i*10), (255, 255, 255)).save(white_io, 'JPEG')\n", " image_bytes = white_io.getvalue()\n", " sample['file_name'] = str(i) + \".jpg\"\n", " sample['label'] = i\n", " sample['data'] = white_io.getvalue()\n", "\n", " data.append(sample)\n", " if i % 10 == 0:\n", " writer.write_raw_data(data)\n", " data = []\n", "\n", "if data:\n", " writer.write_raw_data(data)\n", "\n", "writer.commit()" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen from the printed result `MSRStatus.SUCCESS` above, the dataset conversion was successful. In the examples that follow in this article, you can see this print result if the dataset is successfully converted.\n", "\n", "2. Read the MindSpore Record file format via the MindDataset interface." ] }, { "cell_type": "code", "execution_count": 2, "source": [ "import mindspore.dataset as ds\n", "import mindspore.dataset.vision as vision\n", "\n", "# Read the MindSpore Record file format\n", "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n", "decode_op = vision.Decode()\n", "data_set = data_set.map(operations=decode_op, input_columns=[\"data\"], num_parallel_workers=2)\n", "\n", "# Count the number of samples\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Converting NLP class dataset\n", "\n", "This example first creates a MindSpore Record file format with 100 records. Its sample contains eight fields, all of which are integer arrays, and then uses the `MindDataset` interface to read the MindSpore Record file.\n", "\n", "> For ease of presentation, the preprocessing process of converting text to lexicographic order is omitted here.\n", "\n", "1. Generate 100 images and convert them to MindSpore Record file format." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import mindspore.mindrecord as record\n", "\n", "# The full path of the output MindSpore Record file\n", "MINDRECORD_FILE = \"test.mindrecord\"\n", "\n", "if os.path.exists(MINDRECORD_FILE):\n", " os.remove(MINDRECORD_FILE)\n", " os.remove(MINDRECORD_FILE + \".db\")\n", "\n", "# Defines the fields that the sample data contains\n", "nlp_schema = {\"source_sos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"source_sos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"source_eos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"source_eos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_sos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_sos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_eos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_eos_mask\": {\"type\": \"int64\", \"shape\": [-1]}}\n", "\n", "# Declare the MindSpore Record file format\n", "writer = record.FileWriter(file_name=MINDRECORD_FILE, shard_num=1)\n", "writer.add_schema(nlp_schema, \"Preprocessed nlp dataset.\")\n", "\n", "# Build a virtual dataset\n", "data = []\n", "for i in range(100):\n", " i += 1\n", " sample = {\"source_sos_ids\": np.array([i, i + 1, i + 2, i + 3, i + 4], dtype=np.int64),\n", " \"source_sos_mask\": np.array([i * 1, i * 2, i * 3, i * 4, i * 5, i * 6, i * 7], dtype=np.int64),\n", " \"source_eos_ids\": np.array([i + 5, i + 6, i + 7, i + 8, i + 9, i + 10], dtype=np.int64),\n", " \"source_eos_mask\": np.array([19, 20, 21, 22, 23, 24, 25, 26, 27], dtype=np.int64),\n", " \"target_sos_ids\": np.array([28, 29, 30, 31, 32], dtype=np.int64),\n", " \"target_sos_mask\": np.array([33, 34, 35, 36, 37, 38], dtype=np.int64),\n", " \"target_eos_ids\": np.array([39, 40, 41, 42, 43, 44, 45, 46, 47], dtype=np.int64),\n", " \"target_eos_mask\": np.array([48, 49, 50, 51], dtype=np.int64)}\n", " data.append(sample)\n", "\n", " if i % 10 == 0:\n", " writer.write_raw_data(data)\n", " data = []\n", "\n", "if data:\n", " writer.write_raw_data(data)\n", "\n", "writer.commit()" ] }, { "cell_type": "markdown", "source": [ "2. Read the MindSpore Record format file through the MindDataset interface." ], "metadata": {} }, { "cell_type": "code", "execution_count": 4, "source": [ "import mindspore.dataset as ds\n", "\n", "# Read MindSpore Record file format\n", "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE, shuffle=False)\n", "\n", "# Count the number of samples\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))\n", "\n", "# Print the part of data\n", "count = 0\n", "for item in data_set.create_dict_iterator():\n", " print(\"source_sos_ids:\", item[\"source_sos_ids\"])\n", " count += 1\n", " if count == 10:\n", " break" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Other datasets conversion\n", "\n", "MindSpore provides a tool class for converting commonly used datasets, capable of converting commonly used datasets to the MindSpore Record file format.\n", "\n", "> For more detailed descriptions of dataset transformations, refer to [API Documentation](https://www.mindspore.cn/docs/en/r1.8/api_python/mindspore.mindrecord.html)\n", "\n", "### Converting the CIFAR-10 dataset\n", "\n", "Users can convert the CIFAR-10 raw data into a MindSpore Record through the `Cifar10ToMR` class and read it by using the `MindDataset` interface.\n", "\n", "1. Download the [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz) and extract it to the specified directory. The following example code downloads and extracts the dataset to the specified location." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mindvision import dataset\n", "\n", "# Declare the dataset download address and dataset storage path\n", "dl_path = \"./datasets\"\n", "dl_url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-python.tar.gz\"\n", "\n", "# Download and unzip the dataset\n", "dl = dataset.DownLoad()\n", "dl.download_and_extract_archive(url=dl_url, download_path=dl_path)" ] }, { "cell_type": "markdown", "source": [ "The directory structure of the extracted dataset files is as follows:\n", "\n", "```text\n", "./datasets/cifar-10-batches-py\n", "├── batches.meta\n", "├── data_batch_1\n", "├── data_batch_2\n", "├── data_batch_3\n", "├── data_batch_4\n", "├── data_batch_5\n", "├── readme.html\n", "└── test_batch\n", "```" ], "metadata": {} }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Create a `Cifar10ToMR` object, call the `transform` interface, and convert the CIFAR-10 dataset to the MindSpore Record file format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from mindspore.mindrecord import Cifar10ToMR\n", "\n", "ds_target_path = \"./datasets/mindspore_dataset_conversion/\"\n", "\n", "os.system(\"rm -f {}*\".format(ds_target_path))\n", "os.system(\"mkdir -p {}\".format(ds_target_path))\n", "\n", "# CIFAR-10 dataset path\n", "CIFAR10_DIR = \"./datasets/cifar-10-batches-py\"\n", "# Output MindSpore Record file path\n", "MINDRECORD_FILE = \"./datasets/mindspore_dataset_conversion/cifar10.mindrecord\"\n", "\n", "cifar10_transformer = Cifar10ToMR(CIFAR10_DIR, MINDRECORD_FILE)\n", "cifar10_transformer.transform(['label'])" ] }, { "cell_type": "markdown", "source": [ "3. Read the MindSpore Record file format through the `MindDataset` interface." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "import mindspore.dataset as ds\n", "import mindspore.dataset.vision as vision\n", "\n", "# Read MindSpore Record file format\n", "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n", "decode_op = vision.Decode()\n", "data_set = data_set.map(operations=decode_op, input_columns=[\"data\"], num_parallel_workers=2)\n", "\n", "# Count the number of samples\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Converting the CSV dataset\n", "\n", "This example first creates a CSV file containing 5 records, then converts the CSV file to the MindSpore Record file format through the `CsvToMR` tool class, and finally reads it through the `MindDataset` interface.\n", "\n", "> This example relies on `pandas`, a third-party support package, and can be installed using the command `pip install pandas`. As this document runs as a Notebook, you need to restart the kernel after completing the installation to execute subsequent code.\n", "\n", "1. Generate the CSV file, and convert to MindSpore Record." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import csv\n", "import os\n", "from mindspore import mindrecord as record\n", "\n", "# The path to the CSV file\n", "CSV_FILE = \"test.csv\"\n", "# The path to the Output MindSpore Record file\n", "MINDRECORD_FILE = \"test.mindrecord\"\n", "\n", "if os.path.exists(MINDRECORD_FILE):\n", " os.remove(MINDRECORD_FILE)\n", " os.remove(MINDRECORD_FILE + \".db\")\n", "\n", "def generate_csv():\n", " \"\"\"Generate csv format file data\"\"\"\n", " headers = [\"id\", \"name\", \"math\", \"english\"]\n", " rows = [(1, \"Lily\", 78.5, 90),\n", " (2, \"Lucy\", 99, 85.2),\n", " (3, \"Mike\", 65, 71),\n", " (4, \"Tom\", 95, 99),\n", " (5, \"Jeff\", 85, 78.5)]\n", " with open(CSV_FILE, 'w', encoding='utf-8') as f:\n", " writer = csv.writer(f)\n", " writer.writerow(headers)\n", " writer.writerows(rows)\n", "\n", "# Generate csv format file data\n", "generate_csv()\n", "\n", "# Convert csv format file\n", "csv_transformer = record.CsvToMR(CSV_FILE, MINDRECORD_FILE, partition_number=1)\n", "csv_transformer.transform()\n", "\n", "assert os.path.exists(MINDRECORD_FILE)\n", "assert os.path.exists(MINDRECORD_FILE + \".db\")" ] }, { "cell_type": "markdown", "source": [ "2. Read MindSpore Record through `MindDataset` interface." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import mindspore.dataset as ds\n", "\n", "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n", "\n", "# Count the number of samples\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting TFRecord Dataset\n", "\n", "> This example requires TensorFlow to be installed in advance, and currently only tensorFlow 1.13.0-rc1 and above are supported. As this document runs as a Notebook, you need to restart the kernel after completing the installation to execute subsequent code.\n", "\n", "This example first creates a TFRecord file through TensorFlow, then converts the TFRecord file into a MindSpore Record format file through the `TFRecordToMR` tool class, and finally reads it through the `MindDataset` interface and uses the `Decode` function to decode the `image_bytes` field.\n", "\n", "1. Import the related module." ] }, { "cell_type": "code", "execution_count": 10, "source": [ "import collections\n", "from io import BytesIO\n", "import os\n", "import mindspore.dataset as ds\n", "import mindspore.mindrecord as record\n", "import mindspore.dataset.vision as vision\n", "from PIL import Image\n", "import tensorflow as tf\n", "\n", "os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "2. Generate the TFRecord file." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The Path to the TFRecord file\n", "TFRECORD_FILE = \"test.tfrecord\"\n", "# The path to the Output MindSpore Record file\n", "MINDRECORD_FILE = \"test.mindrecord\"\n", "\n", "def generate_tfrecord():\n", " def create_int_feature(values):\n", " if isinstance(values, list):\n", " feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))\n", " else:\n", " feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[values]))\n", " return feature\n", "\n", " def create_float_feature(values):\n", " if isinstance(values, list):\n", " feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values)))\n", " else:\n", " feature = tf.train.Feature(float_list=tf.train.FloatList(value=[values]))\n", " return feature\n", "\n", " def create_bytes_feature(values):\n", " if isinstance(values, bytes):\n", " white_io = BytesIO()\n", " Image.new('RGB', (10, 10), (255, 255, 255)).save(white_io, 'JPEG')\n", " image_bytes = white_io.getvalue()\n", " feature = tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_bytes]))\n", " else:\n", " feature = tf.train.Feature(bytes_list=tf.train.BytesList(value=[bytes(values, encoding='utf-8')]))\n", " return feature\n", "\n", " writer = tf.io.TFRecordWriter(TFRECORD_FILE)\n", "\n", " example_count = 0\n", " for i in range(10):\n", " # Create randomly Tensorflow sample data\n", " file_name = \"000\" + str(i) + \".jpg\"\n", " image_bytes = bytes(str(\"aaaabbbbcccc\" + str(i)), encoding=\"utf-8\")\n", " int64_scalar = i\n", " float_scalar = float(i)\n", " int64_list = [i, i+1, i+2, i+3, i+4, i+1234567890]\n", " float_list = [float(i), float(i+1), float(i+2.8), float(i+3.2),\n", " float(i+4.4), float(i+123456.9), float(i+98765432.1)]\n", "\n", " # Save the data in the TFRecord file format\n", " features = collections.OrderedDict()\n", " features[\"file_name\"] = create_bytes_feature(file_name)\n", " features[\"image_bytes\"] = create_bytes_feature(image_bytes)\n", " features[\"int64_scalar\"] = create_int_feature(int64_scalar)\n", " features[\"float_scalar\"] = create_float_feature(float_scalar)\n", " features[\"int64_list\"] = create_int_feature(int64_list)\n", " features[\"float_list\"] = create_float_feature(float_list)\n", "\n", " tf_example = tf.train.Example(features=tf.train.Features(feature=features))\n", " writer.write(tf_example.SerializeToString())\n", " example_count += 1\n", "\n", " writer.close()\n", " print(\"Write {} rows in tfrecord.\".format(example_count))\n", "\n", "generate_tfrecord()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Convert TFRecord to MindSpore Record." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_dict = {\"file_name\": tf.io.FixedLenFeature([], tf.string),\n", " \"image_bytes\": tf.io.FixedLenFeature([], tf.string),\n", " \"int64_scalar\": tf.io.FixedLenFeature([], tf.int64),\n", " \"float_scalar\": tf.io.FixedLenFeature([], tf.float32),\n", " \"int64_list\": tf.io.FixedLenFeature([6], tf.int64),\n", " \"float_list\": tf.io.FixedLenFeature([7], tf.float32),\n", " }\n", "\n", "if os.path.exists(MINDRECORD_FILE):\n", " os.remove(MINDRECORD_FILE)\n", " os.remove(MINDRECORD_FILE + \".db\")\n", "\n", "tfrecord_transformer = record.TFRecordToMR(TFRECORD_FILE, MINDRECORD_FILE, feature_dict, [\"image_bytes\"])\n", "tfrecord_transformer.transform()\n", "\n", "assert os.path.exists(MINDRECORD_FILE)\n", "assert os.path.exists(MINDRECORD_FILE + \".db\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Read MindSpore Record through `MindDataset` interface." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n", "decode_op = vision.Decode()\n", "data_set = data_set.map(operations=decode_op, input_columns=[\"image_bytes\"], num_parallel_workers=2)\n", "\n", "# Count the number of samples\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))" ] } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 5 }