{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# 转换数据集为MindRecord\n",
    "\n",
    "`Ascend` `GPU` `CPU` `数据准备`\n",
    "\n",
    "[![在线运行](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_modelarts.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9taW5kc3BvcmUtd2Vic2l0ZS5vYnMuY24tbm9ydGgtNC5teWh1YXdlaWNsb3VkLmNvbS9ub3RlYm9vay9tYXN0ZXIvcHJvZ3JhbW1pbmdfZ3VpZGUvemhfY24vbWluZHNwb3JlX2NvbnZlcnRfZGF0YXNldC5pcHluYg==&imageid=65f636a0-56cf-49df-b941-7d2a07ba8c8c)&emsp;[![下载Notebook](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.6/programming_guide/zh_cn/mindspore_convert_dataset.ipynb)&emsp;[![下载样例代码](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_download_code.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.6/programming_guide/zh_cn/mindspore_convert_dataset.py)&emsp;[![查看源文件](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.6/docs/mindspore/programming_guide/source_zh_cn/convert_dataset.ipynb)"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 概述\n",
    "\n",
    "用户可以将非标准的数据集和常用的数据集转换为MindSpore数据格式，即MindRecord，从而方便地加载到MindSpore中进行训练。同时，MindSpore在部分场景做了性能优化，使用MindRecord数据格式可以获得更好的性能体验。\n",
    "\n",
    "MindSpore数据格式具备的特征如下：\n",
    "\n",
    "1. 实现多变的用户数据统一存储、访问，训练数据读取更加简便。\n",
    "2. 数据聚合存储，高效读取，且方便管理、移动。\n",
    "3. 高效的数据编解码操作，对用户透明、无感知。\n",
    "4. 可以灵活控制分区的大小，实现分布式训练。\n",
    "\n",
    "MindSpore数据格式的目标是归一化用户的数据集，并进一步通过`MindDataset`（详细使用方法参考[API](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/dataset/mindspore.dataset.MindDataset.html)）实现数据的读取，并用于训练过程。\n",
    "\n",
    "![data-conversion-concept](https://gitee.com/mindspore/docs/raw/r1.6/docs/mindspore/programming_guide/source_zh_cn/images/data_conversion_concept.png)"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 基本概念"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "一个MindRecord文件由数据文件和索引文件组成，且数据文件及索引文件暂不支持重命名操作：\n",
    "\n",
    "- 数据文件\n",
    "\n",
    "    包含文件头、标量数据页、块数据页，用于存储用户归一化后的训练数据，且单个MindRecord文件建议小于20G，用户可将大数据集进行分片存储为多个MindRecord文件。\n",
    "\n",
    "- 索引文件\n",
    "\n",
    "    包含基于标量数据（如图像Label、图像文件名等）生成的索引信息，用于方便的检索、统计数据集信息。\n",
    "\n",
    "![mindrecord](https://gitee.com/mindspore/docs/raw/r1.6/docs/mindspore/programming_guide/source_zh_cn/images/mindrecord.png)\n",
    "\n",
    "数据文件主要由以下几个关键部分组成：\n",
    "\n",
    "- 文件头\n",
    "\n",
    "    文件头主要用来存储文件头大小、标量数据页大小、块数据页大小、Schema信息、索引字段、统计信息、文件分区信息、标量数据与块数据对应关系等，是MindRecord文件的元信息。\n",
    "\n",
    "- 标量数据页\n",
    "\n",
    "    标量数据页主要用来存储整型、字符串、浮点型数据，如图像的Label、图像的文件名、图像的长宽等信息，即适合用标量来存储的信息会保存在这里。\n",
    "\n",
    "- 块数据页\n",
    "\n",
    "    块数据页主要用来存储二进制串、Numpy数组等数据，如二进制图像文件本身、文本转换成的字典等。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 将数据集转换为MindRecord\n",
    "\n",
    "下面本教程将简单演示如何将图片数据及其标注转换为MindRecord格式。更多MindSpore数据格式转换说明，可参见编程指南中[MindSpore数据格式转换](https://www.mindspore.cn/docs/programming_guide/zh-CN/r1.6/dataset_conversion.html)章节。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "示例一：展示如何将数据按照定义的数据集结构转换为MindRecord数据文件。\n",
    "\n",
    "1. 导入文件写入工具类`FileWriter`。"
   ],
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T02:13:50.133177Z",
     "start_time": "2021-02-22T02:13:50.127990Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "source": [
    "from mindspore.mindrecord import FileWriter"
   ],
   "outputs": [],
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T02:14:54.114866Z",
     "start_time": "2021-02-22T02:14:11.315731Z"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "2. 定义数据集结构文件Schema。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "source": [
    "cv_schema_json = {\"file_name\": {\"type\": \"string\"}, \"label\": {\"type\": \"int32\"}, \"data\": {\"type\": \"bytes\"}}"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "Schema文件主要包含字段名`name`、字段数据类型`type`和字段各维度维数`shape`：\n",
    "\n",
    "- 字段名：字段的引用名称，可以包含字母、数字和下划线。\n",
    "\n",
    "- 字段数据类型：包含int32、int64、float32、float64、string、bytes。\n",
    "\n",
    "- 字段维数：一维数组用[-1]表示，更高维度可表示为[m, n, …]，其中m、n为各维度维数。\n",
    "\n",
    "> 如果字段有属性`shape`，则用户传入`write_raw_data`接口的数据必须为`numpy.ndarray`类型，对应数据类型必须为int32、int64、float32、float64。\n",
    "\n",
    "3. 按照用户定义的Schema格式，准备需要写入的数据列表，此处传入的是图片数据的二进制流。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "source": [
    "data = [{\"file_name\": \"1.jpg\", \"label\": 0, \"data\": b\"\\x10c\\xb3w\\xa8\\xee$o&<q\\x8c\\x8e(\\xa2\\x90\\x90\\x96\\xbc\\xb1\\x1e\\xd4QER\\x13?\\xff\\xd9\"},\n",
    "        {\"file_name\": \"2.jpg\", \"label\": 56, \"data\": b\"\\xe6\\xda\\xd1\\xae\\x07\\xb8>\\xd4\\x00\\xf8\\x129\\x15\\xd9\\xf2q\\xc0\\xa2\\x91YFUO\\x1dsE1\\x1ep\"},\n",
    "        {\"file_name\": \"3.jpg\", \"label\": 99, \"data\": b\"\\xaf\\xafU<\\xb8|6\\xbd}\\xc1\\x99[\\xeaj+\\x8f\\x84\\xd3\\xcc\\xa0,i\\xbb\\xb9-\\xcdz\\xecp{T\\xb1\\xdb\"}]"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "4. 添加索引字段可以加速数据读取，该步骤非必选。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "source": [
    "indexes = [\"file_name\", \"label\"]"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "5. 创建FileWriter对象，传入文件名及分片数量，然后添加Schema文件及索引，调用`write_raw_data`接口写入数据，最后调用`commit`接口生成本地数据文件。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "source": [
    "writer = FileWriter(file_name=\"test.mindrecord\", shard_num=4)\n",
    "writer.add_schema(cv_schema_json, \"test_schema\")\n",
    "writer.add_index(indexes)\n",
    "writer.write_raw_data(data)\n",
    "writer.commit()"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "MSRStatus.SUCCESS"
      ]
     },
     "metadata": {},
     "execution_count": 5
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "该示例会生成`test.mindrecord0`、`test.mindrecord0.db`、`test.mindrecord1`、`test.mindrecord1.db`、`test.mindrecord2`、`test.mindrecord2.db`、`test.mindrecord3`、`test.mindrecord3.db`共8个文件，称为MindRecord数据集。`test.mindrecord0`和`test.mindrecord0.db`称为1个MindRecord文件，其中`test.mindrecord0`为数据文件，`test.mindrecord0.db`为索引文件。\n",
    "\n",
    "接口说明：\n",
    "\n",
    "- `FileWriter`：如果参数shard_num>1，那么在生成MindRecord文件时，会将原始数据集保存至shard_num个MindRecord文件中，每个MindRecord文件会保存相邻MindRecord文件的元数据信息。然后在使用`MindDataset`接口读取MindRecord数据集时，通过`MindDataset(dataset_files=\"./test.mindrecord0\")`可以读取全部shard_num个MindRecord文件；通过`MindDataset(dataset_files=[\"./test.mindrecord0\"])`只会读取`test.mindrecord0`文件中的数据。\n",
    "\n",
    "- `write_raw_data`：将数据写入到内存之中。\n",
    "\n",
    "- `commit`：将最终内存中的数据写入到磁盘。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "6. 如果需要在现有数据格式文件中增加新数据，可以调用`open_for_append`接口打开已存在的数据文件，继续调用`write_raw_data`接口写入新数据，最后调用`commit`接口生成本地数据文件。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "source": [
    "writer = FileWriter.open_for_append(\"test.mindrecord0\")\n",
    "writer.write_raw_data(data)\n",
    "writer.commit()"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "MSRStatus.SUCCESS"
      ]
     },
     "metadata": {},
     "execution_count": 6
    }
   ],
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T02:32:29.779366Z",
     "start_time": "2021-02-22T02:32:29.606138Z"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "示例二：将`jpg`格式的图片，按照示例一的方法，将其转换成MindRecord数据集。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "下载需要处理的图片数据`transform.jpg`作为待处理的原始数据。\n",
    "\n",
    "创建文件夹目录`./datasets/convert_dataset_to_mindrecord/data_to_mindrecord/`用于存放本次体验中所有的转换数据集。\n",
    "\n",
    "创建文件夹目录`./datasets/convert_dataset_to_mindrecord/images/`用于存放下载下来的图片数据。\n",
    "\n",
    "以下示例代码将数据集下载并解压到指定位置。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "import os\n",
    "import requests\n",
    "import tarfile\n",
    "import zipfile\n",
    "\n",
    "requests.packages.urllib3.disable_warnings()\n",
    "\n",
    "def download_dataset(url, target_path):\n",
    "    \"\"\"download and decompress dataset\"\"\"\n",
    "    if not os.path.exists(target_path):\n",
    "        os.makedirs(target_path)\n",
    "    download_file = url.split(\"/\")[-1]\n",
    "    if not os.path.exists(download_file):\n",
    "        res = requests.get(url, stream=True, verify=False)\n",
    "        if download_file.split(\".\")[-1] not in [\"tgz\", \"zip\", \"tar\", \"gz\"]:\n",
    "            download_file = os.path.join(target_path, download_file)\n",
    "        with open(download_file, \"wb\") as f:\n",
    "            for chunk in res.iter_content(chunk_size=512):\n",
    "                if chunk:\n",
    "                    f.write(chunk)\n",
    "    if download_file.endswith(\"zip\"):\n",
    "        z = zipfile.ZipFile(download_file, \"r\")\n",
    "        z.extractall(path=target_path)\n",
    "        z.close()\n",
    "    if download_file.endswith(\".tar.gz\") or download_file.endswith(\".tar\") or download_file.endswith(\".tgz\"):\n",
    "        t = tarfile.open(download_file)\n",
    "        names = t.getnames()\n",
    "        for name in names:\n",
    "            t.extract(name, target_path)\n",
    "        t.close()\n",
    "    print(\"The {} file is downloaded and saved in the path {} after processing\".format(os.path.basename(url), target_path))\n",
    "\n",
    "download_dataset(\"https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/transform.jpg\", \"./datasets/convert_dataset_to_mindrecord/images\")\n",
    "if not os.path.exists(\"./datasets/convert_dataset_to_mindrecord/data_to_mindrecord/\"):\n",
    "    os.makedirs(\"./datasets/convert_dataset_to_mindrecord/data_to_mindrecord/\")"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "执行以下代码，将下载的`transform.jpg`转换为MindRecord数据集。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "source": [
    "# step 1 import class FileWriter\n",
    "import os\n",
    "from mindspore.mindrecord import FileWriter\n",
    "\n",
    "# clean up old run files before in Linux\n",
    "data_path = './datasets/convert_dataset_to_mindrecord/data_to_mindrecord/'\n",
    "os.system('rm -f {}test.*'.format(data_path))\n",
    "\n",
    "# import FileWriter class ready to write data\n",
    "data_record_path = './datasets/convert_dataset_to_mindrecord/data_to_mindrecord/test.mindrecord'\n",
    "writer = FileWriter(file_name=data_record_path, shard_num=4)\n",
    "\n",
    "# define the data type\n",
    "data_schema = {\"file_name\": {\"type\": \"string\"}, \"label\": {\"type\": \"int32\"}, \"data\": {\"type\": \"bytes\"}}\n",
    "writer.add_schema(data_schema, \"test_schema\")\n",
    "\n",
    "# prepeare the data contents\n",
    "file_name = \"./datasets/convert_dataset_to_mindrecord/images/transform.jpg\"\n",
    "with open(file_name, \"rb\") as f:\n",
    "    bytes_data = f.read()\n",
    "data = [{\"file_name\": \"transform.jpg\", \"label\": 1, \"data\": bytes_data}]\n",
    "\n",
    "# add index field\n",
    "indexes = [\"file_name\", \"label\"]\n",
    "writer.add_index(indexes)\n",
    "\n",
    "# save data to the files\n",
    "writer.write_raw_data(data)\n",
    "writer.commit()"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "MSRStatus.SUCCESS"
      ]
     },
     "metadata": {},
     "execution_count": 8
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "该示例会生成8个文件，成为MindRecord数据集。`test.mindrecord0`和`test.mindrecord0.db`称为1个MindRecord文件，其中`test.mindrecord0`为数据文件，`test.mindrecord0.db`为索引文件，生成的文件如下所示：\n",
    "\n",
    "```text\n",
    "./datasets/convert_dataset_to_mindrecord/data_to_mindrecord/\n",
    "├── test.mindrecord0\n",
    "├── test.mindrecord0.db\n",
    "├── test.mindrecord1\n",
    "├── test.mindrecord1.db\n",
    "├── test.mindrecord2\n",
    "├── test.mindrecord2.db\n",
    "├── test.mindrecord3\n",
    "└── test.mindrecord3.db\n",
    "```"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 读取MindRecord数据集"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "下面将简单演示如何通过`MindDataset`读取MindRecord数据集。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "1. 导入读取类`MindDataset`。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "source": [
    "import mindspore.dataset as ds"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "2. 首先使用`MindDataset`读取MindRecord数据集，然后对数据创建了字典迭代器，并通过迭代器读取了一条数据记录。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "source": [
    "file_name = './datasets/convert_dataset_to_mindrecord/data_to_mindrecord/test.mindrecord0'\n",
    "# create MindDataset for reading data\n",
    "define_data_set = ds.MindDataset(dataset_files=file_name)\n",
    "# create a dictionary iterator and read a data record through the iterator\n",
    "count = 0\n",
    "for item in define_data_set.create_dict_iterator(output_numpy=True):\n",
    "    print(\"sample: {}\".format(item))\n",
    "    count += 1\n",
    "print(\"Got {} samples\".format(count))"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "sample: {'data': array([255, 216, 255, ..., 159, 255, 217], dtype=uint8), 'file_name': array(b'transform.jpg', dtype='|S13'), 'label': array(1, dtype=int32)}\n",
      "Got 1 samples\n"
     ]
    }
   ],
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T02:35:01.078717Z",
     "start_time": "2021-02-22T02:35:01.036028Z"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "MindSpore",
   "language": "python",
   "name": "mindspore"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}