{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 数据处理\n", "\n", "[![在线运行](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.7/resource/_static/logo_modelarts.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9vYnMuZHVhbHN0YWNrLmNuLW5vcnRoLTQubXlodWF3ZWljbG91ZC5jb20vbWluZHNwb3JlLXdlYnNpdGUvbm90ZWJvb2svcjEuNy90dXRvcmlhbHMvemhfY24vYWR2YW5jZWQvZGF0YXNldC9taW5kc3BvcmVfdHJhbnNmb3JtLmlweW5i&imageid=9d63f4d1-dc09-4873-b669-3483cea777c0) [![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.7/resource/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.7/tutorials/zh_cn/advanced/dataset/mindspore_transform.ipynb) \n", "[![下载样例代码](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.7/resource/_static/logo_download_code.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.7/tutorials/zh_cn/advanced/dataset/mindspore_transform.py) \n", "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.7/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.7/tutorials/source_zh_cn/advanced/dataset/transform.ipynb)\n", "\n", "数据是深度学习的基础,良好的数据输入可以对整个深度神经网络训练起到非常积极的作用。在训练前对已加载的数据集进行数据处理,可以解决诸如数据量过大、样本分布不均等问题,从而获得对训练结果更有利的数据输入。\n", "\n", "MindSpore的各个数据集类都为用户提供了多种数据处理操作,用户可以通过构建数据处理的流水线(pipeline)来定义需要使用的数据处理操作,在训练过程中,数据即可像水一样源源不断地经过数据处理pipeline流向训练系统。\n", "\n", "MindSpore目前支持如数据清洗`shuffle`、数据分批`batch`、数据重复`repeat`、数据拼接`concat`等常用数据处理操作。\n", "\n", "> 更多数据处理操作参见[API文档](https://www.mindspore.cn/docs/zh-CN/r1.7/api_python/mindspore.dataset.html)。\n", "\n", "## 数据处理操作\n", "\n", "### shuffle\n", "\n", "shuffle操作会随机打乱数据顺序,对数据集进行混洗。\n", "\n", "设定的`buffer_size`越大,数据混洗程度越大,同时所消耗的时间、计算资源也更大。\n", "\n", "![shuffle](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.7/tutorials/source_zh_cn/advanced/dataset/images/op_shuffle.png)\n", "\n", "下面的样例先构建了一个随机数据集,然后对其进行混洗操作,最后展示了数据混洗前后的结果。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n", "------ after processing ------\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n" ] } ], "source": [ "import numpy as np\n", "import mindspore.dataset as ds\n", "\n", "ds.config.set_seed(0)\n", "\n", "def generator_func():\n", " \"\"\"定义生成数据集函数\"\"\"\n", " for i in range(5):\n", " yield (np.array([i, i+1, i+2]),)\n", "\n", "# 生成数据集\n", "dataset = ds.GeneratorDataset(generator_func, [\"data\"])\n", "for data in dataset.create_dict_iterator():\n", " print(data)\n", "\n", "print(\"------ after processing ------\")\n", "\n", "# 执行数据清洗操作\n", "dataset = dataset.shuffle(buffer_size=2)\n", "for data in dataset.create_dict_iterator():\n", " print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从上面的打印结果可以看出,经过`shuffle`操作之后,数据顺序被打乱了。\n", "\n", "### batch\n", "\n", "batch操作将数据集分批,分别输入到训练系统中进行训练,可以减少训练轮次,达到加速训练过程的目的。\n", "\n", "![batch](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.7/tutorials/source_zh_cn/advanced/dataset/images/op_batch.png)\n", "\n", "下面的样例先构建了一个数据集,然后分别展示了丢弃多余数据与否的数据集分批结果,其中批大小为2。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n", "------not drop remainder ------\n", "{'data': Tensor(shape=[2, 3], dtype=Int64, value=\n", "[[0, 1, 2],\n", " [1, 2, 3]])}\n", "{'data': Tensor(shape=[2, 3], dtype=Int64, value=\n", "[[2, 3, 4],\n", " [3, 4, 5]])}\n", "{'data': Tensor(shape=[1, 3], dtype=Int64, value=\n", "[[4, 5, 6]])}\n", "------ drop remainder ------\n", "{'data': Tensor(shape=[2, 3], dtype=Int64, value=\n", "[[0, 1, 2],\n", " [1, 2, 3]])}\n", "{'data': Tensor(shape=[2, 3], dtype=Int64, value=\n", "[[2, 3, 4],\n", " [3, 4, 5]])}\n" ] } ], "source": [ "import numpy as np\n", "import mindspore.dataset as ds\n", "\n", "def generator_func():\n", " \"\"\"定义生成数据集函数\"\"\"\n", " for i in range(5):\n", " yield (np.array([i, i+1, i+2]),)\n", "\n", "dataset = ds.GeneratorDataset(generator_func, [\"data\"])\n", "for data in dataset.create_dict_iterator():\n", " print(data)\n", "\n", "# 采用不丢弃多余数据的方式对数据集进行分批\n", "dataset = ds.GeneratorDataset(generator_func, [\"data\"])\n", "dataset = dataset.batch(batch_size=2, drop_remainder=False)\n", "print(\"------not drop remainder ------\")\n", "for data in dataset.create_dict_iterator():\n", " print(data)\n", "\n", "# 采用丢弃多余数据的方式对数据集进行分批\n", "dataset = ds.GeneratorDataset(generator_func, [\"data\"])\n", "dataset = dataset.batch(batch_size=2, drop_remainder=True)\n", "print(\"------ drop remainder ------\")\n", "for data in dataset.create_dict_iterator():\n", " print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从上面的打印结果可以看出,数据集大小为5,每2个分一组,不丢弃多余数据时分为3组,丢弃多余数据时分为2组,最后一条数据被丢弃。\n", "\n", "### repeat\n", "\n", "repeat操作对数据集进行重复,达到扩充数据量的目的。`repeat`和`batch`操作的先后顺序会影响训练batch的数量,建议将`repeat`置于`batch`之后。\n", "\n", "![repeat](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.7/tutorials/source_zh_cn/advanced/dataset/images/op_repeat.png)\n", "\n", "下面的样例先构建了一个随机数据集,然后将其重复2次,最后展示了重复后的数据结果。" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n", "------ after processing ------\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n" ] } ], "source": [ "import numpy as np\n", "import mindspore.dataset as ds\n", "\n", "def generator_func():\n", " \"\"\"定义生成数据集函数\"\"\"\n", " for i in range(5):\n", " yield (np.array([i, i+1, i+2]),)\n", "\n", "# 生成数据集\n", "dataset = ds.GeneratorDataset(generator_func, [\"data\"])\n", "for data in dataset.create_dict_iterator():\n", " print(data)\n", "\n", "print(\"------ after processing ------\")\n", "\n", "# 对数据进行数据重复操作\n", "dataset = dataset.repeat(count=2)\n", "for data in dataset.create_dict_iterator():\n", " print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从上面的打印结果可以看出,数据集被拷贝了之后扩充到原数据集后面。\n", "\n", "### zip\n", "\n", "zip操作实现两个数据集的列拼接,将其合并为一个数据集。使用时需要注意以下两点:\n", "\n", "1. 如果两个数据集的列名相同,则不会合并,请注意列的命名。\n", "2. 如果两个数据集的行数不同,合并后的行数将和较小行数保持一致。\n", "\n", "![zip](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.7/tutorials/source_zh_cn/advanced/dataset/images/op_zip.png)\n", "\n", "下面的样例先构建了两个不同样本数的随机数据集,然后将其进行列拼接,最后展示了拼接后的数据结果。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "------ data1 ------\n", "{'data1': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n", "{'data1': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n", "{'data1': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n", "{'data1': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n", "{'data1': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n", "{'data1': Tensor(shape=[3], dtype=Int64, value= [5, 6, 7])}\n", "{'data1': Tensor(shape=[3], dtype=Int64, value= [6, 7, 8])}\n", "------ data2 ------\n", "{'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n", "{'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n", "{'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n", "{'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n", "------ data3 ------\n", "{'data1': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2]), 'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n", "{'data1': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3]), 'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n", "{'data1': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4]), 'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n", "{'data1': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5]), 'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n" ] } ], "source": [ "import numpy as np\n", "import mindspore.dataset as ds\n", "\n", "def generator_func():\n", " \"\"\"定义生成数据集函数1\"\"\"\n", " for i in range(7):\n", " yield (np.array([i, i+1, i+2]),)\n", "\n", "def generator_func2():\n", " \"\"\"定义生成数据集函数2\"\"\"\n", " for _ in range(4):\n", " yield (np.array([1, 2]),)\n", "\n", "print(\"------ data1 ------\")\n", "dataset1 = ds.GeneratorDataset(generator_func, [\"data1\"])\n", "for data in dataset1.create_dict_iterator():\n", " print(data)\n", "\n", "print(\"------ data2 ------\")\n", "dataset2 = ds.GeneratorDataset(generator_func2, [\"data2\"])\n", "for data in dataset2.create_dict_iterator():\n", " print(data)\n", "\n", "print(\"------ data3 ------\")\n", "\n", "# 对数据集1和数据集2做zip操作,生成数据集3\n", "dataset3 = ds.zip((dataset1, dataset2))\n", "for data in dataset3.create_dict_iterator():\n", " print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从上面的打印结果可以看出,数据集3由数据集1和数据集2列拼接得到,其列数为后两者之和,其行数与后两者中最小行数(数据集2行数)保持一致,数据集1中后面多余的行数被丢弃。\n", "\n", "### concat\n", "\n", "concat实现两个数据集的行拼接,并将其合并为一个数据集。使用时需要注意:输入数据集中的列名、列数据类型和列数据的排列应相同。\n", "\n", "![concat](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.7/tutorials/source_zh_cn/advanced/dataset/images/op_concat.png)\n", "\n", "下面的样例先构建了两个随机数据集,然后将其做行拼接,最后展示了拼接后的数据结果。值得一提的是,使用`+`运算符也能达到同样的效果。" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data1:\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 0, 0])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 0, 0])}\n", "data2:\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n", "data3:\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 0, 0])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 0, 0])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n" ] } ], "source": [ "import numpy as np\n", "import mindspore.dataset as ds\n", "\n", "def generator_func():\n", " \"\"\"定义生成数据集函数1\"\"\"\n", " for _ in range(2):\n", " yield (np.array([0, 0, 0]),)\n", "\n", "def generator_func2():\n", " \"\"\"定义生成数据集函数2\"\"\"\n", " for _ in range(2):\n", " yield (np.array([1, 2, 3]),)\n", "\n", "# 生成数据集1\n", "dataset1 = ds.GeneratorDataset(generator_func, [\"data\"])\n", "print(\"data1:\")\n", "for data in dataset1.create_dict_iterator():\n", " print(data)\n", "\n", "# 生成数据集2\n", "dataset2 = ds.GeneratorDataset(generator_func2, [\"data\"])\n", "print(\"data2:\")\n", "for data in dataset2.create_dict_iterator():\n", " print(data)\n", "\n", "# 在数据集1上concat数据集2,生成数据集3\n", "dataset3 = dataset1.concat(dataset2)\n", "print(\"data3:\")\n", "for data in dataset3.create_dict_iterator():\n", " print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从上面的打印结果可以看出,数据集3由数据集1和数据集2行拼接得到,其列数与后两者保持一致,其行数为后两者之和。\n", "\n", "### map\n", "\n", "map操作将指定的函数作用于数据集的指定列数据,实现数据映射操作。\n", "\n", "用户可以自定义映射函数,也可以直接使用`c_transforms`或`py_transforms`中的函数针对图像、文本数据进行数据增强。\n", "\n", "![map](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.7/tutorials/source_zh_cn/advanced/dataset/images/op_map.png)\n", "\n", "下面的样例先构建了一个随机数据集,然后定义了数据翻倍的映射函数并将其作用于数据集,最后对比展示了映射前后的数据结果。" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n", "------ after processing ------\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 2, 4])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 4, 6])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 6, 8])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [ 6, 8, 10])}\n", "{'data': Tensor(shape=[3], dtype=Int64, value= [ 8, 10, 12])}\n" ] } ], "source": [ "import numpy as np\n", "import mindspore.dataset as ds\n", "\n", "def generator_func():\n", " \"\"\"定义生成数据集函数\"\"\"\n", " for i in range(5):\n", " yield (np.array([i, i+1, i+2]),)\n", "\n", "def pyfunc(x):\n", " \"\"\"定义对数据的操作\"\"\"\n", " return x*2\n", "\n", "# 生成数据集\n", "dataset = ds.GeneratorDataset(generator_func, [\"data\"])\n", "\n", "# 显示上述生成的数据集\n", "for data in dataset.create_dict_iterator():\n", " print(data)\n", "\n", "print(\"------ after processing ------\")\n", "\n", "# 对数据集做map操作,操作函数为pyfunc\n", "dataset = dataset.map(operations=pyfunc, input_columns=[\"data\"])\n", "\n", "# 显示map操作后的数据集\n", "for data in dataset.create_dict_iterator():\n", " print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从上面的打印结果可以看出,经过map操作,将函数`pyfunc`作用到数据集后,数据集中每个数据都被乘2。" ] } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }