{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "# 快速入门:MindPandas数据处理\n", "\n", "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.9/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.9/docs/mindpandas/docs/source_zh_cn/mindpandas_quick_start.ipynb)\n", "\n", "数据预处理对于模型训练非常重要,好的特征工程可以大幅度提升训练精度。本章节以推荐系统的特征工程为例,介绍使用MindPandas处理数据的流程。\n", "\n", "## MindPandas执行模式设置\n", "\n", "MindPandas支持多线程与多进程模式,本示例使用多线程模式,更多详见[MindPandas执行模式介绍及配置说明](https://www.mindspore.cn/mindpandas/docs/zh-CN/r0.1/mindpandas_configuration.html),并设置切片维度为16*3,示例如下:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": true, "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np\n", "import mindpandas as pd\n", "import random\n", "\n", "pd.set_concurrency_mode(\"multithread\")\n", "pd.set_partition_shape((16, 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据生成\n", "\n", "生成10000行、40列的二维数据,包含标签、稠密特征以及稀疏特征等信息。标签是值为“0”或“1“的随机数、稠密特征是取值范围为(-10, 10000)的随机数、稀疏特征为随机字符串。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "DENSE_NUM = 13\n", "SPARSE_NUM = 26\n", "ROW_NUM = 10000\n", "cat_val, int_val, lab_val = [], [], []\n", "\n", "def gen_cat_feature(length):\n", " result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()\n", " if len(result) < length:\n", " result = '0' * (length - len(result)) + result\n", " return str(result)\n", "\n", "def gen_int_feature():\n", " return random.randint(-10, 10000)\n", "\n", "def gen_lab_feature():\n", " x = random.randint(0, 1)\n", " return round(x)\n", "\n", "for i in range(ROW_NUM * SPARSE_NUM):\n", " cat_val.append(gen_cat_feature(8))\n", "np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)\n", "df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])\n", "\n", "for i in range(ROW_NUM * DENSE_NUM):\n", " int_val.append(gen_int_feature())\n", "np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)\n", "df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])\n", "\n", "for i in range(ROW_NUM):\n", " lab_val.append(gen_lab_feature())\n", "np_lab = np.array(lab_val).reshape(ROW_NUM, 1)\n", "df_lab = pd.DataFrame(np_lab, columns=['label'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据预处理\n", "\n", "将标签、稠密特征、稀疏特征等拼接为待处理的数据集,结果如下所示:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | label | \n", "I1 | \n", "I2 | \n", "I3 | \n", "I4 | \n", "I5 | \n", "I6 | \n", "I7 | \n", "I8 | \n", "I9 | \n", "... | \n", "C17 | \n", "C18 | \n", "C19 | \n", "C20 | \n", "C21 | \n", "C22 | \n", "C23 | \n", "C24 | \n", "C25 | \n", "C26 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "153 | \n", "4326 | \n", "4239 | \n", "3998 | \n", "4394 | \n", "8434 | \n", "8463 | \n", "7862 | \n", "9993 | \n", "... | \n", "938379C6 | \n", "9878C0E2 | \n", "A75A4A8C | \n", "D9F9E0F2 | \n", "173E6F23 | \n", "004968BA | \n", "E66F6B9F | \n", "287A48D1 | \n", "AC62D5CE | \n", "A723AB7F | \n", "
1 | \n", "1 | \n", "1962 | \n", "6771 | \n", "372 | \n", "1754 | \n", "7408 | \n", "9176 | \n", "6414 | \n", "751 | \n", "7680 | \n", "... | \n", "1613C18C | \n", "CE911717 | \n", "8B35FF3E | \n", "585C6D76 | \n", "5A4EF600 | \n", "3FA13F3A | \n", "1B8B88AD | \n", "C232D96E | \n", "CD630ACA | \n", "AB435A6A | \n", "
2 | \n", "1 | \n", "8665 | \n", "1485 | \n", "3321 | \n", "5368 | \n", "2658 | \n", "6317 | \n", "2848 | \n", "2780 | \n", "2522 | \n", "... | \n", "193587B6 | \n", "17AC3A54 | \n", "025D3F81 | \n", "5E2D04CB | \n", "D28747FF | \n", "D6A6A51A | \n", "C4E08EE7 | \n", "C520A45C | \n", "B8CB53F1 | \n", "3933626E | \n", "
3 | \n", "1 | \n", "7794 | \n", "5804 | \n", "9079 | \n", "4813 | \n", "1912 | \n", "4740 | \n", "212 | \n", "373 | \n", "620 | \n", "... | \n", "8C816BC2 | \n", "F5AA01BE | \n", "08CBECA8 | \n", "DC884327 | \n", "9F95F1D4 | \n", "9C389A00 | \n", "7CFFC865 | \n", "DC9203DB | \n", "86DC5DC2 | \n", "EFFF0EAC | \n", "
4 | \n", "0 | \n", "3331 | \n", "4672 | \n", "9741 | \n", "6430 | \n", "4610 | \n", "8867 | \n", "9055 | \n", "3170 | \n", "7955 | \n", "... | \n", "E18EF1EB | \n", "0905B30C | \n", "1A584C44 | \n", "BAC91CC4 | \n", "8DAAC9B4 | \n", "7298201D | \n", "73A30ED7 | \n", "9560AB20 | \n", "6B452601 | \n", "D7754942 | \n", "
5 rows × 40 columns
\n", "\n", " | id | \n", "weight | \n", "label | \n", "is_training | \n", "
---|---|---|---|---|
0 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 89... | \n", "[0.016285343191127986, 0.4332400559664201, 0.4... | \n", "[0] | \n", "1 | \n", "
1 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 70... | \n", "[0.19702267958837047, 0.6775934439336398, 0.03... | \n", "[1] | \n", "1 | \n", "
2 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -5... | \n", "[0.8667199520431611, 0.14931041375174894, 0.33... | \n", "[1] | \n", "1 | \n", "
3 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 40... | \n", "[0.7796982715556, 0.5809514291425145, 0.907992... | \n", "[1] | \n", "1 | \n", "
4 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 64... | \n", "[0.3337995803776601, 0.467819308414951, 0.9741... | \n", "[0] | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
9995 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 87... | \n", "[0.8151663502847437, 0.962722366580052, 0.5130... | \n", "[1] | \n", "0 | \n", "
9996 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 47... | \n", "[0.6402237985812769, 0.9683190085948431, 0.948... | \n", "[1] | \n", "0 | \n", "
9997 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -7... | \n", "[0.9435508042761515, 0.9097541475114931, 0.313... | \n", "[0] | \n", "0 | \n", "
9998 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 36... | \n", "[0.6173443900489559, 0.41225264841095344, 0.92... | \n", "[1] | \n", "0 | \n", "
9999 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 53... | \n", "[0.869017883904486, 0.8232060763541875, 0.5049... | \n", "[0] | \n", "0 | \n", "
10000 rows × 4 columns
\n", "