{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "# 快速入门:MindPandas数据处理\n", "\n", "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.9/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.9/docs/mindpandas/docs/source_zh_cn/mindpandas_quick_start.ipynb)\n", "\n", "数据预处理对于模型训练非常重要,好的特征工程可以大幅度提升训练精度。本章节以推荐系统的特征工程为例,介绍使用MindPandas处理数据的流程。\n", "\n", "## MindPandas执行模式设置\n", "\n", "MindPandas支持多线程与多进程模式,本示例使用多线程模式,更多详见[MindPandas执行模式介绍及配置说明](https://www.mindspore.cn/mindpandas/docs/zh-CN/r0.1/mindpandas_configuration.html),并设置切片维度为16*3,示例如下:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": true, "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np\n", "import mindpandas as pd\n", "import random\n", "\n", "pd.set_concurrency_mode(\"multithread\")\n", "pd.set_partition_shape((16, 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据生成\n", "\n", "生成10000行、40列的二维数据,包含标签、稠密特征以及稀疏特征等信息。标签是值为“0”或“1“的随机数、稠密特征是取值范围为(-10, 10000)的随机数、稀疏特征为随机字符串。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "DENSE_NUM = 13\n", "SPARSE_NUM = 26\n", "ROW_NUM = 10000\n", "cat_val, int_val, lab_val = [], [], []\n", "\n", "def gen_cat_feature(length):\n", " result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()\n", " if len(result) < length:\n", " result = '0' * (length - len(result)) + result\n", " return str(result)\n", "\n", "def gen_int_feature():\n", " return random.randint(-10, 10000)\n", "\n", "def gen_lab_feature():\n", " x = random.randint(0, 1)\n", " return round(x)\n", "\n", "for i in range(ROW_NUM * SPARSE_NUM):\n", " cat_val.append(gen_cat_feature(8))\n", "np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)\n", "df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])\n", "\n", "for i in range(ROW_NUM * DENSE_NUM):\n", " int_val.append(gen_int_feature())\n", "np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)\n", "df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])\n", "\n", "for i in range(ROW_NUM):\n", " lab_val.append(gen_lab_feature())\n", "np_lab = np.array(lab_val).reshape(ROW_NUM, 1)\n", "df_lab = pd.DataFrame(np_lab, columns=['label'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据预处理\n", "\n", "将标签、稠密特征、稀疏特征等拼接为待处理的数据集,结果如下所示:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelI1I2I3I4I5I6I7I8I9...C17C18C19C20C21C22C23C24C25C26
0015343264239399843948434846378629993...938379C69878C0E2A75A4A8CD9F9E0F2173E6F23004968BAE66F6B9F287A48D1AC62D5CEA723AB7F
111962677137217547408917664147517680...1613C18CCE9117178B35FF3E585C6D765A4EF6003FA13F3A1B8B88ADC232D96ECD630ACAAB435A6A
21866514853321536826586317284827802522...193587B617AC3A54025D3F815E2D04CBD28747FFD6A6A51AC4E08EE7C520A45CB8CB53F13933626E
31779458049079481319124740212373620...8C816BC2F5AA01BE08CBECA8DC8843279F95F1D49C389A007CFFC865DC9203DB86DC5DC2EFFF0EAC
40333146729741643046108867905531707955...E18EF1EB0905B30C1A584C44BAC91CC48DAAC9B47298201D73A30ED79560AB206B452601D7754942
\n", "

5 rows × 40 columns

\n", "
" ], "text/plain": [ " label I1 I2 I3 I4 I5 I6 I7 I8 I9 ... C17 \\\n", "0 0 153 4326 4239 3998 4394 8434 8463 7862 9993 ... 938379C6 \n", "1 1 1962 6771 372 1754 7408 9176 6414 751 7680 ... 1613C18C \n", "2 1 8665 1485 3321 5368 2658 6317 2848 2780 2522 ... 193587B6 \n", "3 1 7794 5804 9079 4813 1912 4740 212 373 620 ... 8C816BC2 \n", "4 0 3331 4672 9741 6430 4610 8867 9055 3170 7955 ... E18EF1EB \n", "\n", " C18 C19 C20 C21 C22 C23 C24 \\\n", "0 9878C0E2 A75A4A8C D9F9E0F2 173E6F23 004968BA E66F6B9F 287A48D1 \n", "1 CE911717 8B35FF3E 585C6D76 5A4EF600 3FA13F3A 1B8B88AD C232D96E \n", "2 17AC3A54 025D3F81 5E2D04CB D28747FF D6A6A51A C4E08EE7 C520A45C \n", "3 F5AA01BE 08CBECA8 DC884327 9F95F1D4 9C389A00 7CFFC865 DC9203DB \n", "4 0905B30C 1A584C44 BAC91CC4 8DAAC9B4 7298201D 73A30ED7 9560AB20 \n", "\n", " C25 C26 \n", "0 AC62D5CE A723AB7F \n", "1 CD630ACA AB435A6A \n", "2 B8CB53F1 3933626E \n", "3 86DC5DC2 EFFF0EAC \n", "4 6B452601 D7754942 \n", "\n", "[5 rows x 40 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.concat([df_lab, df_int, df_cat], axis=1)\n", "df.to_pandas().head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 特征工程\n", "\n", "1. 获取稠密数据每一列的最大值与最小值,为后续的归一化做准备。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "max_dict, min_dict = {}, {}\n", "for i, j in enumerate(df_int.max()):\n", " max_dict[f'I{i + 1}'] = j\n", "\n", "for i, j in enumerate(df_int.min()):\n", " min_dict[f'I{i + 1}'] = j" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 截取df的第2列到第40列。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "features = df.iloc[:, 1:40]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "3. 对df的“label”列应用自定义函数,将“label”列中的数值转换为numpy数组,将数据添加到df的“label”列。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def get_label(x):\n", " return np.array([x])\n", "df['label'] = df['label'].apply(get_label)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "4. 对features应用自定义函数,将稠密数据进行归一化处理,其他数据填充为1,将数据添加到df的“weight”列。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def get_weight(x):\n", " ret = []\n", " for index, val in enumerate(x):\n", " if index < DENSE_NUM:\n", " col = f'I{index + 1}'\n", " ret.append((val - min_dict[col]) / (max_dict[col] - min_dict[col]))\n", " else:\n", " ret.append(1)\n", " return ret\n", "feat_weight = features.apply(get_weight, axis=1)\n", "df['weight'] = feat_weight" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "5. 对features应用自定义函数,获取稠密数据的索引,其他数据使用其哈希值进行填充,将数据添加到df的“id”列。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def get_id(x):\n", " ret = []\n", " for index, val in enumerate(x):\n", " if index < DENSE_NUM:\n", " ret.append(index + 1)\n", " else:\n", " ret.append(hash(val))\n", " return ret\n", "feat_id = features.apply(get_id, axis=1)\n", "df['id'] = feat_id" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据集划分\n", "\n", "df新增\"is_training\"列,将数据的前70%设为训练数据,其他数据标记为非训练数据。结果如下所示:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idweightlabelis_training
0[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 89...[0.016285343191127986, 0.4332400559664201, 0.4...[0]1
1[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 70...[0.19702267958837047, 0.6775934439336398, 0.03...[1]1
2[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -5...[0.8667199520431611, 0.14931041375174894, 0.33...[1]1
3[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 40...[0.7796982715556, 0.5809514291425145, 0.907992...[1]1
4[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 64...[0.3337995803776601, 0.467819308414951, 0.9741...[0]1
...............
9995[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 87...[0.8151663502847437, 0.962722366580052, 0.5130...[1]0
9996[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 47...[0.6402237985812769, 0.9683190085948431, 0.948...[1]0
9997[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -7...[0.9435508042761515, 0.9097541475114931, 0.313...[0]0
9998[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 36...[0.6173443900489559, 0.41225264841095344, 0.92...[1]0
9999[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 53...[0.869017883904486, 0.8232060763541875, 0.5049...[0]0
\n", "

10000 rows × 4 columns

\n", "
" ], "text/plain": [ " id \\\n", "0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 89... \n", "1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 70... \n", "2 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -5... \n", "3 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 40... \n", "4 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 64... \n", "... ... \n", "9995 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 87... \n", "9996 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 47... \n", "9997 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -7... \n", "9998 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 36... \n", "9999 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 53... \n", "\n", " weight label is_training \n", "0 [0.016285343191127986, 0.4332400559664201, 0.4... [0] 1 \n", "1 [0.19702267958837047, 0.6775934439336398, 0.03... [1] 1 \n", "2 [0.8667199520431611, 0.14931041375174894, 0.33... [1] 1 \n", "3 [0.7796982715556, 0.5809514291425145, 0.907992... [1] 1 \n", "4 [0.3337995803776601, 0.467819308414951, 0.9741... [0] 1 \n", "... ... ... ... \n", "9995 [0.8151663502847437, 0.962722366580052, 0.5130... [1] 0 \n", "9996 [0.6402237985812769, 0.9683190085948431, 0.948... [1] 0 \n", "9997 [0.9435508042761515, 0.9097541475114931, 0.313... [0] 0 \n", "9998 [0.6173443900489559, 0.41225264841095344, 0.92... [1] 0 \n", "9999 [0.869017883904486, 0.8232060763541875, 0.5049... [0] 0 \n", "\n", "[10000 rows x 4 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m_train_len = int(len(df) * 0.7)\n", "df['is_training'] = [1] * m_train_len + [0] * (len(df) - m_train_len)\n", "df = df[['id', 'weight', 'label', 'is_training']]\n", "df.to_pandas()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "至此,数据生成、数据预处理以及特征工程等操作已完成,处理好的数据即可传入模型进行训练。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }