{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "# Quick Start: MindSpore Pandas Data Processing\n", "\n", "[![View source files in Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0/resource/_static/logo_source_en.png)](https://gitee.com/mindspore/docs/blob/r2.0/docs/mindpandas/docs/source_en/mindpandas_quick_start.ipynb)\n", "\n", "Data preprocessing is vital for model training. With good feature engineering, training accuracy could be significantly enhanced. This tutorial takes the feature engineering of recommender system as an example to introduce the procedure of using MindSpore Pandas to process data.\n", "\n", "## Setting MindSpore Pandas Execution Mode\n", "\n", "MindSpore Pandas supports two execution modes, which are multithread mode and multiprocess mode. This example takes multithread mode as example. We set partition shape to 16*3. Example is shown as follows:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "pycharm": { "is_executing": true, "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np\n", "import mindpandas as pd\n", "import random\n", "\n", "pd.set_concurrency_mode(\"multithread\")\n", "pd.set_partition_shape((16, 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Generation\n", "\n", "Two dimensional data sized 10,000 rows and 40 columns, with label, dense features and sparse features is generated. The label is a random number with the value \"0\" or \"1\", the dense features are random numbers with the value range from -10 to 10000, and the sparse features are random strings." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "DENSE_NUM = 13\n", "SPARSE_NUM = 26\n", "ROW_NUM = 10000\n", "cat_val, int_val, lab_val = [], [], []\n", "\n", "def gen_cat_feature(length):\n", " result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()\n", " if len(result) < length:\n", " result = '0' * (length - len(result)) + result\n", " return str(result)\n", "\n", "def gen_int_feature():\n", " return random.randint(-10, 10000)\n", "\n", "def gen_lab_feature():\n", " x = random.randint(0, 1)\n", " return round(x)\n", "\n", "for i in range(ROW_NUM * SPARSE_NUM):\n", " cat_val.append(gen_cat_feature(8))\n", "np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)\n", "df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])\n", "\n", "for i in range(ROW_NUM * DENSE_NUM):\n", " int_val.append(gen_int_feature())\n", "np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)\n", "df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])\n", "\n", "for i in range(ROW_NUM):\n", " lab_val.append(gen_lab_feature())\n", "np_lab = np.array(lab_val).reshape(ROW_NUM, 1)\n", "df_lab = pd.DataFrame(np_lab, columns=['label'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preprocessing\n", "\n", "Label, dense features and sparse features are concatenated to form the to-be-processed dataset. The results are shown as follows:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelI1I2I3I4I5I6I7I8I9...C17C18C19C20C21C22C23C24C25C26
005795705182777859305752152066240172...A5AE1E6D25A100C3C6B8E0A4A94F6B56B27D726BEB9F3C73D98D17B2793AB3158C12657FAFCEEBFF
10696883894352331240215087225442494411...EEAC1040BDC711B916269D1BD59EA7BB460218D4F89E137CF488ED52C1DDB598AE9C21C911D47A2A
21114493279399774581447189166310056421...54EE530F68D2F7EFEFD65C79B2F2CCF586E0211031617C1944A2DFA4032C30D1C8098BADCE4DD8BB
3162143183922993891602783268047754436...639D80AA3A14B8849FC92B4F67DB32801EE1FC45CE19F4C1F34CC6FDC3C9F66CCA1B3F85F184D01E
413220323522435050746328689468383063...7671D909126B3F691262514D25C181372BA958DED6CE7BE318D4EEE1315D0FFB7C25DB1D6E4ABFB1
\n", "

5 rows × 40 columns

\n", "
" ], "text/plain": [ " label I1 I2 I3 I4 I5 I6 I7 I8 I9 ... C17 \\\n", "0 0 5795 7051 8277 785 9305 7521 5206 6240 172 ... A5AE1E6D \n", "1 0 6968 8389 4352 3312 4021 5087 2254 4249 4411 ... EEAC1040 \n", "2 1 1144 9327 9399 7745 8144 7189 1663 1005 6421 ... 54EE530F \n", "3 1 6214 3183 9229 938 9160 2783 2680 4775 4436 ... 639D80AA \n", "4 1 3220 3235 2243 50 5074 6328 6894 6838 3063 ... 7671D909 \n", "\n", " C18 C19 C20 C21 C22 C23 C24 \\\n", "0 25A100C3 C6B8E0A4 A94F6B56 B27D726B EB9F3C73 D98D17B2 793AB315 \n", "1 BDC711B9 16269D1B D59EA7BB 460218D4 F89E137C F488ED52 C1DDB598 \n", "2 68D2F7EF EFD65C79 B2F2CCF5 86E02110 31617C19 44A2DFA4 032C30D1 \n", "3 3A14B884 9FC92B4F 67DB3280 1EE1FC45 CE19F4C1 F34CC6FD C3C9F66C \n", "4 126B3F69 1262514D 25C18137 2BA958DE D6CE7BE3 18D4EEE1 315D0FFB \n", "\n", " C25 C26 \n", "0 8C12657F AFCEEBFF \n", "1 AE9C21C9 11D47A2A \n", "2 C8098BAD CE4DD8BB \n", "3 CA1B3F85 F184D01E \n", "4 7C25DB1D 6E4ABFB1 \n", "\n", "[5 rows x 40 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.concat([df_lab, df_int, df_cat], axis=1)\n", "df.to_pandas().head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Engineering\n", "\n", "1. Get the maximum and minimum values of each column of the dense data, to prepare for subsequent normalization." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "max_dict, min_dict = {}, {}\n", "for i, j in enumerate(df_int.max()):\n", " max_dict[f'I{i + 1}'] = j\n", "\n", "for i, j in enumerate(df_int.min()):\n", " min_dict[f'I{i + 1}'] = j" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Select columns 2 to 40 of df and copy into a new dataframe, named features." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "features = df.iloc[:, 1:40]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "3. Apply a custom function to the \"label\" column of df, and the numeric values are converted to numpy array. The result is added to \"label\" column." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_label(x):\n", " return np.array([x])\n", "df['label'] = df['label'].apply(get_label)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "4. Apply a custom function to features, normalize the dense data, fill the other data with 1, and add the data to the \"weight\" column of df." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_weight(x):\n", " ret = []\n", " for index, val in enumerate(x):\n", " if index < DENSE_NUM:\n", " col = f'I{index + 1}'\n", " ret.append((val - min_dict[col]) / (max_dict[col] - min_dict[col]))\n", " else:\n", " ret.append(1)\n", " return ret\n", "feat_weight = features.apply(get_weight, axis=1)\n", "df['weight'] = feat_weight" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "5. Apply a custom function to features, get the index of the dense data, other data is filled with its hash value, add the data to the \"id\" column of df." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_id(x):\n", " ret = []\n", " for index, val in enumerate(x):\n", " if index < DENSE_NUM:\n", " ret.append(index + 1)\n", " else:\n", " ret.append(hash(val))\n", " return ret\n", "feat_id = features.apply(get_id, axis=1)\n", "df['id'] = feat_id" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Splitting Dataset\n", "\n", "Adding \"is_training\" column. The first 70% of the data is set as training data and other data is set as non-training data. The results are shown below:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idweightlabelis_training
0[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31...[0.5799200799200799, 0.705335731414868, 0.8280...[0]1
1[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3...[0.6971028971028971, 0.8390287769784173, 0.435...[0]1
2[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71...[0.11528471528471529, 0.9327537969624301, 0.94...[1]1
3[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38...[0.6217782217782217, 0.3188449240607514, 0.923...[1]1
4[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3...[0.3226773226773227, 0.3240407673860911, 0.225...[1]1
...............
9995[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6...[0.09270729270729271, 0.3959832134292566, 0.03...[0]0
9996[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12...[0.5147852147852148, 0.48810951239008793, 0.46...[1]0
9997[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2...[0.4792207792207792, 0.4045763389288569, 0.514...[1]0
9998[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6...[0.550949050949051, 0.1035171862509992, 0.2167...[0]0
9999[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4...[0.9004995004995004, 0.9000799360511591, 0.826...[0]0
\n", "

10000 rows × 4 columns

\n", "
" ], "text/plain": [ " id \\\n", "0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31... \n", "1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... \n", "2 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71... \n", "3 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38... \n", "4 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... \n", "... ... \n", "9995 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... \n", "9996 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12... \n", "9997 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2... \n", "9998 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... \n", "9999 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4... \n", "\n", " weight label is_training \n", "0 [0.5799200799200799, 0.705335731414868, 0.8280... [0] 1 \n", "1 [0.6971028971028971, 0.8390287769784173, 0.435... [0] 1 \n", "2 [0.11528471528471529, 0.9327537969624301, 0.94... [1] 1 \n", "3 [0.6217782217782217, 0.3188449240607514, 0.923... [1] 1 \n", "4 [0.3226773226773227, 0.3240407673860911, 0.225... [1] 1 \n", "... ... ... ... \n", "9995 [0.09270729270729271, 0.3959832134292566, 0.03... [0] 0 \n", "9996 [0.5147852147852148, 0.48810951239008793, 0.46... [1] 0 \n", "9997 [0.4792207792207792, 0.4045763389288569, 0.514... [1] 0 \n", "9998 [0.550949050949051, 0.1035171862509992, 0.2167... [0] 0 \n", "9999 [0.9004995004995004, 0.9000799360511591, 0.826... [0] 0 \n", "\n", "[10000 rows x 4 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m_train_len = int(len(df) * 0.7)\n", "df['is_training'] = [1] * m_train_len + [0] * (len(df) - m_train_len)\n", "df = df[['id', 'weight', 'label', 'is_training']]\n", "df.to_pandas()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Till now, data generation, proprocessing and feature engineering are completed. The processed data can be passed into the model for training." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 1 }