{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Quick Start: MindSpore Pandas Data Processing\n",
    "\n",
    "[![View source files in Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0/resource/_static/logo_source_en.png)](https://gitee.com/mindspore/docs/blob/r2.0/docs/mindpandas/docs/source_en/mindpandas_quick_start.ipynb)\n",
    "\n",
    "Data preprocessing is vital for model training. With good feature engineering, training accuracy could be significantly enhanced. This tutorial takes the feature engineering of recommender system as an example to introduce the procedure of using MindSpore Pandas to process data.\n",
    "\n",
    "## Setting MindSpore Pandas Execution Mode\n",
    "\n",
    "MindSpore Pandas supports two execution modes, which are multithread mode and multiprocess mode. This example takes multithread mode as example. We set partition shape to 16*3. Example is shown as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "is_executing": true,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import mindpandas as pd\n",
    "import random\n",
    "\n",
    "pd.set_concurrency_mode(\"multithread\")\n",
    "pd.set_partition_shape((16, 3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Generation\n",
    "\n",
    "Two dimensional data sized 10,000 rows and 40 columns, with label, dense features and sparse features is generated. The label is a random number with the value \"0\" or \"1\", the dense features are random numbers with the value range from -10 to 10000, and the sparse features are random strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "DENSE_NUM = 13\n",
    "SPARSE_NUM = 26\n",
    "ROW_NUM = 10000\n",
    "cat_val, int_val, lab_val = [], [], []\n",
    "\n",
    "def gen_cat_feature(length):\n",
    "    result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()\n",
    "    if len(result) < length:\n",
    "        result = '0' * (length - len(result)) + result\n",
    "    return str(result)\n",
    "\n",
    "def gen_int_feature():\n",
    "    return random.randint(-10, 10000)\n",
    "\n",
    "def gen_lab_feature():\n",
    "    x = random.randint(0, 1)\n",
    "    return round(x)\n",
    "\n",
    "for i in range(ROW_NUM * SPARSE_NUM):\n",
    "    cat_val.append(gen_cat_feature(8))\n",
    "np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)\n",
    "df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])\n",
    "\n",
    "for i in range(ROW_NUM * DENSE_NUM):\n",
    "    int_val.append(gen_int_feature())\n",
    "np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)\n",
    "df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])\n",
    "\n",
    "for i in range(ROW_NUM):\n",
    "    lab_val.append(gen_lab_feature())\n",
    "np_lab = np.array(lab_val).reshape(ROW_NUM, 1)\n",
    "df_lab = pd.DataFrame(np_lab, columns=['label'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preprocessing\n",
    "\n",
    "Label, dense features and sparse features are concatenated to form the to-be-processed dataset. The results are shown as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>I1</th>\n",
       "      <th>I2</th>\n",
       "      <th>I3</th>\n",
       "      <th>I4</th>\n",
       "      <th>I5</th>\n",
       "      <th>I6</th>\n",
       "      <th>I7</th>\n",
       "      <th>I8</th>\n",
       "      <th>I9</th>\n",
       "      <th>...</th>\n",
       "      <th>C17</th>\n",
       "      <th>C18</th>\n",
       "      <th>C19</th>\n",
       "      <th>C20</th>\n",
       "      <th>C21</th>\n",
       "      <th>C22</th>\n",
       "      <th>C23</th>\n",
       "      <th>C24</th>\n",
       "      <th>C25</th>\n",
       "      <th>C26</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>5795</td>\n",
       "      <td>7051</td>\n",
       "      <td>8277</td>\n",
       "      <td>785</td>\n",
       "      <td>9305</td>\n",
       "      <td>7521</td>\n",
       "      <td>5206</td>\n",
       "      <td>6240</td>\n",
       "      <td>172</td>\n",
       "      <td>...</td>\n",
       "      <td>A5AE1E6D</td>\n",
       "      <td>25A100C3</td>\n",
       "      <td>C6B8E0A4</td>\n",
       "      <td>A94F6B56</td>\n",
       "      <td>B27D726B</td>\n",
       "      <td>EB9F3C73</td>\n",
       "      <td>D98D17B2</td>\n",
       "      <td>793AB315</td>\n",
       "      <td>8C12657F</td>\n",
       "      <td>AFCEEBFF</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>6968</td>\n",
       "      <td>8389</td>\n",
       "      <td>4352</td>\n",
       "      <td>3312</td>\n",
       "      <td>4021</td>\n",
       "      <td>5087</td>\n",
       "      <td>2254</td>\n",
       "      <td>4249</td>\n",
       "      <td>4411</td>\n",
       "      <td>...</td>\n",
       "      <td>EEAC1040</td>\n",
       "      <td>BDC711B9</td>\n",
       "      <td>16269D1B</td>\n",
       "      <td>D59EA7BB</td>\n",
       "      <td>460218D4</td>\n",
       "      <td>F89E137C</td>\n",
       "      <td>F488ED52</td>\n",
       "      <td>C1DDB598</td>\n",
       "      <td>AE9C21C9</td>\n",
       "      <td>11D47A2A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>1144</td>\n",
       "      <td>9327</td>\n",
       "      <td>9399</td>\n",
       "      <td>7745</td>\n",
       "      <td>8144</td>\n",
       "      <td>7189</td>\n",
       "      <td>1663</td>\n",
       "      <td>1005</td>\n",
       "      <td>6421</td>\n",
       "      <td>...</td>\n",
       "      <td>54EE530F</td>\n",
       "      <td>68D2F7EF</td>\n",
       "      <td>EFD65C79</td>\n",
       "      <td>B2F2CCF5</td>\n",
       "      <td>86E02110</td>\n",
       "      <td>31617C19</td>\n",
       "      <td>44A2DFA4</td>\n",
       "      <td>032C30D1</td>\n",
       "      <td>C8098BAD</td>\n",
       "      <td>CE4DD8BB</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>6214</td>\n",
       "      <td>3183</td>\n",
       "      <td>9229</td>\n",
       "      <td>938</td>\n",
       "      <td>9160</td>\n",
       "      <td>2783</td>\n",
       "      <td>2680</td>\n",
       "      <td>4775</td>\n",
       "      <td>4436</td>\n",
       "      <td>...</td>\n",
       "      <td>639D80AA</td>\n",
       "      <td>3A14B884</td>\n",
       "      <td>9FC92B4F</td>\n",
       "      <td>67DB3280</td>\n",
       "      <td>1EE1FC45</td>\n",
       "      <td>CE19F4C1</td>\n",
       "      <td>F34CC6FD</td>\n",
       "      <td>C3C9F66C</td>\n",
       "      <td>CA1B3F85</td>\n",
       "      <td>F184D01E</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>3220</td>\n",
       "      <td>3235</td>\n",
       "      <td>2243</td>\n",
       "      <td>50</td>\n",
       "      <td>5074</td>\n",
       "      <td>6328</td>\n",
       "      <td>6894</td>\n",
       "      <td>6838</td>\n",
       "      <td>3063</td>\n",
       "      <td>...</td>\n",
       "      <td>7671D909</td>\n",
       "      <td>126B3F69</td>\n",
       "      <td>1262514D</td>\n",
       "      <td>25C18137</td>\n",
       "      <td>2BA958DE</td>\n",
       "      <td>D6CE7BE3</td>\n",
       "      <td>18D4EEE1</td>\n",
       "      <td>315D0FFB</td>\n",
       "      <td>7C25DB1D</td>\n",
       "      <td>6E4ABFB1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 40 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   label    I1    I2    I3    I4    I5    I6    I7    I8    I9  ...       C17  \\\n",
       "0      0  5795  7051  8277   785  9305  7521  5206  6240   172  ...  A5AE1E6D   \n",
       "1      0  6968  8389  4352  3312  4021  5087  2254  4249  4411  ...  EEAC1040   \n",
       "2      1  1144  9327  9399  7745  8144  7189  1663  1005  6421  ...  54EE530F   \n",
       "3      1  6214  3183  9229   938  9160  2783  2680  4775  4436  ...  639D80AA   \n",
       "4      1  3220  3235  2243    50  5074  6328  6894  6838  3063  ...  7671D909   \n",
       "\n",
       "        C18       C19       C20       C21       C22       C23       C24  \\\n",
       "0  25A100C3  C6B8E0A4  A94F6B56  B27D726B  EB9F3C73  D98D17B2  793AB315   \n",
       "1  BDC711B9  16269D1B  D59EA7BB  460218D4  F89E137C  F488ED52  C1DDB598   \n",
       "2  68D2F7EF  EFD65C79  B2F2CCF5  86E02110  31617C19  44A2DFA4  032C30D1   \n",
       "3  3A14B884  9FC92B4F  67DB3280  1EE1FC45  CE19F4C1  F34CC6FD  C3C9F66C   \n",
       "4  126B3F69  1262514D  25C18137  2BA958DE  D6CE7BE3  18D4EEE1  315D0FFB   \n",
       "\n",
       "        C25       C26  \n",
       "0  8C12657F  AFCEEBFF  \n",
       "1  AE9C21C9  11D47A2A  \n",
       "2  C8098BAD  CE4DD8BB  \n",
       "3  CA1B3F85  F184D01E  \n",
       "4  7C25DB1D  6E4ABFB1  \n",
       "\n",
       "[5 rows x 40 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.concat([df_lab, df_int, df_cat], axis=1)\n",
    "df.to_pandas().head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Engineering\n",
    "\n",
    "1. Get the maximum and minimum values of each column of the dense data, to prepare for subsequent normalization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "max_dict, min_dict = {}, {}\n",
    "for i, j in enumerate(df_int.max()):\n",
    "    max_dict[f'I{i + 1}'] = j\n",
    "\n",
    "for i, j in enumerate(df_int.min()):\n",
    "    min_dict[f'I{i + 1}'] = j"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. Select columns 2 to 40 of df and copy into a new dataframe, named features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "features = df.iloc[:, 1:40]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "3. Apply a custom function to the \"label\" column of df, and the numeric values are converted to numpy array. The result is added to \"label\" column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_label(x):\n",
    "    return np.array([x])\n",
    "df['label'] = df['label'].apply(get_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "4. Apply a custom function to features, normalize the dense data, fill the other data with 1, and add the data to the \"weight\" column of df."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_weight(x):\n",
    "    ret = []\n",
    "    for index, val in enumerate(x):\n",
    "        if index < DENSE_NUM:\n",
    "            col = f'I{index + 1}'\n",
    "            ret.append((val - min_dict[col]) / (max_dict[col] - min_dict[col]))\n",
    "        else:\n",
    "            ret.append(1)\n",
    "    return ret\n",
    "feat_weight = features.apply(get_weight, axis=1)\n",
    "df['weight'] = feat_weight"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "5. Apply a custom function to features, get the index of the dense data, other data is filled with its hash value, add the data to the \"id\" column of df."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_id(x):\n",
    "    ret = []\n",
    "    for index, val in enumerate(x):\n",
    "        if index < DENSE_NUM:\n",
    "            ret.append(index + 1)\n",
    "        else:\n",
    "            ret.append(hash(val))\n",
    "    return ret\n",
    "feat_id = features.apply(get_id, axis=1)\n",
    "df['id'] = feat_id"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Splitting Dataset\n",
    "\n",
    "Adding \"is_training\" column. The first 70% of the data is set as training data and other data is set as non-training data. The results are shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>weight</th>\n",
       "      <th>label</th>\n",
       "      <th>is_training</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31...</td>\n",
       "      <td>[0.5799200799200799, 0.705335731414868, 0.8280...</td>\n",
       "      <td>[0]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3...</td>\n",
       "      <td>[0.6971028971028971, 0.8390287769784173, 0.435...</td>\n",
       "      <td>[0]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71...</td>\n",
       "      <td>[0.11528471528471529, 0.9327537969624301, 0.94...</td>\n",
       "      <td>[1]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38...</td>\n",
       "      <td>[0.6217782217782217, 0.3188449240607514, 0.923...</td>\n",
       "      <td>[1]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3...</td>\n",
       "      <td>[0.3226773226773227, 0.3240407673860911, 0.225...</td>\n",
       "      <td>[1]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9995</th>\n",
       "      <td>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6...</td>\n",
       "      <td>[0.09270729270729271, 0.3959832134292566, 0.03...</td>\n",
       "      <td>[0]</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9996</th>\n",
       "      <td>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12...</td>\n",
       "      <td>[0.5147852147852148, 0.48810951239008793, 0.46...</td>\n",
       "      <td>[1]</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9997</th>\n",
       "      <td>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2...</td>\n",
       "      <td>[0.4792207792207792, 0.4045763389288569, 0.514...</td>\n",
       "      <td>[1]</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9998</th>\n",
       "      <td>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6...</td>\n",
       "      <td>[0.550949050949051, 0.1035171862509992, 0.2167...</td>\n",
       "      <td>[0]</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9999</th>\n",
       "      <td>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4...</td>\n",
       "      <td>[0.9004995004995004, 0.9000799360511591, 0.826...</td>\n",
       "      <td>[0]</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10000 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                     id  \\\n",
       "0     [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31...   \n",
       "1     [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3...   \n",
       "2     [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71...   \n",
       "3     [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38...   \n",
       "4     [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3...   \n",
       "...                                                 ...   \n",
       "9995  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6...   \n",
       "9996  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12...   \n",
       "9997  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2...   \n",
       "9998  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6...   \n",
       "9999  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4...   \n",
       "\n",
       "                                                 weight label  is_training  \n",
       "0     [0.5799200799200799, 0.705335731414868, 0.8280...   [0]            1  \n",
       "1     [0.6971028971028971, 0.8390287769784173, 0.435...   [0]            1  \n",
       "2     [0.11528471528471529, 0.9327537969624301, 0.94...   [1]            1  \n",
       "3     [0.6217782217782217, 0.3188449240607514, 0.923...   [1]            1  \n",
       "4     [0.3226773226773227, 0.3240407673860911, 0.225...   [1]            1  \n",
       "...                                                 ...   ...          ...  \n",
       "9995  [0.09270729270729271, 0.3959832134292566, 0.03...   [0]            0  \n",
       "9996  [0.5147852147852148, 0.48810951239008793, 0.46...   [1]            0  \n",
       "9997  [0.4792207792207792, 0.4045763389288569, 0.514...   [1]            0  \n",
       "9998  [0.550949050949051, 0.1035171862509992, 0.2167...   [0]            0  \n",
       "9999  [0.9004995004995004, 0.9000799360511591, 0.826...   [0]            0  \n",
       "\n",
       "[10000 rows x 4 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m_train_len = int(len(df) * 0.7)\n",
    "df['is_training'] = [1] * m_train_len + [0] * (len(df) - m_train_len)\n",
    "df = df[['id', 'weight', 'label', 'is_training']]\n",
    "df.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Till now, data generation, proprocessing and feature engineering are completed. The processed data can be passed into the model for training."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}