{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "# Quick Start: MindSpore Pandas Data Processing\n", "\n", "[![View source files in Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0/resource/_static/logo_source_en.png)](https://gitee.com/mindspore/docs/blob/r2.0/docs/mindpandas/docs/source_en/mindpandas_quick_start.ipynb)\n", "\n", "Data preprocessing is vital for model training. With good feature engineering, training accuracy could be significantly enhanced. This tutorial takes the feature engineering of recommender system as an example to introduce the procedure of using MindSpore Pandas to process data.\n", "\n", "## Setting MindSpore Pandas Execution Mode\n", "\n", "MindSpore Pandas supports two execution modes, which are multithread mode and multiprocess mode. This example takes multithread mode as example. We set partition shape to 16*3. Example is shown as follows:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "pycharm": { "is_executing": true, "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np\n", "import mindpandas as pd\n", "import random\n", "\n", "pd.set_concurrency_mode(\"multithread\")\n", "pd.set_partition_shape((16, 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Generation\n", "\n", "Two dimensional data sized 10,000 rows and 40 columns, with label, dense features and sparse features is generated. The label is a random number with the value \"0\" or \"1\", the dense features are random numbers with the value range from -10 to 10000, and the sparse features are random strings." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "DENSE_NUM = 13\n", "SPARSE_NUM = 26\n", "ROW_NUM = 10000\n", "cat_val, int_val, lab_val = [], [], []\n", "\n", "def gen_cat_feature(length):\n", " result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()\n", " if len(result) < length:\n", " result = '0' * (length - len(result)) + result\n", " return str(result)\n", "\n", "def gen_int_feature():\n", " return random.randint(-10, 10000)\n", "\n", "def gen_lab_feature():\n", " x = random.randint(0, 1)\n", " return round(x)\n", "\n", "for i in range(ROW_NUM * SPARSE_NUM):\n", " cat_val.append(gen_cat_feature(8))\n", "np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)\n", "df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])\n", "\n", "for i in range(ROW_NUM * DENSE_NUM):\n", " int_val.append(gen_int_feature())\n", "np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)\n", "df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])\n", "\n", "for i in range(ROW_NUM):\n", " lab_val.append(gen_lab_feature())\n", "np_lab = np.array(lab_val).reshape(ROW_NUM, 1)\n", "df_lab = pd.DataFrame(np_lab, columns=['label'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preprocessing\n", "\n", "Label, dense features and sparse features are concatenated to form the to-be-processed dataset. The results are shown as follows:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | label | \n", "I1 | \n", "I2 | \n", "I3 | \n", "I4 | \n", "I5 | \n", "I6 | \n", "I7 | \n", "I8 | \n", "I9 | \n", "... | \n", "C17 | \n", "C18 | \n", "C19 | \n", "C20 | \n", "C21 | \n", "C22 | \n", "C23 | \n", "C24 | \n", "C25 | \n", "C26 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "5795 | \n", "7051 | \n", "8277 | \n", "785 | \n", "9305 | \n", "7521 | \n", "5206 | \n", "6240 | \n", "172 | \n", "... | \n", "A5AE1E6D | \n", "25A100C3 | \n", "C6B8E0A4 | \n", "A94F6B56 | \n", "B27D726B | \n", "EB9F3C73 | \n", "D98D17B2 | \n", "793AB315 | \n", "8C12657F | \n", "AFCEEBFF | \n", "
1 | \n", "0 | \n", "6968 | \n", "8389 | \n", "4352 | \n", "3312 | \n", "4021 | \n", "5087 | \n", "2254 | \n", "4249 | \n", "4411 | \n", "... | \n", "EEAC1040 | \n", "BDC711B9 | \n", "16269D1B | \n", "D59EA7BB | \n", "460218D4 | \n", "F89E137C | \n", "F488ED52 | \n", "C1DDB598 | \n", "AE9C21C9 | \n", "11D47A2A | \n", "
2 | \n", "1 | \n", "1144 | \n", "9327 | \n", "9399 | \n", "7745 | \n", "8144 | \n", "7189 | \n", "1663 | \n", "1005 | \n", "6421 | \n", "... | \n", "54EE530F | \n", "68D2F7EF | \n", "EFD65C79 | \n", "B2F2CCF5 | \n", "86E02110 | \n", "31617C19 | \n", "44A2DFA4 | \n", "032C30D1 | \n", "C8098BAD | \n", "CE4DD8BB | \n", "
3 | \n", "1 | \n", "6214 | \n", "3183 | \n", "9229 | \n", "938 | \n", "9160 | \n", "2783 | \n", "2680 | \n", "4775 | \n", "4436 | \n", "... | \n", "639D80AA | \n", "3A14B884 | \n", "9FC92B4F | \n", "67DB3280 | \n", "1EE1FC45 | \n", "CE19F4C1 | \n", "F34CC6FD | \n", "C3C9F66C | \n", "CA1B3F85 | \n", "F184D01E | \n", "
4 | \n", "1 | \n", "3220 | \n", "3235 | \n", "2243 | \n", "50 | \n", "5074 | \n", "6328 | \n", "6894 | \n", "6838 | \n", "3063 | \n", "... | \n", "7671D909 | \n", "126B3F69 | \n", "1262514D | \n", "25C18137 | \n", "2BA958DE | \n", "D6CE7BE3 | \n", "18D4EEE1 | \n", "315D0FFB | \n", "7C25DB1D | \n", "6E4ABFB1 | \n", "
5 rows × 40 columns
\n", "\n", " | id | \n", "weight | \n", "label | \n", "is_training | \n", "
---|---|---|---|---|
0 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31... | \n", "[0.5799200799200799, 0.705335731414868, 0.8280... | \n", "[0] | \n", "1 | \n", "
1 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... | \n", "[0.6971028971028971, 0.8390287769784173, 0.435... | \n", "[0] | \n", "1 | \n", "
2 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71... | \n", "[0.11528471528471529, 0.9327537969624301, 0.94... | \n", "[1] | \n", "1 | \n", "
3 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38... | \n", "[0.6217782217782217, 0.3188449240607514, 0.923... | \n", "[1] | \n", "1 | \n", "
4 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... | \n", "[0.3226773226773227, 0.3240407673860911, 0.225... | \n", "[1] | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
9995 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... | \n", "[0.09270729270729271, 0.3959832134292566, 0.03... | \n", "[0] | \n", "0 | \n", "
9996 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12... | \n", "[0.5147852147852148, 0.48810951239008793, 0.46... | \n", "[1] | \n", "0 | \n", "
9997 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2... | \n", "[0.4792207792207792, 0.4045763389288569, 0.514... | \n", "[1] | \n", "0 | \n", "
9998 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... | \n", "[0.550949050949051, 0.1035171862509992, 0.2167... | \n", "[0] | \n", "0 | \n", "
9999 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4... | \n", "[0.9004995004995004, 0.9000799360511591, 0.826... | \n", "[0] | \n", "0 | \n", "
10000 rows × 4 columns
\n", "