快速入门:MindPandas数据处理

查看源文件

数据预处理对于模型训练非常重要,好的特征工程可以大幅度提升训练精度。本章节以推荐系统的特征工程为例,介绍使用MindPandas处理数据的流程。

MindPandas执行模式设置

MindPandas支持多线程与多进程模式,本示例使用多线程模式,更多详见MindPandas执行模式介绍及配置说明,并设置切片维度为16*3,示例如下:

[1]:
import numpy as np
import mindpandas as pd
import random

pd.set_concurrency_mode("multithread")
pd.set_partition_shape((16, 3))

数据生成

生成10000行、40列的二维数据,包含标签、稠密特征以及稀疏特征等信息。标签是值为“0”或“1“的随机数、稠密特征是取值范围为(-10, 10000)的随机数、稀疏特征为随机字符串。

[2]:
DENSE_NUM = 13
SPARSE_NUM = 26
ROW_NUM = 10000
cat_val, int_val, lab_val = [], [], []

def gen_cat_feature(length):
    result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()
    if len(result) < length:
        result = '0' * (length - len(result)) + result
    return str(result)

def gen_int_feature():
    return random.randint(-10, 10000)

def gen_lab_feature():
    x = random.randint(0, 1)
    return round(x)

for i in range(ROW_NUM * SPARSE_NUM):
    cat_val.append(gen_cat_feature(8))
np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)
df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])

for i in range(ROW_NUM * DENSE_NUM):
    int_val.append(gen_int_feature())
np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)
df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])

for i in range(ROW_NUM):
    lab_val.append(gen_lab_feature())
np_lab = np.array(lab_val).reshape(ROW_NUM, 1)
df_lab = pd.DataFrame(np_lab, columns=['label'])

数据预处理

将标签、稠密特征、稀疏特征等拼接为待处理的数据集,结果如下所示:

[3]:
df = pd.concat([df_lab, df_int, df_cat], axis=1)
df.to_pandas().head(5)
[3]:
label I1 I2 I3 I4 I5 I6 I7 I8 I9 ... C17 C18 C19 C20 C21 C22 C23 C24 C25 C26
0 0 153 4326 4239 3998 4394 8434 8463 7862 9993 ... 938379C6 9878C0E2 A75A4A8C D9F9E0F2 173E6F23 004968BA E66F6B9F 287A48D1 AC62D5CE A723AB7F
1 1 1962 6771 372 1754 7408 9176 6414 751 7680 ... 1613C18C CE911717 8B35FF3E 585C6D76 5A4EF600 3FA13F3A 1B8B88AD C232D96E CD630ACA AB435A6A
2 1 8665 1485 3321 5368 2658 6317 2848 2780 2522 ... 193587B6 17AC3A54 025D3F81 5E2D04CB D28747FF D6A6A51A C4E08EE7 C520A45C B8CB53F1 3933626E
3 1 7794 5804 9079 4813 1912 4740 212 373 620 ... 8C816BC2 F5AA01BE 08CBECA8 DC884327 9F95F1D4 9C389A00 7CFFC865 DC9203DB 86DC5DC2 EFFF0EAC
4 0 3331 4672 9741 6430 4610 8867 9055 3170 7955 ... E18EF1EB 0905B30C 1A584C44 BAC91CC4 8DAAC9B4 7298201D 73A30ED7 9560AB20 6B452601 D7754942

5 rows × 40 columns

特征工程

  1. 获取稠密数据每一列的最大值与最小值,为后续的归一化做准备。

[4]:
max_dict, min_dict = {}, {}
for i, j in enumerate(df_int.max()):
    max_dict[f'I{i + 1}'] = j

for i, j in enumerate(df_int.min()):
    min_dict[f'I{i + 1}'] = j
  1. 截取df的第2列到第40列。

[5]:
features = df.iloc[:, 1:40]
  1. 对df的“label”列应用自定义函数,将“label”列中的数值转换为numpy数组,将数据添加到df的“label”列。

[6]:
def get_label(x):
    return np.array([x])
df['label'] = df['label'].apply(get_label)
  1. 对features应用自定义函数,将稠密数据进行归一化处理,其他数据填充为1,将数据添加到df的“weight”列。

[7]:
def get_weight(x):
    ret = []
    for index, val in enumerate(x):
        if index < DENSE_NUM:
            col = f'I{index + 1}'
            ret.append((val - min_dict[col]) / (max_dict[col] - min_dict[col]))
        else:
            ret.append(1)
    return ret
feat_weight = features.apply(get_weight, axis=1)
df['weight'] = feat_weight
  1. 对features应用自定义函数,获取稠密数据的索引,其他数据使用其哈希值进行填充,将数据添加到df的“id”列。

[8]:
def get_id(x):
    ret = []
    for index, val in enumerate(x):
        if index < DENSE_NUM:
            ret.append(index + 1)
        else:
            ret.append(hash(val))
    return ret
feat_id = features.apply(get_id, axis=1)
df['id'] = feat_id

数据集划分

df新增“is_training”列,将数据的前70%设为训练数据,其他数据标记为非训练数据。结果如下所示:

[9]:
m_train_len = int(len(df) * 0.7)
df['is_training'] = [1] * m_train_len + [0] * (len(df) - m_train_len)
df = df[['id', 'weight', 'label', 'is_training']]
df.to_pandas()
[9]:
id weight label is_training
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 89... [0.016285343191127986, 0.4332400559664201, 0.4... [0] 1
1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 70... [0.19702267958837047, 0.6775934439336398, 0.03... [1] 1
2 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -5... [0.8667199520431611, 0.14931041375174894, 0.33... [1] 1
3 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 40... [0.7796982715556, 0.5809514291425145, 0.907992... [1] 1
4 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 64... [0.3337995803776601, 0.467819308414951, 0.9741... [0] 1
... ... ... ... ...
9995 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 87... [0.8151663502847437, 0.962722366580052, 0.5130... [1] 0
9996 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 47... [0.6402237985812769, 0.9683190085948431, 0.948... [1] 0
9997 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -7... [0.9435508042761515, 0.9097541475114931, 0.313... [0] 0
9998 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 36... [0.6173443900489559, 0.41225264841095344, 0.92... [1] 0
9999 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 53... [0.869017883904486, 0.8232060763541875, 0.5049... [0] 0

10000 rows × 4 columns

至此,数据生成、数据预处理以及特征工程等操作已完成,处理好的数据即可传入模型进行训练。