# Using TabSim Data Simulator [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/xai/docs/source_en/using_tabsim.md) ## Introduction Sometimes, it is impossible to get sufficient amount of data for modeling or algorithm development, TabSim can be used for capturing the distribution of the tabular data and generating simulated data afterward. The complete code of the tutorial below is [using_tabsim.py](https://gitee.com/mindspore/xai/blob/master/examples/using_tabsim.py). ## Installation TabSim is part of the XAI package, no extra installation is required besides [MindSpore](https://mindspore.cn/install/en) and [XAI](https://www.mindspore.cn/xai/docs/en/master/installation.html). ## User Flow There are 2 phases in the TabSim user flow: 1. Digestion: Analyzing the real tabular data, capturing the statistic characteristics and output the digest file. Accomplished by the commandline tool `mindspore_xai tabdig`. 2. Simulation: Generating simulated data according to the statistics stores in the digested file. Accomplished by the commandline tool `mindspore_xai tabsim`. ## Digestion Phase ```bash mindspore_xai tabdig [--bins ] [--clip-sd ] ``` ``: Path of the real CSV table to be simulated. ``: Path of the digest file to be saved. ``: [optional] Number of bins [2 - 32] for discretizing numeric columns, default: 10 ``: [optional] Number of standard deviations away from the mean that defines the outliers, outlier values will be clipped. default: 3, setting to 0 or less will disable the value clipping. ### File Format of Real Data The real data must be a CSV file with header that contains the names and type of all columns. Header pattern: `|,|,|,...` ``: Column name, allowed pattern: `[0-9a-zA-Z_\-]+` ``: Column type, options: 'int', 'float', 'str', 'cat' - 'int': Integers - 'float': Float numbers - 'str': Strings, allowed pattern: `[0-9a-zA-Z_\-\+\.]*` - 'cat': Catergorical values, the underlying data type is integer without order 'int' and 'float' are numeric, while 'str' and 'cat' are discrete columns. There are at most 256 distinct values allowed in each discrete column. Optionally, users may specify at most one label column by adding a '*' before a discrete column (numeric columns are not allowed). Header example: `col_A|int,col_B|float,col_C|str,*col_D|cat` It is recommended users randomly pick around 1 million records from the real database to form the real data file for the statistic accuracy and the memory constraints. ### File Format of Digest Files Digest files are clear text json, they stores the name, type and value distributions of columns without any actual record. Users should never modify the digest file manually; otherwise that may corrupt it. ### Digestion Example We used the [Iris](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) dataset for the demonstration. This dataset consists of 3 different types of irises’ petal and sepal lengths. The Python code below writes tabular data to `real_table.csv`. ```python import sklearn.datasets iris = sklearn.datasets.load_iris() features = iris.data labels = iris.target # save the tabular data to file header = 'sepal_length|float,sepal_width|float,petal_length|float,petal_width|float,*class|cat' with open('real_table.csv', 'w') as f: f.write(header + '\n') for i in range(len(labels)): for feat in features[i]: f.write("{},".format(feat)) f.write("{}\n".format(labels[i])) ``` Content of `real_table.csv`: ```text sepal_length|float,sepal_width|float,petal_length|float,petal_width|float,*class|cat 5.1,3.5,1.4,0.2,0 4.9,3.0,1.4,0.2,0 4.7,3.2,1.3,0.2,0 4.6,3.1,1.5,0.2,0 5.0,3.6,1.4,0.2,0 5.4,3.9,1.7,0.4,0 4.6,3.4,1.4,0.3,0 5.0,3.4,1.5,0.2,0 4.4,2.9,1.4,0.2,0 ... ``` Then, we analyze the real tabular data, capturing the statistic characteristics and output it to `digest.json`. ```bash mindspore_xai tabdig real_table.csv digest.json ``` Content of `digest.json`: ```json { "label_col_idx": 4, "columns": [ { "name": "sepal_length", "idx": 0, "ctype": "float", "dtype": "float", "is_numeric": true, "is_label": false, ... ``` ## Simulation Phase ```bash mindspore_xai tabsim [--batch-size ] [--noise ] ``` ``: Path of the digest file of the real data. ``: Path of the simulated CSV table. ``: Number of rows to be generated to ``. ``: [optional] Number of rows in each batch, default: 10000 ``: [optional] Noise level (0.0-1.0) of value picking probabilities, 0.0 means 100% follows the digested joint distributions, higher the noise level more even the probabilities. default: 0 ### File Format of Simulated Data The simulated CSV file has a similar format to the real data CSV file, but the header is different: `,,,...` It contains no `` and no `*` while the column order remains the same. ### Simulation Example Here we generate 200000 rows of simulated data according to the statistics stored in the digested file, the simulated data is output to `sim_table.csv`. ```bash mindspore_xai tabsim digest.json sim_table.csv 200000 ``` Content of `sim_table.csv`: ```text sepal_length,sepal_width,petal_length,petal_width,class 5.577184113278916,2.600922272560204,4.432243573999988,1.3937476921377445,1 6.723739024436704,2.7995789972671985,4.093195099230183,1.377081159510022,1 4.787110003892638,2.8994714750972608,1.221068662892122,0.18023497892950327,0 5.47601589088659,2.683719381022501,4.429520567795243,1.44376166769605,1 5.713634033969561,2.238437659593092,4.468051986603512,1.5218876291352155,1 6.014412107785783,2.921972441210267,4.066770696930024,0.9183809029577147,1 6.188386742135447,2.92122446931648,5.288862927543273,1.4537708701756062,2 7.394485586937094,2.867479423550221,5.730391070749579,1.998759192383688,2 5.468839597899383,2.8957462954323083,4.4090170094158525,1.502682955942951,1 ... ```