[{"data":1,"prerenderedAt":218},["ShallowReactive",2],{"content-query-dT69cPOe2G":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":10,"date":11,"cover":12,"type":13,"category":14,"body":15,"_type":212,"_id":213,"_source":214,"_file":215,"_stem":216,"_extension":217},"/technology-blogs/en/1770","en",false,"",[9],"MindSpore Made Easy","Learn more about MindSpore.","2022-06-22","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/09/05/912e9e9141424d33b87131087e5110cc.png","technology-blogs","Developer Sharing",{"type":16,"children":17,"toc":209},"root",[18,32,38,48,53,61,66,74,79,87,92,100,105,113,118,126,131,139,144,152,157,165,170,178,183,191,196,204],{"type":19,"tag":20,"props":21,"children":23},"element","h1",{"id":22},"mindspore-made-easy-dive-into-mindsporecsvdataset-for-dataset-loading",[24,30],{"type":19,"tag":25,"props":26,"children":27},"span",{},[28],{"type":29,"value":9},"text",{"type":29,"value":31}," Dive Into MindSpore—CSVDataset for Dataset Loading",{"type":19,"tag":33,"props":34,"children":35},"p",{},[36],{"type":29,"value":37},"June 22, 2022 Author: kaierlong Development environment: Ubuntu 20.04 Python 3.8 MindSpore 1.7.0 1. Parameter Description dataset_files: dataset file path, which can be a single file or a file list. filed_delim: delimiter to separate fields. The default value is a comma (,). column_names: field name, which is used as the key of data fields. shuffle: indicates whether to shuffle data. Possible values are False, Shuffle.GLOBAL, and Shuffle.FILES. Shuffle.GLOBAL (default): shuffles files and data in files. Shuffle.FILES: shuffles files only. 2. Preparing Data 2.1 Downloading Data Run the following commands to download iris.data and iris.names to a specified directory:",{"type":19,"tag":39,"props":40,"children":42},"pre",{"code":41},"mkdir iris && cd iris\nwget -c https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\nwget -c https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names\n",[43],{"type":19,"tag":44,"props":45,"children":46},"code",{"__ignoreMap":7},[47],{"type":29,"value":41},{"type":19,"tag":33,"props":49,"children":50},{},[51],{"type":29,"value":52},"Note: If the wget command cannot be used due to system restrictions, you can use a web browser to download the files. 2.2 Data Overview The Iris flower dataset for multivariate analysis contains 150 datasets, which are classified into three types (Setosa, Versicolour, and Virginica). Each type contains 50 data records, and each data record contains four attributes: sepal length, sepal width, petal length, and petal width. For more details, see the following official dataset description: 2.3 Dataset Division The dataset is preliminarily divided into a training dataset and a test dataset in a 4:1 ratio of data amount. The code is as follows:",{"type":19,"tag":39,"props":54,"children":56},{"code":55},"from random import shuffle\n\n\ndef preprocess_iris_data(iris_data_file, train_file, test_file, header=True):\ncls_0 = \"Iris-setosa\"\ncls_1 = \"Iris-versicolor\"\ncls_2 = \"Iris-virginica\"\n\ncls_0_samples = []\ncls_1_samples = []\ncls_2_samples = []\n\nwith open(iris_data_file, \"r\", encoding=\"UTF8\") as fp:\nlines = fp.readlines()\nfor line in lines:\nline = line.strip()\nif not line:\ncontinue\nif cls_0 in line:\ncls_0_samples.append(line)\ncontinue\nif cls_1 in line:\ncls_1_samples.append(line)\ncontinue\nif cls_2 in line:\ncls_2_samples.append(line)\n\nshuffle(cls_0_samples)\nshuffle(cls_1_samples)\nshuffle(cls_2_samples)\n\nprint(\"number of class 0: {}\".format(len(cls_0_samples)), flush=True)\nprint(\"number of class 1: {}\".format(len(cls_1_samples)), flush=True)\nprint(\"number of class 2: {}\".format(len(cls_2_samples)), flush=True)\n\ntrain_samples = cls_0_samples[:40] + cls_1_samples[:40] + cls_2_samples[:40]\ntest_samples = cls_0_samples[40:] + cls_1_samples[40:] + cls_2_samples[40:]\n\nheader_content = \"Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Classes\"\n\nwith open(train_file, \"w\", encoding=\"UTF8\") as fp:\nif header:\nfp.write(\"{}\\n\".format(header_content))\nfor sample in train_samples:\nfp.write(\"{}\\n\".format(sample))\n\nwith open(test_file, \"w\", encoding=\"UTF8\") as fp:\nif header:\nfp.write(\"{}\\n\".format(header_content))\nfor sample in test_samples:\nfp.write(\"{}\\n\".format(sample))\n\n\ndef main():\niris_data_file = \"{your_path}/iris/iris.data\"\niris_train_file = \"{your_path}/iris/iris_train.csv\"\niris_test_file = \"{your_path}/iris/iris_test.csv\"\n\npreprocess_iris_data(iris_data_file, iris_train_file, iris_test_file)\n\n\nif __name__ == \"__main__\":\nmain()\n",[57],{"type":19,"tag":44,"props":58,"children":59},{"__ignoreMap":7},[60],{"type":29,"value":55},{"type":19,"tag":33,"props":62,"children":63},{},[64],{"type":29,"value":65},"Save the above code to the preprocess.py file (replace the data file paths with the actual ones) and run the following command: python3 preprocess.py The output is as follows:",{"type":19,"tag":39,"props":67,"children":69},{"code":68},"number of class 0: 50\nnumber of class 1: 50\nnumber of class 2: 50\n",[70],{"type":19,"tag":44,"props":71,"children":72},{"__ignoreMap":7},[73],{"type":29,"value":68},{"type":19,"tag":33,"props":75,"children":76},{},[77],{"type":29,"value":78},"The iris_train.csv and iris_test.csv files are generated. The directory content is as follows:",{"type":19,"tag":39,"props":80,"children":82},{"code":81},".\n├── iris.data\n├── iris.names\n├── iris_test.csv\n└── iris_train.csv\n",[83],{"type":19,"tag":44,"props":84,"children":85},{"__ignoreMap":7},[86],{"type":29,"value":81},{"type":19,"tag":33,"props":88,"children":89},{},[90],{"type":29,"value":91},"3. Trial and Error Cases The following shows several trial and error cases to help you learn about CSVDataset. 3.1 Usage of column_defaults Run the following code to load data (for convenience in reproducing the problem, we set shuffle to False):",{"type":19,"tag":39,"props":93,"children":95},{"code":94},"from mindspore.dataset import CSVDataset\n\n\ndef dataset_load(data_files):\ncolumn_defaults = [float, float, float, float, str]\ncolumn_names = [\"Sepal.Length\", \"Sepal.Width\", \"Petal.Length\", \"Petal.Width\", \"Classes\"]\n\ndataset = CSVDataset(\ndataset_files=data_files,\nfield_delim=\",\",\ncolumn_defaults=column_defaults,\ncolumn_names=column_names,\nnum_samples=None,\nshuffle=False)\n\ndata_iter = dataset.create_dict_iterator()\nitem = None\nfor data in data_iter:\nitem = data\nbreak\n\nprint(\"====== sample ======\\n{}\".format(item), flush=True)\n\n\ndef main():\niris_train_file = \"{your_path}/iris/iris_train.csv\"\n\ndataset_load(data_files=iris_train_file)\n\n\nif __name__ == \"__main__\":\nmain()\n",[96],{"type":19,"tag":44,"props":97,"children":98},{"__ignoreMap":7},[99],{"type":29,"value":94},{"type":19,"tag":33,"props":101,"children":102},{},[103],{"type":29,"value":104},"Save the above code to the load.py file (replace the data file paths with the actual ones) and run the following command: python3 load.py An error is reported as follows: Traceback (most recent call last):",{"type":19,"tag":39,"props":106,"children":108},{"code":107},"File \"/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py\", line 107, in \nmain()\nFile \"/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py\", line 103, in main\ndataset_load(data_files=iris_train_file)\nFile \"/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py\", line 75, in dataset_load\ndataset = CSVDataset(\nFile \"/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/validators.py\", line 1634, in new_method\nraise TypeError(\"column type in column_defaults is invalid.\")\nTypeError: column type in column_defaults is invalid.\n",[109],{"type":19,"tag":44,"props":110,"children":111},{"__ignoreMap":7},[112],{"type":29,"value":107},{"type":19,"tag":33,"props":114,"children":115},{},[116],{"type":29,"value":117},"The code in line 1634 of the mindspore/dataset/engine/validators.py file is as follows:",{"type":19,"tag":39,"props":119,"children":121},{"code":120},"# check column_defaults\ncolumn_defaults = param_dict.get('column_defaults')\nif column_defaults is not None:\nif not isinstance(column_defaults, list):\nraise TypeError(\"column_defaults should be type of list.\")\nfor item in column_defaults:\nif not isinstance(item, (str, int, float)):\nraise TypeError(\"column type in column_defaults is invalid.\")\n",[122],{"type":19,"tag":44,"props":123,"children":124},{"__ignoreMap":7},[125],{"type":29,"value":120},{"type":19,"tag":33,"props":127,"children":128},{},[129],{"type":29,"value":130},"3.1.1 Analysis column_defaults (list, optional): data types of the data columns. Valid types are float, int, and string. The default value is None, indicating that all columns are treated as strings. It is found that in the code in mindspore/dataset/engine/validators.py, column_defaults is set to a specific type of data instances. So we need to modify column_defaults = [float, float, float, float, str] to column_defaults = [5.84, 3.05, 3.76, 1.20, \"Classes\"] (the values are obtained from the iris.names file). After the modification, run the code again. The following error is reported:",{"type":19,"tag":39,"props":132,"children":134},{"code":133},"WARNING: Logging before InitGoogleLogging() is written to STDERR\n[ERROR] MD(13306,0x70000269b000,Python):2022-06-14-16:51:59.681.109 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Unexpected error. Invalid csv, csv file: /Users/kaierlong/Downloads/iris/iris_train.csv parse failed at line 1, type does not match.\nLine of code : 506\nFile : /Users/jenkins/agent-working-dir/workspace/Compile_CPU_X86_MacOS_PY39/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/csv_op.cc\n",[135],{"type":19,"tag":44,"props":136,"children":137},{"__ignoreMap":7},[138],{"type":29,"value":133},{"type":19,"tag":33,"props":140,"children":141},{},[142],{"type":29,"value":143},"Traceback (most recent call last):",{"type":19,"tag":39,"props":145,"children":147},{"code":146},"File \"/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py\", line 107, in \nmain()\nFile \"/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py\", line 103, in main\ndataset_load(data_files=iris_train_file)\nFile \"/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py\", line 90, in dataset_load\nfor data in data_iter:\nFile \"/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py\", line 147, in __next__\ndata = self._get_next()\nFile \"/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py\", line 211, in _get_next\nraise err\nFile \"/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py\", line 204, in _get_next\nreturn {k: self._transform_tensor(t) for k, t in self._iterator.GetNextAsMap().items()}\nRuntimeError: Unexpected error. Invalid csv, csv file: /Users/kaierlong/Downloads/iris/iris_train.csv parse failed at line 1, type does not match.\nLine of code : 506\nFile : /Users/jenkins/agent-working-dir/workspace/Compile_CPU_X86_MacOS_PY39/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/csv_op.cc\n",[148],{"type":19,"tag":44,"props":149,"children":150},{"__ignoreMap":7},[151],{"type":29,"value":146},{"type":19,"tag":33,"props":153,"children":154},{},[155],{"type":29,"value":156},"We'll analyze this error in the following section. 3.2 Processing of the Header Information 3.2.1 Analysis According to the error information, the error is located in line 506 of the mindspore/ccsrc/minddata/dataset/engine/datasetops/source/csv_op.cc file. The source code is as follows:",{"type":19,"tag":39,"props":158,"children":160},{"code":159},"Status CsvOp::LoadFile(const std::string &file, int64_t start_offset, int64_t end_offset, int32_t worker_id) {\nCsvParser csv_parser(worker_id, jagged_rows_connector_.get(), field_delim_, column_default_list_, file);\nRETURN_IF_NOT_OK(csv_parser.InitCsvParser());\ncsv_parser.SetStartOffset(start_offset);\ncsv_parser.SetEndOffset(end_offset);\n\nauto realpath = FileUtils::GetRealPath(file.c_str());\nif (!realpath.has_value()) {\nMS_LOG(ERROR) \u003C\u003C \"Invalid file path, \" \u003C\u003C file \u003C\u003C \" does not exist.\";\nRETURN_STATUS_UNEXPECTED(\"Invalid file path, \" + file + \" does not exist.\");\n}\n\nstd::ifstream ifs;\nifs.open(realpath.value(), std::ifstream::in);\nif (!ifs.is_open()) {\nRETURN_STATUS_UNEXPECTED(\"Invalid file, failed to open \" + file + \", the file is damaged or permission denied.\");\n}\nif (column_name_list_.empty()) {\nstd::string tmp;\ngetline(ifs, tmp);\n}\ncsv_parser.Reset();\ntry {\nwhile (ifs.good()) {\n// when ifstream reaches the end of file, the function get() return std::char_traits::eof()\n// which is a 32-bit -1, it's not equal to the 8-bit -1 on Euler OS. So instead of char, we use\n// int to receive its return value.\nint chr = ifs.get();\nint err = csv_parser.ProcessMessage(chr);\nif (err != 0) {\n// if error code is -2, the returned error is interrupted\nif (err == -2) return Status(kMDInterrupted);\nRETURN_STATUS_UNEXPECTED(\"Invalid file, failed to parse csv file: \" + file + \" at line \" +\nstd::to_string(csv_parser.GetTotalRows() + 1) +\n\". Error message: \" + csv_parser.GetErrorMessage());\n}\n}\n} catch (std::invalid_argument &ia) {\nstd::string err_row = std::to_string(csv_parser.GetTotalRows() + 1);\nRETURN_STATUS_UNEXPECTED(\"Invalid csv, csv file: \" + file + \" parse failed at line \" + err_row +\n\", type does not match.\");\n} catch (std::out_of_range &oor) {\nstd::string err_row = std::to_string(csv_parser.GetTotalRows() + 1);\nRETURN_STATUS_UNEXPECTED(\"Invalid csv, \" + file + \" parse failed at line \" + err_row + \" : value out of range.\");\n}\nreturn Status::OK();\n}\n",[161],{"type":19,"tag":44,"props":162,"children":163},{"__ignoreMap":7},[164],{"type":29,"value":159},{"type":19,"tag":33,"props":166,"children":167},{},[168],{"type":29,"value":169},"It is found that the source code does not contain the code for processing the header line (the head information was written during data division in section 2.3). That is, all lines are deemed as data by default. CSVDataset does not provide the capability of processing the header line.",{"type":19,"tag":39,"props":171,"children":173},{"code":172},"Based on the analysis, modify the code for data division\npreprocess_iris_data(iris_data_file, iris_train_file, iris_test_file)\nto\npreprocess_iris_data(iris_data_file, iris_train_file, iris_test_file, header=False)\n\nExecute the preprocess.py file again to generate new data.\nExecute the load.py file (no modification is required). The output is as follows:\n====== sample ======\n{'Sepal.Length': Tensor(shape=[], dtype=Float32, value= 5.5), 'Sepal.Width': Tensor(shape=[], dtype=Float32, value= 4.2), 'Petal.Length': Tensor(shape=[], dtype=Float32, value= 1.4), 'Petal.Width': Tensor(shape=[], dtype=Float32, value= 0.2),\n'Classes': Tensor(shape=[], dtype=String, value= 'Iris-setosa')}\n",[174],{"type":19,"tag":44,"props":175,"children":176},{"__ignoreMap":7},[177],{"type":29,"value":172},{"type":19,"tag":33,"props":179,"children":180},{},[181],{"type":29,"value":182},"Notes: For the convenience of readers, we changed the format of the output. You can see that the data is correctly read. The data contains five fields, and names of the data fields have been processed based on the specified column_names. 4. Positive Code Example Through the two trial-and-error cases in section 3, we now have a preliminary understanding of CSVDataset. You may find that there is still a problem, that is, the Classes field is not converted to a numeric value. The following describes how to convert the Classes field to a numeric value. The source code is as follows:",{"type":19,"tag":39,"props":184,"children":186},{"code":185},"from mindspore.dataset import CSVDataset\nfrom mindspore.dataset.text import Lookup, Vocab\n\n\ndef dataset_load(data_files):\ncolumn_defaults = [5.84, 3.05, 3.76, 1.20, \"Classes\"]\ncolumn_names = [\"Sepal.Length\", \"Sepal.Width\", \"Petal.Length\", \"Petal.Width\", \"Classes\"]\n\ndataset = CSVDataset(\ndataset_files=data_files,\nfield_delim=\",\",\ncolumn_defaults=column_defaults,\ncolumn_names=column_names,\nnum_samples=None,\nshuffle=False)\n\ncls_to_id_dict = {\"Iris-setosa\": 0, \"Iris-versicolor\": 1, \"Iris-virginica\": 2}\nvocab = Vocab.from_dict(word_dict=cls_to_id_dict)\nlookup = Lookup(vocab)\ndataset = dataset.map(input_columns=\"Classes\", operations=lookup)\n\ndata_iter = dataset.create_dict_iterator()\nitem = None\nfor data in data_iter:\nitem = data\nbreak\n\nprint(\"====== sample ======\\n{}\".format(item), flush=True)\n\n\ndef main():\niris_train_file = \"{your_path}/iris/iris_train.csv\"\n\ndataset_load(data_files=iris_train_file)\n\n\nif __name__ == \"__main__\":\nmain()\n",[187],{"type":19,"tag":44,"props":188,"children":189},{"__ignoreMap":7},[190],{"type":29,"value":185},{"type":19,"tag":33,"props":192,"children":193},{},[194],{"type":29,"value":195},"Save the above code to the load.py file (replace the data file paths with the actual ones) and run the following command: python3 load.py The output is as follows: ====== sample ======",{"type":19,"tag":39,"props":197,"children":199},{"code":198},"{'Sepal.Length': Tensor(shape=[], dtype=Float32, value= 5.5), 'Sepal.Width': Tensor(shape=[], dtype=Float32, value= 4.2), 'Petal.Length': Tensor(shape=[], dtype=Float32, value= 1.4), 'Petal.Width': Tensor(shape=[], dtype=Float32, value= 0.2),\n'Classes': Tensor(shape=[], dtype=Int32, value= 0)}\n",[200],{"type":19,"tag":44,"props":201,"children":202},{"__ignoreMap":7},[203],{"type":29,"value":198},{"type":19,"tag":33,"props":205,"children":206},{},[207],{"type":29,"value":208},"Notes: The data contains five fields. The Classes field is converted to a numeric value based on cls_to_id_dict = {\"Iris-setosa\": 0, \"Iris-versicolor\": 1, \"Iris-virginica\": 2}. To do this, we used the mindspore.dataset.text method. You can also add the code for converting fields to numeric values during data division in section 2.3. 5. Summary This article explores the CSVDataset interface in MindSpore with some cases provided. 6. Improvements to Be Made Header cannot be contained in a file. Specified fields cannot be read by simply calling an API. 7. References mindspore.dataset.CSVDataset",{"title":7,"searchDepth":210,"depth":210,"links":211},4,[],"markdown","content:technology-blogs:en:1770.md","content","technology-blogs/en/1770.md","technology-blogs/en/1770","md",1776506104456]