[{"data":1,"prerenderedAt":53},["ShallowReactive",2],{"content-query-0U7wnEc7Mf":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":10,"date":11,"cover":12,"type":13,"category":14,"body":15,"_type":47,"_id":48,"_source":49,"_file":50,"_stem":51,"_extension":52},"/technology-blogs/en/1736","en",false,"",[9],"MindSpore Made Easy","Storage and access of user data are unified, simplifying training data loading.The shard size is flexibly controlled to implement distributed training.","2022-08-24","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/08/24/43554e5e899b4d7689db84f0f4805dc2.png","technology-blogs","Basics",{"type":16,"children":17,"toc":44},"root",[18,32],{"type":19,"tag":20,"props":21,"children":23},"element","h1",{"id":22},"mindspore-made-easy-dive-into-mindspore-minddataset-for-dataset-loading",[24,30],{"type":19,"tag":25,"props":26,"children":27},"span",{},[28],{"type":29,"value":9},"text",{"type":29,"value":31}," Dive Into MindSpore – MindDataset for Dataset Loading",{"type":19,"tag":33,"props":34,"children":35},"p",{},[36,38,42],{"type":29,"value":37},"Development environment: Ubuntu 20.04 Python 3.8 MindSpore 1.7.0 1. Background Previously, we introduced three dataset loading APIs: ImageFolderDataset, CSVDataset, and TFRecordDataset. In this article, we will talk about the MindDataset API for MindRecord data loading in MindSpore. A complete machine learning workflow includes dataset loading (possibly including data processing), model definition, model training, and model evaluation. It is a major concern for all deep learning networks to read data more efficiently. To this end, TensorFlow uses the TFRecord data format, and MindSpore provides MindRecord. The following lists features of the MindRecord data format: 1.Storage and access of user data are unified, simplifying training data loading. 2.Data is aggregated for storage, which can be efficiently read, managed and moved. 3.Data encoding and decoding are efficient and transparent to users. 4.The shard size is flexibly controlled to implement distributed training. 2. Parameters The following explains some of the parameters in the official document: dataset_files: The value is a string or list. If it is a string, the system automatically searches for and loads the MindRecord files with the corresponding prefix according to the matching rule. If the value is a list, the system reads the MindRecord files in the list. columns_list: data fields or columns to be read from the MindRecord file. The default value is None, indicating that all data columns are read. For details about other parameters, see the descriptions in the previous articles. 3. Generating Data This article uses the THUCNews dataset. If you want to use the dataset for commercial purposes, contact the dataset author. The following describes how to generate MindRecord files. The whole process involves: Reading and processing raw data Declaring the MindRecord file format Defining MindRecord data fields Adding MindRecord index fields Writing MindRecord data 3.1 Composing Code Next let's compose code for generating MindRecord data using the THUCNews dataset. 3.1.1 Code for Data Generation import codecs import os import re import numpy as np from collections import Counter from mindspore.mindrecord import FileWriter def get_txt_files(data_dir): cls_txt_dict = {} txt_file_list = [] # get files list and class files list. sub_data_name_list = next(os.walk(data_dir))[1] sub_data_name_list = sorted(sub_data_name_list) for sub_data_name in sub_data_name_list: sub_data_dir = os.path.join(data_dir, sub_data_name) data_name_list = next(os.walk(sub_data_dir))[2] data_file_list = [os.path.join(sub_data_dir, data_name) for data_name in data_name_list] cls_txt_dict[sub_data_name] = data_file_list txt_file_list.extend(data_file_list) num_data_files = len(data_file_list) print(\"{}: {}\".format(sub_data_name, num_data_files), flush=True) num_txt_files = len(txt_file_list) print(\"total: {}\".format(num_txt_files), flush=True) return cls_txt_dict, txt_file_list def get_txt_data(txt_file): with codecs.open(txt_file, \"r\", \"UTF8\") as fp: txt_content = fp.read() txt_data = re.sub(\"\\s+\", \" \", txt_content) return txt_data def build_vocab(txt_file_list, vocab_size=7000): counter = Counter() for txt_file in txt_file_list: txt_data = get_txt_data(txt_file) counter.update(txt_data) num_vocab = len(counter) if num_vocab \u003C vocab_size - 1: real_vocab_size = num_vocab + 2 else: real_vocab_size = vocab_size # pad_id is 0, unk_id is 1 vocab_dict = {word_freq[0]: ix + 1 for ix, word_freq in enumerate(counter.most_common(real_vocab_size - 2))} print(\"real vocab size: {}\".format(real_vocab_size), flush=True) print(\"vocab dict:\\n{}\".format(vocab_dict), flush=True) return vocab_dict def make_mindrecord_files( data_dir, mindrecord_dir, vocab_size=7000, min_seq_length=10, max_seq_length=800, num_train_shard=16, num_test_shard=4): # get txt files cls_txt_dict, txt_file_list = get_txt_files(data_dir=data_dir) # map word to id vocab_dict = build_vocab(txt_file_list=txt_file_list, vocab_size=vocab_size) # map class to id class_dict = {class_name: ix for ix, class_name in enumerate(cls_txt_dict.keys())} data_schema = { \"seq_ids\": {\"type\": \"int32\", \"shape\": [-1]}, \"seq_len\": {\"type\": \"int32\", \"shape\": [-1]}, \"seq_cls\": {\"type\": \"int32\", \"shape\": [-1]} } train_file = os.path.join(mindrecord_dir, \"train.mindrecord\") test_file = os.path.join(mindrecord_dir, \"test.mindrecord\") train_writer = FileWriter(train_file, shard_num=num_train_shard, overwrite=True) test_writer = FileWriter(test_file, shard_num=num_test_shard, overwrite=True) train_writer.add_schema(data_schema, \"train\") test_writer.add_schema(data_schema, \"test\") # indexes = [\"seq_ids\", \"seq_len\", \"seq_cls\"] # train_writer.add_index(indexes) # test_writer.add_index(indexes) pad_id = 0 unk_id = 1 num_samples = 0 num_train_samples = 0 num_test_samples = 0 train_samples = [] test_samples = [] for class_name, class_file_list in cls_txt_dict.items(): class_id = class_dict[class_name] num_class_pass = 0 for txt_file in class_file_list: txt_data = get_txt_data(txt_file=txt_file) txt_len = len(txt_data) if txt_len \u003C min_seq_length: num_class_pass += 1 continue if txt_len > max_seq_length: txt_data = txt_data[",{"type":19,"tag":39,"props":40,"children":41},"max",{},[],{"type":29,"value":43},"_seq_length] txt_len = max_seq_length word_ids = [] for word in txt_data: word_id = vocab_dict.get(word, unk_id) word_ids.append(word_id) for _ in range(max_seq_length - txt_len): word_ids.append(pad_id) num_samples += 1 sample = { \"seq_ids\": np.array(word_ids, dtype=np.int32), \"seq_len\": np.array(txt_len, dtype=np.int32), \"seq_cls\": np.array(class_id, dtype=np.int32)} if num_samples % 10 == 0: train_samples.append(sample) num_train_samples += 1 if num_train_samples % 10000 == 0: train_writer.write_raw_data(train_samples) train_samples = [] else: test_samples.append(sample) num_test_samples += 1 if num_test_samples % 10000 == 0: test_writer.write_raw_data(test_samples) test_samples = [] if train_samples: train_writer.write_raw_data(train_samples) if test_samples: test_writer.write_raw_data(test_samples) train_writer.commit() test_writer.commit() print(\"num samples: {}\".format(num_samples), flush=True) print(\"num train samples: {}\".format(num_train_samples), flush=True) print(\"num test samples: {}\".format(num_test_samples), flush=True) def main(): data_dir = \"/Users/kaierlong/Documents/DownFiles/tmp/009_resources/THUCNews\" mindrecord_dir = \"/Users/kaierlong/Documents/DownFiles/tmp/009_resources/mindrecords\" make_mindrecord_files(data_dir=data_dir, mindrecord_dir=mindrecord_dir) if __name__ == \"__main__\": main() 3.1.2 Code Interpretation The following mainly describes the make_mindrecord_files parameter. Declaring the MindRecord file format train_writer = FileWriter(train_file, shard_num=num_train_shard, overwrite=True) test_writer = FileWriter(test_file, shard_num=num_test_shard, overwrite=True) Interpretation: Import the FileWriter class and create FileWriter instances. train_file may not be a specific file for data writing. It can be a file name prefix. num_shard indicates the number of written data files. Defining MindRecord data columns data_schema = { \"seq_ids\": {\"type\": \"int32\", \"shape\": [-1]}, \"seq_len\": {\"type\": \"int32\", \"shape\": [-1]}, \"seq_cls\": {\"type\": \"int32\", \"shape\": [-1]} } train_writer.add_schema(data_schema, \"train\") test_writer.add_schema(data_schema, \"test\") Interpretation: Define the dataset file structure schema, and then add the schema to the FileWriter instances by using add_schema. The schema contains the field names, field data types (type), and (optional) numbers of dimensions of the fields (shape). If parameter shape is specified, data transferred to the write_raw_data interface must be of the numpy.ndarray type, and the corresponding data type must be int32, int64, float32, or float64. Field name: reference name of a field, which may contain letters, digits, and underscores (_). Field data type: int32, int64, float32, float64, string, or bytes. Number of field dimensions: [-1] indicates one dimension, and [m, n, ...] indicates a higher dimension, where m and n indicate the array length of each dimension. (Optional) Adding MindRecord index fields You can add index fields to accelerate data reading. Note that an index field must be of the int, float, or str type. Otherwise, an error will be reported. For details about the error information, see problem 1 in section 5.1. Writing MindRecord data train_samples = [] sample = { \"seq_ids\": np.array(word_ids, dtype=np.int32), \"seq_len\": np.array(txt_len, dtype=np.int32), \"seq_cls\": np.array(class_id, dtype=np.int32)} train_samples.append(sample) train_writer.write_raw_data(train_samples) train_writer.commit() Interpretation: A list is written by the FileWriter instance, with dictionary as the type of data elements in the list and the content format being the same as the schema format. Call the write_raw_data method to write data. Generate the data using the commit method after all data is written. Note: We can start writing data when the length of the list reaches 1. To speed up data writing, we set a threshold of the list length for triggering the write, for example, 10000, based on the size of the data to be written and the memory size of the device. 3.2 Generating Data Save the code (set data_dir and mindrecord_dir to your own directories) in section 3.1.1 to the generate_mindrecord.py file and run the following command: python3 generate_mindrecord.py Run the tree command in the directory specified by mindrecord_dir to check the generated data. The output is as follows: ├── test.mindrecord0 ├── test.mindrecord0.db ├── test.mindrecord1 ├── test.mindrecord1.db ├── test.mindrecord2 ├── test.mindrecord2.db ├── test.mindrecord3 ├── test.mindrecord3.db ├── train.mindrecord00 ├── train.mindrecord00.db ├── train.mindrecord01 ├── train.mindrecord01.db ├── train.mindrecord02 ├── train.mindrecord02.db ├── train.mindrecord03 ├── train.mindrecord03.db ├── train.mindrecord04 ├── train.mindrecord04.db ├── train.mindrecord05 ├── train.mindrecord05.db ├── train.mindrecord06 ├── train.mindrecord06.db ├── train.mindrecord07 ├── train.mindrecord07.db ├── train.mindrecord08 ├── train.mindrecord08.db ├── train.mindrecord09 ├── train.mindrecord09.db ├── train.mindrecord10 ├── train.mindrecord10.db ├── train.mindrecord11 ├── train.mindrecord11.db ├── train.mindrecord12 ├── train.mindrecord12.db ├── train.mindrecord13 ├── train.mindrecord13.db ├── train.mindrecord14 ├── train.mindrecord14.db ├── train.mindrecord15 └── train.mindrecord15.db 0 directories, 40 files Description: 16 MindRecord training data files are generated, as indicated by parameter num_train_shard. 4 MindRecord test data files are generated, as indicated by parameter num_test_shard. The prefixes of data files are indicated by train_file and test_file. 4. Loading Data The following describes how to load the generated MindRecord data. 4.1 Code for Data Loading To load MindRecord data, we need to use the MindDataset data loading interface mentioned in section 2. 4.1.1 Code To ensure consistent reproduction results, set shuffle to False. import os from mindspore.dataset import MindDataset def create_mindrecord_dataset(mindrecord_dir, train_mode=True): if train_mode: file_prefix = os.path.join(mindrecord_dir, \"train.mindrecord00\") else: file_prefix = os.path.join(mindrecord_dir, \"test.mindrecord0\") dataset = MindDataset(dataset_files=file_prefix, columns_list=None, shuffle=False) for item in dataset.create_dict_iterator(): print(item, flush=True) break def main(): mindrecord_dir = \"/Users/kaierlong/Documents/DownFiles/tmp/009_resources/mindrecords\" create_mindrecord_dataset(mindrecord_dir=mindrecord_dir, train_mode=True) if __name__ == \"__main__\": main() 4.1.2 Code Interpretation In section 3.1.1, num_train_shard and num_test_shard are set to 16 and 4. You may find that the last digits of the data files generated in section 3.2 are different. The data in the test data file ends with 0, 1, 2, or 3, and the data in the train data file ends with 00, 01, etc. As a result, the input values of dataset_files for train and test data are different. For details, see the preceding code. If train.mindrecord0 is forcibly used for train data, an error will be reported. For details, see problem 2 in section 5.2. 4.2 Loading Test Save the code in section 4.1.1 to the load_mindrecord.py file and run the following command: python3 load_mindrecord.py The output is as follows: {'seq_cls': Tensor(shape=[1], dtype=Int32, value= [0]), 'seq_ids': Tensor(shape=[800], dtype=Int32, value= [ 40, 80, 289, 400, 80, 163, 2239, 288, 413, 94, 309, 429, 3, 890, 664, 2941, 582, 539, 14, ...... 55, 7, 5, 65, 7, 24, 40, 8, 40, 80, 1254, 396, 566, 276, 96, 42, 4, 73, 803, 857, 72, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'seq_len': Tensor(shape=[1], dtype=Int32, value= [742])} The data is successfully read, including three fields: seq_cls, seq_ids, and seq_len. The shape values of the fields are the same as those of the output ones. Note: If too many MindRecord data files are being loaded, an error may occur. For details about the error information, see problem 3 in section 5.3. In this case, run the following command: # ulimit -n ${num} ulimit -n 1024 Set the number of files to be loaded to a proper value. 5. Troubleshooting 5.1 Problem 1 Traceback (most recent call last): File \"/Users/kaierlong/Codes/OpenI/kaierlong/Dive_Into_MindSpore/code/chapter_01/04_mindrecord_make.py\", line 167, in main() File \"/Users/kaierlong/Codes/OpenI/kaierlong/Dive_Into_MindSpore/code/chapter_01/04_mindrecord_make.py\", line 163, in main make_mindrecord(data_dir=data_dir, mindrecord_dir=mindrecord_dir) File \"/Users/kaierlong/Codes/OpenI/kaierlong/Dive_Into_MindSpore/code/chapter_01/04_mindrecord_make.py\", line 98, in make_mindrecord train_writer.add_index(indexes) File \"/Users/kaierlong/Pyenvs/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/mindrecord/filewriter.py\", line 223, in add_index raise MRMDefineIndexError(\"Failed to set field {} since it's not primitive type.\".format(field)) mindspore.mindrecord.common.exceptions.MRMDefineIndexError: [MRMDefineIndexError]: Failed to define index field. Detail: Failed to set field seq_ids since it's not primitive type. Troubleshooting: The index fields should be primitive type. e.g. int/float/str. 5.2 Problem 2 Traceback (most recent call last): File \"/Users/kaierlong/Codes/OpenI/kaierlong/Dive_Into_MindSpore/code/chapter_01/04_mindrecord_load.py\", line 36, in main() File \"/Users/kaierlong/Codes/OpenI/kaierlong/Dive_Into_MindSpore/code/chapter_01/04_mindrecord_load.py\", line 32, in main create_mindrecord_dataset(mindrecord_dir=mindrecord_dir, train_mode=True) File \"/Users/kaierlong/Codes/OpenI/kaierlong/Dive_Into_MindSpore/code/chapter_01/04_mindrecord_load.py\", line 23, in create_mindrecord_dataset dataset = MindDataset(dataset_files=file_prefix, columns_list=None, shuffle=False) File \"/Users/kaierlong/Pyenvs/env_mix_dl/lib/python3.9/site-packages/mindspore/dataset/engine/validators.py\", line 994, in new_method check_file(dataset_file) File \"/Users/kaierlong/Pyenvs/env_mix_dl/lib/python3.9/site-packages/mindspore/dataset/core/validator_helpers.py\", line 578, in check_file raise ValueError(\"The file {} does not exist or permission denied!\".format(dataset_file)) ValueError: The file /Users/kaierlong/Documents/DownFiles/tmp/009_resources/mindrecords/train.mindrecord0 does not exist or permission denied! Troubleshooting: See section 4.1.2. 5.3 Problem 3 Line of code : 247 File : /Users/jenkins/agent-working-dir/workspace/Compile_CPU_ARM_MacOS_PY39/mindspore/mindspore/ccsrc/minddata/mindrecord/io/shard_reader.cc (env_ms_1.7.0) [kaierlong@Long-De-MacBook-Pro-16]: ~/Codes/OpenI/kaierlong/Dive_Into_MindSpore/code/chapter_01$ python3 04_mindrecord_load.py Traceback (most recent call last): File \"/Users/kaierlong/Codes/OpenI/kaierlong/Dive_Into_MindSpore/code/chapter_01/04_mindrecord_load.py\", line 36, in main() File \"/Users/kaierlong/Codes/OpenI/kaierlong/Dive_Into_MindSpore/code/chapter_01/04_mindrecord_load.py\", line 32, in main create_mindrecord_dataset(mindrecord_dir=mindrecord_dir, train_mode=True) File \"/Users/kaierlong/Codes/OpenI/kaierlong/Dive_Into_MindSpore/code/chapter_01/04_mindrecord_load.py\", line 25, in create_mindrecord_dataset for item in dataset.create_dict_iterator(): File \"/Users/kaierlong/Pyenvs/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/validators.py\", line 971, in new_method return method(self, *args, **kwargs) File \"/Users/kaierlong/Pyenvs/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py\", line 1478, in create_dict_iterator return DictIterator(self, num_epochs, output_numpy) File \"/Users/kaierlong/Pyenvs/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py\", line 95, in __init__ offload_model = offload.GetOffloadModel(consumer, self.__ori_dataset.get_col_names()) File \"/Users/kaierlong/Pyenvs/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py\", line 1559, in get_col_names self._col_names = runtime_getter[0].GetColumnNames() RuntimeError: Unexpected error. Invalid file, failed to open files for reading mindrecord files. Please check file path, permission and open files limit(ulimit -a): /Users/kaierlong/Documents/DownFiles/tmp/009_resources/mindrecords/train.mindrecord11 Line of code : 247 File : /Users/jenkins/agent-working-dir/workspace/Compile_CPU_ARM_MacOS_PY39/mindspore/mindspore/ccsrc/minddata/mindrecord/io/shard_reader.cc Troubleshooting: Run the following command to change the number of data files to be loaded (${num}) according to the actual situation of the device. # ulimit -n ${num} ulimit -n 1024 Run the ulimit -a command before modifying the parameter. The output is as follows: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 256 pipe size (512 bytes, -p) 1 stack size (kbytes, -s) 8176 cpu time (seconds, -t) unlimited max user processes (-u) 5333 virtual memory (kbytes, -v) unlimited Run the ulimit -a command after modifying the parameter. The output is as follows: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 1 stack size (kbytes, -s) 8176 cpu time (seconds, -t) unlimited max user processes (-u) 5333 virtual memory (kbytes, -v) unlimited 6. Summary This article describes how to generate MindRecord data files in MindSpore and how to use MindDataset in dataset loading, and also presents some common problems during data reading and corresponding troubleshooting measures. 7. References MindDataset API Converting Dataset to MindRecord",{"title":7,"searchDepth":45,"depth":45,"links":46},4,[],"markdown","content:technology-blogs:en:1736.md","content","technology-blogs/en/1736.md","technology-blogs/en/1736","md",1776506103702]