MindSpore Data Processing FAQs
MindSpore Data Processing FAQs
Q1: How do I offload data if I do not use high-level APIs?
A1: You can refer to the test_tdt_data_transfer.py example of the manual offloading mode, without using the model.train API. Currently, this method is supported on the GPU-based and Ascend-based hardware.
Q2: How do I optimize the high memory usage when processing data with MindSpore Dataset?
A2: You can refer to the following procedure to reduce the memory usage, but it may also reduce the efficiency of data processing.
1. Before defining a **Dataset object, run ds.config.set_prefetch_size(2) to set the prefetch size for data processing.
2. When defining the **Dataset object, set parameter num_parallel_workers to 1.
3. If you further perform the .map(...) operation on the **Dataset object, you can set parameter num_parallel_workers of .map(...) to 1.
4. If you further perform the .batch(...) operation on the **Dataset object, you can set parameter num_parallel_workers of .batch(...) to 1.
5. If you further perform the .shuffle(...) operation on the **Dataset object, you can reduce the value of buffer_size.
Q3: How do I optimize the high CPU usage when processing data with MindSpore Dataset, which is manifested as high sy (system) usage and low us (user) usage?
A3: You can refer to the following procedure to reduce the CPU usage and further improve the performance. The main cause for the high CPU usage is the resource competition between third-party library multithreading and data processing multithreading.
1. If OpenCV CV2 operations are involved in the data processing phase, run cv2.setNumThreads(2) to set the number of global CV2 threads.
2. If NumPy operations are involved in the data processing phase, run export OPENBLAS_NUM_THREADS=1 to set the number of OpenBLAS threads.
3. If Numba operations are involved in the data processing phase, run numba.set_num_threads(1) to set the parallelism degree to reduce thread contention.
Q4: Why is there no difference between shuffle=True and shuffle=False in GeneratorDataset?
A4: If shuffle is enabled, the input dataset must support random access (for example, the custom dataset has the getitem method). If data is returned in yeild mode in the custom dataset, random access is not supported. For details, see the custom dataset section in the tutorial.
Q5: How does MindSpore Dataset combine two columns into one column?
A5: You can perform the following operations to combine two columns into one:
def combine(x, y):
x = x.flatten()
y = y.flatten()
return np.append(x, y)
dataset = dataset.map(operations=combine, input_columns=["data", "data2"], output_columns=["data"])
Note: The shapes of the two columns are different. Therefore, you need to flatten them before combining.
Q6: Does GeneratorDataset support ds.PKSampler sampling?
A6: GeneratorDataset does not support the PKSampler sampling logic. The main reason is that the flexibility of custom data operations is too high, and the built-in PKSampler is difficult to achieve universality, so we choose to directly prompt that the operations are not supported at the API layer. However, for GeneratorDataset, you can easily define the required Sampler logic. That is, you can define specific Sampler rules in the __getitem__ function of the ImageDataset class and return the required data.
Q7: How does MindSpore load the existing pre-trained word vector?
A7: When defining EmbedingLookup or Embedding, you can transfer the pre-trained word vector and encapsulate it into a tensor as the initial value of EmbeddingLookup.
Q8: What is the difference between c_transforms and py_transforms? Which one is recommended?
A8: c_transforms is recommended. Its performance is better because it is executed only at the C layer.
Principle: The underlying layer of c_transform uses opencv/jpeg-turbo of the C version for data processing, while py_transform uses Pillow of the Python version for data processing.
Since MindSpore 1.8, data augmentation APIs are merged, and users will no longer need to explicitly specify c_transforms or py_transforms. MindSpore determines the backend to be used based on the data type transferred to data augmentation APIs. By default, c_transforms is used because of its better performance. For details, see the latest API document and import description.
Q9: Since each piece of my data contains multiple images with varying widths and heights, I need to perform a map operation on the data converted to the MindRecord format. But the data I read from the records is in the format of np.ndarray, and my data processing operations are for image formats. How can I preprocess the generated data in MindRecord format?
A9: You are advised to perform the following operations:
#1 The defined schema is as follows: Among them, data1, data2, data3, ... These fields store your image, and only the binary of the image is stored here.
cv_schema_json = {"label": {"type": "int32"}, "data1": {"type": "bytes"}, "data2": {"type": "bytes"}, "data3": {"type": "bytes"}}
#2 The organized data can be as follows, and then this data_list can be written by FileWriter.write_raw_data(...).
data_list = []data = {}data['label'] = 1
f = open("1.jpg", "rb")image_bytes = f.read()f.close
data['data1'] = image_bytes
f2 = open("2.jpg", "rb")image_bytes2 = f2.read()f2.close
data['data2'] = image_bytes2
f3 = open("3.jpg", "rb")image_bytes3 = f3.read()f3.close
data['data3'] = image_bytes3
data_list.append(data)
#3 Use MindDataset to load, then use the decode operation we provide to decode, and then perform subsequent processing.
data_set = ds.MindDataset("mindrecord_file_name")data_set = data_set.map(input_columns=["data1"], operations=vision.Decode(), num_parallel_workers=2)data_set = data_set.map(input_columns=["data2"], operations=vision.Decode(), num_parallel_workers=2)data_set = data_set.map(input_columns=["data3"], operations=vision.Decode(), num_parallel_workers=2)resize_op = vision.Resize((32, 32), interpolation=Inter.LINEAR)data_set = data_set.map(operations=resize_op, input_columns=["data1"], num_parallel_workers=2)for item in data_set.create_dict_iterator(output_numpy=True):
print(item)
Q10: During conversion from my custom image dataset to the MindRecord format, my data is in numpy.ndarray format with a shape of [4, 100, 132, 3], which indicates four three-channel frames, with each value within 0 and 255. However, when I view the data that is converted into the MindRecord format, I find that the shape is [19800] but that of the original data is [158400]. Why?
A10: Perhaps the dtype of your ndarray data is int8, because the difference between [158400] and [19800] is exactly eight times. You are advised to specify the dtype of the ndarray data as float64.
Q11: I want to save a generated image, but cannot find it in the corresponding directory after executing the code. Similarly, in JupyterLab, a dataset is generated for training. During training, the data can be read in the corresponding path, but I cannot find the image or dataset in the path. Why?
A11: Perhaps the image or dataset generated by JumperLab is stored in Docker. The data downloaded by MoXing can be viewed only in Docker during the training process. Once the training is complete, the data is released along with Docker. You can try to use MoXing to transfer the data that needs to be downloaded back to OBS in the training job, and then download it to your local host from OBS.
Q12: How do I understand the dataset_sink_mode parameter in model.train of MindSpore?
A12: When dataset_sink_mode is set to True, data processing and network computation are performed in pipeline mode. That is, when data processing is performed step by step, after a batch of data is processed, the data will be placed in a queue for caching the processed data. Then, network computation will retrieve data from this queue for training. At this point, data processing and network computation are pipelined, and the total training time is determined by whichever takes longer between data processing and network computation.
When dataset_sink_mode is set to False, data processing and network computation are performed in serial mode. That is, after a batch of data is processed, it is transferred to the network for computation. After the computation is complete, the next batch of data will be processed and transferred to the network for computation. This process repeats until the training is complete. The total time consumed is the time consumed for data processing plus the time consumed for network computation.
Q13: Can MindSpore train image data of different sizes by batch?
A13: You can refer to the usage of YOLOv3 in this scenario. The script contains the resizing of different images. For details about the script, see yolo_dataset.
Q14: Must data be converted into the MindRecord format when MindSpore is used for segmentation training?
A14: build_seg_data.py is a script that converts a dataset to the MindRecord format. It can be used directly or adapted to your own dataset. Alternatively, if you want to try implementing your own dataset reading, you can use GeneratorDataset to customize a dataset loading logic.
Q15: When MindSpore performs multi-device training on the Ascend hardware platform, how does the custom dataset transfer different data to different devices?
A15: When GeneratorDataset is used, the num_shards=num_shards and shard_id=device_id parameters can be used to control which shard of data is read by a specific device. __getitem__ and __len__ are used to process the full dataset.
Example:
# Device 0:ds.GeneratorDataset(..., num_shards=8, shard_id=0, ...)# Device 1:ds.GeneratorDataset(..., num_shards=8, shard_id=1, ...)# Device 2:ds.GeneratorDataset(..., num_shards=8, shard_id=2, ...)...# Device 7:ds.GeneratorDataset(..., num_shards=8, shard_id=7, ...)
Q16: How do I build a multi-label MindRecord dataset for images?
A16: The data schema can be defined as follows: cv_schema_json = {"label": {"type": "int32", "shape": [-1]}, "data": {"type": "bytes"}}
Note: A label is an array of the NumPy type, where label values 1, 1, 0, 1, 0, 1 are stored. These label values correspond to the same data, that is, the binary value of the same image. For details, see the tutorial for converting a dataset to the MindRecord format.
Q17: What is the reason for the error message 'wrong shape of image' when I use a model trained by MindSpore to perform prediction on a 28 x 28 digital image with white text on a black background?
A17: The MNIST gray-scale image dataset is used for MindSpore training. So the model has requirements for the input data, which needs to be set as a 28 x 28 gray-scale image with a single channel.
Q18: MindSpore has a framework dedicated to data processing. Are there any related design and usage introductions?
A18: The MindSpore Dataset module enables users to easily define data preprocessing pipelines and efficiently (multi-process/multi-thread) process samples in datasets. Additionally, MindSpore Dataset provides a variety of APIs for loading and processing datasets. For more information, refer to the data processing pipeline introduction. If you want to further optimize the performance of the data processing pipelines, refer to the Optimizing the Data Processing section.
Q19: How do I locate the cause of the data delivery failure 'TDT Push data into device Failed' during network training?
A19: The preceding error message indicates that the training data fails to be sent to the destination device through the training data transfer (TDT) channel. There may be multiple reasons causing this error, and the log provides corresponding check suggestions.
1. Usually we look for the first error (ERROR level) or traceback in the log, and try to find information that can help locate the cause of the error.
2. In the graph compilation phase, if the error occurs before training begins (for example, loss is not printed in the log), check whether there are any errors (ERROR level) in the log related to network operators or caused by improper environment configurations (such as an incorrect hccl.json file that causes a multi-device communication initialization error).
3. If the error occurs in the middle of the training process, it is usually caused by a mismatch between the delivered data volume (batch size) and the data volume (number of steps) required for network training. You can use the get_dataset_size API to print the number of batches contained in an epoch. The following are some possible causes:
(1) Check the number of printed loss times. If the data volume (number of steps) is an integer multiple of the number of batches in an epoch, the epoch processing may be faulty. The following is an example:
...dataset = dataset.create_tuple_iteator(num_epochs=-1) # If an iterator needs to be returned, num_epochs should be set to 1. However, it is recommended that datasetreturn dataset be returned directly.
(2) Check whether data processing is slow and cannot keep up with the network training speed. In this scenario, you can use the Profiler and MindSpore Insight tools to check for obvious iteration gaps, or manually traverse the dataset and print the average time consumed per batch, to determine if it takes longer than the time for network forward and backward propagation combined. If so, performance optimization is required for data processing.
(3) During the training process, if there is abnormal data that causes a data delivery failure, there are usually other error logs (ERROR level) that indicate which step of data processing has encountered an error and provide check suggestions. If they are not obvious, you can also try to find the abnormal data by traversing every data piece in the dataset (such as turning off shuffle and using binary search).
4. If this log is printed after the training is complete (which is caused by forcible resource deallocation), ignore this error.
5. If the specific cause cannot be identified, you can seek assistance from module developers through methods such as raising an issue or posting on forums.
Q20: Can py_transforms and c_transforms be used together for data augmentation? If so, how should they be used?
A20: To achieve high performance, it is not recommended that py_transforms and c_transforms be used together. However, if the ultimate performance is not required, the process needs to be streamlined, and the c_transforms augmentation module cannot be fully used (the corresponding c_transforms augmentation operation is missing), you can use the augmentation operation in the py_transforms module. In this case, py_transforms and c_transforms can be used together. Note that the outputs of the c_transforms augmentation module are usually NumPy arrays, and the outputs of the py_transforms augmentation module are PIL images. For details, see the corresponding module description. The common mixed usages are as follows:
1. c_transforms augmentation operation + ToPIL operation+ py_transforms augmentation operation + ToNumpy operation
2. py_transforms augmentation operation + ToNumpy operation + c_transforms augmentation operation
# example that using c_transforms and py_transforms operations together# in following case: c_vision refers to c_transforms, py_vision refer to py_transformsimport mindspore.vision.c_transforms as c_visionimport mindspore.vision.py_transforms as py_vision
decode_op = c_vision.Decode()
# If input type is not PIL, then add ToPIL operation.transforms = [
py_vision.ToPIL(),
py_vision.CenterCrop(375),
py_vision.ToTensor()]transform = mindspore.dataset.transforms.Compose(transforms)data1 = data1.map(operations=decode_op, input_columns=["image"])data1 = data1.map(operations=transform, input_columns=["image"])
In versions later than MindSpore 1.8, due to the merging of data augmentation APIs, the code becomes more concise, for example:
import mindspore.vision as vision
transforms = [
vision.Decode(), # c_transforms data augmentation
vision.ToPIL(), # Switch the next augmentation input to PIL.
vision.CenterCrop(375), # py_transforms data augmentation
data1 = data1.map(operations=transforms, input_columns=["image"])
Q21: What should I do if the error message "The data pipeline is not a tree (i.e., one node has 2 consumers)" is displayed?
A21: The above error is usually caused by incorrect script writing. Under normal circumstances, the operations in a data processing pipeline are sequentially linked, as defined below:
# Pipeline structure: # dataset1 -> map -> shuffle -> batchdataset1 = XXDataset()dataset1 = dataset1.map(...)dataset1 = dataset1.shuffle(...)dataset1 = dataset1.batch(...)
However, in the following abnormal scenario where dataset1 has two branch nodes, namely dataset2 and dataset3, the above error will occur. This is because when dataset1 node produces branches, the data flow direction is undefined, and therefore this situation is not allowed.
# Pipeline structure: # dataset1 -> dataset2 -> map# |# --> dataset3 -> mapdataset1 = XXDataset()dataset2 = dataset1.map(***)dataset3 = dataset1.map(***)
The correct format is as follows. dataset3 is obtained by performing data augmentation on dataset2 instead of dataset1.
dataset2 = dataset1.map(***)dataset3 = dataset2.map(***)
Q22: What is the corresponding API of DataLoader in MindSpore?
A22: If DataLoader is considered as an API used to receive custom datasets, the MindSpore data processing API that is most similar to DataLoader is GeneratorDataset, which can receive custom datasets. Refer to the GeneratorDataset documentation for specific usage, and the API operator mapping table for difference comparison.
Q23: How do I debug a custom dataset when it encounters an error?
A23: Custom datasets are usually transferred to GeneratorDataset. If errors occur during the use of custom datasets, debugging can be performed through various methods, such as adding print information, printing the shape and data type of the returned values. Custom datasets should typically maintain intermediate processing results as NumPy arrays and you are not advised using them together with the MindSpore network computation operators. In addition, after initialization of a custom dataset like MyDataset below, direct traversal can also be performed as follows (mainly for simplifying debugging and analyzing problems in the original dataset, which does not need to be transferred to GeneratorDataset), and debugging follows conventional Python syntax rules.
Dataset = MyDataset()for item in Dataset:
print("item:", item)
Q24: Can data processing operations and network computation operators be used together?
A24: Generally, if data processing operations and network computation operators are used together, the performance deteriorates. You can try to use them together when the corresponding data processing operations are missing and the custom Python operations are inappropriate. Note that the inputs required by the two are different. Generally, the inputs of data processing operations are NumPy arrays or PIL images, while the inputs of the network computation operators are MindSpore.Tensor. To use them together, you need to ensure that the output format of the former is the same as the input format of the latter. Data processing operations refer to APIs under the mindspore.dataset module in the API document on the official website, for example, mindspore.dataset.vision.CenterCrop. Network computation operators include operators under modules such as mindspore.nn and mindspore.ops.
Q25: Why does MindRecord generate a .db file? What error will occur when a dataset is loaded without the .db file?
A25: The .db file is the index file of a MindRecord file. The absence of the .db file usually results in an error when you obtain the total amount of data in the dataset. An example of the error message is 'MindRecordOp Count total rows failed'.
Q26: How do I read and decode images in a custom dataset?
A26: After a custom dataset is transferred to GeneratorDataset and images are read in the API (such as the __getitem__ function), data of the bytes type, NumPy arrays, or NumPy arrays that have been decoded can be directly returned. The details are as follows:
(1) After the images are read, data of the bytes type is returned.
class ImageDataset:
def __init__(self, data_path):
self.data = data_path
def __getitem__(self, index):
# use file open and read method
f = open(self.data[index], 'rb')
img_bytes = f.read()
f.close()
# return bytes directly
return (img_bytes, )
def __len__(self):
return len(self.data)
# data_path is a list of image file namedataset1 = ds.GeneratorDataset(ImageDataset(data_path), ["data"])decode_op = py_vision.Decode()to_tensor = py_vision.ToTensor(output_type=np.int32)dataset1 = dataset1.map(operations=[decode_op, to_tensor], input_columns=["data"])
(2) After the images are read, NumPy arrays are returned.
# In the preceding test case, the __getitem__ function can be modified as follows. The decoding operation is the same as that in the preceding test case: def __getitem__(self, index)
# use np.fromfile to read image
img_np = np.fromfile(self.data[index])
# return Numpy array directly
return (img_np, )
(3) Decode the images after reading them.
# Based on the preceding test case, you can modify the __getitem__ function as follows to directly return the decoded data. Then, you do not need to perform the decoding operation through map: def __getitem__(self, index)
# use Image.Open to open file, and convert to RGC
img_rgb = Image.Open(self.data[index]).convert("RGB")
return (img_rgb, )
Q27: How do I solve the error 'RuntimeError: can't start new thread' when using MindSpore Dataset to process data?
A27: The main cause is that when using **Dataset, .map(...), and .batch(...), the value of parameter num_parallel_workers is too large, and the number of user processes reaches the upper limit. This can be resolved by increasing the user process limit range with ulimit -u maximum number of processes or by setting num_parallel_workers to a smaller value.
Q28: How do I solve the error "RuntimeError: Failed to copy data into tensor." when I use GeneratorDataset to load data?
A28: When using GeneratorDataset to load NumPy arrays returned by Pyfunc, the MindSpore framework will perform the conversion from NumPy arrays to MindSpore tensors. If the memory to which the NumPy arrays point is freed, a memory copy error may occur. For example:
(1) Perform the in-place conversion of NumPy array - MindSpore tensor - NumPy array in the __getitem__ function. The MindSpore tensor and NumPy array ndarray_1 share the same memory. If the MindSpore tensor exceeds the scope when the __getitem__ function exits, the memory it points to will be freed.
class RandomAccessDataset:
def __init__(self):
pass
def __getitem__(self, item):
ndarray = np.zeros((544, 1056, 3))
tensor = Tensor.from_numpy(ndarray)
ndarray_1 = tensor.asnumpy()
return ndarray_1
def __len__(self):
return 8
data1 = ds.GeneratorDataset(RandomAccessDataset(), ["data"])
(2) Ignore the cyclic conversion in the preceding example. When the __getitem__ function exits, the tensor object is released, and the NumPy array object ndarray_1 that shares the same memory with the tensor object becomes unknown. To avoid this problem, you can use the deepcopy function to allocate independent memory for the returned NumPy array object ndarray_2.
class RandomAccessDataset:
def __init__(self):
pass
def __getitem__(self, item):
ndarray = np.zeros((544, 1056, 3))
tensor = Tensor.from_numpy(ndarray)
ndarray_1 = tensor.asnumpy()
ndarray_2 = copy.deepcopy(ndarray_1)
return ndarray_2
def __len__(self):
return 8
data1 = ds.GeneratorDataset(RandomAccessDataset(), ["data"])
Q29: How do I determine the cause of GetNext timeout based on the exit status of data preprocessing?
A29: When the data offloading mode (data preprocessing, transmission queue, and network computation in a pipeline) is used for training, if a GetNext timeout error is reported, the data preprocessing module outputs the status information to help you analyze the error cause. You can view the following situations in the logs. Causes and improvement methods are also provided:
(1) If information similar to the following is displayed, data preprocessing does not generate any data that can be used for training.
preprocess_batch: 0;batch_queue: ;
push_start_time -> push_end_time
Improvement method: First iterate through the dataset objects to confirm whether dataset preprocessing is normal.
(2) If information similar to the following is displayed, a data record is generated during data preprocessing but it has not been sent to the device:
preprocess_batch: 0;batch_queue: 1;
push_start_time -> push_end_time2022-05-09-11:36:00.521.386 ->
Improvement method: Check whether the device plog contains any error message.
(3) If information similar to the following is displayed, three data records are generated during data preprocessing and have been sent to the device. In addition, the fourth data record is being preprocessed.
preprocess_batch: 3;batch_queue: 1, 0, 1;
push_start_time -> push_end_time2022-05-09-11:36:00.521.386 -> 2022-05-09-11:36:00.782.2152022-05-09-11:36:01.212.621 -> 2022-05-09-11:36:01.490.1392022-05-09-11:36:01.893.412 -> 2022-05-09-11:36:02.006.771
Improvement method: Check the last push_end_time and the time when the GetNext error is reported. If the time exceeds the default GetNext timeout time (1900s by default and can be modified using mindspore.set_context(op_timeout=xx)), the data preprocessing performance is poor. In this case, improve the data preprocessing performance according to section Optimizing the Data Processing.
(4) If information similar to the following is displayed, 182 data records are generated during data preprocessing and the 183th data record is being sent to the device.
preprocess_batch: 182;batch_queue: 1, 0, 1, 1, 2, 1, 0, 1, 1, 0;
push_start_time -> push_end_time
-> 2022-05-09-14:31:00.603.8662022-05-09-14:31:00.621.146 -> 2022-05-09-14:31:01.018.9642022-05-09-14:31:01.043.705 -> 2022-05-09-14:31:01.396.6502022-05-09-14:31:01.421.501 -> 2022-05-09-14:31:01.807.6712022-05-09-14:31:01.828.931 -> 2022-05-09-14:31:02.179.9452022-05-09-14:31:02.201.960 -> 2022-05-09-14:31:02.555.9412022-05-09-14:31:02.584.413 -> 2022-05-09-14:31:02.943.8392022-05-09-14:31:02.969.583 -> 2022-05-09-14:31:03.309.2992022-05-09-14:31:03.337.607 -> 2022-05-09-14:31:03.684.0342022-05-09-14:31:03.717.230 -> 2022-05-09-14:31:04.038.5212022-05-09-14:31:04.064.571 ->
Improvement method: Check whether the device plog contains any error message.