[MindSpore Made Easy] Dive Into MindSpore—CSVDataset for Dataset Loading

2022/06/22

Developer Sharing

MindSpore Made Easy Dive Into MindSpore—CSVDataset for Dataset Loading

June 22, 2022 Author: kaierlong Development environment: Ubuntu 20.04 Python 3.8 MindSpore 1.7.0 1. Parameter Description dataset_files: dataset file path, which can be a single file or a file list. filed_delim: delimiter to separate fields. The default value is a comma (,). column_names: field name, which is used as the key of data fields. shuffle: indicates whether to shuffle data. Possible values are False, Shuffle.GLOBAL, and Shuffle.FILES. Shuffle.GLOBAL (default): shuffles files and data in files. Shuffle.FILES: shuffles files only. 2. Preparing Data 2.1 Downloading Data Run the following commands to download iris.data and iris.names to a specified directory:

mkdir iris && cd iris
wget -c https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
wget -c https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names

Note: If the wget command cannot be used due to system restrictions, you can use a web browser to download the files. 2.2 Data Overview The Iris flower dataset for multivariate analysis contains 150 datasets, which are classified into three types (Setosa, Versicolour, and Virginica). Each type contains 50 data records, and each data record contains four attributes: sepal length, sepal width, petal length, and petal width. For more details, see the following official dataset description: 2.3 Dataset Division The dataset is preliminarily divided into a training dataset and a test dataset in a 4:1 ratio of data amount. The code is as follows:

from random import shuffle


def preprocess_iris_data(iris_data_file, train_file, test_file, header=True):
cls_0 = "Iris-setosa"
cls_1 = "Iris-versicolor"
cls_2 = "Iris-virginica"

cls_0_samples = []
cls_1_samples = []
cls_2_samples = []

with open(iris_data_file, "r", encoding="UTF8") as fp:
lines = fp.readlines()
for line in lines:
line = line.strip()
if not line:
continue
if cls_0 in line:
cls_0_samples.append(line)
continue
if cls_1 in line:
cls_1_samples.append(line)
continue
if cls_2 in line:
cls_2_samples.append(line)

shuffle(cls_0_samples)
shuffle(cls_1_samples)
shuffle(cls_2_samples)

print("number of class 0: {}".format(len(cls_0_samples)), flush=True)
print("number of class 1: {}".format(len(cls_1_samples)), flush=True)
print("number of class 2: {}".format(len(cls_2_samples)), flush=True)

train_samples = cls_0_samples[:40] + cls_1_samples[:40] + cls_2_samples[:40]
test_samples = cls_0_samples[40:] + cls_1_samples[40:] + cls_2_samples[40:]

header_content = "Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Classes"

with open(train_file, "w", encoding="UTF8") as fp:
if header:
fp.write("{}\n".format(header_content))
for sample in train_samples:
fp.write("{}\n".format(sample))

with open(test_file, "w", encoding="UTF8") as fp:
if header:
fp.write("{}\n".format(header_content))
for sample in test_samples:
fp.write("{}\n".format(sample))


def main():
iris_data_file = "{your_path}/iris/iris.data"
iris_train_file = "{your_path}/iris/iris_train.csv"
iris_test_file = "{your_path}/iris/iris_test.csv"

preprocess_iris_data(iris_data_file, iris_train_file, iris_test_file)


if __name__ == "__main__":
main()

Save the above code to the preprocess.py file (replace the data file paths with the actual ones) and run the following command: python3 preprocess.py The output is as follows:

number of class 0: 50
number of class 1: 50
number of class 2: 50

The iris_train.csv and iris_test.csv files are generated. The directory content is as follows:

.
├── iris.data
├── iris.names
├── iris_test.csv
└── iris_train.csv

3. Trial and Error Cases The following shows several trial and error cases to help you learn about CSVDataset. 3.1 Usage of column_defaults Run the following code to load data (for convenience in reproducing the problem, we set shuffle to False):

from mindspore.dataset import CSVDataset


def dataset_load(data_files):
column_defaults = [float, float, float, float, str]
column_names = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Classes"]

dataset = CSVDataset(
dataset_files=data_files,
field_delim=",",
column_defaults=column_defaults,
column_names=column_names,
num_samples=None,
shuffle=False)

data_iter = dataset.create_dict_iterator()
item = None
for data in data_iter:
item = data
break

print("====== sample ======\n{}".format(item), flush=True)


def main():
iris_train_file = "{your_path}/iris/iris_train.csv"

dataset_load(data_files=iris_train_file)


if __name__ == "__main__":
main()

Save the above code to the load.py file (replace the data file paths with the actual ones) and run the following command: python3 load.py An error is reported as follows: Traceback (most recent call last):

File "/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py", line 107, in 
main()
File "/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py", line 103, in main
dataset_load(data_files=iris_train_file)
File "/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py", line 75, in dataset_load
dataset = CSVDataset(
File "/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/validators.py", line 1634, in new_method
raise TypeError("column type in column_defaults is invalid.")
TypeError: column type in column_defaults is invalid.

The code in line 1634 of the mindspore/dataset/engine/validators.py file is as follows:

# check column_defaults
column_defaults = param_dict.get('column_defaults')
if column_defaults is not None:
if not isinstance(column_defaults, list):
raise TypeError("column_defaults should be type of list.")
for item in column_defaults:
if not isinstance(item, (str, int, float)):
raise TypeError("column type in column_defaults is invalid.")

3.1.1 Analysis column_defaults (list, optional): data types of the data columns. Valid types are float, int, and string. The default value is None, indicating that all columns are treated as strings. It is found that in the code in mindspore/dataset/engine/validators.py, column_defaults is set to a specific type of data instances. So we need to modify column_defaults = [float, float, float, float, str] to column_defaults = [5.84, 3.05, 3.76, 1.20, "Classes"] (the values are obtained from the iris.names file). After the modification, run the code again. The following error is reported:

WARNING: Logging before InitGoogleLogging() is written to STDERR
[ERROR] MD(13306,0x70000269b000,Python):2022-06-14-16:51:59.681.109 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Unexpected error. Invalid csv, csv file: /Users/kaierlong/Downloads/iris/iris_train.csv parse failed at line 1, type does not match.
Line of code : 506
File : /Users/jenkins/agent-working-dir/workspace/Compile_CPU_X86_MacOS_PY39/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/csv_op.cc

Traceback (most recent call last):

File "/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py", line 107, in 
main()
File "/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py", line 103, in main
dataset_load(data_files=iris_train_file)
File "/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py", line 90, in dataset_load
for data in data_iter:
File "/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 147, in __next__
data = self._get_next()
File "/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 211, in _get_next
raise err
File "/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 204, in _get_next
return {k: self._transform_tensor(t) for k, t in self._iterator.GetNextAsMap().items()}
RuntimeError: Unexpected error. Invalid csv, csv file: /Users/kaierlong/Downloads/iris/iris_train.csv parse failed at line 1, type does not match.
Line of code : 506
File : /Users/jenkins/agent-working-dir/workspace/Compile_CPU_X86_MacOS_PY39/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/csv_op.cc

We'll analyze this error in the following section. 3.2 Processing of the Header Information 3.2.1 Analysis According to the error information, the error is located in line 506 of the mindspore/ccsrc/minddata/dataset/engine/datasetops/source/csv_op.cc file. The source code is as follows:

Status CsvOp::LoadFile(const std::string &file, int64_t start_offset, int64_t end_offset, int32_t worker_id) {
CsvParser csv_parser(worker_id, jagged_rows_connector_.get(), field_delim_, column_default_list_, file);
RETURN_IF_NOT_OK(csv_parser.InitCsvParser());
csv_parser.SetStartOffset(start_offset);
csv_parser.SetEndOffset(end_offset);

auto realpath = FileUtils::GetRealPath(file.c_str());
if (!realpath.has_value()) {
MS_LOG(ERROR) << "Invalid file path, " << file << " does not exist.";
RETURN_STATUS_UNEXPECTED("Invalid file path, " + file + " does not exist.");
}

std::ifstream ifs;
ifs.open(realpath.value(), std::ifstream::in);
if (!ifs.is_open()) {
RETURN_STATUS_UNEXPECTED("Invalid file, failed to open " + file + ", the file is damaged or permission denied.");
}
if (column_name_list_.empty()) {
std::string tmp;
getline(ifs, tmp);
}
csv_parser.Reset();
try {
while (ifs.good()) {
// when ifstream reaches the end of file, the function get() return std::char_traits::eof()
// which is a 32-bit -1, it's not equal to the 8-bit -1 on Euler OS. So instead of char, we use
// int to receive its return value.
int chr = ifs.get();
int err = csv_parser.ProcessMessage(chr);
if (err != 0) {
// if error code is -2, the returned error is interrupted
if (err == -2) return Status(kMDInterrupted);
RETURN_STATUS_UNEXPECTED("Invalid file, failed to parse csv file: " + file + " at line " +
std::to_string(csv_parser.GetTotalRows() + 1) +
". Error message: " + csv_parser.GetErrorMessage());
}
}
} catch (std::invalid_argument &ia) {
std::string err_row = std::to_string(csv_parser.GetTotalRows() + 1);
RETURN_STATUS_UNEXPECTED("Invalid csv, csv file: " + file + " parse failed at line " + err_row +
", type does not match.");
} catch (std::out_of_range &oor) {
std::string err_row = std::to_string(csv_parser.GetTotalRows() + 1);
RETURN_STATUS_UNEXPECTED("Invalid csv, " + file + " parse failed at line " + err_row + " : value out of range.");
}
return Status::OK();
}

It is found that the source code does not contain the code for processing the header line (the head information was written during data division in section 2.3). That is, all lines are deemed as data by default. CSVDataset does not provide the capability of processing the header line.

Based on the analysis, modify the code for data division
preprocess_iris_data(iris_data_file, iris_train_file, iris_test_file)
to
preprocess_iris_data(iris_data_file, iris_train_file, iris_test_file, header=False)

Execute the preprocess.py file again to generate new data.
Execute the load.py file (no modification is required). The output is as follows:
====== sample ======
{'Sepal.Length': Tensor(shape=[], dtype=Float32, value= 5.5), 'Sepal.Width': Tensor(shape=[], dtype=Float32, value= 4.2), 'Petal.Length': Tensor(shape=[], dtype=Float32, value= 1.4), 'Petal.Width': Tensor(shape=[], dtype=Float32, value= 0.2),
'Classes': Tensor(shape=[], dtype=String, value= 'Iris-setosa')}

Notes: For the convenience of readers, we changed the format of the output. You can see that the data is correctly read. The data contains five fields, and names of the data fields have been processed based on the specified column_names. 4. Positive Code Example Through the two trial-and-error cases in section 3, we now have a preliminary understanding of CSVDataset. You may find that there is still a problem, that is, the Classes field is not converted to a numeric value. The following describes how to convert the Classes field to a numeric value. The source code is as follows:

from mindspore.dataset import CSVDataset
from mindspore.dataset.text import Lookup, Vocab


def dataset_load(data_files):
column_defaults = [5.84, 3.05, 3.76, 1.20, "Classes"]
column_names = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Classes"]

dataset = CSVDataset(
dataset_files=data_files,
field_delim=",",
column_defaults=column_defaults,
column_names=column_names,
num_samples=None,
shuffle=False)

cls_to_id_dict = {"Iris-setosa": 0, "Iris-versicolor": 1, "Iris-virginica": 2}
vocab = Vocab.from_dict(word_dict=cls_to_id_dict)
lookup = Lookup(vocab)
dataset = dataset.map(input_columns="Classes", operations=lookup)

data_iter = dataset.create_dict_iterator()
item = None
for data in data_iter:
item = data
break

print("====== sample ======\n{}".format(item), flush=True)


def main():
iris_train_file = "{your_path}/iris/iris_train.csv"

dataset_load(data_files=iris_train_file)


if __name__ == "__main__":
main()

Save the above code to the load.py file (replace the data file paths with the actual ones) and run the following command: python3 load.py The output is as follows: ====== sample ======

{'Sepal.Length': Tensor(shape=[], dtype=Float32, value= 5.5), 'Sepal.Width': Tensor(shape=[], dtype=Float32, value= 4.2), 'Petal.Length': Tensor(shape=[], dtype=Float32, value= 1.4), 'Petal.Width': Tensor(shape=[], dtype=Float32, value= 0.2),
'Classes': Tensor(shape=[], dtype=Int32, value= 0)}

Notes: The data contains five fields. The Classes field is converted to a numeric value based on cls_to_id_dict = {"Iris-setosa": 0, "Iris-versicolor": 1, "Iris-virginica": 2}. To do this, we used the mindspore.dataset.text method. You can also add the code for converting fields to numeric values during data division in section 2.3. 5. Summary This article explores the CSVDataset interface in MindSpore with some cases provided. 6. Improvements to Be Made Header cannot be contained in a file. Specified fields cannot be read by simply calling an API. 7. References mindspore.dataset.CSVDataset

Learning

Core Frameworks

Foundation Model

Scientific Computing

Domain Suites

Tools

Ecosystem

Technical learning

Community Organization

Contribution and Growth

Interaction and Communication

Events

News

[MindSpore Made Easy] Dive Into MindSpore—CSVDataset for Dataset Loading

MindSpore Made Easy Dive Into MindSpore—CSVDataset for Dataset Loading