# Loading Dataset

[![View Source On Gitee](./_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.1/docs/programming_guide/source_en/dataset_loading.md)

## Overview

MindSpore can load common image datasets. You can directly use the classes in `mindspore.dataset` to load datasets. The following table lists the supported common datasets and corresponding classes.

| Image Dataset | Dataset Class | Description |
|  ----                    | ----  | ----           |
| MNIST | MnistDataset | MNIST is a large handwritten digital image dataset. It has 60,000 training images and 10,000 test images and is often used to train various image processing systems. |
| CIFAR-10 | Cifar10Dataset | CIFAR-10 is a small image dataset that contains 60,000 32 x 32 color images of 10 categories. On average, each category contains 6,000 images, of which 5,000 images are training images and 1,000 images are test images. |
| CIFAR-100 | Cifar100Dataset | CIFAR-100 is similar to CIFAR-10, but it has 100 categories. On average, there are 600 images in each category, among which 500 images are training images and 100 images are test images. |
| CelebA | CelebADataset | CelebA is a large face image dataset that contains more than 200,000 face images of celebrities. Each image has 40 feature labels. |
| PASCAL-VOC | VOCDataset | PASCAL-VOC is a common image dataset, which is widely used in computer vision fields such as object detection and image segmentation. |
| COCO | CocoDataset | COCO is a large dataset for object detection, image segmentation, and pose estimation. |
| CLUE | CLUEDataset | CLUE is a large Chinese semantic comprehension dataset. |

MindSpore can also load datasets in different data storage formats. You can directly use the corresponding classes in `mindspore.dataset` to load data files in the disk. The following table lists the supported data formats and corresponding classes.

| Data Format | Dataset Class | Description |
|  ----                    | ----  | ----           |
| MindRecord | MindDataset | MindRecord is a self-developed data format of MindSpore. It features efficient read/write and easy distributed processing. |
| Manifest | ManifestDataset | Manifest is a data format supported by Huawei ModelArts. It describes the original files and labeling information and can be used for labeling, training, and inference. |
| TFRecord | TFRecordDataset | TFRecord is a binary data file format defined by TensorFlow. |
| NumPy | NumpySlicesDataset | NumPy data source refers to the NumPy array dataset that has been read into the memory. |
| Text File | TextFileDataset | Text File refers to common data in text format. |
| CSV File | CSVDataset | CSV refers to comma-separated values. Files in this format store tabular data in plain text. |

MindSpore also supports user-defined dataset loading using `GeneratorDataset`. You can implement your own dataset classes as required.

> For details about the API for dataset loading, see [MindSpore API](https://www.mindspore.cn/doc/api_python/en/r1.1/mindspore/mindspore.dataset.html).

## Loading Common Dataset

The following describes how to load common datasets.

### CIFAR-10/100 Dataset

Download [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz) and decompress it, the directory structure is as follows:

```text
└─cifar-10-batches-bin
    ├── batches.meta.txt
    ├── data_batch_1.bin
    ├── data_batch_2.bin
    ├── data_batch_3.bin
    ├── data_batch_4.bin
    ├── data_batch_5.bin
    ├── readme.html
    └── test_batch.bin
```

The following example uses the `Cifar10Dataset` API to load the CIFAR-10 dataset, uses the sequential sampler to obtain five samples, and displays the shape and label of the corresponding image.

The methods for loading the CIFAR-100 and MNIST datasets are similar.

```python
import mindspore.dataset as ds

DATA_DIR = "cifar-10-batches-bin/"

sampler = ds.SequentialSampler(num_samples=5)
dataset = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)

for data in dataset.create_dict_iterator():
    print("Image shape:", data['image'].shape, ", Label:", data['label'])
```

The output is as follows:

```text
Image shape: (32, 32, 3) , Label: 6
Image shape: (32, 32, 3) , Label: 9
Image shape: (32, 32, 3) , Label: 9
Image shape: (32, 32, 3) , Label: 4
Image shape: (32, 32, 3) , Label: 1
```

### VOC Dataset

There are multiple versions of the VOC dataset, here uses VOC2012 as an example. Download [VOC2012 dataset](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar) and decompress it. The directory structure is as follows:

```text
└─ VOCtrainval_11-May-2012
    └── VOCdevkit
        └── VOC2012
            ├── Annotations
            ├── ImageSets
            ├── JPEGImages
            ├── SegmentationClass
            └── SegmentationObject
```

The following example uses the `VOCDataset` API to load the VOC2012 dataset, displays the original image shape and target image shape when segmentation and detection tasks are specified.

```python
import mindspore.dataset as ds

DATA_DIR = "VOCtrainval_11-May-2012/VOCdevkit/VOC2012/"

dataset = ds.VOCDataset(DATA_DIR, task="Segmentation", usage="train", num_samples=2, decode=True, shuffle=False)

print("[Segmentation]:")
for data in dataset.create_dict_iterator():
    print("image shape:", data["image"].shape)
    print("target shape:", data["target"].shape)

dataset = ds.VOCDataset(DATA_DIR, task="Detection", usage="train", num_samples=1, decode=True, shuffle=False)

print("[Detection]:")
for data in dataset.create_dict_iterator():
    print("image shape:", data["image"].shape)
    print("bbox shape:", data["bbox"].shape)
```

The output is as follows:

```text
[Segmentation]:
image shape: (281, 500, 3)
target shape: (281, 500, 3)
image shape: (375, 500, 3)
target shape: (375, 500, 3)
[Detection]:
image shape: (442, 500, 3)
bbox shape: (2, 4)
```

### COCO Dataset

There are multiple versions of the COCO dataset. Here, the validation dataset of COCO2017 is taken as an example. Download COCO2017 [validation dataset](http://images.cocodataset.org/zips/val2017.zip), [detection task annotation](http://images.cocodataset.org/annotations/annotations_trainval2017.zip) and [panoptic task annotation](http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip) and decompress them, take only a part of the validation dataset and store it as the following directory structure:

```text
└─ COCO
    ├── val2017
    └── annotations
        ├── instances_val2017.json
        ├── panoptic_val2017.json
        └── person_keypoints_val2017.json
```

The following example uses the `CocoDataset` API to load the COCO dataset, and displays the data when object detection, stuff segmentation, keypoint detection, and panoptic segmentation tasks are specified.

```python
import mindspore.dataset as ds

DATA_DIR = "COCO/val2017/"
ANNOTATION_FILE = "COCO/annotations/instances_val2017.json"
KEYPOINT_FILE = "COCO/annotations/person_keypoints_val2017.json"
PANOPTIC_FILE = "COCO/annotations/panoptic_val2017.json"

dataset = ds.CocoDataset(DATA_DIR, annotation_file=ANNOTATION_FILE, task="Detection", num_samples=1)
for data in dataset.create_dict_iterator():
    print("Detection:", data.keys())

dataset = ds.CocoDataset(DATA_DIR, annotation_file=ANNOTATION_FILE, task="Stuff", num_samples=1)
for data in dataset.create_dict_iterator():
    print("Stuff:", data.keys())

dataset = ds.CocoDataset(DATA_DIR, annotation_file=KEYPOINT_FILE, task="Keypoint", num_samples=1)
for data in dataset.create_dict_iterator():
    print("Keypoint:", data.keys())

dataset = ds.CocoDataset(DATA_DIR, annotation_file=PANOPTIC_FILE, task="Panoptic", num_samples=1)
for data in dataset.create_dict_iterator():
    print("Panoptic:", data.keys())
```

The output is as follows:

```text
Detection: dict_keys(['image', 'bbox', 'category_id', 'iscrowd'])
Stuff: dict_keys(['image', 'segmentation', 'iscrowd'])
Keypoint: dict_keys(['image', 'keypoints', 'num_keypoints'])
Panoptic: dict_keys(['image', 'bbox', 'category_id', 'iscrowd', 'area'])
```

## Loading Datasets in Specific Format

The following describes how to load dataset files in specific formats.

### MindRecord

MindRecord is a data format defined by MindSpore. Using MindRecord can improve performance.

> For details about how to convert a dataset into the MindRecord data format, see [Data Format Conversion](https://www.mindspore.cn/doc/programming_guide/en/r1.1/dataset_conversion.html).

The following example uses the `MindDataset` API to load MindRecord files, and displays labels of the loaded data.

```python
import mindspore.dataset as ds

DATA_FILE = ["mindrecord_file_0", "mindrecord_file_1", "mindrecord_file_2"]
mindrecord_dataset = ds.MindDataset(DATA_FILE)

for data in mindrecord_dataset.create_dict_iterator(output_numpy=True):
    print(data["label"])
```

### Manifest

Manifest is a data format file supported by Huawei ModelArts. For details, see [Specifications for Importing the Manifest File](https://support.huaweicloud.com/en-us/engineers-modelarts/modelarts_23_0009.html).

The following example uses the `ManifestDataset` API to load a Manifest file, and displays labels of the loaded data.

```python
import mindspore.dataset as ds

DATA_FILE = "manifest_file"
manifest_dataset = ds.ManifestDataset(DATA_FILE)

for data in manifest_dataset.create_dict_iterator():
    print(data["label"])
```

### TFRecord

TFRecord is a binary data file format defined by TensorFlow.

The following example uses the `TFRecordDataset` API to load TFRecord files and introduces two methods for setting the format of datasets.

1. Specify the dataset path or TFRecord file list to create a `TFRecordDataset` object.

    ```python
    import mindspore.dataset as ds

    DATA_FILE = ["tfrecord_file_0", "tfrecord_file_1", "tfrecord_file_2"]
    tfrecord_dataset = ds.TFRecordDataset(DATA_FILE)
    ```

2. Compile a schema file or create a schema object to set the dataset format and features.

    - Compile a schema file.

        Write the dataset format and features to the schema file in JSON format. The following is an example:

        ```json
        {
         "columns": {
             "image": {
                 "type": "uint8",
                 "rank": 1
                 },
             "label" : {
                 "type": "string",
                 "rank": 1
                 }
             "id" : {
                 "type": "int64",
                 "rank": 0
                 }
             }
         }
        ```

        - `columns`: column information field, which needs to be defined based on the actual column name of the dataset. In the preceding example, the dataset columns are `image`, `label`, and `id`.

        When creating `TFRecordDataset`, transfer the path of the schema file.

        ```python
        SCHEMA_DIR = "dataset_schema_path/schema.json"
        tfrecord_dataset = ds.TFRecordDataset(DATA_FILE, schema=SCHEMA_DIR)
        ```

    - Create a schema object.

        Create a schema object, add user-defined fields to the schema object, and pass the schema object when creating a dataset object.

        ```python
        from mindspore import dtype as mstype
        schema = ds.Schema()
        schema.add_column('image', de_type=mstype.uint8)
        schema.add_column('label', de_type=mstype.int32)
        tfrecord_dataset = ds.TFRecordDataset(DATA_FILE, schema=schema)
        ```

### NumPy

If all data has been read into the memory, you can directly use the `NumpySlicesDataset` class to load the data.

The following examples describe how to use `NumpySlicesDataset` to load array, list, and dict data.

- Load NumPy array data.

    ```python
    import numpy as np
    import mindspore.dataset as ds

    np.random.seed(6)
    features, labels = np.random.sample((4, 2)), np.random.sample((4, 1))

    data = (features, labels)
    dataset = ds.NumpySlicesDataset(data, column_names=["col1", "col2"], shuffle=False)

    for data in dataset:
        print(data[0], data[1])
    ```

    The output is as follows:

    ```text
    [0.89286015 0.33197981] [0.33540785]
    [0.82122912 0.04169663] [0.62251943]
    [0.10765668 0.59505206] [0.43814143]
    [0.52981736 0.41880743] [0.73588211]
    ```

- Load Python list data.

    ```python

    import mindspore.dataset as ds

    data1 = [[1, 2], [3, 4]]

    dataset = ds.NumpySlicesDataset(data1, column_names=["col1"], shuffle=False)

    for data in dataset:
        print(data[0])
    ```

    The output is as follows:

    ```text
    [1 2]
    [3 4]
    ```

- Load Python dict data.

    ```python
    import mindspore.dataset as ds

    data1 = {"a": [1, 2], "b": [3, 4]}

    dataset = ds.NumpySlicesDataset(data1, column_names=["col1", "col2"], shuffle=False)

    for data in dataset.create_dict_iterator():
        print(data)
    ```

    The output is as follows:

    ```text
    {'col1': Tensor(shape=[], dtype=Int64, value= 1), 'col2': Tensor(shape=[], dtype=Int64, value= 3)}
    {'col1': Tensor(shape=[], dtype=Int64, value= 2), 'col2': Tensor(shape=[], dtype=Int64, value= 4)}
    ```

### CSV

The following example uses `CSVDataset` to load CSV dataset files, and displays labels of the loaded data.

The method of loading a text dataset file is similar to that of loading a CSV file.

```python
import mindspore.dataset as ds

DATA_FILE = ["csv_file_0", "csv_file_1", "csv_file_2"]
csv_dataset = ds.CSVDataset(DATA_FILE)

for data in csv_dataset.create_dict_iterator(output_numpy=True):
    print(data["1"])
```

## Loading User-defined Dataset

For the datasets that cannot be directly loaded by MindSpore, you can construct the `GeneratorDataset` object to load them in a customized method or convert them into the MindRecord data format. The following demonstrates some different methods to load user-defined datasets. For comparison, keep the generated random data the same.

### Constructing Dataset Generator Function

Construct a generator function that defines the data return method, and then use this function to construct the user-defined dataset object. This method is applicable for simple scenarios.

```python
import numpy as np
import mindspore.dataset as ds

np.random.seed(58)
data = np.random.sample((5, 2))
label = np.random.sample((5, 1))

def GeneratorFunc():
    for i in range(5):
        yield (data[i], label[i])

dataset = ds.GeneratorDataset(GeneratorFunc, ["data", "label"])

for sample in dataset.create_dict_iterator():
    print(sample["data"], sample["label"])
```

The output is as follows:

```text
[0.36510558 0.45120592] [0.78888122]
[0.49606035 0.07562207] [0.38068183]
[0.57176158 0.28963401] [0.16271622]
[0.30880446 0.37487617] [0.54738768]
[0.81585667 0.96883469] [0.77994068]
```

### Constructing Iterable Dataset Class

Construct a dataset class to implement the `__iter__` and `__next__` methods, and then use the object of this class to construct the user-defined dataset object. Compared with directly defining the generating function, using the dataset class can achieve more customized functions.

```python
import numpy as np
import mindspore.dataset as ds

class IterDatasetGenerator:
    def __init__(self):
        np.random.seed(58)
        self.__index = 0
        self.__data = np.random.sample((5, 2))
        self.__label = np.random.sample((5, 1))

    def __next__(self):
        if self.__index >= len(self.__data):
            raise StopIteration
        else:
            item = (self.__data[self.__index], self.__label[self.__index])
            self.__index += 1
            return item

    def __iter__(self):
        self.__index = 0
        return self

    def __len__(self):
        return len(self.__data)

dataset_generator = IterDatasetGenerator()
dataset = ds.GeneratorDataset(dataset_generator, ["data", "label"], shuffle=False)

for data in dataset.create_dict_iterator():
    print(data["data"], data["label"])
```

The output is as follows:

```text
[0.36510558 0.45120592] [0.78888122]
[0.49606035 0.07562207] [0.38068183]
[0.57176158 0.28963401] [0.16271622]
[0.30880446 0.37487617] [0.54738768]
[0.81585667 0.96883469] [0.77994068]
```

### Constructing Random Accessible Dataset Class

Construct a dataset class to implement the `__getitem__` method, and then use the object of this class to construct a user-defined dataset object. This method is applicable for achieving distributed training.

```python
import numpy as np
import mindspore.dataset as ds

class GetDatasetGenerator:
    def __init__(self):
        np.random.seed(58)
        self.__data = np.random.sample((5, 2))
        self.__label = np.random.sample((5, 1))

    def __getitem__(self, index):
        return (self.__data[index], self.__label[index])

    def __len__(self):
        return len(self.__data)

dataset_generator = GetDatasetGenerator()
dataset = ds.GeneratorDataset(dataset_generator, ["data", "label"], shuffle=False)

for data in dataset.create_dict_iterator():
    print(data["data"], data["label"])
```

The output is as follows:

```text
[0.36510558 0.45120592] [0.78888122]
[0.49606035 0.07562207] [0.38068183]
[0.57176158 0.28963401] [0.16271622]
[0.30880446 0.37487617] [0.54738768]
[0.81585667 0.96883469] [0.77994068]
```

If you want to perform distributed training, you need to implement the `__iter__` method in the sampler class additionally. The index of the sampled data is returned each time. The code that needs to be added is as follows:

```python
import math

class MySampler():
    def __init__(self, dataset, local_rank, world_size):
        self.__num_data = len(dataset)
        self.__local_rank = local_rank
        self.__world_size = world_size
        self.samples_per_rank = int(math.ceil(self.__num_data / float(self.__world_size)))
        self.total_num_samples = self.samples_per_rank * self.__world_size

    def __iter__(self):
        indices = list(range(self.__num_data))
        indices.extend(indices[:self.total_num_samples-len(indices)])
        indices = indices[self.__local_rank:self.total_num_samples:self.__world_size]
        return iter(indices)

    def __len__(self):
        return self.samples_per_rank

dataset_generator = GetDatasetGenerator()
sampler = MySampler(dataset_generator, local_rank=0, world_size=2)
dataset = ds.GeneratorDataset(dataset_generator, ["data", "label"], shuffle=False, sampler=sampler)

for data in dataset.create_dict_iterator():
    print(data["data"], data["label"])
```

The output is as follows:

```text
[0.36510558 0.45120592] [0.78888122]
[0.57176158 0.28963401] [0.16271622]
[0.81585667 0.96883469] [0.77994068]
```