mindspore.dataset

This module provides APIs to load and process various common datasets such as MNIST, CIFAR-10, CIFAR-100, VOC, COCO, ImageNet, CelebA, CLUE, etc. It also supports datasets in standard format, including MindRecord, TFRecord, Manifest, etc. Users can also define their own datasets with this module.

Besides, this module provides APIs to sample data while loading.

We can enable cache in most of the dataset with its key arguments ‘cache’. Please notice that cache is not supported on Windows platform yet. Do not use it while loading and processing data on Windows. More introductions and limitations can refer Single-Node Tensor Cache .

Common imported modules in corresponding API examples are as follows:

import mindspore.dataset as ds
import mindspore.dataset.transforms as transforms
import mindspore.dataset.vision as vision

Descriptions of common dataset terms are as follows:

  • Dataset, the base class of all the datasets. It provides data processing methods to help preprocess the data.

  • SourceDataset, an abstract class to represent the source of dataset pipeline which produces data from data sources such as files and databases.

  • MappableDataset, an abstract class to represent a source dataset which supports for random access.

  • Iterator, the base class of dataset iterator for enumerating elements.

Introduction to data processing pipeline

../_images/dataset_pipeline_en.png

As shown in the above figure, the mindspore dataset module makes it easy for users to define data preprocessing pipelines and transform samples in the dataset in the most efficient (multi-process / multi-thread) manner. The specific steps are as follows:

  • Loading datasets: Users can easily load supported datasets using the *Dataset class, or load Python layer customized datasets through UDF Loader + GeneratorDataset . At the same time, the loading class method can accept a variety of parameters such as sampler, data slicing, and data shuffle;

  • Dataset operation: The user uses the dataset object method .shuffle / .filter / .skip / .split / .take / … to further shuffle, filter, skip, and obtain the maximum number of samples of datasets;

  • Dataset sample transform operation: The user can add data transform operations ( vision transform , NLP transform , audio transform ) to the map operation to perform transformations. During data preprocessing, multiple map operations can be defined to perform different transform operations to different fields. The data transform operation can also be a user-defined transform pyfunc (Python function);

  • Batch: After the transformation of the samples, the user can use the batch operation to organize multiple samples into batches, or use self-defined batch logic with the parameter per_batch_map applied;

  • Iterator: Finally, the user can use the dataset object method create_dict_iterator to create an iterator, which can output the preprocessed data cyclically.

The data processing pipeline example is as follows. Please refer to datasets_example.py for complete example.

import numpy as np
import mindspore as ms
import mindspore.dataset as ds
import mindspore.dataset.vision as vision
import mindspore.dataset.transforms as transforms

# construct data and label
data1 = np.array(np.random.sample(size=(300, 300, 3)) * 255, dtype=np.uint8)
data2 = np.array(np.random.sample(size=(300, 300, 3)) * 255, dtype=np.uint8)
data3 = np.array(np.random.sample(size=(300, 300, 3)) * 255, dtype=np.uint8)
data4 = np.array(np.random.sample(size=(300, 300, 3)) * 255, dtype=np.uint8)

label = [1, 2, 3, 4]

# load the data and label by NumpySlicesDataset
dataset = ds.NumpySlicesDataset(([data1, data2, data3, data4], label), ["data", "label"])

# apply the transform to data
dataset = dataset.map(operations=vision.RandomCrop(size=(250, 250)), input_columns="data")
dataset = dataset.map(operations=vision.Resize(size=(224, 224)), input_columns="data")
dataset = dataset.map(operations=vision.Normalize(mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                                                  std=[0.229 * 255, 0.224 * 255, 0.225 * 255]),
                      input_columns="data")
dataset = dataset.map(operations=vision.HWC2CHW(), input_columns="data")

# apply the transform to label
dataset = dataset.map(operations=transforms.TypeCast(ms.int32), input_columns="label")

# batch
dataset = dataset.batch(batch_size=2)

# create iterator
epochs = 2
ds_iter = dataset.create_dict_iterator(output_numpy=True, num_epochs=epochs)
for _ in range(epochs):
    for item in ds_iter:
        print("item: {}".format(item), flush=True)

Vision

mindspore.dataset.Caltech101Dataset

A source dataset that reads and parses Caltech101 dataset.

mindspore.dataset.Caltech256Dataset

A source dataset that reads and parses Caltech256 dataset.

mindspore.dataset.CelebADataset

A source dataset that reads and parses CelebA dataset.

mindspore.dataset.Cifar10Dataset

A source dataset that reads and parses Cifar10 dataset.

mindspore.dataset.Cifar100Dataset

A source dataset that reads and parses Cifar100 dataset.

mindspore.dataset.CityscapesDataset

A source dataset that reads and parses Cityscapes dataset.

mindspore.dataset.CocoDataset

A source dataset that reads and parses COCO dataset.

mindspore.dataset.DIV2KDataset

A source dataset that reads and parses DIV2KDataset dataset.

mindspore.dataset.EMnistDataset

A source dataset that reads and parses the EMNIST dataset.

mindspore.dataset.FakeImageDataset

A source dataset for generating fake images.

mindspore.dataset.FashionMnistDataset

A source dataset that reads and parses the Fashion-MNIST dataset.

mindspore.dataset.FlickrDataset

A source dataset that reads and parses Flickr8k and Flickr30k dataset.

mindspore.dataset.Flowers102Dataset

A source dataset that reads and parses Flowers102 dataset.

mindspore.dataset.ImageFolderDataset

A source dataset that reads images from a tree of directories.

mindspore.dataset.KMnistDataset

A source dataset that reads and parses the KMNIST dataset.

mindspore.dataset.ManifestDataset

A source dataset for reading images from a Manifest file.

mindspore.dataset.MnistDataset

A source dataset that reads and parses the MNIST dataset.

mindspore.dataset.PhotoTourDataset

A source dataset that reads and parses the PhotoTour dataset.

mindspore.dataset.Places365Dataset

A source dataset that reads and parses the Places365 dataset.

mindspore.dataset.QMnistDataset

A source dataset that reads and parses the QMNIST dataset.

mindspore.dataset.SBDataset

A source dataset that reads and parses Semantic Boundaries Dataset.

mindspore.dataset.SBUDataset

A source dataset that reads and parses the SBU dataset.

mindspore.dataset.SemeionDataset

A source dataset that reads and parses Semeion dataset.

mindspore.dataset.STL10Dataset

A source dataset that reads and parses STL10 dataset.

mindspore.dataset.SVHNDataset

A source dataset that reads and parses SVHN dataset.

mindspore.dataset.USPSDataset

A source dataset that reads and parses the USPS dataset.

mindspore.dataset.VOCDataset

A source dataset that reads and parses VOC dataset.

mindspore.dataset.WIDERFaceDataset

A source dataset that reads and parses WIDERFace dataset.

Text

mindspore.dataset.AGNewsDataset

A source dataset that reads and parses AG News datasets.

mindspore.dataset.AmazonReviewDataset

A source dataset that reads and parses Amazon Review Polarity and Amazon Review Full datasets.

mindspore.dataset.CLUEDataset

A source dataset that reads and parses CLUE datasets.

mindspore.dataset.CoNLL2000Dataset

A source dataset that reads and parses CoNLL2000 chunking dataset.

mindspore.dataset.CSVDataset

A source dataset that reads and parses comma-separated values (CSV) files as dataset.

mindspore.dataset.DBpediaDataset

A source dataset that reads and parses the DBpedia dataset.

mindspore.dataset.EnWik9Dataset

A source dataset that reads and parses EnWik9 Polarity and EnWik9 Full datasets.

mindspore.dataset.IMDBDataset

A source dataset that reads and parses Internet Movie Database (IMDb).

mindspore.dataset.IWSLT2016Dataset

A source dataset that reads and parses IWSLT2016 datasets.

mindspore.dataset.IWSLT2017Dataset

A source dataset that reads and parses IWSLT2017 datasets.

mindspore.dataset.PennTreebankDataset

A source dataset that reads and parses PennTreebank datasets.

mindspore.dataset.SogouNewsDataset

A source dataset that reads and parses Sogou News dataset.

mindspore.dataset.TextFileDataset

A source dataset that reads and parses datasets stored on disk in text format.

mindspore.dataset.UDPOSDataset

A source dataset that reads and parses UDPOS dataset.

mindspore.dataset.WikiTextDataset

A source dataset that reads and parses WikiText2 and WikiText103 datasets.

mindspore.dataset.YahooAnswersDataset

A source dataset that reads and parses the YahooAnswers dataset.

mindspore.dataset.YelpReviewDataset

A source dataset that reads and parses Yelp Review Polarity and Yelp Review Full dataset.

Audio

mindspore.dataset.LJSpeechDataset

A source dataset that reads and parses LJSpeech dataset.

mindspore.dataset.SpeechCommandsDataset

A source dataset that reads and parses the SpeechCommands dataset.

mindspore.dataset.TedliumDataset

A source dataset that reads and parses Tedlium dataset.

mindspore.dataset.YesNoDataset

A source dataset that reads and parses the YesNo dataset.

Standard Format

mindspore.dataset.CSVDataset

A source dataset that reads and parses comma-separated values (CSV) files as dataset.

mindspore.dataset.MindDataset

A source dataset that reads and parses MindRecord dataset.

mindspore.dataset.OBSMindDataset

A source dataset that reads and parses MindRecord dataset which stored in cloud storage such as OBS, Minio or AWS S3.

mindspore.dataset.TFRecordDataset

A source dataset that reads and parses datasets stored on disk in TFData format.

User Defined

mindspore.dataset.GeneratorDataset

A source dataset that generates data from Python by invoking Python data source each epoch.

mindspore.dataset.NumpySlicesDataset

Creates a dataset with given data slices, mainly for loading Python data into dataset.

mindspore.dataset.PaddedDataset

Creates a dataset with filler data provided by user.

mindspore.dataset.RandomDataset

A source dataset that generates random data.

Graph

mindspore.dataset.ArgoverseDataset

Load argoverse dataset and create graph.

mindspore.dataset.Graph

A graph object for storing Graph structure and feature data, and provide capabilities such as graph sampling.

mindspore.dataset.GraphData

Reads the graph dataset used for GNN training from the shared file and database.

mindspore.dataset.InMemoryGraphDataset

Basic Dataset for loading graph into memory.

Sampler

mindspore.dataset.DistributedSampler

A sampler that accesses a shard of the dataset, it helps divide dataset into multi-subset for distributed training.

mindspore.dataset.PKSampler

Samples K elements for each P class in the dataset.

mindspore.dataset.RandomSampler

Samples the elements randomly.

mindspore.dataset.SequentialSampler

Samples the dataset elements sequentially that is equivalent to not using a sampler.

mindspore.dataset.SubsetRandomSampler

Samples the elements randomly from a sequence of indices.

mindspore.dataset.SubsetSampler

Samples the elements from a sequence of indices.

mindspore.dataset.WeightedRandomSampler

Samples the elements from [0, len(weights) - 1] randomly with the given weights (probabilities).

Config

The configuration module provides various functions to set and get the supported configuration parameters, and read a configuration file.

mindspore.dataset.config.set_sending_batches

Set the default sending batches when training with sink_mode=True in Ascend device.

mindspore.dataset.config.load

Load the project configuration from the file.

mindspore.dataset.config.set_seed

Set the seed so the random generated number will be fixed for deterministic results.

mindspore.dataset.config.get_seed

Get random number seed.

mindspore.dataset.config.set_prefetch_size

Set the queue capacity of the thread in pipeline.

mindspore.dataset.config.get_prefetch_size

Get the prefetch size as for number of rows.

mindspore.dataset.config.set_num_parallel_workers

Set a new global configuration default value for the number of parallel workers.

mindspore.dataset.config.get_num_parallel_workers

Get the global configuration of number of parallel workers.

mindspore.dataset.config.set_numa_enable

Set the default state of numa enabled.

mindspore.dataset.config.get_numa_enable

Get the state of numa to indicate enabled/disabled.

mindspore.dataset.config.set_monitor_sampling_interval

Set the default interval (in milliseconds) for monitor sampling.

mindspore.dataset.config.get_monitor_sampling_interval

Get the global configuration of sampling interval of performance monitor.

mindspore.dataset.config.set_callback_timeout

Set the default timeout (in seconds) for DSWaitedCallback.

mindspore.dataset.config.get_callback_timeout

Get the default timeout for WaitedDSCallback.

mindspore.dataset.config.set_auto_num_workers

Set num_parallel_workers for each op automatically(This feature is turned off by default).

mindspore.dataset.config.get_auto_num_workers

Get the setting (turned on or off) automatic number of workers.

mindspore.dataset.config.set_enable_shared_mem

Set the default state of shared memory flag.

mindspore.dataset.config.get_enable_shared_mem

Get the default state of shared mem enabled variable.

mindspore.dataset.config.set_enable_autotune

Set whether to enable AutoTune.

mindspore.dataset.config.get_enable_autotune

Get whether AutoTune is currently enabled.

mindspore.dataset.config.set_autotune_interval

Set the configuration adjustment interval (in steps) for AutoTune.

mindspore.dataset.config.get_autotune_interval

Get the current configuration adjustment interval (in steps) for AutoTune.

mindspore.dataset.config.set_auto_offload

Set the automatic offload flag of the dataset.

mindspore.dataset.config.get_auto_offload

Get the state of the automatic offload flag (True or False)

mindspore.dataset.config.set_enable_watchdog

Set the default state of watchdog Python thread as enabled, the default state of watchdog Python thread is enabled.

mindspore.dataset.config.get_enable_watchdog

Get the state of watchdog Python thread to indicate enabled or disabled state.

mindspore.dataset.config.set_fast_recovery

Set whether dataset pipeline should recover in fast mode during failover (yet with slightly different random augmentations).

mindspore.dataset.config.get_fast_recovery

Get whether the fast recovery mode is enabled for the current dataset pipeline.

mindspore.dataset.config.set_multiprocessing_timeout_interval

Set the default interval (in seconds) for multiprocessing/multithreading timeout when main process/thread gets data from subprocesses/child threads.

mindspore.dataset.config.get_multiprocessing_timeout_interval

Get the global configuration of multiprocessing/multithreading timeout when main process/thread gets data from subprocesses/child threads.

Others

mindspore.dataset.BatchInfo

Only the batch size function and per_batch_map of the batch operation can dynamically adjust parameters based on the number of batches and epochs during training.

mindspore.dataset.DatasetCache

A client to interface with tensor caching service.

mindspore.dataset.DSCallback

Abstract base class used to build dataset callback classes.

mindspore.dataset.SamplingStrategy

Specifies the sampling strategy when execute get_sampled_neighbors .

mindspore.dataset.Schema

Class to represent a schema of a dataset.

mindspore.dataset.Shuffle

Specify the shuffle mode.

mindspore.dataset.WaitedDSCallback

Abstract base class used to build dataset callback classes that are synchronized with the training callback class mindspore.train.Callback .

mindspore.dataset.OutputFormat

Specifies the output storage format when execute get_all_neighbors .

mindspore.dataset.compare

Compare if two dataset pipelines are the same.

mindspore.dataset.deserialize

Construct dataset pipeline from a JSON file produced by dataset serialize function.

mindspore.dataset.serialize

Serialize dataset pipeline into a JSON file.

mindspore.dataset.show

Write the dataset pipeline graph to logger.info file.

mindspore.dataset.sync_wait_for_dataset

Wait util the dataset files required by all devices are downloaded.

mindspore.dataset.utils.imshow_det_bbox

Draw an image with given bboxes and class labels (with scores).