mindspore.dataset

This module provides APIs to load and process various datasets: MNIST, CIFAR-10, CIFAR-100, VOC, ImageNet, CelebA dataset, etc. It also supports datasets in special format, including mindrecord, tfrecord, manifest. Users can also create samplers with this module to sample data.

class mindspore.dataset.ImageFolderDatasetV2(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, extensions=None, class_indexing=None, decode=False, num_shards=None, shard_id=None)[source]

A source dataset that reads images from a tree of directories.

All images within one folder have the same label. The generated dataset has two columns [‘image’, ‘label’]. The shape of the image column is [image_size] if decode flag is False, or [H,W,C] otherwise. The type of the image tensor is uint8. The label is just a scalar uint64 tensor. This dataset can take in a sampler. sampler and shuffle are mutually exclusive. Table below shows what input args are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, set in the config).

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • extensions (list[str], optional) – List of file extensions to be included in the dataset (default=None).

  • class_indexing (dict, optional) – A str-to-int mapping from folder name to index (default=None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0).

  • decode (bool, optional) – decode the images after reading (default=False).

  • num_shards (int, optional) – Number of shards that the dataset should be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument should be specified only when num_shards is also specified.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • RuntimeError – If class_indexing is not a dictionary.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>> # path to imagefolder directory. This directory needs to contain sub-directories which contain the images
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # 1) read all samples (image files) in dataset_dir with 8 threads
>>> imagefolder_dataset = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8)
>>> # 2) read all samples (image files) from folder cat and folder dog with label 0 and 1
>>> imagefolder_dataset = ds.ImageFolderDatasetV2(dataset_dir,class_indexing={"cat":0,"dog":1})
>>> # 3) read all samples (image files) in dataset_dir with extensions .JPEG and .png (case sensitive)
>>> imagefolder_dataset = ds.ImageFolderDatasetV2(dataset_dir, extensions={".JPEG",".png"})
apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()[source]

Get the number of classes in dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.MnistDataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None)[source]

A source dataset for reading and parsing the Mnist dataset.

The generated dataset has two columns [‘image’, ‘label’]. The type of the image tensor is uint8. The label is just a scalar uint32 tensor. This dataset can take in a sampler. sampler and shuffle are mutually exclusive. Table below shows what input args are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=value, set in the config).

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset should be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument should be specified only when num_shards is also specified.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>> dataset_dir = "/path/to/mnist_folder"
>>> # 1) read 3 samples from mnist_dataset
>>> mnist_dataset = ds.MnistDataset(dataset_dir=dataset_dir, num_samples=3)
>>> # in mnist_dataset dataset, each dictionary has keys "image" and "label"
apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.StorageDataset(dataset_files, schema, distribution='', columns_list=None, num_parallel_workers=None, deterministic_output=None, prefetch_size=None)[source]

A source dataset that reads and parses datasets stored on disk in various formats, including TFData format.

Parameters
  • dataset_files (list[str]) – List of files to be read.

  • schema (str) – Path to the json schema file. If numRows(parsed from schema) is not exist, read the full dataset.

  • distribution (str, optional) – Path of distribution config file (default=””).

  • columns_list (list[str], optional) – List of columns to be read (default=None, read all columns).

  • num_parallel_workers (int, optional) – Number of parallel working threads (default=None).

  • deterministic_output (bool, optional) – Whether the result of this dataset can be reproduced or not (default=True). If True, performance might be affected.

  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

Raises
  • RuntimeError – If schema file failed to read.

  • RuntimeError – If distribution file path is given but failed to read.

apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()[source]

Get the number of classes in dataset.

Returns

Number, number of classes.

Raises
  • ValueError – If dataset type is invalid.

  • ValueError – If dataset is not Imagenet dataset or manifest dataset.

  • RuntimeError – If schema file is given but failed to load.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.MindDataset(dataset_file, columns_list=None, num_parallel_workers=None, shuffle=None, num_shards=None, shard_id=None, block_reader=False, sampler=None)[source]

A source dataset that reads from shard files and database.

Parameters
  • dataset_file (str) – one of file names in dataset.

  • columns_list (list[str], optional) – List of columns to be read (default=None).

  • num_parallel_workers (int, optional) – The number of readers (default=None).

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset (default=None, performs shuffle).

  • num_shards (int, optional) – Number of shards that the dataset should be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument should be specified only when num_shards is also specified.

  • block_reader (bool, optional) – Whether read data by block mode (default=False).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, sampler is exclusive with shuffle and block_reader). Support list: SubsetRandomSampler, PkSampler

Raises
  • ValueError – If num_shards is specified but shard_id is None.

  • ValueError – If shard_id is specified but num_shards is None.

  • ValueError – If block reader is true but partition is specified.

apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.GeneratorDataset(source, column_names, column_types=None, schema=None, num_samples=None, num_parallel_workers=1, shuffle=None, sampler=None, num_shards=None, shard_id=None)[source]

A source dataset that generate data from python by invoking python data source each epoch.

This dataset can take in a sampler. sampler and shuffle are mutually exclusive. Table below shows what input args are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Parameters
  • source (Callable/Iterable/Random Accessible) – A generator callable object, an iterable python object or a random accessible python object. Callable source is required to return a tuple of numpy array as a row of the dataset on source().next(). Iterable source is required to return a tuple of numpy array as a row of the dataset on iter(source).next(). Random accessible source is required to return a tuple of numpy array as a row of the dataset on source[idx].

  • column_names (list[str]) – List of column names of the dataset.

  • column_types (list[mindspore.dtype], optional) – List of column data types of the dataset (default=None). If provided, sanity check will be performed on generator output.

  • schema (Schema/String, optional) – Path to the json schema file or schema object (default=None). If the schema is not provided, the meta data from column_names and column_types is considered the schema.

  • num_samples (int, optional) – The number of samples to be included in the dataset (default=None, all images).

  • num_parallel_workers (int, optional) – Number of subprocesses used to fetch the dataset in parallel (default=1).

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Random accessible input is required. (default=None, expected order behavior shown in the table).

  • sampler (Sampler/Iterable, optional) – Object used to choose samples from the dataset. Random accessible input is required (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset should be divided into (default=None). This argument should be specified only when ‘num_samples’ is “None”. Random accessible input is required.

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument should be specified only when num_shards is also specified. Random accessible input is required.

Examples

>>> import mindspore.dataengine as de
>>> # 1) Multidimensional generator function as callable input
>>> def generator_md():
>>>     for i in range(64):
>>>         yield (np.array([[i, i + 1], [i + 2, i + 3]]),)
>>> # create multi_dimension_generator_dataset with GeneratorMD and column name "multi_dimensional_data"
>>> multi_dimension_generator_dataset = de.GeneratorDataset(generator_md, ["multi_dimensional_data"])
>>> # 2) Multi-column generator function as callable input
>>> def generator_mc(maxid = 64):
>>>     for i in range(maxid):
>>>         yield (np.array([i]), np.array([[i, i + 1], [i + 2, i + 3]]))
>>> # create multi_column_generator_dataset with GeneratorMC and column names "col1" and "col2"
>>> multi_column_generator_dataset = de.GeneratorDataset(generator_mc, ["col1", "col2"])
>>> # 3) Iterable dataset as iterable input
>>> class MyIterable():
>>>     def __iter__(self):
>>>         return # User implementation
>>> # create iterable_generator_dataset with MyIterable object
>>> iterable_generator_dataset = de.GeneratorDataset(MyIterable(), ["col1"])
>>> # 4) Random accessible dataset as Random accessible input
>>> class MyRA():
>>>     def __getitem__(self, index):
>>>         return # User implementation
>>> # create ra_generator_dataset with MyRA object
>>> ra_generator_dataset = de.GeneratorDataset(MyRA(), ["col1"])
>>> # List/Dict/Tuple is also random accessible
>>> list_generator = de.GeneratorDataset([(np.array(0),), (np.array(1)), (np.array(2))], ["col1"])
>>> # 5) Built-in Sampler
>>> my_generator = de.GeneratorDataset(my_ds, ["img", "label"], sampler=samplers.RandomSampler())
>>>
apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.TFRecordDataset(dataset_files, schema=None, columns_list=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, shard_equal_rows=False)[source]

A source dataset that reads and parses datasets stored on disk in TFData format.

Parameters
  • dataset_files (str or list[str]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.

  • schema (str or Schema, optional) – Path to the json schema file or schema object (default=None). If the schema is not provided, the meta data from the TFData file is considered the schema.

  • columns_list (list[str], optional) – List of columns to be read (default=None, read all columns)

  • num_samples (int, optional) – number of samples(rows) to read (default=None). If num_samples is None and numRows(parsed from schema) is not exist, read the full dataset; If num_samples is None and numRows(parsed from schema) is greater than 0, read numRows rows; If both num_samples and numRows(parsed from schema) are greater than 0, read num_samples rows.

  • num_parallel_workers (int, optional) – number of workers to read the data (default=None, number set in the config).

  • shuffle (bool, Shuffle level, optional) –

    perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset should be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument should be specified only when num_shards is also specified.

  • shard_equal_rows (bool) – Get equal rows for all shards(default=False). If shard_equal_rows is false, number of rows of each shard may be not equal.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.common.dtype as mstype
>>> dataset_files = ["/path/to/1", "/path/to/2"] # contains 1 or multiple tf data files
>>> # 1) get all rows from dataset_files with no explicit schema:
>>> # The meta-data in the first row will be used as a schema.
>>> tfdataset = ds.TFRecordDataset(dataset_files=dataset_files)
>>> # 2) get all rows from dataset_files with user-defined schema:
>>> schema = ds.Schema()
>>> schema.add_column('col_1d', de_type=mindspore.int64, shape=[2])
>>> tfdataset = ds.TFRecordDataset(dataset_files=dataset_files, schema=schema)
>>> # 3) get all rows from dataset_files with schema file "./schema.json":
>>> tfdataset = ds.TFRecordDataset(dataset_files=dataset_files, schema="./schema.json")
apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size(estimate=False)[source]

Get the number of batches in an epoch.

Parameters

estimate (bool, optional) – Fast estimation of the dataset size instead of a full scan.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.ManifestDataset(dataset_file, usage='train', num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, class_indexing=None, decode=False, num_shards=None, shard_id=None)[source]

A source dataset that reads images from a manifest file.

The generated dataset has two columns [‘image’, ‘label’]. The shape of the image column is [image_size] if decode flag is False, or [H,W,C] otherwise. The type of the image tensor is uint8. The label is just a scalar uint64 tensor. This dataset can take in a sampler. sampler and shuffle are mutually exclusive. Table below shows what input args are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Parameters
  • dataset_file (str) – File to be read.

  • usage (str, optional) – Need train, eval or inference data (default=”train”).

  • num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • class_indexing (dict, optional) – A str-to-int mapping from label name to index (default=None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0).

  • decode (bool, optional) – decode the images after reading (defaults=False).

  • num_shards (int, optional) – Number of shards that the dataset should be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument should be specified only when num_shards is also specified.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • RuntimeError – If class_indexing is not a dictionary.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>> dataset_file = "/path/to/manifest_file.manifest"
>>> # 1) read all samples specified in manifest_file dataset with 8 threads for training:
>>> manifest_dataset = ds.ManifestDataset(dataset_file, usage="train", num_parallel_workers=8)
>>> # 2) reads samples (specified in manifest_file.manifest) for shard 0 in a 2-way distributed training setup:
>>> manifest_dataset = ds.ManifestDataset(dataset_file, num_shards=2, shard_id=0)
apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()[source]

Get the class index

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()[source]

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.Cifar10Dataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None)[source]

A source dataset that reads cifar10 data.

The generated dataset has two columns [‘image’, ‘label’]. The type of the image tensor is uint8. The label is just a scalar uint32 tensor. This dataset can take in a sampler. sampler and shuffle are mutually exclusive. Table below shows what input args are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset should be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument should be specified only when num_shards is also specified.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>> dataset_dir = "/path/to/cifar10_dataset_directory"
>>> # 1) get all samples from CIFAR10 dataset in sequence:
>>> dataset = ds.Cifar10Dataset(dataset_dir=dataset_dir,shuffle=False)
>>> # 2) randomly select 350 samples from CIFAR10 dataset:
>>> dataset = ds.Cifar10Dataset(dataset_dir=dataset_dir,num_samples=350, shuffle=True)
>>> # 3) get samples from CIFAR10 dataset for shard 0 in a 2 way distributed training:
>>> dataset = ds.Cifar10Dataset(dataset_dir=dataset_dir,num_shards=2,shard_id=0)
>>> # in CIFAR10 dataset, each dictionary has keys "image" and "label"
apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.Cifar100Dataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None)[source]

A source dataset that reads cifar100 data.

The generated dataset has three columns [‘image’, ‘coarse_label’, ‘fine_label’]. The type of the image tensor is uint8. The coarse and fine are just a scalar uint32 tensor. This dataset can take in a sampler. sampler and shuffle are mutually exclusive. Table below shows what input args are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset should be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument should be specified only when num_shards is also specified.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>> dataset_dir = "/path/to/cifar100_dataset_directory"
>>> # 1) get all samples from CIFAR100 dataset in sequence:
>>> cifar100_dataset = ds.Cifar100Dataset(dataset_dir=dataset_dir,shuffle=False)
>>> # 2) randomly select 350 samples from CIFAR100 dataset:
>>> cifar100_dataset = ds.Cifar100Dataset(dataset_dir=dataset_dir,num_samples=350, shuffle=True)
>>> # in CIFAR100 dataset, each dictionary has 3 keys: "image", "fine_label" and "coarse_label"
apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.CelebADataset(dataset_dir, num_parallel_workers=None, shuffle=None, dataset_type='all', sampler=None, decode=False, extensions=None, num_samples=None, num_shards=None, shard_id=None)[source]

A source dataset for reading and parsing CelebA dataset.Only support list_attr_celeba.txt currently

Note

The generated dataset has two columns [‘image’, ‘attr’]. The type of the image tensor is uint8. The attr tensor is uint32 and one hot type.

Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=value set in the config).

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None).

  • dataset_type (string) – one of ‘all’, ‘train’, ‘valid’ or ‘test’.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None).

  • decode (bool, optional) – decode the images after reading (default=False).

  • extensions (list[str], optional) – List of file extensions to be included in the dataset (default=None).

  • num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).

  • num_shards (int, optional) – Number of shards that the dataset should be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument should be specified only when num_shards is also specified.

apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size()

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.VOCDataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None)[source]

A source dataset for reading and parsing VOC dataset.

The generated dataset has two columns [‘image’, ‘target’]. The shape of both column is [image_size] if decode flag is False, or [H, W, C] otherwise. The type of both tensor is uint8. This dataset can take in a sampler. sampler and shuffle are mutually exclusive. Table below shows what input args are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • decode (bool, optional) – Decode the images after reading (default=False).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset should be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument should be specified only when num_shards is also specified.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>> dataset_dir = "/path/to/voc_dataset_directory"
>>> # 1) read all VOC dataset samples in dataset_dir with 8 threads in random order:
>>> voc_dataset = ds.VOCDataset(dataset_dir, num_parallel_workers=8)
>>> # 2) read then decode all VOC dataset samples in dataset_dir in sequence:
>>> voc_dataset = ds.VOCDataset(dataset_dir, decode=True, shuffle=False)
>>> # in VOC dataset, each dictionary has keys "image" and "target"
apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.TextFileDataset(dataset_files, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None)[source]

A source dataset that reads and parses datasets stored on disk in text format. The generated dataset has one columns [‘text’].

Parameters
  • dataset_files (str or list[str]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.

  • num_samples (int, optional) – number of samples(rows) to read (default=None, reads the full dataset).

  • num_parallel_workers (int, optional) – number of workers to read the data (default=None, number set in the config).

  • shuffle (bool, Shuffle level, optional) –

    perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset should be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument should be specified only when num_shards is also specified.

Examples

>>> import mindspore.dataset as ds
>>> dataset_files = ["/path/to/1", "/path/to/2"] # contains 1 or multiple text files
>>> dataset = ds.TextFileDataset(dataset_files=dataset_files)
apply(apply_func)

Apply a function in this dataset.

The specified apply_func is a function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>> # use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None)

Combines batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propogated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represent a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list of string, optional) – List of names of the input columns. The size of the list should match with signature of per_batch_map callable.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
create_dict_iterator()

Create an Iterator over the dataset.

The data retrieved will be a dictionary. The order of the columns in the dictionary may not be the same as the original order.

Returns

Iterator, dictionary of column_name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None)

Create an Iterator over the dataset. The data retrieved will be a list of ndarray of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters

columns (list[str], optional) – List of columns to be used to specify the order of columns (defaults=None, means all columns).

Returns

Iterator, list of ndarray.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # creates an iterator. The columns in the data obtained by the
>>> # iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None)

Returns a transferredDataset that transfer data through device.

Parameters

prefetch_size (int, optional) – prefetch number of records ahead of the user’s request (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate – python callable which returns a boolean value.

  • input_columns – (list[str]): List of names of the input columns, when

  • the predicate will be applied on all columns in the dataset. (default=None,) –

  • num_parallel_workers (int, optional) – Number of workers to process the Dataset

  • parallel (in) –

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1

Returns

Number, the count of repeat.

map(input_columns=None, operations=None, output_columns=None, columns_order=None, num_parallel_workers=None, python_multiprocessing=False)

Applies each operation in operations to this dataset.

The order of operations is determined by the position of each operation in operations. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in columns_order will be propagated to the child node. These columns will be in the same order as specified in columns_order.

Parameters
  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • operations (list[TensorOp] or Python list[functions]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • columns_order (list[str], optional) – list of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the config will be used).

  • python_multiprocessing (bool, optional) – Parallelize python operations with multiple worker process. This option could be beneficial if the python operation is computational heavy (default=False).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.transforms.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2". Each column is
>>> # a 2d array of integers.
>>>
>>> # This config is a global setting, meaning that all future operations which
>>> # uses this config value will use 2 worker threads, unless if specified
>>> # otherwise in their constructor. set_num_parallel_workers can be called
>>> # again later if a different number of worker threads are needed.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Two operations, which takes 1 column for input and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Applies decode_op on column "image". This column will be replaced by the outputed
>>> # column of decode_op. Since columns_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(input_columns, operations)
>>>
>>> # Rename column "image" to "decoded_image"
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns)
>>>
>>> # Specify the order of the columns.
>>> columns_order ["label", "image"]
>>> ds_decoded = data.map(input_columns, operations, None, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> columns_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> columns_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Simple example using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Creates a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since columns_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(input_columns, operations)
>>>
>>> # Creates a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(input_columns, operation, output_columns)
>>>
>>> # Multiple operations using pyfunc. Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: the number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: because the number of input columns is not the same as the number of
>>> # output columns, the output_columns and columns_order parameter must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> columns_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> columns_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(input_columns, operations, output_columns, columns_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shape of each column.

output_types()

Get the types of output data.

Returns

List of data type.

project(columns)

Projects certain columns in input datasets.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – list of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # creates a dataset that consist of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Renames the columns in input datasets.

Parameters
  • input_columns (list[str]) – list of names of the input columns.

  • output_columns (list[str]) – list of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # creates a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeats this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. Recommend that repeat operation should be used after batch operation. If dataset_sink_mode is False, here repeat operation is invalid. If dataset_sink_mode is True, repeat count should be euqal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset should be repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # creates a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # creates a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. the shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object
>>> # optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # creates a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements the dataset should be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset which skips first 3 elements from data
>>> data = data.skip(3)
sync_update(condition_name, num_batch=None, data=None)

condition_name (str): The condition name that is used to toggle sending next row step_size (int or None): The number of steps(rows) that are released

when pass_rows is None, will update the same number as sync_wait specified

data (dict or None): The data passed to the callback

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset

Parameters
  • input_dataset (Dataset) – Input dataset to apply flow control

  • num_batch (int) – the number of batches without blocking at the start of each epoch

  • condition_name (str) – The condition name that is used to toggle sending next row

  • callback (function) – The callback funciton that will be invoked when sync_update is called

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=-1)

Takes at most given numbers of elements from the dataset.

Note

1. If count is greater than the number of element in dataset or equal to -1, all the element in dataset will be taken. 2. The order of using take and batch effects. If take before batch operation, then taken given number of rows, otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>> # data is an instance of Dataset object.
>>> # creates a dataset where the dataset including 50 elements.
>>> data = data.take(50)
to_device(num_batch=None)

Transfers data through CPU, GPU or Ascend devices.

Parameters

num_batch (int, optional) – limit the number of batch to be sent to device (default=None).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • ValueError – If num_batch is None or 0 or larger than int_max.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zips the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (tuple or class Dataset) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>> # ds1 and ds2 are instances of Dataset object
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.Schema(schema_file=None)[source]

Class to represent a schema of dataset.

Parameters

schema_file (str) – Path of schema file (default=None).

Returns

Schema object, schema info about dataset.

Raises

RuntimeError – If schema file failed to load.

Example

>>> import mindspore.dataset as ds
>>> import mindspore.common.dtype as mstype
>>> # create schema, specify column name, mindspore.dtype and shape of the column
>>> schema = ds.Schema()
>>> schema.add_column('col1', de_type=mindspore.int64, shape=[2])
add_column(name, de_type, shape=None)[source]

Add new column to the schema.

Parameters
  • name (str) – name of the column.

  • de_type (str) – data type of the column.

  • shape (list[int], optional) – shape of the column (default=None, [-1] which is an unknown shape of rank 1).

Raises

ValueError – If column type is unknown.

from_json(json_obj)[source]

Get schema file from json file.

Parameters

json_obj (dictionary) – object of json parsed.

Raises
parse_columns(columns)[source]

Parse the columns and add it to self.

Parameters

columns (dict or list[dict]) –

dataset attribution information, decoded from schema file.

  • list[dict], ‘name’ and ‘type’ must be in keys, ‘shape’ optional.

  • dict, columns.keys() as name, columns.values() is dict, and ‘type’ inside, ‘shape’ optional.

Raises

Example

>>> schema = Schema()
>>> columns1 = [{'name': 'image', 'type': 'int8', 'shape': [3, 3]},
>>>             {'name': 'label', 'type': 'int8', 'shape': [1]}]
>>> schema.parse_columns(columns1)
>>> columns2 = {'image': {'shape': [3, 3], 'type': 'int8'}, 'label': {'shape': [1], 'type': 'int8'}}
>>> schema.parse_columns(columns2)
to_json()[source]

Get a JSON string of the schema.

Returns

Str, JSON string of the schema.

class mindspore.dataset.DistributedSampler(num_shards, shard_id, shuffle=True)[source]

Sampler that access a shard of the dataset.

Parameters
  • num_shards (int) – Number of shards to divide the dataset into.

  • shard_id (int) – Shard ID of the current shard within num_shards.

  • shuffle (bool, optional) – If true, the indices are shuffled (default=True).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> # creates a distributed sampler with 10 shards total. This shard is shard 5
>>> sampler = ds.DistributedSampler(10, 5)
>>> data = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8, sampler=sampler)
Raises
  • ValueError – If num_shards is not positive.

  • ValueError – If shard_id is smaller than 0 or equal to num_shards or larger than num_shards.

  • ValueError – If shuffle is not a boolean value.

class mindspore.dataset.PKSampler(num_val, num_class=None, shuffle=False)[source]

Samples K elements for each P class in the dataset.

Parameters
  • num_val (int) – Number of elements to sample for each class.

  • num_class (int, optional) – Number of classes to sample (default=None, all classes).

  • shuffle (bool, optional) – If true, the class IDs are shuffled (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> # creates a PKSampler that will get 3 samples from every class.
>>> sampler = ds.PKSampler(3)
>>> data = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8, sampler=sampler)
Raises
class mindspore.dataset.RandomSampler(replacement=False, num_samples=None)[source]

Samples the elements randomly.

Parameters
  • replacement (bool, optional) – If True, put the sample ID back for the next draw (default=False).

  • num_samples (int, optional) – Number of elements to sample (default=None, all elements).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> # creates a RandomSampler
>>> sampler = ds.RandomSampler()
>>> data = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8, sampler=sampler)
Raises
class mindspore.dataset.SequentialSampler[source]

Samples the dataset elements sequentially, same as not having a sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> # creates a SequentialSampler
>>> sampler = ds.SequentialSampler()
>>> data = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8, sampler=sampler)
class mindspore.dataset.SubsetRandomSampler(indices)[source]

Samples the elements randomly from a sequence of indices.

Parameters

indices (list[int]) – A sequence of indices.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> indices = [0, 1, 2, 3, 7, 88, 119]
>>>
>>> # creates a SubsetRandomSampler, will sample from the provided indices
>>> sampler = ds.SubsetRandomSampler()
>>> data = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8, sampler=sampler)
class mindspore.dataset.WeightedRandomSampler(weights, num_samples, replacement=True)[source]

Samples the elements from [0, len(weights) - 1] randomly with the given weights (probabilities).

Parameters
  • weights (list[float]) – A sequence of weights, not necessarily summing up to 1.

  • num_samples (int) – Number of elements to sample.

  • replacement (bool, optional) – If True, put the sample ID back for the next draw (default=True).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> weights = [0.9, 0.01, 0.4, 0.8, 0.1, 0.1, 0.3]
>>>
>>> # creates a WeightedRandomSampler that will sample 4 elements without replacement
>>> sampler = ds.WeightedRandomSampler(weights, 4)
>>> data = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8, sampler=sampler)
Raises
mindspore.dataset.zip(datasets)[source]

Zips the datasets in the input tuple of datasets.

Parameters

datasets (tuple of class Dataset) – A tuple of datasets to be zipped together. The number of datasets should be more than 1.

Returns

DatasetOp, ZipDataset.

Raises

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir1 = "path/to/imagefolder_directory1"
>>> dataset_dir2 = "path/to/imagefolder_directory2"
>>> ds1 = ds.ImageFolderDatasetV2(dataset_dir1, num_parallel_workers=8)
>>> ds2 = ds.ImageFolderDatasetV2(dataset_dir2, num_parallel_workers=8)
>>>
>>> # creates a dataset which is the combination of ds1 and ds2
>>> data = ds.zip((ds1, ds2))
mindspore.dataset.config = <mindspore.dataset.core.configuration.ConfigurationManager object>

The configuration manager

class mindspore.dataset.core.configuration.ConfigurationManager[source]

The configuration manager

get_num_parallel_workers()[source]

Get the default number of parallel workers.

Returns

Int, number of parallel workers to be used as a default for each operation

get_prefetch_size()[source]

Get the prefetch size in number of rows.

Returns

Size, total number of rows to be prefetched.

get_seed()[source]

Get the seed

Returns

Int, seed.

load(file)[source]

Load configuration from a file.

Parameters

file – path the config file to be loaded

Raises

RuntimeError – If file is invalid and parsing fails.

Examples

>>> import mindspore.dataset as ds
>>> con = ds.engine.ConfigurationManager()
>>> # sets the default value according to values in configuration file.
>>> con.load("path/to/config/file")
>>> # example config file:
>>> # {
>>> #     "logFilePath": "/tmp",
>>> #     "rowsPerBuffer": 32,
>>> #     "numParallelWorkers": 4,
>>> #     "workerConnectorSize": 16,
>>> #     "opConnectorSize": 16,
>>> #     "seed": 5489
>>> # }
set_num_parallel_workers(num)[source]

Set the default number of parallel workers

Parameters

num – number of parallel workers to be used as a default for each operation

Raises

ValueError – If num_parallel_workers is invalid (<= 0 or > MAX_INT_32).

Examples

>>> import mindspore.dataset as ds
>>> con = ds.engine.ConfigurationManager()
>>> # sets the new parallel_workers value, now parallel dataset operators will run with 8 workers.
>>> con.set_num_parallel_workers(8)
set_prefetch_size(size)[source]

Set the number of rows to be prefetched.

Parameters

size – total number of rows to be prefetched.

Raises

ValueError – If prefetch_size is invalid (<= 0 or > MAX_INT_32).

Examples

>>> import mindspore.dataset as ds
>>> con = ds.engine.ConfigurationManager()
>>> # sets the new prefetch value.
>>> con.set_prefetch_size(1000)
set_seed(seed)[source]

Set the seed to be used in any random generator. This is used to produce deterministic results.

Parameters

seed (int) – seed to be set

Raises

ValueError – If seed is invalid (< 0 or > MAX_UINT_32).

Examples

>>> import mindspore.dataset as ds
>>> con = ds.engine.ConfigurationManager()
>>> # sets the new seed value, now operators with a random seed will use new seed value.
>>> con.set_seed(1000)