mindspore.dataset.GeneratorDataset

class mindspore.dataset.GeneratorDataset(source, column_names=None, column_types=None, schema=None, num_samples=None, num_parallel_workers=1, shuffle=None, sampler=None, num_shards=None, shard_id=None, python_multiprocessing=True, max_rowsize=None, batch_sampler=None, collate_fn=None)[source]

A source dataset that generates data from Python by invoking Python data source each epoch.

The column names and column types of generated dataset depend on Python data defined by users.

Parameters:

source (Union[Random Accessible, Iterable]) –
A custom dataset from which to load the data. MindSpore supports the following types of datasets:
- Random-accessible (map-style) datasets: A dataset object that implements the __getitem__() and __len__() methods, represents a mapping from indexes/keys to data samples. For example, such a dataset source, when accessed with source[idx], can read the idx-th sample from disk, see Random-accessible dataset example for details.
- Iterable-style dataset: An iterable dataset object that implements __iter__() and __next__() methods, represents an iterable over data samples. This type of dataset is suitable for situations where random reads are costly or even impossible, and where batch sizes depend on the data being acquired. For example, such a dataset source, when accessed iter(source), can return a stream of data reading from a database or remote server, see Iterable-style dataset example for details.
column_names (Union[str, list[str]], optional) – List of column names of the dataset. Default: None . Users are required to provide either column_names or schema.
column_types (list[mindspore.dtype], optional) – List of column data types of the dataset. Default: None . If provided, sanity check will be performed on generator output (deprecated in future version).
schema (Union[str, Schema], optional) – Data format policy, which specifies the data types and shapes of the data column to be read. Both JSON file path and objects constructed by mindspore.dataset.Schema are acceptable (deprecated in future version). Default: None .
num_samples (int, optional) – The number of samples to be included in the dataset. Default: None , all images.
num_parallel_workers (int, optional) – Number of worker threads/subprocesses used to fetch the dataset in parallel. Default: 1.
shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Random accessible input is required. Default: None , expected order behavior shown in the table below.
sampler (Union[Sampler, Iterable], optional) – Object used to choose samples from the dataset. Random accessible input is required. Default: None , expected order behavior shown in the table below.
num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None . Random accessible input is required. When this argument is specified, num_samples reflects the maximum sample number of per shard. Used in data parallel training .
shard_id (int, optional) – The shard ID within num_shards . Default: None . This argument must be specified only when num_shards is also specified. Random accessible input is required.
python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker process. This option could be beneficial if the Python operation is computational heavy. Default: True.
max_rowsize (int, optional) – Maximum size of data (in MB) that is used for shared memory allocation to copy data between processes, the total occupied shared memory will increase as num_parallel_workers and mindspore.dataset.config.set_prefetch_size() increase. If set to -1, shared memory will be dynamically allocated with the actual size of data. This is only used if python_multiprocessing is set to True. Default: None , allocate shared memory dynamically (deprecated in future version).
batch_sampler (Iterable, optional) – Similar to sampler , but returns a batch of indices at a time, the corresponding data will be combined into a batch. Mutually exclusive with num_samples , shuffle , num_shards , shard_id and sampler . Default: None , do not use batch sampler.
collate_fn (Callable[List[numpy.ndarray]], optional) – Define how to merge a list of data into a batch. Only valid if batch_sampler is used. Default: None , do not use collation function.

Warning

GeneratorDataset uses dill module implicitly in multiprocessing spawn mode to serialize/deserialize source, which is known to be insecure. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never load data that could have come from untrusted sources, or has been tampered with.

Note

The parameter column_types , schema and max_rowsize will be deprecated in a future version.
If you configure python_multiprocessing=True (Default: True ) and num_parallel_workers>1 (default: 1 ) indicates that the multiprocessing mode is started for data load acceleration. At this time, as the dataset iterates, the memory consumption of the subprocess will gradually increase, mainly because the subprocess of the user-defined dataset obtains the member variables from the main process in the Copy On Write way. Example: If you define a dataset with __init__ function which contains a large number of member variable data (for example, a very large file name list is loaded during the dataset construction) and uses the multiprocessing mode, which may cause the problem of OOM (the estimated total memory usage is: (num_parallel_workers+1) * size of the parent process ). The simplest solution is to replace Python objects (such as list/dict/int/float/string) with non referenced data types (such as Pandas, Numpy or PyArrow objects) for member variables, or load less metadata in member variables, or configure python_multiprocessing=False to use multi-threading mode.

You can use the following classes/functions to reduce the size of member variables:

mindspore.dataset.utils.LineReader: Use this class to initialize your text file object in the __init__ function. Then read the file content based on the line number of the object with the __getitem__ function.
Input source accepts user-defined Python functions (PyFuncs), and sets the multiprocessing start method to spawn mode by ds.config.set_multiprocessing_start_method("spawn") with python_multiprocessing=True and num_parallel_workers>1 supports adding network computing operators from mindspore.nn and mindspore.ops or others into this source, otherwise adding to the source is not supported.

When the user defined dataset by source calls the DVPP operator during dataset loading and processing, the supported scenarios are as follows:

Multithreading

Multiprocessing

spawn

fork

Independent

process mode

Data Processing: support

Data Processing + Network training: not support

Data Processing: support

Data Processing + Network training: support

Data Processing: support

Data Processing + Network training: not support

Non-independent

process mode

Data Processing: support

Data Processing + Network training: support

Data Processing: support

Data Processing + Network training: support

Data Processing: support

Data Processing + Network training: not support

The parameters num_samples , shuffle , num_shards , shard_id can be used to control the sampler used in the dataset, and their effects when combined with parameter sampler are as follows.

Sampler obtained by different combinations of parameters sampler and num_samples , shuffle , num_shards , shard_id
Parameter sampler	Parameter num_shards / shard_id	Parameter shuffle	Parameter num_samples	Sampler Used
mindspore.dataset.Sampler type	None	None	None	sampler
numpy.ndarray,list,tuple,int type	/	/	num_samples	SubsetSampler(indices = sampler , num_samples = num_samples )
iterable type	/	/	num_samples	IterSampler(sampler = sampler , num_samples = num_samples )
None	num_shards / shard_id	None / True	num_samples	DistributedSampler(num_shards = num_shards , shard_id = shard_id , shuffle = True , num_samples = num_samples )
None	num_shards / shard_id	False	num_samples	DistributedSampler(num_shards = num_shards , shard_id = shard_id , shuffle = False , num_samples = num_samples )
None	None	None / True	None	RandomSampler(num_samples = num_samples )
None	None	None / True	num_samples	RandomSampler(replacement = True , num_samples = num_samples )
None	None	False	num_samples	SequentialSampler(num_samples = num_samples )

Raises:

RuntimeError – If source raises an exception during execution.
RuntimeError – If len of column_names does not match output len of source.
ValueError – If num_parallel_workers exceeds the max thread numbers.
ValueError – If sampler and shuffle are specified at the same time.
ValueError – If sampler and sharding are specified at the same time.
ValueError – If num_shards is specified but shard_id is None.
ValueError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is not in range of [0, num_shards ).
ValueError – If batch_sampler is specified together with num_samples , shuffle , num_shards , shard_id and sampler.
ValueError – If collate_fn is specified while batch_sampler is None.
TypeError – If batch_sampler is not iterable.
TypeError – If collate_fn is not callable.

Examples

>>> import mindspore.dataset as ds
>>> import numpy as np
>>>
>>> # 1) Multidimensional generator function as callable input.
>>> def generator_multidimensional():
...     for i in range(64):
...         yield (np.array([[i, i + 1], [i + 2, i + 3]]),)
>>>
>>> dataset = ds.GeneratorDataset(source=generator_multidimensional, column_names=["multi_dimensional_data"])
>>>
>>> # 2) Multi-column generator function as callable input.
>>> def generator_multi_column():
...     for i in range(64):
...         yield np.array([i]), np.array([[i, i + 1], [i + 2, i + 3]])
>>>
>>> dataset = ds.GeneratorDataset(source=generator_multi_column, column_names=["col1", "col2"])
>>>
>>> # 3) Iterable dataset as iterable input.
>>> class MyIterable:
...     def __init__(self):
...         self._index = 0
...         self._data = np.random.sample((5, 2))
...         self._label = np.random.sample((5, 1))
...
...     def __next__(self):
...         if self._index >= len(self._data):
...             raise StopIteration
...         else:
...             item = (self._data[self._index], self._label[self._index])
...             self._index += 1
...             return item
...
...     def __iter__(self):
...         self._index = 0
...         return self
...
...     def __len__(self):
...         return len(self._data)
>>>
>>> dataset = ds.GeneratorDataset(source=MyIterable(), column_names=["data", "label"])
>>>
>>> # 4) Random accessible dataset as random accessible input.
>>> class MyAccessible:
...     def __init__(self):
...         self._data = np.random.sample((5, 2))
...         self._label = np.random.sample((5, 1))
...
...     def __getitem__(self, index):
...         return self._data[index], self._label[index]
...
...     def __len__(self):
...         return len(self._data)
>>>
>>> dataset = ds.GeneratorDataset(source=MyAccessible(), column_names=["data", "label"])
>>>
>>> # list, dict, tuple of Python is also random accessible
>>> dataset = ds.GeneratorDataset(source=[(np.array(0),), (np.array(1),), (np.array(2),)], column_names=["col"])

Tutorial Examples:

Load & Process Data With Dataset Pipeline

add_sampler(new_sampler)[source]

Add a child sampler for the current dataset.

Note

If the sampler is added and it has a shuffle option, its value must be Shuffle.GLOBAL . Additionally, the original sampler's shuffle value cannot be Shuffle.PARTIAL .

Parameters:: new_sampler (Sampler) – The child sampler to be added.

Examples

>>> import mindspore.dataset as ds
>>> dataset = ds.GeneratorDataset([i for i in range(10)], "column1")
>>>
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> dataset.add_sampler(new_sampler)

prepare_multiprocessing()[source]: Preprocessing of prepared_source.

Pre-processing Operation

`mindspore.dataset.Dataset.apply`	Apply a function in this dataset.
`mindspore.dataset.Dataset.concat`	Concatenate the dataset objects in the input list.
`mindspore.dataset.Dataset.filter`	Filter dataset by predicate.
`mindspore.dataset.Dataset.flat_map`	Map func to each row in dataset and flatten the result.
`mindspore.dataset.Dataset.map`	Apply each operation in operations to this dataset.
`mindspore.dataset.Dataset.project`	The specified columns will be selected from the dataset and passed into the pipeline with the order specified.
`mindspore.dataset.Dataset.rename`	Rename the columns in input datasets.
`mindspore.dataset.Dataset.repeat`	Repeat this dataset count times.
`mindspore.dataset.Dataset.reset`	Reset the dataset for next epoch.
`mindspore.dataset.Dataset.save`	Save the dynamic data processed by the dataset pipeline in common dataset format.
`mindspore.dataset.Dataset.shuffle`	Shuffle the dataset by creating a cache with the size of buffer_size .
`mindspore.dataset.Dataset.skip`	Skip the first N elements of this dataset.
`mindspore.dataset.Dataset.split`	Split the dataset into smaller, non-overlapping datasets.
`mindspore.dataset.Dataset.take`	Take the first specified number of samples from the dataset.
`mindspore.dataset.Dataset.zip`	Zip the datasets in the sense of input tuple of datasets.

Batch

`mindspore.dataset.Dataset.batch`	Combine batch_size number of consecutive rows into batch which apply per_batch_map to the samples first.
`mindspore.dataset.Dataset.bucket_batch_by_length`	Bucket elements according to their lengths.
`mindspore.dataset.Dataset.padded_batch`	Combine batch_size number of consecutive rows into batches which apply pad_info to the samples first.

Iterator

`mindspore.dataset.Dataset.create_dict_iterator`	Create an iterator over the dataset that yields samples of type dict, while the key is the column name and the value is the data.
`mindspore.dataset.Dataset.create_tuple_iterator`	Create an iterator over the dataset that yields samples of type list, whose elements are the data for each column.

Attribute

`mindspore.dataset.Dataset.get_batch_size`	Return the size of batch.
`mindspore.dataset.Dataset.get_class_indexing`	Get the mapping dictionary from category names to category indexes.
`mindspore.dataset.Dataset.get_col_names`	Return the names of the columns in dataset.
`mindspore.dataset.Dataset.get_dataset_size`	Return the number of batches in an epoch.
`mindspore.dataset.Dataset.get_repeat_count`	Get the replication times in RepeatDataset.
`mindspore.dataset.Dataset.input_indexs`	Get the column index, which represents the corresponding relationship between the data column order and the network when using the sink mode.
`mindspore.dataset.Dataset.num_classes`	Get the number of classes in a dataset.
`mindspore.dataset.Dataset.output_shapes`	Get the shapes of output data.
`mindspore.dataset.Dataset.output_types`	Get the types of output data.

Apply Sampler

`mindspore.dataset.MappableDataset.add_sampler`	Add a child sampler for the current dataset.
`mindspore.dataset.MappableDataset.use_sampler`	Replace the last child sampler of the current dataset, leaving the parent sampler unchanged.

Others

`mindspore.dataset.Dataset.recv`	The dataset communication interface receives data sent by the source Dataset using `mindspore.dataset.Dataset.send` .
`mindspore.dataset.Dataset.send`	The dataset communication interface sends data to the target Dataset, which can be received through `mindspore.dataset.Dataset.recv`.
`mindspore.dataset.Dataset.sync_update`	Release a blocking condition and trigger callback with given data.
`mindspore.dataset.Dataset.sync_wait`	Add a blocking condition to the input Dataset and a synchronize action will be applied.
`mindspore.dataset.Dataset.to_json`	Serialize a pipeline into JSON string and dump into file if filename is provided.