mindspore.dataset.config

The configuration module provides various functions to set and get the supported configuration parameters, and read a configuration file.

Common imported modules in corresponding API examples are as follows:

import mindspore.dataset as ds

mindspore.dataset.config.get_auto_num_workers()

Get the setting (turned on or off) automatic number of workers.

Returns: bool, whether auto number worker feature is turned on.

Examples

>>> # Get the global configuration of auto number worker feature.
>>> flag = ds.config.get_auto_num_workers()

mindspore.dataset.config.get_auto_offload()

Get the state of the automatic offload flag (True or False)

Returns: bool, Whether the automatic offload feature is enabled.

Example

>>> # Get the global configuration of the automatic offload feature.
>>> auto_offload = ds.config.get_auto_offload()

mindspore.dataset.config.get_autotune_interval()

Get the current configuration adjustment interval (in steps) for AutoTune.

Returns: int, the configuration adjustment interval (in steps) for AutoTune.

Examples

>>> # get the global configuration of the autotuning interval
>>> autotune_interval = ds.config.get_autotune_interval()

mindspore.dataset.config.get_callback_timeout()

Get the default timeout for WaitedDSCallback.

Returns: int, Timeout (in seconds) to be used to end the wait in DSWaitedCallback in case of a deadlock.

Examples

>>> # Get the global configuration of callback timeout.
>>> # If set_callback_timeout() is never called before, the default value(60) will be returned.
>>> callback_timeout = ds.config.get_callback_timeout()

mindspore.dataset.config.get_enable_autotune()

Get whether AutoTune is currently enabled.

Returns: bool, whether AutoTune is currently enabled.

Examples

>>> # get the state of AutoTune
>>> autotune_flag = ds.config.get_enable_autotune()

mindspore.dataset.config.get_enable_shared_mem()

Get the default state of shared mem enabled variable.

Note

get_enable_shared_mem is not supported on Windows and MacOS platforms yet.

Returns: bool, the state of shared mem enabled variable.

Examples

>>> # Get the flag of shared memory feature.
>>> shared_mem_flag = ds.config.get_enable_shared_mem()

mindspore.dataset.config.get_enable_watchdog()

Get the state of watchdog Python thread to indicate enabled or disabled state. This is the DEFAULT watchdog Python thread state value used for the all processes.

Returns: bool, the default state of watchdog Python thread enabled.

Examples

>>> # Get the global configuration of watchdog Python thread.
>>> watchdog_state = ds.config.get_enable_watchdog()

mindspore.dataset.config.get_monitor_sampling_interval()

Get the global configuration of sampling interval of performance monitor. If set_monitor_sampling_interval is never called before, the default value(1000) will be returned.

Returns: int, interval (in milliseconds) for performance monitor sampling.

Examples

>>> # Get the global configuration of monitor sampling interval.
>>> # If set_monitor_sampling_interval() is never called before, the default value(1000) will be returned.
>>> sampling_interval = ds.config.get_monitor_sampling_interval()

mindspore.dataset.config.get_multiprocessing_timeout_interval()

Get the global configuration of multiprocessing/multithreading timeout when main process/thread gets data from subprocesses/child threads.

Returns: int, interval (in seconds) for multiprocessing/multithreading timeout when main process/thread gets data from subprocesses/child threads (default is 300s).

Examples

>>> # Get the global configuration of multiprocessing/multithreading timeout when main process/thread gets data
>>> # from subprocesses/child threads. If set_multiprocessing_timeout_interval() is never called before, the
>>> # default value(300) will be returned.
>>> multiprocessing_timeout_interval = ds.config.get_multiprocessing_timeout_interval()

mindspore.dataset.config.get_num_parallel_workers()

Get the global configuration of number of parallel workers. This is the DEFAULT num_parallel_workers value used for each operation.

Returns: int, number of parallel workers to be used as a default for each operation.

Examples

>>> # Get the global configuration of parallel workers.
>>> # If set_num_parallel_workers() is never called before, the default value(8) will be returned.
>>> num_parallel_workers = ds.config.get_num_parallel_workers()

mindspore.dataset.config.get_numa_enable()

Get the state of numa to indicate enabled/disabled. This is the DEFAULT numa enabled value used for the all process.

Returns: bool, the default state of numa enabled.

Examples

>>> # Get the global configuration of numa.
>>> numa_state = ds.config.get_numa_enable()

mindspore.dataset.config.get_prefetch_size()

Get the prefetch size as for number of rows. If set_prefetch_size is never called before, the default value 16 will be returned.

Returns: int, total number of rows to be prefetched.

Examples

>>> # Get the global configuration of prefetch size.
>>> # If set_prefetch_size() is never called before, the default value(16) will be returned.
>>> prefetch_size = ds.config.get_prefetch_size()

mindspore.dataset.config.get_seed()

Get random number seed. If the seed has been set, then will return the set value, otherwise it will return the default seed value which equals to std::mt19937::default_seed.

Returns: int, random number seed.

Examples

>>> # Get the global configuration of seed.
>>> # If set_seed() is never called before, the default value(std::mt19937::default_seed) will be returned.
>>> seed = ds.config.get_seed()

mindspore.dataset.config.load(file)

Load the project configuration from the file.

Parameters: file (str) – Path of the configuration file to be loaded.
Raises: RuntimeError – If file is invalid and parsing fails.

Examples

>>> # Set new default configuration according to values in the configuration file.
>>> # example config file:
>>> # {
>>> #     "logFilePath": "/tmp",
>>> #     "numParallelWorkers": 4,
>>> #     "seed": 5489,
>>> #     "monitorSamplingInterval": 30
>>> # }
>>> config_file = "/path/to/config/file"
>>> ds.config.load(config_file)

mindspore.dataset.config.set_auto_num_workers(enable)

Set num_parallel_workers for each op automatically(This feature is turned off by default).

If turned on, the num_parallel_workers in each op will be adjusted automatically, possibly overwriting the num_parallel_workers passed in by user or the default value (if user doesn’t pass anything) set by ds.config.set_num_parallel_workers().

For now, this function is only optimized for YoloV3 dataset with per_batch_map (running map in batch). This feature aims to provide a baseline for optimized num_workers assignment for each operation. Operation whose num_parallel_workers is adjusted to a new value will be logged.

Parameters: enable (bool) – Whether to enable auto num_workers feature or not.
Raises: TypeError – If enable is not of boolean type.

Examples

>>> # Enable auto_num_worker feature, this might override the num_parallel_workers passed in by user
>>> ds.config.set_auto_num_workers(True)

mindspore.dataset.config.set_auto_offload(offload)

Set the automatic offload flag of the dataset. If set_auto_offload is True, automatically offload as many dataset operations from the CPU to the Device (GPU or Ascend).

Parameters: offload (bool) – Whether to use the automatic offload feature.
Raises: TypeError – If offload is not a boolean data type.

Examples

>>> # Enable automatic offload feature
>>> ds.config.set_auto_offload(True)

mindspore.dataset.config.set_autotune_interval(interval)

Set the configuration adjustment interval (in steps) for AutoTune.

The default setting is 0, which will adjust the configuration after each epoch. Otherwise, the configuration will be adjusted every interval steps.

Parameters

interval (int) – Interval (in steps) to adjust the configuration of the data pipeline.

Raises

TypeError – If interval is not of type int.
ValueError – If interval is not non-negative.

Examples

>>> # set a new interval for AutoTune
>>> ds.config.set_autotune_interval(30)

mindspore.dataset.config.set_callback_timeout(timeout)

Set the default timeout (in seconds) for DSWaitedCallback.

Parameters

timeout (int) – Timeout (in seconds) to be used to end the wait in DSWaitedCallback in case of a deadlock.

Raises

TypeError – If timeout is not type int.
ValueError – If timeout <= 0 or timeout > INT32_MAX(2147483647).

Examples

>>> # Set a new global configuration value for the timeout value.
>>> ds.config.set_callback_timeout(100)

mindspore.dataset.config.set_enable_autotune(enable, filepath_prefix=None)

Set whether to enable AutoTune. AutoTune is disabled by default.

AutoTune is used to automatically adjust the global configuration of the data pipeline according to the workload of environmental resources during the training process to improve the speed of data processing.

The optimized global configuration can be saved as a JSON file by setting json_filepath for subsequent reuse.

Parameters

enable (bool) – Whether to enable AutoTune.
filepath_prefix (str, optional) – The prefix filepath to save the optimized global configuration. The rank id and the json extension will be appended to the filepath_prefix string in multi-device training, rank id will be set to 0 in standalone training. For example, if filepath_prefix=”/path/to/some/dir/prefixname” and rank_id is 1, then the path of the generated file will be “/path/to/some/dir/prefixname_1.json” If the file already exists, it will be automatically overwritten. Default: None, means not to save the configuration file, but the tuned result still can be checked through INFO log.

Raises

TypeError – If enable is not of type boolean.
TypeError – If json_filepath is not of type str.
RuntimeError – If json_filepath is an empty string.
RuntimeError – If json_filepath is a directory.
RuntimeError – If json_filepath does not exist.
RuntimeError – If json_filepath does not have write permission.

Note

When enable is False, json_filepath will be ignored.
The JSON file can be loaded by API mindspore.dataset.deserialize to build a tuned pipeline.
In distributed training scenario, set_enable_autotune() must be called after cluster communication has been initialized (mindspore.communication.management.init()), otherwise the AutoTune file will always suffix with rank id 0.

An example of the generated JSON file is as follows. “remark” file will conclude that if the dataset has been tuned or not. “summary” filed will show the tuned configuration of dataset pipeline. Users can modify scripts based on the tuned result.

{
    "remark": "The following file has been auto-generated by the Dataset AutoTune.",
    "summary": [
        "CifarOp(ID:5)       (num_parallel_workers: 2, prefetch_size:64)",
        "MapOp(ID:4)         (num_parallel_workers: 2, prefetch_size:64)",
        "MapOp(ID:3)         (num_parallel_workers: 2, prefetch_size:64)",
        "BatchOp(ID:2)       (num_parallel_workers: 8, prefetch_size:64)"
    ],
    "tree": {
        ...
    }
}

Examples

>>> # enable AutoTune and save optimized data pipeline configuration
>>> ds.config.set_enable_autotune(True, "/path/to/autotune_out.json")
>>>
>>> # enable AutoTune
>>> ds.config.set_enable_autotune(True)

mindspore.dataset.config.set_enable_shared_mem(enable)

Set the default state of shared memory flag. If shared_mem_enable is True, will use shared memory queues to pass data to processes that are created for operators that set python_multiprocessing=True.

Note

set_enable_shared_mem is not supported on Windows and MacOS platforms yet.

Parameters: enable (bool) – Whether to use shared memory in operators when python_multiprocessing=True.
Raises: TypeError – If enable is not a boolean data type.

Examples

>>> # Enable shared memory feature to improve the performance of Python multiprocessing.
>>> ds.config.set_enable_shared_mem(True)

mindspore.dataset.config.set_enable_watchdog(enable)

Set the default state of watchdog Python thread as enabled, the default state of watchdog Python thread is enabled. Watchdog is a thread which cleans up hanging subprocesses.

Parameters: enable (bool) – Whether to launch a watchdog Python thread. System default: True.
Raises: TypeError – If enable is not a boolean data type.

Examples

>>> # Set a new global configuration value for the state of watchdog Python thread as enabled.
>>> ds.config.set_enable_watchdog(True)

mindspore.dataset.config.set_monitor_sampling_interval(interval)

Set the default interval (in milliseconds) for monitor sampling.

Parameters

interval (int) – Interval (in milliseconds) to be used for performance monitor sampling.

Raises

TypeError – If interval is not type int.
ValueError – If interval <= 0 or interval > INT32_MAX(2147483647).

Examples

>>> # Set a new global configuration value for the monitor sampling interval.
>>> ds.config.set_monitor_sampling_interval(100)

mindspore.dataset.config.set_multiprocessing_timeout_interval(interval)

Set the default interval (in seconds) for multiprocessing/multithreading timeout when main process/thread gets data from subprocesses/child threads.

Parameters

interval (int) – Interval (in seconds) to be used for multiprocessing/multithreading timeout when main process/thread gets data from subprocess/child threads. System default: 300s.

Raises

TypeError – If interval is not of type int.
ValueError – If interval <= 0 or interval > INT32_MAX(2147483647).

Examples

>>> # Set a new global configuration value for multiprocessing/multithreading timeout when getting data.
>>> ds.config.set_multiprocessing_timeout_interval(300)

mindspore.dataset.config.set_num_parallel_workers(num)

Set a new global configuration default value for the number of parallel workers. This setting will affect the parallelism of all dataset operation.

Parameters

num (int) – Number of parallel workers to be used as a default for each operation.

Raises

TypeError – If num is not of type int.
ValueError – If num <= 0 or num > INT32_MAX(2147483647).

Examples

>>> # Set a new global configuration value for the number of parallel workers.
>>> # Now parallel dataset operators will run with 8 workers.
>>> ds.config.set_num_parallel_workers(8)

mindspore.dataset.config.set_numa_enable(numa_enable)

Set the default state of numa enabled. If numa_enable is True, need to ensure numa library is installed.

Parameters: numa_enable (bool) – Whether to use numa bind feature.
Raises: TypeError – If numa_enable is not a boolean data type.

Examples

>>> # Set a new global configuration value for the state of numa enabled.
>>> # Now parallel dataset operators will run with numa bind function
>>> ds.config.set_numa_enable(True)

mindspore.dataset.config.set_prefetch_size(size)

Set the queue capacity of the thread in pipeline.

Parameters

size (int) – The length of the cache queue.

Raises

TypeError – If size is not of type int.
ValueError – If size <= 0 or size > INT32_MAX(2147483647).

Note

Since total memory used for prefetch can grow very large with high number of workers, when the number of workers is greater than 4, the per worker prefetch size will be reduced. The actual prefetch size at runtime per-worker will be prefetchsize * (4 / num_parallel_workers).

Examples

>>> # Set a new global configuration value for the prefetch size.
>>> ds.config.set_prefetch_size(1000)

mindspore.dataset.config.set_seed(seed)

Set the seed so the random generated number will be fixed for deterministic results.

Note

This set_seed function sets the seed in the Python random library and numpy.random library for deterministic Python augmentations using randomness. This set_seed function should be called when iterator is created to reset the random seed.

Parameters

seed (int) – Random number seed. It is used to generate deterministic random numbers.

Raises

TypeError – If seed isn’t of type int.
ValueError – If seed < 0 or seed > UINT32_MAX(4294967295).

Examples

>>> # Set a new global configuration value for the seed value.
>>> # Operations with randomness will use the seed value to generate random values.
>>> ds.config.set_seed(1000)

mindspore.dataset.config.set_sending_batches(batch_num)

Set the default sending batches when training with sink_mode=True in Ascend device.

Parameters: batch_num (int) – the total sending batches, when batch_num is set, it will wait unless sending batches increase, default is 0 which means will send all batches in dataset.
Raises: TypeError – If batch_num is not of type int.

Examples

>>> # Set a new global configuration value for the sending batches
>>> ds.config.set_sending_batches(10)