mindpandas.config

Mindpandas config file

mindpandas.config.get_adaptive_concurrency()[source]

Get the flag for using adaptive concurrency or not.

Returns

bool, value of adaptive_concurrency flag.

Examples

>>> # Get the adaptive concurrency flag.
>>> import mindpandas as pd
>>> adaptive = pd.get_adaptive_concurrency()
mindpandas.config.get_benchmark_mode()[source]

Get the current benchmark mode.

Returns

bool, Indicates whether the benchmark mode is enabled.

Examples

>>> # Get the current benchmark mode.
>>> import mindpandas as pd
>>> mode = pd.get_benchmark_mode()
mindpandas.config.get_concurrency_mode()[source]

Get the current concurrency mode. It would be one of {‘multithread’, ‘multiprocess’}.

Returns

str, current concurrency mode.

Examples

>>> # Get the current concurrency mode.
>>> import mindpandas as pd
>>> mode = pd.get_concurrency_mode()
mindpandas.config.get_min_block_size()[source]

Get the current min block size of each partition.

Returns

int, current min_block_size of each partition in config.

Examples

>>> # Get the current min block size.
>>> import mindpandas as pd
>>> mode = pd.get_min_block_size()
mindpandas.config.get_partition_shape()[source]

Get the current partition shape.

Returns

tuple, Number of expected partitions along each axis. It is a tuple of two positive integers. The first element is the row-wise number of partitions and the second element is the column-wise number of partitions.

Examples

>>> # Get the current partition shape.
>>> import mindpandas as pd
>>> mode = pd.get_partition_shape()
mindpandas.config.set_adaptive_concurrency(adaptive, **kwargs)[source]

Users can set adaptive concurrency to allow read_csv to automatically select the concurrency mode based on the file size. Available options are “True” or “False”. When set to True, file sizes read from read_csv greater than 18 MB and DataFrame initialized from pandas DataFrame using more than 1 GB CPU memory will use the multiprocess mode, otherwise they will use the multithread mode. When set to False, it will use the current concurrency mode.

Parameters
  • adaptive (bool) – True to turn on adaptive concurrency, False to turn off adaptive concurrency.

  • **kwargs

    When ‘adaptive’ is set to False, no additional parameters are required. When ‘adaptive’ is set to True, ‘kwargs’ includes:

    • address: The ip address of the master node. Optional, uses “127.0.0.1” by default.

    • cpu: The number of CPU cores to use. Optional, uses all CPU cores by default.

    • datamem: The amount of memory used by datasystem (MB). Optional, uses 30% of total memory by default.

    • mem: The total memory (including datamem) used by MindPandas (MB).

      Optional, uses 90% of total memory by default.

    • tmp_dir: The temporary directory for the mindpandas process. Optional, uses “/tmp/mindpandas” by default.

    • tmp_file_size_limit: The temporary file size limit (MB).

      Optional, the default value is “None” which uses up to 95% of current free disk space.

Raises

ValueError – if adaptive is not True or False.

Examples

>>> # Set adaptive concurrency to True.
>>> import mindpandas as pd
>>> pd.set_adaptive_concurrency(True)
mindpandas.config.set_benchmark_mode(mode)[source]

Users can select if they want to turn on benchmarkmode for performance analysis. Default mode is False.

Parameters

mode (bool) – This parameter can be set to True or False.

Raises

ValueError – If mode is not True or False.

Examples

>>> # Change the mode to True.
>>> import mindpandas as pd
>>> pd.set_benchmark_mode(True)
mindpandas.config.set_concurrency_mode(mode, **kwargs)[source]

Set the backend concurrency mode to parallelize the computation. Default mode is multithread. Available options are {‘multithread’, ‘multiprocess’}. For the instruction and usage of two modes, please referring to MindPandas execution mode introduction and configuration instructions for more information.

Parameters
  • mode (str) – This parameter can be set to ‘multithread’ for multithread backend, or ‘multiprocess’ for distributed multiprocess backend.

  • **kwargs

    When running on multithread mode, no additional parameters are required. When running on multiprocess mode, additional parameters include:

    • address: The ip address of the master node. Optional, uses “127.0.0.1” by default.

    • cpu: The number of CPU cores to use. Optional, uses all CPU cores by default.

    • datamem: The amount of memory used by datasystem (MB). Optional, uses 30% of total memory by default.

    • mem: The total memory (including datamem) used by MindPandas (MB).

      Optional, uses 90% of total memory by default.

    • tmp_dir: The temporary directory for the mindpandas process. Optional, uses “/tmp/mindpandas” by default.

    • tmp_file_size_limit: The temporary file size limit (MB).

      Optional, the default value is “None” which uses up to 95% of current free disk space.

Raises

ValueError – If mode is not ‘multithread’ or ‘multiprocess’.

Examples

>>> # Change the mode to multiprocess.
>>> import mindpandas as pd
>>> pd.set_concurrency_mode('multiprocess')
mindpandas.config.set_min_block_size(min_block_size)[source]

Users can set the min block size of each partition using this API. It means the minimum size of each axis of each partition. In other words, each partition’s size would be larger or equal to (min_block_size, min_block_size), unless the original data is smaller than this size. For example, if the min_block_size is set to be 32, and I have a dataframe which only has 16 columns and the partition shape is (2, 2), then during the partitioning we won’t further split the columns.

Parameters

min_block_size (int) – Minimum size of a partition’s number of rows and number of columns during partitioning.

Raises

ValueError – if min_block_size is not int type.

Examples

>>> # Set the min block size of each partition to 8.
>>> import mindpandas as pd
>>> pd.set_min_block_size(8)
mindpandas.config.set_partition_shape(shape)[source]

Users can set the partition shape of the data, where shape[0] is the expected number of partitions along axis 0 ( row-wise) and shape[1] is the expected number of partitions along axis 1 (column-wise). e.g. If the shape is (16, 16), then mindpandas will try to slice original data into 16 * 16 partitions.

Parameters

shape (tuple) – Number of expected partitions along each axis. It should be a tuple of two positive integers. The first element is the row-wise number of partitions and the second element is the column-wise number of partitions.

Raises

ValueError – If shape is not tuple type or the value of shape is not int.

Examples

>>> # Set the shape of each partition to (16, 16).
>>> import mindpandas as pd
>>> pd.set_partition_shape((16, 16))