Class Dataset

Inheritance Relationships

Base Type

  • public std::enable_shared_from_this< Dataset >

Derived Types

Class Documentation

class Dataset : public std::enable_shared_from_this<Dataset>

A base class to represent a dataset in the data pipeline.

Subclassed by mindspore::dataset::AlbumDataset, mindspore::dataset::BatchDataset, mindspore::dataset::MapDataset, mindspore::dataset::MnistDataset, mindspore::dataset::ProjectDataset, mindspore::dataset::ShuffleDataset

Public Functions

Dataset()

Constructor.

virtual ~Dataset() = default

Destructor.

int64_t GetDatasetSize(bool estimate = false)

Gets the dataset size.

参数

estimate[in] This is only supported by some of the ops and it’s used to speed up the process of getting dataset size at the expense of accuracy.

返回

dataset size. If failed, return -1

std::vector<mindspore::DataType> GetOutputTypes()

Gets the output type.

返回

a vector of DataType. If failed, return an empty vector

std::vector<std::vector<int64_t>> GetOutputShapes()

Gets the output shape.

返回

a vector of TensorShape. If failed, return an empty vector

int64_t GetBatchSize()

Gets the batch size.

返回

int64_t

int64_t GetRepeatCount()

Gets the repeat count.

返回

int64_t

int64_t GetNumClasses()

Gets the number of classes.

返回

number of classes. If failed, return -1

inline std::vector<std::string> GetColumnNames()

Gets the column names.

返回

Names of the columns. If failed, return an empty vector

inline std::vector<std::pair<std::string, std::vector<int32_t>>> GetClassIndexing()

Gets the class indexing.

返回

a map of ClassIndexing. If failed, return an empty map

std::shared_ptr<Dataset> SetNumWorkers(int32_t num_workers)

Setter function for runtime number of workers.

Example
/* Set number of workers(threads) to process the dataset in parallel */
std::shared_ptr<Dataset> ds = ImageFolder(folder_path, true);
ds = ds->SetNumWorkers(16);

参数

num_workers[in] The number of threads in this operator

返回

Shared pointer to the original object

std::shared_ptr<PullIterator> CreatePullBasedIterator()

Function to create an PullBasedIterator over the Dataset.

Example
/* dataset is an instance of Dataset object */
std::shared_ptr<Iterator> = dataset->CreatePullBasedIterator();
std::unordered_map<std::string, mindspore::MSTensor> row;
iter->GetNextRow(&row);

返回

Shared pointer to the Iterator

inline std::shared_ptr<Iterator> CreateIterator(int32_t num_epochs = -1)

Function to create an Iterator over the Dataset pipeline.

Example
/* dataset is an instance of Dataset object */
std::shared_ptr<Iterator> = dataset->CreateIterator();
std::unordered_map<std::string, mindspore::MSTensor> row;
iter->GetNextRow(&row);

参数

num_epochs[in] Number of epochs to run through the pipeline, default -1 which means infinite epochs. An empty row is returned at the end of each epoch

返回

Shared pointer to the Iterator

inline bool DeviceQueue(const std::string &queue_name = "", const std::string &device_type = "", int32_t device_id = 0, int32_t num_epochs = -1, bool send_epoch_end = true, int32_t total_batches = 0, bool create_data_info_queue = false)

Function to transfer data through a device.

说明

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

参数
  • queue_name[in] Channel name (default=””, create new unique name).

  • device_type[in] Type of device (default=””, get from MSContext).

  • device_id[in] id of device (default=1, get from MSContext).

  • num_epochs[in] Number of epochs (default=-1, infinite epochs).

  • send_epoch_end[in] Whether to send end of sequence to device or not (default=true).

  • total_batches[in] Number of batches to be sent to the device (default=0, all data).

  • create_data_info_queue[in] Whether to create queue which stores types and shapes of data or not(default=false).

返回

Returns true if no error encountered else false.

inline bool Save(const std::string &dataset_path, int32_t num_files = 1, const std::string &dataset_type = "mindrecord")

Function to create a Saver to save the dynamic data processed by the dataset pipeline.

Example
/* Create a dataset and save its data into MindRecord */
std::string folder_path = "/path/to/cifar_dataset";
std::shared_ptr<Dataset> ds = Cifar10(folder_path, "all", std::make_shared<SequentialSampler>(0, 10));
std::string save_file = "Cifar10Data.mindrecord";
bool rc = ds->Save(save_file);

说明

Usage restrictions:

  1. Supported dataset formats: ‘mindrecord’ only

  2. To save the samples in order, set dataset’s shuffle to false and num_files to 1.

  3. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  4. Mindrecord does not support bool, uint64, multi-dimensional uint8(drop dimension) nor multi-dimensional string.

参数
  • dataset_path[in] Path to dataset file

  • num_files[in] Number of dataset files (default=1)

  • dataset_type[in] Dataset format (default=”mindrecord”)

返回

Returns true if no error encountered else false

std::shared_ptr<BatchDataset> Batch(int32_t batch_size, bool drop_remainder = false)

Function to create a BatchDataset.

Example
/* Create a dataset where every 100 rows is combined into a batch */
std::shared_ptr<Dataset> ds = ImageFolder(folder_path, true);
ds = ds->Batch(100, true);

说明

Combines batch_size number of consecutive rows into batches

参数
  • batch_size[in] The number of rows each batch is created with

  • drop_remainder[in] Determines whether or not to drop the last possibly incomplete batch. If true, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the next node

返回

Shared pointer to the current BatchDataset

inline std::shared_ptr<MapDataset> Map(const std::vector<TensorTransform*> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})

Function to create a MapDataset.

Example
 // Create objects for the tensor ops
 std::shared_ptr<TensorTransform> decode_op = std::make_shared<vision::Decode>(true);
 std::shared_ptr<TensorTransform> random_color_op = std::make_shared<vision::RandomColor>(0.0, 0.0);

 /* 1) Simple map example */
 // Apply decode_op on column "image". This column will be replaced by the outputted
 // column of decode_op.
 dataset = dataset->Map({decode_op}, {"image"});

 // Decode and rename column "image" to "decoded_image".
 dataset = dataset->Map({decode_op}, {"image"}, {"decoded_image"});

/* 2) Map example with more than one operation */
// Create a dataset where the images are decoded, then randomly color jittered.
// decode_op takes column "image" as input and outputs one column. The column
// outputted by decode_op is passed as input to random_jitter_op.
// random_jitter_op will output one column. Column "image" will be replaced by
// the column outputted by random_jitter_op (the very last operation). All other
// columns are unchanged.
dataset = dataset->Map({decode_op, random_jitter_op}, {"image"})

说明

Applies each operation in operations to this dataset

参数
  • operations[in] Vector of raw pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list

  • input_columns[in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column

  • output_columns[in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced

  • cache[in] Tensor cache to use. (default=nullptr which means no cache is used).

  • callbacks[in] List of Dataset callbacks to be called.

返回

Shared pointer to the current MapDataset

inline std::shared_ptr<MapDataset> Map(const std::vector<std::shared_ptr<TensorTransform>> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})

Function to create a MapDataset.

说明

Applies each operation in operations to this dataset

参数
  • operations[in] Vector of shared pointers to TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list

  • input_columns[in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column

  • output_columns[in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced

  • cache[in] Tensor cache to use. (default=nullptr which means no cache is used).

  • callbacks[in] List of Dataset callbacks to be called.

返回

Shared pointer to the current MapDataset

inline std::shared_ptr<MapDataset> Map(const std::vector<std::reference_wrapper<TensorTransform>> &operations, const std::vector<std::string> &input_columns = {}, const std::vector<std::string> &output_columns = {}, const std::shared_ptr<DatasetCache> &cache = nullptr, const std::vector<std::shared_ptr<DSCallback>> &callbacks = {})

Function to create a MapDataset.

说明

Applies each operation in operations to this dataset

参数
  • operations[in] Vector of TensorTransform objects to be applied on the dataset. Operations are applied in the order they appear in this list

  • input_columns[in] Vector of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. The default input_columns is the first column

  • output_columns[in] Vector of names assigned to the columns outputted by the last operation This parameter is mandatory if len(input_columns) != len(output_columns) The size of this list must match the number of output columns of the last operation. The default output_columns will have the same name as the input columns, i.e., the columns will be replaced

  • cache[in] Tensor cache to use. (default=nullptr which means no cache is used).

  • callbacks[in] List of Dataset callbacks to be called.

返回

Shared pointer to the current MapDataset

inline std::shared_ptr<ProjectDataset> Project(const std::vector<std::string> &columns)

Function to create a Project Dataset.

Example
/* Reorder the original column names in dataset */
std::shared_ptr<Dataset> ds = Mnist(folder_path, "all", std::make_shared<RandomSampler>(false, 10));
ds = ds->Project({"label", "image"});

说明

Applies project to the dataset

参数

columns[in] The name of columns to project

返回

Shared pointer to the current Dataset

inline std::shared_ptr<ShuffleDataset> Shuffle(int32_t buffer_size)

Function to create a Shuffle Dataset.

Example
/* Rename the original column names in dataset */
std::shared_ptr<Dataset> ds = Mnist(folder_path, "all", std::make_shared<RandomSampler>(false, 10));
ds = ds->Rename({"image", "label"}, {"image_output", "label_output"});

说明

Randomly shuffles the rows of this dataset

参数

buffer_size[in] The size of the buffer (must be larger than 1) for shuffling

返回

Shared pointer to the current ShuffleDataset