mindspore.mindrecord

Introduction of MindRecord.

MindRecord is a module to implement reading, writing, searching and converting for MindSpore format dataset. Users could use the FileWriter API to generate MindRecord data and use the MindDataset API to load MindRecord data. Users could also convert other format datasets to mindrecord data through corresponding sub-module.

class mindspore.mindrecord.Cifar100ToMR(source, destination)[source]

A class to transform from cifar100 to MindRecord.

Note

For details about Examples, please refer to Converting the CIFAR-10 Dataset .

Parameters
  • source (str) – The cifar100 directory to be transformed.

  • destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

Raises

ValueError – If source or destination is invalid.

run(fields=None)[source]

Execute transformation from cifar100 to MindRecord.

Parameters

fields (list[str], optional) – A list of index field, e.g.[“fine_label”, “coarse_label”]. Default: None. For index field settings, please refer to mindspore.mindrecord.FileWriter.add_index() .

Returns

MSRStatus, SUCCESS or FAILED.

transform(fields=None)[source]

Encapsulate the mindspore.mindrecord.Cifar100ToMR.run() function to exit normally.

Parameters

fields (list[str], optional) – A list of index field, e.g.[“fine_label”, “coarse_label”]. Default: None. For index field settings, please refer to mindspore.mindrecord.FileWriter.add_index() .

Returns

MSRStatus, SUCCESS or FAILED.

class mindspore.mindrecord.Cifar10ToMR(source, destination)[source]

A class to transform from cifar10 to MindRecord.

Note

For details about Examples, please refer to Converting the CIFAR-10 Dataset .

Parameters
  • source (str) – The cifar10 directory to be transformed.

  • destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

Raises

ValueError – If source or destination is invalid.

run(fields=None)[source]

Execute transformation from cifar10 to MindRecord.

Parameters

fields (list[str], optional) – A list of index fields. Default: None. For index field settings, please refer to mindspore.mindrecord.FileWriter.add_index() .

Returns

MSRStatus, SUCCESS or FAILED.

transform(fields=None)[source]

Encapsulate the mindspore.mindrecord.Cifar10ToMR.run() function to exit normally.

Parameters

fields (list[str], optional) – A list of index fields. Default: None. For index field settings, please refer to mindspore.mindrecord.FileWriter.add_index() .

Returns

MSRStatus, SUCCESS or FAILED.

class mindspore.mindrecord.CsvToMR(source, destination, columns_list=None, partition_number=1)[source]

A class to transform from csv to MindRecord.

Note

For details about Examples, please refer to Converting CSV Dataset .

Parameters
  • source (str) – The file path of csv.

  • destination (str) – The MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

  • columns_list (list[str], optional) – A list of columns to be read. Default: None.

  • partition_number (int, optional) – The partition size, Default: 1.

Raises
  • ValueError – If source , destination , partition_number is invalid.

  • RuntimeError – If columns_list is invalid.

run()[source]

Execute transformation from csv to MindRecord.

Returns

MSRStatus, SUCCESS or FAILED.

transform()[source]

Encapsulate the mindspore.mindrecord.CsvToMR.run() function to exit normally.

Returns

MSRStatus, SUCCESS or FAILED.

class mindspore.mindrecord.FileReader(file_name, num_consumer=4, columns=None, operator=None)[source]

Class to read MindRecord files.

Note

If file_name is a file path, it tries to load all MindRecord files generated in a conversion, and throws an exception if a MindRecord file is missing. If file_name is file path list, only the MindRecord files in the list are loaded.

Parameters
  • file_name (str, list[str]) – One of MindRecord file path or file path list.

  • num_consumer (int, optional) – Number of reader workers which load data. Default: 4. It should not be smaller than 1 or larger than the number of processor cores.

  • columns (list[str], optional) – A list of fields where corresponding data would be read. Default: None.

  • operator (int, optional) – Reserved parameter for operators. Default: None.

Raises

ParamValueError – If file_name , num_consumer or columns is invalid.

Examples

>>> from mindspore.mindrecord import FileReader
>>>
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> reader = FileReader(file_name=mindrecord_file)
>>>
>>> # create iterator for mindrecord and get saved data
>>> for _, item in enumerate(reader.get_next()):
...     ori_data = item
>>> reader.close()
close()[source]

Stop reader worker and close file.

get_next()[source]

Yield a batch of data according to columns at a time.

Returns

dict, a batch whose keys are the same as columns.

Raises

MRMUnsupportedSchemaError – If schema is invalid.

len()[source]

Get the number of the samples in MindRecord.

Returns

int, the number of the samples in MindRecord.

schema()[source]

Get the schema of the MindRecord.

Returns

dict, the schema info.

class mindspore.mindrecord.FileWriter(file_name, shard_num=1, overwrite=False)[source]

Class to write user defined raw data into MindRecord files.

Note

After the MindRecord file is generated, if the file name is changed, the file may fail to be read.

Parameters
  • file_name (str) – File name of MindRecord file.

  • shard_num (int, optional) – The Number of MindRecord files. It should be between [1, 1000]. Default: 1.

  • overwrite (bool, optional) – Whether to overwrite if the file already exists. Default: False.

Raises

ParamValueError – If file_name or shard_num or overwrite is invalid.

Examples

>>> from mindspore.mindrecord import FileWriter
>>> schema_json = {"file_name": {"type": "string"}, "label": {"type": "int32"}, "data": {"type": "bytes"}}
>>> indexes = ["file_name", "label"]
>>> data = [{"file_name": "1.jpg", "label": 0,
...          "data": b"\x10c\xb3w\xa8\xee$o&<q\x8c\x8e(\xa2\x90\x90\x96\xbc\xb1\x1e\xd4QER\x13?\xff"},
...         {"file_name": "2.jpg", "label": 56,
...          "data": b"\xe6\xda\xd1\xae\x07\xb8>\xd4\x00\xf8\x129\x15\xd9\xf2q\xc0\xa2\x91YFUO\x1dsE1"},
...         {"file_name": "3.jpg", "label": 99,
...          "data": b"\xaf\xafU<\xb8|6\xbd}\xc1\x99[\xeaj+\x8f\x84\xd3\xcc\xa0,i\xbb\xb9-\xcdz\xecp{T\xb1"}]
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1, overwrite=True)
>>> schema_id = writer.add_schema(schema_json, "test_schema")
>>> status = writer.add_index(indexes)
>>> status = writer.write_raw_data(data)
>>> status = writer.commit()
add_index(index_fields)[source]

Select index fields from schema to accelerate reading. schema is added through add_schema .

Note

The index fields should be primitive type. e.g. int/float/str. If the function is not called, the fields of the primitive type in schema are set as indexes by default.

Please refer to the Examples of class: mindspore.mindrecord.FileWriter .

Parameters

index_fields (list[str]) – fields from schema.

Returns

MSRStatus, SUCCESS or FAILED.

Raises
  • ParamTypeError – If index field is invalid.

  • MRMDefineIndexError – If index field is not primitive type.

  • MRMAddIndexError – If failed to add index field.

  • MRMGetMetaError – If the schema is not set or failed to get meta.

add_schema(content, desc=None)[source]

The schema is added to describe the raw data to be written.

Note

Please refer to the Examples of class: mindspore.mindrecord.FileWriter .

Parameters
  • content (dict) – Dictionary of schema content.

  • desc (str, optional) – String of schema description, Default: None.

Returns

int, schema id.

Raises
  • MRMInvalidSchemaError – If schema is invalid.

  • MRMBuildSchemaError – If failed to build schema.

  • MRMAddSchemaError – If failed to add schema.

commit()[source]

Flush data in memory to disk and generate the corresponding database files.

Note

Please refer to the Examples of class: mindspore.mindrecord.FileWriter .

Returns

MSRStatus, SUCCESS or FAILED.

Raises
  • MRMOpenError – If failed to open MindRecord file.

  • MRMSetHeaderError – If failed to set header.

  • MRMIndexGeneratorError – If failed to create index generator.

  • MRMGenerateIndexError – If failed to write to database.

  • MRMCommitError – If failed to flush data to disk.

  • RuntimeError – Parallel write failed.

open_and_set_header()[source]

Open writer and set header which stores meta information. The function is only used for parallel writing and is called before the write_raw_data .

Returns

MSRStatus, SUCCESS or FAILED.

Raises
  • MRMOpenError – If failed to open MindRecord file.

  • MRMSetHeaderError – If failed to set header.

classmethod open_for_append(file_name)[source]

Open MindRecord file and get ready to append data.

Parameters

file_name (str) – String of MindRecord file name.

Returns

FileWriter, file writer object for the opened MindRecord file.

Raises
  • ParamValueError – If file_name is invalid.

  • FileNameError – If path contains invalid characters.

  • MRMOpenError – If failed to open MindRecord file.

  • MRMOpenForAppendError – If failed to open file for appending data.

Examples

>>> from mindspore.mindrecord import FileWriter
>>> schema_json = {"file_name": {"type": "string"}, "label": {"type": "int32"}, "data": {"type": "bytes"}}
>>> data = [{"file_name": "1.jpg", "label": 0,
...          "data": b"\x10c\xb3w\xa8\xee$o&<q\x8c\x8e(\xa2\x90\x90\x96\xbc\xb1\x1e\xd4QER\x13?\xff"}]
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1, overwrite=True)
>>> schema_id = writer.add_schema(schema_json, "test_schema")
>>> status = writer.write_raw_data(data)
>>> status = writer.commit()
>>> write_append = FileWriter.open_for_append("test.mindrecord")
>>> status = write_append.write_raw_data(data)
>>> status = write_append.commit()
set_header_size(header_size)[source]

Set the size of header which contains shard information, schema information, page meta information, etc. The larger a header, the more data the MindRecord file can store. If the size of header is larger than the default size (16MB), users need to call the API to set a proper size.

Parameters

header_size (int) – Size of header, between 16*1024(16KB) and 128*1024*1024(128MB).

Returns

MSRStatus, SUCCESS or FAILED.

Raises

MRMInvalidHeaderSizeError – If failed to set header size.

Examples

>>> from mindspore.mindrecord import FileWriter
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1)
>>> status = writer.set_header_size(1 << 25) # 32MB
set_page_size(page_size)[source]

Set the size of page that represents the area where data is stored, and the areas are divided into two types: raw page and blob page. The larger a page, the more data the page can store. If the size of a sample is larger than the default size (32MB), users need to call the API to set a proper size.

Parameters

page_size (int) – Size of page, between 32*1024(32KB) and 256*1024*1024(256MB).

Returns

MSRStatus, SUCCESS or FAILED.

Raises

MRMInvalidPageSizeError – If failed to set page size.

Examples

>>> from mindspore.mindrecord import FileWriter
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1)
>>> status = writer.set_page_size(1 << 26)  # 64MB
write_raw_data(raw_data, parallel_writer=False)[source]

Convert raw data into a series of consecutive MindRecord files after the raw data is verified against the schema.

Note

Please refer to the Examples of class: mindspore.mindrecord.FileWriter .

Parameters
  • raw_data (list[dict]) – List of raw data.

  • parallel_writer (bool, optional) – Write raw data in parallel if it equals to True. Default: False.

Returns

MSRStatus, SUCCESS or FAILED.

Raises
  • ParamTypeError – If index field is invalid.

  • MRMOpenError – If failed to open MindRecord file.

  • MRMValidateDataError – If data does not match blob fields.

  • MRMSetHeaderError – If failed to set header.

  • MRMWriteDatasetError – If failed to write dataset.

  • TypeError – If parallel_writer is not bool.

class mindspore.mindrecord.ImageNetToMR(map_file, image_dir, destination, partition_number=1)[source]

A class to transform from imagenet to MindRecord.

Parameters
  • map_file (str) –

    The map file that indicates label. This file can be generated by command ls -l [image_dir] | grep -vE "total|\." | awk -F " " '{print $9, NR-1;}' > [file_path] , where image_dir is image directory contains n01440764, n01443537, n01484850 and n15075141 directory and file_path is the generated map_file . An example of map_file is as below:

    n01440764 0
    n01443537 1
    n01484850 2
    n01491361 3
    ...
    n15075141 999
    

  • image_dir (str) – Image directory contains n01440764, n01443537, n01484850 and n15075141 directory.

  • destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

  • partition_number (int, optional) – The partition size. Default: 1.

Raises

ValueError – If map_file , image_dir or destination is invalid.

run()[source]

Execute transformation from imagenet to MindRecord.

Returns

MSRStatus, SUCCESS or FAILED.

transform()[source]

Encapsulate the mindspore.mindrecord.ImageNetToMR.run() function to exit normally.

Returns

MSRStatus, SUCCESS or FAILED.

class mindspore.mindrecord.MindPage(file_name, num_consumer=4)[source]

Class to read MindRecord files in pagination.

Parameters
  • file_name (Union[str, list[str]]) – One of MindRecord files or a file list.

  • num_consumer (int, optional) – The number of reader workers which load data. Default: 4. It should not be smaller than 1 or larger than the number of processor cores.

Raises
  • ParamValueError – If file_name , num_consumer or columns is invalid.

  • MRMInitSegmentError – If failed to initialize ShardSegment.

property candidate_fields

Return candidate category fields.

Returns

list[str], by which data could be grouped.

property category_field

Getter function for category fields.

Returns

list[str], by which data could be grouped.

get_category_fields()[source]

Return candidate category fields.

Returns

list[str], by which data could be grouped.

read_at_page_by_id(category_id, page, num_row)[source]

Query by category id in pagination.

Parameters
  • category_id (int) – Category id, referred to the return of read_category_info .

  • page (int) – Index of page.

  • num_row (int) – Number of rows in a page.

Returns

list[dict], data queried by category id.

Raises
  • ParamValueError – If any parameter is invalid.

  • MRMFetchDataError – If failed to fetch data by category.

  • MRMUnsupportedSchemaError – If schema is invalid.

read_at_page_by_name(category_name, page, num_row)[source]

Query by category name in pagination.

Parameters
  • category_name (str) – String of category field’s value, referred to the return of read_category_info .

  • page (int) – Index of page.

  • num_row (int) – Number of row in a page.

Returns

list[dict], data queried by category name.

read_category_info()[source]

Return category information when data is grouped by indicated category field.

Returns

str, description of group information.

Raises

MRMReadCategoryInfoError – If failed to read category information.

set_category_field(category_field)[source]

Set category field for reading.

Note

Should be a candidate category field.

Parameters

category_field (str) – String of category field name.

Returns

MSRStatus, SUCCESS or FAILED.

class mindspore.mindrecord.MnistToMR(source, destination, partition_number=1)[source]

A class to transform from Mnist to MindRecord.

Parameters
  • source (str) – Directory that contains t10k-images-idx3-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and train-labels-idx1-ubyte.gz.

  • destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

  • partition_number (int, optional) – The partition size. Default: 1.

Raises

ValueError – If source , destination , partition_number is invalid.

run()[source]

Execute transformation from Mnist to MindRecord.

Returns

MSRStatus, SUCCESS or FAILED.

transform()[source]

Encapsulate the mindspore.mindrecord.MnistToMR.run() function to exit normally.

Returns

MSRStatus, SUCCESS or FAILED.

class mindspore.mindrecord.TFRecordToMR(source, destination, feature_dict, bytes_fields=None)[source]

A class to transform from TFRecord to MindRecord.

Note

For details about Examples, please refer to Converting TFRecord Dataset .

Parameters
  • source (str) – TFRecord file to be transformed.

  • destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

  • feature_dict (dict[str, FixedLenFeature]) – Dictionary that states the feature type, and FixedLenFeature is supported.

  • bytes_fields (list[str], optional) – The bytes fields which are in feature_dict and can be images bytes. Default: None, means that there is no byte dtype field such as image.

Raises
  • ValueError – If parameter is invalid.

  • Exception – when tensorflow module is not found or version is not correct.

run()[source]

Execute transformation from TFRecord to MindRecord.

Returns

MSRStatus, SUCCESS or FAILED.

tfrecord_iterator()[source]

Yield a dictionary whose keys are fields in schema.

Returns

dict, data dictionary whose keys are the same as columns.

tfrecord_iterator_oldversion()[source]

Yield a dict with key to be fields in schema, and value to be data. This function is for old version tensorflow whose version number < 2.1.0.

Returns

dict, data dictionary whose keys are the same as columns.

transform()[source]

Encapsulate the mindspore.mindrecord.TFRecordToMR.run() function to exit normally.

Returns

MSRStatus, SUCCESS or FAILED.