mindspore.train

mindspore.train.summary

SummaryRecord.

User can use SummaryRecord to dump the summary data, the summary is a series of operations to collect data for analysis and visualization.

class mindspore.train.summary.SummaryRecord(log_dir, queue_max_size=0, flush_time=120, file_prefix='events', file_suffix='_MS', network=None, max_file_size=None)[source]

SummaryRecord is used to record the summary data and lineage data.

The API will create a summary file and lineage files lazily in a given directory and writes data to them. It writes the data to files by executing the ‘record’ method. In addition to record the data bubbled up from the network by defining the summary operators, SummaryRecord also supports to record extra data which can be added by calling add_value.

Note

  1. Make sure to close the SummaryRecord at the end, or the process will not exit. Please see the Example section below on how to properly close with two ways.

  2. The SummaryRecord instance can only allow one at a time, otherwise it will cause problems with data writes.

Parameters
  • log_dir (str) – The log_dir is a directory location to save the summary.

  • queue_max_size (int) – Deprecated. The capacity of event queue.(reserved). Default: 0.

  • flush_time (int) – Deprecated. Frequency to flush the summaries to disk, the unit is second. Default: 120.

  • file_prefix (str) – The prefix of file. Default: “events”.

  • file_suffix (str) – The suffix of file. Default: “_MS”.

  • network (Cell) – Obtain a pipeline through network for saving graph summary. Default: None.

  • max_file_size (Optional[int]) – The maximum size in bytes each file can be written to the disk. Unlimited by default. For example, to write not larger than 4GB, specify max_file_size=4 * 1024**3.

Raises
  • TypeError – If max_file_size, queue_max_size or flush_time is not int, or file_prefix and file_suffix is not str.

  • RuntimeError – If the log_dir can not be resolved to a canonicalized absolute pathname.

Examples

>>> # use in with statement to auto close
>>> with SummaryRecord(log_dir="./summary_dir") as summary_record:
>>>     pass
>>>
>>> # use in try .. finally .. to ensure closing
>>> try:
>>>     summary_record = SummaryRecord(log_dir="./summary_dir")
>>> finally:
>>>     summary_record.close()
add_value(plugin, name, value)[source]

Add value to be record later on.

When the plugin is ‘tensor’, ‘scalar’, ‘image’ or ‘histogram’, the name should be the tag name, and the value should be a Tensor.

When the plugin plugin is ‘graph’, the value should be a GraphProto.

When the plugin ‘dataset_graph’, ‘train_lineage’, ‘eval_lineage’, or ‘custom_lineage_data’, the value should be a proto message.

Parameters
  • plugin (str) – The plugin for the value.

  • name (str) – The name for the value.

  • value (Union[Tensor, GraphProto, TrainLineage, EvaluationLineage, DatasetGraph, UserDefinedInfo]) –

    The value to store.

    • GraphProto: The ‘value’ should be a serialized string this type when the plugin is ‘graph’.

    • Tensor: The ‘value’ should be this type when the plugin is ‘scalar’, ‘image’, ‘tensor’ or ‘histogram’.

    • TrainLineage: The ‘value’ should be this type when the plugin is ‘train_lineage’.

    • EvaluationLineage: The ‘value’ should be this type when the plugin is ‘eval_lineage’.

    • DatasetGraph: The ‘value’ should be this type when the plugin is ‘dataset_graph’.

    • UserDefinedInfo: The ‘value’ should be this type when the plugin is ‘custom_lineage_data’.

Raises

Examples

>>> with SummaryRecord(log_dir="./summary_dir", file_prefix="xxx_", file_suffix="_yyy") as summary_record:
>>>     summary_record.add_value('scalar', 'loss', Tensor(0.1))
close()[source]

Flush all events and close summary records. Please use with statement to autoclose.

Examples

>>> try:
>>>     summary_record = SummaryRecord(log_dir="./summary_dir")
>>> finally:
>>>     summary_record.close()
flush()[source]

Flush the event file to disk.

Call it to make sure that all pending events have been written to disk.

Examples

>>> with SummaryRecord(log_dir="./summary_dir", file_prefix="xxx_", file_suffix="_yyy") as summary_record:
>>>     summary_record.flush()
property log_dir

Get the full path of the log file.

Returns

str, the full path of log file.

Examples

>>> with SummaryRecord(log_dir="./summary_dir", file_prefix="xxx_", file_suffix="_yyy") as summary_record:
>>>     print(summary_record.log_dir)
record(step, train_network=None, plugin_filter=None)[source]

Record the summary.

Parameters
  • step (int) – Represents training step number.

  • train_network (Cell) – The network that called the callback.

  • plugin_filter (Optional[Callable[[str], bool]]) – The filter function, which is used to filter out plugins from being written by return False.

Returns

bool, whether the record process is successful or not.

Examples

>>> with SummaryRecord(log_dir="./summary_dir", file_prefix="xxx_", file_suffix="_yyy") as summary_record:
>>>     summary_record.record(step=2)
set_mode(mode)[source]

Set the mode for the recorder to be aware. The mode is set ‘train’ by default.

Parameters

mode (str) – The mode to set, which should be ‘train’ or ‘eval’.

Raises

ValueError – When the mode is not recognized.

Examples

>>> with SummaryRecord(log_dir="./summary_dir", file_prefix="xxx_", file_suffix="_yyy") as summary_record:
>>>     summary_record.set_mode('eval')

mindspore.train.callback

Callback related classes and functions.

class mindspore.train.callback.Callback[source]

Abstract base class used to build a callback class. Callbacks are context managers which will be entered and exited when passing into the Model. You can leverage this mechanism to init and release resources automatically.

Callback function will execution some operating to the current step or epoch.

Examples

>>> class Print_info(Callback):
>>>     def step_end(self, run_context):
>>>         cb_params = run_context.original_args()
>>>         print(cb_params.cur_epoch_num)
>>>         print(cb_params.cur_step_num)
>>>
>>> print_cb = Print_info()
>>> model.train(epoch, dataset, callbacks=print_cb)
begin(run_context)[source]

Called once before the network executing.

Parameters

run_context (RunContext) – Include some information of the model.

end(run_context)[source]

Called once after network training.

Parameters

run_context (RunContext) – Include some information of the model.

epoch_begin(run_context)[source]

Called before each epoch beginning.

Parameters

run_context (RunContext) – Include some information of the model.

epoch_end(run_context)[source]

Called after each epoch finished.

Parameters

run_context (RunContext) – Include some information of the model.

step_begin(run_context)[source]

Called before each epoch beginning.

Parameters

run_context (RunContext) – Include some information of the model.

step_end(run_context)[source]

Called after each step finished.

Parameters

run_context (RunContext) – Include some information of the model.

class mindspore.train.callback.LossMonitor(per_print_times=1)[source]

Monitor the loss in training.

If the loss is NAN or INF, it will terminate training.

Note

If per_print_times is 0 do not print loss.

Parameters

per_print_times (int) – Print loss every times. Default: 1.

Raises

ValueError – If print_step is not int or less than zero.

class mindspore.train.callback.TimeMonitor(data_size=None)[source]

Monitor the time in training.

Parameters

data_size (int) – Dataset size. Default: None.

class mindspore.train.callback.ModelCheckpoint(prefix='CKP', directory=None, config=None)[source]

The checkpoint callback class.

It is called to combine with train process and save the model and network parameters after traning.

Parameters
  • prefix (str) – Checkpoint files names prefix. Default: “CKP”.

  • directory (str) – Folder path into which checkpoint files will be saved. Default: None.

  • config (CheckpointConfig) – Checkpoint strategy config. Default: None.

Raises
  • ValueError – If the prefix is invalid.

  • TypeError – If the config is not CheckpointConfig type.

end(run_context)[source]

Save the last checkpoint after training finished.

Parameters

run_context (RunContext) – Context of the train running.

property latest_ckpt_file_name

Return the latest checkpoint path and file name.

step_end(run_context)[source]

Save the checkpoint at the end of step.

Parameters

run_context (RunContext) – Context of the train running.

class mindspore.train.callback.SummaryCollector(summary_dir, collect_freq=10, collect_specified_data=None, keep_default_action=True, custom_lineage_data=None, collect_tensor_freq=None, max_file_size=None)[source]

SummaryCollector can help you to collect some common information.

It can help you to collect loss, learning late, computational graph and so on. SummaryCollector also persists data collected by the summary operator into a summary file.

Note

  1. Multiple SummaryCollector instances in callback list are not allowed.

  2. Not all information is collected at the training phase or at the eval phase.

  3. SummaryCollector always record the data collected by the summary operator.

Parameters
  • summary_dir (str) – The collected data will be persisted to this directory. If the directory does not exist, it will be created automatically.

  • collect_freq (int) – Set the frequency of data collection, it should be greater then zero, and the unit is step. Default: 10. If a frequency is set, we will collect data at (current steps % freq) == 0, and the first step will be collected at any time. It is important to note that if the data sink mode is used, the unit will become the epoch. It is not recommended to collect data too frequently, which can affect performance.

  • collect_specified_data (Union[None, dict]) –

    Perform custom operations on the collected data. Default: None. By default, if set to None, all data is collected as the default behavior. If you want to customize the data collected, you can do so with a dictionary. Examples,you can set {‘collect_metric’: False} to control not collecting metrics. The data that supports control is shown below.

    • collect_metric: Whether to collect training metrics, currently only loss is collected. The first output will be treated as loss, and it will be averaged. Optional: True/False. Default: True.

    • collect_graph: Whether to collect computational graph, currently only training computational graph is collected. Optional: True/False. Default: True.

    • collect_train_lineage: Whether to collect lineage data for the training phase, this field will be displayed on the lineage page of Mindinsight. Optional: True/False. Default: True.

    • collect_eval_lineage: Whether to collect lineage data for the eval phase, this field will be displayed on the lineage page of Mindinsight. Optional: True/False. Default: True.

    • collect_input_data: Whether to collect dataset for each training. Currently only image data is supported. Optional: True/False. Default: True.

    • collect_dataset_graph: Whether to collect dataset graph for the training phase. Optional: True/False. Default: True.

    • histogram_regular: Collect weight and bias for parameter distribution page display in MindInsight. This field allows regular strings to control which parameters to collect. Default: None, it means only the first five parameters are collected. It is not recommended to collect too many parameters at once, as it can affect performance. Note that if you collect too many parameters and run out of memory, the training will fail.

  • keep_default_action (bool) – This field affects the collection behavior of the ‘collect_specified_data’ field. Optional: True/False, Default: True. True: means that after specified data is set, non-specified data is collected as the default behavior. False: means that after specified data is set, only the specified data is collected, and the others are not collected.

  • custom_lineage_data (Union[dict, None]) – Allows you to customize the data and present it on the MingInsight lineage page. In the custom data, the key type support str, and the value type support str/int/float. Default: None, it means there is no custom data.

  • collect_tensor_freq (Optional[int]) – Same semantic as the collect_freq, but controls TensorSummary only. Because TensorSummary data is too large compared to other summary data, this parameter is used to reduce its collection. By default, TensorSummary data will be collected at most 21 steps, but not more than how many steps other summary data will be collected. Default: None, which means to follow the behavior as described above. For example, given collect_freq=10, when the total steps is 600, TensorSummary will be collected 21 steps, while other summary data 61 steps, but when the total steps is 20, both TensorSummary and other summary will be collected 3 steps. Also note that when in parallel mode, the total steps will be splitted evenly, which will affect how many steps TensorSummary will be collected.

  • max_file_size (Optional[int]) – The maximum size in bytes each file can be written to the disk. Default: None, which means no limit. For example, to write not larger than 4GB, specify max_file_size=4 * 1024**3.

Raises
  • ValueError – If the parameter value is not expected.

  • TypeError – If the parameter type is not expected.

  • RuntimeError – If an error occurs during data collection.

Examples

>>> # Simple usage:
>>> summary_collector = SummaryCollector(summary_dir='./summary_dir')
>>> model.train(epoch, dataset, callbacks=summary_collector)
>>>
>>> # Do not collect metric and collect the first layer parameter, others are collected by default
>>> specified={'collect_metric': False, 'histogram_regular': '^conv1.*'}
>>> summary_collector = SummaryCollector(summary_dir='./summary_dir', collect_specified_data=specified)
>>> model.train(epoch, dataset, callbacks=summary_collector)
>>>
>>> # Only collect metric, custom lineage data and record data that collected by the summary operator,
>>> # others are not collected
>>> specified = {'collect_metric': True}
>>> summary_collector = SummaryCollector('./summary_dir',
>>>                                      collect_specified_data=specified,
>>>                                      keep_default_action=False,
>>>                                      custom_lineage_data={'version': 'resnet50_v1'}
>>>                                      )
>>> model.train(epoch, dataset, callbacks=summary_collector)
class mindspore.train.callback.CheckpointConfig(save_checkpoint_steps=1, save_checkpoint_seconds=0, keep_checkpoint_max=5, keep_checkpoint_per_n_minutes=0, integrated_save=True, async_save=False)[source]

The config for model checkpoint.

Note

During the training process, if dataset is transmitted through the data channel, suggest set save_checkpoint_steps be an integer multiple of loop_size. Otherwise there may be deviation in the timing of saving checkpoint.

Parameters
  • save_checkpoint_steps (int) – Steps to save checkpoint. Default: 1.

  • save_checkpoint_seconds (int) – Seconds to save checkpoint. Default: 0. Can’t be used with save_checkpoint_steps at the same time.

  • keep_checkpoint_max (int) – Maximum step to save checkpoint. Default: 5.

  • keep_checkpoint_per_n_minutes (int) – Keep one checkpoint every n minutes. Default: 0. Can’t be used with keep_checkpoint_max at the same time.

  • integrated_save (bool) – Whether to intergrated save in automatic model parallel scene. Default: True. Integrated save function is only supported in automatic parallel scene, not supported in manual parallel.

  • async_save (bool) – Whether asynchronous execute save checkpoint into file. Default: False

Raises

ValueError – If the input_param is None or 0.

Examples

>>> config = CheckpointConfig()
>>> ckpoint_cb = ModelCheckpoint(prefix="ck_prefix", directory='./', config=config)
>>> model.train(10, dataset, callbacks=ckpoint_cb)
property async_save

Get the value of _async_save.

get_checkpoint_policy()[source]

Get the policy of checkpoint.

property integrated_save

Get the value of _integrated_save.

property keep_checkpoint_max

Get the value of _keep_checkpoint_max.

property keep_checkpoint_per_n_minutes

Get the value of _keep_checkpoint_per_n_minutes.

property save_checkpoint_seconds

Get the value of _save_checkpoint_seconds.

property save_checkpoint_steps

Get the value of _save_checkpoint_steps.

class mindspore.train.callback.RunContext(original_args)[source]

Provides information about the model.

Run call being made. Provides information about original request to model function. callback objects can stop the loop by calling request_stop() of run_context.

Parameters

original_args (dict) – Holding the related information of model etc.

get_stop_requested()[source]

Returns whether a stop is requested or not.

Returns

bool, if true, model.train() stops iterations.

original_args()[source]

Get the _original_args object.

Returns

Dict, a object holding the original arguments of model.

request_stop()[source]

Sets stop requested during training.

Callbacks can use this function to request stop of iterations. model.train() checks whether this is called or not.

mindspore.train.serialization

Model and parameters serialization.

mindspore.train.serialization.save_checkpoint(parameter_list, ckpt_file_name, async_save=False)[source]

Saves checkpoint info to a specified file.

Parameters
  • parameter_list (list) – Parameters list, each element is a dict like {“name”:xx, “type”:xx, “shape”:xx, “data”:xx}.

  • ckpt_file_name (str) – Checkpoint file name.

  • async_save (bool) – Whether asynchronous execute save checkpoint into file. Default: False

Raises

RuntimeError – Failed to save the Checkpoint file.

mindspore.train.serialization.load_checkpoint(ckpt_file_name, net=None)[source]

Loads checkpoint info from a specified file.

Parameters
  • ckpt_file_name (str) – Checkpoint file name.

  • net (Cell) – Cell network. Default: None

Returns

Dict, key is parameter name, value is a Parameter.

Raises

ValueError – Checkpoint file is incorrect.

mindspore.train.serialization.load_param_into_net(net, parameter_dict)[source]

Loads parameters into network.

Parameters
  • net (Cell) – Cell network.

  • parameter_dict (dict) – Parameter dict.

Raises

TypeError – Argument is not a Cell, or parameter_dict is not a Parameter dict.

mindspore.train.serialization.export(net, *inputs, file_name, file_format='GEIR')[source]

Exports MindSpore predict model to file in specified format.

Parameters
  • net (Cell) – MindSpore network.

  • inputs (Tensor) – Inputs of the net.

  • file_name (str) – File name of model to export.

  • file_format (str) –

    MindSpore currently supports ‘GEIR’, ‘ONNX’ and ‘BINARY’ format for exported model.

    • GEIR: Graph Engine Intermidiate Representation. An intermidiate representation format of Ascend model.

    • ONNX: Open Neural Network eXchange. An open format built to represent machine learning models.

    • BINARY: Binary format for model. An intermidiate representation format for models.

mindspore.train.serialization.parse_print(print_file_name)[source]

Loads Print data from a specified file.

Parameters

print_file_name (str) – The file name of save print data.

Returns

List, element of list is Tensor.

Raises

ValueError – The print file may be empty, please make sure enter the correct file name.

mindspore.train.amp

Auto mixed precision.

mindspore.train.amp.build_train_network(network, optimizer, loss_fn=None, level='O0', **kwargs)[source]

Build the mixed precision training cell automatically.

Parameters
  • network (Cell) – Definition of the network.

  • loss_fn (Union[None, Cell]) – Definition of the loss_fn. If None, the network should have the loss inside. Default: None.

  • optimizer (Optimizer) – Optimizer to update the Parameter.

  • level (str) –

    Supports [O0, O2, O3]. Default: “O0”.

    • O0: Do not change.

    • O2: Cast network to float16, keep batchnorm and loss_fn (if set) run in float32, using dynamic loss scale.

    • O3: Cast network to float16, with additional property ‘keep_batchnorm_fp32=False’.

    O2 is recommended on GPU, O3 is recommended on Ascend.

  • cast_model_type (mindspore.dtype) – Supports mstype.float16 or mstype.float32. If set to mstype.float16, use float16 mode to train. If set, overwrite the level setting.

  • keep_batchnorm_fp32 (bool) – Keep Batchnorm run in float32. If set, overwrite the level setting. Only cast_model_type is float16, keep_batchnorm_fp32 will take effect.

  • loss_scale_manager (Union[None, LossScaleManager]) – If None, not scale the loss, or else scale the loss by LossScaleManager. If set, overwrite the level setting.

mindspore.train.loss_scale_manager

Loss scale manager abstract class.

class mindspore.train.loss_scale_manager.LossScaleManager[source]

Loss scale manager abstract class.

get_loss_scale()[source]

Get loss scale value.

get_update_cell()[source]

Get the loss scaling update logic cell.

update_loss_scale(overflow)[source]

Update loss scale value.

Parameters

overflow (bool) – Whether it overflows.

class mindspore.train.loss_scale_manager.FixedLossScaleManager(loss_scale=128.0, drop_overflow_update=True)[source]

Fixed loss-scale manager.

Parameters
  • loss_scale (float) – Loss scale. Default: 128.0.

  • drop_overflow_update (bool) – whether to do optimizer if there is overflow. Default: True.

Examples

>>> loss_scale_manager = FixedLossScaleManager()
>>> model = Model(net, loss_scale_manager=loss_scale_manager)
get_drop_overflow_update()[source]

Get the flag whether to drop optimizer update when there is overflow happened

get_loss_scale()[source]

Get loss scale value.

get_update_cell()[source]

Returns the cell for TrainOneStepWithLossScaleCell

update_loss_scale(overflow)[source]

Update loss scale value.

Parameters

overflow (bool) – Whether it overflows.

class mindspore.train.loss_scale_manager.DynamicLossScaleManager(init_loss_scale=16777216, scale_factor=2, scale_window=2000)[source]

Dynamic loss-scale manager.

Parameters
  • init_loss_scale (float) – Init loss scale. Default: 2**24.

  • scale_factor (int) – Coefficient of increase and decrease. Default: 2.

  • scale_window (int) – Maximum continuous normal steps when there is no overflow. Default: 2000.

Examples

>>> loss_scale_manager = DynamicLossScaleManager()
>>> model = Model(net, loss_scale_manager=loss_scale_manager)
get_drop_overflow_update()[source]

Get the flag whether to drop optimizer update when there is overflow happened

get_loss_scale()[source]

Get loss scale value.

get_update_cell()[source]

Returns the cell for TrainOneStepWithLossScaleCell

update_loss_scale(overflow)[source]

Update loss scale value.

Parameters

overflow – Boolean. Whether it overflows.

mindspore.train.quant

quantization.

User can use quantization aware to train a model. MindSpore supports quantization aware training, which models quantization errors in both the forward and backward passes using fake-quantization ops. Note that the entire computation is carried out in floating point. At the end of quantization aware training, MindSpore provides conversion functions to convert the trained model into lower precision.

mindspore.train.quant.convert_quant_network(network, bn_fold=False, freeze_bn=0, quant_delay=(0, 0), num_bits=(8, 8), per_channel=(False, False), symmetric=(False, False), narrow_range=(False, False))[source]

Create quantization aware training network.

Parameters
  • network (Cell) – Obtain a pipeline through network for saving graph summary.

  • bn_fold (bool) – Flag to used bn fold ops for simulation inference operation. Default: False.

  • freeze_bn (int) – Number of steps after which BatchNorm OP parameters used total mean and variance. Default: 0.

  • quant_delay (int, list or tuple) – Number of steps after which weights and activations are quantized during eval. The first element represent weights and second element represent data flow. Default: (0, 0)

  • num_bits (int, list or tuple) – Number of bits to use for quantizing weights and activations. The first element represent weights and second element represent data flow. Default: (8, 8)

  • per_channel (bool, list or tuple) – Quantization granularity based on layer or on channel. If True then base on per channel otherwise base on per layer. The first element represent weights and second element represent data flow. Default: (False, False)

  • symmetric (bool, list or tuple) – Quantization algorithm use symmetric or not. If True then base on symmetric otherwise base on asymmetric. The first element represent weights and second element represent data flow. Default: (False, False)

  • narrow_range (bool, list or tuple) – Quantization algorithm use narrow range or not. If True then base on narrow range otherwise base on off narrow range. The first element represent weights and second element represent data flow. Default: (False, False)

Returns

Cell, Network which has change to quantization aware training network cell.