mindspore.train

mindspore.train.summary

SummaryRecord.

User can use SummaryRecord to dump the summary data, the summary is a series of operations to collect data for analysis and visualization.

class mindspore.train.summary.SummaryRecord(log_dir, queue_max_size=0, flush_time=120, file_prefix='events', file_suffix='_MS', network=None)[source]

SummaryRecord is used to record the summary value.

Note

The API will create an event file in a given directory and add summaries and events to it. It writes the event log to a file by executing the record method. In addition, if the SummaryRecord object is created and the summary operator is used in the network, even if the record method is not called, the event in the cache will be written to the file at the end of execution. Make sure to close the SummaryRecord object at the end.

Parameters
  • log_dir (str) – The log_dir is a directory location to save the summary.

  • queue_max_size (int) – The capacity of event queue.(reserved). Default: 0.

  • flush_time (int) – Frequency to flush the summaries to disk, the unit is second. Default: 120.

  • file_prefix (str) – The prefix of file. Default: “events”.

  • file_suffix (str) – The suffix of file. Default: “_MS”.

  • network (Cell) – Obtain a pipeline through network for saving graph summary. Default: None.

Raises
  • TypeError – If queue_max_size and flush_time is not int, or file_prefix and file_suffix is not str.

  • RuntimeError – If the log_dir can not be resolved to a canonicalized absolute pathname.

Examples

>>> with SummaryRecord(log_dir="/opt/log", file_prefix="xxx_", file_suffix="_yyy") as summary_record:
>>>     pass
close()[source]

Flush all events and close summary records. Please use with statement to autoclose.

Examples

>>> with SummaryRecord(log_dir="/opt/log", file_prefix="xxx_", file_suffix="_yyy") as summary_record:
>>>     pass # summary_record autoclosed
flush()[source]

Flush the event file to disk.

Call it to make sure that all pending events have been written to disk.

Examples

>>> with SummaryRecord(log_dir="/opt/log", file_prefix="xxx_", file_suffix="_yyy") as summary_record:
>>>     summary_record.flush()
property log_dir

Get the full path of the log file.

Examples

>>> with SummaryRecord(log_dir="/opt/log", file_prefix="xxx_", file_suffix="_yyy") as summary_record:
>>>     print(summary_record.log_dir)
Returns

String, the full path of log file.

record(step, train_network=None)[source]

Record the summary.

Parameters
  • step (int) – Represents training step number.

  • train_network (Cell) – The network that called the callback.

Examples

>>> with SummaryRecord(log_dir="/opt/log", file_prefix="xxx_", file_suffix="_yyy") as summary_record:
>>>     summary_record.record(step=2)
Returns

bool, whether the record process is successful or not.

mindspore.train.callback

Callback related classes and functions.

class mindspore.train.callback.Callback[source]

Abstract base class used to build a callback function.

Callback function will execution some operating to the current step or epoch.

Examples

>>> class Print_info(Callback):
>>>     def step_end(self, run_context):
>>>         cb_params = run_context.original_args()
>>>         print(cb_params.cur_epoch_num)
>>>         print(cb_params.cur_step_num)
>>>
>>> print_cb = Print_info()
>>> model.train(epoch, dataset, callbacks=print_cb)
begin(run_context)[source]

Called once before the network executing.

Parameters

run_context (RunContext) – Include some information of the model.

end(run_context)[source]

Called once after network training.

Parameters

run_context (RunContext) – Include some information of the model.

epoch_begin(run_context)[source]

Called before each epoch beginning.

Parameters

run_context (RunContext) – Include some information of the model.

epoch_end(run_context)[source]

Called after each epoch finished.

Parameters

run_context (RunContext) – Include some information of the model.

step_begin(run_context)[source]

Called before each epoch beginning.

Parameters

run_context (RunContext) – Include some information of the model.

step_end(run_context)[source]

Called after each step finished.

Parameters

run_context (RunContext) – Include some information of the model.

class mindspore.train.callback.LossMonitor(per_print_times=1)[source]

Monitor the loss in training.

If the loss is NAN or INF, it will terminate training.

Note

If per_print_times is 0 do not print loss.

Parameters

per_print_times (int) – Print loss every times. Default: 1.

Raises

ValueError – If print_step is not int or less than zero.

class mindspore.train.callback.TimeMonitor(data_size)[source]

Time Monitor.

class mindspore.train.callback.ModelCheckpoint(prefix='CKP', directory=None, config=None)[source]

The checkpoint callback class.

It is called to combine with train process and save the model and network parameters after traning.

Parameters
  • prefix (str) – Checkpoint files names prefix. Default: “CKP”.

  • directory (str) – Folder path into which checkpoint files will be saved. Default: None.

  • config (CheckpointConfig) – Checkpoint strategy config. Default: None.

Raises
  • ValueError – If the prefix is invalid.

  • TypeError – If the config is not CheckpointConfig type.

end(run_context)[source]

Save the last checkpoint after training finished.

Parameters

run_context (RunContext) – Context of the train running.

property latest_ckpt_file_name

Return the latest checkpoint path and file name.

step_end(run_context)[source]

Save the checkpoint at the end of step.

Parameters

run_context (RunContext) – Context of the train running.

class mindspore.train.callback.SummaryStep(summary, flush_step=10)[source]

The summary callback class.

Parameters
  • summary (Object) – Summary recode object.

  • flush_step (int) – Number of interval steps to execute. Default: 10.

step_end(run_context)[source]

Save summary.

Parameters

run_context (RunContext) – Context of the train running.

class mindspore.train.callback.CheckpointConfig(save_checkpoint_steps=1, save_checkpoint_seconds=0, keep_checkpoint_max=5, keep_checkpoint_per_n_minutes=0, integrated_save=True)[source]

The config for model checkpoint.

Note

During the training process, if dataset is transmitted through the data channel, suggest set save_checkpoint_steps be an integer multiple of loop_size. Otherwise there may be deviation in the timing of saving checkpoint.

Parameters
  • save_checkpoint_steps (int) – Steps to save checkpoint. Default: 1.

  • save_checkpoint_seconds (int) – Seconds to save checkpoint. Default: 0. Can’t be used with save_checkpoint_steps at the same time.

  • keep_checkpoint_max (int) – Maximum step to save checkpoint. Default: 5.

  • keep_checkpoint_per_n_minutes (int) – Keep one checkpoint every n minutes. Default: 0. Can’t be used with keep_checkpoint_max at the same time.

  • integrated_save (bool) – Whether to intergrated save in automatic model parallel scene. Default: True. Integrated save function is only supported in automatic parallel scene, not supported in manual parallel.

Raises

ValueError – If the input_param is None or 0.

Examples

>>> config = CheckpointConfig()
>>> ckpoint_cb = ModelCheckpoint(prefix="ck_prefix", directory='./', config=config)
>>> model.train(10, dataset, callbacks=ckpoint_cb)
get_checkpoint_policy()[source]

Get the policy of checkpoint.

property integrated_save

Get the value of _integrated_save.

property keep_checkpoint_max

Get the value of _keep_checkpoint_max.

property keep_checkpoint_per_n_minutes

Get the value of _keep_checkpoint_per_n_minutes.

property save_checkpoint_seconds

Get the value of _save_checkpoint_seconds.

property save_checkpoint_steps

Get the value of _save_checkpoint_steps.

class mindspore.train.callback.RunContext(original_args)[source]

Provides information about the model.

Run call being made. Provides information about original request to model function. callback objects can stop the loop by calling request_stop() of run_context.

Parameters

original_args (dict) – Holding the related information of model etc.

get_stop_requested()[source]

Returns whether a stop is requested or not.

Returns

bool, if true, model.train() stops iterations.

original_args()[source]

Get the _original_args object.

Returns

Dict, a object holding the original arguments of model.

request_stop()[source]

Sets stop requested during training.

Callbacks can use this function to request stop of iterations. model.train() checks whether this is called or not.

mindspore.train.serialization

Model and parameters serialization.

mindspore.train.serialization.save_checkpoint(parameter_list, ckpoint_file_name)[source]

Saves checkpoint info to a specified file.

Parameters
  • parameter_list (list) – Parameters list, each element is a dict like {“name”:xx, “type”:xx, “shape”:xx, “data”:xx}.

  • ckpoint_file_name (str) – Checkpoint file name.

Raises

RuntimeError – Failed to save the Checkpoint file.

mindspore.train.serialization.load_checkpoint(ckpoint_file_name, net=None)[source]

Loads checkpoint info from a specified file.

Parameters
  • ckpoint_file_name (str) – Checkpoint file name.

  • net (Cell) – Cell network. Default: None

Returns

Dict, key is parameter name, value is a Parameter.

Raises

ValueError – Checkpoint file is incorrect.

mindspore.train.serialization.load_param_into_net(net, parameter_dict)[source]

Loads parameters into network.

Parameters
  • net (Cell) – Cell network.

  • parameter_dict (dict) – Parameter dict.

Raises

TypeError – Argument is not a Cell, or parameter_dict is not a Parameter dict.

mindspore.train.serialization.export(net, *inputs, file_name, file_format='GEIR')[source]

Exports MindSpore predict model to file in specified format.

Parameters
  • net (Cell) – MindSpore network.

  • inputs (Tensor) – Inputs of the net.

  • file_name (str) – File name of model to export.

  • file_format (str) –

    MindSpore currently supports ‘GEIR’, ‘ONNX’ and ‘LITE’ format for exported model.

    • GEIR: Graph Engine Intermidiate Representation. An intermidiate representation format of Ascend model.

    • ONNX: Open Neural Network eXchange. An open format built to represent machine learning models.

    • LITE: Huawei model format for mobile. A lite model only for the MindSpore Lite

mindspore.train.amp

Auto mixed precision.

mindspore.train.amp.build_train_network(network, optimizer, loss_fn=None, level='O0', **kwargs)[source]

Build the mixed precision training cell automatically.

Parameters
  • network (Cell) – Definition of the network.

  • loss_fn (Union[None, Cell]) – Definition of the loss_fn. If None, the network should have the loss inside. Default: None.

  • optimizer (Optimizer) – Optimizer to update the Parameter.

  • level (str) –

    Supports [O0, O2]. Default: “O0”.

    • O0: Do not change.

    • O2: Cast network to float16, keep batchnorm and loss_fn (if set) run in float32, using dynamic loss scale.

  • cast_model_type (mindspore.dtype) – Supports mstype.float16 or mstype.float32. If set to mstype.float16, use float16 mode to train. If set, overwrite the level setting.

  • keep_batchnorm_fp32 (bool) – Keep Batchnorm run in float32. If set, overwrite the level setting.

  • loss_scale_manager (Union[None, LossScaleManager]) – If None, not scale the loss, or else scale the loss by LossScaleManager. If set, overwrite the level setting.

mindspore.train.loss_scale_manager

Loss scale manager abstract class.

class mindspore.train.loss_scale_manager.LossScaleManager[source]

Loss scale manager abstract class.

get_loss_scale()[source]

Get loss scale value.

get_update_cell()[source]

Get the loss scaling update logic cell.

update_loss_scale(overflow)[source]

Update loss scale value.

Parameters

overflow (bool) – Whether it overflows.

class mindspore.train.loss_scale_manager.FixedLossScaleManager(loss_scale=128.0, drop_overflow_update=True)[source]

Fixed loss-scale manager.

Parameters
  • loss_scale (float) – Loss scale. Default: 128.0.

  • drop_overflow_update (bool) – whether to do optimizer if there is overflow. Default: True.

Examples

>>> loss_scale_manager = FixedLossScaleManager()
>>> model = Model(net, loss_scale_manager=loss_scale_manager)
get_drop_overflow_update()[source]

Get the flag whether to drop optimizer update when there is overflow happened

get_loss_scale()[source]

Get loss scale value.

get_update_cell()[source]

Returns the cell for TrainOneStepWithLossScaleCell

update_loss_scale(overflow)[source]

Update loss scale value.

Parameters

overflow (bool) – Whether it overflows.

class mindspore.train.loss_scale_manager.DynamicLossScaleManager(init_loss_scale=16777216, scale_factor=2, scale_window=2000)[source]

Dynamic loss-scale manager.

Parameters
  • init_loss_scale (float) – Init loss scale. Default: 2**24.

  • scale_factor (int) – Coefficient of increase and decrease. Default: 2.

  • scale_window (int) – Maximum continuous normal steps when there is no overflow. Default: 2000.

Examples

>>> loss_scale_manager = DynamicLossScaleManager()
>>> model = Model(net, loss_scale_manager=loss_scale_manager)
get_drop_overflow_update()[source]

Get the flag whether to drop optimizer update when there is overflow happened

get_loss_scale()[source]

Get loss scale value.

get_update_cell()[source]

Returns the cell for TrainOneStepWithLossScaleCell

update_loss_scale(overflow)[source]

Update loss scale value.

Parameters

overflow – Boolean. Whether it overflows.