mindspore.tools

mindspore.tools.stress_detect(detect_type='aic')[source]

Used to detect whether there are faults in hardware accuracy or communication between links. The common usage scenario is to initiate a new thread or call this interface through a Callback function at each step or when saving checkpoints, to check whether hardware malfunctions could affect accuracy.

Parameters

detect_type (str, optional) – The type of stress test to perform. There are two options available: 'aic' and 'hccs', which perform AiCore and HCCS link stress tests on the device, respectively. Default: "aic".

Returns

int, the return value represents the error type. 0 indicates normal. 1 indicates failure to start some or all test cases. 2 indicates a hardware failure, and it is recommended to replace the device.

Supported Platforms:

Ascend

Examples

>>> from mindspore.tools import stress_detect
>>> ret = stress_detect()
>>> print(ret)
0
mindspore.tools.sdc_detect_start()[source]

Start silent data corruption detection. It will check the inputs and outputs of MatMul operations during the forward and backward computations on the current device, which may increase execution time. The overhead of the check time decreases as the matrix shapes increase. Starting sdc detection results in approximately 100% performance degradation for a single 4096-sized MatMul computation, and approximately 90% degradation on the Llama2-7B model (model parallel is 4, pipeline parallel is 2, and using qkv concatenation and ffn concatenation in decoder layers).

Supported Platforms:

Ascend

Examples

>>> from mindspore.tools import sdc_detect_start
>>> sdc_detect_start()
mindspore.tools.sdc_detect_stop()[source]

Stop silent data corruption detection.

Supported Platforms:

Ascend

Examples

>>> from mindspore.tools import sdc_detect_stop
>>> sdc_detect_stop()
mindspore.tools.get_sdc_detect_result()[source]

Get the result of silent data corruption detection.

Returns

bool, indicating whether silent data corruption has occurred after detection start.

Supported Platforms:

Ascend

Examples

>>> from mindspore.tools import get_sdc_detect_result
>>> result = get_sdc_detect_result()
>>> print(result)
False
mindspore.tools.set_dump(target, enabled=True)[source]

Enable or disable dump for the target and its contents.

target should be an instance of mindspore.nn.Cell or mindspore.ops.Primitive . Please note that this API takes effect only when the Dump function is enabled, and the dump_mode field in the Dump configuration file is set to "2" with the ms_backend compilation backend (please refer to the backend parameter in jit). See the dump document for details. By default, instances of mindspore.nn.Cell and mindspore.ops.Primitive do not enable the Dump data feature.

Note

  1. This API is only available for JIT compilation, requires 'Ascend' as the device_target and ms_backend as the compilation backend (please refer to the backend parameter in jit), and does not support fused operators.

  2. This API only supports being called before training starts. If you call this API during training, it may not be effective.

  3. After using set_dump(Cell, True) , operators in forward and backward computation (computation generated by the grad operations) of the cell will be dumped.

  4. For mindspore.nn.SoftmaxCrossEntropyWithLogits layer, the forward computation and backward computation use the same set of operators. So you can only see dump data from backward computation. Please note that mindspore.nn.SoftmaxCrossEntropyWithLogits layer will also use the above operators internally when initialized with sparse=True and reduction="mean" .

Parameters
  • target (Union[Cell, Primitive]) – The Cell instance or Primitive instance to which the dump flag is set.

  • enabled (bool, optional) – True indicates that the dump is enabled, and False indicates that the dump is disabled. Default: True .

Supported Platforms:

Ascend

Examples

Note

Please set environment variable MINDSPORE_DUMP_CONFIG to the dump config file and set dump_mode field in dump config file to 2 before running this example. See dump document for details.

>>> import numpy as np
>>> import mindspore as ms
>>> import mindspore.nn as nn
>>> from mindspore import Tensor, jit
>>> from mindspore.tools import set_dump
>>>
>>> ms.set_device(device_target="Ascend")
>>>
>>> class MyNet(nn.Cell):
...     def __init__(self):
...         super().__init__()
...         self.conv1 = nn.Conv2d(5, 6, 5, pad_mode='valid')
...         self.relu1 = nn.ReLU()
...
...     @jit
...     def construct(self, x):
...         x = self.conv1(x)
...         x = self.relu1(x)
...         return x
>>>
>>> if __name__ == "__main__":
...     net = MyNet()
...     set_dump(net.conv1)
...     input_tensor = Tensor(np.ones([1, 5, 10, 10], dtype=np.float32))
...     output = net(input_tensor)