mindspore.tools
- mindspore.tools.stress_detect(detect_type='aic')[source]
Used to detect whether there are faults in hardware accuracy or communication between links. The common usage scenario is to initiate a new thread or call this interface through a Callback function at each step or when saving checkpoints, to check whether hardware malfunctions could affect accuracy.
- Parameters
detect_type (str, optional) – The type of stress test to perform. There are two options available:
'aic'
and'hccs'
, which perform AiCore and HCCS link stress tests on the device, respectively. Default: "aic".- Returns
int, the return value represents the error type. 0 indicates normal. 1 indicates failure to start some or all test cases. 2 indicates a hardware failure, and it is recommended to replace the device.
- Supported Platforms:
Ascend
Examples
>>> from mindspore.tools import stress_detect >>> ret = stress_detect() >>> print(ret) 0
- mindspore.tools.sdc_detect_start()[source]
Start silent data corruption detection. It will check the inputs and outputs of MatMul operations during the forward and backward computations on the current device, which may increase execution time. The overhead of the check time decreases as the matrix shapes increase. Starting sdc detection results in approximately 100% performance degradation for a single 4096-sized MatMul computation, and approximately 90% degradation on the Llama2-7B model (model parallel is 4, pipeline parallel is 2, and using qkv concatenation and ffn concatenation in decoder layers).
- Supported Platforms:
Ascend
Examples
>>> from mindspore.tools import sdc_detect_start >>> sdc_detect_start()
- mindspore.tools.sdc_detect_stop()[source]
Stop silent data corruption detection.
- Supported Platforms:
Ascend
Examples
>>> from mindspore.tools import sdc_detect_stop >>> sdc_detect_stop()
- mindspore.tools.get_sdc_detect_result()[source]
Get the result of silent data corruption detection.
- Returns
bool, indicating whether silent data corruption has occurred after detection start.
- Supported Platforms:
Ascend
Examples
>>> from mindspore.tools import get_sdc_detect_result >>> result = get_sdc_detect_result() >>> print(result) False
- mindspore.tools.set_dump(target, enabled=True)[source]
Enable or disable dump for the target and its contents.
target should be an instance of
mindspore.nn.Cell
ormindspore.ops.Primitive
. Please note that this API takes effect only when the Dump function is enabled, and the dump_mode field in the Dump configuration file is set to "2" with the ms_backend compilation backend (please refer to the backend parameter in jit). See the dump document for details. By default, instances ofmindspore.nn.Cell
andmindspore.ops.Primitive
do not enable the Dump data feature.Note
This API is only available for JIT compilation, requires 'Ascend' as the device_target and ms_backend as the compilation backend (please refer to the backend parameter in jit), and does not support fused operators.
This API only supports being called before training starts. If you call this API during training, it may not be effective.
After using set_dump(Cell, True) , operators in forward and backward computation (computation generated by the grad operations) of the cell will be dumped.
For
mindspore.nn.SoftmaxCrossEntropyWithLogits
layer, the forward computation and backward computation use the same set of operators. So you can only see dump data from backward computation. Please note thatmindspore.nn.SoftmaxCrossEntropyWithLogits
layer will also use the above operators internally when initialized with sparse=True and reduction="mean" .
- Parameters
- Supported Platforms:
Ascend
Examples
Note
Please set environment variable MINDSPORE_DUMP_CONFIG to the dump config file and set dump_mode field in dump config file to 2 before running this example. See dump document for details.
>>> import numpy as np >>> import mindspore as ms >>> import mindspore.nn as nn >>> from mindspore import Tensor, jit >>> from mindspore.tools import set_dump >>> >>> ms.set_device(device_target="Ascend") >>> >>> class MyNet(nn.Cell): ... def __init__(self): ... super().__init__() ... self.conv1 = nn.Conv2d(5, 6, 5, pad_mode='valid') ... self.relu1 = nn.ReLU() ... ... @jit ... def construct(self, x): ... x = self.conv1(x) ... x = self.relu1(x) ... return x >>> >>> if __name__ == "__main__": ... net = MyNet() ... set_dump(net.conv1) ... input_tensor = Tensor(np.ones([1, 5, 10, 10], dtype=np.float32)) ... output = net(input_tensor)