mindspore.utils
- mindspore.utils.stress_detect()
Inspect the hardware to determine if there are any faults affecting its accuracy and precision.Common use cases include invoking this interface at each step or when saving checkpoints, allowing users to check if any hardware issues could impact precision.
- Returns
int, the return value represents the error type: zero indicates normal operation; non-zero values indicate a hardware failure.
- Supported Platforms:
Ascend
Examples
>>> from mindspore.utils import stress_detect >>> ret = stress_detect() >>> print(ret) 0
- mindspore.utils.dryrun.set_simulation()[source]
This interface is used to enable the dryrun function. The dryrun function is mainly used to simulate the actual operation of the large model. After it is enabled, the memory usage, compilation information, etc. can be simulated without occupying device card. In the PyNative mode, once it is enabled, if values are fetched from the device to the host, the Python call stack log will be printed to inform users that these values are inaccurate.
- Supported Platforms:
Ascend
Examples
>>> import mindspore as ms >>> from mindspore.utils import dryrun >>> import numpy as np >>> dryrun.set_simulation() >>> print(os.environ.get('MS_SIMULATION_LEVEL')) 1
- mindspore.utils.dryrun.mock(mock_val, *args)[source]
In the network, if some if branch need to use the actual execution values and the virtual execution cannot obtain them, this interface can be used to return simulated values. During actual execution, the correct results can be obtained and the execution values can be returned.
- Parameters
mock_val (Union[Value, Tensor]) – The value you want to return.
args (Union[Value, function]) – The content you want to mock, it can be values, function and so on.
- Returns
If dryrun is enabled, mock_val will be returned; otherwise, the actual execution value of args will be returned.
- Supported Platforms:
Ascend
GPU
CPU
Examples
>>> import mindspore as ms >>> from mindspore.utils import dryrun >>> import numpy as np >>> dryrun.set_simulation() >>> a = ms.Tensor(np.random.rand(3, 3).astype(np.float32)) >>> if dryrun.mock(True, a[0, 0] > 0.5): ... print("return mock_val: True.") return mock_val: True >>> >>> import mindspore as ms >>> from mindspore.utils import dryrun >>> import numpy as np >>> a = ms.Tensor(np.ones((3, 3)).astype(np.float32)) >>> if dryrun.mock(False, a[0, 0] > 0.5): ... print("return real execution: True.") return real execution: True. >>> >>> import mindspore as ms >>> from mindspore.utils import dryrun >>> import numpy as np >>> a = ms.Tensor(np.ones((3, 3)).astype(np.float32)) >>> if dryrun.mock(False, (a > 0.5).any): ... print("return real execution: True.") return real execution: True.
- mindspore.utils.sdc_detect_start()[source]
Start silent data corruption detection. It will check the inputs and outputs of MatMul operations during the forward and backward computations on the current device, which may increase execution time. The overhead of the check time decreases as the matrix shapes increase. Starting sdc detection results in approximately 100% performance degradation for a single 4096-sized MatMul computation, and approximately 90% degradation on the Llama2-7B model (model parallel is 4, pipeline parallel is 2, and using qkv concatenation and ffn concatenation in decoder layers).
- Supported Platforms:
Ascend
Examples
>>> from mindspore.utils import sdc_detect_start >>> sdc_detect_start()
- mindspore.utils.sdc_detect_stop()[source]
Stop silent data corruption detection.
- Supported Platforms:
Ascend
Examples
>>> from mindspore.utils import sdc_detect_stop >>> sdc_detect_stop()
- mindspore.utils.get_sdc_detect_result()[source]
Get the result of silent data corruption detection.
- Returns
bool, indicating whether silent data corruption has occurred after detection start.
- Supported Platforms:
Ascend
Examples
>>> from mindspore.utils import get_sdc_detect_result >>> result = get_sdc_detect_result() >>> print(result) False