# Debug and Tune [](https://gitee.com/mindspore/docs/blob/r2.5.0/docs/mindspore/source_en/migration_guide/debug_and_tune.md) ## FAQs and Solutions - The following common problems may be encountered during the accuracy commissioning phase: - The first loss and the benchmark are not aligned: It means that the network positive and benchmark are not aligned, you can fix the network input, turn off randomness such as shuffle, save the output as npy at some key nodes of the network, and with the help of [TroubleShooter to see if the two sets of Tensor values (npy files) are equal or not](https://gitee.com/mindspore/toolkits/blob/master/troubleshooter/docs/migrator.md#%E5%BA%94%E7%94%A8%E5%9C%BA%E6%99%AF4%E6%AF%94%E8%BE%83%E4%B8%A4%E7%BB84tensor%E5%80%BCnpy%E6%96%87%E4%BB%B6%E6%98%AF%E5%90%A6%E7%9B%B8%E7%AD%89), locate the first inconsistent position, and then bisect the position to analyze the positive where the difference leads to the loss and the benchmark misaligned to cause the accuracy problem. - The first loss is aligned with the benchmark, and subsequent losses are misaligned: The problem is mainly caused by the network reverse. This can be done with the help of [TroubleShooter comparing MindSpore to PyTorch ckpt/pth](https://gitee.com/mindspore/toolkits/blob/master/troubleshooter/docs/migrator.md#%E5%BA%94%E7%94%A8%E5%9C%BA%E6%99%AF2%E6%AF%94%E5%AF%B9mindspore%E4%B8%8Epytorch%E7%9A%84ckptpth) to check the results of the network reverse update by comparing the values of the corresponding parameters of ckpt and pth. - Loss appears NAN/INF: [TroubleShooter obtains INF/NAN value throw points](https://gitee.com/mindspore/toolkits/blob/master/troubleshooter/docs/tracker.md#%E5%BA%94%E7%94%A8%E5%9C%BA%E6%99%AF2%E8%8E%B7%E5%8F%96infnan%E5%80%BC%E6%8A%9B%E5%87%BA%E7%82%B9) is used to identify the first location in the network where a NAN or INF appears. Overflow operator detection is also available via the [Dump](https://www.mindspore.cn/docs/en/r2.5.0/model_train/debug/dump.html) tool. - The following common problems may be encountered during the graphics debugging phase: - Malloc device memory failed: MindSpore failed to request memory on the device side, the original memory is that the device is occupied by other processes, you can check the running processes by ps -ef | grep "python". - Out of Memory: The possible reasons for failure to request dynamic memory are: batch size is too large, processing too much data leads to a large memory footprint; communication operators take up too much memory leading to a low overall memory reuse rate. ## Introduction of MindSpore Debugging ### Function Debugging During network migration, you are advised to use the PyNative mode for debugging. In PyNative mode, you can perform debugging, and log printing is user-friendly. After the debugging is complete, the graph mode is used. The graph mode is more user-friendly in execution performance. You can also find some problems in network compilation. For example, gradient truncation caused by third-party operators. For details, see [Error Analysis](https://www.mindspore.cn/docs/en/r2.5.0/model_train/debug/error_analysis/error_scenario_analysis.html). ### Precision Tuning The accuracy debugging process is as follows: #### 1. Checking Parameters This part includes checking all parameters and the number of trainable parameters, and checking the shape of all parameters. - `Parameter` is used for PyTorch trainable parameters, and `requires_grad=False` or `buffer` is used for PyTorch untrainable parameters. - `Parameter` is used for MindSpore trainable and untrainable parameters. - The parameters of MindSpore and PyTorch are similar except BatchNorm. Note that MindSpore does not have parameters corresponding to `num_batches_tracked`. You can replace this parameter with `global_step` in the optimizer. | MindSpore | PyTorch | | --------- | --------| | gamma | weight | | beta | bias | | moving_mean | running_mean | | moving_variance | running_var | | -| num_batches_tracked |
Obtaining PyTorch Parameters | Obtaining MindSpore Parameters |
```python from torch import nn class ptNet(nn.Module): def __init__(self): super(ptNet, self).__init__() self.fc = nn.Linear(1, 1) def construct(self, x): output = self.fc(x) return output ptnet = ptNet() all_parameter = [] trainable_params = [] # Obtain network parameters. for name, item in ptnet.named_parameters(): if item.requires_grad: trainable_params.append(item) all_parameter.append(item) print(name, item.shape) for name, buffer in ptnet.named_buffers(): all_parameter.append(buffer) print(name, buffer.shape) print(f"all parameter numbers: {len(all_parameter)}") print(f"trainable parameter numbers: {len(trainable_params)}") ``` Outputs: ```text fc.weight torch.Size([1, 1]) fc.bias torch.Size([1]) all parameter numbers: 2 trainable parameter numbers: 2 ``` |
```python from mindspore import nn class msNet(nn.Cell): def __init__(self): super(msNet, self).__init__() self.fc = nn.Dense(1, 1, weight_init='normal') def construct(self, x): output = self.fc(x) return output msnet = msNet() # Obtain all parameters. all_parameter = [] for item in msnet.get_parameters(): all_parameter.append(item) print(item.name, item.data.shape) print(f"all parameter numbers: {len(all_parameter)}") # Obtain trainable parameters. trainable_params = msnet.trainable_params() for item in trainable_params: print(item.name, item.data.shape) print(f"trainable parameter numbers: {len(trainable_params)}") ``` Outputs: ```text fc.weight (1, 1) fc.bias (1,) all parameter numbers: 2 fc.weight (1, 1) fc.bias (1,) trainable parameter numbers: 2 ``` |