# Large Model Accuracy Optimization Guide
[](https://gitee.com/mindspore/docs/blob/r2.6.0rc1/docs/mindformers/docs/source_en/acc_optimize/acc_optimize.md)
## Overview and Scenarios of Accuracy Issues
### Descriptions
As the Ascend AI processor (hereinafter referred to as NPU) is widely used in deep learning, the MindSpore framework, which is developed natively based on the Ascend NPU, shows better performance advantages. During large-scale cluster training, the performance improvement will greatly save users the cost of large model development. Therefore, more and more users are gradually migrating their original training models to MindSpore. However, due to the differences in hardware and framework usage, users may encounter accuracy problems after completing the model migration.
This paper summarizes the common accuracy problems in the training process of large models and general accuracy problem localization methods, and seeks to help users quickly troubleshoot accuracy problems and shorten the time for model accuracy problem localization. When starting the work on large model accuracy optimization, you should have the basic knowledge of large model. To avoid dispersion, this document will not explain the basic concepts related to large models and focus on the introduction of accuracy optimization.
### Categorized Summary of Common Problems
Various accuracy problems often occur in large model training, and the common problems include that the loss fails to converge, the loss converges poorly, the loss fails to converge at the late stage of training, the accuracy overflows, and the loss can not be fitted to the benchmark in the process of descending. There can be a variety of reasons for these accuracy problems, including the structure of the model, the dataset, the hyperparameters, the precision of the forward and reverse computation, the calculation of the optimizer, the floating-point computational accuracy, and randomness.
When accuracy problems occur, the problem can be analyzed from the reasons for these accuracy problems. A quick troubleshooting based on CheckList is performed first, followed by parameter and weight alignment, fixed randomness and turning on deterministic calculations. Then the base problem is troubleshooted, and finally the anomalous step is troubleshooted by long stable training. At the current stage, this paper mainly introduces the general method of accuracy localization for the scenarios with accuracy benchmarks, and the content of accuracy problem localization without accuracy benchmarks will be added successively.
## Accuracy Problems Location CheckList
Before locating the operator accuracy problem, we should first eliminate the interference of other non-operator factors. Combined with the previous precision positioning cases, the CheckList before precision positioning is summarized. In order to easier locate the problems, users can first carry out quick troubleshooting according to the CheckList.
### Network Structure CheckList
#### Generalized structure
| **Key parameters** | **Descriptions** | **CheckList** |
| ----------------- | ------------------------- |---------------------------------|
| num_layers | Number of transformer layers | Correspond to the Megatron num-layers parameter and check for consistency. |
| num_heads | Number of attention heads in transformer | Correspond to the Megatron num-attention-heads parameter and check for consistency. |
| hidden_size | Transformer hidden layer size | Correspond to the Megatron hidden-size parameter and check for consistency. |
| intermediate_size | Feed-Forward Network hidden layer size | Correspond to the Megatron ffn-hidden-size parameter and check for consistency. |
| n_kv_heads | Number of kv groups | Correspond to the Megatron num-query-groups parameter and check for consistency. |
| Regularization function | Regularization functions, common structures are LayerNorm, RMSNorm | The specified regularization function is used in MindSpore Transformers and cannot be modified by configuration. The configuration can be customized in Megatron by normalization to check for consistency. |
| rms_norm_eps | Regularized epsilon parameters | Correspond to the Megatron layernorm_epsilon parameter and check for consistency. |
| dropout | dropout in the network | Currently, when MindSpore enables dropout, recalculation cannot be enabled; if precision comparison is carried out, it is recommended that both sides be closed to reduce the random factor.|
| Fusion computation | Common fusion operators include FA, ROPE, Norm, SwigLU; some users will fuse Wq, Wk, Wv for computation | 1. For accuracy comparison under the same hardware, if fusion algorithms are used, they should be consistent.
2. When comparing accuracy on different hardware, focus on checking whether there is any difference in the calculation of the fusion calculation part. |
#### MOE Structure
| **Key parameters** | **Descriptions** | **CheckList** |
| ----------------- | ------------------------------------------------------------ |------------------------------------------------------------------------------------------------------------------------------------|
| expert_num | Number of experts | Correspond to the Megatron num-experts parameter and check for consistency. |
| num_experts_chosen | Number of experts selected per token | Correspond to the Megatron moe-router-topk parameter and check for consistency. |
| capacity_factor | Expert capacity factor | Correspond to the Megatron moe_expert_capacity_factor parameter and check for consistency. |
| aux_loss_factor | Load balancing loss contribution factor | When turned on, it is recommended to be less than 0.05. If precision alignment is performed, it is not recommended to be turned on, and is inconsistent with Megatron loss printing method. |
| enable_sdrop | Whether to enable the sdrop (drop implementation) method | It is recommended to set it to true; the corresponding Megatron needs to set the following parameters:
`moe-token-drop-policy: position`
`moe-pad-expert-input-to-capacity: True` |
| router_dense_type | Decide the expert sense layer | Configurable in MindSpore Transformers, FP32 calculations are recommended to prevent overflow; not configurable in Megatron. |
| use_fused_ops_topkrouter | Whether to use the fusion operator for dispatch as well as combine indexing calculations | Fusion operator in MindSpore Transformers takes effect when `enable_sdrop=True`, precision alignment is recommended to be set to True. |
| use_shared_expert_gating | Whether the gating factor is used in the shared expert network | Check if the network sharing expert has a gating factor, if so set it to True. |
### Optimizer CheckList
| **Key parameters** | **Descriptions** | **CheckList** |
| ----------------- | ------------------------------------------------------------ |------------------------------------------------------------------------------------------------------------------------------------|
| adam optimizer | optimizer type | If Megatron uses the adam optimizer, the mathematically equivalent implementation of MindSpore Transformers is AdamW. |
| eps | adam optimizer minimal value parameter | Check the parameters for consistency, recommended value is 1e-8. |
| beta1 | adam optimizer gradient momentum parameters | Check the parameters for consistency, recommended value is 0.9. |
| beta2 | adam optimizer gradient variance parameter | Check the parameters for consistency, recommended value is 0.95. |
| weight_decay | weight decay | By default bias and one-dimensional weights are not decayed and the user is checked for special operations. |
| lr | learning rate | After setting up warmup, learning rate decay, draw a graph to see if the learning rate change is consistent. |
| lr_warmup_fraction | Learning rate warmup step percentage | After setting up warmup, learning rate decay, draw a graph to see if the learning rate change is consistent. |
| clip_grad | clipping gradient | Check the parameters for consistency, recommended value is 1.0. |
| global_batch_size | Global batch size | Consistency with the benchmark can be checked by printing a log during training. |
### Weight CheckList
| **Key parameters** | **Descriptions** | **CheckList** |
| ----------------- | ------------------------------------------------------------ |------------------------------------------------------------------------------------------------------------------------------------|
| param_init_type | Weight initialization type | MindSpore Transformers usually sets the param_init_dtype type to FP32. This is because the gradient communication type needs to be the same as the weight type, controlling the communication type to be FP32. Megatron gradient communication type defaults to FP32 and is not tied to the weight type. |
| init-method-std | Distribution of weights randomly initialized | If weighted random initialization is used, parameters such as mean/std in the random distribution need to be checked for consistency. |
### Mixed-precision CheckList
| **Key parameters** | **Descriptions** | **CheckList** |
| ----------------- | ----------------------------------------- |---------------------------------------|
| compute_dtype | Compute accuracy | Megatron set `-bf16: true` to BF16, otherwise FP16. |
| layernorm_compute_type | LayerNorm/RMSNorm compute precision | Megatron is not configurable, need to check that implementations are consistent. |
| softmax_compute_type | When MindSpore uses FA, the internal Softmax fix is calculated with FA. Type of calculation is configurable only for small arithmetic splicing implementations | Megatron is not configurable, needs to check if the implementation is consistent. |
| rotary_dtype | Calculation accuracy of rotary position encoding | Megatron is not configurable, needs to check if the implementation is consistent. |
| Calculation of weights | accuracy calculation for each weight such as, Embedding, lm_head | Since MindSpore Transformers weight initialization needs to be set to FP32, and the usual calculation precision is BF16/FP16, it is necessary to check whether the weight data type is converted to BF16/FP16 before weight calculation.|
| bias add | bias in the linear layer | If bias is present, Linear layer checks consistency in the computational accuracy of add. |
| residual add | sum of residuals | Check that the accuracy of the calculation of the residuals is consistent with the benchmarks |
| loss | Loss Calculation Module | Check that the accuracy of the calculation in the entire loss module is consistent with the benchmarks |
| Operator High Precision Mode | Ascend Calculator supports high precision mode | Method: `context.set_context(ascend_config= {"ge_options":{ "global":{ "ge.opSelectImplmode":"high_precision" } } })` |
### Parallel Strategy CheckList
| **Key parameters** | **Descriptions** | **CheckList** |
| ----------------- | ------------------------------------------------------------ |------------------------------------------------------------------------------------------------------------------------------------|
| data_parallel | data parallel | Parallel slicing affects the communication behavior, and the calculations that introduce communication after slicing may be slightly different from the single-card calculations. |
| model_parallel | model parallel | Parallel slicing affects the communication behavior, and the calculations that introduce communication after slicing may be slightly different from the single-card calculations. |
| pipeline_stage | pipeline parallel | Parallel slicing affects the communication behavior, and the calculations that introduce communication after slicing may be slightly different from the single-card calculations. |
| use_seq_parallel | Corresponding to Megatron Short Sequence Parallelism | Parallel slicing affects the communication behavior, and the calculations that introduce communication after slicing may be slightly different from the single-card calculations. |
| enable_parallel_optimizer | optimizer parallel | For optimizer parallel, MindSpore and PyTorch have different implementation schemes and inconsistent communication behavior. It is recommended to turn it off when performing precision alignment. |
| micro_batch_interleave_num | multicopy parallel | For optimizer parallel, MindSpore and PyTorch have different implementation schemes and inconsistent communication behavior. It is recommended to turn it off. |
### Other CheckList
| **Key parameters** | **CheckList** |
| ----------------- | ---------------------------|
| Data Check | Check if the data is abnormal, you can randomly select part of the data for decode, encode check to see if the position of input and label is correctly corresponding. |
| Special Words Check | Check whether the special ids such as bos_token_id, eos_token_id, pad_token_id are consistent with the ids when the data is produced. |
| inputs_id check | Check whether inputs_id in Embedding is consistent with 0<=inputs_id 0, the weights are updated, and the long stability test is performed. The training to a certain step appeared the phenomenon of large differences in the loss, after which the training loss began to diverge, as shown in Fig:

In this scenario, the training before and after the mutation can be targeted for troubleshooting, and the following troubleshooting can be tried:
* Check the data situation near the loss mutation to troubleshoot if there is any abnormal data. Decode the data to text via tokenizer to see if the data is abnormal; at the same time, you can try to skip this batch of data for training to verify whether it is caused by the data.
* Check if there is precision overflow in the vicinity of the mutation.
* You can check whether there is any abnormality in the local norm, check the training data of the Dump mutation step, troubleshoot the calculated mutation points, and analyze whether the operator outputs abnormally.
#### Loss Varies Greatly in the Later Stages
It is also possible to have a better fit in the early part of the training period and a large difference in the convergence loss in the later part of the training period in the long stability test, as shown in Fig:

In this scenario, troubleshooting can be done from the following perspectives:
* Examine whether the parameters are aligned: focus on examining the parameters related to the optimizer, such as the optimizer type, learning rate, weight decay. We can compare whether the change of learning rate during training is consistent by drawing diagrams, and we also need to confirm whether the weight of weight decay is consistent with the benchmark.
* Mixed accuracy checking: through the Dump tool, carefully check whether the mixed accuracy is consistent with the benchmark in the calculation process;
* If there is a difference in the loss at convergence, but the difference is small, such as less than 1%, the accuracy acceptance can be performed by evaluating the downstream tasks.
#### Scenario Expansion
After completing the single-card alignment, gradually expand from single-card to multi-card testing and cluster testing; model size and related features such as model parallelism, flow parallelism, optimizer parallelism are added as appropriate. Gradually expand from simple scenarios to actual training scenarios, so as to troubleshoot the impact of the added features on the accuracy.
### Large Model Migration Accuracy Standard
Accuracy standard for large model migration refers to the accuracy standard set for key indicators to ensure that the model accuracy before and after migration is basically the same after migrating the models trained by other third-party hardware or frameworks to MindSpore and Ascend Hardware. It is summarized based on the actual migration scenarios of MindSpore's large models for developers' reference. Since the accuracy of large models is strongly related to the application domain, model structure, number of parameters, and hyperparameters, and is not fully interpretable, there is no complete and unified mandatory standard. Therefore, this standard is only used as a reference standard to help users make a basic judgment on the accuracy of model migration.
#### Accuracy Standard Specifications
1. Relative discrepancy is uniformly described as a percentage (x.x%) and absolute discrepancy is uniformly described as a decimal (0.xx);
2. If the accuracy fluctuations of the third-party model training no longer meet this accuracy standard, the original model should be adequately tested and the standard should be relaxed in accordance with the fluctuations of the original model;
#### Default Configuration
| Classes | Default Values | Descriptions |
|--------------------|------|-------------------------------|
| Dataset | [pretrain] wikitext-103 [sft] alpaca | |
| Accuracy mode | BF16 | Mixed-accuracy configurations are consistent, and distinguish between actual FP32/FP16/BF16 configurations for each API in the network. |
| Parallel method | Data parallel | The parallelism can be adjusted according to the computational resources. |
| Cluster size | Stand-alone 8 cards | Can be adjusted according to the computational resources. |
| checkpoint | [pretrain] Script initialization by default [sft]Loading pre-training weights | ckpt has a large impact on the accuracy metrics, prioritizing weights with small fluctuations in loss and a clear downward trend in overall loss.|
|determinism|Turn on|The accuracy indicator determination phase can turn off determinism. The comparison phase needs to turn on determinism in order to minimize random error interference.|
#### Accuracy Standard Indicator
* Test Standard
1. Without user's special designation, the default continuous observation is 5000 steps or 12 hours, the number of steps can be reduced according to the resource situation, but it is not recommended to be less than 1000 steps.
2. Load the same weights, keep all hyperparameters configured the same, and turn off all randomness.
3. The fluctuation of indicators such as loss is greatly influenced by the model, weights, and hyperparameters, and the combination with smooth loss fluctuation is preferred as a benchmark to reduce the judgment of random fluctuation on the accuracy results.
4. The randomness of the third-party model was adequately tested by repeating the experiment at least 2 times with determinism turned off and observing the range of fluctuations in the accuracy metrics.
* loss Accuracy Standard
1. The absolute error of first loss is less than 0.005, or the relative error is less than 0.5%.
2. The average absolute error is less than 0.01, or the average relative error is less than 1%.
* Monitoring Indicators
The average relative error of the global norm does not exceed 10%.
### Case Details
This section will introduce the completion of accuracy ranking based on the above accuracy localization process with practical examples.
#### Problem Phenomenon
Training the model with a 128-card cluster and comparing training with Ascend+MindSpore training with GPU+PyTorch training reveals that the late training convergence loss is about 0.1 higher than GPU+PyTorch. As shown in the figure, the convergence is not as expected:

The red line is the Ascend+MindSpore training curve and the blue line is the GPU+PyTorch training curve.
#### Problem Location Process
Before locating the problem, check against the CheckList to confirm that there is no error and then start locating the problem.
First the loss alignment of step1 is confirmed to be OK. Comparing the local norm of step1 and calculating the difference between the local norm value of each weight and the benchmark, it is found that the local norm value of Embedding weight has a large difference with the benchmark.

The reason for this is that MindSpore Transformers uses FP32 for weight initialization, and FP32 precision is used for both forward and backward Embedding calculations, while PyTorch forward and backward calculations are BF16, which leads to differences in the calculated local norm values.
Once the computational accuracy is aligned, the exhaustive optimizer computation is also fine, and the long stable training alignment starts.
The long stable training exhaustion will be extended from single card experiments to multi-card experiments by first setting the LEARNING RATE=0, i.e., the weights are not updated. Forward computation of the loss difference of each step is around 0.001, and the forward computation error is as expected. The difference of global norm of each step is about 0.05, and the difference of reverse calculation is not significant. It is initially judged that the model migration code is correct, the model structure is consistent, and the difference of forward and reverse calculation is not significant.

Re-weight update, single card training, set learning rate=1e-5, train 1k steps. Convergence late loss has a steady 0.1 difference, reproducing the problem.

Perform problem troubleshooting. Identify the following problems:
* Identify inconsistencies in computational accuracy during training through Dump file exclusion, and harmonize inconsistencies.
* Weight decay implementation is inconsistent, weight decay is performed on all weights in user PyTorch network. bias weights and one-dimensional weights in MindSpore Transformers do not have weight decay by default.
After fixing the problem, experiment again, train 10,000 steps, the loss difference fluctuates around the 0 axis and is less than 0.03, the accuracy meets the expectation, and the single-card accuracy is aligned.
After completing the single card training, start the multi-card training test: set the learning rate=1e-5, train 1,000 steps. convergence is consistent in the late stage of training, but there is a stable 0.05 error in the middle stage of training.

To verify that this error is within reasonable limits, the deterministic computation was turned off and the GPU experiment was run twice repeatedly. The red line in the figure is the curve of MindSpore training, and the blue and green lines are the curves of the first and second GPU training, respectively. At the training instability around 7,000 steps, the curve of MindSpore training is right between the curves of the two GPU trainings, indicating that the error is within a reasonable range and the problem is finally solved.
