# Configuration File Descriptions

[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/configuration.md)

## Overview

Different parameters usually need to be configured during the training and inference process of a model. MindSpore Transformers supports the use of `YAML` files to centrally manage and adjust the configurable items, which makes the configuration of the model more structured and improves its maintainability at the same time.

## Description of the YAML File Contents

The `YAML` file provided by MindSpore Transformers contains configuration items for different functions, which are described below according to their contents.

### Basic Configuration

The basic configuration is mainly used to specify MindSpore random seeds and related settings for loading weights.

| Parameter Name                 | Data Type | Optional  | Default Value | Value Description                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|--------------------------------|-----------|-----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| seed                           | int       | Optional  | 0             | Sets the global random seed to ensure experimental reproducibility. For details, see [mindspore.set_seed](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_seed.html).                                                                                                                                                                                                                                                                                |
| run_mode                       | string    | Required  | None          | Sets the model's run mode. Optional: `train`, `finetune`, `eval`, or `predict`.                                                                                                                                                                                                                                                                                                                                                                                                 |
| output_dir                     | string    | Optional  | None          | Sets the output directory for saving log files, checkpoint files, and parallel strategy files. If the directory does not exist, it will be created automatically.                                                                                                                                                                                                                                                                                                               |
| load_checkpoint                | string    | Optional  | None          | The file or folder path for loading weights. Supports the following three scenarios: 1. The path to the complete weights file; 2. The path to the distributed weights folder after offline splitting; 3. The path to the folder containing LoRA incremental weights and base model weights. For details on how to obtain various weights, see [Checkpoint Conversion Function](https://www.mindspore.cn/mindformers/docs/en/master/feature/ckpt.html#weight-format-conversion). |
| auto_trans_ckpt                | bool      | Optional  | False         | Whether to enable automatic splitting and merging of distributed weights. When enabled, you can load split weights from multiple cards onto a single card, or load single-card weights from multiple cards onto multiple cards. For more information, see [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/master/feature/ckpt.html#distributed-weight-slicing-and-merging)                                                                |
| resume_training                | bool      | Optional  | False         | Whether to enable the resumable training feature. When enabled, the optimizer state, learning rate scheduler state, and other parameters will be restored from the path specified by `load_checkpoint` to continue training. For more information, see [Resumable Training](https://www.mindspore.cn/mindformers/docs/en/master/feature/resume_training.html#resumable-training)                                                                                                |
| load_ckpt_format               | string    | Optional  | "ckpt"        | The format of the loaded model weights. Optional values include `"ckpt"` and `"safetensors"`.                                                                                                                                                                                                                                                                                                                                                                                   |
| remove_redundancy              | bool      | Optional  | False         | Whether the loaded model weights have been de-redundant. For details, see [Saving and Loading Weights with De-Redundancy](https://www.mindspore.cn/mindformers/docs/en/master/feature/safetensors.html#de-redundant-saving-and-loading).                                                                                                                                                                                                                                        |
| train_precision_sync           | bool      | Optional  | None          | Enables deterministic computation for training. Setting this to True enables synchronous computation for training, which improves computational certainty and is generally used to ensure experimental reproducibility. Setting this to False disables this feature.                                                                                                                                                                                                            |
| infer_precision_sync           | bool      | Optional  | None          | Enables deterministic computation for inference. If set to `True`, inference synchronization is enabled, which improves computational certainty and is generally used to ensure experimental reproducibility. If set to `False`, this feature is disabled.                                                                                                                                                                                                                      |
| use_skip_data_by_global_norm   | bool      | Optional  | False         | Whether to enable data skipping based on the global gradient norm. When a batch of data causes exploding gradients, that batch is automatically skipped to improve training stability. For more information, see [Data Skipping](https://www.mindspore.cn/mindformers/docs/en/master/feature/skip_data_and_ckpt_health_monitor.html#skipping-data).                                                                                                                             |
| use_checkpoint_health_monitor  | bool      | Optional  | False         | Whether to enable weight health monitoring. When enabled, checkpoint integrity and availability are verified when saving, preventing corrupted weight files from being saved. For more information, see [Checkpoint Health Monitor](https://www.mindspore.cn/mindformers/docs/en/master/feature/skip_data_and_ckpt_health_monitor.html#checkpoint-health-monitor).                                                                                                              |

### Context Configuration

Context configuration is mainly used to specify the [mindspore.set_context](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_context.html) in the related parameters.

| Parameter Name              | Data Type     | Optional  | Default Value | Value Description                                                                                                                                                                                                                                                                                                                                                                  |
|-----------------------------|---------------|-----------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| context.mode                | int           | Required  | None          | Sets the backend execution mode. `0` indicates GRAPH_MODE. MindSpore Transformers currently only supports running in GRAPH_MODE mode.                                                                                                                                                                                                                                              |
| context.device_target       | string        | Required  | None          | Sets the backend execution device. MindSpore Transformers only supports running on `Ascend` devices.                                                                                                                                                                                                                                                                               |
| context.device_id           | int           | Optional  | 0             | Sets the execution device ID. The value must be within the available device range. The default value is `0`.                                                                                                                                                                                                                                                                       |
| context.enable_graph_kernel | bool          | Optional  | False         | Whether to enable graph fusion to optimize network execution performance. The default value is `False`.                                                                                                                                                                                                                                                                            |
| context.max_call_depth      | int           | Optional  | 1000          | Sets the maximum depth of function calls. This value must be a positive integer. The default value is `1000`.                                                                                                                                                                                                                                                                      |
| context.max_device_memory   | string        | Optional  | "1024GB"      | Sets the maximum memory available on the device. The format is "xxGB". The default value is `"1024GB"`.                                                                                                                                                                                                                                                                            |
| context.mempool_block_size  | string        | Optional  | "1GB"         | Sets the memory block size. The format is "xxGB". The default value is `"1GB"`.                                                                                                                                                                                                                                                                                                    |
| context.save_graphs         | bool / int    | Optional  | False         | Save compiled graphs during execution:<br/>• `False` or `0`: Do not save intermediate compiled graphs<br/>• `1`: Output some intermediate files during graph compilation<br/>• `True` or `2`: Generate more IR files related to the backend process<br/>• `3`: Generate a visual computation graph and a more detailed frontend IR graph                                           |
| context.save_graphs_path    | string        | Optional  | './graph'     | The path to save compiled graphs. If not set and `save_graphs != False`, the default temporary path `'./graph'` is used.                                                                                                                                                                                                                                                           |
| context.affinity_cpu_list   | dict / string | Optional  | None          | Optional configuration item used to implement a user-defined core binding strategy. <br/>• When not configured: default automatic core binding<br/>• `None` or not set: disable core binding<br/>• Pass in `dict`: customize CPU core binding strategy. For details, refer to [mindspore.runtime.set_cpu_affinity](https://www.mindspore.cn/docs/en/master/api_python/runtime/mindspore.runtime.set_cpu_affinity.html)  |

### Legacy Model Configuration

If you use MindSpore Transformers to run tasks for legacy models, you need to configure the relevant hyperparameters in a YAML file. Please note that the configuration described in this section applies only to legacy models and cannot be mixed with mcore model configurations. Please pay attention to [version compatibility](https://gitee.com/mindspore/mindformers/blob/master/README.md#models-list).

Because different model configurations may vary, this section only describes the general configuration of models in MindSpore Transformers.

| Parameter Name                             | Type      | Optional  | Default Value | Value Description                                                                                                                                                                                                                                                                                                                                                   |
|--------------------------------------------|-----------|-----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| model.arch.type                            | string    | Required  | None          | Sets the model class. This class can be used to instantiate the model when building it.                                                                                                                                                                                                                                                                             |
| model.model_config.type                    | string    | Required  | None          | Sets the model configuration class. This class must match the model class; that is, it must contain all parameters used by the model class.                                                                                                                                                                                                                         |
| model.model_config.num_layers              | int       | Required  | None          | Sets the number of model layers, typically the number of decoder layers.                                                                                                                                                                                                                                                                                            |
| model.model_config.seq_length              | int       | Required  | None          | Sets the model sequence length. This parameter indicates the maximum sequence length supported by the model.                                                                                                                                                                                                                                                        |
| model.model_config.hidden_size             | int       | Required  | None          | Sets the dimension of the model's hidden state.                                                                                                                                                                                                                                                                                                                     |
| model.model_config.vocab_size              | int       | Required  | None          | Sets the size of the model vocabulary.                                                                                                                                                                                                                                                                                                                              |
| model.model_config.top_k                   | int       | Optional  | None          | Sets the sampling from the `top_k` tokens with the highest probability during inference.                                                                                                                                                                                                                                                                            |
| model.model_config.top_p                   | float     | Optional  | None          | Sets the sampling from the tokens with the highest probability, whose cumulative probability does not exceed `top_p`, during inference. The value range is usually `(0,1]`.                                                                                                                                                                                         |
| model.model_config.use_past                | bool      | Optional  | False         | Whether to enable incremental inference for the model. Enabling this allows Paged Attention to improve inference performance. Must be set to `False` during model training.                                                                                                                                                                                         |
| model.model_config.max_decode_length       | int       | Optional  | None          | Sets the maximum length of generated text, including the input length.                                                                                                                                                                                                                                                                                              |
| model.model_config.max_length              | int       | Optional  | None          | Same as `max_decode_length`. When both `max_decode_length` and `max_length` are set, only `max_length` takes effect.                                                                                                                                                                                                                                                |
| model.model_config.max_new_tokens          | int       | Optional  | None          | Sets the maximum length of generated new text, excluding the input length. When both `max_length` and `max_new_tokens` are set, only `max_new_tokens` takes effect.                                                                                                                                                                                                 |
| model.model_config.min_length              | int       | Optional  | None          | Sets the minimum length of generated text, including the input length.                                                                                                                                                                                                                                                                                              |
| model.model_config.min_new_tokens          | int       | Optional  | None          | Sets the minimum length of new text generated, excluding the input length. When `min_length` is set at the same time, only `min_new_tokens` takes effect.                                                                                                                                                                                                           |
| model.model_config.repetition_penalty      | float     | Optional  | 1.0           | Sets the penalty coefficient for generating repeated text. `repetition_penalty` must be no less than 1. When it is equal to 1, no penalty is imposed on repeated output.                                                                                                                                                                                            |
| model.model_config.block_size              | int       | Optional  | None          | Sets the block size in Paged Attention. This only takes effect when `use_past=True`.                                                                                                                                                                                                                                                                                |
| model.model_config.num_blocks              | int       | Optional  | None          | Sets the total number of blocks in Paged Attention. This only takes effect when `use_past=True`. This should satisfy `batch_size × seq_length <= block_size × num_blocks`.                                                                                                                                                                                          |
| model.model_config.return_dict_in_generate | bool      | Optional  | False         | Whether to return the inference results of the `generate` interface in dictionary form. Defaults to `False`.                                                                                                                                                                                                                                                        |
| model.model_config.output_scores           | bool      | Optional  | False         | Whether to include the scores before softmax of the input for each forward generation when returning the results in dictionary form. Defaults to `False`.                                                                                                                                                                                                           |
| model.model_config.output_logits           | bool      | Optional  | False         | Whether to include the logits of the model output for each forward generation when returning the results in dictionary form. Defaults to `False`.                                                                                                                                                                                                                   |
| model.model_config.layers_per_stage        | list(int) | Optional  | None          | Sets the number of transformer layers assigned to each stage when enabling pipeline stages. Defaults to `None`, indicating an equal distribution across all stages. The value to be set is a list of integers with a length equal to the number of pipeline stages, where the i-th position indicates the number of transformer layers assigned to the i-th stage.  |
| model.model_config.bias_swiglu_fusion      | bool      | Optional  | False         | Whether to use the swiglu fusion operator. Defaults to `False`.                                                                                                                                                                                                                                                                                                     |
| model.model_config.apply_rope_fusion       | bool      | Optional  | False         | Whether to use the RoPE fusion operator. Defaults to `False`.                                                                                                                                                                                                                                                                                                       |

In addition to the basic configuration of the above models, the MoE model requires separate configuration of some MoE module hyperparameters. Since different models use different parameters, only the general configuration is described:

| Parameter Name                       | Type        | Optional  | Default Value | Value Description                                                                                                                                                                                                                                   |
|--------------------------------------|-------------|-----------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| moe_config.expert_num                | int         | Required  | None          | Sets the number of routing experts.                                                                                                                                                                                                                 |
| moe_config.shared_expert_num         | int         | Required  | None          | Sets the number of shared experts.                                                                                                                                                                                                                  |
| moe_config.moe_intermediate_size     | int         | Required  | None          | Sets the size of the intermediate dimension of the expert layer.                                                                                                                                                                                    |
| moe_config.capacity_factor           | int         | Required  | None          | Sets the expert capacity factor.                                                                                                                                                                                                                    |
| moe_config.num_experts_chosen        | int         | Required  | None          | Sets the number of experts chosen for each token.                                                                                                                                                                                                   |
| moe_config.enable_sdrop              | bool        | Optional  | False         | Enables the `sdrop` token drop strategy. Since MindSpore Transformers' MoE uses a static shape implementation, it cannot retain all tokens.                                                                                                         |
| moe_config.aux_loss_factor           | list(float) | Optional  | None          | Sets the weight for the balanced loss.                                                                                                                                                                                                              |
| moe_config.first_k_dense_replace     | int         | Optional  | 1             | Enables the block for the Moe layer. Typically set to `1` to disable Moe in the first block.                                                                                                                                                        |
| moe_config.balance_via_topk_bias     | bool        | Optional  | False         | Enables the `aux_loss_free` load balancing algorithm.                                                                                                                                                                                               |
| moe_config.topk_bias_update_rate     | float       | Optional  | None          | Sets the bias update step for the `aux_loss_free` load balancing algorithm.                                                                                                                                                                         |
| moe_config.comp_comm_parallel        | bool        | Optional  | False         | Sets whether to enable parallel computation and communication for ffn.                                                                                                                                                                              |
| moe_config.comp_comm_parallel_degree | int         | Optional  | None          | Sets the number of splits for ffn computation and communication. A larger number results in more overlap, but consumes more memory. This parameter is only valid when `comp_comm_parallel=True`.                                                     |
| moe_config.moe_shared_expert_overlap | bool        | Optional  | False         | Sets whether to enable parallel computation and communication for shared and routing experts.                                                                                                                                                       |
| moe_config.use_gating_sigmoid        | bool        | Optional  | False         | Sets whether to use the sigmoid function for gating results in MoE.                                                                                                                                                                                 |
| moe_config.use_gmm                   | bool        | Optional  | False         | Sets whether to use GroupedMatmul for MoE expert computation.                                                                                                                                                                                       |
| moe_config.use_fused_ops_permute     | bool        | Optional  | False         | Specifies whether MoE uses the permute and unpermute fused operators for performance acceleration. This option only takes effect when `use_gmm=True`.                                                                                               |
| moe_config.enable_deredundency       | bool        | Optional  | False         | Specifies whether to enable de-redundancy communication. This requires that the number of expert parallel operations is an integer multiple of the number of NPUs in each node. Default value: False. This option takes effect when `use_gmm=True`. |
| moe_config.npu_nums_per_device       | int         | Optional  | 8             | Specifies the number of NPUs in each node. Default value: 8. This option takes effect when `enable_deredundency=True`.                                                                                                                              |
| moe_config.enable_gmm_safe_tokens    | bool        | Optional  | False         | Ensures that each expert is assigned at least one token to prevent GroupedMatmul calculation failures in extreme load imbalance. The default value is `False`. It is recommended to enable this when `use_gmm=True`.                                |

### Mcore Model Configuration

When using MindSpore Transformers to launch an Mcore model task, you need to configure relevant hyperparameters under `model_config`, including model selection, model parameters, calculation type, and MoE parameters.

Because different model configurations may vary, here are some common model configurations in MindSpore Transformers:

| Parameter                                                 | Type            | Optional  | Default Value | Value Description                                                                                                                                                                                                                                                                                                                                                                                                    |
|-----------------------------------------------------------|-----------------|-----------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| model.model_config.model_type                             | string          | Required  | None          | Sets the model configuration class. The model configuration class must match the model class; that is, the model configuration class should contain all parameters used by the model class.                                                                                                                                                                                                                          |
| model.model_config.architectures                          | string          | Required  | None          | Sets the model class. When building the model, you can instantiate the model based on the model class.                                                                                                                                                                                                                                                                                                               |
| model.model_config.offset                                 | int / list(int) | Required  | 0             | When pp parallelism is enabled, you need to set the offset based on the number of model layers to build pipeline parallelism.                                                                                                                                                                                                                                                                                        |
| model.model_config.vocab_size                             | int             | Optional  | 128000        | Model vocabulary size.                                                                                                                                                                                                                                                                                                                                                                                               |
| model.model_config.hidden_size                            | int             | Required  | 0             | Transformer hidden layer size.                                                                                                                                                                                                                                                                                                                                                                                       |
| model.model_config.ffn_hidden_size                        | int             | Optional  | None          | Transformer feedforward layer size, corresponding to `intermediate_size` in HuggingFace. If not set, the default is 4 * hidden_size.                                                                                                                                                                                                                                                                                 |
| model.model_config.num_layers                             | int             | Required  | 0             | Number of Transformer layers, corresponding to `num_hidden_layers` in HuggingFace.                                                                                                                                                                                                                                                                                                                                   |
| model.model_config.max_position_embeddings                | int             | Optional  | 4096          | Maximum sequence length the model can handle.                                                                                                                                                                                                                                                                                                                                                                        |
| model.model_config.hidden_act                             | string          | Optional  | 'gelu'        | Activation function used for the nonlinearity in the MLP.                                                                                                                                                                                                                                                                                                                                                            |
| model.model_config.num_attention_heads                    | int             | Required  | 0             | Number of Transformer attention heads.                                                                                                                                                                                                                                                                                                                                                                               |
| model.model_config.num_query_groups                       | int             | Optional  | None          | Number of query groups for the group-query attention mechanism, corresponding to `num_key_value_heads` in HuggingFace. If not configured, the normal attention mechanism is used.                                                                                                                                                                                                                                    |
| model.model_config.kv_channels                            | int             | Optional  | None          | Projection weight dimension for the multi-head attention mechanism, corresponding to `head_dim` in HuggingFace. If not configured, defaults to `hidden_size // num_attention_heads`.                                                                                                                                                                                                                                 |
| model.model_config.layernorm_epsilon                      | float           | Required  | 1e-5          | Epsilon value for any LayerNorm operations.                                                                                                                                                                                                                                                                                                                                                                          |
| model.model_config.add_bias_linear                        | bool            | Required  | True          | Include a bias term in all linear layers (after QKV projection, after core attention, and both in MLP layers).                                                                                                                                                                                                                                                                                                       |
| model.model_config.tie_word_embeddings                    | bool            | Required  | True          | Whether to share input and output embedding weights.                                                                                                                                                                                                                                                                                                                                                                 |
| model.model_config.use_flash_attention                    | bool            | Required  | True          | Whether to use flash attention in the attention layer.                                                                                                                                                                                                                                                                                                                                                               |
| model.model_config.use_contiguous_weight_layout_attention | bool            | Required  | False         | Determines the weight layout in the QKV linear projection of the self-attention layer. Affects only the self-attention layer.                                                                                                                                                                                                                                                                                        |
| model.model_config.hidden_dropout                         | float           | Required  | 0.1           | Dropout probability for the Transformer hidden state.                                                                                                                                                                                                                                                                                                                                                                |
| model.model_config.attention_dropout                      | float           | Required  | 0.1           | Dropout probability for the post-attention layer.                                                                                                                                                                                                                                                                                                                                                                    |
| model.model_config.position_embedding_type                | string          | Required  | 'rope'        | Position embedding type for the attention layer.                                                                                                                                                                                                                                                                                                                                                                     |
| model.model_config.params_dtype                           | string          | Required  | 'float32'     | dtype to use when initializing weights.                                                                                                                                                                                                                                                                                                                                                                              |
| model.model_config.compute_dtype                          | string          | Required  | 'bfloat16'    | Computed dtype for Linear layers.                                                                                                                                                                                                                                                                                                                                                                                    |
| model.model_config.layernorm_compute_dtype                | string          | Required  | 'float32'     | Computed dtype for LayerNorm layers.                                                                                                                                                                                                                                                                                                                                                                                 |
| model.model_config.softmax_compute_dtype                  | string          | Required  | 'float32'     | The dtype used to compute the softmax during attention computation.                                                                                                                                                                                                                                                                                                                                                  |
| model.model_config.rotary_dtype                           | string          | Required  | 'float32'     | Computed dtype for custom rotated position embeddings.                                                                                                                                                                                                                                                                                                                                                               |
| model.model_config.init_method_std                        | float           | Required  | 0.02          | The standard deviation of the zero-mean normal for the default initialization method, corresponding to `initializer_range` in HuggingFace. If `init_method` and `output_layer_init_method` are provided, this method is not used.                                                                                                                                                                                    |
| model.model_config.moe_grouped_gemm                       | bool            | Required  | False         | When there are multiple experts per level, compress multiple local (potentially small) GEMMs in a single kernel launch to leverage grouped GEMM capabilities for improved utilization and performance.                                                                                                                                                                                                               |
| model.model_config.num_moe_experts                        | int             | Optional  | None          | The number of experts to use for the MoE layer, corresponding to `n_routed_experts` in HuggingFace. When set, the MLP is replaced by the MoE layer. Setting this to None disables the MoE.                                                                                                                                                                                                                           |
| model.model_config.num_experts_per_tok                    | int             | Required  | 2             | The number of experts to route each token to.                                                                                                                                                                                                                                                                                                                                                                        |
| model.model_config.moe_ffn_hidden_size                    | int             | Optional  | None          | Size of the hidden layer of the MoE feedforward network. Corresponds to `moe_intermediate_size` in HuggingFace.                                                                                                                                                                                                                                                                                                      |
| model.model_config.moe_router_dtype                       | string          | Required  | 'float32'     | Data type used for routing and weighted averaging of expert outputs. Corresponds to `router_dense_type` in HuggingFace.                                                                                                                                                                                                                                                                                              |
| model.model_config.gated_linear_unit                      | bool            | Required  | False         | Use a gated linear unit for the first linear layer in the MLP.                                                                                                                                                                                                                                                                                                                                                       |
| model.model_config.norm_topk_prob                         | bool            | Required  | True          | Whether to use top-k probabilities for normalization.                                                                                                                                                                                                                                                                                                                                                                |
| model.model_config.moe_router_pre_softmax                 | bool            | Required  | False         | Enables pre-softmax (pre-sigmoid) routing for MoE, meaning softmax is performed before top-k selection. By default, softmax is performed after top-k selection.                                                                                                                                                                                                                                                      |
| model.model_config.moe_token_drop_policy                  | string          | Required  | 'probs'       | The token drop policy. Can be either 'probs' or 'position'. If `'probs'`, the token with the lowest probability is dropped. If `'position'`, the token at the end of each batch is dropped.                                                                                                                                                                                                                          |
| model.model_config.moe_router_topk_scaling_factor         | float           | Optional  | None          | Scaling factor for the routing score in Top-K routing, corresponding to `routed_scaling_factor` in HuggingFace. Valid only when `moe_router_pre_softmax` is enabled. Defaults to `None`, meaning no scaling.                                                                                                                                                                                                         |
| model.model_config.moe_aux_loss_coeff                     | float           | Required  | 0.0           | Scaling factor for the auxiliary loss. The recommended initial value is 1e-2.                                                                                                                                                                                                                                                                                                                                        |
| model.model_config.moe_router_load_balancing_type         | string          | Required  | 'aux_loss'    | The router's load balancing strategy. `'aux_loss'` corresponds to the load balancing loss used in GShard and SwitchTransformer; `'seq_aux_loss'` corresponds to the load balancing loss used in DeepSeekV2 and DeepSeekV3, which is used to calculate the loss of each sample; `'sinkhorn'` corresponds to the balancing algorithm used in S-BASE, and `'none'` means no load balancing.                             |
| model.model_config.moe_permute_fusion                     | bool            | Optional  | False         | Whether to use the moe_token_permute fusion operator. Default is `False`.                                                                                                                                                                                                                                                                                                                                            |
| model.model_config.moe_router_force_expert_balance        | bool            | Optional  | False         | Whether to use forced load balancing in the expert router. This option is only for performance testing and not for general use. Defaults to `False`.                                                                                                                                                                                                                                                                 |
| model.model_config.use_interleaved_weight_layout_mlp      | bool            | Optional  | True          | Determines the weight arrangement in the linear_fc1 projection of the MLP. Affects only MLP layers. <br>1. When True, use an interleaved arrangement: `[Gate_weights[0], Hidden_weights[0], Gate_weights[1], Hidden_weights[1], ...]`. <br>2. When False, use a continuous arrangement: `[Gate_weights, Hidden_weights]`. <br>Note: This affects tensor memory layout, but does not affect mathematical equivalence. |
| model.model_config.moe_router_enable_expert_bias          | bool            | Optional  | False         | Whether to use TopK routing with dynamic expert bias in the unassisted lossless load balancing strategy. Routing decisions are based on the sum of the routing score and the expert bias.                                                                                                                                                                                                                            |
| model.model_config.enable_expert_relocation               | bool            | Optional  | False         | Whether to enable dynamic expert migration for load balancing in the MoE model. When enabled, experts will be dynamically redistributed between devices based on their load history to improve training efficiency and load balance. Defaults to False.                                                                                                                                                              |
| model.model_config.expert_relocation_initial_iteration    | int             | Optional  | 20            | Start the initial iteration of expert migration. Expert migration will begin after this many training iterations.                                                                                                                                                                                                                                                                                                    |
| model.model_config.expert_relocation_freq                 | int             | Optional  | 50            | Frequency of expert migration during training iterations. After the initial iteration, expert migration is performed every N iterations.                                                                                                                                                                                                                                                                             |
| model.model_config.print_expert_load                      | bool            | Optional  | False         | Whether to print expert load information. If enabled, detailed expert load statistics will be printed during training. Defaults to `False`.                                                                                                                                                                                                                                                                          |
| model.model_config.moe_router_num_groups                  | int             | Optional  | None          | The number of expert groups to use for group-limited routing. Equivalent to `n_group` in HuggingFace.                                                                                                                                                                                                                                                                                                                |
| model.model_config.moe_router_group_topk                  | int             | Optional  | None          | The number of selected groups for group-limited routing. Equivalent to `topk_group` in HuggingFace.                                                                                                                                                                                                                                                                                                                  |
| model.model_config.moe_router_topk                        | int             | Optional  | 2             | The number of experts to route each token to. Equivalent to `num_experts_per_tok` in HuggingFace. When used with `moe_router_num_groups` and `moe_router_group_topk`, first group `moe_router_num_groups`, then select `moe_router_group_topk`, and then select `moe_router_topk` experts from `moe_router_group_topk`.                                                                                              |

### Model Training Configuration

When starting model training, in addition to model-related parameters, you also need to set the parameters of trainer, runner_config, learning rate, and optimizer and other modules required for training, MindSpore Transformers provides the following configuration items.

| Parameters                                  | Descriptions                                                                                                                                                                                                                         | Types  |
|---------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|
| trainer.type                                | Set the trainer class, usually different models for different application scenarios will set different trainer classes.                                                                                                              | str    |
| trainer.model_name                          | Set the model name in the format '{name}_xxb', indicating a certain specification of the model.                                                                                                                                      | str    |
| runner_config.epochs                        | Set the number of rounds for model training.                                                                                                                                                                                         | int    |
| runner_config.batch_size                    | Set the sample size of the batch data, which overrides the `batch_size` in the dataset configuration.                                                                                                                                | int    |
| runner_config.sink_mode                     | Enable data sink mode.                                                                                                                                                                                                               | bool   |
| runner_config.sink_size                     | Set the number of iterations to be sent down from Host to Device per iteration, effective only when `sink_mode=True`. This argument will be deprecated in a future release.                                                          | int    |
| runner_config.gradient_accumulation_steps   | Set the number of gradient accumulation steps, the default value is 1, which means that gradient accumulation is not enabled.                                                                                                        | int    |
| runner_wrapper.type                         | Set the wrapper class, generally set 'MFTrainOneStepCell'.                                                                                                                                                                           | str    |
| runner_wrapper.local_norm                   | Set the gradient norm of each parameter on the printing card.                                                                                                                                                                        | bool   |
| runner_wrapper.scale_sense.type             | Set the gradient scaling class, generally just set 'DynamicLossScaleUpdateCell'.                                                                                                                                                     | str    |
| runner_wrapper.scale_sense.loss_scale_value | Set the loss dynamic scale factor, the model loss can change dynamically according to the configuration of this parameter.                                                                                                           | int    |
| runner_wrapper.use_clip_grad                | Turn on gradient clipping. Turning on to avoid cases where the inverse gradient is too large and training fails to converge.                                                                                                         | bool   |
| lr_schedule.type                            | Set the lr_schedule class, lr_schedule is mainly used to adjust the learning rate in model training.                                                                                                                                 | str    |
| lr_schedule.learning_rate                   | Set the initialized learning rate size.                                                                                                                                                                                              | float  |
| lr_scale                                    | Whether to enable learning rate scaling.                                                                                                                                                                                             | bool   |
| lr_scale_factor                             | Set the learning rate scaling factor.                                                                                                                                                                                                | int    |
| layer_scale                                 | Whether to turn on layer attenuation.                                                                                                                                                                                                | bool   |
| layer_decay                                 | Set the layer attenuation factor.                                                                                                                                                                                                    | float  |
| optimizer.type                              | Set the optimizer class, the optimizer is mainly used to calculate the gradient for model training.                                                                                                                                  | str    |
| optimizer.weight_decay                      | Set the optimizer weight decay factor.                                                                                                                                                                                               | float  |
| optimizer.fused_num                         | Set `fused_num` weights for fusion, and update the fused weights to the network parameters according to the fusion algorithm. Default to `10`.                                                                                      | int    |
| optimizer.interleave_step                   | Select the number of step intervals for the weights to be fused, and take a weight as a candidate weight for fusion once every `interleave_step` step. Default to `1000`.                                                             | int    |
| optimizer.fused_algo                        | Fusion algorithm, supports `ema` and `sma`. Default to `ema`.                                                                                                                                                                        | string |
| optimizer.ema_alpha                         | The fusion coefficient is only effective when `fused_algo` is set to `ema`. Default to `0.2`.                                                                                                                                        | float  |
| train_dataset.batch_size                    | The description is same as that of `runner_config.batch_size`.                                                                                                                                                                       | int    |
| train_dataset.input_columns                 | Set the input data columns for the training dataset.                                                                                                                                                                                 | list   |
| train_dataset.output_columns                | Set the output data columns for the training dataset.                                                                                                                                                                                | list   |
| train_dataset.construct_args_key            | Set the dataset part `keys` of the model `construct` input to the model in lexicographical order, used when the parameter passing order of the model does not match the order of the dataset input.                                  | list   |
| train_dataset.column_order                  | Set the order of the output data columns of the training dataset.                                                                                                                                                                    | list   |
| train_dataset.num_parallel_workers          | Set the number of processes that read the training dataset.                                                                                                                                                                          | int    |
| train_dataset.python_multiprocessing        | Enabling Python multi-process mode to improve data processing performance.                                                                                                                                                           | bool   |
| train_dataset.drop_remainder                | Whether to discard the last batch of data if it contains fewer samples than batch_size.                                                                                                                                              | bool   |
| train_dataset.repeat                        | Set the number of dataset duplicates.                                                                                                                                                                                                | int    |
| train_dataset.numa_enable                   | Set the default state of NUMA to data read startup state.                                                                                                                                                                            | bool   |
| train_dataset.prefetch_size                 | Set the amount of pre-read data.                                                                                                                                                                                                     | int    |
| train_dataset.data_loader.type              | Set the data loading class.                                                                                                                                                                                                          | str    |
| train_dataset.data_loader.dataset_dir       | Set the path for loading data.                                                                                                                                                                                                       | str    |
| train_dataset.data_loader.shuffle           | Whether to randomly sort the data when reading the dataset.                                                                                                                                                                          | bool   |
| train_dataset.transforms                    | Set options related to data enhancement.                                                                                                                                                                                             | -      |
| train_dataset_task.type                     | Set up the dataset class, which is used to encapsulate the data loading class and other related configurations.                                                                                                                      | str    |
| train_dataset_task.dataset_config           | Typically set as a reference to `train_dataset`, containing all configuration entries for `train_dataset`.                                                                                                                           | -      |
| auto_tune                                   | Enable auto-tuning of data processing parameters, see [set_enable_autotune](https://www.mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.config.set_enable_autotune.html) for details.                               | bool   |
| filepath_prefix                             | Set the save path for parameter configurations after data optimization.                                                                                                                                                              | str    |
| autotune_per_step                           | Set the configuration tuning step interval for automatic data acceleration, for details see [set_autotune_interval](https://www.mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.config.set_autotune_interval.html). | int    |

### Parallel Configuration

In order to improve the performance of the model, it is usually necessary to configure the parallelism strategy for the model in large-scale cluster usage scenarios. For details, please refer to [Distributed Parallelism](https://www.mindspore.cn/mindformers/docs/en/master/feature/parallel_training.html), the parallel configuration in MindSpore Transformers is as follows.

| Parameters                                                      | Descriptions                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | Types |
|-----------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
| use_parallel                                                    | Enable parallel mode.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | bool  |
| parallel_config.data_parallel                                   | Set the number of data parallel.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | int   |
| parallel_config.model_parallel                                  | Set the number of model parallel.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | int   |
| parallel_config.context_parallel                                | Set the number of sequence parallel.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | int   |
| parallel_config.pipeline_stage                                  | Set the number of pipeline parallel.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | int   |
| parallel_config.micro_batch_num                                 | Set the pipeline parallel microbatch size, which should satisfy `parallel_config.micro_batch_num` >= `parallel_config.pipeline_stage` when `parallel_config.pipeline_stage` is greater than 1.                                                                                                                                                                                                                                                                                                                                                                                                                         | int   |
| parallel_config.seq_split_num                                   | Set the sequence split number in sequence pipeline parallel, which should be a divisor of sequence length.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | int   |
| parallel_config.gradient_aggregation_group                      | Set the size of the gradient communication operator fusion group.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | int   |
| parallel_config.context_parallel_algo                           | Set the long sequence parallel scheme, optionally `colossalai_cp`, `ulysses_cp` and `hybrid_cp`, effective only if the number of `context_parallel` slices is greater than 1.                                                                                                                                                                                                                                                                                                                                                                                                                                          | str   |
| parallel_config.ulysses_degree_in_cp                            | Setting the Ulysses sequence parallel dimension, configured in parallel with the `hybrid_cp` long sequence parallel scheme, requires ensuring that `context_parallel` is divisible by this parameter and greater than 1, and that `ulysses_degree_in_cp` is divisible by the number of attention heads.                                                                                                                                                                                                                                                                                                                | int   |
| micro_batch_interleave_num                                      | Set the number of multicopy parallel, enable multicopy parallelism if it is greater than 1. Usually enabled when using model parallel, mainly used to optimize the communication loss generated by model parallel, and not recommended to be enabled when only using streaming parallel. For details, please refer to [MicroBatchInterleaved](https://www.mindspore.cn/docs/en/master/api_python/parallel/mindspore.parallel.nn.MicroBatchInterleaved.html).                                                                                                                                                           | int   |
| parallel.parallel_mode                                          | Set parallel mode, `0` means data parallel mode, `1` means semi-automatic parallel mode, `2` means automatic parallel mode, `3` means mixed parallel mode, usually set to semi-automatic parallel mode.                                                                                                                                                                                                                                                                                                                                                                                                                | int   |
| parallel.gradients_mean                                         | Whether to execute the averaging operator after the gradient AllReduce. Typically set to `False` in semi-automatic parallel mode and `True` in data parallel mode.                                                                                                                                                                                                                                                                                                                                                                                                                                                     | bool  |
| parallel.enable_alltoall                                        | Enables generation of the AllToAll communication operator during communication. Typically set to `True` only in MOE scenarios, default value is `False`.                                                                                                                                                                                                                                                                                                                                                                                                                                                               | bool  |
| parallel.full_batch                                             | Whether to load the full batch of data from the dataset in parallel mode. Setting it to `True` means all ranks will load the full batch of data. Setting it to `False` means each rank will only load the corresponding batch of data. When set to `False`, the corresponding `dataset_strategy` must be configured.                                                                                                                                                                                                                                                                                                   | bool  |
| parallel.dataset_strategy                                       | Only supports `List of List` type and is effective only when `full_batch=False`. The number of sublists in the list must be equal to the length of `train_dataset.input_columns`. Each sublist in the list must have the same shape as the data returned by the dataset. Generally, data parallel splitting is done along the first dimension, so the first dimension of the sublist should be configured to match `data_parallel`, while the other dimensions should be set to `1`. For detailed explanation, refer to [Dataset Splitting](https://www.mindspore.cn/tutorials/en/master/parallel/dataset_slice.html). | list  |
| parallel.search_mode                                            | Set fully-automatic parallel strategy search mode, options are `recursive_programming`, `dynamic_programming` and `sharding_propagation`, only works in fully-automatic parallel mode, experimental interface.                                                                                                                                                                                                                                                                                                                                                                                                         | str   |
| parallel.strategy_ckpt_save_file                                | Set the save path for the parallel slicing strategy file.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | str   |
| parallel.strategy_ckpt_config.only_trainable_params             | Whether to save (or load) information about the slicing strategy for trainable parameters only, default is True, set this parameter to `False` when there are frozen parameters in the network but need to be sliced.                                                                                                                                                                                                                                                                                                                                                                                                  | bool  |
| parallel.enable_parallel_optimizer                              | Turn on optimizer parallel.<br/> 1. slice model weight parameters by number of devices in data parallel mode. <br/>2. slice model weight parameters by `parallel_config.data_parallel` in semi-automatic parallel mode.                                                                                                                                                                                                                                                                                                                                                                                                | bool  |
| parallel.parallel_optimizer_config.gradient_accumulation_shard  | Set whether the cumulative gradient variable is sliced on the data-parallel dimension, only effective if `enable_parallel_optimizer=True`.                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | bool  |
| parallel.parallel_optimizer_config.parallel_optimizer_threshold | Set the threshold for the optimizer weight parameter cut, effective only if `enable_parallel_optimizer=True`.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | int   |
| parallel.parallel_optimizer_config.optimizer_weight_shard_size  | Set the size of the optimizer weight parameter to slice the communication domain, requiring the value to be integrable by `parallel_config.data_parallel`, effective only if `enable_parallel_optimizer=True`.                                                                                                                                                                                                                                                                                                                                                                                                         | int   |
| parallel.pipeline_config.pipeline_interleave                    | Enable interleave pipeline parallel, you should set this variable to be `true` when using Seq-Pipe or ZeroBubbleV(also known as DualPipeV).                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | bool  |
| parallel.pipeline_config.pipeline_scheduler                     | Set the pipeline scheduling strategy. We only support `"seqpipe"` and `"zero_bubble_v"` now.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | str   |

> Configure the parallel strategy to satisfy device_num = data_parallel × model_parallel × context_parallel × pipeline_stage.

### Model Optimization Configuration

1. MindSpore Transformers provides recomputation-related configurations to reduce the memory footprint of the model during training, see [Recomputation](https://www.mindspore.cn/mindformers/docs/en/master/advanced_development/performance_optimization.html#recomputation) for details.

   | Parameters                                         | Descriptions                                                                                            | Types           |
   |----------------------------------------------------|---------------------------------------------------------------------------------------------------------|-----------------|
   | recompute_config.recompute                         | Whether to enable recompute.                                                                            | bool/list/tuple |
   | recompute_config.select_recompute                  | Turn on recomputation to recompute only for the operators in the attention layer.                       | bool/list       |
   | recompute_config.parallel_optimizer_comm_recompute | Whether to recompute AllGather communication introduced in parallel by the optimizer.                   | bool/list       |
   | recompute_config.mp_comm_recompute                 | Whether to recompute communications introduced by model parallel.                                       | bool            |
   | recompute_config.recompute_slice_activation        | Whether to output slices for Cells kept in memory.                                                      | bool            |
   | recompute_config.select_recompute_exclude          | Disable recomputation for the specified operator, valid only for the Primitive operators.               | bool/list       |
   | recompute_config.select_comm_recompute_exclude     | Disable communication recomputation for the specified operator, valid only for the Primitive operators. | bool/list       |

2. MindSpore Transformers provides fine-grained activations SWAP-related configurations to reduce the memory footprint of the model during training, see [Fine-Grained Activations SWAP](https://www.mindspore.cn/mindformers/docs/en/master/feature/memory_optimization.html#fine-grained-activations-swap) for details.

   | Parameters                   | Descriptions                                                                                                                                                                                        | Types |
   |------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
   | swap_config.swap             | Enable activations SWAP.                                                                                                                                                                            | bool  |
   | swap_config.default_prefetch | Control the timing of releasing memory in forward phase and starting prefetch in backward phase of the default SWAP strategy, only taking effect when swap=True, layer_swap=None, and op_swap=None. | int   |
   | swap_config.layer_swap       | Select specific layers to enable activations SWAP.                                                                                                                                                  | list  |
   | swap_config.op_swap          | Select specific operators within layers to enable activations SWAP.                                                                                                                                 | list  |

### Callbacks Configuration

MindSpore Transformers provides encapsulated Callbacks function class, mainly to achieve to return to the model training state and output in the model training process, save the model weight file and other operations. Currently, the following Callbacks function class is supported.

1. MFLossMonitor

   This callback function class is mainly used to print information such as training progress, model Loss, and learning rate during the training process and has several configurable items as follows:

   | Parameter Name                 | Type   | Optional  | Default Value | Value Description                                                                                                                                                                                                                                                                          |
   |--------------------------------|--------|-----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
   | learning_rate                  | float  | Optional  | None          | Sets the initial learning rate for `MFLossMonitor`. Used for logging and training progress calculation. If not set, attempts to obtain it from the optimizer or other configuration.                                                                                                       |
   | per_print_times                | int    | Optional  | 1             | Sets the frequency of logging for `MFLossMonitor`, in steps. The default value is `1`, which prints a log message once per training step.                                                                                                                                                  |
   | micro_batch_num                | int    | Optional  | 1             | Sets the number of micro batches processed at each training step, used to calculate the actual loss value. If not set, it is the same as `parallel_config.micro_batch_num` in [Parallel Configuration](#Parallel Configuration).                                                           |
   | micro_batch_interleave_num     | int    | Optional  | 1             | Sets the size of the multi-replica micro-batch for each training step, used for loss calculation. If not configured, it is the same as `micro_batch_interleave_num` in [Parallel Configuration](#Parallel Configuration).                                                                  |
   | origin_epochs                  | int    | Optional  | None          | Sets the total number of training epochs in `MFLossMonitor`. If not configured, it is the same as `runner_config.epochs` in [Model Training Configuration](#Model Training Configuration).                                                                                                 |
   | dataset_size                   | int    | Optional  | None          | Sets the total number of samples in the dataset in `MFLossMonitor`. If not configured, it automatically uses the actual dataset size loaded.                                                                                                                                               |
   | initial_epoch                  | int    | Optional  | 0             | Sets the starting epoch number for `MFLossMonitor`. The default value is `0`, indicating that counting starts from epoch 0. This can be used to resume training progress when resuming training from a breakpoint.                                                                         |
   | initial_step                   | int    | Optional  | 0             | Sets the number of initial training steps in `MFLossMonitor`. The default value is `0`. This can be used to align logs and progress bars when resuming training.                                                                                                                           |
   | global_batch_size              | int    | Optional  | 0             | Sets the global batch size in `MFLossMonitor` (i.e., the total number of samples used in each training step). If not configured, it is automatically calculated based on the dataset size and parallelization strategy.                                                                    |
   | gradient_accumulation_steps    | int    | Optional  | 1             | Sets the number of gradient accumulation steps in `MFLossMonitor`. If not configured, it is consistent with `gradient_accumulation_steps` in [Model Training Configuration](#Model Training Configuration). Used for loss normalization and training progress estimation.                  |
   | check_for_nan_in_loss_and_grad | bool   | Optional  | False         | Whether to enable NaN/Inf detection for loss values and gradients in `MFLossMonitor`. If enabled, training will be terminated if overflow (NaN or INF) is detected. The default value is `False`. It is recommended to enable it during the debugging phase to improve training stability. |

2. SummaryMonitor

   This callback function class is mainly used to collect Summary data, see [mindspore.SummaryCollector](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.SummaryCollector.html) for details.

3. CheckpointMonitor

   This callback function class is mainly used to save the model weights file during the model training process and has several configurable items as follows:

   | Parameter Name                 | Type    | Optional | Default Value | Value Description                                                                                                                                                                                                                                                                                                                                       |
   |--------------------------------|---------|----------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
   | prefix                         | string  | Optional | 'CKP'         | Set the prefix for the weight file name. For example, `CKP-100.ckpt` is generated. If not configured, the default value `'CKP'` is used.                                                                                                                                                                                                                |
   | directory                      | string  | Optional | None          | Set the directory for saving weight files. If not configured, the default directory is `checkpoint/` under the `output_dir` directory.                                                                                                                                                                                                                  |
   | save_checkpoint_seconds        | int     | Optional | 0             | Set the interval for automatically saving weights (in seconds). Mutually exclusive with `save_checkpoint_steps` and takes precedence. For example, save every 3600 seconds.                                                                                                                                                                             |
   | save_checkpoint_steps          | int     | Optional | 1             | Sets the automatic saving interval for weights based on the number of training steps (unit: steps). Mutually exclusive with `save_checkpoint_seconds`; if both are set, the time-based saving takes precedence. For example, save every 1000 steps.                                                                                                     |
   | keep_checkpoint_max            | int     | Optional | 5             | The maximum number of weight files to retain. When the number of saved weights exceeds this value, the system will delete the oldest files in order of creation time to ensure that the total number does not exceed this limit. Used to control disk space usage.                                                                                      |
   | keep_checkpoint_per_n_minutes  | int     | Optional | 0             | Retain one weight every N minutes. This is a time-windowed retention policy often used to balance storage and recovery flexibility in long-term training. For example, setting it to `60` means retaining at least one weight every hour.                                                                                                               |
   | integrated_save                | bool    | Optional | True          | Whether to enable aggregated weight saving: <br/>• `True`: Aggregate weights from all devices when saving the weight file, i.e., all devices have the same weights; <br/>• `False`: Each device saves its own weights. <br/> In semi-automatic parallel mode, it is recommended to set this to `False` to avoid memory issues when saving weight files. |
   | save_network_params            | bool    | Optional | False         | Whether to save only the model weights. The default value is `False`.                                                                                                                                                                                                                                                                                   |
   | save_trainable_params          | bool    | Optional | False         | Whether to save trainable parameters separately (i.e., the model's parameter weights during partial fine-tuning).                                                                                                                                                                                                                                       |
   | async_save                     | bool    | Optional | False         | Whether to save weights asynchronously. Enabling this feature will not block the main training process, improving training efficiency. However, please note that I/O resource contention may cause write delays.                                                                                                                                        |
   | remove_redundancy              | bool    | Optional | False         | Whether to remove redundancy from model weights when saving. Defaults to `False`.                                                                                                                                                                                                                                                                       |
   | checkpoint_format              | string  | Optional | 'ckpt'        | The format of saved model weights. Defaults to `ckpt`. Optional `ckpt`, `safetensors`.                                                                                                                                                                                                                                                                  |
   | embedding_local_norm_threshold | float   | Optional | 1.0           | The threshold used in health monitoring to detect abnormalities in the embedding layer gradient or output norm. If the norm exceeds this value, an alarm or data skipping mechanism may be triggered to prevent training divergence. Defaults to `1.0` and can be adjusted based on model scale.                                                        |

Multiple Callbacks function classes can be configured at the same time under the `callbacks` field. The following is an example of `callbacks` configuration.

```yaml
callbacks:
  - type: MFLossMonitor
  - type: CheckpointMonitor
    prefix: "name_xxb"
    save_checkpoint_steps: 1000
    integrated_save: False
    async_save: False
```

### Processor Configuration

Processor is mainly used to preprocess the inference data of the input model. Since the Processor configuration items are not fixed, only the generic configuration items of Processor in MindSpore Transformers are explained here.

| Parameter Name                  | Type    | Optional  | Default Value | Value Description                                                                                                                                                                                                                                    |
|---------------------------------|---------|-----------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| processor.type                  | string  | Required  | None          | Sets the name of the data processing class (Processor) to be used, such as `LlamaProcessor` or `Qwen2Processor`. This class determines the overall input data preprocessing flow and must match the model architecture.                              |
| processor.return_tensors        | string  | Optional  | 'ms'          | Sets the type of tensors returned after data processing. Can be set to `'ms'` to indicate a MindSpore Tensor.                                                                                                                                        |
| processor.image_processor.type  | string  | Required  | None          | Sets the type of the image data processing class. Responsible for image normalization, scaling, cropping, and other operations, and must be compatible with the model's visual encoder.                                                              |
| processor.tokenizer.type        | string  | Required  | None          | Sets the text tokenizer type, such as `LlamaTokenizer` or `Qwen2Tokenizer`. This determines how the text is segmented into subwords or tokens and must be consistent with the language model.                                                        |
| processor.tokenizer.vocab_file  | string  | Required  | None          | Sets the vocabulary file path required by the tokenizer (such as `vocab.txt` or `tokenizer.model`). The specific file type depends on the tokenizer implementation. This must correspond to `processor.tokenizer.type`; otherwise, loading may fail. |

### Model Evaluation Configuration

MindSpore Transformers provides model evaluation function, and also supports model evaluation while training. The following is the configuration related to model evaluation.

| Parameter Name      | Type   | Optional  | Default Value | Value Description                                                                                                                                                                |
|---------------------|--------|-----------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| eval_dataset        | dict   | Required  | None          | Dataset configuration for evaluation, used in the same way as `train_dataset`.                                                                                                   |
| eval_dataset_task   | dict   | Required  | None          | Evaluation task configuration, used in the same way as dataset task configuration (such as preprocessing, batch size, etc.), used to define the evaluation process.              |
| metric.type         | string | Required  | None          | Set the evaluation type, such as `Accuracy`, `F1`, etc. The specific value must be consistent with the supported evaluation metrics.                                             |
| do_eval             | bool   | Optional  | False         | Whether to enable the evaluation-while-training feature.                                                                                                                         |
| eval_step_interval  | int    | Optional  | 100           | Sets the evaluation step interval. The default value is 100. A value less than or equal to 0 disables step-by-step evaluation.                                                   |
| eval_epoch_interval | int    | Optional  | -1            | Sets the evaluation epoch interval. The default value is -1. A value less than 0 disables epoch-by-epoch evaluation. This configuration is not recommended in data sinking mode. |

### Profile Configuration

MindSpore Transformers provides Profile as the main tool for model performance tuning, please refer to [Performance Tuning Guide](https://www.mindspore.cn/mindformers/docs/en/master/advanced_development/performance_optimization.html) for more details. The following is the Profile related configuration.

| Parameter Name        | Type   | Optional | Default Value | Value Description                                                                                                                                                                                              |
|-----------------------|--------|----------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| profile               | bool   | Optional | False         | Whether to enable the performance collection tool. The default value is `False`. For details, see [mindspore.Profiler](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.Profiler.html).  |
| profile_start_step    | int    | Optional | 1             | Sets the number of steps at which to start collecting performance data. The default value is `1`.                                                                                                              |
| profile_stop_step     | int    | Optional | 10            | Sets the number of steps at which to stop collecting performance data. The default value is `10`.                                                                                                              |
| profile_communication | bool   | Optional | False         | Sets whether to collect communication performance data during multi-device training. This parameter is invalid when using a single card for training and the default value is `False`.                         |
| profile_memory        | bool   | Optional | True          | Sets whether to collect Tensor memory data. Defaults to `True`.                                                                                                                                                |
| profile_rank_ids      | list   | Optional | None          | Sets the rank ids for which performance collection is enabled. Defaults to `None`, meaning that performance collection is enabled for all rank ids.                                                            |
| profile_pipeline      | bool   | Optional | False         | Sets whether to enable performance collection for one card in each stage of the pipeline in parallel. Defaults to `False`.                                                                                     |
| profile_output        | string | Required | None          | Sets the folder path for saving performance collection files.                                                                                                                                                  |
| profiler_level        | int    | Optional | 1             | Sets the data collection level. Possible values are `(0, 1, 2)`. Defaults to `1`.                                                                                                                              |
| with_stack            | bool   | Optional | False         | Sets whether to collect call stack data on the Python side. Defaults to `False`.                                                                                                                               |
| data_simplification   | int    | Optional | False         | Sets whether to enable data simplification. If enabled, the FRAMEWORK directory and other redundant data will be deleted after exporting performance data. The default value is `False`.                       |
| init_start_profile    | bool   | Optional | False         | Sets whether to enable performance data collection during Profiler initialization. This parameter has no effect when `profile_start_step` is set. It must be set to `True` when `profile_memory` is enabled.   |
| mstx                  | bool   | Optional | False         | Sets whether to collect mstx timestamp records, including training steps, HCCL communication operators, etc. The default value is `False`.                                                                     |

### Metric Monitoring Configuration

The metric monitoring configuration is primarily used to configure methods to record metrics during training, please refer to [Training Metrics Monitoring](https://www.mindspore.cn/mindformers/docs/en/master/feature/monitor.html) for more details. Below is a description of the common metric monitoring configuration options in MindSpore Transformers:

| Parameters                                       | Type                  | Optional  | Default Value | Value Descriptions                                                                                                                                                                                                                                       |
|--------------------------------------------------|-----------------------|-----------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| monitor_config.monitor_on                        | bool                  | Optional  | False         | Set whether to enable monitoring. The default is `False`, which will disable all parameters below.                                                                                                                                                       |
| monitor_config.dump_path                         | string                | Optional  | './dump'      | Set the save path for metric files of `local_norm`, `device_local_norm` and `local_loss` during training. Defaults to './dump' when not set or set to `null`.                                                                                            |
| monitor_config.target                            | list(string)          | Optional  | ['.*']        | Set the (partial) name of target parameters monitored by metric `optimizer state` and `local_norm`, can be regular expression.Defaults to ['.*'] when not set or set to `null`, that is, specify all parameters.                                         |
| monitor_config.invert                            | bool                  | Optional  | False         | Set whether to invert the targets specified in `monitor_config.target`, defaults to `False`.                                                                                                                                                             |
| monitor_config.step_interval                     | int                   | Optional  | 1             | Set the frequency for metric recording. The default value is `1`, that is, the metrics are recorded every step.                                                                                                                                          |
| monitor_config.local_loss_format                 | string / list(string) | Optional  | null          | Set the format to record metric `local_loss`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric.        |
| monitor_config.device_local_loss_format          | string / list(string) | Optional  | null          | Set the format to record metric `device_local_loss`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. |
| monitor_config.local_norm_format                 | string / list(string) | Optional  | null          | Set the format to record metric `local_norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric.        |
| monitor_config.device_local_norm_format          | string / list(string) | Optional  | null          | Set the format to record metric `device_local_norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. |
| monitor_config.optimizer_state_format            | string / list(string) | Optional  | null          | Set the format to record metric `optimizer state`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric.   |
| monitor_config.weight_state_format               | string / list(string) | Optional  | null          | Set the format to record metric `weight L2-norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric.    |
| monitor_config.throughput_baseline               | int / float           | Optional  | null          | Set the baseline of metric `throughput linearity`, must be positive number. Defaults to `null`, that is, do not monitor this metric.                                                                                                                     |
| monitor_config.print_struct                      | bool                  | Optional  | False         | Set whether to print all trainable parameters' name of model. If set to `True`, print all trainable parameters' name at the beginning of the first step, and exit training process after step end. Defaults to `False`.                                  |
| monitor_config.check_for_global_norm             | bool                  | Optional  | False         | Set whether to enable process level fault recovery function. Defaults to `False`.                                                                                                                                                                        |
| monitor_config.global_norm_spike_threshold       | float                 | Optional  | 3.0           | Set the threshold for global norm, triggering data skipping when the global norm is exceeded. Defaults to `3.0`.                                                                                                                                         |
| monitor_config.global_norm_spike_count_threshold | int                   | Optional  | 10            | Set the cumulative number of consecutive global norm anomalies, and when the threshold is reached, trigger an exception interrupt to terminate the training. Defaults to `10`.                                                                           |

### TensorBoard Configuration

The TensorBoard configuration is primarily used to configure parameters related to TensorBoard during training, allowing for real-time monitoring and visualization of training metrics, please refer to [Training Metrics Monitoring](https://www.mindspore.cn/mindformers/docs/en/master/feature/monitor.html) for more details. Below is a description of the common TensorBoard configuration options in MindSpore Transformers:

| Parameters                                 | Type   | Optional  | Default Value  | Value Description                                                                                                                                                                               |
|--------------------------------------------|--------|-----------|----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| tensorboard.tensorboard_dir                | string | Required  | None           | Sets the path where TensorBoard event files are saved.                                                                                                                                          |
| tensorboard.tensorboard_queue_size         | int    | Optional  | 10             | Sets the maximum cache value of the capture queue. If it exceeds this value, it will be written to the event file, the default value is 10.                                                     |
| tensorboard.log_loss_scale_to_tensorboard  | bool   | Optional  | False          | Sets whether loss scale information is logged to the event file, default is `False`.                                                                                                            |
| tensorboard.log_timers_to_tensorboard      | bool   | Optional  | False          | Sets whether to log timer information to the event file. The timer information contains the duration of the current training step (or iteration) as well as the throughput, defaults to `False` |
| tensorboard.log_expert_load_to_tensorboard | bool   | Optional  | False          | Sets whether to log experts load to the event file, defaults to `False`.                                                                                                                        |