Configuration File Descriptions

Overview

Different parameters usually need to be configured during the training and inference process of a model. MindSpore Transformers supports the use of YAML files to centrally manage and adjust the configurable items, which makes the configuration of the model more structured and improves its maintainability at the same time.

Description of the YAML File Contents

The YAML file provided by MindSpore Transformers contains configuration items for different functions, which are described below according to their contents.

Basic Configuration

The basic configuration is mainly used to specify MindSpore random seeds and related settings for loading weights.

Parameter Name	Data Type	Optional	Default Value	Value Description
seed	int	Optional	0	Sets the global random seed to ensure experimental reproducibility. For details, see mindspore.set_seed.
run_mode	string	Required	None	Sets the model's run mode. Optional: `train`, `finetune`, `eval`, or `predict`.
output_dir	string	Optional	None	Sets the output directory for saving log files, checkpoint files, and parallel strategy files. If the directory does not exist, it will be created automatically.
load_checkpoint	string	Optional	None	The file or folder path for loading weights. Supports the following three scenarios: 1. The path to the complete weights file; 2. The path to the distributed weights folder after offline splitting; 3. The path to the folder containing LoRA incremental weights and base model weights. For details on how to obtain various weights, see Checkpoint Conversion Function.
auto_trans_ckpt	bool	Optional	False	Whether to enable automatic splitting and merging of distributed weights. When enabled, you can load split weights from multiple cards onto a single card, or load single-card weights from multiple cards onto multiple cards. For more information, see Distributed Weight Slicing and Merging
resume_training	bool	Optional	False	Whether to enable the resumable training feature. When enabled, the optimizer state, learning rate scheduler state, and other parameters will be restored from the path specified by `load_checkpoint` to continue training. For more information, see Resumable Training
load_ckpt_format	string	Optional	"ckpt"	The format of the loaded model weights. Optional values include `"ckpt"` and `"safetensors"`.
remove_redundancy	bool	Optional	False	Whether the loaded model weights have been de-redundant. For details, see Saving and Loading Weights with De-Redundancy.
train_precision_sync	bool	Optional	None	Enables deterministic computation for training. Setting this to True enables synchronous computation for training, which improves computational certainty and is generally used to ensure experimental reproducibility. Setting this to False disables this feature.
infer_precision_sync	bool	Optional	None	Enables deterministic computation for inference. If set to `True`, inference synchronization is enabled, which improves computational certainty and is generally used to ensure experimental reproducibility. If set to `False`, this feature is disabled.
use_skip_data_by_global_norm	bool	Optional	False	Whether to enable data skipping based on the global gradient norm. When a batch of data causes exploding gradients, that batch is automatically skipped to improve training stability. For more information, see Data Skipping.
use_checkpoint_health_monitor	bool	Optional	False	Whether to enable weight health monitoring. When enabled, checkpoint integrity and availability are verified when saving, preventing corrupted weight files from being saved. For more information, see Checkpoint Health Monitor.

Context Configuration

Context configuration is mainly used to specify the mindspore.set_context in the related parameters.

Parameter Name	Data Type	Optional	Default Value	Value Description
context.mode	int	Required	None	Sets the backend execution mode. `0` indicates GRAPH_MODE. MindSpore Transformers currently only supports running in GRAPH_MODE mode.
context.device_target	string	Required	None	Sets the backend execution device. MindSpore Transformers only supports running on `Ascend` devices.
context.device_id	int	Optional	0	Sets the execution device ID. The value must be within the available device range. The default value is `0`.
context.enable_graph_kernel	bool	Optional	False	Whether to enable graph fusion to optimize network execution performance. The default value is `False`.
context.max_call_depth	int	Optional	1000	Sets the maximum depth of function calls. This value must be a positive integer. The default value is `1000`.
context.max_device_memory	string	Optional	"1024GB"	Sets the maximum memory available on the device. The format is "xxGB". The default value is `"1024GB"`.
context.mempool_block_size	string	Optional	"1GB"	Sets the memory block size. The format is "xxGB". The default value is `"1GB"`.
context.save_graphs	bool / int	Optional	False	Save compiled graphs during execution: • `False` or `0`: Do not save intermediate compiled graphs • `1`: Output some intermediate files during graph compilation • `True` or `2`: Generate more IR files related to the backend process • `3`: Generate a visual computation graph and a more detailed frontend IR graph
context.save_graphs_path	string	Optional	'./graph'	The path to save compiled graphs. If not set and `save_graphs != False`, the default temporary path `'./graph'` is used.
context.affinity_cpu_list	dict / string	Optional	None	Optional configuration item used to implement a user-defined core binding strategy. This configuration is merged into affinity_config. Please use affinity_config instead. - When not configured: default automatic core binding - `None` or not set: disable core binding - Pass in a `dict`: customize CPU core binding strategy. For details, refer to mindspore.runtime.set_cpu_affinity
context.affinity_config	dict	Optional	None	Optional configuration item used to implement a user-defined core binding strategy. - When not configured: default automatic core binding - Pass in `dict`: customize CPU core binding strategy. For details, refer to mindspore.runtime.set_cpu_affinity

Legacy Model Configuration

If you use MindSpore Transformers to run tasks for legacy models, you need to configure the relevant hyperparameters in a YAML file. Please note that the configuration described in this section applies only to legacy models and cannot be mixed with mcore model configurations. Please pay attention to version compatibility.

Because different model configurations may vary, this section only describes the general configuration of models in MindSpore Transformers.

Parameter Name	Type	Optional	Default Value	Value Description
model.arch.type	string	Required	None	Sets the model class. This class can be used to instantiate the model when building it.
model.model_config.type	string	Required	None	Sets the model configuration class. This class must match the model class; that is, it must contain all parameters used by the model class.
model.model_config.num_layers	int	Required	None	Sets the number of model layers, typically the number of decoder layers.
model.model_config.seq_length	int	Required	None	Sets the model sequence length. This parameter indicates the maximum sequence length supported by the model.
model.model_config.hidden_size	int	Required	None	Sets the dimension of the model's hidden state.
model.model_config.vocab_size	int	Required	None	Sets the size of the model vocabulary.
model.model_config.top_k	int	Optional	None	Sets the sampling from the `top_k` tokens with the highest probability during inference.
model.model_config.top_p	float	Optional	None	Sets the sampling from the tokens with the highest probability, whose cumulative probability does not exceed `top_p`, during inference. The value range is usually `(0,1]`.
model.model_config.use_past	bool	Optional	False	Whether to enable incremental inference for the model. Enabling this allows Paged Attention to improve inference performance. Must be set to `False` during model training.
model.model_config.max_decode_length	int	Optional	None	Sets the maximum length of generated text, including the input length.
model.model_config.max_length	int	Optional	None	Same as `max_decode_length`. When both `max_decode_length` and `max_length` are set, only `max_length` takes effect.
model.model_config.max_new_tokens	int	Optional	None	Sets the maximum length of generated new text, excluding the input length. When both `max_length` and `max_new_tokens` are set, only `max_new_tokens` takes effect.
model.model_config.min_length	int	Optional	None	Sets the minimum length of generated text, including the input length.
model.model_config.min_new_tokens	int	Optional	None	Sets the minimum length of new text generated, excluding the input length. When `min_length` is set at the same time, only `min_new_tokens` takes effect.
model.model_config.repetition_penalty	float	Optional	1.0	Sets the penalty coefficient for generating repeated text. `repetition_penalty` must be no less than 1. When it is equal to 1, no penalty is imposed on repeated output.
model.model_config.block_size	int	Optional	None	Sets the block size in Paged Attention. This only takes effect when `use_past=True`.
model.model_config.num_blocks	int	Optional	None	Sets the total number of blocks in Paged Attention. This only takes effect when `use_past=True`. This should satisfy `batch_size × seq_length <= block_size × num_blocks`.
model.model_config.return_dict_in_generate	bool	Optional	False	Whether to return the inference results of the `generate` interface in dictionary form. Defaults to `False`.
model.model_config.output_scores	bool	Optional	False	Whether to include the scores before softmax of the input for each forward generation when returning the results in dictionary form. Defaults to `False`.
model.model_config.output_logits	bool	Optional	False	Whether to include the logits of the model output for each forward generation when returning the results in dictionary form. Defaults to `False`.
model.model_config.layers_per_stage	list(int)	Optional	None	Sets the number of transformer layers assigned to each stage when enabling pipeline stages. Defaults to `None`, indicating an equal distribution across all stages. The value to be set is a list of integers with a length equal to the number of pipeline stages, where the i-th position indicates the number of transformer layers assigned to the i-th stage.
model.model_config.bias_swiglu_fusion	bool	Optional	False	Whether to use the swiglu fusion operator. Defaults to `False`.
model.model_config.apply_rope_fusion	bool	Optional	False	Whether to use the RoPE fusion operator. Defaults to `False`.

In addition to the basic configuration of the above models, the MoE model requires separate configuration of some MoE module hyperparameters. Since different models use different parameters, only the general configuration is described:

Parameter Name	Type	Optional	Default Value	Value Description
moe_config.expert_num	int	Required	None	Sets the number of routing experts.
moe_config.shared_expert_num	int	Required	None	Sets the number of shared experts.
moe_config.moe_intermediate_size	int	Required	None	Sets the size of the intermediate dimension of the expert layer.
moe_config.capacity_factor	int	Required	None	Sets the expert capacity factor.
moe_config.num_experts_chosen	int	Required	None	Sets the number of experts chosen for each token.
moe_config.enable_sdrop	bool	Optional	False	Enables the `sdrop` token drop strategy. Since MindSpore Transformers' MoE uses a static shape implementation, it cannot retain all tokens.
moe_config.aux_loss_factor	list(float)	Optional	None	Sets the weight for the balanced loss.
moe_config.first_k_dense_replace	int	Optional	1	Enables the block for the Moe layer. Typically set to `1` to disable Moe in the first block.
moe_config.balance_via_topk_bias	bool	Optional	False	Enables the `aux_loss_free` load balancing algorithm.
moe_config.topk_bias_update_rate	float	Optional	None	Sets the bias update step for the `aux_loss_free` load balancing algorithm.
moe_config.comp_comm_parallel	bool	Optional	False	Sets whether to enable parallel computation and communication for ffn.
moe_config.comp_comm_parallel_degree	int	Optional	None	Sets the number of splits for ffn computation and communication. A larger number results in more overlap, but consumes more memory. This parameter is only valid when `comp_comm_parallel=True`.
moe_config.moe_shared_expert_overlap	bool	Optional	False	Sets whether to enable parallel computation and communication for shared and routing experts.
moe_config.use_gating_sigmoid	bool	Optional	False	Sets whether to use the sigmoid function for gating results in MoE.
moe_config.use_gmm	bool	Optional	False	Sets whether to use GroupedMatmul for MoE expert computation.
moe_config.use_fused_ops_permute	bool	Optional	False	Specifies whether MoE uses the permute and unpermute fused operators for performance acceleration. This option only takes effect when `use_gmm=True`.
moe_config.enable_deredundency	bool	Optional	False	Specifies whether to enable de-redundancy communication. This requires that the number of expert parallel operations is an integer multiple of the number of NPUs in each node. Default value: False. This option takes effect when `use_gmm=True`.
moe_config.npu_nums_per_device	int	Optional	8	Specifies the number of NPUs in each node. Default value: 8. This option takes effect when `enable_deredundency=True`.
moe_config.enable_gmm_safe_tokens	bool	Optional	False	Ensures that each expert is assigned at least one token to prevent GroupedMatmul calculation failures in extreme load imbalance. The default value is `False`. It is recommended to enable this when `use_gmm=True`.

Mcore Model Configuration

When using MindSpore Transformers to launch an Mcore model task, you need to configure relevant hyperparameters under model_config, including model selection, model parameters, calculation type, and MoE parameters.

Since different models may have different configurations, this section introduces commonly used model configurations in MindSpore Transformers.

The default values for these parameters may vary between models; only the default values for most cases are shown here. For specific default values, please refer to the configuration class definition configuration_xxx.py for each model (e.g., the configuration class for DeepSeek-V3 is configuration_deepseek_v3.py).

Parameter	Type	Optional	Default Value	Value Description
model.model_config.model_type	string	Required	None	Sets the model configuration class. The model configuration class must match the model class; that is, the model configuration class should contain all parameters used by the model class.
model.model_config.architectures	string	Required	None	Sets the model class. When building the model, you can instantiate the model based on the model class.
model.model_config.offset	int / list(int)	Required	0	In pipeline parallelism (PP), set the offset of each stage's layers: when the model layers cannot be evenly distributed, it is used to accurately allocate the layers of each stage. Rule 1 (Basic PP): When `pipeline_interleave = 1`, `offset` is a list of lengths equal to `pipelin_stage`. - `offset[i]` represents the additional number of layers added to the base layer in the `i-th` stage. - Constraint: `sum (offset)` must be equal to `num_1ayers % pipeline_stage`. - Example : `pipeline_stage = 4`, `num_1ayers = 5`, let `offset = [0,0,1,0]`. The number of layers in each stage is: [1, 1, 2, 1]. Rule 2(Enable interleaving): When `pipeline_interleave > 1`, `offset` is a nested list, in the format of `offset[interleave_id][stage_id]`. - The length of the outer list is `pipeline_interleave`, and the length of the inner list is `pipeline_stage`. -Constraint: The sum of all inner layer offset values must be equal to `num_layers % (pipeline_stage * pipeline_interleave)`. - Example: `pipeline_interleave = 2`, `pipeline_stage = 2`, `num_layers = 5`, let `offset = [0,0], [1,0]]`. Then it means that the first stage in the second interleaved group is allocated an additional layer.
model.model_config.vocab_size	int	Optional	128000	Model vocabulary size.
model.model_config.hidden_size	int	Required	4096	Transformer hidden layer size. The default value of hidden_size differs for some models; for example, in DeepSeek-V3, it is `7168`.
model.model_config.ffn_hidden_size	int	Optional	None	Transformer feedforward layer size, corresponding to `intermediate_size` in HuggingFace. If not set, the default is 4 * hidden_size.
model.model_config.num_layers	int	Required	0	Number of Transformer layers, corresponding to `num_hidden_layers` in HuggingFace.
model.model_config.max_position_embeddings	int	Optional	4096	Maximum sequence length the model can handle.
model.model_config.hidden_act	string	Optional	'gelu'	Activation function used for the nonlinearity in the MLP.
model.model_config.num_attention_heads	int	Required	0	Number of Transformer attention heads.
model.model_config.num_query_groups	int	Optional	None	Number of query groups for the group-query attention mechanism, corresponding to `num_key_value_heads` in HuggingFace. If not configured, the normal attention mechanism is used.
model.model_config.kv_channels	int	Optional	None	Projection weight dimension for the multi-head attention mechanism, corresponding to `head_dim` in HuggingFace. If not configured, defaults to `hidden_size // num_attention_heads`.
model.model_config.layernorm_epsilon	float	Required	1e-5	Epsilon value for any LayerNorm operations.
model.model_config.add_bias_linear	bool	Required	True	Include a bias term in all linear layers (after QKV projection, after core attention, and both in MLP layers).
model.model_config.tie_word_embeddings	bool	Required	True	Whether to share input and output embedding weights.
model.model_config.use_flash_attention	bool	Required	True	Whether to use flash attention in the attention layer.
model.model_config.use_contiguous_weight_layout_attention	bool	Required	False	Determines the weight layout in the QKV linear projection of the self-attention layer. Affects only the self-attention layer.
model.model_config.hidden_dropout	float	Required	0.1	Dropout probability for the Transformer hidden state.
model.model_config.attention_dropout	float	Required	0.1	Dropout probability for the post-attention layer.
model.model_config.position_embedding_type	string	Required	'rope'	Position embedding type for the attention layer.
model.model_config.params_dtype	string	Required	'float32'	dtype to use when initializing weights.
model.model_config.compute_dtype	string	Required	'bfloat16'	Computed dtype for Linear layers.
model.model_config.layernorm_compute_dtype	string	Required	'float32'	Computed dtype for LayerNorm layers.
model.model_config.softmax_compute_dtype	string	Required	'float32'	The dtype used to compute the softmax during attention computation.
model.model_config.rotary_dtype	string	Required	'float32'	Computed dtype for custom rotated position embeddings.
model.model_config.init_method_std	float	Required	0.02	The standard deviation of the zero-mean normal for the default initialization method, corresponding to `initializer_range` in HuggingFace. If `init_method` and `output_layer_init_method` are provided, this method is not used.
model.model_config.param_init_std_rules	list[dict]	Optional	None	Custom rules for parameter initialization standard deviation. Each rule contains `target` (regex pattern for parameter name) and `init_method_std` (std value, ≥0), for example: `[{"target": ".*weight", "init_method_std": 0.02}]`
model.model_config.moe_grouped_gemm	bool	Required	False	When there are multiple experts per level, compress multiple local (potentially small) GEMMs in a single kernel launch to leverage grouped GEMM capabilities for improved utilization and performance.
model.model_config.num_moe_experts	int	Optional	None	The number of experts to use for the MoE layer, corresponding to `n_routed_experts` in HuggingFace. When set, the MLP is replaced by the MoE layer. Setting this to None disables the MoE.
model.model_config.num_experts_per_tok	int	Required	2	The number of experts to route each token to.
model.model_config.moe_ffn_hidden_size	int	Optional	None	Size of the hidden layer of the MoE feedforward network. Corresponds to `moe_intermediate_size` in HuggingFace.
model.model_config.moe_router_dtype	string	Required	'float32'	Data type used for routing and weighted averaging of expert outputs. Corresponds to `router_dense_type` in HuggingFace.
model.model_config.gated_linear_unit	bool	Required	False	Use a gated linear unit for the first linear layer in the MLP.
model.model_config.norm_topk_prob	bool	Required	True	Whether to use top-k probabilities for normalization.
model.model_config.moe_router_pre_softmax	bool	Required	False	Enables pre-softmax (pre-sigmoid) routing for MoE, meaning softmax is performed before top-k selection. By default, softmax is performed after top-k selection.
model.model_config.moe_token_drop_policy	string	Required	'probs'	The token drop policy. Can be either 'probs' or 'position'. If `'probs'`, the token with the lowest probability is dropped. If `'position'`, the token at the end of each batch is dropped.
model.model_config.moe_router_topk_scaling_factor	float	Optional	None	Scaling factor for the routing score in Top-K routing, corresponding to `routed_scaling_factor` in HuggingFace. Valid only when `moe_router_pre_softmax` is enabled. Defaults to `None`, meaning no scaling.
model.model_config.moe_aux_loss_coeff	float	Required	0.0	Scaling factor for the auxiliary loss. The recommended initial value is 1e-2.
model.model_config.moe_router_load_balancing_type	string	Required	'aux_loss'	The router's load balancing strategy. `'aux_loss'` corresponds to the load balancing loss used in GShard and SwitchTransformer; `'seq_aux_loss'` corresponds to the load balancing loss used in DeepSeekV2 and DeepSeekV3, which is used to calculate the loss of each sample; `'sinkhorn'` corresponds to the balancing algorithm used in S-BASE, and `'none'` means no load balancing.
model.model_config.moe_permute_fusion	bool	Optional	False	Whether to use the moe_token_permute fusion operator. Default is `False`.
model.model_config.moe_router_force_expert_balance	bool	Optional	False	Whether to use forced load balancing in the expert router. This option is only for performance testing and not for general use. Defaults to `False`.
model.model_config.use_interleaved_weight_layout_mlp	bool	Optional	True	Determines the weight arrangement in the linear_fc1 projection of the MLP. Affects only MLP layers. 1. When True, use an interleaved arrangement: `[Gate_weights[0], Hidden_weights[0], Gate_weights[1], Hidden_weights[1], ...]`. 2. When False, use a continuous arrangement: `[Gate_weights, Hidden_weights]`. Note: This affects tensor memory layout, but does not affect mathematical equivalence.
model.model_config.moe_router_enable_expert_bias	bool	Optional	False	Whether to use TopK routing with dynamic expert bias in the unassisted lossless load balancing strategy. Routing decisions are based on the sum of the routing score and the expert bias.
model.model_config.enable_expert_relocation	bool	Optional	False	Whether to enable dynamic expert migration for load balancing in the MoE model. When enabled, experts will be dynamically redistributed between devices based on their load history to improve training efficiency and load balance. Defaults to False.
model.model_config.expert_relocation_initial_iteration	int	Optional	20	Start the initial iteration of expert migration. Expert migration will begin after this many training iterations.
model.model_config.expert_relocation_freq	int	Optional	50	Frequency of expert migration during training iterations. After the initial iteration, expert migration is performed every N iterations.
model.model_config.print_expert_load	bool	Optional	False	Whether to print expert load information. If enabled, detailed expert load statistics will be printed during training. Defaults to `False`.
model.model_config.moe_router_num_groups	int	Optional	None	The number of expert groups to use for group-limited routing. Equivalent to `n_group` in HuggingFace.
model.model_config.moe_router_group_topk	int	Optional	None	The number of selected groups for group-limited routing. Equivalent to `topk_group` in HuggingFace.
model.model_config.moe_router_topk	int	Optional	2	The number of experts to route each token to. Equivalent to `num_experts_per_tok` in HuggingFace. When used with `moe_router_num_groups` and `moe_router_group_topk`, first group `moe_router_num_groups`, then select `moe_router_group_topk`, and then select `moe_router_topk` experts from `moe_router_group_topk`.

Model Training Configuration

When starting model training, in addition to model-related parameters, you also need to set the parameters of trainer, runner_config, learning rate, and optimizer and other modules required for training, MindSpore Transformers provides the following configuration items.

Parameters	Descriptions	Types
trainer.type	Set the trainer class, usually different models for different application scenarios will set different trainer classes.	str
trainer.model_name	Set the model name in the format '{name}_xxb', indicating a certain specification of the model.	str
runner_config.epochs	Set the number of rounds for model training.	int
runner_config.batch_size	Set the sample size of the batch data, which overrides the `batch_size` in the dataset configuration.	int
runner_config.sink_mode	Enable data sink mode.	bool
runner_config.sink_size	Set the number of iterations to be sent down from Host to Device per iteration, effective only when `sink_mode=True`. This argument will be deprecated in a future release.	int
runner_config.gradient_accumulation_steps	Set the number of gradient accumulation steps, the default value is 1, which means that gradient accumulation is not enabled.	int
runner_wrapper.type	Set the wrapper class, generally set 'MFTrainOneStepCell'.	str
runner_wrapper.local_norm	Set the gradient norm of each parameter on the printing card.	bool
runner_wrapper.scale_sense.type	Set the gradient scaling class, generally just set 'DynamicLossScaleUpdateCell'.	str
runner_wrapper.scale_sense.loss_scale_value	Set the loss dynamic scale factor, the model loss can change dynamically according to the configuration of this parameter.	int
runner_wrapper.use_clip_grad	Turn on gradient clipping. Turning on to avoid cases where the inverse gradient is too large and training fails to converge.	bool
lr_schedule.type	Set the lr_schedule class, lr_schedule is mainly used to adjust the learning rate in model training.	str
lr_schedule.learning_rate	Set the initialized learning rate size.	float
lr_scale	Whether to enable learning rate scaling.	bool
lr_scale_factor	Set the learning rate scaling factor.	int
layer_scale	Whether to turn on layer attenuation.	bool
layer_decay	Set the layer attenuation factor.	float
optimizer.type	Set the optimizer class, the optimizer is mainly used to calculate the gradient for model training.	str
optimizer.weight_decay	Set the optimizer weight decay factor.	float
optimizer.fused_num	Set `fused_num` weights for fusion, and update the fused weights to the network parameters according to the fusion algorithm. Default to `10`.	int
optimizer.interleave_step	Select the number of step intervals for the weights to be fused, and take a weight as a candidate weight for fusion once every `interleave_step` step. Default to `1000`.	int
optimizer.fused_algo	Fusion algorithm, supports `ema` and `sma`. Default to `ema`.	string
optimizer.ema_alpha	The fusion coefficient is only effective when `fused_algo` is set to `ema`. Default to `0.2`.	float
train_dataset.batch_size	The description is same as that of `runner_config.batch_size`.	int
train_dataset.input_columns	Set the input data columns for the training dataset.	list
train_dataset.output_columns	Set the output data columns for the training dataset.	list
train_dataset.construct_args_key	Set the dataset part `keys` of the model `construct` input to the model in lexicographical order, used when the parameter passing order of the model does not match the order of the dataset input.	list
train_dataset.column_order	Set the order of the output data columns of the training dataset.	list
train_dataset.num_parallel_workers	Set the number of processes that read the training dataset.	int
train_dataset.python_multiprocessing	Enabling Python multi-process mode to improve data processing performance.	bool
train_dataset.drop_remainder	Whether to discard the last batch of data if it contains fewer samples than batch_size.	bool
train_dataset.repeat	Set the number of dataset duplicates.	int
train_dataset.numa_enable	Set the default state of NUMA to data read startup state.	bool
train_dataset.prefetch_size	Set the amount of pre-read data.	int
train_dataset.data_loader.type	Set the data loading class.	str
train_dataset.data_loader.dataset_dir	Set the path for loading data.	str
train_dataset.data_loader.shuffle	Whether to randomly sort the data when reading the dataset.	bool
train_dataset.transforms	Set options related to data enhancement.	-
train_dataset_task.type	Set up the dataset class, which is used to encapsulate the data loading class and other related configurations.	str
train_dataset_task.dataset_config	Typically set as a reference to `train_dataset`, containing all configuration entries for `train_dataset`.	-
auto_tune	Enable auto-tuning of data processing parameters, see set_enable_autotune for details.	bool
filepath_prefix	Set the save path for parameter configurations after data optimization.	str
autotune_per_step	Set the configuration tuning step interval for automatic data acceleration, for details see set_autotune_interval.	int

Parallel Configuration

In order to improve the performance of the model, it is usually necessary to configure the parallelism strategy for the model in large-scale cluster usage scenarios. For details, please refer to Distributed Parallelism, the parallel configuration in MindSpore Transformers is as follows.

Parameters	Descriptions	Types
use_parallel	Enable parallel mode.	bool
parallel_config.data_parallel	Set the number of data parallel.	int
parallel_config.model_parallel	Set the number of model parallel.	int
parallel_config.context_parallel	Set the number of sequence parallel.	int
parallel_config.pipeline_stage	Set the number of pipeline parallel.	int
parallel_config.micro_batch_num	Set the pipeline parallel microbatch size, which should satisfy `parallel_config.micro_batch_num` >= `parallel_config.pipeline_stage` when `parallel_config.pipeline_stage` is greater than 1.	int
parallel_config.seq_split_num	Set the sequence split number in sequence pipeline parallel, which should be a divisor of sequence length.	int
parallel_config.gradient_aggregation_group	Set the size of the gradient communication operator fusion group.	int
parallel_config.context_parallel_algo	Set the long sequence parallel scheme, optionally `colossalai_cp`, `ulysses_cp` and `hybrid_cp`, effective only if the number of `context_parallel` slices is greater than 1.	str
parallel_config.ulysses_degree_in_cp	Setting the Ulysses sequence parallel dimension, configured in parallel with the `hybrid_cp` long sequence parallel scheme, requires ensuring that `context_parallel` is divisible by this parameter and greater than 1, and that `ulysses_degree_in_cp` is divisible by the number of attention heads.	int
micro_batch_interleave_num	Set the number of multicopy parallel, enable multicopy parallelism if it is greater than 1. Usually enabled when using model parallel, mainly used to optimize the communication loss generated by model parallel, and not recommended to be enabled when only using streaming parallel. For details, please refer to MicroBatchInterleaved.	int
parallel.parallel_mode	Set parallel mode, `0` means data parallel mode, `1` means semi-automatic parallel mode, `2` means automatic parallel mode, `3` means mixed parallel mode, usually set to semi-automatic parallel mode.	int
parallel.gradients_mean	Whether to execute the averaging operator after the gradient AllReduce. Typically set to `False` in semi-automatic parallel mode and `True` in data parallel mode.	bool
parallel.enable_alltoall	Enables generation of the AllToAll communication operator during communication. Typically set to `True` only in MOE scenarios, default value is `False`.	bool
parallel.full_batch	Whether to load the full batch of data from the dataset in parallel mode. Setting it to `True` means all ranks will load the full batch of data. Setting it to `False` means each rank will only load the corresponding batch of data. When set to `False`, the corresponding `dataset_strategy` must be configured.	bool
parallel.dataset_strategy	Only supports `List of List` type and is effective only when `full_batch=False`. The number of sublists in the list must be equal to the length of `train_dataset.input_columns`. Each sublist in the list must have the same shape as the data returned by the dataset. Generally, data parallel splitting is done along the first dimension, so the first dimension of the sublist should be configured to match `data_parallel`, while the other dimensions should be set to `1`. For detailed explanation, refer to Dataset Splitting.	list
parallel.search_mode	Set fully-automatic parallel strategy search mode, options are `recursive_programming`, `dynamic_programming` and `sharding_propagation`, only works in fully-automatic parallel mode, experimental interface.	str
parallel.strategy_ckpt_save_file	Set the save path for the parallel slicing strategy file.	str
parallel.strategy_ckpt_config.only_trainable_params	Whether to save (or load) information about the slicing strategy for trainable parameters only, default is True, set this parameter to `False` when there are frozen parameters in the network but need to be sliced.	bool
parallel.enable_parallel_optimizer	Turn on optimizer parallel. 1. slice model weight parameters by number of devices in data parallel mode. 2. slice model weight parameters by `parallel_config.data_parallel` in semi-automatic parallel mode.	bool
parallel.parallel_optimizer_config.gradient_accumulation_shard	Set whether the cumulative gradient variable is sliced on the data-parallel dimension, only effective if `enable_parallel_optimizer=True`.	bool
parallel.parallel_optimizer_config.parallel_optimizer_threshold	Set the threshold for the optimizer weight parameter cut, effective only if `enable_parallel_optimizer=True`.	int
parallel.parallel_optimizer_config.optimizer_weight_shard_size	Set the size of the optimizer weight parameter to slice the communication domain, requiring the value to be integrable by `parallel_config.data_parallel`, effective only if `enable_parallel_optimizer=True`.	int
parallel.pipeline_config.pipeline_interleave	Enable interleave pipeline parallel, you should set this variable to be `true` when using Seq-Pipe or ZeroBubbleV(also known as DualPipeV).	bool
parallel.pipeline_config.pipeline_scheduler	Set the pipeline scheduling strategy. We only support `"seqpipe"` and `"zero_bubble_v"` now.	str

Configure the parallel strategy to satisfy device_num = data_parallel × model_parallel × context_parallel × pipeline_stage.

Model Optimization Configuration

MindSpore Transformers provides recomputation-related configurations to reduce the memory footprint of the model during training, see Recomputation for details.

Parameters	Descriptions	Types
recompute_config.recompute	Whether to enable recompute.	bool/list/tuple
recompute_config.select_recompute	Turn on recomputation to recompute only for the operators in the attention layer.	bool/list
recompute_config.parallel_optimizer_comm_recompute	Whether to recompute AllGather communication introduced in parallel by the optimizer.	bool/list
recompute_config.mp_comm_recompute	Whether to recompute communications introduced by model parallel.	bool
recompute_config.recompute_slice_activation	Whether to output slices for Cells kept in memory. This parameter is only supported in legacy models.	bool
recompute_config.select_recompute_exclude	Disable recomputation for the specified operator, valid only for the Primitive operators.	bool/list
recompute_config.select_comm_recompute_exclude	Disable communication recomputation for the specified operator, valid only for the Primitive operators.	bool/list

MindSpore Transformers provides fine-grained activations SWAP-related configurations to reduce the memory footprint of the model during training, see Fine-Grained Activations SWAP for details.

Parameters	Descriptions	Types
swap_config.swap	Enable activations SWAP.	bool
swap_config.default_prefetch	Control the timing of releasing memory in forward phase and starting prefetch in backward phase of the default SWAP strategy, only taking effect when swap=True, layer_swap=None, and op_swap=None.	int
swap_config.layer_swap	Select specific layers to enable activations SWAP.	list
swap_config.op_swap	Select specific operators within layers to enable activations SWAP.	list

Callbacks Configuration

MindSpore Transformers provides encapsulated Callbacks function class, mainly to achieve to return to the model training state and output in the model training process, save the model weight file and other operations. Currently, the following Callbacks function class is supported.

MFLossMonitor

This callback function class is mainly used to print information such as training progress, model Loss, and learning rate during the training process and has several configurable items as follows:

Parameter Name	Type	Optional	Default Value	Value Description
learning_rate	float	Optional	None	Sets the initial learning rate for `MFLossMonitor`. Used for logging and training progress calculation. If not set, attempts to obtain it from the optimizer or other configuration.
per_print_times	int	Optional	1	Sets the frequency of logging for `MFLossMonitor`, in steps. The default value is `1`, which prints a log message once per training step.
micro_batch_num	int	Optional	1	Sets the number of micro batches processed at each training step, used to calculate the actual loss value. If not set, it is the same as `parallel_config.micro_batch_num` in [Parallel Configuration](#Parallel Configuration).
micro_batch_interleave_num	int	Optional	1	Sets the size of the multi-replica micro-batch for each training step, used for loss calculation. If not configured, it is the same as `micro_batch_interleave_num` in [Parallel Configuration](#Parallel Configuration).
origin_epochs	int	Optional	None	Sets the total number of training epochs in `MFLossMonitor`. If not configured, it is the same as `runner_config.epochs` in [Model Training Configuration](#Model Training Configuration).
dataset_size	int	Optional	None	Sets the total number of samples in the dataset in `MFLossMonitor`. If not configured, it automatically uses the actual dataset size loaded.
initial_epoch	int	Optional	0	Sets the starting epoch number for `MFLossMonitor`. The default value is `0`, indicating that counting starts from epoch 0. This can be used to resume training progress when resuming training from a breakpoint.
initial_step	int	Optional	0	Sets the number of initial training steps in `MFLossMonitor`. The default value is `0`. This can be used to align logs and progress bars when resuming training.
global_batch_size	int	Optional	0	Sets the global batch size in `MFLossMonitor` (i.e., the total number of samples used in each training step). If not configured, it is automatically calculated based on the dataset size and parallelization strategy.
gradient_accumulation_steps	int	Optional	1	Sets the number of gradient accumulation steps in `MFLossMonitor`. If not configured, it is consistent with `gradient_accumulation_steps` in [Model Training Configuration](#Model Training Configuration). Used for loss normalization and training progress estimation.
check_for_nan_in_loss_and_grad	bool	Optional	False	Whether to enable NaN/Inf detection for loss values and gradients in `MFLossMonitor`. If enabled, training will be terminated if overflow (NaN or INF) is detected. The default value is `False`. It is recommended to enable it during the debugging phase to improve training stability.

SummaryMonitor

This callback function class is mainly used to collect Summary data, see mindspore.SummaryCollector for details.

CheckpointMonitor

This callback function class is mainly used to save the model weights file during the model training process and has several configurable items as follows:

Parameter Name	Type	Optional	Default Value	Value Description
prefix	string	Optional	'CKP'	Set the prefix for the weight file name. For example, `CKP-100.ckpt` is generated. If not configured, the default value `'CKP'` is used.
directory	string	Optional	None	Set the directory for saving weight files. If not configured, the default directory is `checkpoint/` under the `output_dir` directory.
save_checkpoint_seconds	int	Optional	0	Set the interval for automatically saving weights (in seconds). Mutually exclusive with `save_checkpoint_steps` and takes precedence. For example, save every 3600 seconds.
save_checkpoint_steps	int	Optional	1	Sets the automatic saving interval for weights based on the number of training steps (unit: steps). Mutually exclusive with `save_checkpoint_seconds`; if both are set, the time-based saving takes precedence. For example, save every 1000 steps.
keep_checkpoint_max	int	Optional	5	The maximum number of weight files to retain. When the number of saved weights exceeds this value, the system will delete the oldest files in order of creation time to ensure that the total number does not exceed this limit. Used to control disk space usage.
keep_checkpoint_per_n_minutes	int	Optional	0	Retain one weight every N minutes. This is a time-windowed retention policy often used to balance storage and recovery flexibility in long-term training. For example, setting it to `60` means retaining at least one weight every hour.
integrated_save	bool	Optional	True	Whether to enable aggregated weight saving: • `True`: Aggregate weights from all devices when saving the weight file, i.e., all devices have the same weights; • `False`: Each device saves its own weights. In semi-automatic parallel mode, it is recommended to set this to `False` to avoid memory issues when saving weight files.
save_network_params	bool	Optional	False	Whether to save only the model weights. The default value is `False`.
save_trainable_params	bool	Optional	False	Whether to save trainable parameters separately (i.e., the model's parameter weights during partial fine-tuning).
async_save	bool	Optional	False	Whether to save weights asynchronously. Enabling this feature will not block the main training process, improving training efficiency. However, please note that I/O resource contention may cause write delays.
remove_redundancy	bool	Optional	False	Whether to remove redundancy from model weights when saving. Defaults to `False`.
checkpoint_format	string	Optional	'ckpt'	The format of saved model weights. Defaults to `ckpt`. Optional `ckpt`, `safetensors`. Note: When using the Mcore architecture for training, only weights in `safetensors` format are supported, and this configuration item will not take effect.
embedding_local_norm_threshold	float	Optional	1.0	The threshold used in health monitoring to detect abnormalities in the embedding layer gradient or output norm. If the norm exceeds this value, an alarm or data skipping mechanism may be triggered to prevent training divergence. Defaults to `1.0` and can be adjusted based on model scale.

Multiple Callbacks function classes can be configured at the same time under the callbacks field. The following is an example of callbacks configuration.

callbacks:
  - type: MFLossMonitor
  - type: CheckpointMonitor
    prefix: "name_xxb"
    save_checkpoint_steps: 1000
    integrated_save: False
    async_save: False

Processor Configuration

Processor is mainly used to preprocess the inference data of the input model. Since the Processor configuration items are not fixed, only the generic configuration items of Processor in MindSpore Transformers are explained here.

Parameter Name	Type	Optional	Default Value	Value Description
processor.type	string	Required	None	Sets the name of the data processing class (Processor) to be used, such as `LlamaProcessor` or `Qwen2Processor`. This class determines the overall input data preprocessing flow and must match the model architecture.
processor.return_tensors	string	Optional	'ms'	Sets the type of tensors returned after data processing. Can be set to `'ms'` to indicate a MindSpore Tensor.
processor.image_processor.type	string	Required	None	Sets the type of the image data processing class. Responsible for image normalization, scaling, cropping, and other operations, and must be compatible with the model's visual encoder.
processor.tokenizer.type	string	Required	None	Sets the text tokenizer type, such as `LlamaTokenizer` or `Qwen2Tokenizer`. This determines how the text is segmented into subwords or tokens and must be consistent with the language model.
processor.tokenizer.vocab_file	string	Required	None	Sets the vocabulary file path required by the tokenizer (such as `vocab.txt` or `tokenizer.model`). The specific file type depends on the tokenizer implementation. This must correspond to `processor.tokenizer.type`; otherwise, loading may fail.

Model Evaluation Configuration

MindSpore Transformers provides model evaluation function, and also supports model evaluation while training. The following is the configuration related to model evaluation.

Parameter Name	Type	Optional	Default Value	Value Description
eval_dataset	dict	Required	None	Dataset configuration for evaluation, used in the same way as `train_dataset`.
eval_dataset_task	dict	Required	None	Evaluation task configuration, used in the same way as dataset task configuration (such as preprocessing, batch size, etc.), used to define the evaluation process.
metric.type	string	Required	None	Set the evaluation type, such as `Accuracy`, `F1`, etc. The specific value must be consistent with the supported evaluation metrics.
do_eval	bool	Optional	False	Whether to enable the evaluation-while-training feature.
eval_step_interval	int	Optional	100	Sets the evaluation step interval. The default value is 100. A value less than or equal to 0 disables step-by-step evaluation.
eval_epoch_interval	int	Optional	-1	Sets the evaluation epoch interval. The default value is -1. A value less than 0 disables epoch-by-epoch evaluation. This configuration is not recommended in data sinking mode.

Profile Configuration

MindSpore Transformers provides Profile as the main tool for model performance tuning, please refer to Performance Tuning Guide for more details. The following is the Profile related configuration.

Parameter Name	Type	Optional	Default Value	Value Description
profile	bool	Optional	False	Whether to enable the performance collection tool. The default value is `False`. For details, see mindspore.Profiler.
profile_start_step	int	Optional	1	Sets the number of steps at which to start collecting performance data. The default value is `1`.
profile_stop_step	int	Optional	10	Sets the number of steps at which to stop collecting performance data. The default value is `10`.
profile_communication	bool	Optional	False	Sets whether to collect communication performance data during multi-device training. This parameter is invalid when using a single card for training and the default value is `False`.
profile_memory	bool	Optional	True	Sets whether to collect Tensor memory data. Defaults to `True`.
profile_rank_ids	list	Optional	None	Sets the rank ids for which performance collection is enabled. Defaults to `None`, meaning that performance collection is enabled for all rank ids.
profile_pipeline	bool	Optional	False	Sets whether to enable performance collection for one card in each stage of the pipeline in parallel. Defaults to `False`.
profile_output	string	Required	None	Sets the folder path for saving performance collection files.
profiler_level	int	Optional	1	Sets the data collection level. Possible values are `(0, 1, 2)`. Defaults to `1`.
with_stack	bool	Optional	False	Sets whether to collect call stack data on the Python side. Defaults to `False`.
data_simplification	bool	Optional	False	Sets whether to enable data simplification. If enabled, the FRAMEWORK directory and other redundant data will be deleted after exporting performance data. The default value is `False`.
init_start_profile	bool	Optional	False	Sets whether to enable performance data collection during Profiler initialization. This parameter has no effect when `profile_start_step` is set. It must be set to `True` when `profile_memory` is enabled.
mstx	bool	Optional	False	Sets whether to collect mstx timestamp records, including training steps, HCCL communication operators, etc. The default value is `False`.

Metric Monitoring Configuration

The metric monitoring configuration is primarily used to configure methods to record metrics during training, please refer to Training Metrics Monitoring for more details. Below is a description of the common metric monitoring configuration options in MindSpore Transformers:

Parameters	Type	Optional	Default Value	Value Descriptions
monitor_config.monitor_on	bool	Optional	False	Set whether to enable monitoring. The default is `False`, which will disable all parameters below.
monitor_config.dump_path	string	Optional	'./dump'	Set the save path for metric files of `local_norm`, `device_local_norm` and `local_loss` during training. Defaults to './dump' when not set or set to `null`.
monitor_config.target	list(string)	Optional	['.*']	Set the (partial) name of target parameters monitored by metric `optimizer state` and `local_norm`, can be regular expression.Defaults to ['.*'] when not set or set to `null`, that is, specify all parameters.
monitor_config.invert	bool	Optional	False	Set whether to invert the targets specified in `monitor_config.target`, defaults to `False`.
monitor_config.step_interval	int	Optional	1	Set the frequency for metric recording. The default value is `1`, that is, the metrics are recorded every step.
monitor_config.local_loss_format	string / list(string)	Optional	null	Set the format to record metric `local_loss`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric.
monitor_config.device_local_loss_format	string / list(string)	Optional	null	Set the format to record metric `device_local_loss`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric.
monitor_config.local_norm_format	string / list(string)	Optional	null	Set the format to record metric `local_norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric.
monitor_config.device_local_norm_format	string / list(string)	Optional	null	Set the format to record metric `device_local_norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric.
monitor_config.optimizer_state_format	string / list(string)	Optional	null	Set the format to record metric `optimizer state`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric.
monitor_config.weight_state_format	string / list(string)	Optional	null	Set the format to record metric `weight L2-norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric.
monitor_config.throughput_baseline	int / float	Optional	null	Set the baseline of metric `throughput linearity`, must be positive number. Defaults to `null`, that is, do not monitor this metric.
monitor_config.print_struct	bool	Optional	False	Set whether to print all trainable parameters' name of model. If set to `True`, print all trainable parameters' name at the beginning of the first step, and exit training process after step end. Defaults to `False`.
monitor_config.check_for_global_norm	bool	Optional	False	Set whether to enable process level fault recovery function. Defaults to `False`.
monitor_config.global_norm_spike_threshold	float	Optional	3.0	Set the threshold for global norm, triggering data skipping when the global norm is exceeded. Defaults to `3.0`.
monitor_config.global_norm_spike_count_threshold	int	Optional	10	Set the cumulative number of consecutive global norm anomalies, and when the threshold is reached, trigger an exception interrupt to terminate the training. Defaults to `10`.

TensorBoard Configuration

The TensorBoard configuration is primarily used to configure parameters related to TensorBoard during training, allowing for real-time monitoring and visualization of training metrics, please refer to Training Metrics Monitoring for more details. Below is a description of the common TensorBoard configuration options in MindSpore Transformers:

Parameters	Type	Optional	Default Value	Value Description
tensorboard.tensorboard_dir	string	Required	None	Sets the path where TensorBoard event files are saved.
tensorboard.tensorboard_queue_size	int	Optional	10	Sets the maximum cache value of the capture queue. If it exceeds this value, it will be written to the event file, the default value is 10.
tensorboard.log_loss_scale_to_tensorboard	bool	Optional	False	Sets whether loss scale information is logged to the event file, default is `False`.
tensorboard.log_timers_to_tensorboard	bool	Optional	False	Sets whether to log timer information to the event file. The timer information contains the duration of the current training step (or iteration) as well as the throughput, defaults to `False`
tensorboard.log_expert_load_to_tensorboard	bool	Optional	False	Sets whether to log experts load to the event file, defaults to `False`.