Configuration File Descriptions

View Source On Gitee

Overview

Different parameters usually need to be configured during the training and inference process of a model. MindSpore Transformers supports the use of YAML files to centrally manage and adjust the configurable items, which makes the configuration of the model more structured and improves its maintainability at the same time.

Description of the YAML File Contents

The YAML file provided by MindSpore Transformers contains configuration items for different functions, which are described below according to their contents.

Basic Configuration

The basic configuration is mainly used to specify MindSpore random seeds and related settings for loading weights.

Parameter Name

Data Type

Optional

Default Value

Value Description

seed

int

Optional

0

Sets the global random seed to ensure experimental reproducibility. For details, see mindspore.set_seed.

run_mode

string

Required

None

Sets the model's run mode. Optional: train, finetune, eval, or predict.

output_dir

string

Optional

None

Sets the output directory for saving log files, checkpoint files, and parallel strategy files. If the directory does not exist, it will be created automatically.

load_checkpoint

string

Optional

None

The file or folder path for loading weights. Supports the following three scenarios: 1. The path to the complete weights file; 2. The path to the distributed weights folder after offline splitting; 3. The path to the folder containing LoRA incremental weights and base model weights. For details on how to obtain various weights, see Checkpoint Conversion Function.

auto_trans_ckpt

bool

Optional

False

Whether to enable automatic splitting and merging of distributed weights. When enabled, you can load split weights from multiple cards onto a single card, or load single-card weights from multiple cards onto multiple cards. For more information, see Distributed Weight Slicing and Merging

resume_training

bool

Optional

False

Whether to enable the resumable training feature. When enabled, the optimizer state, learning rate scheduler state, and other parameters will be restored from the path specified by load_checkpoint to continue training. For more information, see Resumable Training

load_ckpt_format

string

Optional

"ckpt"

The format of the loaded model weights. Optional values include "ckpt" and "safetensors".

remove_redundancy

bool

Optional

False

Whether the loaded model weights have been de-redundant. For details, see Saving and Loading Weights with De-Redundancy.

train_precision_sync

bool

Optional

None

Enables deterministic computation for training. Setting this to True enables synchronous computation for training, which improves computational certainty and is generally used to ensure experimental reproducibility. Setting this to False disables this feature.

infer_precision_sync

bool

Optional

None

Enables deterministic computation for inference. If set to True, inference synchronization is enabled, which improves computational certainty and is generally used to ensure experimental reproducibility. If set to False, this feature is disabled.

use_skip_data_by_global_norm

bool

Optional

False

Whether to enable data skipping based on the global gradient norm. When a batch of data causes exploding gradients, that batch is automatically skipped to improve training stability. For more information, see Data Skipping.

use_checkpoint_health_monitor

bool

Optional

False

Whether to enable weight health monitoring. When enabled, checkpoint integrity and availability are verified when saving, preventing corrupted weight files from being saved. For more information, see Checkpoint Health Monitor.

Context Configuration

Context configuration is mainly used to specify the mindspore.set_context in the related parameters.

Parameter Name

Data Type

Optional

Default Value

Value Description

context.mode

int

Required

None

Sets the backend execution mode. 0 indicates GRAPH_MODE. MindSpore Transformers currently only supports running in GRAPH_MODE mode.

context.device_target

string

Required

None

Sets the backend execution device. MindSpore Transformers only supports running on Ascend devices.

context.device_id

int

Optional

0

Sets the execution device ID. The value must be within the available device range. The default value is 0.

context.enable_graph_kernel

bool

Optional

False

Whether to enable graph fusion to optimize network execution performance. The default value is False.

context.max_call_depth

int

Optional

1000

Sets the maximum depth of function calls. This value must be a positive integer. The default value is 1000.

context.max_device_memory

string

Optional

"1024GB"

Sets the maximum memory available on the device. The format is "xxGB". The default value is "1024GB".

context.mempool_block_size

string

Optional

"1GB"

Sets the memory block size. The format is "xxGB". The default value is "1GB".

context.save_graphs

bool / int

Optional

False

Save compiled graphs during execution:
False or 0: Do not save intermediate compiled graphs
1: Output some intermediate files during graph compilation
True or 2: Generate more IR files related to the backend process
3: Generate a visual computation graph and a more detailed frontend IR graph

context.save_graphs_path

string

Optional

'./graph'

The path to save compiled graphs. If not set and save_graphs != False, the default temporary path './graph' is used.

context.affinity_cpu_list

dict / string

Optional

None

Optional configuration item used to implement a user-defined core binding strategy.
• When not configured: default automatic core binding
None or not set: disable core binding
• Pass in dict: customize CPU core binding strategy. For details, refer to mindspore.runtime.set_cpu_affinity

Legacy Model Configuration

If you use MindSpore Transformers to run tasks for legacy models, you need to configure the relevant hyperparameters in a YAML file. Please note that the configuration described in this section applies only to legacy models and cannot be mixed with mcore model configurations. Please pay attention to version compatibility.

Because different model configurations may vary, this section only describes the general configuration of models in MindSpore Transformers.

Parameter Name

Type

Optional

Default Value

Value Description

model.arch.type

string

Required

None

Sets the model class. This class can be used to instantiate the model when building it.

model.model_config.type

string

Required

None

Sets the model configuration class. This class must match the model class; that is, it must contain all parameters used by the model class.

model.model_config.num_layers

int

Required

None

Sets the number of model layers, typically the number of decoder layers.

model.model_config.seq_length

int

Required

None

Sets the model sequence length. This parameter indicates the maximum sequence length supported by the model.

model.model_config.hidden_size

int

Required

None

Sets the dimension of the model's hidden state.

model.model_config.vocab_size

int

Required

None

Sets the size of the model vocabulary.

model.model_config.top_k

int

Optional

None

Sets the sampling from the top_k tokens with the highest probability during inference.

model.model_config.top_p

float

Optional

None

Sets the sampling from the tokens with the highest probability, whose cumulative probability does not exceed top_p, during inference. The value range is usually (0,1].

model.model_config.use_past

bool

Optional

False

Whether to enable incremental inference for the model. Enabling this allows Paged Attention to improve inference performance. Must be set to False during model training.

model.model_config.max_decode_length

int

Optional

None

Sets the maximum length of generated text, including the input length.

model.model_config.max_length

int

Optional

None

Same as max_decode_length. When both max_decode_length and max_length are set, only max_length takes effect.

model.model_config.max_new_tokens

int

Optional

None

Sets the maximum length of generated new text, excluding the input length. When both max_length and max_new_tokens are set, only max_new_tokens takes effect.

model.model_config.min_length

int

Optional

None

Sets the minimum length of generated text, including the input length.

model.model_config.min_new_tokens

int

Optional

None

Sets the minimum length of new text generated, excluding the input length. When min_length is set at the same time, only min_new_tokens takes effect.

model.model_config.repetition_penalty

float

Optional

1.0

Sets the penalty coefficient for generating repeated text. repetition_penalty must be no less than 1. When it is equal to 1, no penalty is imposed on repeated output.

model.model_config.block_size

int

Optional

None

Sets the block size in Paged Attention. This only takes effect when use_past=True.

model.model_config.num_blocks

int

Optional

None

Sets the total number of blocks in Paged Attention. This only takes effect when use_past=True. This should satisfy batch_size × seq_length <= block_size × num_blocks.

model.model_config.return_dict_in_generate

bool

Optional

False

Whether to return the inference results of the generate interface in dictionary form. Defaults to False.

model.model_config.output_scores

bool

Optional

False

Whether to include the scores before softmax of the input for each forward generation when returning the results in dictionary form. Defaults to False.

model.model_config.output_logits

bool

Optional

False

Whether to include the logits of the model output for each forward generation when returning the results in dictionary form. Defaults to False.

model.model_config.layers_per_stage

list(int)

Optional

None

Sets the number of transformer layers assigned to each stage when enabling pipeline stages. Defaults to None, indicating an equal distribution across all stages. The value to be set is a list of integers with a length equal to the number of pipeline stages, where the i-th position indicates the number of transformer layers assigned to the i-th stage.

model.model_config.bias_swiglu_fusion

bool

Optional

False

Whether to use the swiglu fusion operator. Defaults to False.

model.model_config.apply_rope_fusion

bool

Optional

False

Whether to use the RoPE fusion operator. Defaults to False.

In addition to the basic configuration of the above models, the MoE model requires separate configuration of some MoE module hyperparameters. Since different models use different parameters, only the general configuration is described:

Parameter Name

Type

Optional

Default Value

Value Description

moe_config.expert_num

int

Required

None

Sets the number of routing experts.

moe_config.shared_expert_num

int

Required

None

Sets the number of shared experts.

moe_config.moe_intermediate_size

int

Required

None

Sets the size of the intermediate dimension of the expert layer.

moe_config.capacity_factor

int

Required

None

Sets the expert capacity factor.

moe_config.num_experts_chosen

int

Required

None

Sets the number of experts chosen for each token.

moe_config.enable_sdrop

bool

Optional

False

Enables the sdrop token drop strategy. Since MindSpore Transformers' MoE uses a static shape implementation, it cannot retain all tokens.

moe_config.aux_loss_factor

list(float)

Optional

None

Sets the weight for the balanced loss.

moe_config.first_k_dense_replace

int

Optional

1

Enables the block for the Moe layer. Typically set to 1 to disable Moe in the first block.

moe_config.balance_via_topk_bias

bool

Optional

False

Enables the aux_loss_free load balancing algorithm.

moe_config.topk_bias_update_rate

float

Optional

None

Sets the bias update step for the aux_loss_free load balancing algorithm.

moe_config.comp_comm_parallel

bool

Optional

False

Sets whether to enable parallel computation and communication for ffn.

moe_config.comp_comm_parallel_degree

int

Optional

None

Sets the number of splits for ffn computation and communication. A larger number results in more overlap, but consumes more memory. This parameter is only valid when comp_comm_parallel=True.

moe_config.moe_shared_expert_overlap

bool

Optional

False

Sets whether to enable parallel computation and communication for shared and routing experts.

moe_config.use_gating_sigmoid

bool

Optional

False

Sets whether to use the sigmoid function for gating results in MoE.

moe_config.use_gmm

bool

Optional

False

Sets whether to use GroupedMatmul for MoE expert computation.

moe_config.use_fused_ops_permute

bool

Optional

False

Specifies whether MoE uses the permute and unpermute fused operators for performance acceleration. This option only takes effect when use_gmm=True.

moe_config.enable_deredundency

bool

Optional

False

Specifies whether to enable de-redundancy communication. This requires that the number of expert parallel operations is an integer multiple of the number of NPUs in each node. Default value: False. This option takes effect when use_gmm=True.

moe_config.npu_nums_per_device

int

Optional

8

Specifies the number of NPUs in each node. Default value: 8. This option takes effect when enable_deredundency=True.

moe_config.enable_gmm_safe_tokens

bool

Optional

False

Ensures that each expert is assigned at least one token to prevent GroupedMatmul calculation failures in extreme load imbalance. The default value is False. It is recommended to enable this when use_gmm=True.

Mcore Model Configuration

When using MindSpore Transformers to launch an Mcore model task, you need to configure relevant hyperparameters under model_config, including model selection, model parameters, calculation type, and MoE parameters.

Because different model configurations may vary, here are some common model configurations in MindSpore Transformers:

Parameter

Type

Optional

Default Value

Value Description

model.model_config.model_type

string

Required

None

Sets the model configuration class. The model configuration class must match the model class; that is, the model configuration class should contain all parameters used by the model class.

model.model_config.architectures

string

Required

None

Sets the model class. When building the model, you can instantiate the model based on the model class.

model.model_config.offset

int / list(int)

Required

0

When pp parallelism is enabled, you need to set the offset based on the number of model layers to build pipeline parallelism.

model.model_config.vocab_size

int

Optional

128000

Model vocabulary size.

model.model_config.hidden_size

int

Required

0

Transformer hidden layer size.

model.model_config.ffn_hidden_size

int

Optional

None

Transformer feedforward layer size, corresponding to intermediate_size in HuggingFace. If not set, the default is 4 * hidden_size.

model.model_config.num_layers

int

Required

0

Number of Transformer layers, corresponding to num_hidden_layers in HuggingFace.

model.model_config.max_position_embeddings

int

Optional

4096

Maximum sequence length the model can handle.

model.model_config.hidden_act

string

Optional

'gelu'

Activation function used for the nonlinearity in the MLP.

model.model_config.num_attention_heads

int

Required

0

Number of Transformer attention heads.

model.model_config.num_query_groups

int

Optional

None

Number of query groups for the group-query attention mechanism, corresponding to num_key_value_heads in HuggingFace. If not configured, the normal attention mechanism is used.

model.model_config.kv_channels

int

Optional

None

Projection weight dimension for the multi-head attention mechanism, corresponding to head_dim in HuggingFace. If not configured, defaults to hidden_size // num_attention_heads.

model.model_config.layernorm_epsilon

float

Required

1e-5

Epsilon value for any LayerNorm operations.

model.model_config.add_bias_linear

bool

Required

True

Include a bias term in all linear layers (after QKV projection, after core attention, and both in MLP layers).

model.model_config.tie_word_embeddings

bool

Required

True

Whether to share input and output embedding weights.

model.model_config.use_flash_attention

bool

Required

True

Whether to use flash attention in the attention layer.

model.model_config.use_contiguous_weight_layout_attention

bool

Required

False

Determines the weight layout in the QKV linear projection of the self-attention layer. Affects only the self-attention layer.

model.model_config.hidden_dropout

float

Required

0.1

Dropout probability for the Transformer hidden state.

model.model_config.attention_dropout

float

Required

0.1

Dropout probability for the post-attention layer.

model.model_config.position_embedding_type

string

Required

'rope'

Position embedding type for the attention layer.

model.model_config.params_dtype

string

Required

'float32'

dtype to use when initializing weights.

model.model_config.compute_dtype

string

Required

'bfloat16'

Computed dtype for Linear layers.

model.model_config.layernorm_compute_dtype

string

Required

'float32'

Computed dtype for LayerNorm layers.

model.model_config.softmax_compute_dtype

string

Required

'float32'

The dtype used to compute the softmax during attention computation.

model.model_config.rotary_dtype

string

Required

'float32'

Computed dtype for custom rotated position embeddings.

model.model_config.init_method_std

float

Required

0.02

The standard deviation of the zero-mean normal for the default initialization method, corresponding to initializer_range in HuggingFace. If init_method and output_layer_init_method are provided, this method is not used.

model.model_config.moe_grouped_gemm

bool

Required

False

When there are multiple experts per level, compress multiple local (potentially small) GEMMs in a single kernel launch to leverage grouped GEMM capabilities for improved utilization and performance.

model.model_config.num_moe_experts

int

Optional

None

The number of experts to use for the MoE layer, corresponding to n_routed_experts in HuggingFace. When set, the MLP is replaced by the MoE layer. Setting this to None disables the MoE.

model.model_config.num_experts_per_tok

int

Required

2

The number of experts to route each token to.

model.model_config.moe_ffn_hidden_size

int

Optional

None

Size of the hidden layer of the MoE feedforward network. Corresponds to moe_intermediate_size in HuggingFace.

model.model_config.moe_router_dtype

string

Required

'float32'

Data type used for routing and weighted averaging of expert outputs. Corresponds to router_dense_type in HuggingFace.

model.model_config.gated_linear_unit

bool

Required

False

Use a gated linear unit for the first linear layer in the MLP.

model.model_config.norm_topk_prob

bool

Required

True

Whether to use top-k probabilities for normalization.

model.model_config.moe_router_pre_softmax

bool

Required

False

Enables pre-softmax (pre-sigmoid) routing for MoE, meaning softmax is performed before top-k selection. By default, softmax is performed after top-k selection.

model.model_config.moe_token_drop_policy

string

Required

'probs'

The token drop policy. Can be either 'probs' or 'position'. If 'probs', the token with the lowest probability is dropped. If 'position', the token at the end of each batch is dropped.

model.model_config.moe_router_topk_scaling_factor

float

Optional

None

Scaling factor for the routing score in Top-K routing, corresponding to routed_scaling_factor in HuggingFace. Valid only when moe_router_pre_softmax is enabled. Defaults to None, meaning no scaling.

model.model_config.moe_aux_loss_coeff

float

Required

0.0

Scaling factor for the auxiliary loss. The recommended initial value is 1e-2.

model.model_config.moe_router_load_balancing_type

string

Required

'aux_loss'

The router's load balancing strategy. 'aux_loss' corresponds to the load balancing loss used in GShard and SwitchTransformer; 'seq_aux_loss' corresponds to the load balancing loss used in DeepSeekV2 and DeepSeekV3, which is used to calculate the loss of each sample; 'sinkhorn' corresponds to the balancing algorithm used in S-BASE, and 'none' means no load balancing.

model.model_config.moe_permute_fusion

bool

Optional

False

Whether to use the moe_token_permute fusion operator. Default is False.

model.model_config.moe_router_force_expert_balance

bool

Optional

False

Whether to use forced load balancing in the expert router. This option is only for performance testing and not for general use. Defaults to False.

model.model_config.use_interleaved_weight_layout_mlp

bool

Optional

True

Determines the weight arrangement in the linear_fc1 projection of the MLP. Affects only MLP layers.
1. When True, use an interleaved arrangement: [Gate_weights[0], Hidden_weights[0], Gate_weights[1], Hidden_weights[1], ...].
2. When False, use a continuous arrangement: [Gate_weights, Hidden_weights].
Note: This affects tensor memory layout, but does not affect mathematical equivalence.

model.model_config.moe_router_enable_expert_bias

bool

Optional

False

Whether to use TopK routing with dynamic expert bias in the unassisted lossless load balancing strategy. Routing decisions are based on the sum of the routing score and the expert bias.

model.model_config.enable_expert_relocation

bool

Optional

False

Whether to enable dynamic expert migration for load balancing in the MoE model. When enabled, experts will be dynamically redistributed between devices based on their load history to improve training efficiency and load balance. Defaults to False.

model.model_config.expert_relocation_initial_iteration

int

Optional

20

Start the initial iteration of expert migration. Expert migration will begin after this many training iterations.

model.model_config.expert_relocation_freq

int

Optional

50

Frequency of expert migration during training iterations. After the initial iteration, expert migration is performed every N iterations.

model.model_config.print_expert_load

bool

Optional

False

Whether to print expert load information. If enabled, detailed expert load statistics will be printed during training. Defaults to False.

model.model_config.moe_router_num_groups

int

Optional

None

The number of expert groups to use for group-limited routing. Equivalent to n_group in HuggingFace.

model.model_config.moe_router_group_topk

int

Optional

None

The number of selected groups for group-limited routing. Equivalent to topk_group in HuggingFace.

model.model_config.moe_router_topk

int

Optional

2

The number of experts to route each token to. Equivalent to num_experts_per_tok in HuggingFace. When used with moe_router_num_groups and moe_router_group_topk, first group moe_router_num_groups, then select moe_router_group_topk, and then select moe_router_topk experts from moe_router_group_topk.

Model Training Configuration

When starting model training, in addition to model-related parameters, you also need to set the parameters of trainer, runner_config, learning rate, and optimizer and other modules required for training, MindSpore Transformers provides the following configuration items.

Parameters

Descriptions

Types

trainer.type

Set the trainer class, usually different models for different application scenarios will set different trainer classes.

str

trainer.model_name

Set the model name in the format '{name}_xxb', indicating a certain specification of the model.

str

runner_config.epochs

Set the number of rounds for model training.

int

runner_config.batch_size

Set the sample size of the batch data, which overrides the batch_size in the dataset configuration.

int

runner_config.sink_mode

Enable data sink mode.

bool

runner_config.sink_size

Set the number of iterations to be sent down from Host to Device per iteration, effective only when sink_mode=True. This argument will be deprecated in a future release.

int

runner_config.gradient_accumulation_steps

Set the number of gradient accumulation steps, the default value is 1, which means that gradient accumulation is not enabled.

int

runner_wrapper.type

Set the wrapper class, generally set 'MFTrainOneStepCell'.

str

runner_wrapper.local_norm

Set the gradient norm of each parameter on the printing card.

bool

runner_wrapper.scale_sense.type

Set the gradient scaling class, generally just set 'DynamicLossScaleUpdateCell'.

str

runner_wrapper.scale_sense.loss_scale_value

Set the loss dynamic scale factor, the model loss can change dynamically according to the configuration of this parameter.

int

runner_wrapper.use_clip_grad

Turn on gradient clipping. Turning on to avoid cases where the inverse gradient is too large and training fails to converge.

bool

lr_schedule.type

Set the lr_schedule class, lr_schedule is mainly used to adjust the learning rate in model training.

str

lr_schedule.learning_rate

Set the initialized learning rate size.

float

lr_scale

Whether to enable learning rate scaling.

bool

lr_scale_factor

Set the learning rate scaling factor.

int

layer_scale

Whether to turn on layer attenuation.

bool

layer_decay

Set the layer attenuation factor.

float

optimizer.type

Set the optimizer class, the optimizer is mainly used to calculate the gradient for model training.

str

optimizer.weight_decay

Set the optimizer weight decay factor.

float

optimizer.fused_num

Set fused_num weights for fusion, and update the fused weights to the network parameters according to the fusion algorithm. Default to 10.

int

optimizer.interleave_step

Select the number of step intervals for the weights to be fused, and take a weight as a candidate weight for fusion once every interleave_step step. Default to 1000.

int

optimizer.fused_algo

Fusion algorithm, supports ema and sma. Default to ema.

string

optimizer.ema_alpha

The fusion coefficient is only effective when fused_algo is set to ema. Default to 0.2.

float

train_dataset.batch_size

The description is same as that of runner_config.batch_size.

int

train_dataset.input_columns

Set the input data columns for the training dataset.

list

train_dataset.output_columns

Set the output data columns for the training dataset.

list

train_dataset.construct_args_key

Set the dataset part keys of the model construct input to the model in lexicographical order, used when the parameter passing order of the model does not match the order of the dataset input.

list

train_dataset.column_order

Set the order of the output data columns of the training dataset.

list

train_dataset.num_parallel_workers

Set the number of processes that read the training dataset.

int

train_dataset.python_multiprocessing

Enabling Python multi-process mode to improve data processing performance.

bool

train_dataset.drop_remainder

Whether to discard the last batch of data if it contains fewer samples than batch_size.

bool

train_dataset.repeat

Set the number of dataset duplicates.

int

train_dataset.numa_enable

Set the default state of NUMA to data read startup state.

bool

train_dataset.prefetch_size

Set the amount of pre-read data.

int

train_dataset.data_loader.type

Set the data loading class.

str

train_dataset.data_loader.dataset_dir

Set the path for loading data.

str

train_dataset.data_loader.shuffle

Whether to randomly sort the data when reading the dataset.

bool

train_dataset.transforms

Set options related to data enhancement.

-

train_dataset_task.type

Set up the dataset class, which is used to encapsulate the data loading class and other related configurations.

str

train_dataset_task.dataset_config

Typically set as a reference to train_dataset, containing all configuration entries for train_dataset.

-

auto_tune

Enable auto-tuning of data processing parameters, see set_enable_autotune for details.

bool

filepath_prefix

Set the save path for parameter configurations after data optimization.

str

autotune_per_step

Set the configuration tuning step interval for automatic data acceleration, for details see set_autotune_interval.

int

Parallel Configuration

In order to improve the performance of the model, it is usually necessary to configure the parallelism strategy for the model in large-scale cluster usage scenarios. For details, please refer to Distributed Parallelism, the parallel configuration in MindSpore Transformers is as follows.

Parameters

Descriptions

Types

use_parallel

Enable parallel mode.

bool

parallel_config.data_parallel

Set the number of data parallel.

int

parallel_config.model_parallel

Set the number of model parallel.

int

parallel_config.context_parallel

Set the number of sequence parallel.

int

parallel_config.pipeline_stage

Set the number of pipeline parallel.

int

parallel_config.micro_batch_num

Set the pipeline parallel microbatch size, which should satisfy parallel_config.micro_batch_num >= parallel_config.pipeline_stage when parallel_config.pipeline_stage is greater than 1.

int

parallel_config.seq_split_num

Set the sequence split number in sequence pipeline parallel, which should be a divisor of sequence length.

int

parallel_config.gradient_aggregation_group

Set the size of the gradient communication operator fusion group.

int

parallel_config.context_parallel_algo

Set the long sequence parallel scheme, optionally colossalai_cp, ulysses_cp and hybrid_cp, effective only if the number of context_parallel slices is greater than 1.

str

parallel_config.ulysses_degree_in_cp

Setting the Ulysses sequence parallel dimension, configured in parallel with the hybrid_cp long sequence parallel scheme, requires ensuring that context_parallel is divisible by this parameter and greater than 1, and that ulysses_degree_in_cp is divisible by the number of attention heads.

int

micro_batch_interleave_num

Set the number of multicopy parallel, enable multicopy parallelism if it is greater than 1. Usually enabled when using model parallel, mainly used to optimize the communication loss generated by model parallel, and not recommended to be enabled when only using streaming parallel. For details, please refer to MicroBatchInterleaved.

int

parallel.parallel_mode

Set parallel mode, 0 means data parallel mode, 1 means semi-automatic parallel mode, 2 means automatic parallel mode, 3 means mixed parallel mode, usually set to semi-automatic parallel mode.

int

parallel.gradients_mean

Whether to execute the averaging operator after the gradient AllReduce. Typically set to False in semi-automatic parallel mode and True in data parallel mode.

bool

parallel.enable_alltoall

Enables generation of the AllToAll communication operator during communication. Typically set to True only in MOE scenarios, default value is False.

bool

parallel.full_batch

Whether to load the full batch of data from the dataset in parallel mode. Setting it to True means all ranks will load the full batch of data. Setting it to False means each rank will only load the corresponding batch of data. When set to False, the corresponding dataset_strategy must be configured.

bool

parallel.dataset_strategy

Only supports List of List type and is effective only when full_batch=False. The number of sublists in the list must be equal to the length of train_dataset.input_columns. Each sublist in the list must have the same shape as the data returned by the dataset. Generally, data parallel splitting is done along the first dimension, so the first dimension of the sublist should be configured to match data_parallel, while the other dimensions should be set to 1. For detailed explanation, refer to Dataset Splitting.

list

parallel.search_mode

Set fully-automatic parallel strategy search mode, options are recursive_programming, dynamic_programming and sharding_propagation, only works in fully-automatic parallel mode, experimental interface.

str

parallel.strategy_ckpt_save_file

Set the save path for the parallel slicing strategy file.

str

parallel.strategy_ckpt_config.only_trainable_params

Whether to save (or load) information about the slicing strategy for trainable parameters only, default is True, set this parameter to False when there are frozen parameters in the network but need to be sliced.

bool

parallel.enable_parallel_optimizer

Turn on optimizer parallel.
1. slice model weight parameters by number of devices in data parallel mode.
2. slice model weight parameters by parallel_config.data_parallel in semi-automatic parallel mode.

bool

parallel.parallel_optimizer_config.gradient_accumulation_shard

Set whether the cumulative gradient variable is sliced on the data-parallel dimension, only effective if enable_parallel_optimizer=True.

bool

parallel.parallel_optimizer_config.parallel_optimizer_threshold

Set the threshold for the optimizer weight parameter cut, effective only if enable_parallel_optimizer=True.

int

parallel.parallel_optimizer_config.optimizer_weight_shard_size

Set the size of the optimizer weight parameter to slice the communication domain, requiring the value to be integrable by parallel_config.data_parallel, effective only if enable_parallel_optimizer=True.

int

parallel.pipeline_config.pipeline_interleave

Enable interleave pipeline parallel, you should set this variable to be true when using Seq-Pipe or ZeroBubbleV(also known as DualPipeV).

bool

parallel.pipeline_config.pipeline_scheduler

Set the pipeline scheduling strategy. We only support "seqpipe" and "zero_bubble_v" now.

str

Configure the parallel strategy to satisfy device_num = data_parallel × model_parallel × context_parallel × pipeline_stage.

Model Optimization Configuration

  1. MindSpore Transformers provides recomputation-related configurations to reduce the memory footprint of the model during training, see Recomputation for details.

    Parameters

    Descriptions

    Types

    recompute_config.recompute

    Whether to enable recompute.

    bool/list/tuple

    recompute_config.select_recompute

    Turn on recomputation to recompute only for the operators in the attention layer.

    bool/list

    recompute_config.parallel_optimizer_comm_recompute

    Whether to recompute AllGather communication introduced in parallel by the optimizer.

    bool/list

    recompute_config.mp_comm_recompute

    Whether to recompute communications introduced by model parallel.

    bool

    recompute_config.recompute_slice_activation

    Whether to output slices for Cells kept in memory.

    bool

    recompute_config.select_recompute_exclude

    Disable recomputation for the specified operator, valid only for the Primitive operators.

    bool/list

    recompute_config.select_comm_recompute_exclude

    Disable communication recomputation for the specified operator, valid only for the Primitive operators.

    bool/list

  2. MindSpore Transformers provides fine-grained activations SWAP-related configurations to reduce the memory footprint of the model during training, see Fine-Grained Activations SWAP for details.

    Parameters

    Descriptions

    Types

    swap_config.swap

    Enable activations SWAP.

    bool

    swap_config.default_prefetch

    Control the timing of releasing memory in forward phase and starting prefetch in backward phase of the default SWAP strategy, only taking effect when swap=True, layer_swap=None, and op_swap=None.

    int

    swap_config.layer_swap

    Select specific layers to enable activations SWAP.

    list

    swap_config.op_swap

    Select specific operators within layers to enable activations SWAP.

    list

Callbacks Configuration

MindSpore Transformers provides encapsulated Callbacks function class, mainly to achieve to return to the model training state and output in the model training process, save the model weight file and other operations. Currently, the following Callbacks function class is supported.

  1. MFLossMonitor

    This callback function class is mainly used to print information such as training progress, model Loss, and learning rate during the training process and has several configurable items as follows:

    Parameter Name

    Type

    Optional

    Default Value

    Value Description

    learning_rate

    float

    Optional

    None

    Sets the initial learning rate for MFLossMonitor. Used for logging and training progress calculation. If not set, attempts to obtain it from the optimizer or other configuration.

    per_print_times

    int

    Optional

    1

    Sets the frequency of logging for MFLossMonitor, in steps. The default value is 1, which prints a log message once per training step.

    micro_batch_num

    int

    Optional

    1

    Sets the number of micro batches processed at each training step, used to calculate the actual loss value. If not set, it is the same as parallel_config.micro_batch_num in [Parallel Configuration](#Parallel Configuration).

    micro_batch_interleave_num

    int

    Optional

    1

    Sets the size of the multi-replica micro-batch for each training step, used for loss calculation. If not configured, it is the same as micro_batch_interleave_num in [Parallel Configuration](#Parallel Configuration).

    origin_epochs

    int

    Optional

    None

    Sets the total number of training epochs in MFLossMonitor. If not configured, it is the same as runner_config.epochs in [Model Training Configuration](#Model Training Configuration).

    dataset_size

    int

    Optional

    None

    Sets the total number of samples in the dataset in MFLossMonitor. If not configured, it automatically uses the actual dataset size loaded.

    initial_epoch

    int

    Optional

    0

    Sets the starting epoch number for MFLossMonitor. The default value is 0, indicating that counting starts from epoch 0. This can be used to resume training progress when resuming training from a breakpoint.

    initial_step

    int

    Optional

    0

    Sets the number of initial training steps in MFLossMonitor. The default value is 0. This can be used to align logs and progress bars when resuming training.

    global_batch_size

    int

    Optional

    0

    Sets the global batch size in MFLossMonitor (i.e., the total number of samples used in each training step). If not configured, it is automatically calculated based on the dataset size and parallelization strategy.

    gradient_accumulation_steps

    int

    Optional

    1

    Sets the number of gradient accumulation steps in MFLossMonitor. If not configured, it is consistent with gradient_accumulation_steps in [Model Training Configuration](#Model Training Configuration). Used for loss normalization and training progress estimation.

    check_for_nan_in_loss_and_grad

    bool

    Optional

    False

    Whether to enable NaN/Inf detection for loss values and gradients in MFLossMonitor. If enabled, training will be terminated if overflow (NaN or INF) is detected. The default value is False. It is recommended to enable it during the debugging phase to improve training stability.

  2. SummaryMonitor

    This callback function class is mainly used to collect Summary data, see mindspore.SummaryCollector for details.

  3. CheckpointMonitor

    This callback function class is mainly used to save the model weights file during the model training process and has several configurable items as follows:

    Parameter Name

    Type

    Optional

    Default Value

    Value Description

    prefix

    string

    Optional

    'CKP'

    Set the prefix for the weight file name. For example, CKP-100.ckpt is generated. If not configured, the default value 'CKP' is used.

    directory

    string

    Optional

    None

    Set the directory for saving weight files. If not configured, the default directory is checkpoint/ under the output_dir directory.

    save_checkpoint_seconds

    int

    Optional

    0

    Set the interval for automatically saving weights (in seconds). Mutually exclusive with save_checkpoint_steps and takes precedence. For example, save every 3600 seconds.

    save_checkpoint_steps

    int

    Optional

    1

    Sets the automatic saving interval for weights based on the number of training steps (unit: steps). Mutually exclusive with save_checkpoint_seconds; if both are set, the time-based saving takes precedence. For example, save every 1000 steps.

    keep_checkpoint_max

    int

    Optional

    5

    The maximum number of weight files to retain. When the number of saved weights exceeds this value, the system will delete the oldest files in order of creation time to ensure that the total number does not exceed this limit. Used to control disk space usage.

    keep_checkpoint_per_n_minutes

    int

    Optional

    0

    Retain one weight every N minutes. This is a time-windowed retention policy often used to balance storage and recovery flexibility in long-term training. For example, setting it to 60 means retaining at least one weight every hour.

    integrated_save

    bool

    Optional

    True

    Whether to enable aggregated weight saving:
    True: Aggregate weights from all devices when saving the weight file, i.e., all devices have the same weights;
    False: Each device saves its own weights.
    In semi-automatic parallel mode, it is recommended to set this to False to avoid memory issues when saving weight files.

    save_network_params

    bool

    Optional

    False

    Whether to save only the model weights. The default value is False.

    save_trainable_params

    bool

    Optional

    False

    Whether to save trainable parameters separately (i.e., the model's parameter weights during partial fine-tuning).

    async_save

    bool

    Optional

    False

    Whether to save weights asynchronously. Enabling this feature will not block the main training process, improving training efficiency. However, please note that I/O resource contention may cause write delays.

    remove_redundancy

    bool

    Optional

    False

    Whether to remove redundancy from model weights when saving. Defaults to False.

    checkpoint_format

    string

    Optional

    'ckpt'

    The format of saved model weights. Defaults to ckpt. Optional ckpt, safetensors.

    embedding_local_norm_threshold

    float

    Optional

    1.0

    The threshold used in health monitoring to detect abnormalities in the embedding layer gradient or output norm. If the norm exceeds this value, an alarm or data skipping mechanism may be triggered to prevent training divergence. Defaults to 1.0 and can be adjusted based on model scale.

Multiple Callbacks function classes can be configured at the same time under the callbacks field. The following is an example of callbacks configuration.

callbacks:
  - type: MFLossMonitor
  - type: CheckpointMonitor
    prefix: "name_xxb"
    save_checkpoint_steps: 1000
    integrated_save: False
    async_save: False

Processor Configuration

Processor is mainly used to preprocess the inference data of the input model. Since the Processor configuration items are not fixed, only the generic configuration items of Processor in MindSpore Transformers are explained here.

Parameter Name

Type

Optional

Default Value

Value Description

processor.type

string

Required

None

Sets the name of the data processing class (Processor) to be used, such as LlamaProcessor or Qwen2Processor. This class determines the overall input data preprocessing flow and must match the model architecture.

processor.return_tensors

string

Optional

'ms'

Sets the type of tensors returned after data processing. Can be set to 'ms' to indicate a MindSpore Tensor.

processor.image_processor.type

string

Required

None

Sets the type of the image data processing class. Responsible for image normalization, scaling, cropping, and other operations, and must be compatible with the model's visual encoder.

processor.tokenizer.type

string

Required

None

Sets the text tokenizer type, such as LlamaTokenizer or Qwen2Tokenizer. This determines how the text is segmented into subwords or tokens and must be consistent with the language model.

processor.tokenizer.vocab_file

string

Required

None

Sets the vocabulary file path required by the tokenizer (such as vocab.txt or tokenizer.model). The specific file type depends on the tokenizer implementation. This must correspond to processor.tokenizer.type; otherwise, loading may fail.

Model Evaluation Configuration

MindSpore Transformers provides model evaluation function, and also supports model evaluation while training. The following is the configuration related to model evaluation.

Parameter Name

Type

Optional

Default Value

Value Description

eval_dataset

dict

Required

None

Dataset configuration for evaluation, used in the same way as train_dataset.

eval_dataset_task

dict

Required

None

Evaluation task configuration, used in the same way as dataset task configuration (such as preprocessing, batch size, etc.), used to define the evaluation process.

metric.type

string

Required

None

Set the evaluation type, such as Accuracy, F1, etc. The specific value must be consistent with the supported evaluation metrics.

do_eval

bool

Optional

False

Whether to enable the evaluation-while-training feature.

eval_step_interval

int

Optional

100

Sets the evaluation step interval. The default value is 100. A value less than or equal to 0 disables step-by-step evaluation.

eval_epoch_interval

int

Optional

-1

Sets the evaluation epoch interval. The default value is -1. A value less than 0 disables epoch-by-epoch evaluation. This configuration is not recommended in data sinking mode.

Profile Configuration

MindSpore Transformers provides Profile as the main tool for model performance tuning, please refer to Performance Tuning Guide for more details. The following is the Profile related configuration.

Parameter Name

Type

Optional

Default Value

Value Description

profile

bool

Optional

False

Whether to enable the performance collection tool. The default value is False. For details, see mindspore.Profiler.

profile_start_step

int

Optional

1

Sets the number of steps at which to start collecting performance data. The default value is 1.

profile_stop_step

int

Optional

10

Sets the number of steps at which to stop collecting performance data. The default value is 10.

profile_communication

bool

Optional

False

Sets whether to collect communication performance data during multi-device training. This parameter is invalid when using a single card for training and the default value is False.

profile_memory

bool

Optional

True

Sets whether to collect Tensor memory data. Defaults to True.

profile_rank_ids

list

Optional

None

Sets the rank ids for which performance collection is enabled. Defaults to None, meaning that performance collection is enabled for all rank ids.

profile_pipeline

bool

Optional

False

Sets whether to enable performance collection for one card in each stage of the pipeline in parallel. Defaults to False.

profile_output

string

Required

None

Sets the folder path for saving performance collection files.

profiler_level

int

Optional

1

Sets the data collection level. Possible values are (0, 1, 2). Defaults to 1.

with_stack

bool

Optional

False

Sets whether to collect call stack data on the Python side. Defaults to False.

data_simplification

int

Optional

False

Sets whether to enable data simplification. If enabled, the FRAMEWORK directory and other redundant data will be deleted after exporting performance data. The default value is False.

init_start_profile

bool

Optional

False

Sets whether to enable performance data collection during Profiler initialization. This parameter has no effect when profile_start_step is set. It must be set to True when profile_memory is enabled.

mstx

bool

Optional

False

Sets whether to collect mstx timestamp records, including training steps, HCCL communication operators, etc. The default value is False.

Metric Monitoring Configuration

The metric monitoring configuration is primarily used to configure methods to record metrics during training, please refer to Training Metrics Monitoring for more details. Below is a description of the common metric monitoring configuration options in MindSpore Transformers:

Parameters

Type

Optional

Default Value

Value Descriptions

monitor_config.monitor_on

bool

Optional

False

Set whether to enable monitoring. The default is False, which will disable all parameters below.

monitor_config.dump_path

string

Optional

'./dump'

Set the save path for metric files of local_norm, device_local_norm and local_loss during training. Defaults to './dump' when not set or set to null.

monitor_config.target

list(string)

Optional

['.*']

Set the (partial) name of target parameters monitored by metric optimizer state and local_norm, can be regular expression.Defaults to ['.*'] when not set or set to null, that is, specify all parameters.

monitor_config.invert

bool

Optional

False

Set whether to invert the targets specified in monitor_config.target, defaults to False.

monitor_config.step_interval

int

Optional

1

Set the frequency for metric recording. The default value is 1, that is, the metrics are recorded every step.

monitor_config.local_loss_format

string / list(string)

Optional

null

Set the format to record metric local_loss, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or null. Defaults to null, that is, do not monitor this metric.

monitor_config.device_local_loss_format

string / list(string)

Optional

null

Set the format to record metric device_local_loss, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or null. Defaults to null, that is, do not monitor this metric.

monitor_config.local_norm_format

string / list(string)

Optional

null

Set the format to record metric local_norm, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or null. Defaults to null, that is, do not monitor this metric.

monitor_config.device_local_norm_format

string / list(string)

Optional

null

Set the format to record metric device_local_norm, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or null. Defaults to null, that is, do not monitor this metric.

monitor_config.optimizer_state_format

string / list(string)

Optional

null

Set the format to record metric optimizer state, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or null. Defaults to null, that is, do not monitor this metric.

monitor_config.weight_state_format

string / list(string)

Optional

null

Set the format to record metric weight L2-norm, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or null. Defaults to null, that is, do not monitor this metric.

monitor_config.throughput_baseline

int / float

Optional

null

Set the baseline of metric throughput linearity, must be positive number. Defaults to null, that is, do not monitor this metric.

monitor_config.print_struct

bool

Optional

False

Set whether to print all trainable parameters' name of model. If set to True, print all trainable parameters' name at the beginning of the first step, and exit training process after step end. Defaults to False.

monitor_config.check_for_global_norm

bool

Optional

False

Set whether to enable process level fault recovery function. Defaults to False.

monitor_config.global_norm_spike_threshold

float

Optional

3.0

Set the threshold for global norm, triggering data skipping when the global norm is exceeded. Defaults to 3.0.

monitor_config.global_norm_spike_count_threshold

int

Optional

10

Set the cumulative number of consecutive global norm anomalies, and when the threshold is reached, trigger an exception interrupt to terminate the training. Defaults to 10.

TensorBoard Configuration

The TensorBoard configuration is primarily used to configure parameters related to TensorBoard during training, allowing for real-time monitoring and visualization of training metrics, please refer to Training Metrics Monitoring for more details. Below is a description of the common TensorBoard configuration options in MindSpore Transformers:

Parameters

Type

Optional

Default Value

Value Description

tensorboard.tensorboard_dir

string

Required

None

Sets the path where TensorBoard event files are saved.

tensorboard.tensorboard_queue_size

int

Optional

10

Sets the maximum cache value of the capture queue. If it exceeds this value, it will be written to the event file, the default value is 10.

tensorboard.log_loss_scale_to_tensorboard

bool

Optional

False

Sets whether loss scale information is logged to the event file, default is False.

tensorboard.log_timers_to_tensorboard

bool

Optional

False

Sets whether to log timer information to the event file. The timer information contains the duration of the current training step (or iteration) as well as the throughput, defaults to False

tensorboard.log_expert_load_to_tensorboard

bool

Optional

False

Sets whether to log experts load to the event file, defaults to False.