Configuration File Descriptions
Overview
Different parameters usually need to be configured during the training and inference process of a model. MindSpore Transformers supports the use of YAML
files to centrally manage and adjust the configurable items, which makes the configuration of the model more structured and improves its maintainability at the same time.
Description of the YAML File Contents
The YAML
file provided by MindSpore Transformers contains configuration items for different functions, which are described below according to their contents.
Basic Configuration
The basic configuration is mainly used to specify MindSpore random seeds and related settings for loading weights.
Parameter Name |
Data Type |
Optional |
Default Value |
Value Description |
---|---|---|---|---|
seed |
int |
Optional |
0 |
Sets the global random seed to ensure experimental reproducibility. For details, see mindspore.set_seed. |
run_mode |
string |
Required |
None |
Sets the model's run mode. Optional: |
output_dir |
string |
Optional |
None |
Sets the output directory for saving log files, checkpoint files, and parallel strategy files. If the directory does not exist, it will be created automatically. |
load_checkpoint |
string |
Optional |
None |
The file or folder path for loading weights. Supports the following three scenarios: 1. The path to the complete weights file; 2. The path to the distributed weights folder after offline splitting; 3. The path to the folder containing LoRA incremental weights and base model weights. For details on how to obtain various weights, see Checkpoint Conversion Function. |
auto_trans_ckpt |
bool |
Optional |
False |
Whether to enable automatic splitting and merging of distributed weights. When enabled, you can load split weights from multiple cards onto a single card, or load single-card weights from multiple cards onto multiple cards. For more information, see Distributed Weight Slicing and Merging |
resume_training |
bool |
Optional |
False |
Whether to enable the resumable training feature. When enabled, the optimizer state, learning rate scheduler state, and other parameters will be restored from the path specified by |
load_ckpt_format |
string |
Optional |
"ckpt" |
The format of the loaded model weights. Optional values include |
remove_redundancy |
bool |
Optional |
False |
Whether the loaded model weights have been de-redundant. For details, see Saving and Loading Weights with De-Redundancy. |
train_precision_sync |
bool |
Optional |
None |
Enables deterministic computation for training. Setting this to True enables synchronous computation for training, which improves computational certainty and is generally used to ensure experimental reproducibility. Setting this to False disables this feature. |
infer_precision_sync |
bool |
Optional |
None |
Enables deterministic computation for inference. If set to |
use_skip_data_by_global_norm |
bool |
Optional |
False |
Whether to enable data skipping based on the global gradient norm. When a batch of data causes exploding gradients, that batch is automatically skipped to improve training stability. For more information, see Data Skipping. |
use_checkpoint_health_monitor |
bool |
Optional |
False |
Whether to enable weight health monitoring. When enabled, checkpoint integrity and availability are verified when saving, preventing corrupted weight files from being saved. For more information, see Checkpoint Health Monitor. |
Context Configuration
Context configuration is mainly used to specify the mindspore.set_context in the related parameters.
Parameter Name |
Data Type |
Optional |
Default Value |
Value Description |
---|---|---|---|---|
context.mode |
int |
Required |
None |
Sets the backend execution mode. |
context.device_target |
string |
Required |
None |
Sets the backend execution device. MindSpore Transformers only supports running on |
context.device_id |
int |
Optional |
0 |
Sets the execution device ID. The value must be within the available device range. The default value is |
context.enable_graph_kernel |
bool |
Optional |
False |
Whether to enable graph fusion to optimize network execution performance. The default value is |
context.max_call_depth |
int |
Optional |
1000 |
Sets the maximum depth of function calls. This value must be a positive integer. The default value is |
context.max_device_memory |
string |
Optional |
"1024GB" |
Sets the maximum memory available on the device. The format is "xxGB". The default value is |
context.mempool_block_size |
string |
Optional |
"1GB" |
Sets the memory block size. The format is "xxGB". The default value is |
context.save_graphs |
bool / int |
Optional |
False |
Save compiled graphs during execution: |
context.save_graphs_path |
string |
Optional |
'./graph' |
The path to save compiled graphs. If not set and |
context.affinity_cpu_list |
dict / string |
Optional |
None |
Optional configuration item used to implement a user-defined core binding strategy. |
Legacy Model Configuration
If you use MindSpore Transformers to run tasks for legacy models, you need to configure the relevant hyperparameters in a YAML file. Please note that the configuration described in this section applies only to legacy models and cannot be mixed with mcore model configurations. Please pay attention to version compatibility.
Because different model configurations may vary, this section only describes the general configuration of models in MindSpore Transformers.
Parameter Name |
Type |
Optional |
Default Value |
Value Description |
---|---|---|---|---|
model.arch.type |
string |
Required |
None |
Sets the model class. This class can be used to instantiate the model when building it. |
model.model_config.type |
string |
Required |
None |
Sets the model configuration class. This class must match the model class; that is, it must contain all parameters used by the model class. |
model.model_config.num_layers |
int |
Required |
None |
Sets the number of model layers, typically the number of decoder layers. |
model.model_config.seq_length |
int |
Required |
None |
Sets the model sequence length. This parameter indicates the maximum sequence length supported by the model. |
model.model_config.hidden_size |
int |
Required |
None |
Sets the dimension of the model's hidden state. |
model.model_config.vocab_size |
int |
Required |
None |
Sets the size of the model vocabulary. |
model.model_config.top_k |
int |
Optional |
None |
Sets the sampling from the |
model.model_config.top_p |
float |
Optional |
None |
Sets the sampling from the tokens with the highest probability, whose cumulative probability does not exceed |
model.model_config.use_past |
bool |
Optional |
False |
Whether to enable incremental inference for the model. Enabling this allows Paged Attention to improve inference performance. Must be set to |
model.model_config.max_decode_length |
int |
Optional |
None |
Sets the maximum length of generated text, including the input length. |
model.model_config.max_length |
int |
Optional |
None |
Same as |
model.model_config.max_new_tokens |
int |
Optional |
None |
Sets the maximum length of generated new text, excluding the input length. When both |
model.model_config.min_length |
int |
Optional |
None |
Sets the minimum length of generated text, including the input length. |
model.model_config.min_new_tokens |
int |
Optional |
None |
Sets the minimum length of new text generated, excluding the input length. When |
model.model_config.repetition_penalty |
float |
Optional |
1.0 |
Sets the penalty coefficient for generating repeated text. |
model.model_config.block_size |
int |
Optional |
None |
Sets the block size in Paged Attention. This only takes effect when |
model.model_config.num_blocks |
int |
Optional |
None |
Sets the total number of blocks in Paged Attention. This only takes effect when |
model.model_config.return_dict_in_generate |
bool |
Optional |
False |
Whether to return the inference results of the |
model.model_config.output_scores |
bool |
Optional |
False |
Whether to include the scores before softmax of the input for each forward generation when returning the results in dictionary form. Defaults to |
model.model_config.output_logits |
bool |
Optional |
False |
Whether to include the logits of the model output for each forward generation when returning the results in dictionary form. Defaults to |
model.model_config.layers_per_stage |
list(int) |
Optional |
None |
Sets the number of transformer layers assigned to each stage when enabling pipeline stages. Defaults to |
model.model_config.bias_swiglu_fusion |
bool |
Optional |
False |
Whether to use the swiglu fusion operator. Defaults to |
model.model_config.apply_rope_fusion |
bool |
Optional |
False |
Whether to use the RoPE fusion operator. Defaults to |
In addition to the basic configuration of the above models, the MoE model requires separate configuration of some MoE module hyperparameters. Since different models use different parameters, only the general configuration is described:
Parameter Name |
Type |
Optional |
Default Value |
Value Description |
---|---|---|---|---|
moe_config.expert_num |
int |
Required |
None |
Sets the number of routing experts. |
moe_config.shared_expert_num |
int |
Required |
None |
Sets the number of shared experts. |
moe_config.moe_intermediate_size |
int |
Required |
None |
Sets the size of the intermediate dimension of the expert layer. |
moe_config.capacity_factor |
int |
Required |
None |
Sets the expert capacity factor. |
moe_config.num_experts_chosen |
int |
Required |
None |
Sets the number of experts chosen for each token. |
moe_config.enable_sdrop |
bool |
Optional |
False |
Enables the |
moe_config.aux_loss_factor |
list(float) |
Optional |
None |
Sets the weight for the balanced loss. |
moe_config.first_k_dense_replace |
int |
Optional |
1 |
Enables the block for the Moe layer. Typically set to |
moe_config.balance_via_topk_bias |
bool |
Optional |
False |
Enables the |
moe_config.topk_bias_update_rate |
float |
Optional |
None |
Sets the bias update step for the |
moe_config.comp_comm_parallel |
bool |
Optional |
False |
Sets whether to enable parallel computation and communication for ffn. |
moe_config.comp_comm_parallel_degree |
int |
Optional |
None |
Sets the number of splits for ffn computation and communication. A larger number results in more overlap, but consumes more memory. This parameter is only valid when |
moe_config.moe_shared_expert_overlap |
bool |
Optional |
False |
Sets whether to enable parallel computation and communication for shared and routing experts. |
moe_config.use_gating_sigmoid |
bool |
Optional |
False |
Sets whether to use the sigmoid function for gating results in MoE. |
moe_config.use_gmm |
bool |
Optional |
False |
Sets whether to use GroupedMatmul for MoE expert computation. |
moe_config.use_fused_ops_permute |
bool |
Optional |
False |
Specifies whether MoE uses the permute and unpermute fused operators for performance acceleration. This option only takes effect when |
moe_config.enable_deredundency |
bool |
Optional |
False |
Specifies whether to enable de-redundancy communication. This requires that the number of expert parallel operations is an integer multiple of the number of NPUs in each node. Default value: False. This option takes effect when |
moe_config.npu_nums_per_device |
int |
Optional |
8 |
Specifies the number of NPUs in each node. Default value: 8. This option takes effect when |
moe_config.enable_gmm_safe_tokens |
bool |
Optional |
False |
Ensures that each expert is assigned at least one token to prevent GroupedMatmul calculation failures in extreme load imbalance. The default value is |
Mcore Model Configuration
When using MindSpore Transformers to launch an Mcore model task, you need to configure relevant hyperparameters under model_config
, including model selection, model parameters, calculation type, and MoE parameters.
Because different model configurations may vary, here are some common model configurations in MindSpore Transformers:
Parameter |
Type |
Optional |
Default Value |
Value Description |
---|---|---|---|---|
model.model_config.model_type |
string |
Required |
None |
Sets the model configuration class. The model configuration class must match the model class; that is, the model configuration class should contain all parameters used by the model class. |
model.model_config.architectures |
string |
Required |
None |
Sets the model class. When building the model, you can instantiate the model based on the model class. |
model.model_config.offset |
int / list(int) |
Required |
0 |
When pp parallelism is enabled, you need to set the offset based on the number of model layers to build pipeline parallelism. |
model.model_config.vocab_size |
int |
Optional |
128000 |
Model vocabulary size. |
model.model_config.hidden_size |
int |
Required |
0 |
Transformer hidden layer size. |
model.model_config.ffn_hidden_size |
int |
Optional |
None |
Transformer feedforward layer size, corresponding to |
model.model_config.num_layers |
int |
Required |
0 |
Number of Transformer layers, corresponding to |
model.model_config.max_position_embeddings |
int |
Optional |
4096 |
Maximum sequence length the model can handle. |
model.model_config.hidden_act |
string |
Optional |
'gelu' |
Activation function used for the nonlinearity in the MLP. |
model.model_config.num_attention_heads |
int |
Required |
0 |
Number of Transformer attention heads. |
model.model_config.num_query_groups |
int |
Optional |
None |
Number of query groups for the group-query attention mechanism, corresponding to |
model.model_config.kv_channels |
int |
Optional |
None |
Projection weight dimension for the multi-head attention mechanism, corresponding to |
model.model_config.layernorm_epsilon |
float |
Required |
1e-5 |
Epsilon value for any LayerNorm operations. |
model.model_config.add_bias_linear |
bool |
Required |
True |
Include a bias term in all linear layers (after QKV projection, after core attention, and both in MLP layers). |
model.model_config.tie_word_embeddings |
bool |
Required |
True |
Whether to share input and output embedding weights. |
model.model_config.use_flash_attention |
bool |
Required |
True |
Whether to use flash attention in the attention layer. |
model.model_config.use_contiguous_weight_layout_attention |
bool |
Required |
False |
Determines the weight layout in the QKV linear projection of the self-attention layer. Affects only the self-attention layer. |
model.model_config.hidden_dropout |
float |
Required |
0.1 |
Dropout probability for the Transformer hidden state. |
model.model_config.attention_dropout |
float |
Required |
0.1 |
Dropout probability for the post-attention layer. |
model.model_config.position_embedding_type |
string |
Required |
'rope' |
Position embedding type for the attention layer. |
model.model_config.params_dtype |
string |
Required |
'float32' |
dtype to use when initializing weights. |
model.model_config.compute_dtype |
string |
Required |
'bfloat16' |
Computed dtype for Linear layers. |
model.model_config.layernorm_compute_dtype |
string |
Required |
'float32' |
Computed dtype for LayerNorm layers. |
model.model_config.softmax_compute_dtype |
string |
Required |
'float32' |
The dtype used to compute the softmax during attention computation. |
model.model_config.rotary_dtype |
string |
Required |
'float32' |
Computed dtype for custom rotated position embeddings. |
model.model_config.init_method_std |
float |
Required |
0.02 |
The standard deviation of the zero-mean normal for the default initialization method, corresponding to |
model.model_config.moe_grouped_gemm |
bool |
Required |
False |
When there are multiple experts per level, compress multiple local (potentially small) GEMMs in a single kernel launch to leverage grouped GEMM capabilities for improved utilization and performance. |
model.model_config.num_moe_experts |
int |
Optional |
None |
The number of experts to use for the MoE layer, corresponding to |
model.model_config.num_experts_per_tok |
int |
Required |
2 |
The number of experts to route each token to. |
model.model_config.moe_ffn_hidden_size |
int |
Optional |
None |
Size of the hidden layer of the MoE feedforward network. Corresponds to |
model.model_config.moe_router_dtype |
string |
Required |
'float32' |
Data type used for routing and weighted averaging of expert outputs. Corresponds to |
model.model_config.gated_linear_unit |
bool |
Required |
False |
Use a gated linear unit for the first linear layer in the MLP. |
model.model_config.norm_topk_prob |
bool |
Required |
True |
Whether to use top-k probabilities for normalization. |
model.model_config.moe_router_pre_softmax |
bool |
Required |
False |
Enables pre-softmax (pre-sigmoid) routing for MoE, meaning softmax is performed before top-k selection. By default, softmax is performed after top-k selection. |
model.model_config.moe_token_drop_policy |
string |
Required |
'probs' |
The token drop policy. Can be either 'probs' or 'position'. If |
model.model_config.moe_router_topk_scaling_factor |
float |
Optional |
None |
Scaling factor for the routing score in Top-K routing, corresponding to |
model.model_config.moe_aux_loss_coeff |
float |
Required |
0.0 |
Scaling factor for the auxiliary loss. The recommended initial value is 1e-2. |
model.model_config.moe_router_load_balancing_type |
string |
Required |
'aux_loss' |
The router's load balancing strategy. |
model.model_config.moe_permute_fusion |
bool |
Optional |
False |
Whether to use the moe_token_permute fusion operator. Default is |
model.model_config.moe_router_force_expert_balance |
bool |
Optional |
False |
Whether to use forced load balancing in the expert router. This option is only for performance testing and not for general use. Defaults to |
model.model_config.use_interleaved_weight_layout_mlp |
bool |
Optional |
True |
Determines the weight arrangement in the linear_fc1 projection of the MLP. Affects only MLP layers. |
model.model_config.moe_router_enable_expert_bias |
bool |
Optional |
False |
Whether to use TopK routing with dynamic expert bias in the unassisted lossless load balancing strategy. Routing decisions are based on the sum of the routing score and the expert bias. |
model.model_config.enable_expert_relocation |
bool |
Optional |
False |
Whether to enable dynamic expert migration for load balancing in the MoE model. When enabled, experts will be dynamically redistributed between devices based on their load history to improve training efficiency and load balance. Defaults to False. |
model.model_config.expert_relocation_initial_iteration |
int |
Optional |
20 |
Start the initial iteration of expert migration. Expert migration will begin after this many training iterations. |
model.model_config.expert_relocation_freq |
int |
Optional |
50 |
Frequency of expert migration during training iterations. After the initial iteration, expert migration is performed every N iterations. |
model.model_config.print_expert_load |
bool |
Optional |
False |
Whether to print expert load information. If enabled, detailed expert load statistics will be printed during training. Defaults to |
model.model_config.moe_router_num_groups |
int |
Optional |
None |
The number of expert groups to use for group-limited routing. Equivalent to |
model.model_config.moe_router_group_topk |
int |
Optional |
None |
The number of selected groups for group-limited routing. Equivalent to |
model.model_config.moe_router_topk |
int |
Optional |
2 |
The number of experts to route each token to. Equivalent to |
Model Training Configuration
When starting model training, in addition to model-related parameters, you also need to set the parameters of trainer, runner_config, learning rate, and optimizer and other modules required for training, MindSpore Transformers provides the following configuration items.
Parameters |
Descriptions |
Types |
---|---|---|
trainer.type |
Set the trainer class, usually different models for different application scenarios will set different trainer classes. |
str |
trainer.model_name |
Set the model name in the format '{name}_xxb', indicating a certain specification of the model. |
str |
runner_config.epochs |
Set the number of rounds for model training. |
int |
runner_config.batch_size |
Set the sample size of the batch data, which overrides the |
int |
runner_config.sink_mode |
Enable data sink mode. |
bool |
runner_config.sink_size |
Set the number of iterations to be sent down from Host to Device per iteration, effective only when |
int |
runner_config.gradient_accumulation_steps |
Set the number of gradient accumulation steps, the default value is 1, which means that gradient accumulation is not enabled. |
int |
runner_wrapper.type |
Set the wrapper class, generally set 'MFTrainOneStepCell'. |
str |
runner_wrapper.local_norm |
Set the gradient norm of each parameter on the printing card. |
bool |
runner_wrapper.scale_sense.type |
Set the gradient scaling class, generally just set 'DynamicLossScaleUpdateCell'. |
str |
runner_wrapper.scale_sense.loss_scale_value |
Set the loss dynamic scale factor, the model loss can change dynamically according to the configuration of this parameter. |
int |
runner_wrapper.use_clip_grad |
Turn on gradient clipping. Turning on to avoid cases where the inverse gradient is too large and training fails to converge. |
bool |
lr_schedule.type |
Set the lr_schedule class, lr_schedule is mainly used to adjust the learning rate in model training. |
str |
lr_schedule.learning_rate |
Set the initialized learning rate size. |
float |
lr_scale |
Whether to enable learning rate scaling. |
bool |
lr_scale_factor |
Set the learning rate scaling factor. |
int |
layer_scale |
Whether to turn on layer attenuation. |
bool |
layer_decay |
Set the layer attenuation factor. |
float |
optimizer.type |
Set the optimizer class, the optimizer is mainly used to calculate the gradient for model training. |
str |
optimizer.weight_decay |
Set the optimizer weight decay factor. |
float |
optimizer.fused_num |
Set |
int |
optimizer.interleave_step |
Select the number of step intervals for the weights to be fused, and take a weight as a candidate weight for fusion once every |
int |
optimizer.fused_algo |
Fusion algorithm, supports |
string |
optimizer.ema_alpha |
The fusion coefficient is only effective when |
float |
train_dataset.batch_size |
The description is same as that of |
int |
train_dataset.input_columns |
Set the input data columns for the training dataset. |
list |
train_dataset.output_columns |
Set the output data columns for the training dataset. |
list |
train_dataset.construct_args_key |
Set the dataset part |
list |
train_dataset.column_order |
Set the order of the output data columns of the training dataset. |
list |
train_dataset.num_parallel_workers |
Set the number of processes that read the training dataset. |
int |
train_dataset.python_multiprocessing |
Enabling Python multi-process mode to improve data processing performance. |
bool |
train_dataset.drop_remainder |
Whether to discard the last batch of data if it contains fewer samples than batch_size. |
bool |
train_dataset.repeat |
Set the number of dataset duplicates. |
int |
train_dataset.numa_enable |
Set the default state of NUMA to data read startup state. |
bool |
train_dataset.prefetch_size |
Set the amount of pre-read data. |
int |
train_dataset.data_loader.type |
Set the data loading class. |
str |
train_dataset.data_loader.dataset_dir |
Set the path for loading data. |
str |
train_dataset.data_loader.shuffle |
Whether to randomly sort the data when reading the dataset. |
bool |
train_dataset.transforms |
Set options related to data enhancement. |
- |
train_dataset_task.type |
Set up the dataset class, which is used to encapsulate the data loading class and other related configurations. |
str |
train_dataset_task.dataset_config |
Typically set as a reference to |
- |
auto_tune |
Enable auto-tuning of data processing parameters, see set_enable_autotune for details. |
bool |
filepath_prefix |
Set the save path for parameter configurations after data optimization. |
str |
autotune_per_step |
Set the configuration tuning step interval for automatic data acceleration, for details see set_autotune_interval. |
int |
Parallel Configuration
In order to improve the performance of the model, it is usually necessary to configure the parallelism strategy for the model in large-scale cluster usage scenarios. For details, please refer to Distributed Parallelism, the parallel configuration in MindSpore Transformers is as follows.
Parameters |
Descriptions |
Types |
---|---|---|
use_parallel |
Enable parallel mode. |
bool |
parallel_config.data_parallel |
Set the number of data parallel. |
int |
parallel_config.model_parallel |
Set the number of model parallel. |
int |
parallel_config.context_parallel |
Set the number of sequence parallel. |
int |
parallel_config.pipeline_stage |
Set the number of pipeline parallel. |
int |
parallel_config.micro_batch_num |
Set the pipeline parallel microbatch size, which should satisfy |
int |
parallel_config.seq_split_num |
Set the sequence split number in sequence pipeline parallel, which should be a divisor of sequence length. |
int |
parallel_config.gradient_aggregation_group |
Set the size of the gradient communication operator fusion group. |
int |
parallel_config.context_parallel_algo |
Set the long sequence parallel scheme, optionally |
str |
parallel_config.ulysses_degree_in_cp |
Setting the Ulysses sequence parallel dimension, configured in parallel with the |
int |
micro_batch_interleave_num |
Set the number of multicopy parallel, enable multicopy parallelism if it is greater than 1. Usually enabled when using model parallel, mainly used to optimize the communication loss generated by model parallel, and not recommended to be enabled when only using streaming parallel. For details, please refer to MicroBatchInterleaved. |
int |
parallel.parallel_mode |
Set parallel mode, |
int |
parallel.gradients_mean |
Whether to execute the averaging operator after the gradient AllReduce. Typically set to |
bool |
parallel.enable_alltoall |
Enables generation of the AllToAll communication operator during communication. Typically set to |
bool |
parallel.full_batch |
Whether to load the full batch of data from the dataset in parallel mode. Setting it to |
bool |
parallel.dataset_strategy |
Only supports |
list |
parallel.search_mode |
Set fully-automatic parallel strategy search mode, options are |
str |
parallel.strategy_ckpt_save_file |
Set the save path for the parallel slicing strategy file. |
str |
parallel.strategy_ckpt_config.only_trainable_params |
Whether to save (or load) information about the slicing strategy for trainable parameters only, default is True, set this parameter to |
bool |
parallel.enable_parallel_optimizer |
Turn on optimizer parallel. |
bool |
parallel.parallel_optimizer_config.gradient_accumulation_shard |
Set whether the cumulative gradient variable is sliced on the data-parallel dimension, only effective if |
bool |
parallel.parallel_optimizer_config.parallel_optimizer_threshold |
Set the threshold for the optimizer weight parameter cut, effective only if |
int |
parallel.parallel_optimizer_config.optimizer_weight_shard_size |
Set the size of the optimizer weight parameter to slice the communication domain, requiring the value to be integrable by |
int |
parallel.pipeline_config.pipeline_interleave |
Enable interleave pipeline parallel, you should set this variable to be |
bool |
parallel.pipeline_config.pipeline_scheduler |
Set the pipeline scheduling strategy. We only support |
str |
Configure the parallel strategy to satisfy device_num = data_parallel × model_parallel × context_parallel × pipeline_stage.
Model Optimization Configuration
MindSpore Transformers provides recomputation-related configurations to reduce the memory footprint of the model during training, see Recomputation for details.
Parameters
Descriptions
Types
recompute_config.recompute
Whether to enable recompute.
bool/list/tuple
recompute_config.select_recompute
Turn on recomputation to recompute only for the operators in the attention layer.
bool/list
recompute_config.parallel_optimizer_comm_recompute
Whether to recompute AllGather communication introduced in parallel by the optimizer.
bool/list
recompute_config.mp_comm_recompute
Whether to recompute communications introduced by model parallel.
bool
recompute_config.recompute_slice_activation
Whether to output slices for Cells kept in memory.
bool
recompute_config.select_recompute_exclude
Disable recomputation for the specified operator, valid only for the Primitive operators.
bool/list
recompute_config.select_comm_recompute_exclude
Disable communication recomputation for the specified operator, valid only for the Primitive operators.
bool/list
MindSpore Transformers provides fine-grained activations SWAP-related configurations to reduce the memory footprint of the model during training, see Fine-Grained Activations SWAP for details.
Parameters
Descriptions
Types
swap_config.swap
Enable activations SWAP.
bool
swap_config.default_prefetch
Control the timing of releasing memory in forward phase and starting prefetch in backward phase of the default SWAP strategy, only taking effect when swap=True, layer_swap=None, and op_swap=None.
int
swap_config.layer_swap
Select specific layers to enable activations SWAP.
list
swap_config.op_swap
Select specific operators within layers to enable activations SWAP.
list
Callbacks Configuration
MindSpore Transformers provides encapsulated Callbacks function class, mainly to achieve to return to the model training state and output in the model training process, save the model weight file and other operations. Currently, the following Callbacks function class is supported.
MFLossMonitor
This callback function class is mainly used to print information such as training progress, model Loss, and learning rate during the training process and has several configurable items as follows:
Parameter Name
Type
Optional
Default Value
Value Description
learning_rate
float
Optional
None
Sets the initial learning rate for
MFLossMonitor
. Used for logging and training progress calculation. If not set, attempts to obtain it from the optimizer or other configuration.per_print_times
int
Optional
1
Sets the frequency of logging for
MFLossMonitor
, in steps. The default value is1
, which prints a log message once per training step.micro_batch_num
int
Optional
1
Sets the number of micro batches processed at each training step, used to calculate the actual loss value. If not set, it is the same as
parallel_config.micro_batch_num
in [Parallel Configuration](#Parallel Configuration).micro_batch_interleave_num
int
Optional
1
Sets the size of the multi-replica micro-batch for each training step, used for loss calculation. If not configured, it is the same as
micro_batch_interleave_num
in [Parallel Configuration](#Parallel Configuration).origin_epochs
int
Optional
None
Sets the total number of training epochs in
MFLossMonitor
. If not configured, it is the same asrunner_config.epochs
in [Model Training Configuration](#Model Training Configuration).dataset_size
int
Optional
None
Sets the total number of samples in the dataset in
MFLossMonitor
. If not configured, it automatically uses the actual dataset size loaded.initial_epoch
int
Optional
0
Sets the starting epoch number for
MFLossMonitor
. The default value is0
, indicating that counting starts from epoch 0. This can be used to resume training progress when resuming training from a breakpoint.initial_step
int
Optional
0
Sets the number of initial training steps in
MFLossMonitor
. The default value is0
. This can be used to align logs and progress bars when resuming training.global_batch_size
int
Optional
0
Sets the global batch size in
MFLossMonitor
(i.e., the total number of samples used in each training step). If not configured, it is automatically calculated based on the dataset size and parallelization strategy.gradient_accumulation_steps
int
Optional
1
Sets the number of gradient accumulation steps in
MFLossMonitor
. If not configured, it is consistent withgradient_accumulation_steps
in [Model Training Configuration](#Model Training Configuration). Used for loss normalization and training progress estimation.check_for_nan_in_loss_and_grad
bool
Optional
False
Whether to enable NaN/Inf detection for loss values and gradients in
MFLossMonitor
. If enabled, training will be terminated if overflow (NaN or INF) is detected. The default value isFalse
. It is recommended to enable it during the debugging phase to improve training stability.SummaryMonitor
This callback function class is mainly used to collect Summary data, see mindspore.SummaryCollector for details.
CheckpointMonitor
This callback function class is mainly used to save the model weights file during the model training process and has several configurable items as follows:
Parameter Name
Type
Optional
Default Value
Value Description
prefix
string
Optional
'CKP'
Set the prefix for the weight file name. For example,
CKP-100.ckpt
is generated. If not configured, the default value'CKP'
is used.directory
string
Optional
None
Set the directory for saving weight files. If not configured, the default directory is
checkpoint/
under theoutput_dir
directory.save_checkpoint_seconds
int
Optional
0
Set the interval for automatically saving weights (in seconds). Mutually exclusive with
save_checkpoint_steps
and takes precedence. For example, save every 3600 seconds.save_checkpoint_steps
int
Optional
1
Sets the automatic saving interval for weights based on the number of training steps (unit: steps). Mutually exclusive with
save_checkpoint_seconds
; if both are set, the time-based saving takes precedence. For example, save every 1000 steps.keep_checkpoint_max
int
Optional
5
The maximum number of weight files to retain. When the number of saved weights exceeds this value, the system will delete the oldest files in order of creation time to ensure that the total number does not exceed this limit. Used to control disk space usage.
keep_checkpoint_per_n_minutes
int
Optional
0
Retain one weight every N minutes. This is a time-windowed retention policy often used to balance storage and recovery flexibility in long-term training. For example, setting it to
60
means retaining at least one weight every hour.integrated_save
bool
Optional
True
Whether to enable aggregated weight saving:
•True
: Aggregate weights from all devices when saving the weight file, i.e., all devices have the same weights;
•False
: Each device saves its own weights.
In semi-automatic parallel mode, it is recommended to set this toFalse
to avoid memory issues when saving weight files.save_network_params
bool
Optional
False
Whether to save only the model weights. The default value is
False
.save_trainable_params
bool
Optional
False
Whether to save trainable parameters separately (i.e., the model's parameter weights during partial fine-tuning).
async_save
bool
Optional
False
Whether to save weights asynchronously. Enabling this feature will not block the main training process, improving training efficiency. However, please note that I/O resource contention may cause write delays.
remove_redundancy
bool
Optional
False
Whether to remove redundancy from model weights when saving. Defaults to
False
.checkpoint_format
string
Optional
'ckpt'
The format of saved model weights. Defaults to
ckpt
. Optionalckpt
,safetensors
.embedding_local_norm_threshold
float
Optional
1.0
The threshold used in health monitoring to detect abnormalities in the embedding layer gradient or output norm. If the norm exceeds this value, an alarm or data skipping mechanism may be triggered to prevent training divergence. Defaults to
1.0
and can be adjusted based on model scale.
Multiple Callbacks function classes can be configured at the same time under the callbacks
field. The following is an example of callbacks
configuration.
callbacks:
- type: MFLossMonitor
- type: CheckpointMonitor
prefix: "name_xxb"
save_checkpoint_steps: 1000
integrated_save: False
async_save: False
Processor Configuration
Processor is mainly used to preprocess the inference data of the input model. Since the Processor configuration items are not fixed, only the generic configuration items of Processor in MindSpore Transformers are explained here.
Parameter Name |
Type |
Optional |
Default Value |
Value Description |
---|---|---|---|---|
processor.type |
string |
Required |
None |
Sets the name of the data processing class (Processor) to be used, such as |
processor.return_tensors |
string |
Optional |
'ms' |
Sets the type of tensors returned after data processing. Can be set to |
processor.image_processor.type |
string |
Required |
None |
Sets the type of the image data processing class. Responsible for image normalization, scaling, cropping, and other operations, and must be compatible with the model's visual encoder. |
processor.tokenizer.type |
string |
Required |
None |
Sets the text tokenizer type, such as |
processor.tokenizer.vocab_file |
string |
Required |
None |
Sets the vocabulary file path required by the tokenizer (such as |
Model Evaluation Configuration
MindSpore Transformers provides model evaluation function, and also supports model evaluation while training. The following is the configuration related to model evaluation.
Parameter Name |
Type |
Optional |
Default Value |
Value Description |
---|---|---|---|---|
eval_dataset |
dict |
Required |
None |
Dataset configuration for evaluation, used in the same way as |
eval_dataset_task |
dict |
Required |
None |
Evaluation task configuration, used in the same way as dataset task configuration (such as preprocessing, batch size, etc.), used to define the evaluation process. |
metric.type |
string |
Required |
None |
Set the evaluation type, such as |
do_eval |
bool |
Optional |
False |
Whether to enable the evaluation-while-training feature. |
eval_step_interval |
int |
Optional |
100 |
Sets the evaluation step interval. The default value is 100. A value less than or equal to 0 disables step-by-step evaluation. |
eval_epoch_interval |
int |
Optional |
-1 |
Sets the evaluation epoch interval. The default value is -1. A value less than 0 disables epoch-by-epoch evaluation. This configuration is not recommended in data sinking mode. |
Profile Configuration
MindSpore Transformers provides Profile as the main tool for model performance tuning, please refer to Performance Tuning Guide for more details. The following is the Profile related configuration.
Parameter Name |
Type |
Optional |
Default Value |
Value Description |
---|---|---|---|---|
profile |
bool |
Optional |
False |
Whether to enable the performance collection tool. The default value is |
profile_start_step |
int |
Optional |
1 |
Sets the number of steps at which to start collecting performance data. The default value is |
profile_stop_step |
int |
Optional |
10 |
Sets the number of steps at which to stop collecting performance data. The default value is |
profile_communication |
bool |
Optional |
False |
Sets whether to collect communication performance data during multi-device training. This parameter is invalid when using a single card for training and the default value is |
profile_memory |
bool |
Optional |
True |
Sets whether to collect Tensor memory data. Defaults to |
profile_rank_ids |
list |
Optional |
None |
Sets the rank ids for which performance collection is enabled. Defaults to |
profile_pipeline |
bool |
Optional |
False |
Sets whether to enable performance collection for one card in each stage of the pipeline in parallel. Defaults to |
profile_output |
string |
Required |
None |
Sets the folder path for saving performance collection files. |
profiler_level |
int |
Optional |
1 |
Sets the data collection level. Possible values are |
with_stack |
bool |
Optional |
False |
Sets whether to collect call stack data on the Python side. Defaults to |
data_simplification |
int |
Optional |
False |
Sets whether to enable data simplification. If enabled, the FRAMEWORK directory and other redundant data will be deleted after exporting performance data. The default value is |
init_start_profile |
bool |
Optional |
False |
Sets whether to enable performance data collection during Profiler initialization. This parameter has no effect when |
mstx |
bool |
Optional |
False |
Sets whether to collect mstx timestamp records, including training steps, HCCL communication operators, etc. The default value is |
Metric Monitoring Configuration
The metric monitoring configuration is primarily used to configure methods to record metrics during training, please refer to Training Metrics Monitoring for more details. Below is a description of the common metric monitoring configuration options in MindSpore Transformers:
Parameters |
Type |
Optional |
Default Value |
Value Descriptions |
---|---|---|---|---|
monitor_config.monitor_on |
bool |
Optional |
False |
Set whether to enable monitoring. The default is |
monitor_config.dump_path |
string |
Optional |
'./dump' |
Set the save path for metric files of |
monitor_config.target |
list(string) |
Optional |
['.*'] |
Set the (partial) name of target parameters monitored by metric |
monitor_config.invert |
bool |
Optional |
False |
Set whether to invert the targets specified in |
monitor_config.step_interval |
int |
Optional |
1 |
Set the frequency for metric recording. The default value is |
monitor_config.local_loss_format |
string / list(string) |
Optional |
null |
Set the format to record metric |
monitor_config.device_local_loss_format |
string / list(string) |
Optional |
null |
Set the format to record metric |
monitor_config.local_norm_format |
string / list(string) |
Optional |
null |
Set the format to record metric |
monitor_config.device_local_norm_format |
string / list(string) |
Optional |
null |
Set the format to record metric |
monitor_config.optimizer_state_format |
string / list(string) |
Optional |
null |
Set the format to record metric |
monitor_config.weight_state_format |
string / list(string) |
Optional |
null |
Set the format to record metric |
monitor_config.throughput_baseline |
int / float |
Optional |
null |
Set the baseline of metric |
monitor_config.print_struct |
bool |
Optional |
False |
Set whether to print all trainable parameters' name of model. If set to |
monitor_config.check_for_global_norm |
bool |
Optional |
False |
Set whether to enable process level fault recovery function. Defaults to |
monitor_config.global_norm_spike_threshold |
float |
Optional |
3.0 |
Set the threshold for global norm, triggering data skipping when the global norm is exceeded. Defaults to |
monitor_config.global_norm_spike_count_threshold |
int |
Optional |
10 |
Set the cumulative number of consecutive global norm anomalies, and when the threshold is reached, trigger an exception interrupt to terminate the training. Defaults to |
TensorBoard Configuration
The TensorBoard configuration is primarily used to configure parameters related to TensorBoard during training, allowing for real-time monitoring and visualization of training metrics, please refer to Training Metrics Monitoring for more details. Below is a description of the common TensorBoard configuration options in MindSpore Transformers:
Parameters |
Type |
Optional |
Default Value |
Value Description |
---|---|---|---|---|
tensorboard.tensorboard_dir |
string |
Required |
None |
Sets the path where TensorBoard event files are saved. |
tensorboard.tensorboard_queue_size |
int |
Optional |
10 |
Sets the maximum cache value of the capture queue. If it exceeds this value, it will be written to the event file, the default value is 10. |
tensorboard.log_loss_scale_to_tensorboard |
bool |
Optional |
False |
Sets whether loss scale information is logged to the event file, default is |
tensorboard.log_timers_to_tensorboard |
bool |
Optional |
False |
Sets whether to log timer information to the event file. The timer information contains the duration of the current training step (or iteration) as well as the throughput, defaults to |
tensorboard.log_expert_load_to_tensorboard |
bool |
Optional |
False |
Sets whether to log experts load to the event file, defaults to |