# Configuration File Descriptions [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0rc1/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/r2.7.0rc1/docs/mindformers/docs/source_en/feature/configuration.md) ## Overview Different parameters usually need to be configured during the training and inference process of a model. MindSpore Transformers supports the use of `YAML` files to centrally manage and adjust the configurable items, which makes the configuration of the model more structured and improves its maintainability at the same time. ## Description of the YAML File Contents The `YAML` file provided by MindSpore Transformers contains configuration items for different functions, which are described below according to their contents. ### Basic Configuration The basic configuration is mainly used to specify MindSpore random seeds and related settings for loading weights. | Parameters | Descriptions | Types | |-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------| | seed | Set the global seed. For details, refer to [mindspore.set_seed](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/mindspore/mindspore.set_seed.html). | int | | run_mode | Set the running mode of the model: `train`, `finetune`, `eval` or `predict`. | str | | output_dir | Set the path where log, checkpoint, strategy, etc. files are saved. | str | | load_checkpoint | File or folder paths for loading weights. Currently there are 3 application scenarios
1. Support for passing in full weight file paths.
2. Support for passing in offline sliced weight folder paths.
3. Support for passing in folder paths containing lora weights and base weights
Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/ckpt.html) for the ways of obtaining various weights. | str | | auto_trans_ckpt | Enable distributed weight auto slicing and merging. Refer to [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/ckpt.html). | bool | | resume_training | Enable resumable training after breakpoint. For details, refer to [Resumable Training After Breakpoint](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/resume_training.html#resumable-training). | bool | | load_ckpt_format | The format of loading checkpoint, either `ckpt` or `safetensors`. | str | | remove_redundancy | Whether the checkpoint has removed redundancy while loading checkpoint. The default value is `False`. | bool | | train_precision_sync | Switching on or off deterministic computation of the training process. The default value is `None`. | Optional[bool] | | infer_precision_sync | Switching on or off deterministic computation of the inference process. The default value is `None`. | Optional[bool] | | use_skip_data_by_global_norm | Enable data Skip Function. The default value is `False`. | | | use_checkpoint_health_monitor | Enable health monitoring function. The default value is `False`. | | ### Context Configuration Context configuration is mainly used to specify the [mindspore.set_context](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/mindspore/mindspore.set_context.html) in the related parameters. | Parameters | Descriptions | Types | |-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|----------| | context.mode | Set the backend execution mode, `0` means GRAPH_MODE. MindSpore Transformers currently only supports running in GRAPH_MODE mode. | int | | context.device_target | Set the backend execution device. MindSpore Transformers is only supported on `Ascend` devices. | str | | context.device_id | Set the execution device ID. The value must be within the range of available devices, and the default value is `0`. | int | | context.enable_graph_kernel | Enable graph fusion to optimize network execution performance, defaults to `False`. | bool | | context.max_call_depth | Set the maximum depth of a function call. The value must be a positive integer, and the default value is `1000`. | int | | context.max_device_memory | Set the maximum memory available to the device in the format “xxGB”, and the default value is `1024GB`. | str | | context.mempool_block_size | Set the size of the memory pool block for devices. The format is "xxGB". Default value is `"1GB"`. | str | | context.save_graphs | Save the compilation graph during execution.
1. `False` or `0` indicates that the intermediate compilation map is not saved.
2. `1` means outputting some of the intermediate files generated during the compilation of the diagram.
3. `True` or `2` indicates the generation of more backend-process-related IR files.
4. `3` indicates the generation of visualized computational diagrams and more detailed front-end IR diagrams. | bool/int | | context.save_graphs_path | Path for saving the compilation diagram. | str | | context.affinity_cpu_list | Optional configuration option, used to implement user-defined binding policies. Enable default binding policy when not configured. `None` means to disable the binding function. Default value is `{}`. If you want to enable custom binding policies, you need to pass in' dict '. See [mindspore.runtime.set_cpu_affinity](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/runtime/mindspore.runtime.set_cpu_affinity.html) for details. | dict/str | ### Model Configuration Since the configuration will vary from model to model, only the generic configuration of models in MindSpore Transformers is described here. | Parameters | Descriptions | Types | |--------------------------------------------|--------------------------------------------------------------------------------------------------|------| | model.arch.type | Set the model class to instantiate the model according to the model class when constructing the model. | str | | model.model_config.type | Set the model configuration class, the model configuration class needs to match the model class to be used, i.e. the model configuration class should contain all the parameters used by the model class. | str | | model.model_config.num_layers | Set the number of model layers, usually the number of layers in the model Decoder Layer. | int | | model.model_config.seq_length | Set the model sequence length, this parameter indicates the maximum sequence length supported by the model. | int | | model.model_config.hidden_size | Set the dimension of the model hidden state. | int | | model.model_config.vocab_size | Set the model word list size. | int | | model.model_config.top_k | Sample from the `top_k` tokens with the highest probability during inference. | int | | model.model_config.top_p | Sample from tokens that have the highest probability and whose probability accumulation does not exceed `top_p` during inference. | int | | model.model_config.use_past | Turn on model incremental inference, when turned on you can use Paged Attention to improve inference performance, must be set to `False` during model training. | bool | | model.model_config.max_decode_length | Set the maximum length of the generated text, including the input length. | int | | model.model_config.max_length | The descriptions are same as `max_decode_length`. When set together with `max_decode_length`, `max_length` takes effect. | int | | model.model_config.max_new_tokens | Set the maximum length of the generated new text, excluding the input length, when set together with `max_length`, `max_new_tokens` takes effect. | int | | model.model_config.min_length | Set the minimum length of the generated text, including the input length. | int | | model.model_config.min_new_tokens | Set the minimum length of the new text to be generated, excluding the input length; when set together with `min_length`, `min_new_tokens` takes effect. | int | | model.model_config.repetition_penalty | Set the penalty factor for generating duplicate text, `repetition_penalty` is not less than 1. When it equals to 1, duplicate outputs will not be penalized. | int | | model.model_config.block_size | Set the size of the block in Paged Attention, only works if `use_past=True`. | int | | model.model_config.num_blocks | Set the total number of blocks in Paged Attention, effective only if `use_past=True`. `batch_size×seq_length<=block_size×num_blocks` should be satisfied. | int | | model.model_config.return_dict_in_generate | Set to return the inference results of the `generate` interface as a dictionary, defaults to `False`. | bool | | model.model_config.output_scores | Set to include score before the input softmax for each forward generation when returning the result as a dictionary, defaults to `False`. | bool | | model.model_config.output_logits | Set to include the logits output by the model at each forward generation when returning results as a dictionary, defaults to `False`. | bool | | model.model_config.layers_per_stage | Set the number of transformer layers assigned to each stage when enabling the pipeline stage, default is `None`, which means the transformer layers are evenly distributed across each stage. The set value is a list of integers with a length equal to the number of pipeline stages, where the i-th element indicates the number of transformer layers assigned to the i-th stage. | list | ### MoE Configuration In addition to the basic configuration of the model above, the MoE model needs to be configured separately with some superparameters of the moe module, and since the parameters used will vary from model to model, only the generic configuration will be explained: | Parameters | Descriptions | Types | |--------------------------------------------|--------------------------------------------------------------------------------------------------|------| | moe_config.expert_num | Set the number of routing experts. | int | | moe_config.shared_expert_num | Set the number of sharing experts. | int | | moe_config.moe_intermediate_size | Set the size of the intermediate dimension of the expert layer. | int | | moe_config.capacity_factor | Set the expert capacity factor. | int | | moe_config.num_experts_chosen | Set the number of experts to select per token. | int | | moe_config.enable_sdrop | Set whether to enable token drop policy `sdrop`, since MindSpore Transformers's MoE is a static shape implementation so it can't retain all tokens. | bool | | moe_config.aux_loss_factor | Set the weights of the equilibrium loss. | list[float] | | moe_config.first_k_dense_replace | Set the enable block of the moe layer, generally set to 1 to indicate that moe is not enabled in the first block. | int | | moe_config.balance_via_topk_bias | Set whether to enable `aux_loss_free` load balancing algorithm. | bool | | moe_config.topk_bias_update_rate | Set `aux_loss_free` load balancing algorithm `bias` update step size. | float | | moe_config.comp_comm_parallel | Set whether to enable computational communication parallelism for ffn. Default value: False. | bool | | moe_config.comp_comm_parallel_degree | Set ffn to compute the number of communication splits. The higher the number, the more overlap there is, but it will consume more memory. This parameter is only valid when comp_com_parallel is enabled. | int | | moe_config.moe_shared_expert_overlap | Set whether to enable computational communication parallelism for shared experts and routing experts. Default value: False. | bool | | moe_config.use_gating_sigmoid | Enables sigmoid activation for gating results in MoE. Default: `False`. | bool | | moe_config.use_gmm | Enables GroupedMatmul for MoE expert computation. Default: `False`. | bool | | moe_config.use_fused_ops_permute | Enables fused permute/unpermute operators in MoE for performance acceleration. Only takes effect when `use_gmm=True`. | bool | | moe_config.enable_deredundency | Enables redundant communication in MoE. Requires that the expert parallelism is an integer multiple of the number of NPUs per node. Default: `False`. Only takes effect when `use_gmm=True`. | bool | | moe_config.npu_nums_per_device | Sets the number of NPUs per node. Default: 8. Only takes effect when `enable_deredundency=True`. | int | | moe_config.enable_gmm_safe_tokens | Ensures at least 1 token is allocated to each expert to prevent GroupedMatmul computation failures under extreme load imbalance. Default: `False`. Recommended to enable when `use_gmm=True`. | bool | ### Model Training Configuration When starting model training, in addition to model-related parameters, you also need to set the parameters of trainer, runner_config, learning rate, and optimizer and other modules required for training, MindSpore Transformers provides the following configuration items. | Parameters | Descriptions | Types | |---------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| | trainer.type | Set the trainer class, usually different models for different application scenarios will set different trainer classes. | str | | trainer.model_name | Set the model name in the format '{name}_xxb', indicating a certain specification of the model. | str | | runner_config.epochs | Set the number of rounds for model training. | int | | runner_config.batch_size | Set the sample size of the batch data, which overrides the `batch_size` in the dataset configuration. | int | | runner_config.sink_mode | Enable data sink mode. | bool | | runner_config.sink_size | Set the number of iterations to be sent down from Host to Device per iteration, effective only when `sink_mode=True`. This argument will be deprecated in a future release. | int | | runner_config.gradient_accumulation_steps | Set the number of gradient accumulation steps, the default value is 1, which means that gradient accumulation is not enabled. | int | | runner_wrapper.type | Set the wrapper class, generally set 'MFTrainOneStepCell'. | str | | runner_wrapper.local_norm | Set the gradient norm of each parameter on the printing card. | bool | | runner_wrapper.scale_sense.type | Set the gradient scaling class, generally just set 'DynamicLossScaleUpdateCell'. | str | | runner_wrapper.scale_sense.use_clip_grad | Turn on gradient clipping. Turning on to avoid cases where the inverse gradient is too large and training fails to converge. | bool | | runner_wrapper.scale_sense.loss_scale_value | Set the loss dynamic scale factor, the model loss can change dynamically according to the configuration of this parameter. | int | | lr_schedule.type | Set the lr_schedule class, lr_schedule is mainly used to adjust the learning rate in model training. | str | | lr_schedule.learning_rate | Set the initialized learning rate size. | float | | lr_scale | Whether to enable learning rate scaling. | bool | | lr_scale_factor | Set the learning rate scaling factor. | int | | layer_scale | Whether to turn on layer attenuation. | bool | | layer_decay | Set the layer attenuation factor. | float | | optimizer.type | Set the optimizer class, the optimizer is mainly used to calculate the gradient for model training. | str | | optimizer.weight_decay | Set the optimizer weight decay factor. | float | | train_dataset.batch_size | The description is same as that of `runner_config.batch_size`. | int | | train_dataset.input_columns | Set the input data columns for the training dataset. | list | | train_dataset.output_columns | Set the output data columns for the training dataset. | list | | train_dataset.construct_args_key | Set the dataset part `keys` of the model `construct` input to the model in lexicographical order, used when the parameter passing order of the model does not match the order of the dataset input. | list | | train_dataset.column_order | Set the order of the output data columns of the training dataset. | list | | train_dataset.num_parallel_workers | Set the number of processes that read the training dataset. | int | | train_dataset.python_multiprocessing | Enabling Python multi-process mode to improve data processing performance. | bool | | train_dataset.drop_remainder | Whether to discard the last batch of data if it contains fewer samples than batch_size. | bool | | train_dataset.repeat | Set the number of dataset duplicates. | int | | train_dataset.numa_enable | Set the default state of NUMA to data read startup state. | bool | | train_dataset.prefetch_size | Set the amount of pre-read data. | int | | train_dataset.data_loader.type | Set the data loading class. | str | | train_dataset.data_loader.dataset_dir | Set the path for loading data. | str | | train_dataset.data_loader.shuffle | Whether to randomly sort the data when reading the dataset. | bool | | train_dataset.transforms | Set options related to data enhancement. | - | | train_dataset_task.type | Set up the dataset class, which is used to encapsulate the data loading class and other related configurations. | str | | train_dataset_task.dataset_config | Typically set as a reference to `train_dataset`, containing all configuration entries for `train_dataset`. | - | | auto_tune | Enable auto-tuning of data processing parameters, see [set_enable_autotune](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/dataset/mindspore.dataset.config.set_enable_autotune.html) for details. | bool | | filepath_prefix | Set the save path for parameter configurations after data optimization. | str | | autotune_per_step | Set the configuration tuning step interval for automatic data acceleration, for details see [set_autotune_interval](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/dataset/mindspore.dataset.config.set_autotune_interval.html). | int | ### Parallel Configuration In order to improve the performance of the model, it is usually necessary to configure the parallelism strategy for the model in large-scale cluster usage scenarios. For details, please refer to [Distributed Parallelism](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/parallel_training.html), the parallel configuration in MindSpore Transformers is as follows. | Parameters | Descriptions | Types | |-----------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------| | use_parallel | Enable parallel mode. | bool | | parallel_config.data_parallel | Set the number of data parallel. | int | | parallel_config.model_parallel | Set the number of model parallel. | int | | parallel_config.context_parallel | Set the number of sequence parallel. | int | | parallel_config.pipeline_stage | Set the number of pipeline parallel. | int | | parallel_config.micro_batch_num | Set the pipeline parallel microbatch size, which should satisfy `parallel_config.micro_batch_num` >= `parallel_config.pipeline_stage` when `parallel_config.pipeline_stage` is greater than 1. | int | | parallel_config.seq_split_num | Set the sequence split number in sequence pipeline parallel, which should be a divisor of sequence length. | int | | parallel_config.gradient_aggregation_group | Set the size of the gradient communication operator fusion group. | int | | parallel_config.context_parallel_algo | Set the long sequence parallel scheme, optionally `colossalai_cp`, `ulysses_cp` and `hybrid_cp`, effective only if the number of `context_parallel` slices is greater than 1. | str | | parallel_config.ulysses_degree_in_cp | Setting the Ulysses sequence parallel dimension, configured in parallel with the `hybrid_cp` long sequence parallel scheme, requires ensuring that `context_parallel` is divisible by this parameter and greater than 1, and that `ulysses_degree_in_cp` is divisible by the number of attention heads. | int | | micro_batch_interleave_num | Set the number of multicopy parallel, enable multicopy parallelism if it is greater than 1. Usually enabled when using model parallel, mainly used to optimize the communication loss generated by model parallel, and not recommended to be enabled when only using streaming parallel. For details, please refer to [MicroBatchInterleaved](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/parallel/mindspore.parallel.nn.MicroBatchInterleaved.html). | int | | parallel.parallel_mode | Set parallel mode, `0` means data parallel mode, `1` means semi-automatic parallel mode, `2` means automatic parallel mode, `3` means mixed parallel mode, usually set to semi-automatic parallel mode. | int | | parallel.gradients_mean | Whether to execute the averaging operator after the gradient AllReduce. Typically set to `False` in semi-automatic parallel mode and `True` in data parallel mode. | bool | | parallel.enable_alltoall | Enables generation of the AllToAll communication operator during communication. Typically set to `True` only in MOE scenarios, default value is `False`. | bool | | parallel.full_batch | Whether to load the full batch of data from the dataset in parallel mode. Setting it to `True` means all ranks will load the full batch of data. Setting it to `False` means each rank will only load the corresponding batch of data. When set to `False`, the corresponding `dataset_strategy` must be configured. | bool | | parallel.dataset_strategy | Only supports `List of List` type and is effective only when `full_batch=False`. The number of sublists in the list must be equal to the length of `train_dataset.input_columns`. Each sublist in the list must have the same shape as the data returned by the dataset. Generally, data parallel splitting is done along the first dimension, so the first dimension of the sublist should be configured to match `data_parallel`, while the other dimensions should be set to `1`. For detailed explanation, refer to [Dataset Splitting](https://www.mindspore.cn/tutorials/en/r2.7.0rc1/parallel/dataset_slice.html). | list | | parallel.search_mode | Set fully-automatic parallel strategy search mode, options are `recursive_programming`, `dynamic_programming` and `sharding_propagation`, only works in fully-automatic parallel mode, experimental interface. | str | | parallel.strategy_ckpt_save_file | Set the save path for the parallel slicing strategy file. | str | | parallel.strategy_ckpt_config.only_trainable_params | Whether to save (or load) information about the slicing strategy for trainable parameters only, default is True, set this parameter to `False` when there are frozen parameters in the network but need to be sliced. | bool | | parallel.enable_parallel_optimizer | Turn on optimizer parallel.
1. slice model weight parameters by number of devices in data parallel mode.
2. slice model weight parameters by `parallel_config.data_parallel` in semi-automatic parallel mode. | bool | | parallel.parallel_optimizer_config.gradient_accumulation_shard | Set whether the cumulative gradient variable is sliced on the data-parallel dimension, only effective if `enable_parallel_optimizer=True`. | bool | | parallel.parallel_optimizer_config.parallel_optimizer_threshold | Set the threshold for the optimizer weight parameter cut, effective only if `enable_parallel_optimizer=True`. | int | | parallel.parallel_optimizer_config.optimizer_weight_shard_size | Set the size of the optimizer weight parameter to slice the communication domain, requiring the value to be integrable by `parallel_config.data_parallel`, effective only if `enable_parallel_optimizer=True`. | int | | parallel.pipeline_config.pipeline_interleave | Enable interleave pipeline parallel, you should set this variable to be `true` when using Seq-Pipe. | bool | | parallel.pipeline_config.pipeline_scheduler | Set the scheduling strategy of Seq-Pipe, we only support `"seqpipe"` now. | str | > Configure the parallel strategy to satisfy device_num = data_parallel × model_parallel × context_parallel × pipeline_stage. ### Model Optimization Configuration 1. MindSpore Transformers provides recomputation-related configurations to reduce the memory footprint of the model during training, see [Recomputation](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/advanced_development/performance_optimization.html#recomputation) for details. | Parameters | Descriptions | Types | |----------------------------------------------------|---------------------------------------------------------------------------------------------------------|-----------------| | recompute_config.recompute | Whether to enable recompute. | bool/list/tuple | | recompute_config.select_recompute | Turn on recomputation to recompute only for the operators in the attention layer. | bool/list | | recompute_config.parallel_optimizer_comm_recompute | Whether to recompute AllGather communication introduced in parallel by the optimizer. | bool/list | | recompute_config.mp_comm_recompute | Whether to recompute communications introduced by model parallel. | bool | | recompute_config.recompute_slice_activation | Whether to output slices for Cells kept in memory. | bool | | recompute_config.select_recompute_exclude | Disable recomputation for the specified operator, valid only for the Primitive operators. | bool/list | | recompute_config.select_comm_recompute_exclude | Disable communication recomputation for the specified operator, valid only for the Primitive operators. | bool/list | 2. MindSpore Transformers provides fine-grained activations SWAP-related configurations to reduce the memory footprint of the model during training, see [Fine-Grained Activations SWAP](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/memory_optimization.html#fine-grained-activations-swap) for details. | Parameters | Descriptions | Types | |----------------------------------------------------|---------------------------------------------------------------------------------------------------------|-----------------| | swap_config.swap | Enable activations SWAP. | bool | | swap_config.default_prefetch | Control the timing of releasing memory in forward phase and starting prefetch in backward phase of the default SWAP strategy, only taking effect when swap=True, layer_swap=None, and op_swap=None. | int | | swap_config.layer_swap | Select specific layers to enable activations SWAP. | list | | swap_config.op_swap | Select specific operators within layers to enable activations SWAP. | list | ### Callbacks Configuration MindSpore Transformers provides encapsulated Callbacks function class, mainly to achieve to return to the model training state and output in the model training process, save the model weight file and other operations. Currently, the following Callbacks function class is supported. 1. MFLossMonitor This callback function class is mainly used to print information such as training progress, model Loss, and learning rate during the training process and has several configurable items as follows: | Parameters | Descriptions | Types | |--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| | learning_rate | Set the initial learning rate in `MFLossMonitor`. The default value is `None`. | float | | per_print_times | Set the frequency for printing log information in `MFLossMonitor`. The default value is `1`, that is, the log information is printed every step. | int | | micro_batch_num | Set the size of the micro batch data in each step in the training, which is used to calculate the actual loss value. If this parameter is not set, the value of this parameter is the same as that of `parallel_config.micro_batch_num` in [Parallel Configuration](#parallel-configuration). | int | | micro_batch_interleave_num | Set the size of the interleave micro batch data in each step of the training. This parameter is used to calculate the actual loss value. If this parameter is not set, the value of this parameter is the same as that of `micro_batch_interleave_num` in [Parallel Configuration](#parallel-configuration). | int | | origin_epochs | Set the initial number of training epochs in `MFLossMonitor`. If this parameter is not set, the value of this parameter is the same as that of `runner_config.epochs` in [Model Training Configuration](#model-training-configuration). | int | | dataset_size | Set initial size of the dataset in `MFLossMonitor`. If this parameter is not set, the size of the initialized dataset is the same as the size of the actual dataset used for training. | int | | initial_epoch | Set start epoch number of training in `MFLossMonitor`. The default value is `0`. | int | | initial_step | Set start step number of training in `MFLossMonitor`. The default value is `0`. | int | | global_batch_size | Set the number of global batch data samples in `MFLossMonitor`. If this parameter is not set, the system automatically calculates the number of global batch data samples based on the dataset size and parallel strategy. | int | | gradient_accumulation_steps | Set the number of gradient accumulation steps in `MFLossMonitor`. If this parameter is not set, the value of this parameter is the same as that of `gradient_accumulation_steps` in [Model Training Configuration](#model-training-configuration). | int | | check_for_nan_in_loss_and_grad | Whether to enable overflow detection in `MFLossMonitor`. After overflow detection is enabled, the training exits if overflow occurs during model training. The default value is `False`. | bool | 2. SummaryMonitor This callback function class is mainly used to collect Summary data, see [mindspore.SummaryCollector](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/mindspore/mindspore.SummaryCollector.html) for details. 3. CheckpointMonitor This callback function class is mainly used to save the model weights file during the model training process and has several configurable items as follows: | Parameters | Descriptions | Types | |------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| | prefix | Set the prefix for saving file names. | str | | directory | Set the directory for saving file names. | str | | save_checkpoint_seconds | Set the number of seconds between saving model weights. | int | | save_checkpoint_steps | Set the number of interval steps for saving model weights. | int | | keep_checkpoint_max | Set the maximum number of model weight files to be saved, if there are more model weight files in the save path, they will be deleted starting from the earliest file created to ensure that the total number of files does not exceed `keep_checkpoint_max`. | int | | keep_checkpoint_per_n_minutes | Set the number of minutes between saving model weights. | int | | integrated_save | Turn on aggregation to save the weights file.
1. When set to True, it means that the weights of all devices are aggregated when the weight file is saved, i.e., the weights of all devices are the same.
2. False means that all devices save their own weights
When using semi-automatic parallel mode, it is usually necessary to set it to False to avoid memory problems when saving the weights file. | bool | | save_network_params | Set to save only model weights, default value is `False`. | bool | | save_trainable_params | Set the additional saving of trainable parameter weights, i.e. the parameter weights of the model when partially fine-tuned, default to `False`. | bool | | async_save | Set an asynchronous execution to save the model weights file. | bool | | remove_redundancy | Whether to remove the redundancy for the checkpoint, default value is `False`. | bool | | checkpoint_format | The format of the checkpoint while saving the checkpoint, default value is `ckpt`. Either `ckpt` or `safetensors`. | str | | embedding_local_norm_threshold | Set the threshold for embedding norm in health monitoring,default value is `1.0`. | float | Multiple Callbacks function classes can be configured at the same time under the `callbacks` field. The following is an example of `callbacks` configuration. ```yaml callbacks: - type: MFLossMonitor - type: CheckpointMonitor prefix: "name_xxb" save_checkpoint_steps: 1000 integrated_save: False async_save: False ``` ### Processor Configuration Processor is mainly used to preprocess the inference data of the input model. Since the Processor configuration items are not fixed, only the generic configuration items of Processor in MindSpore Transformers are explained here. | Parameters | Descriptions | Types | |--------------------------------|--------------------------------------|-----| | processor.type | Set the data processing class. | str | | processor.return_tensors | Set the type of tensor returned by the data processing class, typically use 'ms'. | str | | processor.image_processor.type | Set the image data processing class. | str | | processor.tokenizer.type | Set the text tokenizer class. | str | | processor.tokenizer.vocab_file | Set the path of the file to be read by the text tokenizer, which needs to correspond to the tokenizer class. | str | ### Model Evaluation Configuration MindSpore Transformers provides model evaluation function, and also supports model evaluation while training. The following is the configuration related to model evaluation. | Parameters | Descriptions | Types | |---------------------|-------------------------------------------------------------|------| | eval_dataset | Used in the same way as `train_dataset`. | - | | eval_dataset_task | Used in the same way as `eval_dataset_task`. | - | | metric.type | Used in the same way as `callbacks`. | - | | do_eval | Enable evaluation while training. | bool | | eval_step_interval | Set evaluation step interval, default value is 100. The value less than 0 means disable evaluation according to step interval. | int | | eval_epoch_interval | Set the epoch interval for evaluation, the default value is -1. The value less than 0 means disable the function of evaluating according to epoch interval, it is not recommended to use this configuration in data sinking mode. | int | | metric.type | Set the type of evaluation. | str | ### Profile Configuration MindSpore Transformers provides Profile as the main tool for model performance tuning, please refer to [Performance Tuning Guide](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/advanced_development/performance_optimization.html) for more details. The following is the Profile related configuration. | Parameters | Descriptions | Types | |-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| | profile | Whether to enable the performance capture tool, see [mindspore.Profiler](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/mindspore/mindspore.Profiler.html) for details. Default: `False`. | bool | | profile_start_step | Set the number of steps to start collecting performance data. Default: `1`. | int | | profile_stop_step | Set the number of steps to stop collecting performance data. Default: `10`. | int | | profile_communication | Set whether communication performance data is collected in multi-device training, this parameter is invalid when using single card training. Default: `False`. | bool | | profile_memory | Set whether to collect Tensor memory data. Default: `True`. | bool | | profile_rank_ids | Specify rank ids to enable collecting performance data. Defaults to `None`, which means all rank ids are enabled. | list | | profile_pipeline | Set whether to enable collecting performance data on one card of each parallel stage. Default: `False`. | bool | | profile_output | Set the directory of saving performance data. | str | | profiler_level | Set the collection level. Should be one of (0, 1, 2). Default: `1`. | int | | with_stack | Set whether to collect Python-side stack trace data. Default: `False`. | bool | | data_simplification | Set whether to enable data simplification, which will delete the FRAMEWORK directory and other extraneous data after exporting performance data. Default: `False`. | int | | init_start_profile | Set whether to turn on collecting performance data when the Profiler is initialized; this parameter does not take effect when `profile_start_step` is set. This parameter needs to be set to `True` when `profile_memory` is turned on. | bool | | mstx | Set whether to enable mstx timestamp recording, including training step, HCCL-operators and etc. Default: `False`. | bool | ### Metric Monitoring Configuration The metric monitoring configuration is primarily used to configure methods to record metrics during training, please refer to [Training Metrics Monitoring](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/monitor.html) for more details.Below is a description of the common metric monitoring configuration options in MindSpore Transformers: | Parameters | Descriptions | Types | |--------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| | monitor_config.monitor_on | Set whether to enable monitoring. The default is `False`, which will disable all parameters below. | bool | | monitor_config.dump_path | Set the save path for metric files of `local_norm`, `device_local_norm` and `local_loss` during training. Defaults to './dump' when not set or set to `null`. | str | | monitor_config.target | Set the (partial) name of target parameters monitored by metric `optimizer state` and `local_norm`, can be regular expression.Defaults to ['.*'] when not set or set to `null`, that is, specify all parameters. | list[str] | | monitor_config.invert | Set whether to invert the targets specified in `monitor_config.target`, defaults to `False`. | bool | | monitor_config.step_interval | Set the frequency for metric recording. The default value is `1`, that is, the metrics are recorded every step. | int | | monitor_config.local_loss_format | Set the format to record metric `local_loss`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | | monitor_config.device_local_loss_format | Set the format to record metric `device_local_loss`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | | monitor_config.local_norm_format | Set the format to record metric `local_norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | | monitor_config.device_local_norm_format | Set the format to record metric `device_local_norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | | monitor_config.optimizer_state_format | Set the format to record metric `optimizer state`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | | monitor_config.weight_state_format | Set the format to record metric `weight L2-norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | | monitor_config.throughput_baseline | Set the baseline of metric `throughput linearity`, must be positive number. Defaults to `null`, that is, do not monitor this metric. | int/float | | monitor_config.print_struct | Set whether to print all trainable parameters' name of model. If set to `True`, print all trainable parameters' name at the beginning of the first step, and exit training process after step end. Defaults to `False`. | bool | | monitor_config.check_for_global_norm | Set whether to enable process level fault recovery function. Defaults to `False`. | bool | | monitor_config.global_norm_spike_threshold | Set the threshold for global norm, triggering data skipping when the global norm is exceeded. Defaults to `3.0`. | float | | monitor_config.global_norm_spike_count_threshold | Set the cumulative number of consecutive global norm anomalies, and when the threshold is reached, trigger an exception interrupt to terminate the training. Defaults to `10`. | int | ### TensorBoard Configuration The TensorBoard configuration is primarily used to configure parameters related to TensorBoard during training, allowing for real-time monitoring and visualization of training metrics, please refer to [Training Metrics Monitoring](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/monitor.html) for more details. Below is a description of the common TensorBoard configuration options in MindSpore Transformers: | Parameters | Descriptions | Types | |--------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| | tensorboard.tensorboard_dir | Sets the path where TensorBoard event files are saved. | str | | tensorboard.tensorboard_queue_size | Sets the maximum cache value of the capture queue. If it exceeds this value, it will be written to the event file, the default value is 10. | int | | tensorboard.log_loss_scale_to_tensorboard | Sets whether loss scale information is logged to the event file, default is `False`. | bool | | tensorboard.log_timers_to_tensorboard | Sets whether to log timer information to the event file. The timer information contains the duration of the current training step (or iteration) as well as the throughput, defaults to `False` | bool | | tensorboard.log_expert_load_to_tensorboard | Sets whether to log experts load to the event file, defaults to `False`. | bool |