# Supervised Fine-Tuning (SFT) [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.2/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/r2.7.2/docs/mindformers/docs/source_en/guide/supervised_fine_tuning.md) ## Overview SFT (Supervised Fine-Tuning) adopts the concept of supervised learning, referring to the process of adjusting some or all parameters of a pre-trained model to better adapt it to specific tasks or datasets. MindSpore Transformers supports two SFT fine-tuning methods: full-parameter fine-tuning and LoRA fine-tuning. Full-parameter fine-tuning involves updating all parameters during training, suitable for large-scale data refinement, offering optimal task adaptability but requiring significant computational resources. LoRA fine-tuning updates only a subset of parameters, consuming less memory and training faster than full-parameter fine-tuning, though its performance may be inferior in certain tasks. ## Basic Process of SFT Fine-Tuning Combining practical operations, SFT fine-tuning can be broken down into the following steps: ### 1. Weight Preparation Before fine-tuning, the weight files of the pre-trained model need to be prepared. MindSpore Transformers supports loading [safetensors weights](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/safetensors.html), enabling direct loading of model weights downloaded from the Hugging Face model hub. ### 2. Dataset Preparation MindSpore Transformers currently supports datasets in [Hugging Face format](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/dataset.html#hugging-face-dataset) and [MindRecord format](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/dataset.html#mindrecord-dataset) for the fine-tuning phase. Users can prepare data according to task requirements. ### 3. Configuration File Preparation Fine-tuning tasks are uniformly controlled through [configuration files](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/configuration.html), allowing users to flexibly adjust [model training hyperparameters](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/training_hyperparameters.html). Additionally, fine-tuning performance can be optimized using [distributed parallel training](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/parallel_training.html), [memory optimization features](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/memory_optimization.html), and [other training features](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/other_training_features.html). ### 4. Launching the Training Task MindSpore Transformers provides a [one-click startup script](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/start_tasks.html) to initiate fine-tuning tasks. During training, [logs](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/logging.html) and [visualization tools](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/monitor.html) can be used to monitor the training process. ### 5. Model Saving Checkpoints are saved during training, or model weights are saved to a specified path upon completion. Currently, weights can be saved in [Safetensors format](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/safetensors.html) or [Ckpt format](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/ckpt.html), which can be used for resumed training or further fine-tuning. ### 6. Fault Recovery To handle exceptions such as training interruptions, MindSpore Transformers offers [training high availability](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/high_availability.html) like last-state saving and automatic recovery, as well as [checkpoint-based resumed training](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/resume_training.html), enhancing training stability. ## Full-Parameter Fine-Tuning with MindSpore Transformers ### Selecting a Pre-Trained Model MindSpore Transformers currently supports mainstream large-scale models in the industry. This guide uses the Qwen3-8B model as an example. ### Downloading Model Weights MindSpore Transformers supports loading Hugging Face model weights, enabling direct loading of weights downloaded from the Hugging Face model hub. For details, refer to [MindSpore Transformers-Safetensors Weights](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/safetensors.html). | Model Name | Hugging Face Weight Download Link | |:-----------| :---------------------------------------------------: | | Qwen3-8B | [Link](https://huggingface.co/Qwen/Qwen3-8B) | ### Dataset Preparation MindSpore Transformers supports online loading of Hugging Face datasets. For details, refer to [MindSpore Transformers-Dataset-Hugging Face Dataset](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/feature/dataset.html#hugging-face-dataset). This guide uses [llm-wizard/alpaca-gpt4-data](https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data) as the fine-tuning dataset. | Dataset Name | Applicable Phase | Download Link | | :-------------------------- | :--------------: | :----------------------------------------------------------------: | | llm-wizard/alpaca-gpt4-data | Fine-Tuning | [Link](https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data) | ### Executing the Fine-Tuning Task #### Single-NPU Training First, prepare the configuration file. This guide provides a fine-tuning configuration file for the Qwen3-8B model, `finetune_qwen3.yaml`, available for download from the [Gitee repository](https://gitee.com/mindspore/mindformers/blob/r1.8.0/configs/qwen3/finetune_qwen3.yaml). > Due to limited single-NPU memory, the `num_layers` in the configuration file is set to 4, used as an example only. Then, modify the parameters in the configuration file based on actual conditions, mainly including: ```yaml pretrained_model_dir: '/path/to/Qwen3-8B' ... train_dataset: &train_dataset ... data_loader: type: HFDataLoader path: "llm-wizard/alpaca-gpt4-data-zh" # An Alpaca-style dataset. Ensure the network can access Hugging Face for automatic dataset download. # path: "json" # If using a local JSON file for offline dataset loading, uncomment the next two lines and comment out the line above # data_files: '/path/to/alpaca_gpt4_data_zh.json' ... handler: - type: take # Invoke the `take` method from the datasets library to fetch the first n samples for demonstration n: 2000 # Take the first 2000 samples for demonstration. Remove this line and the one above during actual use. model: model_config: num_hidden_layers: 4 ... parallel_config: data_parallel: 1 model_parallel: 1 pipeline_stage: 1 use_seq_parallel: False micro_batch_num: 1 ``` Run `run_mindformer.py` to start the single-NPU fine-tuning task. The command is as follows: ```shell python run_mindformer.py \ --config configs/qwen3/finetune_qwen3.yaml \ --use_parallel False \ --run_mode finetune ``` Parameter descriptions: ```text config: Model configuration file use_parallel: Whether to enable parallel training run_mode: Running mode, train: training, finetune: fine-tuning, predict: inference ``` #### Single-Node Training First, prepare the configuration file. This guide provides a fine-tuning configuration file for the Qwen3-8B model, `finetune_qwen3.yaml`, available for download from the [Gitee repository](https://gitee.com/mindspore/mindformers/blob/r1.8.0/configs/qwen3/finetune_qwen3.yaml). Then, modify the parameters in the configuration file based on actual conditions, mainly including: ```yaml pretrained_model_dir: '/path/to/Qwen3-8B' ... train_dataset: &train_dataset ... data_loader: type: HFDataLoader path: "llm-wizard/alpaca-gpt4-data-zh" # An Alpaca-style dataset. Ensure the network can access Hugging Face for automatic dataset download. # path: "json" # If using a local JSON file for offline dataset loading, uncomment the next two lines and comment out the line above # data_files: '/path/to/alpaca_gpt4_data_zh.json' ... handler: - type: take # Invoke the `take` method from the datasets library to fetch the first n samples for demonstration n: 2000 # Take the first 2000 samples for demonstration. Remove this line and the one above during actual use. parallel_config: data_parallel: 1 model_parallel: 4 pipeline_stage: 2 micro_batch_num: 2 ``` Run the following msrun startup script for 8-NPU distributed training: ```bash total_rank_num=8 bash scripts/msrun_launcher.sh "run_mindformer.py \ --config configs/qwen3/finetune_qwen3.yaml \ --auto_trans_ckpt True \ --use_parallel True \ --run_mode finetune" \ $total_rank_num ``` Parameter descriptions: ```text config: Model configuration file auto_trans_ckpt: Whether to automatically convert the weight file format use_parallel: Whether to enable parallel training run_mode: Running mode, train: training, finetune: fine-tuning, predict: inference ``` After task completion, a checkpoint folder will be generated in the mindformers/output directory, and the model files will be saved in this folder. #### Multi-Node Training Multi-Node, multi-NPU fine-tuning tasks are similar to launching pre-training. Refer to [Multi-Node, Multi-NPU pre-training commands](https://www.mindspore.cn/mindformers/docs/en/r1.8.0/guide/pre_training.html#multi-node-training). First, modify the configuration file, adjusting settings based on the number of nodes: ```yaml parallel_config: data_parallel: ... model_parallel: ... pipeline_stage: ... context_parallel: ... ``` Modify the command as follows: 1. Add the startup script parameter `--config configs/qwen3/finetune_qwen3.yaml` to load pre-trained weights. 2. Set `--run_mode finetune` in the startup script, where run_mode indicates the running mode: train (training), finetune (fine-tuning), or predict (inference). After task completion, a checkpoint folder will be generated in the mindformers/output directory, and the model files will be saved in this folder. ## LoRA Fine-Tuning with MindSpore Transformers MindSpore Transformers supports configuration-driven LoRA fine-tuning, eliminating the need for code adaptations for each model. By modifying the model configuration in the full-parameter fine-tuning YAML file and adding the `pet_config` parameter-efficient fine-tuning configuration, LoRA fine-tuning tasks can be performed. Below is an example of the model configuration section in a YAML file for LoRA fine-tuning of the Qwen3 model, with detailed explanations of the `pet_config` parameters. ### Introduction to LoRA Principles LoRA significantly reduces the number of parameters by decomposing the original model’s weight matrix into two low-rank matrices. For example, suppose a weight matrix W has dimensions $m \times n$. With LoRA, it is decomposed into two low-rank matrices A and B, where A has dimensions $m \times r$ and B has dimensions $r \times n$ ($r$ is much smaller than $m$ and $n$). During fine-tuning, only these two low-rank matrices are updated, leaving the rest of the original model unchanged. This approach not only drastically reduces the computational cost of fine-tuning but also preserves the model’s original performance, making it particularly suitable for model optimization in environments with limited data or computational resources. For detailed principles, refer to the paper [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685). ### Modifying the Configuration File Based on the full-parameter fine-tuning configuration file, add LoRA-related parameters to the model configuration and rename it to `finetune_qwen3_8b_lora.yaml`. Below is an example configuration snippet showing how to add LoRA fine-tuning parameters for the Qwen3-8B model: ```yaml # model config model: model_config: ... # Add `pet_config` under the `model_config` level. pet_config: pet_type: lora lora_rank: 8 lora_alpha: 16 lora_dropout: 0.1 lora_a_init: 'normal' lora_b_init: 'zeros' target_modules: '.*word_embeddings|.*linear_qkv|.*linear_proj|.*linear_fc1|.*linear_fc2' freeze_include: ['*'] freeze_exclude: ['*lora*'] ``` ### Detailed Explanation of pet_config Parameters In the `model_config`, `pet_config` is the core configuration section for LoRA fine-tuning, used to specify LoRA-related parameters. The parameters are explained as follows: - **pet_type:** Specifies the type of Parameter-Efficient Tuning (PET) as LoRA. This means LoRA modules will be inserted into key layers of the model to reduce the number of parameters required for fine-tuning. - **lora_rank:** Defines the rank of the low-rank matrices. A smaller rank results in fewer parameters to update, reducing computational resource usage. Setting it to 16 is a common balance point, significantly reducing the parameter count while maintaining model performance. - **lora_alpha:** Controls the scaling factor for weight updates in the LoRA module. This value determines the magnitude and impact of weight updates during fine-tuning. Setting it to 16 indicates a moderate scaling factor, helping to stabilize the training process. - **lora_dropout:** Sets the dropout probability in the LoRA module. Dropout is a regularization technique used to reduce the risk of overfitting. A value of 0.05 means there is a 5% chance of randomly “disabling” certain neural connections during training, which is particularly important when data is limited. - **lora_a_init:** Specifies the initialization method for the LoRA A matrix. Common choices include 'normal' and 'zeros'. - **lora_b_init:** Specifies the initialization method for the LoRA B matrix. Common choices include 'normal' and 'zeros'. - **target_modules:** Apply LoRA to modules, with the above configuration applying LoRA to the weight matrices of word_embeddings, attention, and mlp. ### LoRA Fine-Tuning Example for Qwen3-8B The dataset used for LoRA fine-tuning can be prepared as described in the [Dataset Preparation](#dataset-preparation) section of the full-parameter fine-tuning process. For the Qwen3-8B model, the following msrun startup command can be executed for 8-NPU distributed fine-tuning: ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ --config /path/to/finetune_qwen3_8b_lora.yaml \ --use_parallel True \ --run_mode finetune" 8 ```