# Supervised Fine-Tuning (SFT) [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0rc1/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/r2.7.0rc1/docs/mindformers/docs/source_en/guide/supervised_fine_tuning.md) ## Overview SFT (Supervised Fine-Tuning) adopts the concept of supervised learning, referring to the process of adjusting some or all parameters of a pre-trained model to better adapt it to specific tasks or datasets. MindSpore Transformers supports two SFT fine-tuning methods: full-parameter fine-tuning and LoRA fine-tuning. Full-parameter fine-tuning involves updating all parameters during training, suitable for large-scale data refinement, offering optimal task adaptability but requiring significant computational resources. LoRA fine-tuning updates only a subset of parameters, consuming less memory and training faster than full-parameter fine-tuning, though its performance may be inferior in certain tasks. ## Basic Process of SFT Fine-Tuning Combining practical operations, SFT fine-tuning can be broken down into the following steps: ### 1. Weight Preparation Before fine-tuning, the weight files of the pre-trained model need to be prepared. MindSpore Transformers supports loading [safetensors weights](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/safetensors.html), enabling direct loading of model weights downloaded from the Hugging Face model hub. ### 2. Dataset Preparation MindSpore Transformers currently supports datasets in [Hugging Face format](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/dataset.html#huggingface-datasets) and [MindRecord format](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/dataset.html#mindrecord-dataset) for the fine-tuning phase. Users can prepare data according to task requirements. ### 3. Configuration File Preparation Fine-tuning tasks are uniformly controlled through [configuration files](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/configuration.html), allowing users to flexibly adjust [model training hyperparameters](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/training_hyperparameters.html). Additionally, fine-tuning performance can be optimized using [distributed parallel training](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/parallel_training.html), [memory optimization features](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/memory_optimization.html), and [other training features](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/other_training_features.html). ### 4. Launching the Training Task MindSpore Transformers provides a [one-click startup script](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/start_tasks.html) to initiate fine-tuning tasks. During training, [logs](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/logging.html) and [visualization tools](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/monitor.html) can be used to monitor the training process. ### 5. Model Saving Checkpoints are saved during training, or model weights are saved to a specified path upon completion. Currently, weights can be saved in [Safetensors format](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/safetensors.html) or [Ckpt format](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/ckpt.html), which can be used for resumed training or further fine-tuning. ### 6. Fault Recovery To handle exceptions such as training interruptions, MindSpore Transformers offers [training high availability](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/high_availability.html) like last-state saving and automatic recovery, as well as [checkpoint-based resumed training](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/resume_training.html), enhancing training stability. ## Full-Parameter Fine-Tuning with MindSpore Transformers ### Selecting a Pre-Trained Model MindSpore Transformers currently supports mainstream large-scale models in the industry. This guide uses the Qwen2.5-7B model as an example. ### Downloading Model Weights MindSpore Transformers supports loading Hugging Face model weights, enabling direct loading of weights downloaded from the Hugging Face model hub. For details, refer to [MindSpore Transformers-Safetensors Weights](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/safetensors.html). | Model Name | Hugging Face Weight Download Link | | :---------- | :---------------------------------------------------: | | Qwen2.5-7B | [Link](https://huggingface.co/Qwen/Qwen2.5-7B) | ### Dataset Preparation MindSpore Transformers supports online loading of Hugging Face datasets. For details, refer to [MindSpore Transformers-Dataset-Hugging Face Dataset](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/dataset.html#huggingface-datasets). This guide uses [llm-wizard/alpaca-gpt4-data](https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data) as the fine-tuning dataset. | Dataset Name | Applicable Phase | Download Link | | :-------------------------- | :--------------: | :----------------------------------------------------------------: | | llm-wizard/alpaca-gpt4-data | Fine-Tuning | [Link](https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data) | ### Executing the Fine-Tuning Task #### Single-NPU Training First, prepare the configuration file. This guide provides a fine-tuning configuration file for the Qwen2.5-7B model, `finetune_qwen2_5_7b_8k_1p.yaml`, available for download from the [Gitee repository](https://gitee.com/mindspore/docs/blob/r2.7.0rc1/docs/mindformers/docs/source_zh_cn/example/supervised_fine_tuning/finetune_qwen2_5_7b_8k_1p.yaml). > Due to limited single-NPU memory, the `num_layers` in the configuration file is set to 4, used as an example only. Then, modify the parameters in the configuration file based on actual conditions, mainly including: ```yaml load_checkpoint: '/path/to/Qwen2.5-7B/' # Path to the pre-trained model weight folder ... train_dataset: &train_dataset ... data_loader: ... handler: - type: AlpacaInstructDataHandler tokenizer: vocab_file: "/path/to/Qwen2.5-7B/vocab.json" # Path to the vocabulary file merges_file: "/path/to/Qwen2.5-7B/merges.txt" # Path to the merges file ``` Run `run_mindformer.py` to start the single-NPU fine-tuning task. The command is as follows: ```shell python run_mindformer.py \ --config /path/to/finetune_qwen2_5_7b_8k_1p.yaml \ --register_path research/qwen2_5 \ --use_parallel False \ --run_mode finetune ``` Parameter descriptions: ```commandline config: Model configuration file use_parallel: Whether to enable parallel training run_mode: Running mode, train: training, finetune: fine-tuning, predict: inference ``` #### Single-Node Training First, prepare the configuration file. This guide provides a fine-tuning configuration file for the Qwen2.5-7B model, `finetune_qwen2_5_7b_8k.yaml`, available for download from the [Gitee repository](https://gitee.com/mindspore/docs/blob/r2.7.0rc1/docs/mindformers/docs/source_zh_cn/example/supervised_fine_tuning/finetune_qwen2_5_7b_8k.yaml). Then, modify the parameters in the configuration file based on actual conditions, mainly including: ```yaml load_checkpoint: '/path/to/Qwen2.5-7B/' # Path to the pre-trained model weight folder ... train_dataset: &train_dataset ... data_loader: ... handler: - type: AlpacaInstructDataHandler tokenizer: vocab_file: "/path/to/Qwen2.5-7B/vocab.json" # Path to the vocabulary file merges_file: "/path/to/Qwen2.5-7B/merges.txt" # Path to the merges file ``` Run the following msrun startup script for 8-NPU distributed training: ```bash bash scripts/msrun_launcher.sh "run_mindformer.py \ --register_path research/qwen2_5 \ --config /path/to/finetune_qwen2_5_7b_8k.yaml \ --use_parallel True \ --run_mode finetune" 8 ``` Parameter descriptions: ```commandline config: Model configuration file use_parallel: Whether to enable parallel training run_mode: Running mode, train: training, finetune: fine-tuning, predict: inference ``` After task completion, a checkpoint folder will be generated in the mindformers/output directory, and the model files will be saved in this folder. #### Multi-Node Training Multi-Node, multi-NPU fine-tuning tasks are similar to launching pre-training. Refer to [multi-node, multi-NPU pre-training commands](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/guide/pre_training.html#multi-node-training). First, modify the configuration file, adjusting settings based on the number of nodes: ```yaml parallel_config: data_parallel: ... model_parallel: ... pipeline_stage: ... context_parallel: ... ``` Modify the command as follows: 1. Add the startup script parameter `--config /path/to/finetune_qwen2_5_7b_8k.yaml` to load pre-trained weights. 2. Set `--run_mode finetune` in the startup script, where run_mode indicates the running mode: train (training), finetune (fine-tuning), or predict (inference). After task completion, a checkpoint folder will be generated in the mindformers/output directory, and the model files will be saved in this folder. ## LoRA Fine-Tuning with MindSpore Transformers MindSpore Transformers supports configuration-driven LoRA fine-tuning, eliminating the need for code adaptations for each model. By modifying the model configuration in the full-parameter fine-tuning YAML file and adding the `pet_config` parameter-efficient fine-tuning configuration, LoRA fine-tuning tasks can be performed. Below is an example of the model configuration section in a YAML file for LoRA fine-tuning of the Llama2 model, with detailed explanations of the `pet_config` parameters. ### Introduction to LoRA Principles LoRA significantly reduces the number of parameters by decomposing the original model’s weight matrix into two low-rank matrices. For example, suppose a weight matrix W has dimensions $m \times n$. With LoRA, it is decomposed into two low-rank matrices A and B, where A has dimensions $m \times r$ and B has dimensions $r \times n$ ($r$ is much smaller than $m$ and $n$). During fine-tuning, only these two low-rank matrices are updated, leaving the rest of the original model unchanged. This approach not only drastically reduces the computational cost of fine-tuning but also preserves the model’s original performance, making it particularly suitable for model optimization in environments with limited data or computational resources. For detailed principles, refer to the paper [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685). ### Modifying the Configuration File Based on the full-parameter fine-tuning configuration file, add LoRA-related parameters to the model configuration and rename it to `fine_tune_qwen2_5_7b_8k_lora.yaml`. Below is an example configuration snippet showing how to add LoRA fine-tuning parameters for the Qwen2.5-7B model: ```yaml # model config model: model_config: ... pet_config: pet_type: lora lora_rank: 16 lora_alpha: 16 lora_dropout: 0.05 target_modules: '.*wq|.*wk|.*wv|.*wo' ``` ### Detailed Explanation of pet_config Parameters In the `model_config`, `pet_config` is the core configuration section for LoRA fine-tuning, used to specify LoRA-related parameters. The parameters are explained as follows: - **pet_type:** Specifies the type of Parameter-Efficient Tuning (PET) as LoRA. This means LoRA modules will be inserted into key layers of the model to reduce the number of parameters required for fine-tuning. - **lora_rank:** Defines the rank of the low-rank matrices. A smaller rank results in fewer parameters to update, reducing computational resource usage. Setting it to 16 is a common balance point, significantly reducing the parameter count while maintaining model performance. - **lora_alpha:** Controls the scaling factor for weight updates in the LoRA module. This value determines the magnitude and impact of weight updates during fine-tuning. Setting it to 16 indicates a moderate scaling factor, helping to stabilize the training process. - **lora_dropout:** Sets the dropout probability in the LoRA module. Dropout is a regularization technique used to reduce the risk of overfitting. A value of 0.05 means there is a 5% chance of randomly “disabling” certain neural connections during training, which is particularly important when data is limited. - **target_modules:** Specifies which weight matrices in the model LoRA will be applied to, using regular expressions. In Llama, this configuration applies LoRA to the Query (wq), Key (wk), Value (wv), and Output (wo) matrices in the self-attention mechanism. These matrices play critical roles in the Transformer architecture, and applying LoRA to them maintains model performance while reducing the parameter count. ### LoRA Fine-Tuning Example for Qwen2.5-7B The dataset used for LoRA fine-tuning can be prepared as described in the [Dataset Preparation](#dataset-preparation) section of the full-parameter fine-tuning process. For the Qwen2.5-7B model, the following msrun startup command can be executed for 8-NPU distributed fine-tuning: ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ --register_path research/qwen2_5 \ --config /path/to/finetune_qwen2_5_7b_8k_lora.yaml \ --use_parallel True \ --run_mode finetune" 8 ```