[{"data":1,"prerenderedAt":592},["ShallowReactive",2],{"content-query-gHHJ7wicBy":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":586,"_id":587,"_source":588,"_file":589,"_stem":590,"_extension":591},"/technology-blogs/en/2948","en",false,"","Project Sharing | Training Your Exclusive AI Model with MindSpore and LoRA Fine-tuning Techniques","In addition to well-known pretrained foundation models such as LLaMA, LLaMA2, ChatGLM, and Falcon, a large number of top models on the Hugging Face Open LLM Leaderboard are fine-tuning models.","2023-11-17","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/01/19/45538d308da245b89c4f9c35553392c1.png","technology-blogs",{"type":14,"children":15,"toc":583},"root",[16,24,34,42,47,62,81,89,94,99,104,109,114,119,127,132,137,149,157,162,176,181,186,197,202,211,219,224,241,251,258,283,291,308,316,323,328,340,351,376,394,405,449,485,502,533,538,545,553,566],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"project-sharing-training-your-exclusive-ai-model-with-mindspore-and-lora-fine-tuning-techniques",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":17,"tag":29,"props":30,"children":31},"strong",{},[32],{"type":23,"value":33},"Author: Zhong Yuanke Source: Zhihu",{"type":17,"tag":25,"props":35,"children":36},{},[37],{"type":17,"tag":29,"props":38,"children":39},{},[40],{"type":23,"value":41},"Summary",{"type":17,"tag":25,"props":43,"children":44},{},[45],{"type":23,"value":46},"In the past year, foundation models have rapidly developed. In addition to well-known pretrained foundation models such as LLaMA, LLaMA2, ChatGLM, and Falcon, a large number of top models on the Hugging Face Open LLM Leaderboard, a Hugging Face Space in Hugging Face H4, are fine-tuning models. Foundation model training requires significant computing power, so many AI developers opt to use fine-tuning to train such models, as it significantly reduces the computation workload. Mainstream fine-tuning methods include Freeze, P-Tuning, and Low-Rank Adaptation (LoRA). This blog will explain the principles of the LoRA method.",{"type":17,"tag":25,"props":48,"children":49},{},[50,55,57],{"type":17,"tag":29,"props":51,"children":52},{},[53],{"type":23,"value":54},"01",{"type":23,"value":56}," ",{"type":17,"tag":29,"props":58,"children":59},{},[60],{"type":23,"value":61},"LoRA Analysis",{"type":17,"tag":25,"props":63,"children":64},{},[65,67,73,74,79],{"type":23,"value":66},"The LoRA fine-tuning method was proposed in paper ",{"type":17,"tag":68,"props":69,"children":70},"em",{},[71],{"type":23,"value":72},"LoRA: Low-Rank Adaptation",{"type":23,"value":56},{"type":17,"tag":68,"props":75,"children":76},{},[77],{"type":23,"value":78},"o__f Large Language Models",{"type":23,"value":80},", which introduced LoRA as follows: \"We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.\" It can be learned from the original description that LoRA mainly freezes parameters of the original model and injects trainable low-rank decomposition matrix modules to reduce the number of parameters for fine-tuning.",{"type":17,"tag":25,"props":82,"children":83},{},[84],{"type":17,"tag":85,"props":86,"children":88},"img",{"alt":7,"src":87},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/01/19/baea29dfe9df428680bf06b159d7b558.png",[],{"type":17,"tag":25,"props":90,"children":91},{},[92],{"type":23,"value":93},"Figure 1 LoRA fine-tuning for only A and B",{"type":17,"tag":25,"props":95,"children":96},{},[97],{"type":23,"value":98},"The figure above shows a schematic diagram of LoRA, where the blue part on the left represents the weights to be frozen, and the right part represents the newly added fine-tuning weight modules.",{"type":17,"tag":25,"props":100,"children":101},{},[102],{"type":23,"value":103},"The paper concludes that LoRA has the following key advantages:",{"type":17,"tag":25,"props":105,"children":106},{},[107],{"type":23,"value":108},"· Having a small number of parameters reduces the demands for computing power and storage.",{"type":17,"tag":25,"props":110,"children":111},{},[112],{"type":23,"value":113},"· It is small-scale fine-tuning with high efficiency.",{"type":17,"tag":25,"props":115,"children":116},{},[117],{"type":23,"value":118},"· LoRA generates an independent fine-tuning module that can be integrated with other fine-tuning techniques.",{"type":17,"tag":25,"props":120,"children":121},{},[122],{"type":17,"tag":29,"props":123,"children":124},{},[125],{"type":23,"value":126},"Symbols",{"type":17,"tag":25,"props":128,"children":129},{},[130],{"type":23,"value":131},"dmodel: dimensionality of the input and output of a transformer layer.",{"type":17,"tag":25,"props":133,"children":134},{},[135],{"type":23,"value":136},"Wq, Wk, Wv, and Wo: mapping matrix of query, key, value, and output of a self-attention module.",{"type":17,"tag":25,"props":138,"children":139},{},[140,144,145],{"type":17,"tag":85,"props":141,"children":143},{"alt":7,"src":142},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/01/19/ec32820dc61c4736bf9435d6c8994989.png",[],{"type":23,"value":56},{"type":17,"tag":85,"props":146,"children":148},{"alt":7,"src":147},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/01/19/249479c877a641fe9cdcbd403d7891f3.png",[],{"type":17,"tag":25,"props":150,"children":151},{},[152],{"type":17,"tag":29,"props":153,"children":154},{},[155],{"type":23,"value":156},"Applying LoRA to Transformer",{"type":17,"tag":25,"props":158,"children":159},{},[160],{"type":23,"value":161},"According to the principle, LoRA can be applied to the Transformer sub-modules of various models for sub-module fine-tuning. For example, Self-Attention has four matrices Wq, Wk, Wv, and Wo, and they can be regarded as an independent model matrix for LoRA fine-tuning.",{"type":17,"tag":25,"props":163,"children":164},{},[165,170,171],{"type":17,"tag":29,"props":166,"children":167},{},[168],{"type":23,"value":169},"04",{"type":23,"value":56},{"type":17,"tag":29,"props":172,"children":173},{},[174],{"type":23,"value":175},"Code Explanation",{"type":17,"tag":25,"props":177,"children":178},{},[179],{"type":23,"value":180},"The following code explanation illustrates the implementation of LoRA using the source code of the MindSpore fine-tuning suite, MindPet, as an example, and showcases the application of LoRA in MindPet by combining it with the fine-tuning of the Wukong-Huahua model.",{"type":17,"tag":25,"props":182,"children":183},{},[184],{"type":23,"value":185},"Link to the MindPet project:",{"type":17,"tag":25,"props":187,"children":188},{},[189],{"type":17,"tag":190,"props":191,"children":195},"a",{"href":192,"rel":193},"https://github.com/mindspore-lab/mindpet",[194],"nofollow",[196],{"type":23,"value":192},{"type":17,"tag":25,"props":198,"children":199},{},[200],{"type":23,"value":201},"Wukong-Huahua+LoRA fine-tuning code:",{"type":17,"tag":25,"props":203,"children":204},{},[205],{"type":17,"tag":190,"props":206,"children":209},{"href":207,"rel":208},"https://github.com/mindspore-lab/minddiffusion/blob/main/vision/wukong-huahua/run_train.py",[194],[210],{"type":23,"value":207},{"type":17,"tag":25,"props":212,"children":213},{},[214],{"type":17,"tag":29,"props":215,"children":216},{},[217],{"type":23,"value":218},"Core Code Explanation",{"type":17,"tag":25,"props":220,"children":221},{},[222],{"type":23,"value":223},"The core code of LoRA fine-tuning consists of two parts: (1) freezing the original weights; (2) adding the LoRA weights.",{"type":17,"tag":25,"props":225,"children":226},{},[227,232,234,239],{"type":17,"tag":29,"props":228,"children":229},{},[230],{"type":23,"value":231},"Freeze the original weights.",{"type":23,"value":233}," Here we use the ",{"type":17,"tag":29,"props":235,"children":236},{},[237],{"type":23,"value":238},"freeze_delta",{"type":23,"value":240}," API provided by MindPet.",{"type":17,"tag":242,"props":243,"children":245},"pre",{"code":244},"if opts.enable_lora:\n    from mindpet.graph import freeze_delta \nfreeze_delta(LatentDiffusionWithLoss, 'lora')\n",[246],{"type":17,"tag":247,"props":248,"children":249},"code",{"__ignoreMap":7},[250],{"type":23,"value":244},{"type":17,"tag":25,"props":252,"children":253},{},[254],{"type":17,"tag":29,"props":255,"children":256},{},[257],{"type":23,"value":238},{"type":17,"tag":25,"props":259,"children":260},{},[261,263,268,270,274,276,281],{"type":23,"value":262},"The core code of ",{"type":17,"tag":29,"props":264,"children":265},{},[266],{"type":23,"value":267},"_freeze_from_list",{"type":23,"value":269}," used by ",{"type":17,"tag":29,"props":271,"children":272},{},[273],{"type":23,"value":238},{"type":23,"value":275}," is as follows. It can be seen that the module sets gradient descent update of parameters in a specified list to ",{"type":17,"tag":29,"props":277,"children":278},{},[279],{"type":23,"value":280},"False",{"type":23,"value":282},", thus achieving the effect of freezing the gradients.",{"type":17,"tag":242,"props":284,"children":286},{"code":285},"def _freeze_from_list(model, include, exclude):\n    \"\"\"\n    Freeze the network based on the include/exclude list.\n    \"\"\"\n    for name, param in model.parameters_and_names():\n        if _match_str_and_list(name, include) and not _match_str_and_list(name, exclude):\n            param.requires_grad = False\n        elif not _match_str_and_list(name, include) and _match_str_and_list(name, exclude):\n            param.requires_grad = True\n",[287],{"type":17,"tag":247,"props":288,"children":289},{"__ignoreMap":7},[290],{"type":23,"value":285},{"type":17,"tag":25,"props":292,"children":293},{},[294,299,301,306],{"type":17,"tag":29,"props":295,"children":296},{},[297],{"type":23,"value":298},"Add LoRA weights",{"type":23,"value":300},". Here, q, k, and v in CrossAttention are considered as separate modules, and a trainable rank decomposition matrix is injected into each module using the ",{"type":17,"tag":29,"props":302,"children":303},{},[304],{"type":23,"value":305},"LoRADense",{"type":23,"value":307}," API provided by MindPet. Then, the original q, k, and v are replaced with LoRA matrices.",{"type":17,"tag":242,"props":309,"children":311},{"code":310},"from mindpet.delta import LoRADense\nself.to_q = LoRADense(query_dim, inner_dim, has_bias=False, lora_rank=lora_rank, lora_alpha=lora_alpha).to_float(dtype)\nself.to_v = LoRADense(context_dim, inner_dim, has_bias=False, lora_rank=lora_rank, lora_alpha=lora_alpha).to_float(dtype)\nself.to_k = LoRADense(context_dim, inner_dim, has_bias=False, lora_rank=lora_rank, lora_alpha=lora_alpha).to_float(dtype)\n\nself.to_out = nn.SequentialCell(\n    LoRADense(inner_dim, query_dim, lora_rank=lora_rank, lora_alpha=lora_alpha).to_float(dtype),\n                nn.Dropout(dropout)\n)\n",[312],{"type":17,"tag":247,"props":313,"children":314},{"__ignoreMap":7},[315],{"type":23,"value":310},{"type":17,"tag":25,"props":317,"children":318},{},[319],{"type":17,"tag":29,"props":320,"children":321},{},[322],{"type":23,"value":305},{"type":17,"tag":25,"props":324,"children":325},{},[326],{"type":23,"value":327},"The main parameters of this API are described as follows:",{"type":17,"tag":25,"props":329,"children":330},{},[331,333,338],{"type":23,"value":332},"ž ",{"type":17,"tag":29,"props":334,"children":335},{},[336],{"type":23,"value":337},"in_channels",{"type":23,"value":339}," (int): spatial dimension of the tensor input to the original Dense layer.",{"type":17,"tag":25,"props":341,"children":342},{},[343,344,349],{"type":23,"value":332},{"type":17,"tag":29,"props":345,"children":346},{},[347],{"type":23,"value":348},"out_channels",{"type":23,"value":350}," (int): spatial dimension of the tensor output from the original Dense layer.",{"type":17,"tag":25,"props":352,"children":353},{},[354,355,360,362,367,369,374],{"type":23,"value":332},{"type":17,"tag":29,"props":356,"children":357},{},[358],{"type":23,"value":359},"lora_rank",{"type":23,"value":361}," (int) - number of rows in the ",{"type":17,"tag":29,"props":363,"children":364},{},[365],{"type":23,"value":366},"lora_a",{"type":23,"value":368}," matrix and number of columns in the ",{"type":17,"tag":29,"props":370,"children":371},{},[372],{"type":23,"value":373},"lora_b",{"type":23,"value":375}," matrix in the LoRA algorithm.",{"type":17,"tag":25,"props":377,"children":378},{},[379,380,385,387,392],{"type":23,"value":332},{"type":17,"tag":29,"props":381,"children":382},{},[383],{"type":23,"value":384},"lora_alpha",{"type":23,"value":386}," (Union[int, float]) - constant hyperparameter. The value is not ",{"type":17,"tag":29,"props":388,"children":389},{},[390],{"type":23,"value":391},"0",{"type":23,"value":393},".",{"type":17,"tag":25,"props":395,"children":396},{},[397,398,403],{"type":23,"value":332},{"type":17,"tag":29,"props":399,"children":400},{},[401],{"type":23,"value":402},"lora_dropout",{"type":23,"value":404}," (float) - dropout rate. The value range is [0.0, 1.0).",{"type":17,"tag":25,"props":406,"children":407},{},[408,409,414,416,420,422,427,429,434,436,441,443,448],{"type":23,"value":332},{"type":17,"tag":29,"props":410,"children":411},{},[412],{"type":23,"value":413},"lora_a_init",{"type":23,"value":415}," (Union[Tensor, str, Initializer, numbers.Number]) - initialization method of the ",{"type":17,"tag":29,"props":417,"children":418},{},[419],{"type":23,"value":366},{"type":23,"value":421}," matrix. Its data type is the same as that of ",{"type":17,"tag":29,"props":423,"children":424},{},[425],{"type":23,"value":426},"x",{"type":23,"value":428},". The value of ",{"type":17,"tag":29,"props":430,"children":431},{},[432],{"type":23,"value":433},"str",{"type":23,"value":435}," is referenced from the ",{"type":17,"tag":29,"props":437,"children":438},{},[439],{"type":23,"value":440},"initializer",{"type":23,"value":442}," function. The default value is ",{"type":17,"tag":29,"props":444,"children":445},{},[446],{"type":23,"value":447},"HeUniform(negative_slope=math.sqrt(5))",{"type":23,"value":393},{"type":17,"tag":25,"props":450,"children":451},{},[452,453,458,459,463,464,468,469,473,474,478,479,484],{"type":23,"value":332},{"type":17,"tag":29,"props":454,"children":455},{},[456],{"type":23,"value":457},"lora_b_init",{"type":23,"value":415},{"type":17,"tag":29,"props":460,"children":461},{},[462],{"type":23,"value":373},{"type":23,"value":421},{"type":17,"tag":29,"props":465,"children":466},{},[467],{"type":23,"value":426},{"type":23,"value":428},{"type":17,"tag":29,"props":470,"children":471},{},[472],{"type":23,"value":433},{"type":23,"value":435},{"type":17,"tag":29,"props":475,"children":476},{},[477],{"type":23,"value":440},{"type":23,"value":442},{"type":17,"tag":29,"props":480,"children":481},{},[482],{"type":23,"value":483},"zeros",{"type":23,"value":393},{"type":17,"tag":25,"props":486,"children":487},{},[488,489,494,496,501],{"type":23,"value":332},{"type":17,"tag":29,"props":490,"children":491},{},[492],{"type":23,"value":493},"has_bias",{"type":23,"value":495}," (bool) - whether to use the bias vector. The default value is ",{"type":17,"tag":29,"props":497,"children":498},{},[499],{"type":23,"value":500},"True",{"type":23,"value":393},{"type":17,"tag":25,"props":503,"children":504},{},[505,506,511,513,518,520,525,527,532],{"type":23,"value":332},{"type":17,"tag":29,"props":507,"children":508},{},[509],{"type":23,"value":510},"activation",{"type":23,"value":512}," (Union[str, Cell, Primitive, None]) - activation function applied to the output of the fully-connected layer. You can specify an activation function name, such as ",{"type":17,"tag":29,"props":514,"children":515},{},[516],{"type":23,"value":517},"relu",{"type":23,"value":519},", or a specific activation function, such as ",{"type":17,"tag":29,"props":521,"children":522},{},[523],{"type":23,"value":524},"mindspore.nn.ReLU()",{"type":23,"value":526},". The default value is ",{"type":17,"tag":29,"props":528,"children":529},{},[530],{"type":23,"value":531},"None",{"type":23,"value":393},{"type":17,"tag":25,"props":534,"children":535},{},[536],{"type":23,"value":537},"The principle of this module is as follows, which can calculate the original model's output and the sum (h) of the outputs of LoRA modules A and B based on the original weights (as shown in figure 1).",{"type":17,"tag":25,"props":539,"children":540},{},[541],{"type":17,"tag":85,"props":542,"children":544},{"alt":7,"src":543},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/01/19/5718a85eaf9348b88d20f6fb5183957d.png",[],{"type":17,"tag":242,"props":546,"children":548},{"code":547},"# Dense result, which is the result of the original weights, as shown in the blue part in figure 1.\ndense_result = self.matmul(input_tensor, weight)\nif self.has_bias:\n    bias = self.cast(self.bias, self.dtype)\n    dense_result = self.bias_add(dense_result, bias)\n\n# LoRA result, which is the result of the LoRA weights, as shown in the orange part in figure 1.\ninput_tensor = self.lora_dropout(input_tensor)\ninput_tensor = self.lora_a_matmul(input_tensor, lora_a)\ninput_tensor = self.lora_b_matmul(input_tensor, lora_b)\ninput_tensor = self.mul(input_tensor, scaling)\n\n# Result addition and activation: Add the two parts together.\ndense_result = self.add(dense_result, input_tensor)\n",[549],{"type":17,"tag":247,"props":550,"children":551},{"__ignoreMap":7},[552],{"type":23,"value":547},{"type":17,"tag":25,"props":554,"children":555},{},[556,561,562],{"type":17,"tag":29,"props":557,"children":558},{},[559],{"type":23,"value":560},"05",{"type":23,"value":56},{"type":17,"tag":29,"props":563,"children":564},{},[565],{"type":23,"value":41},{"type":17,"tag":25,"props":567,"children":568},{},[569,571,575,577,581],{"type":23,"value":570},"The main principle of LoRA fine-tuning is to add new weights to the original weights to adapt to downstream tasks. The new weights adopt trainable low-rank matrices to make LoRA fine-tuning more efficient. This fine-tuning method can be implemented by calling only two APIs: (1) ",{"type":17,"tag":29,"props":572,"children":573},{},[574],{"type":23,"value":238},{"type":23,"value":576}," that freezes the original weights; (2) ",{"type":17,"tag":29,"props":578,"children":579},{},[580],{"type":23,"value":305},{"type":23,"value":582}," that adds LoRA weights.",{"title":7,"searchDepth":584,"depth":584,"links":585},4,[],"markdown","content:technology-blogs:en:2948.md","content","technology-blogs/en/2948.md","technology-blogs/en/2948","md",1776506108239]