[{"data":1,"prerenderedAt":1002},["ShallowReactive",2],{"content-query-kBZl4atnbT":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"category":13,"body":14,"_type":996,"_id":997,"_source":998,"_file":999,"_stem":1000,"_extension":1001},"/technology-blogs/en/2502","en",false,"","Ice in the Hole for Efficient Development of Foundation Models - MindSpore PET","In this blog, we focus on the first article of the Ice in the Hole for Efficient Development of Foundation Models series to introduce MindSpore PET.","2023-04-07","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/05/23/1ca6f7f186d2450587c51b098a37ff08.png","technology-blogs","Practices",{"type":15,"children":16,"toc":993},"root",[17,25,31,36,53,61,69,77,91,102,113,124,132,142,153,161,169,179,187,195,205,213,221,229,237,245,253,261,269,274,279,284,289,294,299,304,309,314,318,323,328,333,338,346,354,362,389,397,408,416,427,449,457,465,473,481,489,500,508,516,524,532,543,565,573,581,589,597,605,616,624,632,640,648,656,664,672,680,691,699,707,715,723,731,739,747,755,763,777,785,796,811,819,829,840,848,852,857,862,867,872,877,882,887,892,897,902,907,911,916,921,926,931,936,941,946,950,954,959,964,969,974,985],{"type":18,"tag":19,"props":20,"children":22},"element","h1",{"id":21},"ice-in-the-hole-for-efficient-development-of-foundation-models-mindspore-pet",[23],{"type":24,"value":8},"text",{"type":18,"tag":26,"props":27,"children":28},"p",{},[29],{"type":24,"value":30},"AI has entered the \"era of foundation model\". Foundation models have stronger generalization capabilities. When applying foundation models to vertical domains, you only need to fine-tune parameters to adapt to multiple scenarios. Therefore, the development of the foundation models has become a consensus among all walks of life.",{"type":18,"tag":26,"props":32,"children":33},{},[34],{"type":24,"value":35},"In this regard, Ascend launches an enablement platform and full-process enablement kits based on MindSpore for foundation model development, including MindSpore TransFormers, MindSpore Diffusion (for text-to-image conversion), MindSpore RLHF (reinforcement learning from human feedback), and MindSpore PET (fine-tuning with fewer parameters), supporting pre-training, fine-tuning, compression, inference, and service-oriented deployment.",{"type":18,"tag":26,"props":37,"children":38},{},[39],{"type":18,"tag":40,"props":41,"children":42},"strong",{},[43,45,51],{"type":24,"value":44},"In this blog, we focus on the first article of the ",{"type":18,"tag":46,"props":47,"children":48},"em",{},[49],{"type":24,"value":50},"Ice in the Hole for Efficient Development of Foundation Models",{"type":24,"value":52}," series to introduce MindSpore PET.",{"type":18,"tag":26,"props":54,"children":55},{},[56],{"type":18,"tag":40,"props":57,"children":58},{},[59],{"type":24,"value":60},"MindSpore PET introductionMindSpore PET (parameter-efficient tuning) is developed based on the MindSpore AI convergence framework. Currently, this kit provides six algorithms, including five classic algorithms (LoRA, Prefix-Tuning, Adapter, LowRankAdapter, and BitFit) for fine-tuning with fewer parameters and R-Drop for improving the accuracy of downstream tasks. These algorithms only fine-tune a few parameters. In this way, we can greatly save computing and storage memory and reduce the fine-tuning time while maintaining the accuracy of full-parameter fine-tuning. The fine-tuning algorithm for accuracy improvement increases the model randomness without increasing the computing memory and time duration, preventing model overfitting.",{"type":18,"tag":26,"props":62,"children":63},{},[64],{"type":18,"tag":40,"props":65,"children":66},{},[67],{"type":24,"value":68},"The kit provides APIs and use cases for all algorithms to realize out-of-the-box use. In addition, the kit provides APIs for saving only few learnable parameters for such algorithms, confining the generated CKPT file to a small size.",{"type":18,"tag":26,"props":70,"children":71},{},[72],{"type":18,"tag":40,"props":73,"children":74},{},[75],{"type":24,"value":76},"Open source repository:",{"type":18,"tag":26,"props":78,"children":79},{},[80],{"type":18,"tag":40,"props":81,"children":82},{},[83],{"type":18,"tag":84,"props":85,"children":89},"a",{"href":86,"rel":87},"https://github.com/mindspore-lab/MindPet",[88],"nofollow",[90],{"type":24,"value":86},{"type":18,"tag":26,"props":92,"children":93},{},[94],{"type":18,"tag":40,"props":95,"children":96},{},[97],{"type":18,"tag":98,"props":99,"children":101},"img",{"alt":7,"src":100},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/05/23/8df5c355dfcb4a629c3aeea62b7f1626.png",[],{"type":18,"tag":26,"props":103,"children":104},{},[105],{"type":18,"tag":40,"props":106,"children":107},{},[108],{"type":18,"tag":40,"props":109,"children":110},{},[111],{"type":24,"value":112},"MindSpore PET - LoRA",{"type":18,"tag":26,"props":114,"children":115},{},[116],{"type":18,"tag":40,"props":117,"children":118},{},[119],{"type":18,"tag":40,"props":120,"children":121},{},[122],{"type":24,"value":123},"Algorithm Principles",{"type":18,"tag":26,"props":125,"children":126},{},[127],{"type":18,"tag":40,"props":128,"children":129},{},[130],{"type":24,"value":131},"LoRA stands for low-rank adaptation. It is an algorithm for fine-tuning with fewer parameters developed by Microsoft for language foundation models. LoRA assumes that when adapting to downstream tasks, the fully connected layer of the foundation model has a low intrinsic rank which contains a large amount of redundant information. Therefore, it suggests injecting trainable rank decomposition matrices into each fully connected layer of the Transformer architecture and freezing weights of the original pre-trained model, so that the number parameters participating in training can be greatly reduced.",{"type":18,"tag":26,"props":133,"children":134},{},[135],{"type":18,"tag":40,"props":136,"children":137},{},[138],{"type":18,"tag":98,"props":139,"children":141},{"alt":7,"src":140},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/05/23/616386c16bd840e2ab48cf8ee37e0ab0.png",[],{"type":18,"tag":26,"props":143,"children":144},{},[145],{"type":18,"tag":40,"props":146,"children":147},{},[148],{"type":18,"tag":40,"props":149,"children":150},{},[151],{"type":24,"value":152},"Application effect - Wukong-Huahua as an example",{"type":18,"tag":26,"props":154,"children":155},{},[156],{"type":18,"tag":40,"props":157,"children":158},{},[159],{"type":24,"value":160},"Wukong-Huahua is a foundation model for text-to-image conversion in Chinese based on the diffusion model. Despite its powerful capabilities, the huge network scale, number of parameters (about 900 million), and long training time (when the model adapts to downstream tasks) make it costly to deploy the computing and storage memory.",{"type":18,"tag":26,"props":162,"children":163},{},[164],{"type":18,"tag":40,"props":165,"children":166},{},[167],{"type":24,"value":168},"The CLIP model is used to convert human languages into mathematical vectors that can be understood by machines, and the U-Net model is used to predict noises. The attention structures of the two models contain fully connected layers, which may contain a large amount of redundant information when adapting to downstream tasks.",{"type":18,"tag":26,"props":170,"children":171},{},[172],{"type":18,"tag":40,"props":173,"children":174},{},[175],{"type":18,"tag":98,"props":176,"children":178},{"alt":7,"src":177},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/05/23/bb984edd9e7b45b597b9ad2fdd150fa0.png",[],{"type":18,"tag":26,"props":180,"children":181},{},[182],{"type":18,"tag":40,"props":183,"children":184},{},[185],{"type":24,"value":186},"Therefore, the LoRA module is injected into the query, key, value, and output (QKVO) modules of the cross attention layer of the U-Net. The subsequent performance is outstanding.",{"type":18,"tag":26,"props":188,"children":189},{},[190],{"type":18,"tag":40,"props":191,"children":192},{},[193],{"type":24,"value":194},"As shown in the following figure, after LoRA adaptation, high-quality images can be generated even if only 0.07% parameters are trained.",{"type":18,"tag":26,"props":196,"children":197},{},[198],{"type":18,"tag":40,"props":199,"children":200},{},[201],{"type":18,"tag":98,"props":202,"children":204},{"alt":7,"src":203},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/05/23/5b2e64a0d0af4e76a274f9f6b2ed3f05.png",[],{"type":18,"tag":26,"props":206,"children":207},{},[208],{"type":18,"tag":40,"props":209,"children":210},{},[211],{"type":24,"value":212},"In addition, compared with full-parameter fine-tuning, the LoRA algorithm greatly improves the training performance.",{"type":18,"tag":26,"props":214,"children":215},{},[216],{"type":18,"tag":40,"props":217,"children":218},{},[219],{"type":24,"value":220},"Model",{"type":18,"tag":26,"props":222,"children":223},{},[224],{"type":18,"tag":40,"props":225,"children":226},{},[227],{"type":24,"value":228},"Number of Epochs",{"type":18,"tag":26,"props":230,"children":231},{},[232],{"type":18,"tag":40,"props":233,"children":234},{},[235],{"type":24,"value":236},"Parameter Quantity",{"type":18,"tag":26,"props":238,"children":239},{},[240],{"type":18,"tag":40,"props":241,"children":242},{},[243],{"type":24,"value":244},"Training Time for One Epoch",{"type":18,"tag":26,"props":246,"children":247},{},[248],{"type":18,"tag":40,"props":249,"children":250},{},[251],{"type":24,"value":252},"Static Memory",{"type":18,"tag":26,"props":254,"children":255},{},[256],{"type":18,"tag":40,"props":257,"children":258},{},[259],{"type":24,"value":260},"Dynamic Memory",{"type":18,"tag":26,"props":262,"children":263},{},[264],{"type":18,"tag":40,"props":265,"children":266},{},[267],{"type":24,"value":268},"CKPT Size",{"type":18,"tag":26,"props":270,"children":271},{},[272],{"type":24,"value":273},"Baseline",{"type":18,"tag":26,"props":275,"children":276},{},[277],{"type":24,"value":278},"10",{"type":18,"tag":26,"props":280,"children":281},{},[282],{"type":24,"value":283},"100%",{"type":18,"tag":26,"props":285,"children":286},{},[287],{"type":24,"value":288},"103 min",{"type":18,"tag":26,"props":290,"children":291},{},[292],{"type":24,"value":293},"13898 MB",{"type":18,"tag":26,"props":295,"children":296},{},[297],{"type":24,"value":298},"13952 MB",{"type":18,"tag":26,"props":300,"children":301},{},[302],{"type":24,"value":303},"4 GB",{"type":18,"tag":26,"props":305,"children":306},{},[307],{"type":24,"value":308},"LoRA",{"type":18,"tag":26,"props":310,"children":311},{},[312],{"type":24,"value":313},"15",{"type":18,"tag":26,"props":315,"children":316},{},[317],{"type":24,"value":283},{"type":18,"tag":26,"props":319,"children":320},{},[321],{"type":24,"value":322},"37 min",{"type":18,"tag":26,"props":324,"children":325},{},[326],{"type":24,"value":327},"4120 MB",{"type":18,"tag":26,"props":329,"children":330},{},[331],{"type":24,"value":332},"12908 MB",{"type":18,"tag":26,"props":334,"children":335},{},[336],{"type":24,"value":337},"3.06 MB + 3.97 GB",{"type":18,"tag":26,"props":339,"children":340},{},[341],{"type":18,"tag":40,"props":342,"children":343},{},[344],{"type":24,"value":345},"(1) Originally, it takes 17 hours to fine-tune all parameters. After adaptation, it takes only 9 hours, saving nearly 50% time.",{"type":18,"tag":26,"props":347,"children":348},{},[349],{"type":18,"tag":40,"props":350,"children":351},{},[352],{"type":24,"value":353},"(2) The computing memory is reduced by 40%. You can double the batch size to further improve the speed.",{"type":18,"tag":26,"props":355,"children":356},{},[357],{"type":18,"tag":40,"props":358,"children":359},{},[360],{"type":24,"value":361},"(3) The size of the saved CKPT file is only 3.06 MB instead of 4 GB to save all parameters.",{"type":18,"tag":26,"props":363,"children":364},{},[365],{"type":18,"tag":40,"props":366,"children":367},{},[368,370,375,377,381,383,387],{"type":24,"value":369},"When there are ",{"type":18,"tag":46,"props":371,"children":372},{},[373],{"type":24,"value":374},"n",{"type":24,"value":376}," downstream tasks, only ",{"type":18,"tag":46,"props":378,"children":379},{},[380],{"type":24,"value":374},{"type":24,"value":382}," x 3.06 MB rather than ",{"type":18,"tag":46,"props":384,"children":385},{},[386],{"type":24,"value":374},{"type":24,"value":388}," x 4 GB needs to be saved. And, experiments demonstrate an inspiring result: If you have trained models of multiple styles, it takes only 0.5s to switch between them, seamlessly shifting from Pablo Picasso to Shinkai Makoto.",{"type":18,"tag":26,"props":390,"children":391},{},[392],{"type":18,"tag":40,"props":393,"children":394},{},[395],{"type":24,"value":396},"Thanks to the static graph feature of the MindSpore framework, you only need to edit the graph during the first forward training even if other LoRA CKPT files are added later to update parameters.",{"type":18,"tag":26,"props":398,"children":399},{},[400],{"type":18,"tag":40,"props":401,"children":402},{},[403],{"type":18,"tag":40,"props":404,"children":405},{},[406],{"type":24,"value":407},"Usage",{"type":18,"tag":26,"props":409,"children":410},{},[411],{"type":18,"tag":40,"props":412,"children":413},{},[414],{"type":24,"value":415},"The LoRA algorithm, aiming to ease the burden of foundation models, is user-friendly in nature. End-to-end adaptation can be completed in only five steps.",{"type":18,"tag":26,"props":417,"children":418},{},[419],{"type":18,"tag":40,"props":420,"children":421},{},[422],{"type":18,"tag":40,"props":423,"children":424},{},[425],{"type":24,"value":426},"Step 1:",{"type":18,"tag":26,"props":428,"children":429},{},[430],{"type":18,"tag":40,"props":431,"children":432},{},[433,435,440,442,447],{"type":24,"value":434},"Replace the ",{"type":18,"tag":40,"props":436,"children":437},{},[438],{"type":24,"value":439},"Dense",{"type":24,"value":441}," layer of QKVO in the cross-attention structure with ",{"type":18,"tag":40,"props":443,"children":444},{},[445],{"type":24,"value":446},"LoRADense",{"type":24,"value":448},".",{"type":18,"tag":26,"props":450,"children":451},{},[452],{"type":18,"tag":40,"props":453,"children":454},{},[455],{"type":24,"value":456},"from tk.delta import LoRADense",{"type":18,"tag":26,"props":458,"children":459},{},[460],{"type":18,"tag":40,"props":461,"children":462},{},[463],{"type":24,"value":464},"# Original Dense layer",{"type":18,"tag":26,"props":466,"children":467},{},[468],{"type":18,"tag":40,"props":469,"children":470},{},[471],{"type":24,"value":472},"# self.to_q = nn.Dense(query_dim, inner_dim, has_bias=False).to_float(dtype)",{"type":18,"tag":26,"props":474,"children":475},{},[476],{"type":18,"tag":40,"props":477,"children":478},{},[479],{"type":24,"value":480},"# Replace Dense layer with LoRADense",{"type":18,"tag":26,"props":482,"children":483},{},[484],{"type":18,"tag":40,"props":485,"children":486},{},[487],{"type":24,"value":488},"self.to_q = LoRADense(query_dim, inner_dim, has_bias=False, lora_rank=4, lora_alpha=4).to_float(dtype)",{"type":18,"tag":26,"props":490,"children":491},{},[492],{"type":18,"tag":40,"props":493,"children":494},{},[495],{"type":18,"tag":40,"props":496,"children":497},{},[498],{"type":24,"value":499},"Step 2:",{"type":18,"tag":26,"props":501,"children":502},{},[503],{"type":18,"tag":40,"props":504,"children":505},{},[506],{"type":24,"value":507},"Invoke the freezing method in the training script to train only the new LoRA module.",{"type":18,"tag":26,"props":509,"children":510},{},[511],{"type":18,"tag":40,"props":512,"children":513},{},[514],{"type":24,"value":515},"from tk.graph import freeze_delta",{"type":18,"tag":26,"props":517,"children":518},{},[519],{"type":18,"tag":40,"props":520,"children":521},{},[522],{"type":24,"value":523},"# Freeze all cells except LoRA and head",{"type":18,"tag":26,"props":525,"children":526},{},[527],{"type":18,"tag":40,"props":528,"children":529},{},[530],{"type":24,"value":531},"freeze_delta(LatentDiffusionWithLoss, 'lora’)",{"type":18,"tag":26,"props":533,"children":534},{},[535],{"type":18,"tag":40,"props":536,"children":537},{},[538],{"type":18,"tag":40,"props":539,"children":540},{},[541],{"type":24,"value":542},"Step 3:",{"type":18,"tag":26,"props":544,"children":545},{},[546],{"type":18,"tag":40,"props":547,"children":548},{},[549,551,556,558,563],{"type":24,"value":550},"In the training script, replace ",{"type":18,"tag":40,"props":552,"children":553},{},[554],{"type":24,"value":555},"ModelCheckpoint",{"type":24,"value":557}," for saving the CKPT file with ",{"type":18,"tag":40,"props":559,"children":560},{},[561],{"type":24,"value":562},"TrainableParamsCheckPoint",{"type":24,"value":564}," and save only the parameters to be updated.",{"type":18,"tag":26,"props":566,"children":567},{},[568],{"type":18,"tag":40,"props":569,"children":570},{},[571],{"type":24,"value":572},"from tk.graph import TrainableParamsCheckPoint",{"type":18,"tag":26,"props":574,"children":575},{},[576],{"type":18,"tag":40,"props":577,"children":578},{},[579],{"type":24,"value":580},"# Original callback",{"type":18,"tag":26,"props":582,"children":583},{},[584],{"type":18,"tag":40,"props":585,"children":586},{},[587],{"type":24,"value":588},"# ckpt_callback = ModelCheckpoint(...)",{"type":18,"tag":26,"props":590,"children":591},{},[592],{"type":18,"tag":40,"props":593,"children":594},{},[595],{"type":24,"value":596},"# Replace ModelCheckpoint with TrainableParamsCheckPoint",{"type":18,"tag":26,"props":598,"children":599},{},[600],{"type":18,"tag":40,"props":601,"children":602},{},[603],{"type":24,"value":604},"ckpt_callback = TrainableParamsCheckPoint(...)",{"type":18,"tag":26,"props":606,"children":607},{},[608],{"type":18,"tag":40,"props":609,"children":610},{},[611],{"type":18,"tag":40,"props":612,"children":613},{},[614],{"type":24,"value":615},"Step 4:",{"type":18,"tag":26,"props":617,"children":618},{},[619],{"type":18,"tag":40,"props":620,"children":621},{},[622],{"type":24,"value":623},"Adjust parameters including the learning rate and batch size based on the training objective.",{"type":18,"tag":26,"props":625,"children":626},{},[627],{"type":18,"tag":40,"props":628,"children":629},{},[630],{"type":24,"value":631},"epochs: 15",{"type":18,"tag":26,"props":633,"children":634},{},[635],{"type":18,"tag":40,"props":636,"children":637},{},[638],{"type":24,"value":639},"start_learning_rate: 1e-4",{"type":18,"tag":26,"props":641,"children":642},{},[643],{"type":18,"tag":40,"props":644,"children":645},{},[646],{"type":24,"value":647},"end_learning_rate: 1e-6",{"type":18,"tag":26,"props":649,"children":650},{},[651],{"type":18,"tag":40,"props":652,"children":653},{},[654],{"type":24,"value":655},"train_batch_size: 3",{"type":18,"tag":26,"props":657,"children":658},{},[659],{"type":18,"tag":40,"props":660,"children":661},{},[662],{"type":24,"value":663},"warmup_steps: 0",{"type":18,"tag":26,"props":665,"children":666},{},[667],{"type":18,"tag":40,"props":668,"children":669},{},[670],{"type":24,"value":671},"lora_rank: 4",{"type":18,"tag":26,"props":673,"children":674},{},[675],{"type":18,"tag":40,"props":676,"children":677},{},[678],{"type":24,"value":679},"lora_alpha: 4",{"type":18,"tag":26,"props":681,"children":682},{},[683],{"type":18,"tag":40,"props":684,"children":685},{},[686],{"type":18,"tag":40,"props":687,"children":688},{},[689],{"type":24,"value":690},"Step 5:",{"type":18,"tag":26,"props":692,"children":693},{},[694],{"type":18,"tag":40,"props":695,"children":696},{},[697],{"type":24,"value":698},"After the training is complete, load the pre-trained CKPT file and the CKPT file generated after fine-tuning in the evaluation script.",{"type":18,"tag":26,"props":700,"children":701},{},[702],{"type":18,"tag":40,"props":703,"children":704},{},[705],{"type":24,"value":706},"# Load the pre-trained CKPT file.",{"type":18,"tag":26,"props":708,"children":709},{},[710],{"type":18,"tag":40,"props":711,"children":712},{},[713],{"type":24,"value":714},"pre_trained_pramas = load_checkpoint(pre_trained_ckpt_path)",{"type":18,"tag":26,"props":716,"children":717},{},[718],{"type":18,"tag":40,"props":719,"children":720},{},[721],{"type":24,"value":722},"load_param_into_net(net, pre_trained_pramas)",{"type":18,"tag":26,"props":724,"children":725},{},[726],{"type":18,"tag":40,"props":727,"children":728},{},[729],{"type":24,"value":730},"# Load the CKPT generated after fine-tuning.",{"type":18,"tag":26,"props":732,"children":733},{},[734],{"type":18,"tag":40,"props":735,"children":736},{},[737],{"type":24,"value":738},"trainable_pramas = load_checkpoint(trainable_ckpt_path)",{"type":18,"tag":26,"props":740,"children":741},{},[742],{"type":18,"tag":40,"props":743,"children":744},{},[745],{"type":24,"value":746},"load_param_into_net(net, trainable_pramas)",{"type":18,"tag":26,"props":748,"children":749},{},[750],{"type":18,"tag":40,"props":751,"children":752},{},[753],{"type":24,"value":754},"# Start evaluation.",{"type":18,"tag":26,"props":756,"children":757},{},[758],{"type":18,"tag":40,"props":759,"children":760},{},[761],{"type":24,"value":762},"model.eval()",{"type":18,"tag":26,"props":764,"children":765},{},[766],{"type":18,"tag":40,"props":767,"children":768},{},[769,771],{"type":24,"value":770},"We have released all code and provided detailed description for APIs and use cases: ",{"type":18,"tag":84,"props":772,"children":775},{"href":773,"rel":774},"https://github.com/mindspore-lab/MindPet/blob/master/doc/TK_DeltaAlgorithm_README.md",[88],[776],{"type":24,"value":773},{"type":18,"tag":26,"props":778,"children":779},{},[780],{"type":18,"tag":40,"props":781,"children":782},{},[783],{"type":24,"value":784},"Note that compared with full-parameter fine-tuning, a larger learning rate is needed after LoRA adaptation. For example, during the adaptation to Wukong-Huahua, the learning rate is increased from 1e-5 to 1e-4.",{"type":18,"tag":26,"props":786,"children":787},{},[788],{"type":18,"tag":40,"props":789,"children":790},{},[791],{"type":18,"tag":40,"props":792,"children":793},{},[794],{"type":24,"value":795},"MindSpore PET - prefix-tuning",{"type":18,"tag":26,"props":797,"children":798},{},[799],{"type":18,"tag":40,"props":800,"children":801},{},[802,804,809],{"type":24,"value":803},"Prefix-tuning, proposed in ",{"type":18,"tag":46,"props":805,"children":806},{},[807],{"type":24,"value":808},"Prefix-Tuning: Optimizing Continuous Prompts for Generation",{"type":24,"value":810},", is also a fine-tuning algorithm with fewer parameters for foundation language models. Researchers suggest that using continuous vectors instead of discrete words to construct prefix templates, that is, adding continuous token embeddings before input, can increase the correlation between the query and key. Therefore, prefix-tuning is implemented by injecting trainable prefix vectors k and v before the key and value matrices of each multi-head attention, and freezes the original network parameters to greatly improve the performance of generative tasks.",{"type":18,"tag":26,"props":812,"children":813},{},[814],{"type":18,"tag":40,"props":815,"children":816},{},[817],{"type":24,"value":818},"Prefix-tuning reaches good effect on both GPT-2 and Pangu Alpha foundation models. Compared with fine-tuning of full parameters, using prefix-tuning to train Pangu Alpha requires only 5.5% parameters while maintaining the original accuracy, saving more than 65% computing memory and halving the time required for an iteration.",{"type":18,"tag":26,"props":820,"children":821},{},[822],{"type":18,"tag":40,"props":823,"children":824},{},[825],{"type":18,"tag":98,"props":826,"children":828},{"alt":7,"src":827},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/05/23/89d1d75f046143fe9331fbcacf96e7df.png",[],{"type":18,"tag":26,"props":830,"children":831},{},[832],{"type":18,"tag":40,"props":833,"children":834},{},[835],{"type":18,"tag":40,"props":836,"children":837},{},[838],{"type":24,"value":839},"MindSpore PET - R-Drop",{"type":18,"tag":26,"props":841,"children":842},{},[843],{"type":18,"tag":40,"props":844,"children":845},{},[846],{"type":24,"value":847},"R-Drop stands for regularized dropout. It is a fine-tuning algorithm used to improve accuracy. It uses two simple dropouts to construct positive samples for comparative learning, improving model randomness. Specifically, after the model loads a batch of data, it copies the data, inputs the data to the model, calculates the loss functions respectively, and adds the results to obtain the final loss value. Despite the simple logic, the algorithm can efficiently prevent model overfitting and further improve the accuracy. According to the verification of multiple downstream tasks on Bert, the accuracy can be improved by 2.6% with almost the same memory and time overhead.",{"type":18,"tag":26,"props":849,"children":850},{},[851],{"type":24,"value":220},{"type":18,"tag":26,"props":853,"children":854},{},[855],{"type":24,"value":856},"Downstream Task",{"type":18,"tag":26,"props":858,"children":859},{},[860],{"type":24,"value":861},"Mode",{"type":18,"tag":26,"props":863,"children":864},{},[865],{"type":24,"value":866},"Training Parameter",{"type":18,"tag":26,"props":868,"children":869},{},[870],{"type":24,"value":871},"Memory",{"type":18,"tag":26,"props":873,"children":874},{},[875],{"type":24,"value":876},"Time per Step",{"type":18,"tag":26,"props":878,"children":879},{},[880],{"type":24,"value":881},"Accuracy",{"type":18,"tag":26,"props":883,"children":884},{},[885],{"type":24,"value":886},"epoch",{"type":18,"tag":26,"props":888,"children":889},{},[890],{"type":24,"value":891},"dropout_rate",{"type":18,"tag":26,"props":893,"children":894},{},[895],{"type":24,"value":896},"alpha",{"type":18,"tag":26,"props":898,"children":899},{},[900],{"type":24,"value":901},"bert-base",{"type":18,"tag":26,"props":903,"children":904},{},[905],{"type":24,"value":906},"IFLYTEK",{"type":18,"tag":26,"props":908,"children":909},{},[910],{"type":24,"value":273},{"type":18,"tag":26,"props":912,"children":913},{},[914],{"type":24,"value":915},"50",{"type":18,"tag":26,"props":917,"children":918},{},[919],{"type":24,"value":920},"0.2",{"type":18,"tag":26,"props":922,"children":923},{},[924],{"type":24,"value":925},"\\",{"type":18,"tag":26,"props":927,"children":928},{},[929],{"type":24,"value":930},"1183 MB",{"type":18,"tag":26,"props":932,"children":933},{},[934],{"type":24,"value":935},"116.147 ms",{"type":18,"tag":26,"props":937,"children":938},{},[939],{"type":24,"value":940},"0.5952",{"type":18,"tag":26,"props":942,"children":943},{},[944],{"type":24,"value":945},"R-Drop",{"type":18,"tag":26,"props":947,"children":948},{},[949],{"type":24,"value":915},{"type":18,"tag":26,"props":951,"children":952},{},[953],{"type":24,"value":920},{"type":18,"tag":26,"props":955,"children":956},{},[957],{"type":24,"value":958},"6",{"type":18,"tag":26,"props":960,"children":961},{},[962],{"type":24,"value":963},"1195 MB",{"type":18,"tag":26,"props":965,"children":966},{},[967],{"type":24,"value":968},"117.668 ms",{"type":18,"tag":26,"props":970,"children":971},{},[972],{"type":24,"value":973},"0.6211",{"type":18,"tag":26,"props":975,"children":976},{},[977],{"type":18,"tag":40,"props":978,"children":979},{},[980],{"type":18,"tag":40,"props":981,"children":982},{},[983],{"type":24,"value":984},"The development of foundation models involves high requirements and complex processes. With the help of enablement kits, developers will make the models easier to develop, adapt, and deploy.",{"type":18,"tag":26,"props":986,"children":987},{},[988],{"type":18,"tag":40,"props":989,"children":990},{},[991],{"type":24,"value":992},"For more information about MindSpore TransFormers, MindSpore Diffusion, and MindSpore RLHF, please follow our MindSpore official account on WeChat. We will continue to bring you information about AI technologies and activities.",{"title":7,"searchDepth":994,"depth":994,"links":995},4,[],"markdown","content:technology-blogs:en:2502.md","content","technology-blogs/en/2502.md","technology-blogs/en/2502","md",1776506106026]