[{"data":1,"prerenderedAt":348},["ShallowReactive",2],{"content-query-qnsSEPPjON":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":342,"_id":343,"_source":344,"_file":345,"_stem":346,"_extension":347},"/technology-blogs/en/3020","en",false,"","Learning Foundation Models from Scratch — Rise of GPT2: Zero Shot","Author: Xing Guangjie | Source: Zhihu","2024-02-07","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/15/f7a4539a321e4980af6231b8d6c8b8ae.png","technology-blogs",{"type":14,"children":15,"toc":335},"root",[16,24,29,34,39,44,49,58,68,73,78,86,91,96,101,110,115,122,132,139,144,151,156,163,172,184,192,200,205,213,232,237,242,247,257,269,277,289,294,299,307,312,317,322,330],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"learning-foundation-models-from-scratch-rise-of-gpt2-zero-shot",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":23,"value":9},{"type":17,"tag":25,"props":30,"children":31},{},[32],{"type":23,"value":33},"MindSpore hopes to establish channels for communication and sharing between developers with this series of blogs,",{"type":17,"tag":25,"props":35,"children":36},{},[37],{"type":23,"value":38},"which focus on the learning experience of the MindSpore open courses. The courses have two topics on foundation models from Transformer to LLaMA. The first session (talk 1 to talk 10) has been concluded, which explains the evolution roadmap starting from Transformer to ChatGPT and provides guidance on building a simple version of ChatGPT. The ongoing second session (talk 11) is improved based on the previous courses, focusing on the full-process practice from development to application with more advanced knowledge and experienced teachers. The courses are combined with the deep learning framework MindSpore and the difficulty increases gradually, which is friendly to students hoping to learn machine learning, especially foundation model technologies.",{"type":17,"tag":25,"props":40,"children":41},{},[42],{"type":23,"value":43},"Course materials:",{"type":17,"tag":25,"props":45,"children":46},{},[47],{"type":23,"value":48},"github.com/mindspore-courses/step_into_llm",{"type":17,"tag":25,"props":50,"children":51},{},[52],{"type":17,"tag":53,"props":54,"children":55},"strong",{},[56],{"type":23,"value":57},"1. Learning Summary",{"type":17,"tag":59,"props":60,"children":62},"h3",{"id":61},"core-technology-of-gpt-2",[63],{"type":17,"tag":53,"props":64,"children":65},{},[66],{"type":23,"value":67},"Core Technology of GPT-2",{"type":17,"tag":25,"props":69,"children":70},{},[71],{"type":23,"value":72},"l Task Conditioning",{"type":17,"tag":25,"props":74,"children":75},{},[76],{"type":23,"value":77},"When the decoder-only structure is used to construct the training dataset, stitching is performed on both GPT and GPT-2. However, GPT-2 adds a task concatenation operator between two segments.",{"type":17,"tag":25,"props":79,"children":80},{},[81],{"type":17,"tag":82,"props":83,"children":85},"img",{"alt":7,"src":84},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/15/72b8fb67735c4b8288dcff8540ecea77.png",[],{"type":17,"tag":25,"props":87,"children":88},{},[89],{"type":23,"value":90},"Task Conditioning concat",{"type":17,"tag":25,"props":92,"children":93},{},[94],{"type":23,"value":95},"l Zero Shot Learning and Zero Shot Task Transfer",{"type":17,"tag":25,"props":97,"children":98},{},[99],{"type":23,"value":100},"This mechanism is similar to the prompt commonly used these days. The difference is that GPT-2 requires a prompt in a specific format to trigger the implementation, while other LLMs generally need a prompt described in a natural language.",{"type":17,"tag":59,"props":102,"children":104},{"id":103},"implementation-of-gpt-2-fully-connected-layer-module",[105],{"type":17,"tag":53,"props":106,"children":107},{},[108],{"type":23,"value":109},"Implementation of GPT-2 Fully Connected Layer Module",{"type":17,"tag":25,"props":111,"children":112},{},[113],{"type":23,"value":114},"The common self-attention code is not displayed here. While GPT-2 is trained, subsequent words are unknown to the current word, which complies with the generative solution during inference.",{"type":17,"tag":25,"props":116,"children":117},{},[118],{"type":17,"tag":82,"props":119,"children":121},{"alt":7,"src":120},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/15/8c3fcb3629374f159452f982cc2f5386.png",[],{"type":17,"tag":123,"props":124,"children":126},"pre",{"code":125},"max_positions = seq_len\n\nbias = Tensor(np.tril(np.ones((max_positions, max_positions))).reshape(\n              (1, 1, max_positions, max_positions)), mindspore.bool_)\nbias\n",[127],{"type":17,"tag":128,"props":129,"children":130},"code",{"__ignoreMap":7},[131],{"type":23,"value":125},{"type":17,"tag":25,"props":133,"children":134},{},[135],{"type":17,"tag":82,"props":136,"children":138},{"alt":7,"src":137},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/15/7d3bfb18027445d8811276818fced21f.png",[],{"type":17,"tag":25,"props":140,"children":141},{},[142],{"type":23,"value":143},"After softmax is performed, all the positions of the upper triangle change to 0.",{"type":17,"tag":25,"props":145,"children":146},{},[147],{"type":17,"tag":82,"props":148,"children":150},{"alt":7,"src":149},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/15/a5342a73554d4713b1721c3db2adc673.png",[],{"type":17,"tag":25,"props":152,"children":153},{},[154],{"type":23,"value":155},"After self-attention, a fully connected layer (mindnlp.models.utils.utils.Conv1D) of the project layer is added. That is, the obtained matrix is transformed again to merge multiple heads to enhance the expression capability.",{"type":17,"tag":25,"props":157,"children":158},{},[159],{"type":17,"tag":82,"props":160,"children":162},{"alt":7,"src":161},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/15/1382f98044dd47bd896a7e50c843334e.png",[],{"type":17,"tag":59,"props":164,"children":166},{"id":165},"data-preprocessing",[167],{"type":17,"tag":53,"props":168,"children":169},{},[170],{"type":23,"value":171},"Data Preprocessing",{"type":17,"tag":25,"props":173,"children":174},{},[175,177,182],{"type":23,"value":176},"The key to data preprocessing is how to perform padding. For a fixed ",{"type":17,"tag":53,"props":178,"children":179},{},[180],{"type":23,"value":181},"max_seq_len",{"type":23,"value":183},", long training samples need to be truncated, training samples that are not long enough need to be padded, and all samples need to be aligned. Different truncation solutions may result in different training effects. Therefore, you need to strike a balance based on the information to be retained. For example, a summarizing task is conducted in this course, and all summary data is retained while the raw data is truncated.",{"type":17,"tag":123,"props":185,"children":187},{"code":186},"def merge_and_pad(article, summary):\n    article_len = len(article)\n    summary_len = len(summary)\n\n    sep_id = np.array([tokenizer.sep_token_id])\n    pad_id = np.array([tokenizer.pad_token_id])\n    if article_len + summary_len > max_seq_len:\n        new_article_len = max_seq_len - summary_len\n        merged = np.concatenate([article[:new_article_len], sep_id, summary[1:]])\n    elif article_len + summary_len - 1 \u003C max_seq_len:\n        pad_len = max_seq_len - article_len - summary_len + 1\n        pad_text = np.array([tokenizer.pad_token_id] * pad_len)\n        merged = np.concatenate([article, summary[1:], pad_text])\n    else:\n        merged = np.concatenate([article, summary[1:]])\n\nreturn merged.astype(np.int32\n",[188],{"type":17,"tag":128,"props":189,"children":190},{"__ignoreMap":7},[191],{"type":23,"value":186},{"type":17,"tag":25,"props":193,"children":194},{},[195],{"type":17,"tag":53,"props":196,"children":197},{},[198],{"type":23,"value":199},"2. Focus",{"type":17,"tag":25,"props":201,"children":202},{},[203],{"type":23,"value":204},"Nowadays, spotlight has shifted to newer models like LLaMA and GLM. However, taking a retrospective look allows us to contemplate why the past technologies gained prominence initially and how certain technologies that initially performed poorly are eventually proved to be effective.",{"type":17,"tag":25,"props":206,"children":207},{},[208],{"type":17,"tag":53,"props":209,"children":210},{},[211],{"type":23,"value":212},"3. Experience Sharing",{"type":17,"tag":25,"props":214,"children":215},{},[216,218,223,225,230],{"type":23,"value":217},"If select operations such as ",{"type":17,"tag":53,"props":219,"children":220},{},[221],{"type":23,"value":222},"attn_weights = where(causal_mask, attn_weights, mask_value)",{"type":23,"value":224}," are used, for Ascend hardware, the masking speed is lower than addition. In this case, directly add a minimum value to the position of masking, which does not affect the value range during training: ",{"type":17,"tag":53,"props":226,"children":227},{},[228],{"type":23,"value":229},"attn_weights = attn_weights + adder",{"type":23,"value":231},".",{"type":17,"tag":25,"props":233,"children":234},{},[235],{"type":23,"value":236},"If pre-trained GPT-2 is directly used for fine-tuning, the performance may be poor. In addition, the built-in tokenizer of GPT-2 does not support Chinese word segmentation. Therefore, you can use pre-trained BERT-Base-Chinese for word segmentation.",{"type":17,"tag":25,"props":238,"children":239},{},[240],{"type":23,"value":241},"from mindnlp.transforms import BertTokenizer",{"type":17,"tag":25,"props":243,"children":244},{},[245],{"type":23,"value":246},"tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')",{"type":17,"tag":25,"props":248,"children":249},{},[250,255],{"type":17,"tag":53,"props":251,"children":252},{},[253],{"type":23,"value":254},"from mindnlp.models import GPT2Config, GPT2LMHeadModel",{"type":23,"value":256},": GPT2LMHeadModel is specially used for pre-training. Some other models, such as GPT2Classification, may be used for downstream tasks of classification.",{"type":17,"tag":25,"props":258,"children":259},{},[260,262,267],{"type":23,"value":261},"When mixed precision is used, pay attention to whether overflow occurs during training. You can use ",{"type":17,"tag":53,"props":263,"children":264},{},[265],{"type":23,"value":266},"mindspore.amp.DynamiclossScaler",{"type":23,"value":268}," to multiply the loss by a large number. In this case, the gradient is also multiplied by the large number. In this way, an extremely small value can be represented, for example, in the float16 type. To use the original gradient value, perform division accordingly. Overflow detection during mixed precision training makes the whole process better organized. MindSpore supports convenient overflow detection for Ascend software and hardware.",{"type":17,"tag":123,"props":270,"children":272},{"code":271},"from mindspore.amp import init_status, all_finite, DynamicLossScaler\n@ms_function\ndef train_step(data, label):\n    # Initialize the overflow detection bit.\n    status = init_status()\n    data = ops.depend(data, status)\n    loss, grads = grad_fn(data, label)\n    loss = loss_scaler.unscale(loss)\n    # Check whether the overflow detection bit is set.\n    is_finite = all_finite(grads, status)\n    # If overflow occurs, skip this step.\n    if is_finite:\n        grads = loss_scaler.unscale(grads)\n        loss = ops.depend(loss, accumulator(grads))\n    loss = ops.depend(loss, loss_scaler.adjust(is_finite))\n    return loss, is_finite\n",[273],{"type":17,"tag":128,"props":274,"children":275},{"__ignoreMap":7},[276],{"type":23,"value":271},{"type":17,"tag":25,"props":278,"children":279},{},[280,282,287],{"type":23,"value":281},"When the GPU is insufficient, use the ",{"type":17,"tag":53,"props":283,"children":284},{},[285],{"type":23,"value":286},"gradient accumulation technology",{"type":23,"value":288}," supported by MindSpore.",{"type":17,"tag":25,"props":290,"children":291},{},[292],{"type":23,"value":293},"from mindnlp.modules import Accumulator",{"type":17,"tag":25,"props":295,"children":296},{},[297],{"type":23,"value":298},"accumulator = Accumulator(optimizer, accumulate_step, max_grad_norm)",{"type":17,"tag":25,"props":300,"children":301},{},[302],{"type":17,"tag":53,"props":303,"children":304},{},[305],{"type":23,"value":306},"4. Feedback",{"type":17,"tag":25,"props":308,"children":309},{},[310],{"type":23,"value":311},"The course covers a wide range of content taught by professional teachers, and provides a platform for interactions. Cutting-edge industrial visions are the best as students are not familiar with such knowledge.",{"type":17,"tag":25,"props":313,"children":314},{},[315],{"type":23,"value":316},"The map API of the MindSpore dataset is great as well.",{"type":17,"tag":25,"props":318,"children":319},{},[320],{"type":23,"value":321},"MindNLP benchmarks with HuggingFace, which makes it easier to use pre-trained models for fine-tuning.",{"type":17,"tag":25,"props":323,"children":324},{},[325],{"type":17,"tag":53,"props":326,"children":327},{},[328],{"type":23,"value":329},"5. Outlook",{"type":17,"tag":25,"props":331,"children":332},{},[333],{"type":23,"value":334},"The theories and code are not well-aligned. It may yield better outcomes if some new technologies of GPT-2 can be added to the code in a rather simple way. For example, it is worth mentioning how to present GPT-2 features in the code and how to integrate the prompt information into the training.",{"title":7,"searchDepth":336,"depth":336,"links":337},4,[338,340,341],{"id":61,"depth":339,"text":67},3,{"id":103,"depth":339,"text":109},{"id":165,"depth":339,"text":171},"markdown","content:technology-blogs:en:3020.md","content","technology-blogs/en/3020.md","technology-blogs/en/3020","md",1776506109334]