[{"data":1,"prerenderedAt":397},["ShallowReactive",2],{"content-query-V8atuK4Lsn":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":391,"_id":392,"_source":393,"_file":394,"_stem":395,"_extension":396},"/technology-blogs/en/2830","en",false,"","RWKV: The New RNN in the Transformer Era","In the Transformer era, RWKV is an innovative deep learning network architecture that combines the advantages of Transformer and RNN.","2023-09-15","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/e8c69f853d494e75bf2aa2ad5543b5c6.png","technology-blogs",{"type":14,"children":15,"toc":388},"root",[16,24,30,39,44,49,57,62,67,74,79,87,92,100,105,110,117,122,127,134,139,146,151,156,161,166,174,179,186,191,196,201,209,214,219,224,229,236,241,246,251,256,261,266,271,278,283,288,293,298,305,313,318,325,333,338,345,353,358,365,373,378,383],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"rwkv-the-new-rnn-in-the-transformer-era",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":23,"value":29},"In the Transformer era, RWKV is an innovative deep learning network architecture that combines the advantages of Transformer and RNN. It can perform highly parallel training and efficient inference, with a linear time complexity, which could outperform Transformer in long sequence inference.",{"type":17,"tag":25,"props":31,"children":32},{},[33],{"type":17,"tag":34,"props":35,"children":36},"strong",{},[37],{"type":23,"value":38},"1. Overview",{"type":17,"tag":25,"props":40,"children":41},{},[42],{"type":23,"value":43},"When the natural language uses RNN for modeling at the beginning, RNN is a feature extraction network architecture based on the loop layer, which can transfer the hidden state of a previous time step to the next time step in natural language modeling.",{"type":17,"tag":25,"props":45,"children":46},{},[47],{"type":23,"value":48},"Because RNN has a loop architecture (see the following figure), calculation of each time step depends on the hidden state of the previous time step, resulting in high calculation complexity. Moreover, gradient disappearance or gradient explosion may occur leading to low training efficiency. Therefore, the scalability of RNN is not good.",{"type":17,"tag":25,"props":50,"children":51},{},[52],{"type":17,"tag":53,"props":54,"children":56},"img",{"alt":7,"src":55},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/cdb562f631144257980d08fc5720e076.png",[],{"type":17,"tag":25,"props":58,"children":59},{},[60],{"type":23,"value":61},"RNN architecture",{"type":17,"tag":25,"props":63,"children":64},{},[65],{"type":23,"value":66},"Transformer was proposed by Google in 2017. It is a feature extraction network architecture based on the self-attention mechanism and is mainly used for natural language processing. The self-attention mechanism can calculation each location in the input sequence, so as to obtain the global context information. Encoders and decoders in Transformer can handle tasks such as machine translation and text generation. The core of Transformer is the self-attention mechanism (see the following figure). Because it processes natural languages in complete sentences, it features high training efficiency and parallel processing. The disadvantage of Transformer is high calculation complexity. In O(N^2*d), where N is the sequence length and d is the token embedding dimension, the time complexity is not friendly to long sequences.",{"type":17,"tag":25,"props":68,"children":69},{},[70],{"type":17,"tag":53,"props":71,"children":73},{"alt":7,"src":72},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/09e51fb4861f4b24b3cfffaa31ba5f0a.png",[],{"type":17,"tag":25,"props":75,"children":76},{},[77],{"type":23,"value":78},"Self-attention mechanism",{"type":17,"tag":25,"props":80,"children":81},{},[82],{"type":17,"tag":34,"props":83,"children":84},{},[85],{"type":23,"value":86},"2. Basic Principles",{"type":17,"tag":25,"props":88,"children":89},{},[90],{"type":23,"value":91},"Based on the problems of RNN and Transformer, RWKV is proposed to improve the linear attention mechanism for solving parallelism difficulty in RNN. RWKV also has similar time complexity to RNN and similar performance as Transformer. So, let's move on to linear Transformer and Attention Free Transformer to understand the basic principles of RWKV.",{"type":17,"tag":25,"props":93,"children":94},{},[95],{"type":17,"tag":34,"props":96,"children":97},{},[98],{"type":23,"value":99},"2.1 Linear Transformer",{"type":17,"tag":25,"props":101,"children":102},{},[103],{"type":23,"value":104},"Linear Transformer is to reduce calculation complexity of self-attention from O(N^2) to O(N), where N is a sequence length. This is very important for the overall acceleration of Transformer.",{"type":17,"tag":25,"props":106,"children":107},{},[108],{"type":23,"value":109},"The typical calculation of self-attention mechanism in Transformer is as follows:",{"type":17,"tag":25,"props":111,"children":112},{},[113],{"type":17,"tag":53,"props":114,"children":116},{"alt":7,"src":115},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/64ce2d1d3ca54a278e96fafe18e0d39d.png",[],{"type":17,"tag":25,"props":118,"children":119},{},[120],{"type":23,"value":121},"Formula (1)",{"type":17,"tag":25,"props":123,"children":124},{},[125],{"type":23,"value":126},"The matrices Q, K, and V are Query, Key, and Value obtained by the input x after the linear change. If we use a subscript i to represent the ith row of the matrix (for example, Qi represents the ith row of matrix Q), calculation in formula (1) may be abstracted in the following form:",{"type":17,"tag":25,"props":128,"children":129},{},[130],{"type":17,"tag":53,"props":131,"children":133},{"alt":7,"src":132},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/430299566b87484b81d31511b9db7fdb.png",[],{"type":17,"tag":25,"props":135,"children":136},{},[137],{"type":23,"value":138},"The calculation complexity of the original Transformer increases quadratic with the sequence length N. This is because the self-attention calculation contains two layers of for loops. The outer layer is to calculate the new representation of the token corresponding to each query. The inner layer is to calculate the new representation corresponding to each query, in which Query needs to be calculated with each key. Therefore, the outer layer is for q in Queries, and the inner layer is for k in Keys. The number of Queries and Keys are N. Therefore, the complexity is O(N^2). Linear Transformer has only the outer for q in Queries loop. Because the sum item calculation is irrelevant to i, all Qis can share the value of the sum item. In other words, the value of the sum item can be calculated only once, and then stored in memory for all Qis to use. Therefore, the calculation complexity of Linear Transformer is O(N). The following two new symbols are introduced:",{"type":17,"tag":25,"props":140,"children":141},{},[142],{"type":17,"tag":53,"props":143,"children":145},{"alt":7,"src":144},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/bd391d396b15418e881a83d9d54f59c5.png",[],{"type":17,"tag":25,"props":147,"children":148},{},[149],{"type":23,"value":150},"During inference, when the output at the ith moment needs to be calculated, the Linear Transformer can reuse the previous states Si-1 and Zi-1, and add a calculation value related to the current moment. When the Transformer calculates the output at the ith moment, all calculations at the (i-1)th moment cannot be reused at the ith moment. Therefore, Linear Transformer is more efficient.",{"type":17,"tag":25,"props":152,"children":153},{},[154],{"type":23,"value":155},"To summarize this section:",{"type":17,"tag":25,"props":157,"children":158},{},[159],{"type":23,"value":160},"The calculation complexity of Linear Transformer is O(N) (not considering the embedding dimension).",{"type":17,"tag":25,"props":162,"children":163},{},[164],{"type":23,"value":165},"As shown in the above formulas, because Si can be calculated by using Si-1 (similar to Zi), sequential decoding can be implemented (S1 is calculated first, S2 is calculated by using S1, and so on). Sequential decoding is the main reason why such types of Transformer are similar to RNNs.",{"type":17,"tag":25,"props":167,"children":168},{},[169],{"type":17,"tag":34,"props":170,"children":171},{},[172],{"type":23,"value":173},"2.2 Attention Free Transformer",{"type":17,"tag":25,"props":175,"children":176},{},[177],{"type":23,"value":178},"Attention Free Transformer (AFT) is a new neural network model proposed by Apple. Based on the traditional Transformer model, AFT uses technologies such as residual connection to eliminate attention mechanisms, reducing calculation workload and improving performance. Decoder form of AFT:",{"type":17,"tag":25,"props":180,"children":181},{},[182],{"type":17,"tag":53,"props":183,"children":185},{"alt":7,"src":184},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/ded6ff5cf85a4963a792ffa7262ce5d4.png",[],{"type":17,"tag":25,"props":187,"children":188},{},[189],{"type":23,"value":190},"Formula (6)",{"type":17,"tag":25,"props":192,"children":193},{},[194],{"type":23,"value":195},"σ is a sigmoid function,⊙ is element-wise product, and wi,j is a to-be-trained parameter. The form of AFT is different from that of Linear Transformer. The first difference is the attention score. Linear Transformer assigns a weight to each value, which is the same as Transformer. But AFT assigns a weight to each dimension. In other words, in Linear Transformer, the weights of different dimensions in the same value are the same, but the weights of different dimensions in AFT value are different. In addition, the calculation of attention score becomes particularly simple. K is used to add a trainable bias. The usage of Q is much like a gate.",{"type":17,"tag":25,"props":197,"children":198},{},[199],{"type":23,"value":200},"We can write AFT into a recursive form by referring to formula (5). In this way, it can be seen that AFT can be similar to Linear Transformer to reuse a calculation result at a previous moment in an inference phase, like RNN, and becomes more efficient than Transformer.",{"type":17,"tag":25,"props":202,"children":203},{},[204],{"type":17,"tag":34,"props":205,"children":206},{},[207],{"type":23,"value":208},"2.3 RWKV Network Architecture",{"type":17,"tag":25,"props":210,"children":211},{},[212],{"type":23,"value":213},"Features:",{"type":17,"tag":25,"props":215,"children":216},{},[217],{"type":23,"value":218},"Rebuilding AFT: reduces the self-attention complexity from O(N^2) to O(N) through Liner Transformer.",{"type":17,"tag":25,"props":220,"children":221},{},[222],{"type":23,"value":223},"Retains the simple \"attention\" form and sequential decoding of AFT to have the RNN form.",{"type":17,"tag":25,"props":225,"children":226},{},[227],{"type":23,"value":228},"Overall network architecture of RWKV:",{"type":17,"tag":25,"props":230,"children":231},{},[232],{"type":17,"tag":53,"props":233,"children":235},{"alt":7,"src":234},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/3c0ab57a8ea740a8b227b9271ab8f12d.png",[],{"type":17,"tag":25,"props":237,"children":238},{},[239],{"type":23,"value":240},"RWKV network architecture",{"type":17,"tag":25,"props":242,"children":243},{},[244],{"type":23,"value":245},"See the time-mixing block first. The purpose of time-mixing is global interaction, which corresponds to self-attention in Transformer.",{"type":17,"tag":25,"props":247,"children":248},{},[249],{"type":23,"value":250},"R indicates the past information, which is activated by Sigmoid and is forgotten.",{"type":17,"tag":25,"props":252,"children":253},{},[254],{"type":23,"value":255},"W is the relative position and has the Channel Wise d dimension. U compensates for the current position signal.",{"type":17,"tag":25,"props":257,"children":258},{},[259],{"type":23,"value":260},"WKV is similar to Attention, corresponding to position t. KV expresses the weighted sum that can be learned in the past.",{"type":17,"tag":25,"props":262,"children":263},{},[264],{"type":23,"value":265},"The R, K, and V correspond to Q, K, and V in AFT (or Transformer). That is, the meanings of K and V can be somehow considered consistent, and R can be processed as Q.",{"type":17,"tag":25,"props":267,"children":268},{},[269],{"type":23,"value":270},"The only difference is that the RKV calculation method is changed.",{"type":17,"tag":25,"props":272,"children":273},{},[274],{"type":17,"tag":53,"props":275,"children":277},{"alt":7,"src":276},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/2c3da2a7890a44248be4b03089ccdf51.png",[],{"type":17,"tag":25,"props":279,"children":280},{},[281],{"type":23,"value":282},"We need to compare this formula with the AFT formula () carefully. There are 2 changes:",{"type":17,"tag":25,"props":284,"children":285},{},[286],{"type":23,"value":287},"The original offset wi,j that depends on the absolute position does not exist any longer. Instead, the offset wi,j is changed to the relative position, and only one parameter w vector needs to be trained.",{"type":17,"tag":25,"props":289,"children":290},{},[291],{"type":23,"value":292},"The current position is processed separately, and the u parameter is added.",{"type":17,"tag":25,"props":294,"children":295},{},[296],{"type":23,"value":297},"Formula (8) may also be written in a recursive form, so that RWKV has both O(N) of Linear Transformer and simplicity of AFT. Final output of the time-mixing block:",{"type":17,"tag":25,"props":299,"children":300},{},[301],{"type":17,"tag":53,"props":302,"children":304},{"alt":7,"src":303},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/bd4176beb4394ba699fab1574186fd27.png",[],{"type":17,"tag":25,"props":306,"children":307},{},[308],{"type":17,"tag":34,"props":309,"children":310},{},[311],{"type":23,"value":312},"3. Experiment Effect",{"type":17,"tag":25,"props":314,"children":315},{},[316],{"type":23,"value":317},"The following figure shows the performance comparison between the RWKV network and different types of Transformer. The time consumption increases linearly with the sequence length for RWKV, and the time consumption is far less than those of different Transformer architectures.",{"type":17,"tag":25,"props":319,"children":320},{},[321],{"type":17,"tag":53,"props":322,"children":324},{"alt":7,"src":323},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/1eff913e994c4109be476813a7c8cfcf.png",[],{"type":17,"tag":25,"props":326,"children":327},{},[328],{"type":17,"tag":34,"props":329,"children":330},{},[331],{"type":23,"value":332},"3.1 Performance Comparison",{"type":17,"tag":25,"props":334,"children":335},{},[336],{"type":23,"value":337},"The following figure shows the performance comparison between the RWKV and Transformer pre-trained models (BLOOM, OPT, and Pythia). In six benchmarks tests (Winogrande, PIQA, ARC-Challenge, ARC-Easy, LAMBADA, and SciQ), RWKV is competitive to the open source quadratic complexity Transformer models like Pythia, OPT, and BLOOM. RWKV even outperforms Pythia and GPT-Neo in four tasks (PIQA, OBQA, ARC-Easy, and COPA).",{"type":17,"tag":25,"props":339,"children":340},{},[341],{"type":17,"tag":53,"props":342,"children":344},{"alt":7,"src":343},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/35eac384942d4227b7b9a47a3ab49c82.png",[],{"type":17,"tag":25,"props":346,"children":347},{},[348],{"type":17,"tag":34,"props":349,"children":350},{},[351],{"type":23,"value":352},"3.2 Effect Comparison",{"type":17,"tag":25,"props":354,"children":355},{},[356],{"type":23,"value":357},"The following figure shows that increasing the context length reduces the pile test loss, indicating that RWKV can effectively use longer context information.",{"type":17,"tag":25,"props":359,"children":360},{},[361],{"type":17,"tag":53,"props":362,"children":364},{"alt":7,"src":363},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/10/24/5d25cc83aee54a00b36202fb84c27505.png",[],{"type":17,"tag":25,"props":366,"children":367},{},[368],{"type":17,"tag":34,"props":369,"children":370},{},[371],{"type":23,"value":372},"4. Summary and Prospect",{"type":17,"tag":25,"props":374,"children":375},{},[376],{"type":23,"value":377},"The memory and computational complexity of the Transformer network increases quadratic with the sequence length, while RNN only needs to scale linearly. However, RNN has limitations in parallelization and scalability, and therefore it is difficult to compete with Transformer. RWKV-LM/ChatRWKV is a 10-billion-level parameter language basic model/dialog model based on the RWKV pre-trained architecture that is not using Transformer. It has the same capability as Transformer-based LLMs and has higher calculation efficiency (fast computing and low resource consumption).",{"type":17,"tag":25,"props":379,"children":380},{},[381],{"type":23,"value":382},"Because past information is stored in a historical vector, the capability to have long dependencies is not as good as the original Attention. Similarly, the robustness to Prompt is as good as that of Transformer. Linear attention uses element wise calculation to replace matrix multiplication calculation of Transformer. The theoretical advantage of calculation complexity is not an advantage for the Ascend architecture, but the space complexity of linear attention is affected by flash attention.",{"type":17,"tag":25,"props":384,"children":385},{},[386],{"type":23,"value":387},"Compared with the Transformer network, RWKV does not have a vibrant ecosystem (fewer acceleration libraries and algorithms). Whether RWKV can develop into a mainstream neural network remains to be observed.",{"title":7,"searchDepth":389,"depth":389,"links":390},4,[],"markdown","content:technology-blogs:en:2830.md","content","technology-blogs/en/2830.md","technology-blogs/en/2830","md",1776506107319]