[{"data":1,"prerenderedAt":133},["ShallowReactive",2],{"content-query-EDLPkILlfh":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":10,"date":11,"cover":12,"type":13,"category":14,"body":15,"_type":127,"_id":128,"_source":129,"_file":130,"_stem":131,"_extension":132},"/technology-blogs/en/1849","en",false,"",[9],"Technical Knowledge","This article presents how to use MindSpore to reproduce the implementation of image classification of Swin Transformer on ImageNet.","2021-10-28","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/09/29/4ee54b2ab87c4ed5814fa6f95bcc8c2a.png","technology-blogs","Developer Sharing",{"type":16,"children":17,"toc":124},"root",[18,32,48,58,63,71,76,84,89,97],{"type":19,"tag":20,"props":21,"children":23},"element","h1",{"id":22},"technical-knowledge-using-mindspore-to-reproduce-the-implementation-of-swin-transformer",[24,30],{"type":19,"tag":25,"props":26,"children":27},"span",{},[28],{"type":29,"value":9},"text",{"type":29,"value":31}," Using MindSpore to Reproduce the Implementation of Swin Transformer",{"type":19,"tag":33,"props":34,"children":35},"p",{},[36,38,46],{"type":29,"value":37},"October 28, 2021 This article presents how I use MindSpore to reproduce the implementation of image classification of Swin Transformer on ImageNet. The Swin Transformer, proposed in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (the best paper of ICCV in 2021), serves as a general-purpose backbone for computer vision. For all related code, visit ",{"type":19,"tag":39,"props":40,"children":44},"a",{"href":41,"rel":42},"https://gitee.com/ZJUTER0126/models",[43],"nofollow",[45],{"type":29,"value":41},{"type":29,"value":47},". You can obtain the code after modification in the MindSpore repository models. Data Enhancement The Swin Transformer source code is implemented based on PyTorch and timm (PyTorch Image Models, a library for image classification) and AutoAugment is implemented based on the Python Imaging Library (PIL). Since there're differences between PyTorch and MindSpore's image library interface and timm, it is difficult to implement the code ourselves. Here I copied the data augmentation code of timm to MindSpore and modified the code of PyTorch part based on Numpy. It's like that a pluggable AutoAugment is added to the dataset interface of MindSpore and the interfaces are the same as those of PyTorch. You can use the code for reference in future reproduction of similar papers (see the swin_transformer/src/data/data_utils folder in my repository). Mixed Precision The mixed precision training method accelerates the deep neural network training process by mixing the single-precision floating-point data format and the half-precision floating-point data format without compromising the network accuracy. This training method can accelerate the computing process, reduce memory usage and retrieval, and enable a larger model or batch size to be trained on specific hardware. For more details, see Enabling Mixed Precision. The following describes the differences between the mixed precision of MindSpore and that of PyTorch (Apex). After I reproduced a MindSpore-based Swin Transformer that matched the original PyTorch-based Swin Transformer in terms of model training, data, and performance, the new model delivered 10% to 20% lower performance than PyTorch in the 10th round of training. This was why I used mixed precision here. The following are definitions of operation modes: O0: pure FP32 training, which can be used as the accuracy baseline. O1: mixed-precision training (recommended). Whether FP16 (GEMM, convolution) or FP32 (Softmax) is used for calculation is automatically determined based on the blocklist and trustlist. O2: \"almost FP16\" mixed-precision training. There are no blocklists or trustlists. Almost all calculations except batch normalization (BN) are performed using FP16. O3: pure FP16 training, which is unstable but can be used as the speed baseline. MindSpore supports all the options above, with O1 as the default operation mode. The difference between O2 and O3 is whether the BN layer uses FP32 or FP16 training. That is, for most ViTs that do not use the BN layer, O2 and O3 are the same. There is a list of FP16 and FP32 functions in O1 mode in the Apex library. For details, see list. In O1 mode, modules that have great impact on precision such as Softmax and LayerNorm use FP32. So here I used FP32 for these modules to be aligned with PyTorch. To convert data to FP32 and FP16, I directly used the amp.py file of MindSpore.",{"type":19,"tag":49,"props":50,"children":52},"pre",{"code":51},"print(f\"=> using amp_level {args.amp_level}\")\n# Convert data to FP16.\nnet.to_float(mstype.float16)\nprint(f\"=> change {args.arch} to fp16\")\n",[53],{"type":19,"tag":54,"props":55,"children":56},"code",{"__ignoreMap":7},[57],{"type":29,"value":51},{"type":19,"tag":33,"props":59,"children":60},{},[61],{"type":29,"value":62},"# Convert data to FP32. The Swin Transformer uses nn.Dense functions mainly and several nn.Conv2d functions. So here I converted the data to FP32. cell_types = (nn.GELU, nn.Softmax, nn.Conv2d, nn.Conv1d, nn.BatchNorm2d, nn.LayerNorm) _do_keep_fp32(net, cell_types) print(f\"=> cast {cell_types} to fp32 back\") class OutputTo16(nn.Cell): \"Wrap cell for amp. Cast network output back to float16\" def __init__(self, op): super(OutputTo16, self).__init__(auto_prefix=False) self._op = op def construct(self, x): return F.cast(self._op(x), mstype.float16) def _do_keep_fp32(network, cell_types): cells = network.name_cells() change = False for name in cells: subcell = cells[name] if subcell == network: continue elif isinstance(subcell, cell_types): network._cells[name] = OutputTo16(subcell.to_float(mstype.float32)) change = True else: # Recursively call the functions. _do_keep_fp32(subcell, cell_types) if isinstance(network, nn.SequentialCell) and change: network.cell_list = list(network.cells() After manual conversion to O1 mode, the precision of MindSpore-based Swin Transformer became normal in the 10th round. Tuning the Model Performance Based on CANN, MindSpore has more impressive matrix calculation algorithms than the NVIDIA V100 GPU and its on-device execution greatly improves the efficiency of data loading and model running. For details, see On-Device Execution. It should be noted that a large number of index operators will sharply deteriorate the model performance and may cause the single-step training duration to be four or five times longer than that of the V100 GPU under the same conditions. The following are some examples that can be used to address performance bottlenecks of Ascend AI 910 Processors: Example 1: Different Expressions of the Query-Key-Value (QKV)",{"type":19,"tag":49,"props":64,"children":66},{"code":65},"\"\"\" Expression 1 of the QKV attention\"\"\"\nself.qkv = nn.Dense(in_channels=dim, out_channels=dim * 3, has_bias=qkv_bias)\nqkv = ops.Reshape()(self.qkv(x), (B_, N, 3, self.num_heads, C // self.num_heads))\nqkv = ops.Transpose()(qkv, (2, 0, 3, 1, 4))\nq, k, v = qkv[0]*self.scale, qkv[1], qkv[2]\n\n\"\"\" Expression 2 of QKV attention\"\"\"\nself.q = nn.Dense(in_channels=dim, out_channels=dim, has_bias=qkv_bias)\nself.k = nn.Dense(in_channels=dim, out_channels=dim, has_bias=qkv_bias)\nself.v = nn.Dense(in_channels=dim, out_channels=dim, has_bias=qkv_bias)\n\nq = ops.Reshape()(self.q(x), (B_, N, self.num_heads, C // self.num_heads))\nk = ops.Reshape()(self.k(x), (B_, N, self.num_heads, C // self.num_heads))\nk = ops.Transpose()(k, (0, 1, 3, 2))\nv = ops.Reshape()(self.v(x), (B_, N, self.num_heads, C // self.num_heads))\n",[67],{"type":19,"tag":54,"props":68,"children":69},{"__ignoreMap":7},[70],{"type":29,"value":65},{"type":19,"tag":33,"props":72,"children":73},{},[74],{"type":29,"value":75},"These two expressions are the most popular ones for the self-attention paradigm. Expression 2 is required for Ascend 910 AI Processors. Though the two expressions yield the same solution, expression 1 uses index operations (even with qkv[0] only), which causes an extra of 100 ms spent on single-step training of the swin_tiny model on the Swin Transformer (about 600 ms to over 700 ms in total). Example 2 Using Reshape and Transpose to Replace Some Typical Indexes",{"type":19,"tag":49,"props":77,"children":79},{"code":78},"\"\"\"Expression 1\"\"\"\nx0 = x[:, 0::2, 0::2, :] # B H/2 W/2 C\nx1 = x[:, 1::2, 0::2, :] # B H/2 W/2 C\nx2 = x[:, 0::2, 1::2, :] # B H/2 W/2 C\nx3 = x[:, 1::2, 1::2, :] # B H/2 W/2 C\nx = torch.cat([x0, x1, x2, x3], -1) # B H/2 W/2 4*C\nx = x.view(B, -1, 4 * C) # B H/2*W/2 4*C\n\n\"\"\"Expression 2\"\"\"\nx = P.Reshape()(x, (B, self.H_2, 2, self.W_2, 2, self.dim))\nx = P.Transpose()(x, (0, 1, 3, 4, 2, 5))\nx = P.Reshape()(x, (B, self.H2W2, self.dim_mul_4))\n",[80],{"type":19,"tag":54,"props":81,"children":82},{"__ignoreMap":7},[83],{"type":29,"value":78},{"type":19,"tag":33,"props":85,"children":86},{},[87],{"type":29,"value":88},"Both of the two expressions above involves upsampling and PixelShuffle operations, that is, they rearrange pixels in a regular form, and the number of inputs and outputs is the same. In this case, we must use the same reshape operation, which can save a lot of time for the model. Specifically, this saves about 150 ms/step for the swin_tiny model. Example 3 Replacing Indexing with Matrix Multiplication",{"type":19,"tag":49,"props":90,"children":92},{"code":91},"\"\"\"Expression 1\"\"\"\na = [1, 2, 3]\nindex = 2\n\na[index] => 3\n\n\"\"\"Expression 2\"\"\"\na = [1, 2, 3]\nindex = 2\none_hot_index = [0, 0, 1] # predefine\na dot one_hot_index.T => 3\n",[93],{"type":19,"tag":54,"props":94,"children":95},{"__ignoreMap":7},[96],{"type":29,"value":91},{"type":19,"tag":33,"props":98,"children":99},{},[100,102,108,110,116,118],{"type":29,"value":101},"To put it simply, one-hot coding and matrix multiplication are used to replace the indexing method to enhance the model performance. The more the indexes, the more obvious the optimization effect. Model Training Heterogeneous parallel training analyzes the memory usage and computing density of operators on a graph, allocates operators that consume a large amount of memory or are suitable for CPU logic processing to CPU subgraphs, and allocates operators that consume a small amount of memory to hardware accelerator subgraphs. The framework collaborates with different subgraphs for network training, so that subgraphs that are located on different hardware and have no dependencies can be executed in parallel. First, let's look at the optimization of heterogeneous parallel computing. During the training of Pangu or GPT-3 models, optimizer states occupy a large amount of memory, which limits the scale of models that can be trained. By using a heterogeneous optimizer, the optimizer is executed on CPUs, which greatly expands the scale of models that can be trained. Model weight saving and update occupies significant memory and requires few calculations. Because it is impractical to perform grad_sum and zero_op operations in the GPU, we need to shift these calculations to the CPU. To do this, optimize as follows: 1.Inherit TrainOneStepCell from nn.TrainOneStepWithLossScaleCell (this cell contains the gradient for automatically filtering overflow) to reduce the code size. 2.Shift the _sum_op and _clear_op operations to the CPU based on the template in heterogeneous computing. assignadd = P.AssignAdd() assignadd.add_prim_attr(\"primitive_target\", \"CPU\") In addition, given that the final weight includes the grad_sum and zeros weights, we can remove the weight of zeros as follows: _sum_op = C.MultitypeFuncGraph(\"grad_sum_op\") assignadd = P.AssignAdd() assignadd.add_prim_attr(\"primitive_target\", \"CPU\") # It is equivalent to running a -= a. We can obtain 0 by subtracting a from itself. self.hyper_map(F.partial(_sum_op), self._grad_sum, -self._grad_sum) Then we can use clip_grad_norm. After all these, we can successfully reproduce the PyTorch-based Swin Transformer. Let's move on to using ModelArts for model training. Using ModelArts ModelArts is a one-stop AI development platform. It enables developers to rapidly build, train, and deploy models anywhere (from the cloud to the edge), and manage full-lifecycle AI workflows. ModelArts accelerates AI development and fosters AI innovation with key capabilities, including data preprocessing and auto labeling, distributed training, automated model building, and one-click workflow execution. You can simply take it as an efficient deep learning platform. The following are some personal feelings about using ModelArts on Huawei Cloud. 1.Fast: The execution of the same code using ModelArts is about 100 ms/step faster than using offline Ascend 910 server. The possible cause might be that Ascend 910 Pro A performs better than Ascend 910, the ECS environment is stable, or ModelArts has optimized distributed computing. 2.Large overhead: The unit price of V100 GPU training (8-device) and Ascend 910 training is high and the whole training process using the Swin Transformer takes 3 days altogether. 3.High learning costs Though it's hard to get started, once we get familiar with it, we can use ModelArts to store training parameters and start training within just a few clicks. Training Result After reproduction, our new model achieved a similar accuracy (81.15%) as the Swim Transformer in the paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. MindSpore documentation: ",{"type":19,"tag":39,"props":103,"children":106},{"href":104,"rel":105},"https://www.mindspore.cn/docs/en/r0.7/index.html",[43],[107],{"type":29,"value":104},{"type":29,"value":109}," GitHub: ",{"type":19,"tag":39,"props":111,"children":114},{"href":112,"rel":113},"https://github.com/mindspore-ai/mindspore",[43],[115],{"type":29,"value":112},{"type":29,"value":117}," Gitee: ",{"type":19,"tag":39,"props":119,"children":122},{"href":120,"rel":121},"https://gitee.com/mindspore/mindspore",[43],[123],{"type":29,"value":120},{"title":7,"searchDepth":125,"depth":125,"links":126},4,[],"markdown","content:technology-blogs:en:1849.md","content","technology-blogs/en/1849.md","technology-blogs/en/1849","md",1776506105750]