[{"data":1,"prerenderedAt":305},["ShallowReactive",2],{"content-query-gnwqfIpWRH":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":299,"_id":300,"_source":301,"_file":302,"_stem":303,"_extension":304},"/technology-blogs/en/3103","en",false,"","LocalMIM: Achieving Faster Training Speed with Local Multi-Scale Reconstruction","The effectiveness of LocalMIM has been verified on the columnar architecture ViT and pyramid architecture Swin, and its performance has been tested in classification, detection, and segmentation tasks.","2024-03-06","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/8864caf0846843cfba2103ccb808b3d2.png","technology-blogs",{"type":14,"children":15,"toc":296},"root",[16,24,30,35,43,48,59,64,73,82,87,92,97,102,115,130,137,142,150,157,162,186,197,202,210,215,222,227,234,239,246,251,256,261,268,273,278,286,291],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"localmim-achieving-faster-training-speed-with-local-multi-scale-reconstruction",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":23,"value":29},"Author: Wang Yunhe Source: Zhihu",{"type":17,"tag":25,"props":31,"children":32},{},[33],{"type":23,"value":34},"Unlike existing Masked Image Modeling (MIM) models that conduct reconstruction tasks only at the top layer of an encoder, we perform reconstruction tasks at multiple selected local layers and design multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale signals respectively. For top-1 fine-tuning accuracy on ImageNet-1K, LocalMIM outperforms other models. It is 3.1 times faster than MAE, 5.6 times faster than MaskFeat, 3.6 times faster than SimMIM192, and 6.4 times faster than GreenMIM. Our approach is architecture-independent, and can be applied to more new backbone networks in the future.",{"type":17,"tag":25,"props":36,"children":37},{},[38],{"type":17,"tag":39,"props":40,"children":42},"img",{"alt":7,"src":41},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/c35ea96952924298a0d5be3a1e75f889.png",[],{"type":17,"tag":25,"props":44,"children":45},{},[46],{"type":23,"value":47},"Paper:",{"type":17,"tag":25,"props":49,"children":50},{},[51],{"type":17,"tag":52,"props":53,"children":57},"a",{"href":54,"rel":55},"https://arxiv.org/pdf/2303.05251v1.pdf",[56],"nofollow",[58],{"type":23,"value":54},{"type":17,"tag":25,"props":60,"children":61},{},[62],{"type":23,"value":63},"Code Used on MindSpore:",{"type":17,"tag":25,"props":65,"children":66},{},[67],{"type":17,"tag":52,"props":68,"children":71},{"href":69,"rel":70},"https://gitee.com/mindspore/hub/blob/master/mshub_res/assets/noah-cvlab/gpu/1.8/localmim_v1.0_imagenet2012.md",[56],[72],{"type":23,"value":69},{"type":17,"tag":25,"props":74,"children":75},{},[76],{"type":17,"tag":77,"props":78,"children":79},"strong",{},[80],{"type":23,"value":81},"Introduction",{"type":17,"tag":25,"props":83,"children":84},{},[85],{"type":23,"value":86},"In recent years, the emergence of contrastive methods (e.g. MoCo and SimCLR) and MIM methods (e.g. MAE) has sped up the development of vision self-supervised representation learning. With the development of vision transformers, MIM methods have become more and more popular because of their superior fine-tuning performance in downstream tasks. In real-world application, MIM methods are expected to learn common knowledge from massive unlabeled data (such as images randomly crawled on the network). However, their training cost is high, and this limits their industrial implementation.",{"type":17,"tag":25,"props":88,"children":89},{},[90],{"type":23,"value":91},"Computation workloads are mostly consumed by MIM's encode and decoder. Because of the small size of the decoder, existing pre-training acceleration methods generally speed up encoding by reducing computation workloads of the encoder, which can be implemented by the following methods:",{"type":17,"tag":25,"props":93,"children":94},{},[95],{"type":23,"value":96},"● For MAE and GreenMIM, the encoder processes only visible patches.",{"type":17,"tag":25,"props":98,"children":99},{},[100],{"type":23,"value":101},"● For LoMaR, UM-MAE, and FastMIM, the input image resolution is reduced to minimize the total number of patches.",{"type":17,"tag":25,"props":103,"children":104},{},[105,107,113],{"type":23,"value":106},"Unlike preceding ideas, we associate representation learning with our method. The theory[1] shows that after the input image is divided into segments and linearly mapped, inter-patch semantic associations of the obtained patch representations are basically lost due to randomness of the mapping process. The attention mechanism in vision transformers learns semantic associations through interaction between subsequent patches and builds a better representation space than pixel space. It should be noted that the self-attention mechanism has the computation complexity with a quadratic dependence on patch number ",{"type":17,"tag":108,"props":109,"children":110},"em",{},[111],{"type":23,"value":112},"N",{"type":23,"value":114},", so it is difficult to learn the inter-patch interactions. All existing MIM methods introduce reconstruction tasks only at a top layer. Consequently, inter-patch interactions at lower layers are not explicitly guided, and patch representations and semantic associations can be learned through a slow learning process. This slows down the overall representation learning process. It is more conspicuous for some pyramidal backbone networks, whose lower layers have more patches than the top layer (for example, Swin-224 has 3136 patches at the bottom layer and 49 patches at the top layer). In real-world application, lower levels play a key role in representation learning:",{"type":17,"tag":116,"props":117,"children":118},"ol",{},[119,125],{"type":17,"tag":120,"props":121,"children":122},"li",{},[123],{"type":23,"value":124},"A lower level with a good learning capability can transfer knowledge to a higher level to facilitate its learning.",{"type":17,"tag":120,"props":126,"children":127},{},[128],{"type":23,"value":129},"When a downstream task is fine-tuned, a higher layer usually quickly adapts to the new task, while a lower layer changes slowly. The lower layer needs to fully learn data in the pre-training phase. Therefore, layer-wise learning rate decay in NLP or CV downstream tasks usually yields the optimal outcome. To intuitively display learning degrees of semantic associations between patches at different layers of a model, we examine Normalized Mutual Information (NMI) between query and key patches at each year, as shown in Figure 3.",{"type":17,"tag":25,"props":131,"children":132},{},[133],{"type":17,"tag":39,"props":134,"children":136},{"alt":7,"src":135},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/06302d2c959544a997139c63fd7164ba.png",[],{"type":17,"tag":25,"props":138,"children":139},{},[140],{"type":23,"value":141},"A large NMI value means that attention strongly depends on the query patch. As shown in Figure 3, attention of lower layers of many existing classical models (BEiT, SimMIM, and MaskFeat) is not as heavily reliant on the query patch as it is in the top layer.",{"type":17,"tag":25,"props":143,"children":144},{},[145],{"type":17,"tag":77,"props":146,"children":147},{},[148],{"type":23,"value":149},"Method",{"type":17,"tag":25,"props":151,"children":152},{},[153],{"type":17,"tag":39,"props":154,"children":156},{"alt":7,"src":155},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/696d3ab1a6594bd7aeb04c3e40effa4f.png",[],{"type":17,"tag":25,"props":158,"children":159},{},[160],{"type":23,"value":161},"LocalMIM architecture",{"type":17,"tag":25,"props":163,"children":164},{},[165,167,172,174,178,180,184],{"type":23,"value":166},"According to the above analysis, although lower-level learning plays the key role in MIM, existing MIM methods explicitly guide top-level learning. Considering that reconstruction tasks require inter-patch semantic inference, we introduce reconstruction tasks into multiple local layers to explicitly guide them. Actually, feature distillation can also explicitly guide local layers, but it requires a pre-trained teacher network or one with momentum update, which significantly increases computation workloads. In addition, simple feature matching cannot bring mush gains to semantic association learning compared with inference tasks. Further, we find that the gain is not obvious when a reconstruction task of the top layer is directly introduced to multiple local layers. The reason may be that different local layers need to learn information of different granularities. To this end, we consider extracting supervision signals of different scales from the original input to guide the learning of multiple local layers. Specifically, to obtain supervision signals, existing methods usually first divide the input ",{"type":17,"tag":108,"props":168,"children":169},{},[170],{"type":23,"value":171},"x",{"type":23,"value":173}," into non-overlapping regions, which is the same as the division for constructing the encoder input. Then, a proper feature description operator (such as pixel normalization, HOG, or pre-trained codebook) is used to extract features of each region as supervised signals. In this case, coarse-scale (larger ",{"type":17,"tag":108,"props":175,"children":176},{},[177],{"type":23,"value":25},{"type":23,"value":179}," value) supervision signals capture high-level semantic information of the input, such as the shape of partial or whole object. Relatively, fine-scale (smaller ",{"type":17,"tag":108,"props":181,"children":182},{},[183],{"type":23,"value":25},{"type":23,"value":185}," value) supervision signals capture low-level semantic information, like edges, corners, or textures.",{"type":17,"tag":25,"props":187,"children":188},{},[189,191,195],{"type":23,"value":190},"Additionally, much work of vision backbone network design has shown that coarse- and fine-scale at high and low layers of a feature map brings benefits to various vision tasks (detection, segmentation, and classification). Generally, a lower layer learns fine-scale information and a higher layer learns coarse-scale information. Therefore, we construct multi-scale supervision signals (by selecting different ",{"type":17,"tag":108,"props":192,"children":193},{},[194],{"type":23,"value":25},{"type":23,"value":196},") from the original input to reconstruct multiple local layers, so that a lower layer reconstructs fine-scale supervision signals and a higher layer reconstructs coarse-scale supervision signals. In addition, our method is compatible with \"acceleration encoding\". In particular, we use the methods in MAE and GreenMIM, that is, input only visible patches to the encoder. Figure 2(a) shows the overall process of the algorithm. Figure 2(b) demonstrates a decoding process at a specific scale. A decoder includes three parts: transformer blocks for inference, Deconvolution/Pool for scaling, and MLP for prediction. Transformer blocks infer the information about blocked patches based on visible patch representations. Deconvolution/Pool handles a process when scales of features and supervision signals are inconsistent. For example, for columnar architectures like ViT, feature scales at each remain unchanged while supervision-signal scales change. When they become inconsistent, we use the deconvolution/pool operations to conduct up/down sampling. MLP integrates predictions after scaling as the final output. A model with a pyramid architecture is generally divided into multiple stages. We introduce a reconstruction task to the end of each stage. For a columnar architecture, we select some layers for reconstruction based on the experience of processing pyramid architectures.",{"type":17,"tag":25,"props":198,"children":199},{},[200],{"type":23,"value":201},"To sum up, local multi-scale reconstruction (LocalMIM[2]) can explicitly guide lower layers to accelerate overall representation learning and promote multi-scale understanding of input images. Moreover, the method is architecture-independent. Theoretically, it adapts to various backbone networks, and can be used in more advanced backbone networks in the future.",{"type":17,"tag":25,"props":203,"children":204},{},[205],{"type":17,"tag":77,"props":206,"children":207},{},[208],{"type":23,"value":209},"Experimental Results",{"type":17,"tag":25,"props":211,"children":212},{},[213],{"type":23,"value":214},"The effectiveness of LocalMIM has been verified on the columnar architecture ViT and pyramid architecture Swin, and its performance has been tested in classification, detection, and segmentation tasks, as shown in Table 1, Table 2, and Table 3. Note that we only examine two feature description operators: pixel normalization and HOG.",{"type":17,"tag":25,"props":216,"children":217},{},[218],{"type":17,"tag":39,"props":219,"children":221},{"alt":7,"src":220},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/3fe89e55d5044bbc8a2ebd8e0a61c457.png",[],{"type":17,"tag":25,"props":223,"children":224},{},[225],{"type":23,"value":226},"Table 1: Top-1 fine-tuning accuracy on ImageNet-1K",{"type":17,"tag":25,"props":228,"children":229},{},[230],{"type":17,"tag":39,"props":231,"children":233},{"alt":7,"src":232},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/6949e592c58b436fad10dda74fa38887.png",[],{"type":17,"tag":25,"props":235,"children":236},{},[237],{"type":23,"value":238},"Table 2: Semantic segmentation on ADE20K",{"type":17,"tag":25,"props":240,"children":241},{},[242],{"type":17,"tag":39,"props":243,"children":245},{"alt":7,"src":244},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/8173f2b36b624bfba3797352fa164595.png",[],{"type":17,"tag":25,"props":247,"children":248},{},[249],{"type":23,"value":250},"Table 3: Object detection and instance segmentation on COCO",{"type":17,"tag":25,"props":252,"children":253},{},[254],{"type":23,"value":255},"As shown in Table 1, LocalMIM has better efficiency than existing models. Specifically, in terms of top-1fine-tuning accuracy on ImageNet-1K, LocalMIM outperforms other models. It is 3.1 times faster than MAE, 5.6 times faster than MaskFeat, 3.6 times faster than SimMIM192, and 6.4 times faster than GreenMIM. As can be seen from Tables 2 and 3, LocalMIM achieves better performance with significantly less computational costs in downstream detection and segmentation tasks. For other ablation experiments and implementation details, refer to the paper.",{"type":17,"tag":25,"props":257,"children":258},{},[259],{"type":23,"value":260},"In addition, we conduct an interesting experiment. Gradient truncation is performed on selected layers in the training process. That is, parameters of each stage can receive only the backward gradient from the reconstruction task of that stage, and cannot receive the gradient from a higher layer. The results are shown in Table 4.",{"type":17,"tag":25,"props":262,"children":263},{},[264],{"type":17,"tag":39,"props":265,"children":267},{"alt":7,"src":266},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/a9626d38e4a54c0fb72b791135e3d148.png",[],{"type":17,"tag":25,"props":269,"children":270},{},[271],{"type":23,"value":272},"Table 4. Training LocalMIM with isolated gradients achieves similar performance with global backpropagation.",{"type":17,"tag":25,"props":274,"children":275},{},[276],{"type":23,"value":277},"We find that even if there is no global backpropagation gradient and only the local supervision gradient is used, representation learning of each layer of the backbone network can be well guided. This demonstrates the advantages of our local supervision task and the possibility of neural network decoupling training. Neural network decoupling training allows for training very deep networks without memory concern and effectively reduces explosive or vanishing gradients. Currently, the backbone network commonly used in the vision field is usually shallow. However, looking ahead, there may be a need to pre-train a significantly deeper backbone network. We aspire that our method could serve as an inspiration for such pre-training endeavors.",{"type":17,"tag":25,"props":279,"children":280},{},[281],{"type":17,"tag":77,"props":282,"children":283},{},[284],{"type":23,"value":285},"References",{"type":17,"tag":25,"props":287,"children":288},{},[289],{"type":23,"value":290},"[1] How to Understand Masked Autoencoders. Cao and Xu, 2022.",{"type":17,"tag":25,"props":292,"children":293},{},[294],{"type":23,"value":295},"[2] Masked Image Modeling with Local Multi-Scale Reconstruction. Wang et.al. CVPR 2023.",{"title":7,"searchDepth":297,"depth":297,"links":298},4,[],"markdown","content:technology-blogs:en:3103.md","content","technology-blogs/en/3103.md","technology-blogs/en/3103","md",1776506110366]