[{"data":1,"prerenderedAt":272},["ShallowReactive",2],{"content-query-ex9MSxm3r2":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":266,"_id":267,"_source":268,"_file":269,"_stem":270,"_extension":271},"/technology-blogs/en/3105","en",false,"","Strike Back by Convolutional Architectures, Pure Convolutional Query-based Detector DECO Outperforms DETR","March 18, 2024","2024-03-18","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/883950ca892b40dc835211da5b54c731.png","technology-blogs",{"type":14,"children":15,"toc":263},"root",[16,24,29,34,39,50,55,64,73,78,83,91,96,104,109,117,122,130,135,142,147,152,157,162,169,174,181,186,193,198,206,211,216,224,229,236,241,246,253,258],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"strike-back-by-convolutional-architectures-pure-convolutional-query-based-detector-deco-outperforms-detr",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":23,"value":9},{"type":17,"tag":25,"props":30,"children":31},{},[32],{"type":23,"value":33},"Author: Wang Yunhe; source: Zhihu",{"type":17,"tag":25,"props":35,"children":36},{},[37],{"type":23,"value":38},"Paper link:",{"type":17,"tag":25,"props":40,"children":41},{},[42],{"type":17,"tag":43,"props":44,"children":48},"a",{"href":45,"rel":46},"https://arxiv.org/abs/2312.13735",[47],"nofollow",[49],{"type":23,"value":45},{"type":17,"tag":25,"props":51,"children":52},{},[53],{"type":23,"value":54},"MindSpore code:",{"type":17,"tag":25,"props":56,"children":57},{},[58],{"type":17,"tag":43,"props":59,"children":62},{"href":60,"rel":61},"https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/DECO",[47],[63],{"type":23,"value":60},{"type":17,"tag":25,"props":65,"children":66},{},[67],{"type":17,"tag":68,"props":69,"children":70},"strong",{},[71],{"type":23,"value":72},"Introduction",{"type":17,"tag":25,"props":74,"children":75},{},[76],{"type":23,"value":77},"The appearance of Detection Transformer (DETR) triggered a wave of discussion and research in the field of object detection, and a lot of subsequent related work improved the original DETR in terms of precision and speed. However, is Transformer the only answer to the visualization field? It can be seen from the works on ConvNeXt and RepLKNet that CNN architecture still has great space for improvement in the visualization field. Our work explores how to obtain a DETR-like detector with outstanding performance based on a pure convolutional architecture.",{"type":17,"tag":25,"props":79,"children":80},{},[81],{"type":23,"value":82},"To make a tribute to DETR, we name our method Detection ConvNets (DECO). With an architecture similar to DETR and the utilization of different backbones, DECO achieves 38.6% and 40.8% of APs on the challenging object detection benchmark (COCO), and 35 FPS and 28 FPS on a NVIDIA V100 GPU, outperforming DETR. With cross-scale feature-fusion modules similar to RT-DETR, DECO achieves 47.8% AP and 34 FPS. The overall performance is better than that of many improvement methods based on DETR.",{"type":17,"tag":25,"props":84,"children":85},{},[86],{"type":17,"tag":68,"props":87,"children":88},{},[89],{"type":23,"value":90},"Overall DECO Architecture",{"type":17,"tag":25,"props":92,"children":93},{},[94],{"type":23,"value":95},"For an input image, DETR utilizes the Transformer Encoder-Decoder architecture to interact with image features by using a group of queries. In this way, a specified number of box predictions can be directly output, eliminating the dependency on post-processing operations such as non-maximum suppression (NMS). The overall architecture of DECO is similar to that of DETR. DECO also includes backbones for image feature extraction and an encoder-decoder architecture for interaction with queries, and outputs a specific number of detection results. The only difference is that the encoder and decoder of DECO are pure convolutional architectures. Therefore, DECO is a query-based end-to-end detector consisting of pure convolution.",{"type":17,"tag":25,"props":97,"children":98},{},[99],{"type":17,"tag":100,"props":101,"children":103},"img",{"alt":7,"src":102},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/b9f2e4aca4fe47c1bc1624689ff64a84.png",[],{"type":17,"tag":25,"props":105,"children":106},{},[107],{"type":23,"value":108},"Figure: Comparison between the DECO and DETR architectures",{"type":17,"tag":25,"props":110,"children":111},{},[112],{"type":17,"tag":68,"props":113,"children":114},{},[115],{"type":23,"value":116},"DECO Encoder Architecture",{"type":17,"tag":25,"props":118,"children":119},{},[120],{"type":23,"value":121},"The architecture replacement of DECO Encoder is relatively direct. Four ConvNeXt blocks are used to form the encoder architecture. Specifically, each encoder layer is stacked with a 7 × 7 depthwise convolution, a LayerNorm layer, a 1 × 1 convolution, a GELU activation and another 1 × 1 convolution. Moreover, in DETR, because the Transformer architecture is permutation-invariant for the input, positional encoding needs to be added to the input of each layer of encoder. However, for the DECO encoder, no positional encodings need to be added.",{"type":17,"tag":25,"props":123,"children":124},{},[125],{"type":17,"tag":68,"props":126,"children":127},{},[128],{"type":23,"value":129},"DECO Decoder Architecture",{"type":17,"tag":25,"props":131,"children":132},{},[133],{"type":23,"value":134},"Compared with DECO Encoder, the replacement of decoder is much more complex. The main function of the decoder is to fully interact with image features and queries, so that the queries can fully perceive the image features, so as to predict coordinates and categories of the objects in the image. DECO Decoder mainly includes two inputs: feature output of the encoder and a group of learnable query vectors (Query). The decoder can be divided into two major modules: self-interaction module (SIM) and cross-interaction module (CIM).",{"type":17,"tag":25,"props":136,"children":137},{},[138],{"type":17,"tag":100,"props":139,"children":141},{"alt":7,"src":140},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/ad7077ea3aa042a58b2da1808c47ddc9.png",[],{"type":17,"tag":25,"props":143,"children":144},{},[145],{"type":23,"value":146},"Figure: DECO Decoder architecture",{"type":17,"tag":25,"props":148,"children":149},{},[150],{"type":23,"value":151},"SIM integrates the queries and output of the upper-layer decoder. The architecture of this part can be composed of several convolutional layers. The 9 x 9 depthwise convolution and 1 x 1 convolution are used to perform spatial and channel information mixing respectively, obtaining the required object information that is to be sent to CIM for further object detection and feature extraction. Query is a group of randomly initialized vectors, the number of which determines the number of box predictions output by the detector in the final stage. The specific value can be adjusted as required. For DECO, all architectures are composed of convolutions, therefore we change queries into two dimensions. For example, 100 queries can be changed into a dimension of 10 x 10.",{"type":17,"tag":25,"props":153,"children":154},{},[155],{"type":23,"value":156},"The main function of CIM is to enable image features to fully interact with queries, so that the queries can fully perceive the image features, so as to predict coordinates and categories of the objects in the image. For the Transformer architecture, the cross-attention mechanism can be used to implement this purpose conveniently. However, for the convolutional architecture, how to implement full interaction between two features is a major challenge.",{"type":17,"tag":25,"props":158,"children":159},{},[160],{"type":23,"value":161},"To fuse the global features of the SIM output and encoder output in different sizes, we must align the SIM output and encoder output in space before fusion. First, we perform nearest neighbor upsampling on the SIM output.",{"type":17,"tag":25,"props":163,"children":164},{},[165],{"type":17,"tag":100,"props":166,"children":168},{"alt":7,"src":167},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/2b1a7f8a6a324a6eab15f83688c70041.png",[],{"type":17,"tag":25,"props":170,"children":171},{},[172],{"type":23,"value":173},"In this way, the size of the feature is the same as that of the global feature output by the encoder. Then, the feature after upsampling and the global feature output by the encoder are fused, and the residual input is added after feature interaction in depthwise convolution.",{"type":17,"tag":25,"props":175,"children":176},{},[177],{"type":17,"tag":100,"props":178,"children":180},{"alt":7,"src":179},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/bf884ff2f1ab4d94b079089b165662b3.png",[],{"type":17,"tag":25,"props":182,"children":183},{},[184],{"type":23,"value":185},"Next, the feature after interaction exchange channel information through FNN, and then Pooling is performed to obtain the target size to get the Decoder output Embedding:",{"type":17,"tag":25,"props":187,"children":188},{},[189],{"type":17,"tag":100,"props":190,"children":192},{"alt":7,"src":191},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/883618a24d484925b06cfba83d8d0cf7.png",[],{"type":17,"tag":25,"props":194,"children":195},{},[196],{"type":23,"value":197},"Finally, we feed the output Embedding into the detection head for subsequent classification and regression.",{"type":17,"tag":25,"props":199,"children":200},{},[201],{"type":17,"tag":68,"props":202,"children":203},{},[204],{"type":23,"value":205},"Utilization of Cross-Scale Features",{"type":17,"tag":25,"props":207,"children":208},{},[209],{"type":23,"value":210},"Similar to the original DETR, DECO obtained by the above architecture has a common weakness, that is, lack of cross-scale features, which has a great impact on high-precision object detection. Deformable DETR uses a cross-scale deformable attention module to integrate features of different scales. However, this method is strongly coupled with the Attention operator and cannot be directly used in our DECO.",{"type":17,"tag":25,"props":212,"children":213},{},[214],{"type":23,"value":215},"In order for DECO to process cross-scale features, a cross-scale feature fusion module proposed by RT-DETR is used following the features output by Decoder. In fact, a series of improvement methods have been proposed after the advent of DETR. We believe that many improvement methods are also applicable to DECO. We hope more and more people are into the discussion and exploration for a better DECO.",{"type":17,"tag":25,"props":217,"children":218},{},[219],{"type":17,"tag":68,"props":220,"children":221},{},[222],{"type":23,"value":223},"Experimental Results",{"type":17,"tag":25,"props":225,"children":226},{},[227],{"type":23,"value":228},"We have conducted experiments on COCO. We compared DECO with DETR while keeping the main architecture unchanged. For example, we keep the number of queries consistent and the number of Decoder layers unchanged. We only replace the Transformer architecture in DETR with our convolution architecture as formerly described. It can be seen that DECO achieves tradeoff with better precision and speed than DETR.",{"type":17,"tag":25,"props":230,"children":231},{},[232],{"type":17,"tag":100,"props":233,"children":235},{"alt":7,"src":234},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/b11bcfd56fd34c099868ff976c80f488.png",[],{"type":17,"tag":25,"props":237,"children":238},{},[239],{"type":23,"value":240},"Figure: Performance comparison between DECO and DETR",{"type":17,"tag":25,"props":242,"children":243},{},[244],{"type":23,"value":245},"We also compared DECO with cross-scale features with more object detection methods, including DETR variants. As shown in the following figure, DECO has achieved better performance than many previous detectors.",{"type":17,"tag":25,"props":247,"children":248},{},[249],{"type":17,"tag":100,"props":250,"children":252},{"alt":7,"src":251},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/f78ee496236c4180b4051a4e9022e899.png",[],{"type":17,"tag":25,"props":254,"children":255},{},[256],{"type":23,"value":257},"Figure: Performance comparison between DECO and different detectors",{"type":17,"tag":25,"props":259,"children":260},{},[261],{"type":23,"value":262},"In the paper, many ablation experiments and visualizations have been performed on the DECO architecture, including the specific fusion policies (addition, point multiplication, and Concat) selected in Decoder and the process of setting the query dimension to achieve the optimal effect. For more detailed results and discussions, see the original paper.",{"title":7,"searchDepth":264,"depth":264,"links":265},4,[],"markdown","content:technology-blogs:en:3105.md","content","technology-blogs/en/3105.md","technology-blogs/en/3105","md",1776506110443]