[{"data":1,"prerenderedAt":185},["ShallowReactive",2],{"content-query-Irgor8TuX5":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"category":13,"body":14,"_type":179,"_id":180,"_source":181,"_file":182,"_stem":183,"_extension":184},"/technology-blogs/en/1716","en",false,"","A Model Introducing Convolutions to ViT","This blog proposes the new model CvT by introducing desirable properties of CNNs to the ViT architecture. It can effectively learn and process image features using the CNN while utilizing the dynamic attention mechanism and global information perception of Transformer.","2022-07-18","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/11/28/0923b9bd16c84035affb6f6c12325579.png","technology-blogs","Influencers",{"type":15,"children":16,"toc":176},"root",[17,25,30,35,48,56,61,70,75,80,85,93,98,103,108,113,120,125,133,138,145,150,158,163,171],{"type":18,"tag":19,"props":20,"children":22},"element","h1",{"id":21},"a-model-introducing-convolutions-to-vit",[23],{"type":24,"value":8},"text",{"type":18,"tag":26,"props":27,"children":28},"p",{},[29],{"type":24,"value":8},{"type":18,"tag":26,"props":31,"children":32},{},[33],{"type":24,"value":34},"July 18, 2022",{"type":18,"tag":26,"props":36,"children":37},{},[38,40,46],{"type":24,"value":39},"Since the launch of the Vision Transformer (ViT) network, Transformer models have also been increasingly applied in the CV field. However, as shown in figure 1, the original ViT model delivers unsatisfying performance in small datasets without obvious advantages over a traditional convolutional neural network (CNN). In addition, compared with a CNN of similar size, a ViT network require larger training datasets. All these have become barriers of applying ViT models for inference on small datasets. In this blog, I'd like to share a paper from ICCV: ",{"type":18,"tag":41,"props":42,"children":43},"em",{},[44],{"type":24,"value":45},"CvT: Introducing Convolutions to Vision Transformers",{"type":24,"value":47},". This paper proposes an idea of combining convolutional neural networks and Transformers, which can effectively improve the prediction accuracy of ViT models on small datasets.",{"type":18,"tag":26,"props":49,"children":50},{},[51],{"type":18,"tag":52,"props":53,"children":55},"img",{"alt":7,"src":54},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/08/15/cff6b3a7f56e4027b6684108b128b205.png",[],{"type":18,"tag":26,"props":57,"children":58},{},[59],{"type":24,"value":60},"Figure 1 Prediction accuracy of CNN-based models and ViT-based models",{"type":18,"tag":26,"props":62,"children":63},{},[64],{"type":18,"tag":65,"props":66,"children":67},"strong",{},[68],{"type":24,"value":69},"Cause Analysis",{"type":18,"tag":26,"props":71,"children":72},{},[73],{"type":24,"value":74},"According to the author, the reasons why the ViT network is outperformed by the CNN in small datasets are as follows:",{"type":18,"tag":26,"props":76,"children":77},{},[78],{"type":24,"value":79},"1. Convolutions of the CNN are characterized by local receptive fields, shared weights, and spatial subsampling. In this way, local associated features of an image can be effectively captured with invariance to translation, scaling, and rotation. However, the characteristics of the CNN do not exist in the ViT network. The ViT network divides an image into different patches and converts them into one-dimensional sequence to calculate attention. In essence, it calculates the association between a single patch and the images of different positions, requiring more data for training.",{"type":18,"tag":26,"props":81,"children":82},{},[83],{"type":24,"value":84},"2. The multi-layer CNN network structure helps to extract information at different levels. For example, a low-layer convolution kernel can extract image edge or texture information, and a high-layer convolution kernel can obtain semantic information. These features enable the CNN network to better process image tasks. As a result, an idea emerges naturally: improve the overall performance of the ViT network by introducing convolutions.",{"type":18,"tag":26,"props":86,"children":87},{},[88],{"type":18,"tag":65,"props":89,"children":90},{},[91],{"type":24,"value":92},"Introducing Solution",{"type":18,"tag":26,"props":94,"children":95},{},[96],{"type":24,"value":97},"Figure 2 (a) shows the pipeline of the CvT architecture proposed by the author team. Compared with the ViT network architecture, the CvT architecture has two distinct features:",{"type":18,"tag":26,"props":99,"children":100},{},[101],{"type":24,"value":102},"1. The Transformer model employs a multi-layer structure similar to the CNN. The first layer of each stage applies convolutional token embedding to extract features of a 2D image or token map and perform spatial subsampling. In this way, feature graphs of more dimensions are obtained. The convolution of the CNN works in this way.",{"type":18,"tag":26,"props":104,"children":105},{},[106],{"type":24,"value":107},"2. The linear projection in Transformer is replaced with convolutional projection, enabling the model to better capture associated information in local space and reduce semantic ambiguity in the attention mechanism. In the Q/K/V calculation of attention, the author team has tried different stride values to further compress the model size and improve computing efficiency by compromising some accuracy.",{"type":18,"tag":26,"props":109,"children":110},{},[111],{"type":24,"value":112},"This paper points out that such a structure can fully leverage the desirable properties of the CNN, such as local receptive fields, shared weights, and spatial subsampling, while retaining the merits of Transformer models, such as dynamic attention, global context fusion, and better generalization.",{"type":18,"tag":26,"props":114,"children":115},{},[116],{"type":18,"tag":52,"props":117,"children":119},{"alt":7,"src":118},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/08/15/6b74baeee8934ec5b6187a801bc4d3f9.png",[],{"type":18,"tag":26,"props":121,"children":122},{},[123],{"type":24,"value":124},"Figure 2 (a) Structure of a CvT network (b) Details of a convolutional transformer block",{"type":18,"tag":26,"props":126,"children":127},{},[128],{"type":18,"tag":65,"props":129,"children":130},{},[131],{"type":24,"value":132},"Effect Analysis",{"type":18,"tag":26,"props":134,"children":135},{},[136],{"type":24,"value":137},"The following table compares the test results of the CvT model and other models in ImageNet, ImageNet Real, and ImageNet V2 datasets. Compared with Transformer models, the CvT model achieves higher prediction accuracy with less learning parameters and floating point operations per second (FLOPs). It also delivers higher prediction accuracy than the traditional ResNet.",{"type":18,"tag":26,"props":139,"children":140},{},[141],{"type":18,"tag":52,"props":142,"children":144},{"alt":7,"src":143},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/08/15/5058a923c8244597aa5bff9be98cd57d.png",[],{"type":18,"tag":26,"props":146,"children":147},{},[148],{"type":24,"value":149},"Figure 3 Effect comparison between different models",{"type":18,"tag":26,"props":151,"children":152},{},[153],{"type":18,"tag":65,"props":154,"children":155},{},[156],{"type":24,"value":157},"Conclusion",{"type":18,"tag":26,"props":159,"children":160},{},[161],{"type":24,"value":162},"This paper proposes CvT, a new model introducing desirable properties of CNNs to the ViT architecture. It can effectively learn and process image features using the CNN while utilizing the dynamic attention mechanism and global information perception of Transformer. When the training datasets are small, it can significantly improve image prediction accuracy.",{"type":18,"tag":26,"props":164,"children":165},{},[166],{"type":18,"tag":65,"props":167,"children":168},{},[169],{"type":24,"value":170},"Reference",{"type":18,"tag":26,"props":172,"children":173},{},[174],{"type":24,"value":175},"Wu H, Xiao B, Codella N, et al. CvT: Introducing convolutions to vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 22-31.",{"title":7,"searchDepth":177,"depth":177,"links":178},4,[],"markdown","content:technology-blogs:en:1716.md","content","technology-blogs/en/1716.md","technology-blogs/en/1716","md",1776506103658]