[{"data":1,"prerenderedAt":292},["ShallowReactive",2],{"content-query-BurSfNUVCA":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":286,"_id":287,"_source":288,"_file":289,"_stem":290,"_extension":291},"/technology-blogs/en/2998","en",false,"","PICR-NET: MindSpore-based RGB-D Salient Object Detection Network, Achieving Accurate and High-Quality Detection","In recent years, with development and popularization of depth cameras, depth maps have been successfully applied to various computer vision tasks, providing a new idea for the SOD technology, that is RGB-D SOD.","2023-11-16","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/01/99cdcc2c709643a5988be74bdd4d5cee.png","technology-blogs",{"type":14,"children":15,"toc":283},"root",[16,24,30,35,44,49,54,59,70,75,84,96,101,106,115,120,128,133,138,143,148,153,158,163,171,179,184,189,196,201,206,213,218,223,231,236,243,248,253,260,265,270,278],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"picr-net-mindspore-based-rgb-d-salient-object-detection-network-achieving-accurate-and-high-quality-detection",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":23,"value":29},"Author: Li Ruifeng | Source: Zhihu",{"type":17,"tag":25,"props":31,"children":32},{},[33],{"type":23,"value":34},"Paper Title",{"type":17,"tag":25,"props":36,"children":37},{},[38],{"type":17,"tag":39,"props":40,"children":41},"em",{},[42],{"type":23,"value":43},"Point-aware Interaction and CNN-induced Refinement Network for RGB-D Salient Object Detection",{"type":17,"tag":25,"props":45,"children":46},{},[47],{"type":23,"value":48},"Paper Source",{"type":17,"tag":25,"props":50,"children":51},{},[52],{"type":23,"value":53},"ACM MM 2023",{"type":17,"tag":25,"props":55,"children":56},{},[57],{"type":23,"value":58},"Paper URL",{"type":17,"tag":25,"props":60,"children":61},{},[62],{"type":17,"tag":63,"props":64,"children":68},"a",{"href":65,"rel":66},"https://arxiv.org/abs/2308.08930",[67],"nofollow",[69],{"type":23,"value":65},{"type":17,"tag":25,"props":71,"children":72},{},[73],{"type":23,"value":74},"Code URL",{"type":17,"tag":25,"props":76,"children":77},{},[78],{"type":17,"tag":63,"props":79,"children":82},{"href":80,"rel":81},"https://gitee.com/big_feather/acm-mm-2023-picr",[67],[83],{"type":23,"value":80},{"type":17,"tag":25,"props":85,"children":86},{},[87,89,94],{"type":23,"value":88},"As an open-source AI framework, MindSpore supports ultra-large-scale AI pre-training and brings excellent experience of device-edge-cloud synergy, simplified development, ultimate performance, and security and reliability for researchers and developers. Since it was open sourced on March 28th, 2020, MindSpore has been downloaded for more than 6 million times. It has also been the subject of hundreds of papers presented at premier AI conferences. Furthermore, it has a large community of developers and has been introduced in over 100 top universities and 5000 commercial applications. Being widely used in scenarios such as AI computing centers, finance, smart manufacturing, cloud, wireless, datacom, energy, \"1+8+",{"type":17,"tag":39,"props":90,"children":91},{},[92],{"type":23,"value":93},"N",{"type":23,"value":95},"\" consumers, and smart automobiles, MindSpore has emerged as the leading open-source software on Gitee. The MindSpore community extends a warm welcome to all who wish to contribute to open-source development kits, models, industrial applications, algorithm innovations, academic collaborations, AI-themed book writing, and application cases across the cloud, device, edge, and security.",{"type":17,"tag":25,"props":97,"children":98},{},[99],{"type":23,"value":100},"Thanks to the support from scientific, industry, and academic circles, MindSpore-based papers account for 7% of all papers about AI frameworks in 2023, ranking No. 2 globally for two consecutive years. The MindSpore community is thrilled to share and interpret top-level conference papers and is looking forward to collaborating with experts from industries, academia, and research institutions, so as to yield proprietary AI outcomes and innovate AI applications. In this blog, I'd like to share the paper of the team led by Prof. Cong Runmin, School of Control Science and Engineering, Shandong University.",{"type":17,"tag":25,"props":102,"children":103},{},[104],{"type":23,"value":105},"MindSpore aims to achieve three goals: easy development, efficient execution, and all-scenario coverage. The development of MindSpore has been characterized by rapid improvements with successive iterations, with its API design being more complete, reasonable, and powerful. To augment its convenience and power, several kits based on MindSpore have been developed. One such example is MindSpore Insight, which can present model architectures in graphs and dynamically monitor the changes of indicators and parameters during model execution, thereby simplifying the development process.",{"type":17,"tag":25,"props":107,"children":108},{},[109],{"type":17,"tag":110,"props":111,"children":112},"strong",{},[113],{"type":23,"value":114},"01 Research Background",{"type":17,"tag":25,"props":116,"children":117},{},[118],{"type":23,"value":119},"Inspired by the attention mechanism of human visions, salient object detection (SOD) aims to locate the most attractive objects or regions in a given scene. In recent years, with development and popularization of depth cameras, depth maps have been successfully applied to various computer vision tasks, providing a new idea for the SOD technology, that is RGB-D SOD. The depth maps help a computer simulate a human visual system more comprehensively and provide new solutions for scenarios where detection is challenging due to a low contrast and complex background by offering supplementary information such as structures and locations of depth maps.",{"type":17,"tag":25,"props":121,"children":122},{},[123],{"type":17,"tag":110,"props":124,"children":125},{},[126],{"type":23,"value":127},"02 Team Introduction",{"type":17,"tag":25,"props":129,"children":130},{},[131],{"type":23,"value":132},"Cong Runmin, a distinguished processor and PhD supervisor at Shandong University, was a recipient of the Young Elite Scientist Sponsorship Program by the China Association for Science and Technology, The World's Top 2% Scientists (2021-2023), Taishan Scholar Project of Shandong Province, and Qilu Young Scholar of Shandong University. He holds the position of Deputy Secretary-General for the Youth Committee of the China Society of Image and Graphics (CSIG), and also serves as the Vice Chairman of CSIG's Youbo Club. His research interests include computer vision, artificial intelligence, multimedia information processing, visual salient calculation, underwater environment perception. He has participated in a various of scientific research projects at national and provincial levels, published 66 papers on CCF-A and IEEE/ACM Trans, such as IEEE TIP, NeurIPS, CVPR, and ICCV, 2 ESI hot papers, 11 ESI highly-cited papers, and owned 22 national invention patents. In addition, he serves as a member of the editorial committee of multiple SCI Q2 journals and has won many awards for his excellent papers.",{"type":17,"tag":25,"props":134,"children":135},{},[136],{"type":23,"value":137},"Liu Hongyu, a second-year postgraduate student in Beijing Jiaotong University, focuses on the research of RGB-D and high-resolution SOD, and once won a national scholarship.",{"type":17,"tag":25,"props":139,"children":140},{},[141],{"type":23,"value":142},"Zhang Chen, a postgraduate student in Beijing Jiaotong University, published five CCF A/IEEE Trans papers, which are quoted more than 150 times. He also successfully applied for one national patent and won several awards due to his outstanding scholarly articles.",{"type":17,"tag":25,"props":144,"children":145},{},[146],{"type":23,"value":147},"Zhang Wei, a distinguished professor and PhD supervisor at Shandong University, is also a member of Changjiang (Yangtze River) Scholar program. He is mainly engaged in research in fields such as visual perception, machine learning, and robotics. He has led and participated in more than 10 major national projects. He has published more than 80 papers in authoritative journals and conferences such as IEEE TPAMI, TNNLS, TIP, TCYB, CVPR, ICCV, IJCAI and AAAI, and possesses more than 10 invention patents in the United States and China.",{"type":17,"tag":25,"props":149,"children":150},{},[151],{"type":23,"value":152},"Zheng Feng, winner of the National Natural Science Fund (NSFC) for Excellent Young Scholars and associate professor (researcher) of Southern University of Science and Technology, focuses on research on machine learning, computer vision, and cross-media computing. He already published 85 academic papers in top international magazines and conferences, including IEEE TPAMI/TIP/TNNLS, AAAI, NeuIPS, CVPR, ICCV, among which two are highly quoted and 45 are recommended by CCF.",{"type":17,"tag":25,"props":154,"children":155},{},[156],{"type":23,"value":157},"Song Ran, a professor and PhD supervisor of the School of Control Science and Engineering, Shandong University, was honored with Top Young Scholar of National Ten Thousands Talent Program. His papers were acknowledged as the best at international conferences for three times. Moreover, he participated in serval key national projects.",{"type":17,"tag":25,"props":159,"children":160},{},[161],{"type":23,"value":162},"Kwong Tak Wu Sam, Chair Professor of Computational Intelligence, and concurrently as Associate Vice-President (Strategic Research) of Lingnan University. Professor Kwong is a distinguished scholar in evolutionary computation, artificial intelligence solutions, and image/video processing. Professor Kwong has a prolific publication record with over 350 journal articles, and 160 conference papers with an h-index of 76 based on Google Scholar.",{"type":17,"tag":25,"props":164,"children":165},{},[166],{"type":17,"tag":110,"props":167,"children":168},{},[169],{"type":23,"value":170},"03 Introduction to the Paper",{"type":17,"tag":25,"props":172,"children":173},{},[174],{"type":17,"tag":175,"props":176,"children":178},"img",{"alt":7,"src":177},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/01/3cbd88aed0cd4a349092948c7f670ae1.png",[],{"type":17,"tag":25,"props":180,"children":181},{},[182],{"type":23,"value":183},"Figure 1: Visual comparison of representative networks with different architectures, where MVSalNet, VST, and TriTransNet are pure CNN, Transformer, and Transformer-assisted CNN architectures, respectively.",{"type":17,"tag":25,"props":185,"children":186},{},[187],{"type":23,"value":188},"From the perspective of model architectures, the existing RGB-D SOD methods can be classified into three types: pure CNN model, pure Transformer model, and Transformer-assisted CNN model. For a pure CNN architecture, due to the excellent local perception capability of the convolutional operation, the salient results perform better in describing some local details (e.g. boundaries). However, the results may be incomplete, such as the result of MVSalNet in the first image of Figure 1. For a pure Transformer architecture, the ability to capture long-range dependencies can enhance the integrity of detection results. However, the patch-dividing operation may compromise the quality of details, causing blocking effects and even introducing additional false detection, as shown in the result of VST in Figure 1. The Transformer-assisted CNN structure introduces Transformer to assist CNNs in global context modeling, which can alleviate the disadvantages of the above single solutions by combining them. However, in a layer-by-layer decoding process, the convolutional operation gradually dilutes the global information obtained by the Transformer, resulting in the missing or false detection, as shown in the result of TriTransNet in Figure 1.",{"type":17,"tag":25,"props":190,"children":191},{},[192],{"type":17,"tag":175,"props":193,"children":195},{"alt":7,"src":194},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/01/2d857fe5f20643e99d43e06e2e94918c.png",[],{"type":17,"tag":25,"props":197,"children":198},{},[199],{"type":23,"value":200},"Figure 2: Overall framework of the proposed PICR-Net",{"type":17,"tag":25,"props":202,"children":203},{},[204],{"type":23,"value":205},"Therefore, given the relationship between the Transformer and CNN, we proposed a new model architecture named PICR-Net. As shown in Figure 2, we use the Transformer to complete most of the encoding and decoding processes, and design a pluggable CNN-induced refinement (CNNR) unit for content refinement at the network end. In this way, the Transformer and CNN can be fully utilized without interfering with each other to obtain global and local perception abilities and generate accurate and high-quality salient maps.",{"type":17,"tag":25,"props":207,"children":208},{},[209],{"type":17,"tag":175,"props":210,"children":212},{"alt":7,"src":211},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/01/3ba42aaef09746e29e5966ea2ca9f0ac.png",[],{"type":17,"tag":25,"props":214,"children":215},{},[216],{"type":23,"value":217},"Figure 3: Cross-modality point-aware mode",{"type":17,"tag":25,"props":219,"children":220},{},[221],{"type":23,"value":222},"Once the multi-level encoding features of RGB modality and depth modality are extracted, the issue of how to achieve comprehensive interaction becomes a critical focus in the encoding stage. The existing cross-modality interaction scheme under the Transformer architecture usually models the relationship among all positions of two modalities. However, it is well known that there is a corresponding relationship between the RGB image and the depth map itself, that is, the two modalities have a clear relationship only at the corresponding position. As such, there is computational redundancy if the relationship between all pixels of different modalities is modeled, and unnecessary noise may also be introduced due to this forced association modeling. Considering these and the reality of cross-modality in RGB-D SOD tasks, we introduced position constraint factors and proposed a cross-modality point-aware interaction scheme, as shown in Figure 3. Its core is to explore the interaction of different modality features at the same position through multi-head attention.",{"type":17,"tag":25,"props":224,"children":225},{},[226],{"type":17,"tag":110,"props":227,"children":228},{},[229],{"type":23,"value":230},"04 Experiment Results",{"type":17,"tag":25,"props":232,"children":233},{},[234],{"type":23,"value":235},"To verify the effectiveness of PICR-Net, we compared it with 16 SOTA methods on five widely used RGB-D SOD datasets.",{"type":17,"tag":25,"props":237,"children":238},{},[239],{"type":17,"tag":175,"props":240,"children":242},{"alt":7,"src":241},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/01/795b4773c63146a1bb174cc1ef0f43b8.png",[],{"type":17,"tag":25,"props":244,"children":245},{},[246],{"type":23,"value":247},"Table 1: Quantitative comparison results of three indicators on five datasets",{"type":17,"tag":25,"props":249,"children":250},{},[251],{"type":23,"value":252},"Table 1 shows the quantitative results of the proposed PICR-Net on five benchmark datasets, with the best performance marked in bold. The method proposed in this paper is superior to all comparison methods on these five datasets, except for the S-measure on the LFSD dataset. For example, compared with the second best method, the percentage gains of MAE score reach 16.7%, 1.9%, 9.5%, and 6.1% on the DUT-test, LFSD, NLPR-test, and STERE1000 datasets, respectively. Similar gains can be observed in other indicators.",{"type":17,"tag":25,"props":254,"children":255},{},[256],{"type":17,"tag":175,"props":257,"children":259},{"alt":7,"src":258},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/01/0920ec67c5b74ecea171f173caec1c00.png",[],{"type":17,"tag":25,"props":261,"children":262},{},[263],{"type":23,"value":264},"Figure 4: Visual comparison of PICR-Net and SOTA methods in different challenging scenarios, such as small objectives (i.e., a, c, and d), multiple objectives (i.e., c), low contrast (i.e., d and f), and low quality depth maps (i.e., b and e), and uneven lighting (i.e. g).",{"type":17,"tag":25,"props":266,"children":267},{},[268],{"type":23,"value":269},"As shown in Figure 4, PICR-Net not only accurately detects salient targets in these challenging scenarios, but also obtains better integrity and local details.",{"type":17,"tag":25,"props":271,"children":272},{},[273],{"type":17,"tag":110,"props":274,"children":275},{},[276],{"type":23,"value":277},"05 Summary and Prospects",{"type":17,"tag":25,"props":279,"children":280},{},[281],{"type":23,"value":282},"Given the features and advantages of Transformer and CNN, this paper proposes a network called PICR-Net to implement RGB-D SOD, where the network follows the encoder-decoder architecture based on Transformer as a whole, and finally adds a pluggable CNNR unit for detail refinement. In addition, compared with the conventional cross-attention mechanism, the CmPI module proposed in this paper considers the prior correlation between RGB and depth modalities, allowing for more effective cross-modality interaction by introducing spatial constraints and global salient guidance. According to the comprehensive experiments, PICR-Net achieves competitive performance against 16 SOTA on five benchmark datasets. MindSpore provides cross-platform development, training, and deployment capabilities to developers, as well as comprehensive documents and API mapping tables, making it simple to learn MindSpore. It is expected that the Chinese deep learning frameworks can attract more and more developers through their own features and convenience.",{"title":7,"searchDepth":284,"depth":284,"links":285},4,[],"markdown","content:technology-blogs:en:2998.md","content","technology-blogs/en/2998.md","technology-blogs/en/2998","md",1776506108947]