[{"data":1,"prerenderedAt":274},["ShallowReactive",2],{"content-query-SntDXl7fDn":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":268,"_id":269,"_source":270,"_file":271,"_stem":272,"_extension":273},"/technology-blogs/en/2950","en",false,"","Unedited Video Identification and Location via MindSpore-based Anchor-free Temporal Action Localization","This blog introduces a paper that  focuses on temporal action localization, with the aim to refine the anchor-free methods by using a novel Progressive Boundary-aware Boosting Network.","2023-12-08","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/01/19/efbc10ef60f74525bdeb9fd68567ef13.png","technology-blogs",{"type":14,"children":15,"toc":265},"root",[16,24,30,35,40,49,54,59,64,75,80,89,101,106,111,120,125,130,135,140,148,153,165,173,178,183,188,196,201,208,213,221,226,233,240,247,252,260],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"unedited-video-identification-and-location-via-mindspore-based-anchor-free-temporal-action-localization",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":23,"value":29},"December 08, 2023",{"type":17,"tag":25,"props":31,"children":32},{},[33],{"type":23,"value":34},"Author: Li Ruifeng | Source: Zhihu",{"type":17,"tag":25,"props":36,"children":37},{},[38],{"type":23,"value":39},"Paper Title",{"type":17,"tag":25,"props":41,"children":42},{},[43],{"type":17,"tag":44,"props":45,"children":46},"em",{},[47],{"type":23,"value":48},"Anchor-free Temporal Action Localization via Progressive Boundary-aware Boosting",{"type":17,"tag":25,"props":50,"children":51},{},[52],{"type":23,"value":53},"Source of the Paper",{"type":17,"tag":25,"props":55,"children":56},{},[57],{"type":23,"value":58},"IPM 2023",{"type":17,"tag":25,"props":60,"children":61},{},[62],{"type":23,"value":63},"Paper URL",{"type":17,"tag":25,"props":65,"children":66},{},[67],{"type":17,"tag":68,"props":69,"children":73},"a",{"href":70,"rel":71},"https://www.sciencedirect.com/science/article/abs/pii/S0306457322002424",[72],"nofollow",[74],{"type":23,"value":70},{"type":17,"tag":25,"props":76,"children":77},{},[78],{"type":23,"value":79},"Code URL",{"type":17,"tag":25,"props":81,"children":82},{},[83],{"type":17,"tag":68,"props":84,"children":87},{"href":85,"rel":86},"https://gitee.com/chunjie-zhang/anchor-free",[72],[88],{"type":23,"value":85},{"type":17,"tag":25,"props":90,"children":91},{},[92,94,99],{"type":23,"value":93},"As an open-source AI framework, MindSpore supports ultra-large-scale AI pre-training and brings excellent experience of device-edge-cloud synergy, simplified development, ultimate performance, and security and reliability for researchers and developers. Since it was open sourced on March 28th, 2020, MindSpore has been downloaded for more than 6 million times. It has also been the subject of hundreds of papers presented at premier AI conferences. Furthermore, it has a large community of developers and has been introduced in over 100 top universities and 5000 commercial applications. Being widely used in scenarios such as AI computing centers, finance, smart manufacturing, cloud, wireless, datacom, energy, \"1+8+",{"type":17,"tag":44,"props":95,"children":96},{},[97],{"type":23,"value":98},"N",{"type":23,"value":100},"\" consumers, and smart automobiles, MindSpore has emerged as one of the leading open-source software on Gitee. The MindSpore community extends a warm welcome to all who wish to contribute to open-source development kits, models, industrial applications, algorithm innovations, academic collaborations, AI-themed book writing, and application cases across the cloud, device, edge, and security.",{"type":17,"tag":25,"props":102,"children":103},{},[104],{"type":23,"value":105},"Thanks to the support from scientific, industry, and academic circles, MindSpore-based papers account for 7% of all papers about AI frameworks in 2023, ranking No. 2 globally for two consecutive years. The MindSpore community is thrilled to share and interpret top-level conference papers and is looking forward to collaborating with experts from industries, academia, and research institutions, so as to yield proprietary AI outcomes and innovate AI applications. In this blog, I'd like to share the paper of the team led by Prof. Zhang Chunjie, School of Computer and Information Technology at Beijing Jiaotong University.",{"type":17,"tag":25,"props":107,"children":108},{},[109],{"type":23,"value":110},"This paper focuses on temporal action localization, with the aim to refine the anchor-free methods. Such methods have gained increasing attention due to small computational costs and no complex hyperparameters and pre-set anchors. However, inaccurate action boundary predictions are still bottlenecks for most existing methods. Therefore, a novel Progressive Boundary-aware Boosting Network (PBBNet) is proposed, which improves the algorithm through progressive boundary refinement and temporal context aggregation. The algorithm can be implemented using the cases and code in the MindSpore official documentation.",{"type":17,"tag":25,"props":112,"children":113},{},[114],{"type":17,"tag":115,"props":116,"children":117},"strong",{},[118],{"type":23,"value":119},"01 Research Background",{"type":17,"tag":25,"props":121,"children":122},{},[123],{"type":23,"value":124},"As the cost of photography decreases, information is usually stored in the form of video data in many scenarios. Unlike text, images, and audio, video data contains both spatial and temporal information, especially unedited long-duration videos in real world, making it more complex to deal with. To effectively analyze these unedited videos, we often need to focus on specific themes, such as human actions, animal actions, or object movements.",{"type":17,"tag":25,"props":126,"children":127},{},[128],{"type":23,"value":129},"Computers can trim and identify video snippets related to specific themes from the unedited videos with deep learning algorithms, thereby efficiently dealing with a huge amount of video data in real world. Among subjects, human actions are one of the most important and critical subjects, and it is more easily to collect videos containing human actions. As a result, most researches on unedited video understanding focus on human actions. Temporal action localization (TAL), an important and basic video understanding task, is developed to accurately locate and classify video snippets that may contain human action instances in unedited videos. The TAL algorithm can be applied to various scenarios, including business recommendation and autonomous driving.",{"type":17,"tag":25,"props":131,"children":132},{},[133],{"type":23,"value":134},"In recent years, the TAL task has attracted increasing attention. However, most existing anchor-free TAL models still have the problem of inaccurate action boundary predictions, which is mainly caused by fewer generated proposals and poor ability on temporal modeling. For example, an AFSD model uses a coarse-to-fine framework to effectively improve accuracy of boundary predictions. However, its refined prediction policy is relatively simple, and boundary neighborhood information is not fully used, leading to ambiguous boundary predictions when the boundary context is complex.",{"type":17,"tag":25,"props":136,"children":137},{},[138],{"type":23,"value":139},"In addition, the AFSD model only refines the action proposals once, which limits the improvement of boundary predictions. Furthermore, the anchor-free TAL method directly regresses action boundaries, which depends on the ability of the model to capture temporal context information. However, the capability of existing anchor-free TAL models for temporal modeling is still not satisfactory. Therefore, it is necessary to improve the method of anchor-free temporal action localization from boundary refinement and temporal modeling.",{"type":17,"tag":25,"props":141,"children":142},{},[143],{"type":17,"tag":115,"props":144,"children":145},{},[146],{"type":23,"value":147},"02 Team Introduction",{"type":17,"tag":25,"props":149,"children":150},{},[151],{"type":23,"value":152},"Tang Yepeng, the first author of this paper, is a PhD from the School of Computer and Information Technology, Beijing Jiaotong University. His research focuses on computer vision, video understanding, and temporal action localization.",{"type":17,"tag":25,"props":154,"children":155},{},[156,158,163],{"type":23,"value":157},"The Center of Digital Media Information Processing (MePro) at Beijing Jiaotong University started in 1998 and was recognized under the Innovation Team Development Plan of the Ministry of Education in 2012. The MePro has 14 faculty members and more than 100 doctoral and postgraduate students. Its research primarily revolves around digital media information processing, including image/video coding and transmission, digital watermark and forensics, and media content analysis and understanding. In 2022, the lab made significant contributions to the field by publishing 61 high-impact papers, including 38 papers in the esteemed international journal ",{"type":17,"tag":44,"props":159,"children":160},{},[161],{"type":23,"value":162},"IEEE Trans",{"type":23,"value":164}," and 23 papers presented at top-tier international conferences like NeurIPS, CVPR, ECCV, and ACM MM.",{"type":17,"tag":25,"props":166,"children":167},{},[168],{"type":17,"tag":115,"props":169,"children":170},{},[171],{"type":23,"value":172},"03 Introduction to the Paper",{"type":17,"tag":25,"props":174,"children":175},{},[176],{"type":23,"value":177},"The paper introduces a study on a technology of anchor-free TAL of videos. This technology plays a critical role in video analysis and intelligent surveillance, with an aim to identify and locate human behaviors in unedited long-duration videos. However, most of the existing anchor-free TAL models still suffer from inaccurate action boundary predictions. On the one hand, because only a small number of proposals are generated, the anchor-free TAL methods have natural inferiority in action boundary predictions.",{"type":17,"tag":25,"props":179,"children":180},{},[181],{"type":23,"value":182},"In addition, existing methods do not make full use of the boundary neighborhood information to carry out hierarchical and detailed boundary predictions. On the other hand, the anchor-free TAL method directly regresses action boundaries, which depends on the ability of the model to capture temporal context information. Thereby, PBBNet is proposed to solve the problems above.",{"type":17,"tag":25,"props":184,"children":185},{},[186],{"type":23,"value":187},"To be specific, the PBBNet consists of three main modules: Temporal Context-aware Module (TCM), Instance-wise Boundary-aware Module (IBM), and Frame-wise Progressive Boundary-aware Module (FPBM). The TCM is used to aggregate temporal context information and provide coarse-grained and fine-grained aggregated features for the IBM and the FPBM, respectively. The IBM is used to locate the approximate temporal position of action instances and predict the action boundary and category on each location of multi-scale features by generating a pyramid network of multi-scale features. The FPBM is used to refine the preliminary boundary predictions from IBM. Compared with the IBM, the FPBM conducts frame-wise prediction on the boundary of each action instance and uses hierarchical supervision information for training. In this process, action boundaries are regressed multiple times in the frame level to obtain high-quality action boundary predictions. This method has achieved advanced performance on multiple TAL datasets.",{"type":17,"tag":25,"props":189,"children":190},{},[191],{"type":17,"tag":192,"props":193,"children":195},"img",{"alt":7,"src":194},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/01/19/5bc6f4f4547849cb8c47f3a41d36126b.png",[],{"type":17,"tag":25,"props":197,"children":198},{},[199],{"type":23,"value":200},"Model architecture",{"type":17,"tag":25,"props":202,"children":203},{},[204],{"type":17,"tag":192,"props":205,"children":207},{"alt":7,"src":206},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/01/19/da7c2687c88b4c9bb60c55abc37a3e7b.png",[],{"type":17,"tag":25,"props":209,"children":210},{},[211],{"type":23,"value":212},"CCPG and sub-datasets",{"type":17,"tag":25,"props":214,"children":215},{},[216],{"type":17,"tag":115,"props":217,"children":218},{},[219],{"type":23,"value":220},"04 Experiment Results",{"type":17,"tag":25,"props":222,"children":223},{},[224],{"type":23,"value":225},"Experiments are conducted on publicly recognized TAL datasets, including THUMOS14, ActivityNet-v1.3, and HACS. The comparison results and analysis of the PBBNet and other algorithms are as follows.",{"type":17,"tag":25,"props":227,"children":228},{},[229],{"type":17,"tag":192,"props":230,"children":232},{"alt":7,"src":231},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/01/19/d223b6ba6738484783580406921ebe24.png",[],{"type":17,"tag":25,"props":234,"children":235},{},[236],{"type":17,"tag":192,"props":237,"children":239},{"alt":7,"src":238},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/01/19/fa9991d0ae4e40a7b4d2b3f5727bef89.png",[],{"type":17,"tag":25,"props":241,"children":242},{},[243],{"type":17,"tag":192,"props":244,"children":246},{"alt":7,"src":245},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/01/19/4eb9856a14024ae790d74417188234ab.png",[],{"type":17,"tag":25,"props":248,"children":249},{},[250],{"type":23,"value":251},"From the experimental results, the PBBNet is better than other anchor-free TAL methods, and its performance is as good as that of methods with dense temporal anchors. In the past, many anchor-free methods used a coarse-to-fine strategy to optimize action instance boundaries in unedited videos. However, they usually made refined prediction only once, making it difficult to implement precise model prediction. To solve this problem, a progressive boosting strategy is proposed, which progressively refines the boundary predictions with supervision from weak to strong for model prediction. In addition, the TCM is used to model the relationship between temporal context to obtain more accurate prediction results.",{"type":17,"tag":25,"props":253,"children":254},{},[255],{"type":17,"tag":115,"props":256,"children":257},{},[258],{"type":23,"value":259},"05 Summary and Prospects",{"type":17,"tag":25,"props":261,"children":262},{},[263],{"type":23,"value":264},"This blog introduces a method called Progressive Boundary-aware Boosting Network for anchor-free temporal action localization for identification and localization of high-quality human actions in unedited videos. The main focus of this method is the inaccurate action boundary predictions of anchor-free methods. Therefore, the instance-wise boundary-aware module and frame-wise progressive boundary-aware module are designed to boost the boundary predictions. In addition, the temporal context-aware module is also introduced to capture temporal context information, which helps the model to generate better results. Extensive experiments on datasets have proved the effectiveness of this method. In the paper, the method is improved from boundary refinement and temporal modeling. It is undoubtedly that these innovations will contribute to the application of anchor-free TAL methods in actual scenarios.",{"title":7,"searchDepth":266,"depth":266,"links":267},4,[],"markdown","content:technology-blogs:en:2950.md","content","technology-blogs/en/2950.md","technology-blogs/en/2950","md",1776506108333]