Weakly-Supervised Video Anomaly Detection with Snippet Anomalous Attention
Weakly-Supervised Video Anomaly Detection with Snippet Anomalous Attention
Paper Title: Weakly-Supervised Video Anomaly Detection with Snippet Anomalous Attention
Published in: IEEE Transactions on Circuits and Systems for Video Technology
Paper URL: https://arxiv.org/abs/2309.16309
Code URL: https://github.com/Daniel00008/WS-VAD-mindspore
As an open source AI framework, MindSpore supports ultra-large-scale AI pre-training and brings excellent experience of device-edge-cloud synergy, simplified development, ultimate performance, and security and reliability for researchers and developers. Since it was open sourced on March 28th, 2020, MindSpore has been downloaded for more than 6.57 million times. It has also been the subject of thousands of papers presented at premier AI conferences. Furthermore, it has a large community of developers and has been introduced in over 290 universities and 5,000 commercial applications. Being widely used in scenarios such as AI computing centers, finance, smart manufacturing, cloud, wireless, datacom, energy, "1+8+N" consumers, and smart automobiles, MindSpore has emerged as the leading open source software on Gitee. The MindSpore community extends a warm welcome to all who wish to contribute to open source development kits, models, industrial applications, algorithm innovations, academic collaborations, AI-themed book writing, and application cases across the cloud, device, edge, and security.
Thanks to the support from scientific, industry, and academic circles, MindSpore-based papers account for 7% of all papers about AI frameworks in 2023, ranking No. 2 globally for two consecutive years. The MindSpore community is thrilled to share and interpret top-level conference papers and is looking forward to collaborating with experts from industries, academia, and research institutions, so as to yield proprietary AI outcomes and innovate AI applications. In this blog, I'd like to share the paper of the team led by Prof. Yahong Han, College of Intelligence and Computing at Tianjin University.
01 Research Background
With the generation of large-scale video data and the improvement of storage capabilities, video anomaly detection (VAD) has become a key technology to solve practical problems. VAD detects and identifies abnormal events or actions that are not normal or do not match the scenarios from videos. It has wide application in security monitoring, industrial manufacturing, and traffic management. For example, in a city security monitoring system, abnormal events, such as trespassing, theft, and violence, can be identified and reported in real time to improve public security and property protection.
In a traffic management system, VAD can be used to monitor traffic flow in real time, detect traffic accidents, and identify violations to improve traffic safety. However, existing VAD methods rely heavily on manually annotated data. Therefore, researchers propose various solutions that are unsupervised, weakly supervised, or multi-modal. Among the solutions, the weakly supervised approach attracts the most attention. This approach uses only the video-level labels (1 for abnormal and 0 for normal) for training, reducing manual annotation costs. Compared with unsupervised VAD, weakly supervised VAD greatly improves detection performance.
02 Team Introduction
Yidan Fan, the first author of this paper, is a postgraduate at Tianjin University. Her research focuses on VAD and domain adaptation.
Prof. Yahong Han, the correspondent author of this paper and mentor of Fan, is a professor and doctoral mentor of the College of Intelligence and Computing at Tianjin University. His research focuses on multimedia analysis, computer vision, and machine learning.
03 Introduction to the Paper
As pre-trained models face the difficulty of obtaining labeled data in actual downstream VAD tasks, this paper attempts to improve VAD when only weakly labeled data (video-level labels) is available. Previous methods either focus only on discriminative features and overlook snippet-level embeddings that contain rich context information during feature selection or employ self-training to generate pseudo-labels that are prone to noise. This paper proposes an anomalous attention mechanism for weakly-supervised anomalous detection based on snippet-level encoded features. The mechanism learns different areas of the video, including areas that are difficult to detect, and assists in attention optimization. The training and verification of weakly-supervised VAD tasks are implemented based on the MindSpore framework, which accelerates model training compared to other deep learning frameworks.
This paper proposes weakly-supervised VAD (WS-VAD), a new VAD method that tackles the challenge of VAD in actual scenarios where frame-wise labels are absent during training and only video-level labels can be used as coarse supervision. Most existing methods attempt to learn a video by either learning discriminative features (Figure 1-a/b) or employing self-training (Figure 1-c) to generate snippet-level pseudo-labels. However, both approaches have certain limitations. The former tends to overlook non-discriminative features at the snippet level, while the latter can be susceptible to label noise.

Figure 1 Comparison between existing methods (a, b, c) and the proposed WS-VAD (d)
WS-VAD focuses on snippet-level features and the completeness of anomalies and uses snippet-level anomalous attention to implement VAD, as shown in Figure 2. These practices suppress the discriminative snippets to restrain their influence on the final result, focus on weakly abnormal information, and avoid noise as no pseudo-labels are generated. Specifically, WS-VAD consists of three main modules: the temporal embedding unit, the anomalous attention unit, and the multi-branch supervision module. The first module encodes features and aggregates contextual information. The second module focuses on detecting snippet-level anomalies. The third module extracts weakly abnormal information with the help of anomalous attention and models the completeness of anomalies based on the optimized training strategy. Finally, the anomalies are effectively detected. In this process, abnormal attention is continuously refined with predicted anomalous scores and supports high-quality regression of anomalous scores.

Figure 2 Model structure of WS-VAD
04 Experiment Results
WS-VAD is tested on the UCF-Crime and XD-Violence benchmark datasets. The area under the receiver operating characteristic curve (AUC) and area under the curve (average precision, AP) are used as evaluation metrics to validate the effectiveness of WS-VAD. The experiment results are compared with those of state-of-the-art (SOTA) methods and further analyzed.

Table 1 Comparison of results on UCF-Crime

Table 2 Comparison of results on XD-Violence
Experiment results show that the proposed WS-VAD based on snippet-level anomalous attention greatly surpasses SOTA weakly supervised VAD methods in terms of performance. († in the table means that the code is re-trained with open source features.) In addition, this paper proves the effectiveness of snippet-level anomalous attention through extensive analysis and study of experiment results. First, compared with other methods, WS-VAD shows greater robustness to object changing and scene transforming. Second, when anomalies account for a large proportion of the entire video, WS-VAD effectively integrates local information surrounding the anomaly while restraining the influence of discriminative snippets, thereby generating a relatively smooth anomalous score curve with small fluctuations. Finally, thanks to the snippet-level attention, WS-VAD improves the accuracy of anomaly localization, especially when the abnormal snippets in the video are short or sparsely distributed.
05 Summary and Prospects
This paper proposes WS-VAD, an approach that considers snippet-level encoded features to tackle the difficulty in obtaining labeled data in actual downstream VAD tasks. Specifically, an attention mechanism is first introduced based on the modeling of global and local original features. Then, in combination with snippet-level anomalous attention, a multi-branch supervision module is proposed to utilize both general predicted scores and attention-based predictions. WS-VAD suppresses the most discriminative snippets so that the uncertain proportion of the video can be learned, and it explores the completeness of anomalies. Finally, to better generate anomalous attention, an optimization process with a normalization term and guide items is given. WS-VAD addresses the problems faced by weakly-supervised anomaly detection in video surveillance applications and brings a new mentality to the implementation of VAD in actual scenarios.
In research on WS-VAD, MindSpore is used to accelerate the training in an automatic and parallel manner, which significantly improves the training and inference efficiency and simplifies development. We welcome developers to join us and share experience and skills in the MindSpore community. Together, we will build a more powerful, extensive, and diversified MindSpore ecosystem.