Integrating Vision and Language: An Improved VAD Model

Manas Ranjan Biswal; Santos Kumar Baliarsingh

doi:10.38094/jastt71658

Vol. 7 No. 1 (2026)

Standard Journal Issues

Integrating Vision and Language: An Improved VAD Model

Published 2026-03-03

Manas Ranjan Biswal
Santos Kumar Baliarsingh

Manas Ranjan Biswal
School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar-751024, Odisha, India.

Santos Kumar Baliarsingh
School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar-751024, Odisha, India.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

How to Cite

[1]

M. Ranjan Biswal and S. K. Baliarsingh, “Integrating Vision and Language: An Improved VAD Model”, JASTT, vol. 7, no. 1, pp. 137–156, Mar. 2026, doi: 10.38094/jastt71658.

Download Citation

Abstract

Automatic anomaly detection in video surveillance is crucial for public and private safety. However, it is challenging because of unclear abnormal events, limited labeled data, and mismatches between different types of data. Traditional video anomaly detection methods mainly focus on spatiotemporal visual features. They often ignore semantic information and interactions between different data types. Additionally, many multimodal approaches use basic fusion methods that do not solve the alignment problems between these types of data. To address these issues, we propose a multimodal framework that includes a Hierarchical Multi-scale Temporal Network (H-MSTN). This network models short-, medium-, and long-term dependencies in visual and textual data. A lightweight cross-modal attention module makes sure the semantics align. Meanwhile, a Multimodal Attention-Based Fusion Transformer (MAFT) refines cross-modal representations in real time. We evaluate this framework using the UCF-Crime and XD-Violence benchmarks. The proposed method achieves 92.42% AUC on UCF-Crime and 88.63% AP on XD-Violence with significantly lower computational cost and faster inference than recent multimodal baselines such as ReFLIP-VAD. These results demonstrate a strong efficiency–accuracy trade-off for real-time deployment while maintaining competitive or improved performance over prior methods such as MVAD and TEVAD.

Keywords

Video Anomaly Detection (VAD), Vision- Language Models, Multimodal Anomaly Detection, Hierarchical Multi-Scale Temporal Network (H-MSTN), Cross-Modal Attention Module (CMAM), Multimodal Attention-Based Fusion Transformer (MAFT)

PDF

References

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299– 6308, 2017.
Weiling Chen, Keng Teck Ma, Zi Jian Yew, Minhoe Hur, and David Aik- Aun Khoo. Tevad: Improved video anomaly detection with captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5549–5559, 2023.
Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 387– 395, 2023.
Hamza Karim, Keval Doshi, and Yasin Yilmaz. Real-time weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 6848–6856, 2024.
Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986, 2021.
Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14009–14018, 2021.
Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen. Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2310.18961, 2023.
Yuan Yuan, Zhaojian Li, and Bin Zhao. A survey of multimodal learning: Methods, applications, and future. ACM Computing Surveys, 57(7):1–34, 2025.
Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):12113–12132, 2023.
Yongshuo Zong, Oisin Mac Aodha, and Timothy M Hospedales. Self- supervised multimodal learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5299–5318, 2024.
Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018.
Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. Video event restoration based on keyframes for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14592–14601, 2023.
Peng Wu, Xiaotao Liu, and Jing Liu. Weakly supervised audio-visual violence detection. IEEE Transactions on Multimedia, 25:1674–1685, 2022.
Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. Swinbert: End-to-end trans- formers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17949–17958, 2022.
Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
Peng Wu, Jing Liu, Xiangteng He, Yuxin Peng, Peng Wang, and Yanning Zhang. Toward video anomaly retrieval from video anomaly detection: New benchmarks and model. IEEE Transactions on Image Processing, 33:2213–2225, 2024.
Dicong Wang, Qilong Wang, Qinghua Hu, and Kaijun Wu. Multimodal vad: Visual anomaly detection in intelligent monitoring system via audio-vision-language. IEEE Transactions on Instrumentation and Measurement, 2025.
Ata-Ur Rehman, Hafiz Sami Ullah, Haroon Farooq, Muhammad Salman Khan, Tayyeb Mahmood, and Hafiz Owais Ahmed Khan. Multi-modal anomaly detection by using audio and visual cues. IEEE Access, 9:30587–30603, 2021.
Peng Wu, Wanshun Su, Guansong Pang, Yujia Sun, Qingsen Yan, Peng Wang, and Yanning Zhang. Avadclip: Audio-visual collaboration for robust video anomaly detection. arXiv preprint arXiv:2504.04495, 2025.
Hui Lv and Qianru Sun. Video anomaly detection and explanation via large language models. arXiv preprint arXiv:2401.05702, 2024.
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learn- ing with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Debojyoti Biswas and Jelena Tesic. Unsupervised domain adaptation with debiased contrastive learning and support-set guided pseudo-labeling for remote sensing images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024.
Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven CH Hoi. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4953–4963, 2022.
Noussaiba Jaafar and Zied Lachiri. Multimodal fusion methods with deep neural networks and meta-information for aggression detection in surveillance. Expert Systems with Applications, 211:118523, 2023.
Swalpa Kumar Roy, Ankur Deria, Chiranjibi Shah, Juan M Haut, Qian Du, and Antonio Plaza. Spectral–spatial morphological attention transformer for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 61:1–15, 2023.
Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In European conference on computer vision, pages 322–339. Springer, 2020.
Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy- Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016.
Peng Wu, Jing Liu, Mingming Li, Yujia Sun, and Fang Shen. Fast sparse coding networks for anomaly detection in videos. Pattern Recognition, 107:107515, 2020.
Peng Wu and Jing Liu. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing, 30:3513–3527, 2021.
Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, and Hanwang Zhang. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8022–8031, 2023.
Prabhu Prasad Dev, Raju Hazari, and Pranesh Das. Reflip-vad: Towards weakly supervised video anomaly detection via vision-language model. IEEE Transactions on Circuits and Systems for Video Technology, 2024.
Kun-Lun Li, Hou-Kuan Huang, Sheng-Feng Tian, and Wei Xu. Improv- ing one-class svm for anomaly detection. In Proceedings of the 2003 international conference on machine learning and cybernetics (IEEE Cat. No. 03EX693), volume 5, pages 3077–3081. IEEE, 2003.
Jue Wang and Anoop Cherian. Gods: Generalized one-class discriminative subspaces for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8201–8211, 2019.
Yang Liu, Jing Liu, Jieyu Lin, Mengyang Zhao, and Liang Song. Appearance-motion united auto-encoder framework for video anomaly detection. IEEE Transactions on Circuits and Systems II: Express Briefs, 69(5):2498–2502, 2022.
Hui Lv, Chuanwei Zhou, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Localizing anomalies from weakly-labeled videos. IEEE trans- actions on image processing, 30:4505–4515, 2021.
Lin Wang, Xiangjun Wang, Feng Liu, Mingyang Li, Xin Hao, and Nianfu Zhao. Attention-guided mil weakly supervised visual anomaly detection. Measurement, 209:112500, 2023.
Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6074–6082, 2024.
Bernhard Scho¨lkopf, Robert C Williamson, Alex Smola, John Shawe- Taylor, and John Platt. Support vector method for novelty detection. Advances in neural information processing systems, 12, 1999.
Jhih-Ciang Wu, He-Yen Hsieh, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. Self-supervised sparse representation for video anomaly detection. In European Conference on Computer Vision, pages 729–745. Springer, 2022.
Luca Zanella, Benedetta Liberatori, Willi Menapace, Fabio Poiesi, Yiming Wang, and Elisa Ricci. Delving into clip latent space for video anomaly recognition. Computer Vision and Image Understanding, 249:104163, 2024.
Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, pages 105–124. Springer, 2022.
Shuo Li, Fang Liu, and Licheng Jiao. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1395–1403, 2022.
Yang Zhen, Yuanfang Guo, Jinjie Wei, Xiuguo Bao, and Di Huang. Multi-scale background suppression anomaly detection in surveillance videos. In 2021 IEEE International Conference on Image Processing (ICIP), pages 1114–1118. IEEE, 2021.
Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18899–18908, 2024.
Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip- tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In 2023 IEEE International Conference on Image Processing (ICIP), pages 3230–3234. IEEE, 2023.
Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. W-talc: Weakly- supervised temporal activity localization and classification. In Proceedings of the European conference on computer vision (ECCV), pages 563–579, 2018.
Sanath Narayan, Hisham Cholakkal, Fahad Shahbaz Khan, and Ling Shao. 3c-net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8679–8687, 2019.
Yi Zhu and Shawn Newsam. Motion-aware feature for improved video anomaly detection. arXiv preprint arXiv:1907.10211, 2019.

Downloads

Download data is not yet available.

Abstract

Keywords

References

Downloads

Similar Articles