Skip to main navigation menu Skip to main content Skip to site footer

Window-Based Vision Transformer Network Implementation For Multi-Label Image Classification In Remote Sensing

Abstract

Swift process in technology and widespread availability of low-cost internet have led to a substantial rise in data volume in remote sensing, especially for high-resolution and very-high resolution images. Still, these images contain more complex information, and it is not appropriate to analyze the images using a solitary scene-level label while ignoring the distinct features provided by other labels in the images. In multi-label image classification applications, multiple labels are assigned to an image, reflecting various objects or features present in the scene. The classification of these images is critically important for monitoring environmental changes over large geographical areas, disaster management, urban planning, agriculture and forestry management, natural resource conservation, and military intelligence. Nowadays, many methods are used in such image classification problems, primarily deep learning algorithms. However, current deep learning approaches for multi-label remote sensing images often struggle to capture both local fine-grained details and global contextual relationships simultaneously, leaving a gap for models that can efficiently integrate these complementary representations. In this study, advanced neural networks are explored and evaluated for Multi-label AID dataset which contains 3000 images and 17 different labels; AlexNet, VGG16, DenseNet-201, Inception-v3 and ConvNeXt as the CNN models, ViT, SwinT as transformer models and MaxViT as the hybrid model that initially contains both CNN and transformer network. OneCycleLR as scheduler and AsymmetricLoss (ASL) as loss function are employed for each model to systematically evaluate their impact on model performance. MaxViT was chosen because its multi-scale window-based attention can jointly model local and global dependencies, making it particularly suitable for the complex spatial patterns in remote-sensing imagery compared with other hybrid architectures. The window-based MaxViT algorithm, which has not been previously applied to the Multi-label AID dataset in the current literature, has been evaluated. This constitutes the first application of MaxViT to this dataset and provides a novel benchmark for multi-label remote-sensing classification. This algorithm has demonstrated superior performance on this dataset, significantly outperforming existing models and setting a new benchmark with an mAP of 84.98%.

Keywords

Remote Sensing (RS), MaxViT, Scene Classification, Multi-label AID, OneCycleLR

PDF

References

  1. D. K. Sinha, P. P. Chakraborty, A. Rahman, A. Kumar Saha, and A. R. Siddiqui, “Remote Sensing and GIS Module: Basic Physics of Remote Sensing.”
  2. G. Koukiou, “SAR Features and Techniques for Urban Planning—A Review,” Jun. 01, 2024, Multidisciplinary Digital Publishing Institute (MDPI). doi: 10.3390/rs16111923.
  3. J. S. Estrada, A. Fuentes, P. Reszka, and F. Auat Cheein, “Machine learning assisted remote forestry health assessment: a comprehensive state of the art review,” 2023, Frontiers Media S.A. doi: 10.3389/fpls.2023.1139232.
  4. N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore, “Google Earth Engine: Planetary-scale geospatial analysis for everyone,” Remote Sens Environ, vol. 202, pp. 18–27, Dec. 2017, doi: 10.1016/j.rse.2017.06.031.
  5. D. V. Malakhov and O. V. Dolbnya, “Remote sensing as a tool of biological conservation and grassland monitoring in mountain areas of Southeastern Kazakhstan,” Journal of Applied Science and Technology Trends, vol. 4, no. 1, pp. 72–79, Jan. 2023, doi: 10.38094/jastt401175.
  6. P. Helber, B. Bischke, A. Dengel, and D. Borth, “EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification,” Aug. 2017, [Online]. Available: http://arxiv.org/abs/1709.00029
  7. N. Ali and B. Zafar, “RSSCN7 Image dataset,” Sep. 2018, figshare. doi: 10.6084/m9.figshare.7006946.v1.
  8. Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, 2010, pp. 270–279.
  9. G. S. Xia et al., “AID: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981, Jul. 2017, doi: 10.1109/TGRS.2017.2685945.
  10. O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.0575
  11. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks.” [Online]. Available: http://code.google.com/p/cuda-convnet/
  12. Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with Noisy Student improves ImageNet classification,” Nov. 2019, [Online]. Available: http://arxiv.org/abs/1911.04252
  13. H. Zhang et al., “ResNeSt: Split-Attention Networks,” Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.08955
  14. A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Oct. 2020, [Online]. Available: http://arxiv.org/abs/2010.11929
  15. Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” Mar. 2021, [Online]. Available: http://arxiv.org/abs/2103.14030
  16. M. Kaselimi, A. Voulodimos, I. Daskalopoulos, N. Doulamis, and A. Doulamis, “A Vision Transformer Model for Convolution-Free Multilabel Classification of Satellite Imagery in Deforestation Monitoring,” IEEE Trans Neural Netw Learn Syst, vol. 34, no. 7, pp. 3299–3307, Jul. 2023, doi: 10.1109/TNNLS.2022.3144791.
  17. F. Wang, J. Ji, and Y. Wang, “DSViT: Dynamically Scalable Vision Transformer for Remote Sensing Image Segmentation and Classification,” IEEE J Sel Top Appl Earth Obs Remote Sens, vol. 16, pp. 5441–5452, 2023, doi: 10.1109/JSTARS.2023.3285259.
  18. P. Lv, W. Wu, Y. Zhong, F. Du, and L. Zhang, “SCViT: A Spatial-Channel Feature Preserving Vision Transformer for Remote Sensing Image Scene Classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2022, doi: 10.1109/TGRS.2022.3157671.
  19. K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.1556
  20. C. Szegedy et al., “Going Deeper with Convolutions,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.4842
  21. Y. Hua, L. Mou, and X. X. Zhu, “Recurrently Exploring Class-wise Attention in A Hybrid Convolutional and Bidirectional LSTM Network for Multi-label Aerial Image Classification,” Jul. 2018, doi: 10.1016/j.isprsjprs.2019.01.015.
  22. J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “CNN-RNN: A Unified Framework for Multi-label Image Classification.”
  23. K. Xu, P. Deng, and H. Huang, “Vision Transformer: An Excellent Teacher for Guiding Small Networks in Remote Sensing Image Scene Classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022, doi: 10.1109/TGRS.2022.3152566.
  24. Z. Tu et al., “MaxViT: Multi-Axis Vision Transformer,” Apr. 2022, [Online]. Available: http://arxiv.org/abs/2204.01697
  25. C.-F. Chen, Q. Fan, and R. Panda, “CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification,” Mar. 2021, [Online]. Available: http://arxiv.org/abs/2103.14899
  26. D. Zhou et al., “DeepViT: Towards Deeper Vision Transformer,” Mar. 2021, [Online]. Available: http://arxiv.org/abs/2103.11886
  27. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” Dec. 2020, [Online]. Available: http://arxiv.org/abs/2012.12877
  28. L. Yuan et al., “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet,” Jan. 2021, [Online]. Available: http://arxiv.org/abs/2101.11986
  29. Y. Hua, L. Mou, and X. X. Zhu, “Relation Network for Multi-label Aerial Image Classification,” Jul. 2019, doi: 10.1109/TGRS.2019.2963364.
  30. Y. Li, R. Chen, Y. Zhang, M. Zhang, and L. Chen, “Multi-label remote sensing image scene classification by combining a convolutional neural network and a graph neural network,” Remote Sens (Basel), vol. 12, no. 23, pp. 1–17, Dec. 2020, doi: 10.3390/rs12234003.
  31. X. Tan, Z. Xiao, J. Zhu, Q. Wan, K. Wang, and D. Li, “Transformer-Driven Semantic Relation Inference for Multilabel Classification of High-Resolution Remote Sensing Images,” IEEE J Sel Top Appl Earth Obs Remote Sens, vol. 15, pp. 1884–1901, 2022, doi: 10.1109/JSTARS.2022.3145042.
  32. H. Wu, C. Xu, and H. Liu, “S-MAT: Semantic-Driven Masked Attention Transformer for Multi-Label Aerial Image Classification,” Sensors, vol. 22, no. 14, Jul. 2022, doi: 10.3390/s22145433.
  33. B. Ma et al., “Label-Driven Graph Convolutional Network for Multilabel Remote Sensing Image Classification,” IEEE J Sel Top Appl Earth Obs Remote Sens, vol. 17, pp. 2245–2255, 2024, doi: 10.1109/JSTARS.2023.3344106.
  34. A. Rangel, J. Terven, D. M. Cordova-Esparza, and E. A. Chavez-Urbiola, “Land Cover Image Classification,” Jan. 2024, [Online]. Available: http://arxiv.org/abs/2401.09607
  35. D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” Dec. 2014, [Online]. Available: http://arxiv.org/abs/1412.6980
  36. X. Wang, L. Duan, and C. Ning, “Global context-based multilevel feature fusion networks for multilabel remote sensing image scene classification,” IEEE J Sel Top Appl Earth Obs Remote Sens, vol. 14, pp. 11179–11196, 2021, doi: 10.1109/JSTARS.2021.3122464.
  37. I. Puleko, O. Svintsytska, V. Chumakevych, V. Ptashnyk, and Y. Polishchuk, “The Scalar Metric of Classification Algorithm Choice in Machine Learning Problems Based on the Scheme of Nonlinear Compromises,” 2022.
  38. The PyTorch Foundation, “BCEWITHLOGITSLOSS,” 2022.
  39. T. Ridnik et al., “Asymmetric Loss For Multi-Label Classification.” [Online]. Available: https://github.com/Alibaba-MIIL/ASL.
  40. Q. Gao, T. Long, and Z. Zhou, “Mineral identification based on natural feature-oriented image processing and multi-label image classification,” Expert Syst Appl, vol. 238, Mar. 2024, doi: 10.1016/j.eswa.2023.122111.

Downloads

Download data is not yet available.