Window-Based Vision Transformer Network Implementation For Multi-Label Image Classification In Remote Sensing

Emre Akkas; Selda Güney

doi:10.38094/jastt62393

Vol. 6 No. 2 (2025)

Standard Journal Issues

Window-Based Vision Transformer Network Implementation For Multi-Label Image Classification In Remote Sensing

Published 2025-11-07

Emre Akkas
Selda Güney

Emre Akkas
Department of Electrical Electronics Engineering, Ba?kent University, Ankara, Türkiye

Selda Güney
Department of Electrical Electronics Engineering, Ba?kent University, Ankara, Türkiye

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

How to Cite

[1]

E. Akkas and S. Güney, “Window-Based Vision Transformer Network Implementation For Multi-Label Image Classification In Remote Sensing”, JASTT, vol. 6, no. 2, pp. 384–392, Nov. 2025, doi: 10.38094/jastt62393.

Download Citation

Abstract

Swift process in technology and widespread availability of low-cost internet have led to a substantial rise in data volume in remote sensing, especially for high-resolution and very-high resolution images. Still, these images contain more complex information, and it is not appropriate to analyze the images using a solitary scene-level label while ignoring the distinct features provided by other labels in the images. In multi-label image classification applications, multiple labels are assigned to an image, reflecting various objects or features present in the scene. The classification of these images is critically important for monitoring environmental changes over large geographical areas, disaster management, urban planning, agriculture and forestry management, natural resource conservation, and military intelligence. Nowadays, many methods are used in such image classification problems, primarily deep learning algorithms. However, current deep learning approaches for multi-label remote sensing images often struggle to capture both local fine-grained details and global contextual relationships simultaneously, leaving a gap for models that can efficiently integrate these complementary representations. In this study, advanced neural networks are explored and evaluated for Multi-label AID dataset which contains 3000 images and 17 different labels; AlexNet, VGG16, DenseNet-201, Inception-v3 and ConvNeXt as the CNN models, ViT, SwinT as transformer models and MaxViT as the hybrid model that initially contains both CNN and transformer network. OneCycleLR as scheduler and AsymmetricLoss (ASL) as loss function are employed for each model to systematically evaluate their impact on model performance. MaxViT was chosen because its multi-scale window-based attention can jointly model local and global dependencies, making it particularly suitable for the complex spatial patterns in remote-sensing imagery compared with other hybrid architectures. The window-based MaxViT algorithm, which has not been previously applied to the Multi-label AID dataset in the current literature, has been evaluated. This constitutes the first application of MaxViT to this dataset and provides a novel benchmark for multi-label remote-sensing classification. This algorithm has demonstrated superior performance on this dataset, significantly outperforming existing models and setting a new benchmark with an mAP of 84.98%.

Keywords

Remote Sensing (RS), MaxViT, Scene Classification, Multi-label AID, OneCycleLR

PDF

References

D. K. Sinha, P. P. Chakraborty, A. Rahman, A. Kumar Saha, and A. R. Siddiqui, “Remote Sensing and GIS Module: Basic Physics of Remote Sensing.”
G. Koukiou, “SAR Features and Techniques for Urban Planning—A Review,” Jun. 01, 2024, Multidisciplinary Digital Publishing Institute (MDPI). doi: 10.3390/rs16111923.
J. S. Estrada, A. Fuentes, P. Reszka, and F. Auat Cheein, “Machine learning assisted remote forestry health assessment: a comprehensive state of the art review,” 2023, Frontiers Media S.A. doi: 10.3389/fpls.2023.1139232.
N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore, “Google Earth Engine: Planetary-scale geospatial analysis for everyone,” Remote Sens Environ, vol. 202, pp. 18–27, Dec. 2017, doi: 10.1016/j.rse.2017.06.031.
D. V. Malakhov and O. V. Dolbnya, “Remote sensing as a tool of biological conservation and grassland monitoring in mountain areas of Southeastern Kazakhstan,” Journal of Applied Science and Technology Trends, vol. 4, no. 1, pp. 72–79, Jan. 2023, doi: 10.38094/jastt401175.
P. Helber, B. Bischke, A. Dengel, and D. Borth, “EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification,” Aug. 2017, [Online]. Available: http://arxiv.org/abs/1709.00029
N. Ali and B. Zafar, “RSSCN7 Image dataset,” Sep. 2018, figshare. doi: 10.6084/m9.figshare.7006946.v1.
Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, 2010, pp. 270–279.
G. S. Xia et al., “AID: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981, Jul. 2017, doi: 10.1109/TGRS.2017.2685945.
O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.0575
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks.” [Online]. Available: http://code.google.com/p/cuda-convnet/
Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with Noisy Student improves ImageNet classification,” Nov. 2019, [Online]. Available: http://arxiv.org/abs/1911.04252
H. Zhang et al., “ResNeSt: Split-Attention Networks,” Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.08955
A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Oct. 2020, [Online]. Available: http://arxiv.org/abs/2010.11929
Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” Mar. 2021, [Online]. Available: http://arxiv.org/abs/2103.14030
M. Kaselimi, A. Voulodimos, I. Daskalopoulos, N. Doulamis, and A. Doulamis, “A Vision Transformer Model for Convolution-Free Multilabel Classification of Satellite Imagery in Deforestation Monitoring,” IEEE Trans Neural Netw Learn Syst, vol. 34, no. 7, pp. 3299–3307, Jul. 2023, doi: 10.1109/TNNLS.2022.3144791.
F. Wang, J. Ji, and Y. Wang, “DSViT: Dynamically Scalable Vision Transformer for Remote Sensing Image Segmentation and Classification,” IEEE J Sel Top Appl Earth Obs Remote Sens, vol. 16, pp. 5441–5452, 2023, doi: 10.1109/JSTARS.2023.3285259.
P. Lv, W. Wu, Y. Zhong, F. Du, and L. Zhang, “SCViT: A Spatial-Channel Feature Preserving Vision Transformer for Remote Sensing Image Scene Classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2022, doi: 10.1109/TGRS.2022.3157671.
K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.1556
C. Szegedy et al., “Going Deeper with Convolutions,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.4842
Y. Hua, L. Mou, and X. X. Zhu, “Recurrently Exploring Class-wise Attention in A Hybrid Convolutional and Bidirectional LSTM Network for Multi-label Aerial Image Classification,” Jul. 2018, doi: 10.1016/j.isprsjprs.2019.01.015.
J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “CNN-RNN: A Unified Framework for Multi-label Image Classification.”
K. Xu, P. Deng, and H. Huang, “Vision Transformer: An Excellent Teacher for Guiding Small Networks in Remote Sensing Image Scene Classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022, doi: 10.1109/TGRS.2022.3152566.
Z. Tu et al., “MaxViT: Multi-Axis Vision Transformer,” Apr. 2022, [Online]. Available: http://arxiv.org/abs/2204.01697
C.-F. Chen, Q. Fan, and R. Panda, “CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification,” Mar. 2021, [Online]. Available: http://arxiv.org/abs/2103.14899
D. Zhou et al., “DeepViT: Towards Deeper Vision Transformer,” Mar. 2021, [Online]. Available: http://arxiv.org/abs/2103.11886
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” Dec. 2020, [Online]. Available: http://arxiv.org/abs/2012.12877
L. Yuan et al., “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet,” Jan. 2021, [Online]. Available: http://arxiv.org/abs/2101.11986
Y. Hua, L. Mou, and X. X. Zhu, “Relation Network for Multi-label Aerial Image Classification,” Jul. 2019, doi: 10.1109/TGRS.2019.2963364.
Y. Li, R. Chen, Y. Zhang, M. Zhang, and L. Chen, “Multi-label remote sensing image scene classification by combining a convolutional neural network and a graph neural network,” Remote Sens (Basel), vol. 12, no. 23, pp. 1–17, Dec. 2020, doi: 10.3390/rs12234003.
X. Tan, Z. Xiao, J. Zhu, Q. Wan, K. Wang, and D. Li, “Transformer-Driven Semantic Relation Inference for Multilabel Classification of High-Resolution Remote Sensing Images,” IEEE J Sel Top Appl Earth Obs Remote Sens, vol. 15, pp. 1884–1901, 2022, doi: 10.1109/JSTARS.2022.3145042.
H. Wu, C. Xu, and H. Liu, “S-MAT: Semantic-Driven Masked Attention Transformer for Multi-Label Aerial Image Classification,” Sensors, vol. 22, no. 14, Jul. 2022, doi: 10.3390/s22145433.
B. Ma et al., “Label-Driven Graph Convolutional Network for Multilabel Remote Sensing Image Classification,” IEEE J Sel Top Appl Earth Obs Remote Sens, vol. 17, pp. 2245–2255, 2024, doi: 10.1109/JSTARS.2023.3344106.
A. Rangel, J. Terven, D. M. Cordova-Esparza, and E. A. Chavez-Urbiola, “Land Cover Image Classification,” Jan. 2024, [Online]. Available: http://arxiv.org/abs/2401.09607
D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” Dec. 2014, [Online]. Available: http://arxiv.org/abs/1412.6980
X. Wang, L. Duan, and C. Ning, “Global context-based multilevel feature fusion networks for multilabel remote sensing image scene classification,” IEEE J Sel Top Appl Earth Obs Remote Sens, vol. 14, pp. 11179–11196, 2021, doi: 10.1109/JSTARS.2021.3122464.
I. Puleko, O. Svintsytska, V. Chumakevych, V. Ptashnyk, and Y. Polishchuk, “The Scalar Metric of Classification Algorithm Choice in Machine Learning Problems Based on the Scheme of Nonlinear Compromises,” 2022.
The PyTorch Foundation, “BCEWITHLOGITSLOSS,” 2022.
T. Ridnik et al., “Asymmetric Loss For Multi-Label Classification.” [Online]. Available: https://github.com/Alibaba-MIIL/ASL.
Q. Gao, T. Long, and Z. Zhou, “Mineral identification based on natural feature-oriented image processing and multi-label image classification,” Expert Syst Appl, vol. 238, Mar. 2024, doi: 10.1016/j.eswa.2023.122111.

Downloads

Download data is not yet available.

Abstract

Keywords

References

Downloads

Similar Articles