Part-of-Speech Tagging-Based Document Clustering for Kurdish Corpora

Toreen Masoud; Ismael Ali

doi:10.38094/jastt7031086

2026
Special Issue: Selected Proceedings of the 1st International Conference on Artificial Intelligence for Sustainability in the Developing World (AISDW2025)

Special Issue: Selected Proceedings of the 1st International Conference on Artificial Intelligence for Sustainability in the Developing World

Part-of-Speech Tagging-Based Document Clustering for Kurdish Corpora

Published 2026-04-19

Toreen Dilshad Masoud
Ismael Ail Ali

Toreen Dilshad Masoud
Department of Information Technology, Technical College of Informatics, Akre University for Applied Sciences, Akre, Kurdistan Region, Iraq

Ismael Ail Ali
Jazari Research Center, Research Center, University of Zakho, Zakho, Kurdistan Region, Iraq

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

How to Cite

[1]

T. Masoud and I. Ali, “Part-of-Speech Tagging-Based Document Clustering for Kurdish Corpora”, JASTT, vol. 7, no. 03, pp. 37–45, Apr. 2026, doi: 10.38094/jastt7031086.

Download Citation

Abstract

This study investigates the Natural Language Processing (NLP) challenges of unsupervised clustering for Kurdish documents, focusing on the role of Part-of-Speech (POS) tagging in improving clustering performance. Due to the linguistic complexity of Kurdish and the scarcity of annotated corpora, the proposed method used a TF-IDF (Term Frequency–Inverse Document Frequency) matrix with K-Means clustering, applied POS tagging to the Badini Kurdish corpus. POS tagging is important for capturing the syntactic and grammatical structures of text. The experiments were conducted using the UOZBDN corpus, which includes 231 documents distributed across five categories. To evaluate the impact of POS tag. Additionally, compared the clustering performance using POS tagging and No POS tags of the same corpus. One of the main challenges in this research is the absence of prior studies that apply document clustering techniques to Kurdish corpora. Therefore, there is limited prior work available for direct comparison with the results obtained in this study. The results showed that incorporating POS tags, particularly a carefully selected subset of 22 key of POS categories, significantly improved clustering performance. The proposed approach achieved a Purity score of 0.9714, an NMI score of 0.9477, and a Silhouette score of 0.2403. demonstrated that POS tagging significantly enhanced clustering quality and highlighted the importance of POS represented in Badini Kurdish corpus.

Keywords

Natural Language Processing, Document Clustering, Kurdish Language Processing, TF-IDF, Part-of-Speech Tagging

PDF

References

G. J. Oyewole and G. A. Thopil, “Data clustering: application and trends,” Artif. Intell. Rev., vol. 56, no. 7, pp. 6439–6475, July 2023.
A. F. J. Al-Gburi, M. Z. A. Nazri, M. R. B. Yaakub, and Z. A. A. Alyasseri, “Multi-objective unsupervised feature selection and cluster based on symbiotic organism search,” Algorithms, vol. 17, no. 8, Aug. 2024.
A. Pegado-Bardayo, A. Lorenzo-Espejo, J. Muñuzuri, and A. Escudero-Santana, “A review of unsupervised k-value selection techniques in clustering algorithms,” J. Ind. Eng. Manag., vol. 17, no. 3, p. 641, Aug. 2024.
M. Zubair, M. A. Iqbal, A. Shil, M. J. M. Chowdhury, M. A. Moni, and I. H. Sarker, “An improved k-means clustering algorithm towards an efficient data-driven modeling,” Ann. Data Sci., Oct. 2022.
A. Abas Abdullah, A. Mahmood Ahmed, T. Rashid, H. Veisi, Y. H. Rassul, B. Hassan, P. Fattah, S. A. Abdulhameed, and A. S. Shamsaldin, “Advanced clustering techniques for speech signal enhancement: a review and metanalysis of fuzzy c means, k means, and kernel fuzzy c means methods,” CoRR, vol. abs/2409.19448, Sep. 2024.
P. Safikhani and D. Broneske, “Enhancing AutoNLP with fine-tuned BERT models: an evaluation of text representation methods for AutoPyTorch,” in Comput. Sci. Inf. Technol. (CSIT), vol. 13, pp. 23–38, Sept. 2023.
A. M. Saeed, “An automated new approach in fast text classification: a case study for Kurdish text,” Sci. J. Univ. Zakho, vol. 12, no. 3, pp. 330–336, June 2024.
M. George and R. Murugesan, “Improving sentiment analysis of financial news headlines using hybrid word2vec–tfidf feature extraction technique,” Procedia Comput. Sci., vol. 244, pp. 1–8, Jan. 2024.
P. Morad, S. Ahmadi, and L. Gatti, “Part of speech tagging for northern kurdish,” in Proc. Joint Workshop on Multiword Expressions and Universal Dependencies (MWE UD), Torino, Italy, pp. 70–80, May 2024.
A. A. Mustafa and K. Jacksi, “Affinity propagation and k-means algorithm for document clustering based on semantic similarity,” Sci. J. Univ. Zakho, vol. 11, no. 2, pp. 153–159, April 2023.
N. M. Salih and K. Jacksi, “Semantic document clustering using k-means algorithm and ward’s method,” in Proc. 3rd Int. Conf. Adv. Sci. Eng. (ICOASE), pp. 1-6, Dec. 2021.
K. Jacksi and N. Salih, “State-of-the-art document clustering algorithms based on semantic similarity,” J. Informatika, vol. 14, no. 2, p. 58, 2020.
R. Kumbhar, S. Mhamane, H. Pati, and S. Patil, “Text document clustering using k-means algorithm with dimension reduction techniques,” in Proc. IEEE Int. Conf. Comput. Electr. Commun. Eng. (ICCES), pp. 1164-1168, June 2019.
P. Perumal and B. Mathivanan, “Type2 IFC with SOA for topic detection and document clustering analysis,” Research Square, 2021.
R. Saha, “Influence of various text embeddings on clustering performance in nlp,” arXiv preprint arXiv:2305.03144, 2023.
P. R. Sampaio and H. Maxcici, “Unsupervised document and template clustering using multimodal embeddings,” arXiv preprint arXiv:2506.12116, June 2025.
M. Rosell, “Part of speech tagging for text clustering in swedish,”, In proceedings of the 17th Nordic Conference of Computational Linguistics, pp. 150-157, May 2009.
M. A. Ahmed, S. M. Nafl, H. Baharin, and P. N. E. Nohuddin, “Prioritise five tafseer translators using clustering technique for surah al-baqarah,” Al-Iraqia J. Sci. Eng. Res., vol. 3, no. 1, pp. 75–86, March 2024.
I. Gupta and N. Joshi, “Real-time twitter corpus labelling using automatic clustering approach,” Int. J. Comput. Digit. Syst., vol. 10, no. 1, pp. 519–532, April 2021.
V. Gupta, H. Shi, K. Gimpel, and M. Sachan, “Deep clustering of text representations for supervision-free probing of syntax,” In Proceedings of the AAAI Conference on AI, Vol. 36, No. 10, pp.10720-10728, Jun. 2022

Downloads

Download data is not yet available.

Abstract

Keywords

References

Downloads

Similar Articles