Part-of-Speech Tagging-Based Document Clustering for Kurdish Corpora
Abstract
This study investigates the Natural Language Processing (NLP) challenges of unsupervised clustering for Kurdish documents, focusing on the role of Part-of-Speech (POS) tagging in improving clustering performance. Due to the linguistic complexity of Kurdish and the scarcity of annotated corpora, the proposed method used a TF-IDF (Term Frequency–Inverse Document Frequency) matrix with K-Means clustering, applied POS tagging to the Badini Kurdish corpus. POS tagging is important for capturing the syntactic and grammatical structures of text. The experiments were conducted using the UOZBDN corpus, which includes 231 documents distributed across five categories. To evaluate the impact of POS tag. Additionally, compared the clustering performance using POS tagging and No POS tags of the same corpus. One of the main challenges in this research is the absence of prior studies that apply document clustering techniques to Kurdish corpora. Therefore, there is limited prior work available for direct comparison with the results obtained in this study. The results showed that incorporating POS tags, particularly a carefully selected subset of 22 key of POS categories, significantly improved clustering performance. The proposed approach achieved a Purity score of 0.9714, an NMI score of 0.9477, and a Silhouette score of 0.2403. demonstrated that POS tagging significantly enhanced clustering quality and highlighted the importance of POS represented in Badini Kurdish corpus.
Keywords
Natural Language Processing, Document Clustering, Kurdish Language Processing, TF-IDF, Part-of-Speech Tagging
References
- G. J. Oyewole and G. A. Thopil, “Data clustering: application and trends,” Artif. Intell. Rev., vol. 56, no. 7, pp. 6439–6475, July 2023.
- A. F. J. Al-Gburi, M. Z. A. Nazri, M. R. B. Yaakub, and Z. A. A. Alyasseri, “Multi-objective unsupervised feature selection and cluster based on symbiotic organism search,” Algorithms, vol. 17, no. 8, Aug. 2024.
- A. Pegado-Bardayo, A. Lorenzo-Espejo, J. Muñuzuri, and A. Escudero-Santana, “A review of unsupervised k-value selection techniques in clustering algorithms,” J. Ind. Eng. Manag., vol. 17, no. 3, p. 641, Aug. 2024.
- M. Zubair, M. A. Iqbal, A. Shil, M. J. M. Chowdhury, M. A. Moni, and I. H. Sarker, “An improved k-means clustering algorithm towards an efficient data-driven modeling,” Ann. Data Sci., Oct. 2022.
- A. Abas Abdullah, A. Mahmood Ahmed, T. Rashid, H. Veisi, Y. H. Rassul, B. Hassan, P. Fattah, S. A. Abdulhameed, and A. S. Shamsaldin, “Advanced clustering techniques for speech signal enhancement: a review and metanalysis of fuzzy c means, k means, and kernel fuzzy c means methods,” CoRR, vol. abs/2409.19448, Sep. 2024.
- P. Safikhani and D. Broneske, “Enhancing AutoNLP with fine-tuned BERT models: an evaluation of text representation methods for AutoPyTorch,” in Comput. Sci. Inf. Technol. (CSIT), vol. 13, pp. 23–38, Sept. 2023.
- A. M. Saeed, “An automated new approach in fast text classification: a case study for Kurdish text,” Sci. J. Univ. Zakho, vol. 12, no. 3, pp. 330–336, June 2024.
- M. George and R. Murugesan, “Improving sentiment analysis of financial news headlines using hybrid word2vec–tfidf feature extraction technique,” Procedia Comput. Sci., vol. 244, pp. 1–8, Jan. 2024.
- P. Morad, S. Ahmadi, and L. Gatti, “Part of speech tagging for northern kurdish,” in Proc. Joint Workshop on Multiword Expressions and Universal Dependencies (MWE UD), Torino, Italy, pp. 70–80, May 2024.
- A. A. Mustafa and K. Jacksi, “Affinity propagation and k-means algorithm for document clustering based on semantic similarity,” Sci. J. Univ. Zakho, vol. 11, no. 2, pp. 153–159, April 2023.
- N. M. Salih and K. Jacksi, “Semantic document clustering using k-means algorithm and ward’s method,” in Proc. 3rd Int. Conf. Adv. Sci. Eng. (ICOASE), pp. 1-6, Dec. 2021.
- K. Jacksi and N. Salih, “State-of-the-art document clustering algorithms based on semantic similarity,” J. Informatika, vol. 14, no. 2, p. 58, 2020.
- R. Kumbhar, S. Mhamane, H. Pati, and S. Patil, “Text document clustering using k-means algorithm with dimension reduction techniques,” in Proc. IEEE Int. Conf. Comput. Electr. Commun. Eng. (ICCES), pp. 1164-1168, June 2019.
- P. Perumal and B. Mathivanan, “Type2 IFC with SOA for topic detection and document clustering analysis,” Research Square, 2021.
- R. Saha, “Influence of various text embeddings on clustering performance in nlp,” arXiv preprint arXiv:2305.03144, 2023.
- P. R. Sampaio and H. Maxcici, “Unsupervised document and template clustering using multimodal embeddings,” arXiv preprint arXiv:2506.12116, June 2025.
- M. Rosell, “Part of speech tagging for text clustering in swedish,”, In proceedings of the 17th Nordic Conference of Computational Linguistics, pp. 150-157, May 2009.
- M. A. Ahmed, S. M. Nafl, H. Baharin, and P. N. E. Nohuddin, “Prioritise five tafseer translators using clustering technique for surah al-baqarah,” Al-Iraqia J. Sci. Eng. Res., vol. 3, no. 1, pp. 75–86, March 2024.
- I. Gupta and N. Joshi, “Real-time twitter corpus labelling using automatic clustering approach,” Int. J. Comput. Digit. Syst., vol. 10, no. 1, pp. 519–532, April 2021.
- V. Gupta, H. Shi, K. Gimpel, and M. Sachan, “Deep clustering of text representations for supervision-free probing of syntax,” In Proceedings of the AAAI Conference on AI, Vol. 36, No. 10, pp.10720-10728, Jun. 2022