Optimizing n-gram lengths for cross-linguistic text classification: A comparative analysis of English and Arabic morphosyntactic structures

Authors: Boumedyen Shannaq *

Affiliations:

Management Information System Department, University of Buraimi, Al Buraimi, Oman

Abstract

This paper investigates the impact of n-gram length on text classification in English and Arabic, two languages with different writing systems. The study aims to examine how language characteristics influence the optimal n-gram length for text classification. The English dataset comprises 4,450 articles categorized into business, technology, entertainment, sports, and politics, with 2,225 records used for training and 2,225 for testing. The Arabic dataset includes 5,000 randomly selected documents from a total of 111,728 documents. The findings indicate that for English text classification, 2-grams provide the best performance with a precision of 0.482, recall of 0.489, and F1 score of 0.472. In contrast, Arabic text classification achieves optimal performance with 6-grams, reaching an F1 score close to 0.85. These results highlight that language-dependent morphological and syntactic features can significantly affect the performance of n-gram-based models. This study provides valuable insights for enhancing language-sensitive text classification techniques, particularly for accurately and efficiently categorizing documents in different languages.

Keywords

N-gram length, Text classification, English language, Arabic language, Morphological features

Download

📄 Full PDF

DOI

https://doi.org/10.21833/ijaas.2025.04.015

Citation (APA)

Shannaq, B. (2025). Optimizing n-gram lengths for cross-linguistic text classification: A comparative analysis of English and Arabic morphosyntactic structures. International Journal of Advanced and Applied Sciences, 12(4), 136–145. https://doi.org/10.21833/ijaas.2025.04.015