Automatic Clustering Technology of Scientific Documents Based on Water Cycle Algorithm

Abdolrazzagh; Nezhad, Hashemzadeh, Ghasemi, Effat

نویسندگان	Abdolrazzagh-Nezhad, Hashemzadeh, Ghasemi, Effat
نشریه	Journal of Decisions and Operations Research
نوع مقاله	Full Paper
تاریخ انتشار	۲۰۲۴
رتبه نشریه	ISI
نوع نشریه	چاپی
کشور محل چاپ	ایران

چکیده مقاله

This paper presents a novel automatic clustering method for scientific documents based on the Water Cycle Algorithm (WCA). The primary goal of the research is to enhance the quality and efficiency of clustering large-scale, unbalanced textual data while reducing the need for manual parameter tuning. The proposed approach integrates several key stages: preprocessing of text data, document representation using TF-IDF adapted for scientific content, and a dynamic mechanism for activating and deactivating cluster centers to automatically determine the optimal number of clusters. By leveraging WCA—a nature-inspired optimization algorithm—the method efficiently optimizes both the number and coordinates of cluster centers, addressing common challenges in document clustering such as sensitivity to outliers and scalability.

The study evaluates the proposed method using two benchmark datasets from the NIPS 2015 and AAAI 2013 conferences, comparing its performance against four established meta-heuristic algorithms: Differential Evolution, Genetic Algorithm, Artificial Bee Colony, and Particle Swarm Optimization. Evaluation metrics include the Davis-Bouldin (DB) index and the Chou and Su (CS) index, both of which measure clustering quality based on intra-cluster cohesion and inter-cluster separation. The results demonstrate that the WCA-based approach consistently achieves lower DB and CS values, indicating superior clustering quality with better-defined and more separable clusters. Importantly, the method also shows a significant reduction in computational time, making it particularly suitable for large and imbalanced datasets.

A key innovation of this work lies in the adaptation of WCA for textual data—previously applied mainly to numerical or image clustering—combined with a novel activation/deactivation mechanism for cluster centers. This allows the algorithm to dynamically adjust the number of clusters without prior specification, enhancing flexibility and automation. Additionally, the use of TF-IDF for document representation ensures interpretability and scalability, while preprocessing steps such as stop-word removal, stemming, and dimensionality reduction further improve clustering coherence and reduce noise.

لینک ثابت مقاله

tags: Automatic Clustering