Improving Keyword Extraction in Multilingual Texts

Hashemzade, B. and M. Abdolrazzagh; Nezhad

نویسندگان	Hashemzade, B. and M. Abdolrazzagh-Nezhad
نشریه	International Journal of Electrical and Computer Engineering
نوع مقاله	Full Paper
تاریخ انتشار	2020
رتبه نشریه	ISI
نوع نشریه	چاپی
کشور محل چاپ	اندونزی

چکیده مقاله

This paper addresses the challenge of automatic keyword extraction from multilingual texts, presenting an unsupervised and language-independent algorithm designed to enhance accuracy by leveraging information across multiple languages. The proposed method improves upon the traditional TF-IDF (Term Frequency-Inverse Document Frequency) approach by considering not only the statistical significance of a word within a single language document but also its relevance across parallel texts in other languages. The core idea is to select a word as a keyword if it consistently holds a high TF-IDF score not only in the original language of the text but also in its translated or parallel versions in other languages.

The algorithm operates by first gathering parallel news texts from sources like the BBC in eight different languages. After preprocessing steps that remove common stop words (such as prepositions and conjunctions) across languages, it calculates the TF-IDF score for each candidate word in every language version of the document. Instead of relying on the TF-IDF from a single language, the method computes the average TF-IDF score for each word across all available language versions. Words with the highest average scores are then selected as the final keywords for the entire multilingual document set.

A key achievement of this work is its significant improvement in extraction accuracy. The proposed algorithm, using the average TF-IDF method, achieved an accuracy rate of 91.3% on a dataset of 200 news articles in eight languages. This substantially outperformed the baseline conventional TF-IDF (60.65% accuracy) and a graph-based algorithm (80% accuracy). The method's strength lies in its simplicity, unsupervised nature, and true language independence—it does not require language-specific rules, morphological analysis, or large training datasets, making it highly scalable and applicable to diverse linguistic contexts.

In conclusion, this research successfully demonstrates that cross-lingual information can be effectively harnessed to refine keyword extraction. By aggregating statistical signals from parallel texts, the algorithm reduces errors caused by language-specific noise and more reliably identifies the core thematic words. This approach offers a practical and efficient solution for improving information retrieval, text summarization, and content analysis in our increasingly multilingual digital world, where documents often exist in multiple language versions simultaneously.

لینک ثابت مقاله

tags: Keyword Extraction