Multilingual thematic modeling: A comparative study of classical and transformational approaches

Aizhan  Nazyrova; Aikerim  Nasrullayeva; Assel  Mukanova; Aigerim  Buribayeva; Banu  Yergesh

doi:10.53894/ijirss.v8i6.10204

Engineering

Aizhan Nazyrova, Aikerim Nasrullayeva, Assel Mukanova, Aigerim Buribayeva, Banu Yergesh

https://doi.org/10.53894/ijirss.v8i6.10204

Issue
Vol. 8 No. 6 (2025)

Keywords:

DistilBERT, Efficiency, Green AI, Semantic analysis, Sentiment analysis, Sustainability, NLP, Word2Vec, BERT.

PDF

Abstract

This study aims to conduct a comparative evaluation of classical and transformer-based sentiment analysis models applied to Kazakh-Russian bilingual texts, addressing the gap in resource-efficient NLP solutions for low-resource languages. Three models were implemented and evaluated: (1) Word2Vec with a two-layer neural network, (2) BERT (rubert-base-cased), and (3) DistilBERT (distilrubert-tiny). A balanced dataset of 226,000 bilingual comments was used. The models were compared using key performance indicators, including F1-score, accuracy, computational efficiency, inference speed, model size, and energy consumption. Results show that BERT achieved the highest accuracy (F1 = 0.90), but with significant computational and memory costs. DistilBERT provided nearly identical accuracy (F1 = 0.89) with substantially reduced resource requirements, while Word2Vec achieved lower accuracy (F1 = 0.81) but demonstrated superior speed and energy efficiency. Error analysis revealed consistent challenges across models in handling negation, sarcasm, idiomatic expressions, and code-mixed language. The findings confirm that lightweight transformer models, particularly DistilBERT, provide a favorable trade-off between accuracy and efficiency. Word2Vec remains a viable option for real-time and embedded applications, while BERT, although accurate, is less practical for resource-constrained environments. This study contributes to the advancement of Green AI principles by demonstrating how efficient sentiment analysis systems can be developed for low-resource languages. The proposed dataset and evaluation framework can serve as a benchmark for future Kazakh-Russian NLP research and practical applications, including mobile services, e-Government platforms, and education technologies.

Authors

Aizhan Nazyrova

Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Satpayev str. 2, Astana, Kazakhstan.

https://orcid.org/0000-0002-9162-6791

Aikerim Nasrullayeva

Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Satpayev str. 2, Astana, and Higher School of Information Technology and Engineering, Astana International University, Kabanbay Batyra ave.,8, Astana, 010000, Kazakhstan.

https://orcid.org/0009-0003-3388-878X

nasrullayevaik@gmail.com (Primary Contact)

Assel Mukanova

Higher School of Information Technology and Engineering, Astana International University, Kabanbay Batyra ave.,8, Astana, 010000, Kazakhstan.

https://orcid.org/0000-0002-8964-3891

Aigerim Buribayeva

Higher School of Information Technology and Engineering, Astana International University, Kabanbay Batyra ave.,8, Astana, 010000, Kazakhstan.

https://orcid.org/0009-0008-6374-6447

Banu Yergesh

Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Satpayev str. 2, Astana, Kazakhstan.

https://orcid.org/0000-0002-8967-2625

Nazyrova, A. ., Nasrullayeva, A. ., Mukanova, A. ., Buribayeva, A. ., & Yergesh, B. . (2025). Multilingual thematic modeling: A comparative study of classical and transformational approaches. International Journal of Innovative Research and Scientific Studies, 8(6), 2787–2799. https://doi.org/10.53894/ijirss.v8i6.10204

Download Citation

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

	All	Since 2021
Citations	4099	3872
h-index	25	25
i10-index	91	91

Article Sidebar

Abstract

Authors

Article Details

Cited byView all

Cited by