Document analysis via combined vectorization and machine learning approaches

Dinara  Kaibassova; Bigul  Mukhametzhanova; Dinara  Tokseit; Aigul  Kubegenova; Murad  Kozhanov

doi:10.53894/ijirss.v8i4.8356

Social Sciences

Dinara Kaibassova, Bigul Mukhametzhanova, Dinara Tokseit, Aigul Kubegenova, Murad Kozhanov

https://doi.org/10.53894/ijirss.v8i4.8356

Issue
Vol. 8 No. 4 (2025)

Keywords:

PDF

Abstract

The purpose of this study is to develop an effective hybrid model for automatic document classification by combining statistical and semantic text vectorization techniques with machine learning algorithms. The methodology integrates Term Frequency–Inverse Document Frequency (TF-IDF) and Word2Vec embeddings with classifiers such as Support Vector Machine (SVM) and Random Forest. The proposed approach includes data preprocessing (tokenization, normalization, stop word removal, and lemmatization), feature extraction, model training, and evaluation using classification metrics such as accuracy, F1-score, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa. Experimental results demonstrate that the Word2Vec + SVM model outperforms other configurations, achieving 90.2% accuracy and an F1-score of 82.52%, thus highlighting the advantage of incorporating semantic context into vector representation. The study concludes that hybrid methods combining TF-IDF and Word2Vec with robust classifiers improve both the precision and generalizability of document analysis models. Practical implications include potential applications in sentiment analysis, topic modeling, text classification for legal and healthcare domains, and multilingual contexts. This research provides a foundation for developing high-performance text analysis systems applicable to various real-world natural language processing tasks.

Authors

Dinara Kaibassova

Astana IT University, Astana, 010000, Kazakhstan.

https://orcid.org/0000-0002-8410-7758

Bigul Mukhametzhanova

Abylkas Saginov Karaganda Technical University, Karaganda, 100000, Kazakhstan.

https://orcid.org/0000-0003-3585-8181

Dinara Tokseit

L. N. Gumilyov Eurasian National University, Astana, 010000, Kazakhstan.

https://orcid.org/0000-0001-9075-3943

Aigul Kubegenova

West Kazakhstan Agrarian and Technical University named after Zhangir Khan, Uralsk, 090000, Kazakhstan.

https://orcid.org/0000-0003-0156-7757

Murad Kozhanov

Abylkas Saginov Karaganda Technical University, Karaganda, 100000, Kazakhstan.

https://orcid.org/0000-0002-5310-9953

mukhamedzhanova.bigul@mail.ru (Primary Contact)

Kaibassova, D. ., Mukhametzhanova, B. ., Tokseit, D. ., Kubegenova, A. ., & Kozhanov, M. . (2025). Document analysis via combined vectorization and machine learning approaches. International Journal of Innovative Research and Scientific Studies, 8(4), 2195–2204. https://doi.org/10.53894/ijirss.v8i4.8356

Download Citation

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

	All	Since 2020
Citations	750	733
h-index	11	11
i10-index	15	15

Document analysis via combined vectorization and machine learning approaches

Abstract

Authors

Similar Articles

Related Article based on the article keywords

gsCitation

Article Sidebar

Abstract

Authors

Article Details

Similar Articles

Related Article based on the article keywords

gsCitation