Document analysis via combined vectorization and machine learning approaches

Dinara Kaibassova, Bigul Mukhametzhanova, Dinara Tokseit, Aigul Kubegenova, Murad Kozhanov

Abstract

The purpose of this study is to develop an effective hybrid model for automatic document classification by combining statistical and semantic text vectorization techniques with machine learning algorithms. The methodology integrates Term Frequency–Inverse Document Frequency (TF-IDF) and Word2Vec embeddings with classifiers such as Support Vector Machine (SVM) and Random Forest. The proposed approach includes data preprocessing (tokenization, normalization, stop word removal, and lemmatization), feature extraction, model training, and evaluation using classification metrics such as accuracy, F1-score, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa. Experimental results demonstrate that the Word2Vec + SVM model outperforms other configurations, achieving 90.2% accuracy and an F1-score of 82.52%, thus highlighting the advantage of incorporating semantic context into vector representation. The study concludes that hybrid methods combining TF-IDF and Word2Vec with robust classifiers improve both the precision and generalizability of document analysis models. Practical implications include potential applications in sentiment analysis, topic modeling, text classification for legal and healthcare domains, and multilingual contexts. This research provides a foundation for developing high-performance text analysis systems applicable to various real-world natural language processing tasks.

Authors

Dinara Kaibassova
Bigul Mukhametzhanova
Dinara Tokseit
Aigul Kubegenova
Murad Kozhanov
mukhamedzhanova.bigul@mail.ru (Primary Contact)
Kaibassova, D. ., Mukhametzhanova, B. ., Tokseit, D. ., Kubegenova, A. ., & Kozhanov, M. . (2025). Document analysis via combined vectorization and machine learning approaches. International Journal of Innovative Research and Scientific Studies, 8(4), 2195–2204. https://doi.org/10.53894/ijirss.v8i4.8356

Article Details

No Related Submission Found