Methodology for detection and comparison of tandem repeats in genomic sequences using modern statistical and vector metrics
Abstract
The purpose of this study is to develop a robust methodology for the automated detection and quantitative analysis of tandem repeats in genomic sequences, taking into account mismatches and distances, to enhance primer design and improve the accuracy of genomic research. The approach combines an efficient algorithm for identifying complementary DNA fragments, focusing on the 3' end of primers, and integrates two independent similarity metrics: the Hardy–Weinberg χ² test and cosine similarity. The methodology involves generating similarity matrices, heat maps, 3D surface visualizations, and scatter plots for comprehensive evaluation of sequences. Experimental validation of the complete genome of Lactobacillus brevis ATCC 367 identified 586 tandem repeats, demonstrating high consistency between the two metrics and revealing high similarity among most repeats, while highlighting specific cases with discrepancies that require further investigation. The developed methodology effectively combines statistical and vector analyses, enhancing the reliability of genomic studies and enabling the identification of biologically significant variations. The proposed tool can be widely applied in molecular biology, especially for primer design, genome annotation, and biomarker discovery. It is scalable and adaptable to large genomic datasets, making it suitable for high-throughput bioinformatics analyses.
Authors

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.