Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.04.01.535163v1?rss=1
Authors: Caldonazzo Garbelini, J. M., Sanches, D. S., Pozo, A. T. R.
Abstract: Motivation: The search for conserved motifs in DNA sequences is an important problem in bioinformatics. The growing availability of large-scale genomic data poses significant challenges for computational biology, particularly in terms of efficiency in analysis, kmer identification, and noise presence. The detection of conserved motifs and patterns in DNA sequences is crucial for understanding gene functions and regulations. Therefore, it is essential to develop a data structure that can handle these large volumes of information and provide accurate and fast results. Results: We present SMT, an innovative tool designed to efficiently store and count kmers, optimizing memory usage and computation time. The application of SMT has also proven effective in discovering motifs in noisy datasets, allowing the identification of conserved regions in sequences. Furthermore, SMT enables exact searches in constant time and recovers the most abundant k-mers, as well as performs approximate searches in linear time to find fragments with up to d mutations. This approach facilitates large-scale data analysis and provides important insights into the conserved properties of biological sequences. The application of SMT in motif discovery demonstrates its potential to drive research in bioinformatics and genomics. Supplementary data and results are available to provide additional information and support the conclusions presented in this work. Availability and implementation: The source code of the presented method is publicly available at https://github.com/jadermcg/SMT.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC