Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.02.01.526717v1?rss=1
Authors: Sladky, O., Vesely, P., Brinda, K.
Abstract: The popularity of -mer-based methods has recently led to the development of compact -mer-set representations, such as simplitigs/Spectrum-Preserving String Sets (SPSS), matchtigs, and eulertigs. These aim to represent -mer sets via strings that contain individual -mers as substrings more efficiently than the traditional unitigs. Here, we demonstrate that all such representations can be viewed as superstrings of input -mers, and as such can be generalized into a unified framework that we call the masked superstring of -mers. We study the complexity of masked superstring computation and prove NP- hardness for both -mer superstrings and their masks. We then design local and global greedy heuristics for efficient computation of masked superstrings, implement them in a program called KmerCamel , and evaluate their performance using selected genomes and pan-genomes. Overall, masked superstrings unify the theory and practice of textual -mer set representations and provide a useful framework for optimizing representations for specific bioinformatics applications.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC