Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.01.11.523286v1?rss=1
Authors: Grigorjew, A., Gynter, A., Dias, F., Buchfink, B., Drost, H.-G., Tomescu, A. I.
Abstract: Sequence alignments have become the foundation of life science research by unlocking biological mechanisms through protein comparisons. Despite its methodological success, most algorithmic innovation in the past decades focused on the optimal alignment problem, while often ignoring information derived from suboptimal solutions. The assumption that the score-derived optimal alignment represents the biologically most relevant choice has led many life scientists to accept this reduced dimension from thousands or millions of possible alignment configurations to one optimal alignment setting. However, we argue that one optimal alignment per pairwise sequence comparison may have been a reasonable approximation when dealing with very similar sequences, but is insufficient when aiming to capture the natural variation of the protein universe at tree-of-life scale. To overcome this alignment-sensitivity limitation, we propose the concept of pairwise alignment-safety as a way to explore the neighborhood of suboptimal alignment configurations when comparing divergent protein sequences. We show that by using alignment-safe intervals, it is possible to encode the defining structural features of proteins even when comparing highly divergent sequences. To demonstrate this, we present EMERALD, a dedicated command line tool able to infer alignment-safe sequence intervals from biodiverse protein sequence clusters. EMERALD effectively explores suboptimal alignment paths within the pairwise dynamic programming matrix and flags robust intervals that are shared across all suboptimal configurations. We apply EMERALD to clusters of 396k sequences generated from the Swiss-Prot database and show that alignment-safe intervals derived from the suboptimal alignment space are sufficient to capture the structural identity of biodiverse proteins even when comparing highly divergent clusters.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC