cover of episode Multiple sequence-alignment-based RNA language model and its application to structural inference

Multiple sequence-alignment-based RNA language model and its application to structural inference

2023/3/16
logo of podcast PaperPlayer biorxiv bioinformatics

PaperPlayer biorxiv bioinformatics

Shownotes Transcript

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.03.15.532863v1?rss=1

Authors: Zhang, Y., Lang, M., Jiang, J., Gao, Z., Xu, F., Litfin, T., Chen, K., Singh, J., Huang, X., Song, G., Tian, Y., Zhan, J., Chen, J., Zhou, Y.

Abstract: Compared to proteins, DNA and RNA are more difficult languages to interpret because 4-letter-coded DNA/RNA sequences have less information content than 20-letter-coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised Multiple sequence-alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap. The resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks over existing state-of-the-art techniques. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.

Copy rights belong to original authors. Visit the link for more info

Podcast created by Paper Player, LLC