Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.03.15.532863v1?rss=1
Authors: Zhang, Y., Lang, M., Jiang, J., Gao, Z., Xu, F., Litfin, T., Chen, K., Singh, J., Huang, X., Song, G., Tian, Y., Zhan, J., Chen, J., Zhou, Y.
Abstract: Compared to proteins, DNA and RNA are more difficult languages to interpret because 4-letter-coded DNA/RNA sequences have less information content than 20-letter-coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised Multiple sequence-alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap. The resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks over existing state-of-the-art techniques. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC