cover of episode Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction

Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction

2023/2/3
logo of podcast PaperPlayer biorxiv bioinformatics

PaperPlayer biorxiv bioinformatics

Shownotes Transcript

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.01.31.526427v1?rss=1

Authors: Chen, K., Zhou, Y., Ding, M., Wang, Y., Ren, Z., Yang, Y.

Abstract: RNA splicing is an important post-transcriptional process of gene expression in eukaryotic organisms. Here, we developed a novel language model, SpliceBERT, pre-trained on the precursor messenger RNA sequences of 72 vertebrates to improve sequence-based modelling of RNA splicing. SpliceBERT is capable of generating embeddings that preserve the evolutionary information of nucleotides and functional characteristics of splice sites. Moreover, the pre-trained model can be utilized to prioritize potential splice-disrupting variants in an unsupervised manner based on genetic variants' impact on the output of SpliceBERT for sequence context. Benchmarked on a multi-species splice site and a human branchpoint prediction task, SpliceBERT outperformed not only conventional baseline models but also other language models pretrained only on the human genome. Our study highlighted the importance of unsupervised learning with genomic sequences of multiple species and indicated that language models were promising approaches to decipher the determinants of RNA splicing.

Copy rights belong to original authors. Visit the link for more info

Podcast created by Paper Player, LLC