Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.01.23.525250v1?rss=1
Authors: Zhao, N., Wang, S., Huang, Q., Dong, S., Boyle, A. P.
Abstract: Interpreting predictive machine learning models to derive biological knowledge is the ultimate goal of developing models in the era of genomic data exploding. Recently, sequence-based deep learning models have greatly outperformed other machine learning models such as SVM in genome-wide prediction tasks. However, deep learning models, which are black-box models, are challenging to interpret their predictions. Here we represented an end-to-end computational pipeline, Explain-seq, to automate the process of developing and interpreting deep learning models in the context of genomics. Explain-seq takes input as genomic sequences and outputs predictive motifs derived from the model trained on sequences. We demonstrated Explain-seq with a public STARR-seq dataset of the A549 human lung cancer cell line released by ENCODE. We found our deep learning model outperformed gkm-SVM model in predicting A549 enhancer activities. By interpreting our well-performed model, we identified 47 TF motifs matched with known TF PWMs, including ZEB1, SP1, YY1, and INSM1. They are associated with epithelial-mesenchymal transition and lung cancer proliferation and metagenesis. In addition, there were motifs that were not matched in the JASPAR database and may be considered as de novo enhancer motifs in the A549 cell line.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC