cover of episode Prediction of virus-host association using protein language models and multiple instance learning

Prediction of virus-host association using protein language models and multiple instance learning

2023/4/8
logo of podcast PaperPlayer biorxiv bioinformatics

PaperPlayer biorxiv bioinformatics

Frequently requested episodes will be transcribed first

Shownotes Transcript

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.04.07.536023v1?rss=1

Authors: Liu, D., Young, F., Robertson, D. L., Yuan, K.

Abstract: Predicting virus-host association is essential to understand how viruses interact with host species, and discovering new therapeutics for viral diseases across humans and animals. Currently, the host of the majority of viruses is unknown. Here, we introduce EvoMIL, a deep learning method that predicts virus-host association at the species level from viral sequence only. The method combines a pre-trained large protein language model and attention-based multiple instance learning (MIL) to allow protein-orientated predictions. Our results show that protein embeddings capture stronger predictive signals than traditional handcrafted features, including amino acids and DNA k-mers, and physio-chemical properties. EvoMIL binary classifiers achieve AUC values of over 0.95 for all prokaryotic and nearly 0.8 for almost all eukaryotic hosts. In multi-host prediction tasks, EvoMIL achieved median performance improvements of 8.6% in prokaryotic hosts and 1.8% in eukaryotic hosts. Furthermore, EvoMIL estimates the importance of single proteins in the prediction and maps them to an embedding landscape of all viral proteins, where proteins with similar functions are distinctly clustered together.

Copy rights belong to original authors. Visit the link for more info

Podcast created by Paper Player, LLC