Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.06.22.546084v1?rss=1
Authors: Dickson, A. M., Mofrad, M. R. K.
Abstract: A central goal of bioinformatics research is to understand proteins on a functional level, typically by extrapolating from experimental results with the protein sequence information. One strategy is to assume that proteins with similar sequences will also share function. This has the benefit of being interpretable; it gives a very clear idea of why a protein might have a particular function by comparing with the most similar reference example. However, direct machine learning classifiers now outperform pure sequence similarity methods in raw prediction ability. A hybrid method is to use pre-trained language models to create protein embeddings, and then indirectly predict protein function using their relative similarity. We find that fine-tuning an auxiliary objective on protein function indirectly improves these hybrid methods, to the point that they are in some cases better than direct classifiers. Our empirical results demonstrate that interpretable protein comparison models can be developed using fine-tuning techniques, without cost, or even with some benefit, to overall performance. K-nearest neighbors (KNN) embedding-based models also offer free generalization to previously unknown classes, while continuing to outperform only pre-trained models, further demonstrating the potential of fine-tuned embeddings outside of direct classification.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC