Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.06.21.545880v1?rss=1
Authors: Groth, P. M., Michael, R., Salomon, J., Tian, P., Boomsma, W.
Abstract: Protein engineering has the potential to create optimized protein variants with improved properties and function. An initial step in the protein optimization process typically consists of a search among natural (wildtype) sequences to find the naturally occurring proteins with the most desirable properties. Promising candidates from this initial discovery phase then form the basis of the second step: a more local optimization procedure, exploring the space of variants separated from this candidate by a number of mutations. While considerable progress has been made on evaluating machine learning methods on single protein datasets, benchmarks of data-driven approaches for global fitness landscape exploration are still lacking. In this paper, we have carefully curated a representative benchmark dataset, which reflects industrially relevant scenarios for the initial wildtype discovery phase of protein engineering. We focus on exploration within a protein family, and investigate the downstream predictive power of various protein representation paradigms, i.e., protein language model-based representations, structure-based representations, and evolution-based representations. Our benchmark highlights the importance of coherent split strategies, and how we can be misled into overly optimistic estimates of the state of the field. The codebase and data can be accessed via https://github.com/petergroth/FLOP.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC