Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.02.09.527362v1?rss=1
Authors: Mardikoraem, M., Woldring, D.
Abstract: Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed our ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations. Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling methods and protein rep-resentations to improve model performance in two different datasets with binding affinity and thermal stability prediction tasks. For protein sequence representations, we incor-porate two widely used methods (One-Hot encoding, physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, length, data size, and sampling methods. In addition, an ensemble of representation methods is generated to discover the contribution of distinct representations to the final prediction score. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. In addition, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC