Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.02.26.528265v1?rss=1
Authors: Li, Y., Lu, S., Nan, X., Zhang, S., Zhou, Q.
Abstract: Accurate prediction of protein binding residues (PBRs) from sequence is important for the understanding of cellular activity and helpful for the design of novel drug. However, experimental methods are time-consuming and expensive. In recent years, a lot of computational predictors based on machine learning and deep learning models are proposed to reduce such consumption. But those methods often use MSA tools such as PSI-BLAST or NetSurfP to generate some statistical features and enter them into predictive models as necessary supplementary input. The input generation process normally takes long time, and there is no standard to specify which and how many statistic results should be provided to a prediction model. In addition, prediction of PBRs relies on residue local context, but the most appropriate scale is undetermined. Most works pre-selected certain residue features as input and a scale size based on expertise for certain type of PBRs. In this study, we propose a general tool-free end-to-end framework that can be applied to all types of PBRs, Multiscale Protein Binding Residues Prediction using language model (MsPBRsP). We adopt a pre-trained language model ProtTrans to save the large consumption caused by MSA tools, and use protein sequence alone as input to our model. To ease scale size uncertainty, we construct multi-size windows in attention layer and multi-size kernels in convolutional layer. We test our framework on various benchmark datasets including PBRs from protein-protein, protein-nucleotide, protein-small ligand, heterodimer, homodimer and antibody-antigen interactions. Compared with existing state-of-the-art methods, MsPBRsP achieves superior performance with less running time and higher prediction rates on every PBRs prediction task. Specifically, we boost F1 score by 27.1% and AUPRC score by 7.6% on NSP448 dataset and decrease running time from over 10 minutes to under 0.1s on average. The source code and datasets are available at https://github.com/biolushuai/MsPBRsP-for-multiple-PBRsprediction.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC