Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2022.12.27.522071v1?rss=1
Authors: Draizen, E. J., Murillo, L. F., Readey, J., Mura, C., Bourne, P. E.
Abstract: Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be of utility, such datasets must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently far more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing. Here, we report Prop3D, a protein biophysical and evolutionary featurization and data-processing pipeline that we have developed and deployed--both in the cloud and on local HPC resources--in order to systematically and reproducibly create comprehensive datasets, using the Highly Scalable Data Service (HSDS). Prop3D and its associated 'Prop3D-20sf' dataset can be of broader utility, as a community-wide resource, for other structure-related workflows, particularly for tasks that arise at the intersection of deep learning and classical structural bioinformatics.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC