Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.07.20.549913v1?rss=1
Authors: Deorowicz, S., Gudys, A.
Abstract: Motivation The introduction of Deep Minds' Alpha Fold 2 enabled prediction of protein structures at unprecedented scale. AlphaFold Protein Structure Database and ESM Metagenomic Atlas contain hundreds of millions of structures stored in CIF and/or PDB formats. When compressed with a general-purpose utility like gzip, this translates to tens of terabytes of data which hinders the effective use of predicted structures in large-scale analyses. Results Here, we present ProteStAr, a compressor dedicated to CIF/PDB as well as, supplementary PAE files. Its main contribution is a novel approach to predict atom coordinates on the basis of the previously analyzed atoms. This allows efficient encoding of the coordinates which are the largest component of the protein structure files. By default, the compression is lossless, though the lossy mode with a controlled maximum error of coordinates reconstruction is also present. Compared to the competing packages, i.e., BinaryCIF, Foldcomp, PDC, our approach offers superior compression ratio at established reconstruction accuracy. By the efficient use of threads at both compression and decompression stages, the algorithm takes advantage of multicore architecture of current central processing units and operates with speeds about 1 GB/s. The presence of C++ API further increases the usability of the presented method. Availability and implementation The source code of ProteStAr is available at https://github.com/refresh-bio/protestar.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC