Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.03.14.532504v1?rss=1
Authors: Tong, J., Lu, M., Peng, B., An, S., Wang, J., Yu, C.
Abstract: Backgroud: The size of high-resolution mass spectrometry (HRMS) data has been increasing significantly. Several lossy compressors have been developed for higher compression rate. Currently, a comprehensive evaluation of what and how MS data (m/z and intensities) with precision losses would affect data processing is absent. Assessing the impact of different degrees of precision losses on the data processing results should clarify the variation rates under different accuracy losses and explore the reasons for them. Result: Sixteen vendor files were converted to mzML files with a different combination of data precision (32- or 64-bit) for m/z and intensities via msConvert. A suitable precision combination of mzML files were afterwards converted to precision-lossy files with absolute m/z or relative intensities mistakes by truncation transformations. We set an error threshold at 1% to evaluate files results of feature and compound detection obtained from MZmine3. The variation was less than 0.13% for both features and compounds when m/z and intensities with different combinations of storage precision. Five maximum absolute errors of m/z (10-5, 2x10-5, 5x10-5, 10-4, 10-3) and five maximum relative errors of intensities (2x10-4, 2x10-3, 8x10-3, 2x10-2, 2x10-1) were examined. We identified that the error of 10-4 for m/z had a feature detection error of 0.57% and compound detection error of 1.1%. For intensities, the error group of 2x10-2 had an error of 4.65% for features and 0.98% for compounds to precision-lossless files. Taken together, we consider that a maximum absolute error of 10-4 for m/z and a maximum relative error of 2x10-2 for intensity can meet the error threshold of 1% and be recommended errors for lossy compression. Conclusion: We examined that mzML files with both m/z and intensity encoded in 32-bit precision appear to be a preferred combination, which has smaller file size and minor variation. Further, we checked that how varying levels of precision affect the MS data processing and provided a reasonable scene-accuracy proposal (10-4 for m/z and 2x10-2 for intensities). This guidance aimed to help researchers in improving lossy compression algorithms and minimizing the negative effects of precision losses on downstream data processing.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC