cover of episode μ-PBWT: Enabling the Storage and Use of UK Biobank Data on a Commodity Laptop

μ-PBWT: Enabling the Storage and Use of UK Biobank Data on a Commodity Laptop

2023/2/16
logo of podcast PaperPlayer biorxiv bioinformatics

PaperPlayer biorxiv bioinformatics

Shownotes Transcript

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.02.15.528658v1?rss=1

Authors: Cozzi, D., Rossi, M., Rubinacci, S., Köppl, D., Boucher, C., Bonizzoni, P.

Abstract: Motivation: The positional Burrows-Wheeler Transform (PBWT) has been introduced as a key data structure for indexing haplotype sequences with the main purpose of finding maximal haplotype matches in h sequences containing w variation sites in O(hw)-time with a significant improvement over classical quadratic time approaches. However the original PBWT data structure does not allow queries over the modern biobank panels of haplotypes consisting of several millions of haplotypes, as they must be kept entirely in memory. Results: In this paper, we present a method for constructing the run-length encoded PBWT for memory efficient haplotype matching. We implement our method, which we refer to as -PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the -PBWT reduces the memory usage up to a factor of 25 compared to the best current PBWT-based indexing. In particular, -PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in half the space of its BCF file. In addition, -PBWT is able to index a dataset with 2 million haplotypes and 2.3 million sites in 4 GB of space, which can be uploaded in 20 seconds on a commodity laptop. -PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a small representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. Availability: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html

Copy rights belong to original authors. Visit the link for more info

Podcast created by Paper Player, LLC