cover of episode Building a Pangenome Alignment Index via Recursive Prefix-Free Parsing

Building a Pangenome Alignment Index via Recursive Prefix-Free Parsing

2023/1/27
logo of podcast PaperPlayer biorxiv bioinformatics

PaperPlayer biorxiv bioinformatics

Shownotes Transcript

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.01.26.525723v1?rss=1

Authors: Oliva, M., Gagie, T., Boucher, C.

Abstract: Motivation: Pangenomics alignment has emerged as an opportunity to reduce bias in biomedical research. Traditionally, short read aligners---such as Bowtie and BWA---were used to index a single reference genome, which was then used to find approximate alignments of reads to that genome. Unfortunately, these methods can only index a small number of genomes due to the linear-memory requirement of the algorithms used to construct the index. Although there are a couple of emerging pangenome aligners that can index a larger number of genomes more algorithmic progress is needed to build an index for all available data. Results: Emerging pangenomic methods include VG, Giraffe, and Moni, where the first two methods build an index a variation graph from the multiple alignment of the sequences, and Moni simply indexes all the sequences in a manner that takes the repetition of the sequences into account. Moni uses a preprocessing technique called prefix-free parsing to build a dictionary and parse from the input---these, in turn, are used to build the main run-length encoded BWT, and suffix array of the input. This is accomplished in linear space in the size of the dictionary and parse. Therein lies the open problem that we tackle in this paper. Although the dictionary scales nicely (sub-linear) with the size of the input, the parse becomes orders of magnitude larger than the dictionary. To scale the construction of Moni, we need to remove the parse from the construction of the RLBWT and suffix array. We accomplish this, in this paper by applying prefix-free parsing recursively on the parse. Although conceptually simple, this leads to an algorithmic challenge of constructing the RLBWT and suffix array without access to the parse. We solve this problem, implement it, and demonstrate that this improves the construction time by a factor of 8.9 the running time and by a factor of 2.7 the memory required. Availability: Our implementation is open source and available at https://github.com/marco-oliva/r-pfbwt. Contact: Marco Oliva at [email protected]

Copy rights belong to original authors. Visit the link for more info

Podcast created by Paper Player, LLC