cover of episode RExPRT: a machine learning tool to predict pathogenicity of tandem repeat loci

RExPRT: a machine learning tool to predict pathogenicity of tandem repeat loci

2023/3/23
logo of podcast PaperPlayer biorxiv bioinformatics

PaperPlayer biorxiv bioinformatics

Frequently requested episodes will be transcribed first

Shownotes Transcript

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.03.22.533484v1?rss=1

Authors: Fazal, S., Danzi, M., Xu, I., Kobren, S. N., Sunyaev, S., Reuter, C., Marwaha, S., Wheeler, M. T., Dolzhenko, E., Lucas, F., Wuchty, S., Tekin, M., Zuchner, S., Aguiar-Pulido, V.

Abstract: Tandem repeats (TRs) are polymorphic sequences of DNA that are composed of repeating units of motifs ranging from 2-6 base pairs in length. Expansions of TRs are responsible for approximately 50 monogenic diseases, compared to over 4,300 disease causing genes disrupted by single nucleotide variants and small indels. It appears thus reasonable to expect the discovery of additional pathogenic repeat expansions, which has the potential of significantly narrowing the current diagnostic gap in many diseases. Recently, short and long-read whole genome sequencing with the use of advanced bioinformatics tools, have identified a growing number of TR expansions in the human population. The majority of these loci are expanded in less than 1% of genomes. Categorizing and prioritizing such TR loci is a growing challenge to human genomic studies. We present a first-in-class machine learning tool, RExPRT (Repeat EXpansion Pathogenicity pRediction Tool), which is designed to distinguish pathogenic from benign TR expansions. RExPRT's predictive features include annotations of the surrounding genetic architecture that were selected based on their enrichment in known pathogenic loci compared to other repeats. Leave-one-out cross validation results demonstrated that an ensemble approach comprised of support vector machines (SVM) and extreme gradient boosted decision tree (XGB) classify TRs with a precision of 92% and a recall of 90%. Further validation of RExPRT on unseen test data demonstrate a similar precision of 86%, and a recall of 60%. RExPRT's high precision in particular, will be of significant value to large-scale discovery studies, which require the prioritization of promising candidate loci for time-consuming and costly functional follow-up studies. Thus, RExPRT establishes a foundation for the application of machine learning approaches to categorize the pathogenicity of tandem repeat loci.

Copy rights belong to original authors. Visit the link for more info

Podcast created by Paper Player, LLC