Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.03.22.533737v1?rss=1
Authors: Georges, A., Mijangos, L., Patel, H. R., Aitkens, M., Gruber, B. R.
Abstract: Distance measures are widely used for examining genetic structure in datasets that comprise many individuals scored for a very large number of attributes. Genotype datasets composed of single nucleotide polymorphisms (SNPs) typically contain bi-allelic scores for tens of thousands if not hundreds of thousands of loci. We examine the application of distance measures to SNP data (both genotypes and sequence tag presence-absence) and use real datasets and simulated data to illustrate pitfalls in the application of genetic distances and their visualization. Missing values arise from ascertainment biases in the SNP discovery process (null alleles in the case of SNP genotyping; true missing data in the case of sequence tag presence-absence data). Missing values can cause displacement of affected individuals from their natural groupings and artificial inflation of confidence envelopes, leading to potential misinterpretation. Failure of a distance measure to conform to metric and Euclidean properties is important but only likely to create unacceptable outcomes in extreme cases. Lack of randomness in the selection of individuals and lack of independence of both individuals and loci (e.g. polymorphic haploblocks), can lead to substantial and otherwise inexplicable distortions of the visual representations and again, potential misinterpretation. Euclidean Distance is the metric of choice in many distance studies. However, other measures may be preferable because of underlying models of divergence, population demographic history, linkage disequilibrium, because it is desirable to down-weight joint absences, or because of other characteristics specific to the data or analyses. Distance measures for SNP genotype data depend on the arbitrary choice of reference and alternate alleles (e.g. Bray-Curtis distance) should be avoided. Careful consideration should be given to which state is scored zero when applying binary distance measures to fragment presence-absence data (e.g. Jaccard distance). Filtering on missing values then imputing those that remain avoids distortion in visual representations. Presence of closely related individuals or polymorphic haploblocks in the genomes of target species with limited genomic information occasionally emerge as challenges that need to be managed.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC