Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.06.25.546422v1?rss=1
Authors: Gu, L.-L., Chen, G.-B., Wu, H.-S., Zhang, Y.-J., He, J.-C., Liu, X.-L., Wang, Z.-Y., Jiang, D., Fang, M.
Abstract: Genetic analysis using big data can enhance the power of GWAS, but large data sets often have a large number of missing phenotypes. The UK Biobank database contains ~500,000 individuals with ~3,000 phenotypes, with phenotype missing rates ranging from 0.11% to 98.35%. Imputation of missing phenotypes is an important way of improving the GWAS power. The multi-phenotype imputation method can significantly improve the accuracy of imputation. However, most existing multi-phenotype imputation methods are unable to impute missing phenotypes of millions of individuals, for example, PHENIX (Nature Genetics 2016(48):466-472) will require months of time and ~1T of computer memory. We herein developed a Mixed Fast Random Forest (MFRF) based machine learning for phenotypic imputation. Our simulation results showed that the imputation accuracy of MFRF was higher than or equal to that of existing state-of-the-art methods; MFRF was also extremely computationally fast and memory efficient, using only 0.23-0.54 h and 68.32-126.35 Mb of computer memory for the UK Biobank dataset. We applied MFRF to impute 425 phenotypes from the UK Biobank dataset, and conducted the GWA studies using the imputed phenotypes. Compared with the GWAS before phenotype imputation, 1355 (15.6%) extra GWAS loci were identified.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC