Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.03.23.533903v1?rss=1
Authors: Kotlarz, K., Mielczarek, M., Biecek, P., Wojdak-Maksymiec, K., Suchocki, T., Topolski, P., Jagusiak, W., Szyda, J.
Abstract: The most serious drawback underlying the biological annotation of Whole Genome Sequence data is the p greater than greater than n problem, meaning that the number of polymorphic variants (p) is much larger than the number of available phenotypic records (n). Therefore, the major aim of the study was to propose a way to circumvent the problem by combining a LASSO logistic regression model with Deep Learning (DL). That was illustrated by a practical biological problem of classification of cows into mastitis-susceptible or mastitis-resistant, based on genotypes of Single Nucleotide Polymorphisms (SNPs) identified in their WGS. Among several DL architectures proposed via optimisation of DL hyperparameters using the Optuna software, imposed on different SNP sub-sets defined by LASSO logistic regressions with different penalty values, the architecture with 204,642 SNPs was selected as the best one. This architecture was composed of 2 layers with respectively 7 and 46 units per layer as well as respective drop-out rates of 0.210 and 0.358. The classification of the test data set resulted in the AUC=0.750, accuracy=0.650, sensitivity=0.600, and specificity=0.700 was selected as the best model and thus proceeded to genomic and functional annotations. Significant SNPs were selected based on the SHapley Additive exPlanation values transformed to Z-scores to assess the underlying type I-error. These SNPs were annotated to genes. As a final result, a single GO term related to the biological process and thirteen GO terms related to the molecular function were significantly enriched in the gene set that corresponded to the significant SNPs.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC