Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.02.09.527930v1?rss=1
Authors: Mecham, A., Stephenson, A., Quinteros, B. I., Salmons, G., Piccolo, S. R.
Abstract: TidyGEO is a Web-based tool for downloading, tidying, and reformatting data series from Gene Expression Omnibus (GEO). As a freely accessible repository with data from over 4 million biological samples across more than 4,000 organisms, GEO provides diverse opportunities for secondary research. Transcriptomic data are most common in GEO, but other measurement types are also prevalent, including DNA methylation levels, genotypes, and chromatin-accessibility profiles. GEO's diversity and expansiveness present opportunities and challenges. Although scientists may find assay data relevant to a given research question, most analyses require sample annotations, such as a sample's treatment group, disease subtype, or age. In GEO, such annotations are stored alongside assay data in delimited, text-based files. However, the structure and semantics of the annotations vary widely from one series to another, and many annotations are not useful for analysis purposes. Thus, every GEO series must be tidied before it is analyzed. Manual approaches may be used, but these are error prone and take time away from other research tasks. Custom computer scripts can be written, but many scientists lack the computational expertise to create such scripts. To address these challenges, we created TidyGEO, which supports essential data-cleaning tasks for sample-level annotations, such as selecting informative columns, renaming columns, splitting or merging columns, standardizing data values, and filtering samples. Additionally, users can integrate annotations with assay data, restructure assay data, and generate code that enables others to reproduce these steps. The source code for TidyGEO is at https://github.com/srp33/TidyGEO.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC