Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.04.14.536886v1?rss=1
Authors: Teufel, F., Gislason, M. H., Almagro Armenteros, J. J., Johansen, A. R., Winther, O., Nielsen, H.
Abstract: When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too closely related pairs of sequences ending up in different partitions. If this is ignored, performance estimates of prediction methods will tend to be exaggerated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too closely related pairs remain. We present GraphPart, an algorithm for homology partitioning, where as many sequences as possible are kept in the dataset, but partitions are defined such that closely related sequences always end up in the same partition. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation quality on par with reduction approaches.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC