Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.02.15.528638v1?rss=1
Authors: Stam, M., Langlois, j., Chevalier, C., Reboul, G., Bastard, K., Medigue, C., Vallenet, D.
Abstract: Background: The growing availability of large genomic datasets presents an opportunity to discover novel metabolic pathways and enzymatic reactions profitable for industrial or synthetic biological applications. Efforts to identify new enzyme functions in this substantial number of sequences cannot be achieved without the help of bioinformatics tools and the development of new strategies. The classical way to assign a function to a gene uses sequence similarity. However, another way is to mine databases to identify conserved gene clusters (i.e. syntenies) as, in prokaryotic genomes, genes involved in the same pathway are frequently encoded in a single locus with an operonic organisation. This Genomic Context (GC) conservation is considered as a reliable indicator of functional relationships, and thus is a promising approach to improve the gene function prediction. Methods. Here we present NetSyn (Network Synteny), a tool, which aims to cluster protein sequences according to the similarity of their genomic context rather than their sequence similarity. Starting from a set of protein sequences of interest, NetSyn retrieves neighbouring genes from the corresponding genomes as well as their protein sequence. Homologous protein families are then computed to measure synteny conservation between each pair of input sequences using a GC score. A network is then created where nodes represent the input proteins and edges the fact that two proteins share a common GC. The weight of the edges corresponds to the synteny conservation score. The network is then partitioned into clusters of proteins sharing a high degree of synteny conservation. Results. As a proof of concept, we used NetSyn on two different datasets. The first one is made of homologous sequences of an enzyme family (the BKACE family, previously named DUF849) to divide it into sub-families of specific activities. NetSyn was able to go further by providing additional subfamilies in addition to those previously published. The second dataset corresponds to a set of non-homologous proteins consisting of different Glycosyl Hydrolases (GH) with the aim of interconnecting them and finding conserved operon-like genomic structures. NetSyn was able to detect the locus of Cellvibrio japonicus for the degradation of xyloglucan. It contains three non-homologous GH and was found conserved in fourteen bacterial genomes. Discussion. NetSyn is able to cluster proteins according to their genomic context which is a way to make functional links between proteins without taking into count their sequence similarity only. We showed that NetSyn is efficient in exploring large protein families to define iso-functional groups. It can also highlight functional interactions between proteins from different families and predicts new conserved genomic structures that have not yet been experimentally characterised. NetSyn can also be useful in pinpointing mis-annotations that have been propagated in databases and in suggesting annotations on proteins currently annotated as 'unknown'. NetSyn is freely available at https://github.com/labgem/netsyn.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC