Yang2010
TITLE: Identifying gene clusters within localized regions in multiple genomes SOURCE: JCB Vol 12, No 5, 2010
Consider a model of gene clusters where the distance between any pair of genes within a cluster is at most d. Problem is to find all maximal gene clusters with e-value below a cutoff.
Their model allows for multiple gene numbers to correspond to the same gene, i.e. a gene can belong in multiple families.
[Durand2003a] shows how to compute p-value, e-value estimated from p-value.
The general problem is shown to be NP-hard via reduction from the clique problem.
Ordered clusters
A polynomial time algorithm can be obtained for a restricted problem with the following assumptions:
- a cluster appears in every chromosome
- none of the genes appear in more than one orthologous group
- orthologous genes are strictly ordered within each cluster without allowing paralogous copies
The idea is to construct a DAG where a maximal cluster corresponds to finding the longest path. To find additional clusters, they mask the genes in the optimal cluster and rerun the algorithm or using an algorithm to find u longest paths.
Unordered clusters
The idea to consider each window of size d on each chromosome and find the location of all occurrence of its gene numbers in the other chromosomes and then intersect them. The complexity is $O((tD+1)^{k-1})$ after some preprocessing. This is much better than the $O(n^{k-1})$ when tD is small relative to n.
Results on bacterial genomes
Input: B. subtilis, S. pyogenes, S. pneumoniae, and C. acetobutylicum from COG, $1 \le d \le 50$
Examined 7 maximal unordered gene clusters with low e-values that appear in all four genomes and showed biological significance. Then applied the restricted version.
Results on yeast genomes
Input: S. cerevisiae, S. paradoxus, S. mikatae, and S. bayanus. using orthologous groups from Kellis, $1 \le d \le 50$
Showed top result and compare GCFinder against GeneTeams with $1 \le \delta \le 50$. Evaluated functional enrichment of each cluster by applying the GO Term Finder [Boyle2004] on the S. cerevisiae genes and clusters a cluster to have a significant GO term if its Bonferroni corrected p-value is blow 0.05 within any of the three ontologies. GCFinder found almost twice as many clusters with significant functional enrichment.