Melvin's digital garden

Yang2010

TITLE: Identifying gene clusters within localized regions in multiple genomes SOURCE: JCB Vol 12, No 5, 2010

Consider a model of gene clusters where the distance between any pair of genes within a cluster is at most d. Problem is to find all maximal gene clusters with e-value below a cutoff.

Their model allows for multiple gene numbers to correspond to the same gene, i.e. a gene can belong in multiple families.

[Durand2003a] shows how to compute p-value, e-value estimated from p-value.

The general problem is shown to be NP-hard via reduction from the clique problem.

Ordered clusters

A polynomial time algorithm can be obtained for a restricted problem with the following assumptions:

  • a cluster appears in every chromosome
  • none of the genes appear in more than one orthologous group
  • orthologous genes are strictly ordered within each cluster without allowing paralogous copies

The idea is to construct a DAG where a maximal cluster corresponds to finding the longest path. To find additional clusters, they mask the genes in the optimal cluster and rerun the algorithm or using an algorithm to find u longest paths.

Unordered clusters

The idea to consider each window of size d on each chromosome and find the location of all occurrence of its gene numbers in the other chromosomes and then intersect them. The complexity is $O((tD+1)^{k-1})$ after some preprocessing. This is much better than the $O(n^{k-1})$ when tD is small relative to n.

Results on bacterial genomes

Input: B. subtilis, S. pyogenes, S. pneumoniae, and C. acetobutylicum from COG, $1 \le d \le 50$

Examined 7 maximal unordered gene clusters with low e-values that appear in all four genomes and showed biological significance. Then applied the restricted version.

Results on yeast genomes

Input: S. cerevisiae, S. paradoxus, S. mikatae, and S. bayanus. using orthologous groups from Kellis, $1 \le d \le 50$

Showed top result and compare GCFinder against GeneTeams with $1 \le \delta \le 50$. Evaluated functional enrichment of each cluster by applying the GO Term Finder [Boyle2004] on the S. cerevisiae genes and clusters a cluster to have a significant GO term if its Bonferroni corrected p-value is blow 0.05 within any of the three ontologies. GCFinder found almost twice as many clusters with significant functional enrichment.

Links to this note