Melvin's digital garden

Ling2008

created#200811180337 #paper #gene_cluster #max_gap

Efficiently identifying max-gap clusters in pairwise genome comparisons [file:///home/melvin/Modules/Literature/Ling2008.pdf]

Main contributions

extended the model to allow start and end of each element, segmental elements could overlap or completely cover other segmental elements

derived an upper bound merging distance which guarantees to not miss any valid max-gap clusters

merging distance is small which avoid the need to look ahead, hence more efficient than existing approaches

Applications

conservation of spatial clustering to infer homology of genomic regions

inference of functional gene groups, e.g. operons

max-gap clusters are popular, two class of methods, heuristics and formal methods (gene teams)

advantages of gene teams model:

  • correctness and completeness of algorithms can be verified
  • there are rigorous statistical tests, e.g. [Hoberman2005]

Methods based on collinearity

DAGchainer, SyMAP, ColinearScan and FISH

may miss many blocks where some “local” rearrangement events occur

gene order is not always conserved in duplicated genomics regions (Venter2001) or in syntenic regions (Postlethwait2000, Pevzner2003)

Methods based on merging

Salgado2000, Price2005, Zheng2005, Westover2005, Vandepoele2002, Pevzner2003, Bourque2005 Hampson2005, Cannon2003, McLysaght2002

relax the gene order constraints in a limited way, order conservation implicitly required

was shown in Bergon2002 that such methods can not guarantee to find all max-gap clusters

Methods based on divide and conquer

inefficient because most clusters are of small size, which makes recursive decomposition very expensive

Statistical evaluation

two genomes each which has n genes, suppose there are m one-to-one orthologs, the statistical significance of a max-gap cluster of h genes with max-gap constraint equal to g is assessed through computing the probability of observing one cluster of size exactly h when orders of genes in both genomes are randomly shuffled.

Equation 6 of [Hoberman2005b]. Note that this is different from traditional way of assessing statistical significance of looking at the probability of obtaining a result under the null hypothesis that is at least as extreme as the observation.

Results

Comparison with [He2005], g from 0 to 5

At small values of g, found most clusters correspond to operons, reproducing findings in [He2005]

At larger values of g, found new patterns

  • clusters spanning multiple operons
  • clusters where operon structure is not conserved but some genes still remain
  • in proximity in both genomes

CloseUp [Hampson2005] showed the gene order information is redundant in analysis of plant genomes, sharing of gene content alone is sufficient to detect homology

Started with anchors computed by PatternHunter and used by GRIMM-synteny algorithm [Pevzner2003], each anchor is a pair of homologous subsequences between human and mouse, there are 642,542 anchors with lengths ranging from 30-9699nt.

Assessment based upon comparison with GRIMM-synteny algorithm

  • those that match exactly what GRIMM-synteny found
  • those that are in the same region but cover a larger chromosome region than
  • GRIMM-synteny’s findings
  • those that have no overlap with what GRIMM-synteny found

gene teams tend to discover a larger portion of chromosome regions and results tend to be more exhaustive.

Links to this note