Ling2008
created#200811180337 #paper #gene_cluster #max_gap
Efficiently identifying max-gap clusters in pairwise genome comparisons [file:///home/melvin/Modules/Literature/Ling2008.pdf]
Main contributions
extended the model to allow start and end of each element, segmental elements could overlap or completely cover other segmental elements
derived an upper bound merging distance which guarantees to not miss any valid max-gap clusters
merging distance is small which avoid the need to look ahead, hence more efficient than existing approaches
Applications
conservation of spatial clustering to infer homology of genomic regions
inference of functional gene groups, e.g. operons
max-gap clusters are popular, two class of methods, heuristics and formal methods (gene teams)
advantages of gene teams model:
- correctness and completeness of algorithms can be verified
- there are rigorous statistical tests, e.g. [Hoberman2005]
Methods based on collinearity
DAGchainer, SyMAP, ColinearScan and FISH
may miss many blocks where some “local” rearrangement events occur
gene order is not always conserved in duplicated genomics regions (Venter2001) or in syntenic regions (Postlethwait2000, Pevzner2003)
Methods based on merging
Salgado2000, Price2005, Zheng2005, Westover2005, Vandepoele2002, Pevzner2003, Bourque2005 Hampson2005, Cannon2003, McLysaght2002
relax the gene order constraints in a limited way, order conservation implicitly required
was shown in Bergon2002 that such methods can not guarantee to find all max-gap clusters
Methods based on divide and conquer
inefficient because most clusters are of small size, which makes recursive decomposition very expensive
Statistical evaluation
two genomes each which has n genes, suppose there are m one-to-one orthologs, the statistical significance of a max-gap cluster of h genes with max-gap constraint equal to g is assessed through computing the probability of observing one cluster of size exactly h when orders of genes in both genomes are randomly shuffled.
Equation 6 of [Hoberman2005b]. Note that this is different from traditional way of assessing statistical significance of looking at the probability of obtaining a result under the null hypothesis that is at least as extreme as the observation.
Results
Comparison with [He2005], g from 0 to 5
At small values of g, found most clusters correspond to operons, reproducing findings in [He2005]
At larger values of g, found new patterns
- clusters spanning multiple operons
- clusters where operon structure is not conserved but some genes still remain
- in proximity in both genomes
CloseUp [Hampson2005] showed the gene order information is redundant in analysis of plant genomes, sharing of gene content alone is sufficient to detect homology
Started with anchors computed by PatternHunter and used by GRIMM-synteny algorithm [Pevzner2003], each anchor is a pair of homologous subsequences between human and mouse, there are 642,542 anchors with lengths ranging from 30-9699nt.
Assessment based upon comparison with GRIMM-synteny algorithm
- those that match exactly what GRIMM-synteny found
- those that are in the same region but cover a larger chromosome region than
- GRIMM-synteny’s findings
- those that have no overlap with what GRIMM-synteny found
gene teams tend to discover a larger portion of chromosome regions and results tend to be more exhaustive.