Melvin's digital garden

Han2009

Goal is to identify parent-daughter relationship among paralogs in species S using an outgroup.

Method is to cluster gene pairs into either parent or daughter based on (log syntenic length, syntenic length/total length).

Assume that the co-occurrence of a homologous gene in the region follows a Bernoulli distribution. The model parameters are

  • probability of co-occurrence of homologs in a syntenic region
  • probability of co-occurrence of homologs in a non-syntenic region
  • length of syntenic region to the left of the gene $g_i$
  • length of syntenic region to the right of the gene $g_i$

Use EM to find the best set of model parameters for the given data. Total number of parameters is $N+2$, where $N$ is the number of possible gene pairs between S and $S_o$. Syntenic length of each gene is the total length of syntenic region surround a gene.

Applied to human-macaque comparison, $p_s = 0.829$ and $p_n = 0.050$. The average length of share synteny was 141 (SD = 154). Some gene pairs were ambiguous. The parent copies had an average synteny of 243 genes, while daughter genes has an average synteny of 4.5 genes.

Links to this note