🌱

Melvin's digital garden

Sliding Window Cluster

CREATED: 200710020214 References:

Friedman and Hughes 2001

New ideas:

relation to clustering with n-match distance?
use techniques from DM clustering
NP-hard for (w,k) clusters, k is the size of the clique, what about best match?

Overall approach: for each maximal windows I_G of length w in G,

consider only homologous genes in H, call this H’
for each maximal windows I_H of length w in H’
- if number of homologous genes in I_H >= k
  - report minimum interval containing homologous genes in both windows
- end if
end for end for

Observations:

can treat each marker as the left end of a window of length at most w
insertion of a new marker introduces a new window and increase the size of windows which contains the new marker by 1
deletion of a marker removes a window and reduces the size of windows which contain this marker by 1
consecutive windows in G are similar, change via insertion/deletion of genes

Data structure:

Array Labels: Positions: Index: Return maximal pairs of indices where (1) pos(j) - pos(i) <= w (2) j - i >= k
Using a balanced binary search tree where the data are stored in the leaves, internal nodes keep track of aggregate properties about a range of windows such as Min/Max Size, Min/Max Position, Min/Max Label.

| Algorithmic issues | Array | Segment Tree / Binary indexed tree | | generating H’ from H and $I_G$ | sorting, $O(w \log w)$ | insertion, $O(\log w)$ | | find windows in H’ which has size $\ge k$ | enumerate widows, $O(w)$ | query, $O(|R| log w)$ | | determine end points of window in H and in G | dequeue, $O(w)$ | query, $O(log w)$ |

Links to this note

Contributions