Clustering based Gene Cluster Discovery

CREATED: 200710011210

Class project CS6220: Advanced Topics in Data Mining, taught by Anthony Tung

** Clustering based Gene Cluster Discovery *** My responsibilities

report (max 6 pages)
presentation (max 25mins)
experiments - datasets, other methods to compare against *** Ideas
subspace clustering ideas ** currently find subspace greedily
use freq k-n match and merging of clusters
use of density, #points/vol MBR depends on the number of dimensions, one suggestion is to use #points/length in current dimension since we add dimensions iteratively
another issue is the ordering of the dimensions, we discuss adding the dimensions such that the resulting clusters has the largest average size (equivalent average length of the partitions in the new dimension is the largest)
start with each dimension
neighbor joining approach?
bounding box of a cluster
max size of a cluster
min density
clustering by partitioning, split data along one dimension using greedy method
density based clusters
scoring - min max, min sum, min pairwise sum
facility location - sum of facility costs + sum of distances (have constant factor approx algo) *** Issues
number of clusters unknown
n-match difference is not a distance metric! Does not satisfy triangle inequality
elongated clusters can be good
how to define a scoring function based on n-match?
how to find a good set of clusters? ** if we knew the number of clusters, say k (k-means?) ** top-down vs bottom-up ** clustering in high dimension, subspace clusters ** aim to work reasonable for up to say 100 dimensions *** Schedule (tentative)
Week 9: This week (brainstorm about issues, email if you think of other issues)
Week 10: Meeting after lecture, start on report/implementation
Week 11: report/implementation
Week 12: start on presentation
Week 13: 13/11 (Tuesday) is project presentation *** Deliverables
Report (LaTeX), melvin to start a template
Presentation (LaTeX + Beamer) *** Objectives
Formulate the conserved gene cluster mining problem as a clustering problem in m dimensional space using the n－match difference as the distance measure.
Propose algorithms for solving the above mentioned problem.
Evaluate the performance of our algorithms against traditional conserved gene cluster mining algorithms. *** Notations
m = number of sequences
n = n-match
S = set of labels
Input is a set of m sequences, where sequences are simply strings over the labels, need not be of the same length or have all the labels (if a sequence doesn’t have a label we can treat the position of the missing label as at infinity)
Output is a set of labels which are “close” in n out of the m of sequence *** Focus
novel formulation
new variant of clustering using n-match difference *** Assessment ** proposal(oral) – 10% ** updates – 5% ** cooperation – 5% ** report/presentation – 20% (13/11)

Melvin's digital garden

Clustering based Gene Cluster Discovery

Links to this note