Data mining at I2R
CREATED: 200804041051 Speaker: Ng See-Kiong
- Algorithms – Data – Applications ** Kids Cancer Heroes
- Biological data: sequences (DNA),
- Goals: better drug/treatment, save time/cost, better science
- Emerging patterns (contrasting): observed in most members of one class but found in none/few of the other classes
- ALL, heterogeneous cancer, > 12 subtypes
- Diagnosis requires, 3 different tests and 4 different experts, not feasible in poorer countries
- PCL: prediction by collective likelihood of emerging patterns, accurate and easily comprehensible ** Fishing for the baby whales
- Observation: no man is an island
- Real world networks ** large scale, continual growth ** distributed, organic growth ** abstract notions of distance ** scale-free: grow by added new node with m edges to existing nodes, preferential attachment (Science, 1999) ** small world, social networks (6 degrees of separation), WWW (19 clicks) ** contains a number of highly connected hubs ** degree distribution: random - poisson, scale-free - power law ** scale-free networks are robust to failure of nodes
- evaluate growth of entities, model dynamic changes
- similar to evolution of webpages
- compute change in Pubrank over time ** Getting Rich with Data Mining
- 75% of fund managers fail to beat the S&P 500 index
- most fund managers use Modern Porfolio Theory (MPT) to create their portfolios
- assumes that the price in the past can predict the price in the future
- fundamental analysis
- model as a bipartite graph, set of stocks, set of financial ratios
- mine maximal quasi-bicliques to cluster stocks and financial ratios ** Protecting your data privacy
- public health threat
- using data mining for health monitoring, potential breach of individual’s privacy
- perturb the data by adding noise
- privacy preserving data sharing ** Other projects
- emergence/reemergence of infectious diseases, epitope prediction, computer aided vaccine design
- traffic problems in highly urbanized cities, intelligent route planning
- people want to be understood but they do not want to be known, anonymous data collection ** Observations
- Data is king
- Key features of real data