Melvin's digital garden

Data mining at I2R

CREATED: 200804041051 Speaker: Ng See-Kiong

  • Algorithms – Data – Applications ** Kids Cancer Heroes
  • Biological data: sequences (DNA),
  • Goals: better drug/treatment, save time/cost, better science
  • Emerging patterns (contrasting): observed in most members of one class but found in none/few of the other classes
  • ALL, heterogeneous cancer, > 12 subtypes
  • Diagnosis requires, 3 different tests and 4 different experts, not feasible in poorer countries
  • PCL: prediction by collective likelihood of emerging patterns, accurate and easily comprehensible ** Fishing for the baby whales
  • Observation: no man is an island
  • Real world networks ** large scale, continual growth ** distributed, organic growth ** abstract notions of distance ** scale-free: grow by added new node with m edges to existing nodes, preferential attachment (Science, 1999) ** small world, social networks (6 degrees of separation), WWW (19 clicks) ** contains a number of highly connected hubs ** degree distribution: random - poisson, scale-free - power law ** scale-free networks are robust to failure of nodes
  • evaluate growth of entities, model dynamic changes
  • similar to evolution of webpages
  • compute change in Pubrank over time ** Getting Rich with Data Mining
  • 75% of fund managers fail to beat the S&P 500 index
  • most fund managers use Modern Porfolio Theory (MPT) to create their portfolios
  • assumes that the price in the past can predict the price in the future
  • fundamental analysis
  • model as a bipartite graph, set of stocks, set of financial ratios
  • mine maximal quasi-bicliques to cluster stocks and financial ratios ** Protecting your data privacy
  • public health threat
  • using data mining for health monitoring, potential breach of individual’s privacy
  • perturb the data by adding noise
  • privacy preserving data sharing ** Other projects
  • emergence/reemergence of infectious diseases, epitope prediction, computer aided vaccine design
  • traffic problems in highly urbanized cities, intelligent route planning
  • people want to be understood but they do not want to be known, anonymous data collection ** Observations
  • Data is king
  • Key features of real data

lots of data

different types of data

large features few samples

data can be wrong/incomplete

Links to this note