Melvin's digital garden

Data Linkage Problem

CREATED: 200701010947 Author: Dongwon Lee Website: http://pike.psu.edu

** Group linkage

  • entity with group of elements as information ** author->citations ** images->sub images *** Methods
  • Jaccard similarity = $(g_1 \cap g_2)/(g_1 \cup g_2)$
  • bipartite matching *** Idea
  • similar ~ many matching elements, few matching byt strong similarity
  • define $BM(g_1, g_2) = (\sum_{(r_1,r_2) \in M} sim(r_1,r_2)) / (m_1 + m_2 - |M|)$ ** M = maximum bipartite matching ** $sim(r1,r2) \ge \rho$ ** similar if $BM \ge \theta$ *** Improvement
  • computing M is expensive, $O(n^3)$
  • use greedy matching, relax condition that each element can only be be in one matching, just match each element to closest match
  • compute upper bound using $g_1 \cup g_2$
  • compute lower bound using $g_1 \cap g_2$
  • use upper bound and lower bound to quickly decide if BM is similar
  • idea of filtering, compute some bounds quickly to avoid expensive computation

** Adaptive SNM - sorted neighbourhood method *** Method

sort

sliding window, pairwise comparison

*** Adaptative window size

  • similar items -> enlarge window
  • dissimilar items -> reduce window

Links to this note