Melvin's digital garden

Motif p-value

CREATED: 200701030709

  • p-value of motif (PWM) in region R = prob(motif occurs in R)
  • spaced speeds (101110111) = prob(spaced seed hits R)

** Seed => Motif

  • |m| = |s|
  • m[i] = 1 if s[i] = 1
  • Pr[m[i] = 0] = Pr[m[i] = 1] = 1/2
  • Threshold t = (1/2)^z, z = number of 0’s in s
  • m occurs in R w.r.t threshold t iff s hits R
  • P(m occurs in R) = P(s hits R)
  • Implies that p-value problem is NP-hard

** Seed <= Motif

  • f(i,b) = Pr(m hits R[i..n]|b is a prefix of R[i..n])
  • want to find f(1, \e)
  • f(i,b) = sum_{c in Omega}f(i,bc)Pr(R[i + |b|] = c | b is a prefix of R[i..n])
  • overlap(b,m’) = largest overlap of b and m’, m’ compatible with PWM m ** compatible = match score >= threshold ** need to enumerate all strings of length L which are compatible
  • f(i,b) = f(i’,overlap(b,m’)), i’ = i + |b| - |overlap(b,m’)|

** Enumerating compatible strings

AGCCTT -> 031123 (frequency rank of ith character in PWM)

if s is compatible, try s+1

if s is not NOT compatible, try s’

** s = 231200, s’ = 23200

After L steps, we enumerate one compatible string, total O(KL)

** Reference

Links to this note