Melvin's digital garden

Max's master thesis presentation

CREATED: 201002261708 ** Challenges in protein sequencing

  • missing peaks
  • ion types
  • neutral losses
  • multiple isotopes
  • noise
  • multi charge

** Improve DB search using tags

  • usually consider top k tags
  • need a good scoring function
  • false positives generally formed from lower intensity peaks, tp from high intensity peaks
  • look-ahead method, strong “future” peaks strengthens current peaks

** DB search without tags by filtering using parent mass Tags must be formed from continuous paths, missing peaks cause problems

Idea: Filter database using precursor mass

Problem large number of matches, around 200K

Runtime optimizations

  • build a mass index of peptide sequences
  • build a peptide trie to avoid scoring candidates with similar prefixes, but
  • not much savings as branching factor is large
  • build a fragment index

** Comparing PMF (parent mass filter) against Inspect Comparable results on filtered dataset, where precursor mass is accurate

Inspect and PMF agree on 168 sequences, but it is different from annotated sequence

** Errors in parent mass GPM data: ~ 70% has an error more than 0.5DA of precursor mass

Compute convoluted mass histogram

Select datasets where convoluted mass is close to given precursor mass (filter bad datasets)

Links to this note