Melvin's digital garden

Data science at honestbee

[2016-10-24 Mon 19:09:47] speaker: Dat Le, Lead data scientist at honestbee event: DataScienceSG

four categories of problems:

  • predictive models
  • recommendation engine
  • customer segmentation
  • operational optimization

item availability prediction

  • item not in store!
  • features:
    • date of delivery
    • product metadata
    • store metadata
    • extra data:
      • weather, holiday, promotion period
    • financial data:
      • STI, inflation rate, un-employment rate
  • uses XGBoost
    • decision tree based gradient boosting machine
    • winning algorithm for lots of Kaggle’s data science challenge
  • evaluation using AUC score
    • not affected by highly-skewed dataset
  • show a “low in stock” label
    • updated daily

item-based recommendation engine

  • collaborative filtering
  • pandas + jaccard index
  • user-based
    • similarity between users
  • item-based
    • similarity between items
    • jaccard index
      • purchases with both items / purchses with either item
  • hard to test offline
  • run A/B test on production

data infrastructure

  • EC2 spot and reserve instances
  • Mesos
  • Marathon, Spark
  • Airflow (job scheduling), MLib, Spark SQL/RDD
  • continuous integration and deployment

Links to this note