Yesterday at work, I gave an internal talk on the beta-binomial model, AKA my beloved Bayesian conjugate prior model. I've been using it a lot recently for A/B testing analysis, usually as a complement (not replacement!) to the usual frequentist methods (e.g. running a t-test between the cohorts, checking cohort balance before and after the experiment, running regressions to control for stuff that may be imbalanced). I've been reading a lot of Gelman lately, and just getting more and more enamored with the Bayesian view, and the failings (ahem) of the frequentists.

Specifically, what I like is:

  • Rather than worrying about sampling distributions (which is core to frequentist stats), you can speak more intuitively about beliefs.
  • You have to be explicit. That is, instead of a bunch of hidden assumptions about how your parameters or your data are distributed, you make those choices yourself - they're knobs to turn.
  • In the case of conjugate priors, the math is super easy. Updating a conjugate prior - such as updating a beta distribution with binomial data - is just so damn simple. And it's lovely!

I also just really love how the beta can be generalized into n-dimensional space into the Dirichlet. Back in CS109B, we did a cool homework where we Bayesian-updated the Dirichlet distribution of two texts to see if we could determine the author. Hollaaaa, NLP!

One good question that came up during my talk yesterday was: When is this useful? When is it more useful than the usual (frequentist) methods?

My sense is that it's most useful when you don't have much data (or don't expect to have much data). As \(n \rightarrow \infty\), Bayesian and frequentist methods should agree. But there's a big wide world out there of sparse data and low sample sizes!

For example, in the case of views on a single YouTube video - the distribution of views across videos on YouTube has a very long tail: most videos get little to no views, and a minority get a whole lot. But what if you want to predict stuff on that long tail? A low-data area where you don't expect much data to (ever) come? That's where Bayes can really shine.

Or, heh, SHINE WITH THE FIRE OF BAD ASSUMPTIONS! But at least they're explicit. Not-much-data means your prior grows in importance.

As I type, the Women in Machine Learning and Data Science conference is happening/has happened. One of the talks was from the very cool StitchFix Algorithms team, on - gasp! - using the beta-binomial for predicting sales of slow-moving items! Very cool.