Suppose you pulled a slot machine 5 times, and won 1 time. What's your best estimate for the true underlying payout probability $p$?

This seemingly innocent question is actually quite involved. You might think the answer is $\frac{1}{5}$ but it's actually $\frac{2}{7}$.

(Note: we're making the usual assumption of a *uniform prior* for $p$, which just means
"before we saw any data, we thought any value of $p$ was equally likely.")

## Maximum Likelihood Estimate

The maximum likelihood estimate (MLE) of a parameter is
the value whose likelihood is highest. If you're looking at the
probability density function (PDF) of that parameter,
the MLE is simply the **highest point on the curve** (i.e. the *mode*).

For a binomial, the MLE of $p$ is indeed $\frac{k}{n}$ (where you saw $k$ "successes" out of $n$ attempts), which is $\frac{1}{5}$ in this case.

## Expected Value

However, the MLE is different than the expected value of $p$.
If you're looking at the PDF of a parameter, the expected value is the *mean* of that curve.
For a binomial, the expected value turns out to be
$\frac{k+1}{n+2}$. (This is also known as
Laplace's Rule of Succession.)

For the data above ($k=1$, $n=5$) you get $\frac{2}{7}$. Yes, this seems strange, but this will be a better estimate of $p$ than $\frac{1}{5}$.

## Nitty Gritty Math Details

The posterior distribution for a binomial parameter $p$ takes the shape of a beta distribution, which is a function $Beta(x; \alpha, \beta)$. When the data is $k$ successes out of $n$ attempts, the parameters to the function are $\alpha=k+1$ and $\beta=(n-k)+1$.

The function $Beta(x; \alpha, \beta)$ has the following properties: $$mean = \frac{\alpha}{\alpha + \beta} = \frac{k+1}{n+2}$$ and $$mode = \frac{\alpha - 1}{\alpha + \beta + 2} = \frac{k}{n}$$

Lo and behold, when you plug in $k=1$ and $n=5$ you get $\frac{2}{7}$ for the mean, and $\frac{1}{5}$ for the mode.

## Head-to-head Matchup

But don't take my word for it. You can simulate it yourself with the Python code below:```
import random
def flip_coin(p_heads):
if random.random() < p_heads: return 'H'
else: return 'T'
def compute_errors():
# pick a true underlying 'p' uniformly, and generate data
p_heads = random.random()
NUM_COINS = 5
coins = [flip_coin(p_heads) for x in xrange(NUM_COINS)]
# come up with estimates of 'p'
mle_p = coins.count('H') / float(NUM_COINS)
laplace_p = (coins.count('H') + 1) / float(NUM_COINS + 2)
# return their errors
return (abs(mle_p - p_heads), abs(laplace_p - p_heads))
win_count = {'mle': 0, 'laplace': 0}
mle_errors = []
laplace_errors = []
for x in xrange(1000000):
(mle_error, laplace_error) = compute_errors()
if mle_error <= laplace_error:
win_count["mle"] += 1
else:
win_count["laplace"] += 1
mle_errors.append(mle_error)
laplace_errors.append(laplace_error)
print "win_counts:", win_count
print "average mle_error:", sum(mle_errors) / len(mle_errors)
print "average laplace_error:", sum(laplace_errors) / len(laplace_errors)
```

When I run this, I get

```
win_counts: {'laplace': 569967, 'mle': 430033}
average mle_error: 0.142
average laplace_error: 0.123
```

Which means the formula $\frac{k+1}{n+2}$ was closer 57% of the time, and had a smaller average error than $\frac{k }{n}$ does.

## WTF?!

If you're having a hard time coming to grips with this reality, let me try to explain some intuition behind why $\frac{k}{n}$ is a bad estimator.

The basic problem is that $\frac{k}{n}$ tends to "believe" extreme data too easily. If we see 0 wins out of 5 attempts, $\frac{k}{n}$ will conclude that the exact value of $p$ must be 0. Of course, this is extremely unlikely. It's more likely that the true value of $p$ is $> 0$, and that 0-out-of-5 was just an unlucky streak. (Similarly for when the data is 5-out-of-5 wins.)

blog comments powered by Disqus