Dustin Boswell - Brain Dumps about Computers, Programming, and Everything Else
Maximum Likelihood vs. Expected Value April 5, 2011

Suppose you pulled a slot machine 5 times, and won 1 time. What's your best estimate for the true underlying payout probability $p$?

This seemingly innocent question is actually quite involved. You might think the answer is $\frac{1}{5}$ but it's actually $\frac{2}{7}$.

(Note: we're making the usual assumption of a uniform prior for $p$, which just means "before we saw any data, we thought any value of $p$ was equally likely.")

## Maximum Likelihood Estimate

The maximum likelihood estimate (MLE) of a parameter is the value whose likelihood is highest. If you're looking at the probability density function (PDF) of that parameter, the MLE is simply the highest point on the curve (i.e. the mode).

For a binomial, the MLE of $p$ is indeed $\frac{k}{n}$ (where you saw $k$ "successes" out of $n$ attempts), which is $\frac{1}{5}$ in this case.

## Expected Value

However, the MLE is different than the expected value of $p$. If you're looking at the PDF of a parameter, the expected value is the mean of that curve. For a binomial, the expected value turns out to be $\frac{k+1}{n+2}$. (This is also known as Laplace's Rule of Succession.)

For the data above ($k=1$, $n=5$) you get $\frac{2}{7}$. Yes, this seems strange, but this will be a better estimate of $p$ than $\frac{1}{5}$.

## Nitty Gritty Math Details

The posterior distribution for a binomial parameter $p$ takes the shape of a beta distribution, which is a function $Beta(x; \alpha, \beta)$. When the data is $k$ successes out of $n$ attempts, the parameters to the function are $\alpha=k+1$ and $\beta=(n-k)+1$.

The function $Beta(x; \alpha, \beta)$ has the following properties: $$mean = \frac{\alpha}{\alpha + \beta} = \frac{k+1}{n+2}$$ and $$mode = \frac{\alpha - 1}{\alpha + \beta + 2} = \frac{k}{n}$$

Lo and behold, when you plug in $k=1$ and $n=5$ you get $\frac{2}{7}$ for the mean, and $\frac{1}{5}$ for the mode.

But don't take my word for it. You can simulate it yourself with the Python code below:
import random

if random.random() < p_heads: return 'H'
else: return 'T'

def compute_errors():
# pick a true underlying 'p' uniformly, and generate data
NUM_COINS = 5
coins = [flip_coin(p_heads) for x in xrange(NUM_COINS)]

# come up with estimates of 'p'
mle_p = coins.count('H') / float(NUM_COINS)
laplace_p = (coins.count('H') + 1) / float(NUM_COINS + 2)

# return their errors

win_count = {'mle': 0, 'laplace': 0}
mle_errors = []
laplace_errors = []
for x in xrange(1000000):
(mle_error, laplace_error) = compute_errors()
if mle_error <= laplace_error:
win_count["mle"] += 1
else:
win_count["laplace"] += 1

mle_errors.append(mle_error)
laplace_errors.append(laplace_error)

print "win_counts:",  win_count
print "average mle_error:", sum(mle_errors) / len(mle_errors)
print "average laplace_error:", sum(laplace_errors) / len(laplace_errors)


When I run this, I get

win_counts: {'laplace': 569967, 'mle': 430033}
average mle_error: 0.142
average laplace_error: 0.123


Which means the formula $\frac{k+1}{n+2}$ was closer 57% of the time, and had a smaller average error than $\frac{k }{n}$ does.

## WTF?!

If you're having a hard time coming to grips with this reality, let me try to explain some intuition behind why $\frac{k}{n}$ is a bad estimator.

The basic problem is that $\frac{k}{n}$ tends to "believe" extreme data too easily. If we see 0 wins out of 5 attempts, $\frac{k}{n}$ will conclude that the exact value of $p$ must be 0. Of course, this is extremely unlikely. It's more likely that the true value of $p$ is $> 0$, and that 0-out-of-5 was just an unlucky streak. (Similarly for when the data is 5-out-of-5 wins.)