Bayes Theorem and Maximum Likelihood

So far we have been thinking of the probability for getting a result #tex2html_wrap_inline552# if we know that the mean value should be #tex2html_wrap_inline554#. Now suppose we make a measurement and get #tex2html_wrap_inline556# counts, but we don't know anything about #tex2html_wrap_inline558#, except that it must be nonnegative, of course. We may turn the question around and ask what is the most likely value for #tex2html_wrap_inline560#, given the result of our measurement. To make this turned-around idea more concrete, we use the concept of conditional probability. We say that the Poisson distribution #tex2html_wrap_inline562# tells us the probability that we get #tex2html_wrap_inline564#, on the condition that the mean value is #tex2html_wrap_inline566#. The notation #tex2html_wrap_inline568# denotes the probability for getting #tex2html_wrap_inline570#, given that #tex2html_wrap_inline572# occurs or #tex2html_wrap_inline574# is true. Thus we could write

#math79#
#displaymath199# (19)

Now the reverse question is, ``What is the probability that the mean value is #tex2html_wrap_inline576#, given that we just made a measurement and got #tex2html_wrap_inline578#?''. This probability would be denoted #tex2html_wrap_inline580#. Now a trivial but important theorem due to Bayes states that
#math80#
#displaymath204# (20)

where #tex2html_wrap_inline582# is the <#206#>a priori<#206#> probability for #tex2html_wrap_inline584# to occur, regardless of whether the event #tex2html_wrap_inline586# occurs, and #tex2html_wrap_inline588# is the <#207#>a priori<#207#> probability for #tex2html_wrap_inline590# to occur, regardless of whether the event #tex2html_wrap_inline592# occurs. From this theorem we conclude that
#math81#
#displaymath208# (21)

So we need to know #tex2html_wrap_inline594# and #tex2html_wrap_inline596# to make progress. The first is the <#210#>a priori<#210#> probability for getting a particular value for #tex2html_wrap_inline598#. If we don't know anything about #tex2html_wrap_inline600#, except that it is nonnegative, then we must say that any nonnegative value whatsoever is equally probable. Thus without benefit of knowing the outcome of the measurement, we say #tex2html_wrap_inline602# is constant, independent of #tex2html_wrap_inline604# for nonnegative #tex2html_wrap_inline606#, and it is zero for negative #tex2html_wrap_inline608#. So the rhs of this equation reduces simply to
#math82#
#displaymath211# (22)

where the normalization factor #math83##tex2html_wrap_inline610# can be determined by requiring that the total probability for having any #tex2html_wrap_inline612# is 1. In fact it turns out that #tex2html_wrap_inline614#, so
#math84#
#displaymath216# (23)

This distribution is called the likelihood function for the parameter #tex2html_wrap_inline616#. Notice that we are now thinking of the rhs as a continuous function of #tex2html_wrap_inline618# with fixed #tex2html_wrap_inline620#. This result is very remarkable, since a single measurement is giving us the <#221#>whole<#221#> probability distribution! Recall that if we were to measure the length of a table top, even if we started by assuming we were going to get a Gaussian distribution, a single measurement would allow us only to guess #tex2html_wrap_inline622# and would tell us nothing about #tex2html_wrap_inline624#. To get #tex2html_wrap_inline626# takes at least two measurements, and even then we would be putting ourselves at the mercy of the gods of statistics for taking a chance with only two measurements. If we weren't so rash as to assume a Gaussian, we would have to make many measurements of the length of the table top to get the probability distribution in the measured length.

We now ask, what is the most probable value of #tex2html_wrap_inline628#, given that we just found #tex2html_wrap_inline630#? This is the value with maximum likelihood. If we examine the probability distribution, we see that it peaks at #tex2html_wrap_inline632#, just as we might have expected. We may then ask, what is the error in the determination of this value. This is a tricky question, because the Poisson distribution is not shaped like a Gaussian distribution. However, for large #tex2html_wrap_inline634# it looks more and more like a Gaussian. Expanding the log of the Poisson distribution for large #tex2html_wrap_inline636# and fixed #tex2html_wrap_inline638# gives

#math85#
#displaymath222# (24)

so for large #tex2html_wrap_inline640# the error is
#math86#
#displaymath226# (25)

To summarize, a single measurement yields the entire probability distribution. For large enough #tex2html_wrap_inline642# we can say that
#math87#
#displaymath230# (26)

To see how Bayesian statistics works, suppose we repeated the experiment and got a new value #tex2html_wrap_inline644#. What is the probability distribution for #tex2html_wrap_inline646# in light of the new result? Now things have changed, since the <#234#>a priori<#234#> probability for #tex2html_wrap_inline648# is no longer constant because we already made one measurement and got #tex2html_wrap_inline650#. Instead we have

#math88#
#displaymath235# (27)

so
#math89#
#displaymath240# (28)

Notice that the likelihood function is now the product of the individual likelihood functions. A more systematic notation would write this function as #math90##tex2html_wrap_inline652#, i.e. the probability for #tex2html_wrap_inline654# having a particular value, given that we made two measurements and found #tex2html_wrap_inline656# and #tex2html_wrap_inline658#. The normalization factor #tex2html_wrap_inline660# is obtained by requiring the total probability to be 1. The most likely value of #tex2html_wrap_inline662# is easily shown to be just the average
#math91#
#displaymath251# (29)

as we should have expected.

The Bayesian approach insists that we fold together all of our knowledge about a parameter in constructing its likelihood function. Thus a generalization of these results would state that the likelihood function for the parameter set #tex2html_wrap_inline664#, given the independently measured results #tex2html_wrap_inline666#, #tex2html_wrap_inline668#, #tex2html_wrap_inline670#, etc. is just

#math92#
#displaymath256# (30)

where #tex2html_wrap_inline672# is a normalization factor. Again, this is just the product of the separate likelihood functions. The result is completely general and applies to any probability distribution, not just a Poisson distribution. We will use this result in discussing #tex2html_wrap_inline674# fits to data as a maximum likelihood search. #./chap_statistics_intro.ltx#