From CSULA CS Wiki
Jump to: navigation, search

Bayes' Theorem is a formula for computing conditional probabilities. It's quite intuitive once you understand what it's saying. Here is the formula.

p(A|B) = \frac{p(A)\ p(B|A)}{p(B)}

It can be read like this. (The probability of A given B) equals (the probability of A) times (the probability of B given A) divided by (the probability of B).

Bayes Theorem.

\begin{align}p(B)\ p(A|B) & = p(A\And B) = p(A)\ p(B|A) \\
 \\
p(B)\ p(A|B) & = p(A)\ p(B|A) \\
 \\
p(A|B) & = \frac{p(A)\ p(B|A)}{p(B)}
\end{align}

What does Bayes' Theorem do for you? Assume you have an element that you know is in B. Bayes' Theorem tells you how to compute the probability that the element is in the shaded overlapping area (A ∩ B). It goes like this.

Assume you know the following.

  1. You know how big A ∩ B is with respect to A. In other words, you know p(B|A).
  2. You also know p(A) and p(B) separately.

You don't know but you want to know how big A ∩ B is with respect to B. In other words, you don't know but want to know p(A|B). Bayes' Theorem gives us a formula for computing p(A|B) in terms of p(B|A), p(A), and p(B).

See also Odds: Bayes made useful.

Contents

A simple derivation

The probability of both A and B—written p(A\And B)—is the probability of A times the probability of B given A.

p(A\And B) = p(A)\, p(B|A)

This is sometimes known as the chain rule.

The same is true the other way around.

p(B\And A) = p(B)\, p(A|B)

But

p(A\And B) = p(B\And A)

So the two right sides are also equal.

p(B)\, p(A|B) = p(A)\, p(B|A)

Then solve for p(A\vert B)

p(A|B) = \frac{p(A)\, p(B|A)}{p(B)}

A count-based derivation

Another way to think about it is that the probability of A given B is the probability of A in the population consisting of B. Let |A|, |B|, |A∩B|, and |U| refer to the size of A, B, A∩B, and the Universe U respectively. Then

p(A) = \frac{|A|}{|U|}
p(B) = \frac{|B|}{|U|}
p(B|A) = \frac{|A \cap B|}{|B|}
and
\begin{align}p(A|B) &= \frac{|A \cap B|}{|B|} \\ 
                           &= \frac{|A \cap B|\times \frac{|A|}{|A|}}{|B|} \\
                           &= \frac{\frac{|A \cap B|}{|A|}\times |A|}{|B|} \\
                           &= \frac{p(B|A) \times |A|}{|B|}  \\
                           &= \frac{\frac{p(A|B) \times |A|}{|U|}}{\frac{|B|}{|U|}}  \\
                           &= \frac{p(A|B) \times \frac{|A|}{|U|}}{\frac{|B|}{|U|}} \\
                           &= \frac{p(A|B)\ p(A)}{p(B)}
\end{align}


Cancer screening test example

Assume the following.
1% of women have breast cancer (and therefore 99% do not).
80% of mammograms detect breast cancer when it is there (and therefore 20% miss it).
10% of mammograms detect breast cancer when it’s not there, a false positive (and therefore 90% correctly return a negative result).
What is the probability that a positive test indicates an actual cancer?

Compute using population counts

To compute the answer directly with hypothetical population counts, it looks like this.

Assume there are 1000 people.
10 of them (1%) will have cancer.
Of those 10, 8 (80%) will get a positive test result.
Of the other 990, 99 (10%) will get a (false) positive test result.
Cancer (1%) No cancer (99%) Total
Positive test (True positive) 80% = 8 (False positive) 10% =  99
107
Negative test (False negative) 20% = 2 (True negative) 90% = 891
893
Total
10
990
1000

\begin{align}
 p(cancer | positive test) & = \frac{\#\ people\ with\ cancer\ with\ positive\ test\ result} {\#\ people\ with\ a\ positive\ test\ result} \\
 \\
                           & = \frac{8} {8 + 99} \\
 \\
                           & = \frac{8} {107} \\
 \\
                           & = 0.0748 \\
 \\
                           & = 7.48%
\end{align}

In this example, what's tricky is realizing that the number of people with a positive test result includes both the 8 people with cancer who test positive and the 99 people without cancer who test positive. So of the 107 people who test positive, only 8 (or 7.48%) actually have cancer.

What's the probability that a negative test means there is no cancer?


\begin{align}
 p(no cancer | negative test) & = \frac{\#\ people\ without\ cancer\ with\ a\ negative\ test\ result} {\#\ people\ with\ a\ negative\ test\ result} \\
 \\
                           & = \frac{891} {2 + 891} \\
 \\
                           & = \frac{891} {893} \\
 \\
                           & = 0.9978 \\
 \\
                           & = 99.78%
\end{align}

Compute using Bayes' theorem

To see this in terms of the diagram at the top of the page, let B be the women with cancer and A be the women with a positive test result.

  • B is the 1% of the population that has cancer. B (everything that is outside B) is the rest of the population.
  • A is the set of women who test positive on the test. A ∩ B is the set of women for which the test is correct.
  • B - (A ∩ B) is the set of women who have cancer but for whom the test was negative.
  • B - (A ∩ B) (everything in A but not in B) is the set of women for whom the test was positive even thought they do not have cancer.

What we want to know is what proportion of A is made up of A ∩ B, or given an arbitrary member of A, what's the probability that the person is in (A ∩ B). In other words p(B | A).

Putting this in terms of Bayes' Theorem:

 p(cancer\ |\ positive\ test) = p(B | A) = \frac{p(B \cap A)}{p(A)} = \frac{p(B)\, p(A|B)}{p(A)} = \frac{p(cancer)\, p(positive\ test\ |\ cancer)}{p(positive\ test)}



\begin{align}
p(cancer | positive\ test) & = \frac{p(cancer)\, p(positive\ test | cancer)}{p(positive\ test)}  \\ 
\\
                          & = \frac{0.01  \times 0.8 }{ 0.01 \times 0.8 + 0.99 \times 0.10 } \\
\\
                          & = \frac{0.008} {0.008 + 0.099} \\
\\
                          & = \frac{0.008} {0.107} \\
\\
                          & = 0.0748 \\
\\
                          & = 7.48% \\
\end{align}

Wikipedia example

Suppose there is a school with 60% boys and 40% girls. The female students wear trousers or skirts in equal numbers; the boys all wear trousers. An observer seeing a (random) student from a distance notices that this student is wearing trousers. What is the probability that this student is a girl?

In this case, B is the set of girls and A is the set of people who wear trousers. So given that we see someone wearing trousers (someone in A) what is the probability that the person is in B, i.e. B ∩ A?


\begin{align}
 p(girl | trousers) & = \frac{ p(girl \cap trousers)} {p(trousers)} \\ 
 \\
                    & = \frac{ p(girl)\ p(trousers | girl)} {p(trousers)} \\ 
 \\
                    & =    \frac{ 0.4 *  0.5} {0.6 + 0.2} \\
 \\
                    & =    \frac{ 0.2} {0.8} \\
 \\
                    & =    0.25 \\
\end{align}

Or more concretely (assuming 100 students):


\begin{align}
 p(girl | trousers) & = \frac{ \#\ girls\ with\ trousers} {\#\ students\ with\ trousers} \\ 
 \\
                    & = \frac{20} {60 + 20} \\ 
 \\
                    & = 0.25 \\
\end{align}

In other words, select at random a student with trousers. What's the probability it will be a girl? It's the number of girls with trousers divided by the number of students with trousers.

Girls Boys Total
Trousers
20
60
 80
Skirts
20
 0
 20
Total
40
60
100

Similar to saying, select a student at random. What's the probability it will be a girl? That's

 \frac{\#\ girls} {\#\ students}

Then modify the question so that it is restricted to only students who wear trousers. That's


\frac{ \#\ girls\ with\ trousers} {\#\ students\ with\ trousers} = \frac{ \#\ girls\ \times prob\ that\ a\ girl\ will\ be\ wearing\ trousers} {\#\ students\ with\ trousers}


What may be confusing is that

\#\ girls\ with\ trousers

is expressed formally as

(\#\ girls)  \times  (prob\ that\ a\ girl\ will\ be\ wearing\ trousers).

What may also be tricky is realizing that the number of students with trousers includes both the 20 girls with trousers and the 60 boys with trousers. So of the 80 people with trousers, 20 (or 25%) are girls.

Counterfeit coin example

In his review of McGrayne's The theory that would not die, John Allen Paulos gives this example.

Assume that you’re presented with three coins, two of them fair and the other a counterfeit that always lands heads. If you randomly pick one of the three coins, the probability that it’s the counterfeit is 1 in 3. This is the prior probability of the hypothesis that the coin is counterfeit. Now after picking the coin, you flip it three times and observe that it lands heads each time. Seeing this new evidence that your chosen coin has landed heads three times in a row, you want to know the revised posterior probability that it is the counterfeit. The answer to this question, found using Bayes’s theorem (calculation mercifully omitted), is 4 in 5. You thus revise your probability estimate of the coin’s being counterfeit upward from 1 in 3 to 4 in 5.

How did he get 4 out of 5?

The first—and I think hardest—issue is what is the universe of events? For this problem the events in the universe of events are the results of flipping a coin three times. That's the information we will use to determine whether a coin is counterfeit.

The easiest way to think about it is to suppose we have a population in which 1/3 of the coins is counterfeit. We select a coin at random from that population and we flip it three times. The results will be as follows.

  • If the coin is counterfeit, which it will be on average 1/3 of the time, it will come up heads on each of the three flips.
  • If the coin is not counterfeit, which it will be on average 2/3 of the time, it will come up all heads, on average, once out of eight times.

Putting this together, the probability of getting three heads will be   \frac{1}{3} \times 1 + \frac{2}{3} \times \frac{1}{8} = \frac{1}{3} + \frac{2}{24} = \frac{4}{12} + \frac{1}{12} = \frac{5}{12}.   In other words, if we select coins from a population in which 1/3 of them are counterfeit and flip the coin 3 times, the probability that the three flips will all be heads is 5/12.

It seems to me that's the hardest part of working out this example. The rest follows directly from Bayes Theorem.

We know

p(\mbox{counterfeit}\ |\ \mbox{all heads}) = \frac{p(\mbox{all heads}\ |\ \mbox{counterfeit})  p(\mbox{counterfeit})}{p(\mbox{all heads})}

But

\begin{align}
 p(\mbox{all heads}\ |\ \mbox{counterfeit}) &= 1\ \ \ \ \ \ \ \ \ \ \ \ \mbox{The counterfeit coin always comes up all heads.} \\
 p(\mbox{counterfeit}) &= \frac{1}{3}\ \ \ \ \ \ \ \ \ \ \ \ \mbox{One third of the coins are counterfeit.}\\
 p(\mbox{all heads}) &= \frac{5}{12}\ \ \ \ \ \ \ \ \ \ \mbox{When a coin selected at random from a population in which 1/3 are counterfeit}\\
                     &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \mbox{is flipped three times, the probability that all three flips will be all heads is 5/12.}
 \end{align}

So

\begin{align}
 p(\mbox{counterfeit}\ |\ \mbox{all heads}) &= \frac{1 \times  \frac{1}{3}}{\frac{5}{12}} \\
 &= \frac{\frac{1}{3}}{\frac{5}{12}} \\
 &= \frac{4}{5} \\
 \end{align}

In other words, having flipped the unknown coin three times and seen heads each time, the probability that the unknown coin is counterfeit has increased from 1/3 to 4/5.

This is a nice illustration of why Bayes Theorem uses the terms a priori and a posteriori. The a priori probability that a coin randomly selected from a population 1/3 of which are counterfeit is 1/3. But with the additional information provided by the flipping, the a posteriori probability is now 4/5.

As far as the diagram goes, B is the set of counterfeit coins; B is the set of fair coins; A is the result of getting three heads when we flip a coin three times; and A are all the other possible results when we flip a coin three times.

The diagram is misleading for this example in that B is completely contained in A. Every time we flip the counterfeit coin three times it will come up three heads. So A ∩ B = B. No part of B is outside A.

We are still interested in the following. Given that a coin comes up three heads, i.e., that it is in A, what is the probability that it is in B, i.e., p(B|A)? In this case, that's \frac{p(B)}{p(A)}=\frac{\frac{1}{3}}{\frac{5}{12}}=\frac{4}{5}.

Cause and effect

A common use for Bayes' Theorem is for diagnostic purposes: given some symptoms, i.e., effects, determine the probability that those effects were produced by certain causes. (This is adapted from Russell and Norvig, Artificial Intelligence, a modern approach.)

p(\mathit{cause}|\mathit{effect}) = \frac{p(\mathit{effect}|\mathit{cause}) p(\mathit{cause})}{p(\mathit{effect})}

Suppose that one knows that a cause will produce an effect—at least with a certain probability: p(\mathit{effect}|\mathit{cause}). One wants to determine the probability that given an observed effect, it was produced by a particular cause: p(\mathit{cause}|\mathit{effect}). In other words, one wants to reason backward and diagnose the cause of some observed effect. Because the goal is to reason backwards, it is likely that one will know the prior probabilities, p(\mathit{effect}|\mathit{cause}),\  p(\mathit{cause}), \mbox{ and }p(\mathit{effect}), even though one may not know the desired diagnostic probability, p(\mathit{cause}|\mathit{effect}).

Russell and Norvig discuss the relationship between meningitis and stiff neck. It is known that meningitis often produces a stiff neck. So p(\mathit{stiff neck}|\mathit{meningitis}) is known. Also known are p(\mathit{stiff neck}) and p(\mathit{meningitis}) in general. What isn't necessarily known is p(\mathit{meningitis}|\mathit{stiff neck}). Certainly one can gather statistics for p(\mathit{meningitis}|\mathit{stiff neck}). But if there is an outbreak of meningitis, those statistics may not apply. But p(\mathit{stiff neck}|\mathit{meningitis}) will be unchanged because it reflects a mechanism in how meningitis works. So if one sees a case of stiff neck, one wants to know p(\mathit{meningitis}|\mathit{stiff neck}) under the new circumstances. It is possible to determine that using Bayes' theorem.

p(\mathit{meningitis} | \mathit{stiff neck}) = \frac{p(\mathit{stiff neck}|\mathit{meningitis})p(\mathit{meningitis})}{\mathit{p(stiff neck})}

Multiple causes and effects

The more general case is that one knows that a number of effects are produced by a cause. Given some of them, one wants to know the probability that the cause produced them.

p(\mathit{cause}|\mathit{effect_1}\ldots\mathit{effect_n}) = \frac{p(\mathit{effect_1}\ldots\mathit{effect_n}|\mathit{cause}) p(\mathit{cause})}{p(\mathit{effect_1}\ldots\mathit{effect_n})}

If the effects are independent, we can write this.

p(\mathit{cause}|\mathit{effect_1}\ldots\mathit{effect_n}) = \frac{p(\mathit{effect_1}\ldots\mathit{effect_n}|\mathit{cause}) p(\mathit{cause})}{p(\mathit{effect_1})\ldots p(\mathit{effect_n})}

Furthermore, if the cause produces the effects independently, we can write this.

p(\mathit{cause}|\mathit{effect_1}\ldots\mathit{effect_n}) = \frac{p(\mathit{effect_1}|\mathit{cause})\ldots p(\mathit{effect_n}|\mathit{cause}) p(\mathit{cause})}{p(\mathit{effect_1})\ldots p(\mathit{effect_n})}

This is the case if the production of effecti by the cause is independent of the productions of effectj by the cause for all i and j pairs.

This is not the case for our XOR example. In that case we have two "causes," X1 and X2, and one "effect," X3. It is not the case that the causes independently produce the effect—and certainly not that X3 considered as a cause produced X1 and X2 as independent effects. To produce X3 as an effect one needs certain combinations of the causes. Independence in the sense of one-cause-to-the-effect doesn't hold in either direction.

The tennis example is similar. One has multiple causes, the various weather conditions, and one effect, whether or not to play tennis. It is not the case that any one of the causes produces the effect independently. Nor is it the case that whether or not one plays tennis can be a cause of what the weather conditions are.

The following is an example of a single cause and multiple independent effects.

  • The temperature is the cause.
  • One of the effects is that the number of people at the beach is high.
  • Another effect is that the sale of ice cream is high.

Each of these is an effect of the cause—hot temperature. But these two effects are independent of each other. They each depend on the cause, but they depend on the cause independently. In addition, they might depend on other things besides the temperature. For example, the sale of ice cream may increase during holidays whether or not the temperature is hot, and the number of people at the beach may depend on whether it is raining as well as the temperature.

Assume that the beach is crowded and that ice cream sales are high. One wants to know the probability that these two perceived effects were caused by high temperatures.

\begin{align}p(\mathit{hot}&|\mathit{beachCrowded} \And \mathit{iceCreamSalesHigh}) \\ 
              &= \frac{p(\mathit{beachCrowded} \And \mathit{iceCreamSalesHigh}|\mathit{hot})\ p(\mathit{hot})}{p(\mathit{beachCrowded} \And \mathit{iceCreamSalesHigh})}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \mbox{      (by Bayes Theorem)} \\
              &= \frac{p(\mathit{beachCrowded}|\mathit{hot})\ p(\mathit{iceCreamSalesHigh}|\mathit{hot})\ p(\mathit{hot})}{p(\mathit{beachCrowded})\ p(\mathit{iceCreamSalesHigh})}\ \ \ \ \ \ \ \ \ \   \mbox{    (by independence)}\\
\end{align}

See also

Khan Academy

Sharon McGrayne

Bayesia white papers

Other

Here are some shorter videos and some longer videos.

Here's a 3-part (4+ hour) tutorial on Bayesian Inference.

Papers

Idiot’s Bayes – not so stupid after all?” – the whole paper is about why it doesn’t suck, which is related to redundancies in language.
Naive Bayes at Forty: The Independence Assumption in Information
Spam Filtering with Naive Bayes – Which Naive Bayes?