Bayes' Theorem is a formula for computing conditional probabilities. It's quite intuitive once you understand what it's saying. Here is the formula.
It can be read like this. (The probability of A given B) equals (the probability of A) times (the probability of B given A) divided by (the probability of B).
What does Bayes' Theorem do for you? Assume you have an element that you know is in B. Bayes' Theorem tells you how to compute the probability that the element is in the shaded overlapping area (A ∩ B). It goes like this.
Assume you know the following.
- You know how big A ∩ B is with respect to A. In other words, you know .
- You also know and separately.
You don't know but you want to know how big A ∩ B is with respect to B. In other words, you don't know but want to know . Bayes' Theorem gives us a formula for computing in terms of , , and .
See also Odds: Bayes made useful.
A simple derivation
The probability of both A and B—written —is the probability of times the probability of given .
This is sometimes known as the chain rule.
The same is true the other way around.
So the two right sides are also equal.
Then solve for
A count-based derivation
Another way to think about it is that the probability of A given B is the probability of A in the population consisting of B. Let |A|, |B|, |A∩B|, and |U| refer to the size of A, B, A∩B, and the Universe U respectively. Then
Cancer screening test example
Compute using population counts
To compute the answer directly with hypothetical population counts, it looks like this.
- Assume there are 1000 people.
- 10 of them (1%) will have cancer.
- Of those 10, 8 (80%) will get a positive test result.
- Of the other 990, 99 (10%) will get a (false) positive test result.
Cancer (1%) No cancer (99%) Total Positive test (True positive) 80% = 8 (False positive) 10% = 99 107 Negative test (False negative) 20% = 2 (True negative) 90% = 891 893 Total 10 990 1000
In this example, what's tricky is realizing that the number of people with a positive test result includes both the 8 people with cancer who test positive and the 99 people without cancer who test positive. So of the 107 people who test positive, only 8 (or 7.48%) actually have cancer.
What's the probability that a negative test means there is no cancer?
Compute using Bayes' theorem
To see this in terms of the diagram at the top of the page, let B be the women with cancer and A be the women with a positive test result.
- B is the 1% of the population that has cancer. B (everything that is outside B) is the rest of the population.
- A is the set of women who test positive on the test. A ∩ B is the set of women for which the test is correct.
- B - (A ∩ B) is the set of women who have cancer but for whom the test was negative.
- B - (A ∩ B) (everything in A but not in B) is the set of women for whom the test was positive even thought they do not have cancer.
What we want to know is what proportion of A is made up of A ∩ B, or given an arbitrary member of A, what's the probability that the person is in (A ∩ B). In other words p(B | A).
Putting this in terms of Bayes' Theorem:
In this case, B is the set of girls and A is the set of people who wear trousers. So given that we see someone wearing trousers (someone in A) what is the probability that the person is in B, i.e. B ∩ A?
Or more concretely (assuming 100 students):
In other words, select at random a student with trousers. What's the probability it will be a girl? It's the number of girls with trousers divided by the number of students with trousers.
Girls Boys Total Trousers 20 60 80 Skirts 20 0 20 Total 40 60 100
Similar to saying, select a student at random. What's the probability it will be a girl? That's
Then modify the question so that it is restricted to only students who wear trousers. That's
What may be confusing is that
is expressed formally as
What may also be tricky is realizing that the number of students with trousers includes both the 20 girls with trousers and the 60 boys with trousers. So of the 80 people with trousers, 20 (or 25%) are girls.
Counterfeit coin example
How did he get 4 out of 5?
The first—and I think hardest—issue is what is the universe of events? For this problem the events in the universe of events are the results of flipping a coin three times. That's the information we will use to determine whether a coin is counterfeit.
The easiest way to think about it is to suppose we have a population in which 1/3 of the coins is counterfeit. We select a coin at random from that population and we flip it three times. The results will be as follows.
- If the coin is counterfeit, which it will be on average 1/3 of the time, it will come up heads on each of the three flips.
- If the coin is not counterfeit, which it will be on average 2/3 of the time, it will come up all heads, on average, once out of eight times.
Putting this together, the probability of getting three heads will be . In other words, if we select coins from a population in which 1/3 of them are counterfeit and flip the coin 3 times, the probability that the three flips will all be heads is 5/12.
It seems to me that's the hardest part of working out this example. The rest follows directly from Bayes Theorem.
In other words, having flipped the unknown coin three times and seen heads each time, the probability that the unknown coin is counterfeit has increased from 1/3 to 4/5.
This is a nice illustration of why Bayes Theorem uses the terms a priori and a posteriori. The a priori probability that a coin randomly selected from a population 1/3 of which are counterfeit is 1/3. But with the additional information provided by the flipping, the a posteriori probability is now 4/5.
As far as the diagram goes, B is the set of counterfeit coins; B is the set of fair coins; A is the result of getting three heads when we flip a coin three times; and A are all the other possible results when we flip a coin three times.
The diagram is misleading for this example in that B is completely contained in A. Every time we flip the counterfeit coin three times it will come up three heads. So A ∩ B = B. No part of B is outside A.
We are still interested in the following. Given that a coin comes up three heads, i.e., that it is in A, what is the probability that it is in B, i.e., p(B|A)? In this case, that's .
Cause and effect
A common use for Bayes' Theorem is for diagnostic purposes: given some symptoms, i.e., effects, determine the probability that those effects were produced by certain causes. (This is adapted from Russell and Norvig, Artificial Intelligence, a modern approach.)
Suppose that one knows that a cause will produce an effect—at least with a certain probability: . One wants to determine the probability that given an observed effect, it was produced by a particular cause: . In other words, one wants to reason backward and diagnose the cause of some observed effect. Because the goal is to reason backwards, it is likely that one will know the prior probabilities, , even though one may not know the desired diagnostic probability, .
Russell and Norvig discuss the relationship between meningitis and stiff neck. It is known that meningitis often produces a stiff neck. So is known. Also known are and in general. What isn't necessarily known is . Certainly one can gather statistics for . But if there is an outbreak of meningitis, those statistics may not apply. But will be unchanged because it reflects a mechanism in how meningitis works. So if one sees a case of stiff neck, one wants to know under the new circumstances. It is possible to determine that using Bayes' theorem.
Multiple causes and effects
The more general case is that one knows that a number of effects are produced by a cause. Given some of them, one wants to know the probability that the cause produced them.
If the effects are independent, we can write this.
Furthermore, if the cause produces the effects independently, we can write this.
This is the case if the production of effecti by the cause is independent of the productions of effectj by the cause for all i and j pairs.
This is not the case for our XOR example. In that case we have two "causes," X1 and X2, and one "effect," X3. It is not the case that the causes independently produce the effect—and certainly not that X3 considered as a cause produced X1 and X2 as independent effects. To produce X3 as an effect one needs certain combinations of the causes. Independence in the sense of one-cause-to-the-effect doesn't hold in either direction.
The tennis example is similar. One has multiple causes, the various weather conditions, and one effect, whether or not to play tennis. It is not the case that any one of the causes produces the effect independently. Nor is it the case that whether or not one plays tennis can be a cause of what the weather conditions are.
The following is an example of a single cause and multiple independent effects.
- The temperature is the cause.
- One of the effects is that the number of people at the beach is high.
- Another effect is that the sale of ice cream is high.
Each of these is an effect of the cause—hot temperature. But these two effects are independent of each other. They each depend on the cause, but they depend on the cause independently. In addition, they might depend on other things besides the temperature. For example, the sale of ice cream may increase during holidays whether or not the temperature is hot, and the number of people at the beach may depend on whether it is raining as well as the temperature.
Assume that the beach is crowded and that ice cream sales are high. One wants to know the probability that these two perceived effects were caused by high temperatures.
- From a review of "Sharon McGrayne's fun new book, The Theory That Would Not Die, a popular history of Bayes' Theorem."
Here's a 3-part (4+ hour) tutorial on Bayesian Inference.
- “Idiot’s Bayes – not so stupid after all?” – the whole paper is about why it doesn’t suck, which is related to redundancies in language.