Bayes’ Theorem

Bayes' Theorem is a simple mathematical formula used for calculating conditional probabilities. It figures prominently in subjectivist or Bayesian approaches to epistemology, statistics, and inductive logic. Subjectivists, who maintain that rational belief is governed by the laws of probability, lean heavily on conditional probabilities in their theories of evidence and their models of empirical learning. Bayes' Theorem is central to these enterprises both because it simplifies the calculation of conditional probabilities and because it clarifies significant features of subjectivist position. Indeed, the Theorem's central insight — that a hypothesis is confirmed by any body of data that its truth renders probable — is the cornerstone of all subjectivist methodology.

1. Conditional Probabilities and Bayes' Theorem

The probability of a hypothesis H conditional on a given body of data E is the ratio of the unconditional probability of the conjunction of the hypothesis with the data to the unconditional probability of the data alone.

(1.1)	Definition.
	The probability of H conditional on E is defined as P_E(H) = P(H & E)/P(E), provided that both terms of this ratio exist and P(E) > 0.^[¹^]

To illustrate, suppose J. Doe is a randomly chosen American who was alive on January 1, 2000. According to the United States Center for Disease Control, roughly 2.4 million of the 275 million Americans alive on that date died during the 2000 calendar year. Among the approximately 16.6 million senior citizens (age 75 or greater) about 1.36 million died. The unconditional probability of the hypothesis that our J. Doe died during 2000, H, is just the population-wide mortality rate P(H) = 2.4M/275M = 0.00873. To find the probability of J. Doe's death conditional on the information, E, that he or she was a senior citizen, we divide the probability that he or she was a senior who died, P(H & E) = 1.36M/275M = 0.00495, by the probability that he or she was a senior citizen, P(E) = 16.6M/275M = 0.06036. Thus, the probability of J. Doe's death given that he or she was a senior is P_E(H) = P(H & E)/P(E) = 0.00495/0.06036 = 0.082. Notice how the size of the total population factors out of this equation, so that P_E(H) is just the proportion of seniors who died. One should contrast this quantity, which gives the mortality rate among senior citizens, with the "inverse" probability of E conditional on H, P_H(E) = P(H & E)/P(H) = 0.00495/0.00873 = 0.57, which is the proportion of deaths in the total population that occurred among seniors.

Here are some straightforward consequences of (1.1):

Probability. P_E is a probability function.^[²^]
Logical Consequence. If E entails H, then P_E(H) = 1.
Preservation of Certainties. If P(H) = 1, then P_E(H) = 1.
Mixing. P(H) = P(E)P_E(H) + P(~E)P_~_E(H).^[³^]

The most important fact about conditional probabilities is undoubtedly Bayes' Theorem, whose significance was first appreciated by the British cleric Thomas Bayes in his posthumously published masterwork, "An Essay Toward Solving a Problem in the Doctrine of Chances" (Bayes 1764). Bayes' Theorem relates the "direct" probability of a hypothesis conditional on a given body of data, P_E(H), to the "inverse" probability of the data conditional on the hypothesis, P_H(E).

(1.2)	Bayes' Theorem.
	P_E(H) = [P(H)/P(E)] P_H(E)

In an unfortunate, but now unavoidable, choice of terminology, statisticians refer to the inverse probability P_H(E) as the "likelihood" of H on E. It expresses the degree to which the hypothesis predicts the data given the background information codified in the probability P.

In the example discussed above, the condition that J. Doe died during 2000 is a fairly strong predictor of senior citizenship. Indeed, the equation P_H(E) = 0.57 tells us that 57% of the total deaths occurred among seniors that year. Bayes' theorem lets us use this information to compute the "direct" probability of J. Doe dying given that he or she was a senior citizen. We do this by multiplying the "prediction term" P_H(E) by the ratio of the total number of deaths in the population to the number of senior citizens in the population, P(H)/P(E) = 2.4M/16.6M = 0.144. The result is P_E(H) = 0.57 × 0.144 = 0.082, just as expected.

Though a mathematical triviality, Bayes' Theorem is of great value in calculating conditional probabilities because inverse probabilities are typically both easier to ascertain and less subjective than direct probabilities. People with different views about the unconditional probabilities of E and H often disagree about E's value as an indicator of H. Even so, they can agree about the degree to which the hypothesis predicts the data if they know any of the following intersubjectively available facts: (a) E's objective probability given H, (b) the frequency with which events like E will occur if H is true, or (c) the fact that H logically entails E. Scientists often design experiments so that likelihoods can be known in one of these "objective" ways. Bayes' Theorem then ensures that any dispute about the significance of the experimental results can be traced to "subjective" disagreements about the unconditional probabilities of H and E.

When both P_H(E) and P_~_H(E) are known an experimenter need not even know E's probability to determine a value for P_E(H) using Bayes' Theorem.

(1.3)	Bayes' Theorem (2nd form).^[⁴^]
	P_E(H) = P(H)P_H(E) / [P(H)P_H(E) + P(~H)P_~_H(E)]

In this guise Bayes' theorem is particularly useful for inferring causes from their effects since it is often fairly easy to discern the probability of an effect given the presence or absence of a putative cause. For instance, physicians often screen for diseases of known prevalence using diagnostic tests of recognized sensitivity and specificity. The sensitivity of a test, its "true positive" rate, is the fraction of times that patients with the disease test positive for it. The test's specificity, its "true negative" rate, is the proportion of healthy patients who test negative. If we let H be the event of a given patient having the disease, and E be the event of her testing positive for it, then the test's sensitivity and specificity are given by the likelihoods P_H(E) and P_~_H(~E), respectively, and the "baseline" prevalence of the disease in the population is P(H). Given these inputs about the effects of the disease on the outcome of the test, one can use (1.3) to determine the probability of disease given a positive test. For a more detailed illustration of this process, see Example 1 in the Supplementary Document "Examples, Tables, and Proof Sketches".

2. Special Forms of Bayes' Theorem

Bayes' Theorem can be expressed in a variety of forms that are useful for different purposes. One version employs what Rudolf Carnap called the relevance quotient or probability ratio (Carnap 1962, 466). This is the factor PR(H, E) = P_E(H)/P(H) by which H's unconditional probability must be multiplied to get its probability conditional on E. Bayes' Theorem is equivalent to a simple symmetry principle for probability ratios.

(1.4)	Probability Ratio Rule.
	PR(H, E) = PR(E, H)

The term on the right provides one measure of the degree to which H predicts E. If we think of P(E) as expressing the "baseline" predictability of E given the background information codified in P, and of P_H(E) as E's predictability when H is added to this background, then PR(E, H) captures the degree to which knowing H makes E more or less predictable relative to the baseline: PR(E, H) = 0 means that H categorically predicts ~E; PR(E, H) = 1 means that adding H does not alter the baseline prediction at all; PR(E, H) = 1/P(E) means that H categorically predicts E. Since P(E)) = P_T(E)) where T is any truth of logic, we can think of (1.4) as telling us that

The probability of a hypothesis conditional on a body of data is equal to the unconditional probability of the hypothesis multiplied by the degree to which the hypothesis surpasses a tautology as a predictor of the data.

In our J. Doe example, PR(H, E) is obtained by comparing the predictability of senior status given that J. Doe died in 2000 to its predictability given no information whatever about his or her mortality. Dividing the former "prediction term" by the latter yields PR(H, E) = P_H(E)/P(E) = 0.57/0.06036 = 9.44. Thus, as a predictor of senior status in 2000, knowing that J. Doe died is more than nine times better than not knowing whether she lived or died.

Another useful form of Bayes' Theorem is the Odds Rule. In the jargon of bookies, the "odds" of a hypothesis is its probability divided by the probability of its negation: O(H) = P(H)/P(~H). So, for example, a racehorse whose odds of winning a particular race are 7-to-5 has a 7/12 chance of winning and a 5/12 chance of losing. To understand the difference between odds and probabilities it helps to think of probabilities as fractions of the distance between the probability of a contradiction and that of a tautology, so that P(H) = p means that H is p times as likely to be true as a tautology. In contrast, writing O(H) = [P(H) − P(F)]/[P(T) − P(H)] (where F is some logical contradiction) makes it clear that O(H) expresses this same quantity as the ratio of the amount by which H's probability exceeds that of a contradiction to the amount by which it is exceeded by that of a tautology. Thus, the difference between "probability talk" and "odds talk" corresponds to the difference between saying "we are two thirds of the way there" and saying "we have gone twice as far as we have yet to go."

The analogue of the probability ratio is the odds ratio OR(H, E) = O_E(H)/O(H), the factor by which H's unconditional odds must be multiplied to obtain its odds conditional on E. Bayes' Theorem is equivalent to the following fact about odds ratios:

(1.5)	Odds Ratio Rule.
	OR(H, E) = P_H(E)/P_~H(E)

Notice the similarity between (1.4) and (1.5). While each employs a different way of expressing probabilities, each shows how its expression for H's probability conditional on E can be obtained by multiplying its expression for H's unconditional probability by a factor involving inverse probabilities.

The quantity LR(H, E) = P_H(E)/P_~_H(E) that appears in (1.5) is the likelihood ratio of H given E. In testing situations like the one described in Example 1, the likelihood ratio is the test's true positive rate divided by its false positive rate: LR = sensitivity/(1 − specificity). As with the probability ratio, we can construe the likelihood ratio as a measure of the degree to which H predicts E. Instead of comparing E's probability given H with its unconditional probability, however, we now compare it with its probability conditional on ~H. LR(H, E) is thus the degree to which the hypothesis surpasses its negation as a predictor of the data. Once more, Bayes' Theorem tells us how to factor conditional probabilities into unconditional probabilities and measures of predictive power.

The odds of a hypothesis conditional on a body of data is equal to the unconditional odds of the hypothesis multiplied by the degree to which it surpasses its negation as a predictor of the data.

In our running J. Doe example, LR(H, E) is obtained by comparing the predictability of senior status given that J. Doe died in 2000 to its predictability given that he or she lived out the year. Dividing the former "prediction term" by the latter yields LR(H, E) = P_H(E)/P_~_H(E) = 0.57/0.056 = 10.12. Thus, as a predictor of senior status in 2000, knowing that J. Doe died is more than ten times better than knowing that he or she lived.

The similarities between the "probability ratio" and "odds ratio" versions of Bayes' Theorem can be developed further if we express H's probability as a multiple of the probability of some other hypothesis H* using the relative probability function B(H, H*) = P(H)/P(H*). It should be clear that B generalizes both P and O since P(H) = B(H, T) and O(H) = B(H, ~H). By comparing the conditional and unconditional values of B we obtain the Bayes' Factor:

BR(H, H*; E) = B_E(H, H*)/B(H, H*) = [P_E(H)/P_E(H*)]/ [P(H)/P(H*)].

We can also generalize the likelihood ratio by setting LR(H, H*; E) = P_H(E)/P_H_*(E). This compares E's predictability on the basis of H with its predictability on the basis of H*. We can use these two quantities to formulate an even more general form of Bayes' Theorem.

(1.6)	Bayes' Theorem (General Form)
	BR(H, H; E) = LR(H, H; E)

The message of (1.6) is this:

The ratio of probabilities for two hypotheses conditional on a body of data is equal to the ratio their unconditional probabilities multiplied by the degree to which the first hypothesis surpasses the second as a predictor of the data.

The various versions of Bayes' Theorem differ only with respect to the functions used to express unconditional probabilities (P(H), O(H), B(H)) and in the likelihood term used to represent predictive power (PR(E, H), LR(H, E), LR(H, H*; E)). In each case, though, the underlying message is the same:

conditional probability = unconditional probability × predictive power

(1.2) – (1.6) are multiplicative forms of Bayes' Theorem that use division to compare the disparities between unconditional and conditional probabilities. Sometimes these comparisons are best expressed additively by replacing ratios with differences. The following table gives the additive analogue of each ratio measure.

Table 1
Ratio	Difference
Probability Ratio PR(H, E) = P_E(H)/P(H)	Probability Difference PD(H, E) = P_E(H) − P(H)
Odds Ratio OR(H, E) = O_E(H)/O(H)	Odds Difference OD(H, E) = O_E(H) − O(H)
Bayes' Factor BR(H, H; E) = B_E(H, H)/B(H, H*)	Bayes' Difference BD(H, H; E) = B_E(H, H) − B(H, H*)

We can use Bayes' theorem to obtain additive analogues of (1.4) – (1.6), which are here displayed along with their multiplicative counterparts:

Table 2
	Ratio	Difference
(1.4)	PR(H, E) = PR(E, H) = P_H(E)/P(E)	PD(H, E) = P(H) [PR(E, H) − 1]
(1.5)	OR(H, E) = LR(H, E) = P_H(E)/P_~H(E)	OD(H, E) = O(H) [OR(H, E) − 1]
(1.6)	BR(H, H; E) = LR(H, H; E) = P_H(E)/P_H*(E)	BD(H, H; E) = B(H, H) [BR(H, H*; E) − 1]

Notice how each additive measure is obtained by multiplying H's unconditional probability, expressed on the relevant scale, P, O or B, by the associated multiplicative measure diminished by 1.

While the results of this section are useful to anyone who employs the probability calculus, they have a special relevance for subjectivist or "Bayesian" approaches to statistics, epistemology, and inductive inference.^[⁵^] Subjectivists lean heavily on conditional probabilities in their theory of evidential support and their account of empirical learning. Given that Bayes' Theorem is the single most important fact about conditional probabilities, it is not at all surprising that it should figure prominently in subjectivist methodology.