A brief overview of Bayes Factors
Posted: 2025-12-07 · Last updated: 2025-12-07 · Permalink
Bayes' theorem expresses the following relationships between various probabilities:
\begin{equation} P(H \,\vert\, D) = \frac{P(H) P(D \,\vert\, H)}{P(D)} \label{bayes} \end{equation}- $P(H \,\vert\, D)$ is the posterior (as in, post having data, given the data).
- $P(H)$ is the prior.
- $P(D \,\vert\, H)$ is the likelihood of the data (given the hypothesis).
- $P(D)$ is the marginal likelihood of the data.
Proving Bayes' theorem
\begin{align*} P(H \,\cap\, D) &= P(H \,\vert\, D) P(D)\\ P(D \,\cap\, H) &= P(D \,\vert\, H) P(H) \end{align*}Since $P(H \,\cap\, D) = P(D \,\cap\, H)$, Equation \eqref{bayes} follows.
Working in odds space (deriving a Bayes Factor)
Suppose now we are working with two hypotheses, $H_1 , H_2$. We can use Equation \eqref{bayes} twice to obtain the following representation of posterior odds:
\begin{align*} \frac{P(H_1 \,\vert\, D)}{P(H_2 \,\vert\, D)} &= \frac{\frac{P(H_1) P(D \,\vert\, H_1)}{P(D)}}{\frac{P(H_2) P(D \,\vert\, H_2)}{P(D)}}\\ &= \frac{P(H_1) P(D \,\vert\, H_1)}{P(H_2) P(D \,\vert\, H_2)}\\ &= \underbrace{\frac{P(H_1)}{P(H_2)}}_{\text{Prior Odds}} \cdot \underbrace{\frac{P(D \,\vert\, H_1)}{P(D \,\vert\, H_2)}}_{\text{Bayes Factor}} \end{align*}In other words, Posterior Odds = Prior Odds · Bayes Factor.
More precisely, we use a subscript and define
\begin{equation} BF_{12} = \frac{P(D \,\vert\, H_1)}{P(D \,\vert\, H_2)} \label{bf}. \end{equation}Note how $BF_{12}$ is about likelihood odds, and the subscript translates “top to bottom” to numerator and denominator in the Bayes Factor.
Crucially,
\begin{equation} BF_{21} = \frac{1}{BF_{12}}. \label{reversed} \end{equation}Calculating the posterior probability from a Bayes Factor
Suppose now that either $H_1$ or $H_2$ is true, and that there are no other possibilities. That is, $H_1 \cup H_2 = \Omega$, so $H_1, H_2$ divide all states of the world, $P(\Omega) = 1$. Then, $P(H_2 \,\vert\, D) = 1 - P(H_1 \,\vert\, D)$.
This means that the posterior probability can be calculated as follows:
\begin{align*} \frac{P(H_1 \,\vert\, D)}{1-P(H_1 \,\vert\, D)} &= \frac{P(H_1)}{P(H_2)} \cdot \frac{P(D \,\vert\, H_1)}{P(D \,\vert\, H_2)}\\ \\ \Longleftrightarrow\\ P(H_1 \,\vert\, D) &= \frac{\frac{P(H_1)}{P(H_2)} \cdot BF_{12}}{1 + \frac{P(H_1)}{P(H_2)} \cdot BF_{12}} \end{align*}In the common case where prior odds are equal to 1, the posterior probability has a convenient expression:
\begin{equation} P(H_1 \,\vert\, D) = \frac{BF_{12}}{1 + BF_{12}} \label{postprob} \end{equation}Bayes Factors and strength of evidence
If $BF_{12} \geq 1$, use the below table; otherwise, use Equation \eqref{reversed} to calculate $BF_{21}$. The second column assumes prior odds equal to 1.
| Bayes Factor | Posterior probability | InterpretationA |
|---|---|---|
| $1$ | $0.5$ | No evidenceB |
| $1 \dots 3$ | $0.5 \dots 0.75$ | Anecdotal evidenceB |
| $3 \dots 10$ | $0.75 \dots 0.91$ | Moderate evidenceB |
| $10 \dots 30$ | $0.91 \dots 0.97$ | Strong evidenceB |
| $30 \dots 100$ | $0.97 \dots 0.99$ | Very strong evidenceB |
| $> 100$ | $> 0.99$ | Extremely strong evidenceB |
Relationship to likelihood ratio tests
The classical likelihood ratio test statistic is
\begin{equation} \Lambda = \frac{\sup_{\theta \in \Theta_1} P(D \,\vert\, \theta)}{\sup_{\theta \in \Theta_0} P(D \,\vert\, \theta)} \end{equation}where $\Theta_0, \Theta_1$ are the parameter spaces under the null and alternative. This ratio compares the best-fitting parameter values.
The Bayes Factor instead compares average likelihoods:
\begin{equation} BF_{10} = \frac{\int P(D \,\vert\, \theta) \, p(\theta \,\vert\, H_1) \, d\theta}{\int P(D \,\vert\, \theta) \, p(\theta \,\vert\, H_0) \, d\theta} \end{equation}The averaging penalizes models with diffuse priors over large parameter spaces: probability mass wasted on poor-fitting parameter values drags down the marginal likelihood. This provides an automatic Occam's razor absent from $\Lambda$.
Decomposing Bayes Factors
In some cases, there is no single data set $D$ that can be used to evaluate the hypotheses, but rather multiple “evidence” $E_1 , E_2, \dots, E_m$ such that $E \equiv \bigcap_{i=1}^{m} E_{i}$ is the totality of evidence.
Rewriting Equation \eqref{bayes} in terms of evidence, we get
\begin{equation} P(H \,\vert\, E) = \frac{P(H) P(E \,\vert\, H)}{P(E)}. \end{equation}Using the chain rule, for $m=2$ it holds that
\begin{equation} P(E \,\vert\, z) = P(E_1 \,\vert\, z) P(E_2 \,\vert\, E_1, z) \end{equation}for any conditioning variable, $z$. More generally,
\begin{equation} P(E \,\vert\, z) = P(E_1 \,\vert\, z) \prod_{i=2}^{m} P(E_i \,\vert\, E_1, \dots, E_{i-1}, z). \end{equation}The Bayes Factor can thus be decomposed as
\begin{align*} \frac{P(E \,\vert\, H_1)}{P(E \,\vert\, H_2)} &= \frac{P(E_1 \,\vert\, H_1) \prod_{i=2}^{m} P(E_i \,\vert\, E_1, \dots, E_{i-1}, H_1)}{P(E_1 \,\vert\, H_2) \prod_{i=2}^{m} P(E_i \,\vert\, E_1, \dots, E_{i-1}, H_2)}\\ &= \frac{P(E_1 \,\vert\, H_1)}{P(E_1 \,\vert\, H_2)} \prod_{i=2}^{m} \frac{P(E_i \,\vert\, E_1, \dots, E_{i-1}, H_1)}{P(E_i \,\vert\, E_1, \dots, E_{i-1}, H_2)}\\ &= BF_{12}^{E_1} \cdot BF_{12}^{E_2 \, \vert \, E_1} \cdot \dots \cdot BF_{12}^{E_m \, \vert \, E_1, \dots, E_{m-1}} \end{align*}This decomposition is useful when evidence arrives sequentially or when different pieces of evidence have qualitatively different sources. Instead of computing a single likelihood ratio over all evidence at once, you can update beliefs incrementally. Each factor $BF_{12}^{E_i \, \vert \, E_1, \dots, E_{i-1}}$ measures how much the $i$-th piece of evidence favors $H_1$ over $H_2$, given what was already known.
In practice, this matters when some evidence is easier to evaluate than others, or when you want to diagnose which pieces of evidence drive the overall conclusion. A large overall Bayes factor might be dominated by a single $E_i$, or it might accumulate from many modest contributions. The decomposition makes this transparent. It also helps when combining evidence from heterogeneous sources (e.g., experimental data and observational data) where assuming independence would be wrong, but the conditional structure is tractable.
When the pieces of evidence are conditionally independent given the hypothesis (when $P(E_i \mid E_1, \dots, E_{i-1}, H) = P(E_i \mid H)$ for all $i$) the decomposition simplifies to a product of unconditional Bayes factors:
\begin{equation} \frac{P(E \,\vert\, H_1)}{P(E \,\vert\, H_2)} = \prod_{i=1}^{m} \frac{P(E_i \,\vert\, H_1)}{P(E_i \,\vert\, H_2)} = \prod_{i=1}^{m} BF_{12}^{E_i} \end{equation}This is convenient but rarely justified in practice. Evidence from the same domain or measurement process is typically correlated, and treating dependent evidence as independent can inflate confidence.