The distribution of p-values, size and power
Posted: 2020-07-20 · Last updated: 2023-12-16
In this note, I will derive the distribution of the p-value of an exemplary test and show how this distribution can be used to calculate size and power. I will also show why this presentation is useful, since size and power are actually two sides of the same coin. This note requires advanced knowledge of hypothesis testing and statistical inference, and I will skip over a lot of parts. Clearing out the cobwebs is left to you, the reader.
For simplicity, consider a two-sided one-sample z-test. This test requires that the underlying data are normally distributed with a known variance $\sigma^2$. For most comparisons of means, we would use Student's t-test, but for reasons of simplicity assume the variance to be known so that it does not have to be estimated (and so that we can directly use the normal distribution).
Given these assumptions, we want to perform the following test:
\begin{align*} \begin{array}{cc} H_0:& \mu = \mu_0,\\ H_1:& \mu \neq \mu_0. \end{array} \end{align*}
So we want to test whether we can reject the hypothesis that the true mean is equal to $\mu_0$.
The test statistic of the z-test is $$T = \frac{\overline{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}},$$ where $\overline{x}$ is the arithmetic mean of the data, $\mu_0$ is the mean to be tested against and $n$ is sample size. We choose some level $\alpha$, which is the (theoretical!) probability of rejecting the $H_0$ when it is not true. $\alpha$ is often set at $0.05$. We know that $T$ is standard normally distributed, with a cumulative distribution function $\Phi(x)$.
After the test statistic is calculated, we can calculate the p-value, $$p = 2(1-\Phi(\lvert T \rvert)),$$ and if $p < \alpha$, we reject $H_0$ and accept $H_1$.
Now assume that $H_0$ is not correct. Obviously, if $\overline{x}$ is much larger or much smaller than $\mu_0$, we would assume that the true mean, $\mu$, is also not equal to $\mu_0$. We would expect that our hypothesis test, given repeated random sampling from the population, would reject more often than if, say, we were to consistently find that $\overline{x}$ were only a little bit larger or a little bit smaller than $\mu_0$. Therefore, the question arises how much of the time the test correctly rejects the $H_0$ (i.e. if it is not correct). This probability is called power: $$\text{Power} = P(\text{reject $H_0$ | $H_0$ not true}).$$
However, there is another issue. If $H_0$ is in fact correct, we would expect the test to (falsely) reject it in only 5% of cases (or whatever your $\alpha$ is). Theoretically, this is always exactly true by construction, as we will show below. But in reality, many hypothesis tests exhibit some size bias, especially when the conditions for their application are not met exactly or the sample size is small. In the construction of hypothesis tests, it is not just important that the test rejects if $H_0$ is not true, but that for every $\alpha$ chosen by the researcher, the test rejects the $H_0$ if it is true in exactly $\alpha\cdot100\%$ of cases. You could achieve perfect power by always rejecting, but this would clearly be a terrible test because you would also reject in those cases where $H_0$ is true! This probability of rejection if $H_0$ is true is called size: $$\text{Size} = P(\text{reject $H_0$ | $H_0$ true}).$$
Note the similarity between the probabilistic definitions of size and power. The only difference is whether $H_0$ is true or not.
Really complicated calculations can be used to determine the power of a test, and there are several websites and computer programs that can do it. (Size is more important for theoreticians, as most tests that we commonly use have excellent properties in this regard.) However, in this note, I want to present an alternative approach: Calculating these probabilities by using the distribution of p-values themselves. This is obviously attractive because we know when we reject $H_0$: precisely if $p < \alpha$!
Let me now derive the distribution of the p-value.
\begin{align*} F_p(\pi) &= P(p \leq \pi)\\ &= P(2(1-\Phi(\lvert T \rvert)) \leq \pi)\\ &= P(\Phi(\lvert T \rvert) \geq 1-\pi/2)\\ &= P(\lvert T \rvert \geq \Phi^{-1}(1-\pi/2))\\ &= P(T \geq \Phi^{-1}(1-\pi/2))+P(T \leq -\Phi^{-1}(1-\pi/2))\\ &= 1-F_T(\Phi^{-1}(1-\pi/2))+F_T(-\Phi^{-1}(1-\pi/2)). \end{align*}
In the first line, we used the definition of the cumulative distribution function. In the second line, we used the definition of the p-value (for the test at hand). The fourth line applies the inverse of the c.d.f. (the quantile function). The fifth line dissolves the absolute value ($\lvert T \rvert \geq k$ obviously implies that either $T \geq k$ or $T \leq -k$). In the sixth line, we again used the definition of the c.d.f.
Obviously, the distribution of the p-value depends on the distribution of the test statistic $T$. How is it distributed? Well, that depends—perhaps most notably, it depends on whether $H_0$ is correct or not.
Let us first consider the easy case: if $H_0$ is true. Let us show that theoretically, the z-test falsely rejects in exactly $\alpha\cdot100\%$ of cases. If $H_0$ is true, what can be said about the distribution of $T$? First of all, it is obviously standard normally distributed. Why? We know that $\mu = \mu_0$, and therefore $E[\overline{x}] = \frac{1}{n} \sum_i E[x_i] = \frac{1}{n} n \mu_0 = \mu_0$, where the penultimate equality again follows from our assumption that $H_0$ is correct and the linearity of the expectation has been applied. Then, $E[T] = 0$. Since we divide by the standard deviation, it can also be shown that $V[T] = 1$. It therefore follows that $T\rvert_{H_0} \sim N(0,1)$ and we have that $F_T(x) = \Phi(x)$. Hence,
\begin{align*} F_p(\pi) &= 1-\Phi(\Phi^{-1}(1-\pi/2))+\Phi(-\Phi^{-1}(1-\pi/2))\\ &= 1-(1-\pi/2) + \Phi(-\Phi^{-1}(1-\pi/2))\\ &= 1-(1-\pi/2) + 1-\Phi(\Phi^{-1}(1-\pi/2))\\ &= 1-(1-\pi/2) + 1-(1-\pi/2)\\ &= \pi, \end{align*}
which shows that for all $\alpha$ we have that the size (probability of false rejection) is exactly $\alpha$, since $F_p(\alpha) = \alpha$. Sometimes, this famous result is referred to as the uniform distribution of the p-value under the null, something that should hold for all tests, at least asymptotically. [(The third line follows from the symmetry of the standard normal distribution.)]{.small}
Having derived the exact distribution of the p-value allows us to compute the probability of rejection if $H_0$ is not true. As stated above, this probability is referred to as power.
If $H_0$ is not true, $T$ is no longer standard normally distributed. However, it is still at least normally distributed since our data are (and through elaborate calculations it can be shown that the test statistic therefore is, too). Let $\mu_1$ be the true mean, with $\mu_1 \neq \mu_0$ such that $H_0$ is not correct. Then we have to find expectation and variance of $T$.
\begin{align} E\left[\frac{\overline{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}}\right] &= \frac{E[\overline{x} - \mu_0]}{\frac{\sigma}{\sqrt{n}}}\\ &= \frac{E[\overline{x}] - \mu_0}{\frac{\sigma}{\sqrt{n}}}\\ &= \frac{\frac{1}{n} \sum_i E[x_i] - \mu_0}{\frac{\sigma}{\sqrt{n}}}\\ &= \frac{\frac{1}{n} n \mu_1 - \mu_0}{\frac{\sigma}{\sqrt{n}}}\\ &= \frac{\mu_1 - \mu_0}{\frac{\sigma}{\sqrt{n}}}\\ \end{align}
The variance remains equal to 1.
Hence, $$T\rvert_{H_1} \sim N\left(\frac{\mu_1 - \mu_0}{\frac{\sigma}{\sqrt{n}}}, 1\right).$$ This uniquely identifies the cumulative distribution function of $T$ under the alternative hypothesis, $F_T = F_{T\rvert H_1}$.
Using the definition of power,
\begin{align} \text{Power} &= P(\text{reject $H_0$ | $H_0$ not true})\\ &= F_p(\text{$\alpha$ | $H_0$ not true})\\ &= 1-F_{T\rvert H_1}(\Phi^{-1}(1-\alpha/2))+F_{T\rvert H_1}(-\Phi^{-1}(1-\alpha/2)) \end{align}
$F_{T\rvert H_1}$, the c.d.f. of the test statistic, does not necessarily follow trivially; but in our case, it does, since $T$ is still normally distributed. But note that power will depend on the difference between the hypothesized mean and the real mean, divided by the standard deviation—this is also called the effect size. It also depends on the sample size. Thus, power is a more nuanced concept than size and reporting it in a scientific paper is therefore nontrivial.
We can write this down in a R script:
power <- function (mu0, mu1, sigma, n, alpha = 0.05) {
1-pnorm(qnorm(1-alpha/2), mean = (mu1-mu0)/(sigma/sqrt(n)))+
pnorm(-qnorm(1-alpha/2), mean = (mu1-mu0)/(sigma/sqrt(n)))
}
For example, if we want to test $H_0: \mu = 0$, but the true mean is $\mu_1 = 1$ and we have $\sigma = 500$ and $n = 1.000.000$, power is
power(0, 1, 500, 1000000) # 0.5160053
which is confirmed by programs such as G*Power and this website. Note, however, that some of these sources may use a t-test in place of the z-test. This is relevant if $n$ is small. For this example, I chose a large $n$ so that there is nearly no difference between the t- and the z-tests. It is quite easy to do the above derivations using the t-test as well.
If the null hypothesis is true, you will find that the "power" is always equal to $\alpha$—but this is not the power, but the size of the test:
power(0, 0, 5, 100) # 0.05
This is because power and size are ultimately the same concept: They give the probability of rejection. Their only difference is whether the null hypothesis is true or not. It is convenient to find these expressions using the distribution of the p-value since rejection happens precisely if $p < \alpha$, whose probability is $F_p(\alpha)$.
In this sense, the z-test is exact. However, in reality, it is unlikely that the variance is known. It will generally have to be estimated from the data. Under $H_0$, $T$ will then follow Student's $t$ distribution with $n-1$ degrees of freedom. It is from this distribution that the $p$-value is computed as well. Therefore, you should find that the size is still exactly $\alpha$, but the power will generally be smaller than in our z-test.
We can reparametrize our function in terms of the effect size:
powereff <- function (effect, n, alpha = 0.05) {
1-pnorm(qnorm(1-alpha/2), mean = effect*sqrt(n))+
pnorm(-qnorm(1-alpha/2), mean = effect*sqrt(n))
}
For example, we could now plot the achieved power by fixing some effect size and varying the sample size. Consider a small effect size, $0.1$ and let $n$ go from $30$ to $2500$. This produces the following plot:
plot(function (n) powereff(0.1, n), xlim = c(30, 2500),
ylim = c(0, 1), xlab = "n", main = "Power and sample size",
ylab = "Power")
abline(a = 1, b = 0, col = "red")
abline(a = 0.05, b = 0, col = "blue")
As is clearly visible, such a small effect size requires quite a large number of samples to be detectable at all. Pretty disturbing, isn't it?