Max R. P. Grossmann

Max R. P. Grossmann

Conditional distributions

Posted: 2020-07-10 · Last updated: 2023-12-02

If average lifespan is 80 years, and you know that someone is already 70 years old, how high is that person’s expected lifespan?
You might want to answer 80, but remember that the average lifespan of 80 years given above is the unconditional lifespan that holds for a newborn. However, if someone has already lived for 70 years, the probability of them having a lifespan of less than 70 is zero, and the probability mass of these lower lifespans can have no influence on this person’s expected lifespan!

First, we investigate this issue using a small simulation. After that, we approach the problem more analytically.
In our Monte-Carlo simulation, we generate 1,000,000 normally distributed lifespans with $ \mu = 80 $ and $ \sigma = 10$. Thereafter, we remove those that have a lifespan of less than 70 and calculate a mean of the remaining lifespans. This way, we can make an inference about the expected lifespan of our 70-year old specimen. The R code is as follows:

set.seed(20170830)
lifespans <- rnorm(n = 1000000, mean = 80, sd = 10)
lifespans2 <- Filter(function(x) x > 70, lifespans)

By typing mean(lifespans2) we get the following result: 82.87548 years. According to our simulation, this is the expected lifespan of a person known to have a lifespan larger than 70 years. This is a pretty amazing result: Someone who has persevered through the dangers of young age can expect almost three more years of life! This effect is even stronger for other ages. For example, a 79-year-old can expect to live seven additional years.

Therefore, it is not proper to make statements like “I’m 70 years old, meaning I only have 10 years left!” There are two main fallacies involved here: 1. The expected lifespan (whether unconditional or conditional) does not mean that everyone dies at some constant age, it is more like an average with outliers on both sides; 2. As shown above, 80 years are the unconditional expectation (for a newborn), but if we know that someone cannot have a lifespan of, say, less than 70 years, we must incorporate this information.

We will now approach the problem more analytically. Consider a random variable $ X $ that is characterized by a probability distribution function (p.d.f.) $f(x)$ and a cumulative distribution function (c.d.f.) $F(x) = P(X < x) = \int_{-\infty}^{x} f(x) \, dx$. This is the *unconditional* distribution.

We now want to look at the conditional distribution of $X$: If $X$ lies between $x_l$ and $x_u$, how does its distribution change? Arithmetically, we write that $x_l < X < x_u$. We are looking for $P(X\,\vert\,x_l < X < x_u)$.

We know from the principles of probability that $P(A \cap B) = P(A) \cdot P(B\,\vert\,A)$. By solving for $P(B\,\vert\,A)$, we find $P(B\,\vert\,A) = \frac{P(A \cap B)}{P(A)}$ and by substituting, it follows that $$P(X\,\vert\,x_l < X < x_u) = \frac{P((x_l < X < x_u) \cap X)}{P(x_l < X < x_u)}.$$

We can immediately determine the value of $P(x_l < X < x_u)$: From the definition of the c.d.f., it follows that $P(x_l < X < x_u) = F(x_u)-F(x_l)$.

But what is $P((x_l < X < x_u) \cap X)$? This is actually pretty simple. Which probability are we looking for? In fact, we are looking for the probability distribution of $X$ if $x_l < X < x_u$ holds at that $X$. The value of $P((x_l < X < x_u) \cap X)$ is therefore $P(X)$ if the condition holds and zero otherwise. By incorporating this insight and combining it with the results obtained above, we find that

$$ P(X\,\vert\,x_l < X < x_u) = \left\{ \begin{array}{ll} 0 & \mbox{if } X < x_l, \\
\frac{P(X)}{F(x_u)-F(x_l)} & \mbox{if } x_l < X < x_u, \\
0 & \mbox{if } X > x_u. \end{array} \right. $$

Since $P(X)$ is the p.d.f. of $X$, we ultimately have

$$ f(x\,\vert\,x_l < x < x_u) = \left\{ \begin{array}{ll} 0 & \mbox{if } x < x_l, \\
\frac{f(x)}{F(x_u)-F(x_l)} & \mbox{if } x_l < x < x_u, \\
0 & \mbox{if } x > x_u. \end{array} \right. $$

Note that $F(x_u)-F(x_l)$ merely acts as a scaling factor to ensure that the integral of $f(x\,\vert\,x_l < x < x_u)$ from $-\infty$ to $\infty$ (or, due to the restriction, from $x_l$ to $x_u$), equals 1. Assuming that we only care about $x$ values that lie between $x_l$ and $x_u$, we can define the restricted c.d.f. as follows:

$$ F(x\,\vert\,x_l < x < x_u) = \int_{x_l}^{x} \frac{f(x)}{F(x_u)-F(x_l)}\,dx. $$

Since $F(x_u)-F(x_l)$ is just a constant scaling factor, we can also write it as

$$ F(x\,\vert\,x_l < x < x_u) = \frac{1}{{F(x_u)-F(x_l)}}\int_{x_l}^{x} f(x)\,dx. $$

Note that our c.d.f. is only valid for values inside our bounds! Beyond our bounds, the c.d.f. will be either 0 or 1, depending on whether $ x < x_l $ or $ x > x_u$. Equivalently, we can define our conditional expectation $E[X\,\vert\,x_l < X < x_u]$:

$$ E[X\,\vert\,x_l < X < x_u] = \frac{1}{{F(x_u)-F(x_l)}}\int_{x_l}^{x_u} x f(x)\,dx. $$

Let us now return to the example above. In our simulation of human lifespan, we assumed a normal distribution. Some distributions are easier to handle; for example, an uniform distribution that is constrained in the manner we describe here would once again become a uniform distribution, only the start and end points would change. The proof is straightforward and left to the reader.

It is well known that a normal distribution with $\mu = 80 $ and $\sigma = 10$ has the following p.d.f.:

$$ f(x) = \frac{\exp\left(-\frac{(x-80)^2}{200}\right)}{\sqrt{200\pi}}. $$

The unconditional c.d.f. $F(x)$ can be found by looking it up. If we set an infinite upper bound ($x_u = \infty$) and a lower bound of 70, we find that $F(x_u)-F(x_l) \approx 1-0.1587 \approx 0.8413$. We wanted to calculate $ E[X\,\vert\,70 < X] $, right? By using the above formula and plugging in our scaling factor and the p.d.f., we find

$$ E[X\,\vert\,70 < X] \approx \frac{1}{0.8413}\int_{70}^{\infty} x \frac{\exp\left(-\frac{(x-80)^2}{200}\right)}{\sqrt{200\pi}}\,dx. $$

By using a computer algebra system (I used SAGE), we find that

$$ E[X\,\vert\,70 < X] \approx \frac{1}{0.8413}\cdot \frac{5 \, {\left(8 \, \sqrt{\pi} {\left(\operatorname{erf}\left(\frac{1}{2} \, \sqrt{2}\right) + 1\right)} \sqrt{e} + \sqrt{2}\right)}}{\sqrt{\pi} \sqrt{e}}. $$

… and this is approximately 82.876, incredibly close to our simulation-based estimate of 82.87548! Pretty awesome, isn’t it?

I made a graph depicting the conditional expectation depending on $x_l$. Enjoy!