## Conditional distributions

Last updated: 2020-07-10 12:00.

If average lifespan is 80 years, and you know that someone is already 70 years old, how high is that person’s expected lifespan?

You might want to answer *80*, but remember that the average lifespan of 80 years given above is the unconditional lifespan that holds for a newborn. However, if someone has already lived for 70 years, the probability of them having a lifespan of less than 70 is zero, and the probability mass of these lower lifespans can have no influence on this person’s expected lifespan!

First, we investigate this issue using a small simulation. After that, we approach the problem more analytically.

In our Monte-Carlo simulation, we generate 1,000,000 normally distributed lifespans with $ \mu = 80 $ and $ \sigma = 10$. Thereafter, we remove those that have a lifespan of less than 70 and calculate a mean of the remaining lifespans. This way, we can make an inference about the expected lifespan of our 70-year old specimen. The R code is as follows:

```
set.seed(20170830)
lifespans <- rnorm(n = 1000000, mean = 80, sd = 10)
lifespans2 <- Filter(function(x) x > 70, lifespans)
```

By typing `mean(lifespans2)`

we get the following result: **82.87548 years**. According to our simulation, this is the expected lifespan of a person *known* to have a lifespan larger than 70 years. This is a pretty amazing result: Someone who has persevered through the dangers of young age can expect almost three more years of life! This effect is even stronger for other ages. For example, a 79-year-old can expect to live seven additional years.

Therefore, it is not proper to make statements like “I’m 70 years old, meaning I only have 10 years left!” There are two main fallacies involved here: 1. The expected lifespan (whether unconditional or conditional) does not mean that everyone dies at some constant age, it is more like an average with outliers on both sides; 2. As shown above, *80 years* are the unconditional expectation (for a newborn), but if we know that someone cannot have a lifespan of, say, less than 70 years, we must incorporate this information.

We will now approach the problem more analytically. Consider a random variable $ X $ that is characterized by a probability distribution function (p.d.f.) $f(x)$ and a cumulative distribution function (c.d.f.) $F(x) = P(X < x) = \int_{-\infty}^{x} f(x) \, dx$. This is the *unconditional* distribution.

We now want to look at the conditional distribution of $X$: If $X$ lies between $x_l$ and $x_u$, how does its distribution change? Arithmetically, we write that $x_l < X < x_u$. We are looking for $P(X\,\vert\,x_l < X < x_u)$.

We know from the principles of probability that $P(A \cap B) = P(A) \cdot P(B\,\vert\,A)$. By solving for $P(B\,\vert\,A)$, we find $P(B\,\vert\,A) = \frac{P(A \cap B)}{P(A)}$ and by substituting, it follows that $$P(X\,\vert\,x_l < X < x_u) = \frac{P((x_l < X < x_u) \cap X)}{P(x_l < X < x_u)}.$$

We can immediately determine the value of $P(x_l < X < x_u)$: From the definition of the c.d.f., it follows that $P(x_l < X < x_u) = F(x_u)-F(x_l)$.

But what is $P((x_l < X < x_u) \cap X)$? This is actually pretty simple. Which probability are we looking for? In fact, we are looking for the probability distribution of $X$ if $x_l < X < x_u$ holds at that $X$. The value of $P((x_l < X < x_u) \cap X)$ is therefore $P(X)$ if the condition holds and zero otherwise. By incorporating this insight and combining it with the results obtained above, we find that

$$
P(X\,\vert\,x_l < X < x_u) =
\left\{
\begin{array}{ll}
0 & \mbox{if } X < x_l, \\

\frac{P(X)}{F(x_u)-F(x_l)} & \mbox{if } x_l < X < x_u, \\

0 & \mbox{if } X > x_u.
\end{array}
\right.
$$

Since $P(X)$ is the p.d.f. of $X$, we ultimately have

$$
f(x\,\vert\,x_l < x < x_u) =
\left\{
\begin{array}{ll}
0 & \mbox{if } x < x_l, \\

\frac{f(x)}{F(x_u)-F(x_l)} & \mbox{if } x_l < x < x_u, \\

0 & \mbox{if } x > x_u.
\end{array}
\right.
$$

Note that $F(x_u)-F(x_l)$ merely acts as a scaling factor to ensure that the integral of $f(x\,\vert\,x_l < x < x_u)$ from $-\infty$ to $\infty$ (or, due to the restriction, from $x_l$ to $x_u$), equals 1. Assuming that we only care about $x$ values that lie between $x_l$ and $x_u$, we can define the restricted c.d.f. as follows:

$$ F(x\,\vert\,x_l < x < x_u) = \int_{x_l}^{x} \frac{f(x)}{F(x_u)-F(x_l)}\,dx. $$

Since $F(x_u)-F(x_l)$ is just a constant scaling factor, we can also write it as

$$ F(x\,\vert\,x_l < x < x_u) = \frac{1}{{F(x_u)-F(x_l)}}\int_{x_l}^{x} f(x)\,dx. $$

Note that our c.d.f. is only valid for values inside our bounds! Beyond our bounds, the c.d.f. will be either 0 or 1, depending on whether $ x < x_l $ or $ x > x_u$. Equivalently, we can define our conditional expectation $E[X\,\vert\,x_l < X < x_u]$:

$$ E[X\,\vert\,x_l < X < x_u] = \frac{1}{{F(x_u)-F(x_l)}}\int_{x_l}^{x_u} x f(x)\,dx. $$

Let us now return to the example above. In our simulation of human lifespan, we assumed a normal distribution. Some distributions are easier to handle; for example, an uniform distribution that is constrained in the manner we describe here would once again become a uniform distribution, only the start and end points would change. The proof is straightforward and left to the reader.

It is well known that a normal distribution with $\mu = 80 $ and $\sigma = 10$ has the following p.d.f.:

$$ f(x) = \frac{\exp\left(-\frac{(x-80)^2}{200}\right)}{\sqrt{200\pi}}. $$

The unconditional c.d.f. $F(x)$ can be found by looking it up. If we set an infinite upper bound ($x_u = \infty$) and a lower bound of 70, we find that $F(x_u)-F(x_l) \approx 1-0.1587 \approx 0.8413$. We wanted to calculate $ E[X\,\vert\,70 < X] $, right? By using the above formula and plugging in our scaling factor and the p.d.f., we find

$$ E[X\,\vert\,70 < X] \approx \frac{1}{0.8413}\int_{70}^{\infty} x \frac{\exp\left(-\frac{(x-80)^2}{200}\right)}{\sqrt{200\pi}}\,dx. $$

By using a computer algebra system (I used SAGE), we find that

$$ E[X\,\vert\,70 < X] \approx \frac{1}{0.8413}\cdot \frac{5 \, {\left(8 \, \sqrt{\pi} {\left(\operatorname{erf}\left(\frac{1}{2} \, \sqrt{2}\right) + 1\right)} \sqrt{e} + \sqrt{2}\right)}}{\sqrt{\pi} \sqrt{e}}. $$

… and this is approximately **82.876**, incredibly close to our simulation-based estimate of **82.87548**! Pretty awesome, isn’t it?

I made a graph depicting the conditional expectation depending on $x_l$. Enjoy!