This is just a quick and dirty note on how to derive the OLS estimator using matrix calculus. When I wrote this note, it was surprisingly difficult to find an uncluttered derivation of it – so here it is.

I assume you have knowledge of, say, the Gauss-Markov theorem and you know all the terms that are involved here. Ideally, you already learned how to derive the OLS estimator and you are just visiting to freshen up your memory. I will repeat as little as possible.

We have a design matrix $X$ and our dependent variables are in the column vector $y$. We assume random errors $\varepsilon$ with mean 0 and variance $\sigma^2$. We are looking for a good estimate $\hat{\beta}$ of the (true) coefficients $\beta$. Therefore, we have the following regression model:

$$ y = X\beta + \varepsilon. $$

First of all, what is a "good estimate"? We say that an estimate of our coefficients is good if it minimizes the sum of squares of the residuals – after all, that's why it's called OLS: ordinary *least squares*. In other words, our $\hat{\beta}$ should minimize the expression $\varepsilon^{T} \varepsilon$ (the superscript $T$ indicates a transposed matrix: By transposing the column vector $\varepsilon$ and multiplying this transpose with $\varepsilon$, we are indeed left with a single number, the sum of squares of the error terms).

We are looking for $\hat{\beta}$. $X$ is known, so is $y$. We want to minimize $\varepsilon^{T} \varepsilon$. Therefore, we have to rewrite the above expression. By subtracting $X\beta$ from both sides, we have

$$ \varepsilon = y - X\beta. $$

We apply our minimization objective:

$$ \underset{\beta}{\text{min}} \,\, \varepsilon^{T} \varepsilon = (y - X\beta)^{T} (y - X\beta). $$

We factor out and we apply that for two matrices $A$ and $B$, $(AB)^T = B^T A^T$:

$$ \underset{\beta}{\text{min}} \,\, y^T y - y^T X \beta - \beta^T X^T y + \beta^T X^T X \beta. $$

Now note that the first term will be irrelevant for the minimization as it is independent of $\beta$ and that the second and third term are identical. Our minimization problem becomes

$$ \underset{\beta}{\text{min}} \,\, - 2\beta^T X^T y + \beta^T X^T X \beta. $$

We solve by differentiating with respect to $\beta$. (If you do not know how to differentiate matrices, check this out.)

(Due to the rules of differentiating a matrix, we are roughly speaking "differentiating with respect to $\beta^{T}$". Also note that $\beta^{T}\cdots\beta$ is akin to $\beta^2$ in traditional calculus and we differentiate accordingly.)

$$ \frac{\partial}{\partial \beta} \left(-2 \beta^T X^T y + \beta^T X^T X \beta\right) = -2 X^T y + 2 X^{T} X \beta $$

Our estimate of $\beta$ will only be "good" if the derivative of the sum of squares of the residuals is zero (so that the sum of squares of the residuals is minimal). We will now use the hat notation $\hat{\beta}$ so that it is clear that we are dealing with our OLS *estimate* of $\beta$. The following steps are therefore trivial:

$$ -2 X^T y + 2 X^{T} X \hat{\beta} = 0 $$

$$ 2 X^{T} X \hat{\beta} = 2 X^T y $$

$$ X^{T} X \hat{\beta} = X^T y $$

We "pre-multiply" both sides by $\left(X^{T} X\right)^{-1}$ so that the left-hand side is only left with $\hat{\beta}$. Here, $A^{-1}$ is the inverse of matrix $A$ and $I$ is the unit matrix:

$$ X^{T} X \hat{\beta} = X^T y $$

$$ \underbrace{\left(X^{T} X\right)^{-1} X^{T} X}_{I} \hat{\beta} = \left(X^{T} X\right)^{-1} X^T y $$

Now, therefore, it follows that

$$ \hat{\beta} = \left(X^{T} X\right)^{-1} X^T y. $$