viewof theta = Inputs.range([0, 360], {value: 60, step: 1, label: "Angle between u and v (°)"})
viewof c_w = Inputs.range([-1.5, 1.5], {value: 0.4, step: 0.01, label: "Scale c for w = c·v"})Let \(\mathbf{x}_1, \cdots, \mathbf{x}_n\) be a sample of \(n\) dimensional random variables with mean \(\mu\). Their covariance is defined as
\[ \Sigma = \mathbb{E} \left[ (\mathbf{x}-\mathbf{\mu})(\mathbf{x}-\mathbf{\mu})^\top \right] \>. \] I don’t blame you for not really grokking what this means. Should you want to know, then we will need to revisit some fundamental concepts in linear algebra, starting with projections, moving on to outer products, and finally landing on covariances before briefly touching on PCA.
Projections
Consider a vector \(\mathbf{u} \in \mathbb{R}^n\) which I want to project onto a vector \(\mathbf{v} \in \mathbb{R}^n\) to obtain a new vector denoted \(\operatorname{proj}_{\mathbf{v}}(\mathbf{u})\). What do I even mean by “project”? By project, I mean obtain a vector in the direction of \(\mathbf{v}\) which is geometrically closest to \(\mathbf{u}\). There are some useful interpretations of the projection as the “components of one vector in the direction of another”, and me may reach for that later. It may be useful to consider 3 cases and reason about what that means.
In the case where \(\mathbf{u}\) and \(\mathbf{v}\) are collinear (i.e. \(\left< \mathbf{u}, \mathbf{v} \right> = \pm 1\) ), then the closest vector to \(\mathbf{u}\) in the direction of \(\mathbf{v}\) is \(\mathbf{v}\) itself (or -\(\mathbf{v}\) as the case may be)! So whatever the projection is, it should return \(\mathbf{v}\).
In the case where \(\mathbf{u}\) and \(\mathbf{v}\) are orthogonal (i.e. \(\left< \mathbf{u}, \mathbf{v} \right> =0\) ), then the closest vector to \(\mathbf{u}\) in the direction of \(\mathbf{v}\) is the zero vector, \(\mathbf{0}\). This may seem weird, but see the little animation below for a rationale as to why.
In the case where \(\mathbf{u}\) and \(\mathbf{v}\) are neither collinear nor orthogonal (i.e. \(0 \lt \left< \mathbf{u}, \mathbf{v} \right> \lt 1\) ), what should the projection return? What is the closest vector in this case? We can use some algebra to reason about this. Since \(\operatorname{proj}_{\mathbf{v}}(\mathbf{u})\) is in the direction of \(\mathbf{v}\), it is a scalar multiple of \(\mathbf{v}\), hence \(\operatorname{proj}_{\mathbf{v}}(\mathbf{u}) = c \mathbf{v}\) for some \(c\). Hence, we can use some algebra to find the closest vector. The scalar we are looking for, \(c\), is the solution to
\[ c = \underset{\tilde c \in \mathbb{R}}{\arg \min} \left\lVert \mathbf{u} - \tilde c\mathbf{v} \right\rVert^2 \>.\]
It is not hard to find such a scalar. Simply expand out the norm, differentiate with respect to \(\tilde c\) and set to 0.
\[ \dfrac{d}{d \tilde c} \left( \mathbf{u}^\top \mathbf{u} - 2 \tilde c \mathbf{u}^\top \mathbf{v} + \tilde c^2 \mathbf{v}^\top \mathbf{v}\right) = -2\mathbf{u}^\top \mathbf{v} + 2 \tilde c \mathbf{v}^\top \mathbf{v} = 0 \]
\[\therefore \boxed{ c = \dfrac{\mathbf{u}^\top \mathbf{v}}{\mathbf{v}^\top \mathbf{v}}}\]
You can verify this is a minimum since the second derivative of the above quantity is always positive. The projection will be
\[\operatorname{proj}_{\mathbf{v}}(\mathbf{u}) = \dfrac{\mathbf{u}^\top \mathbf{v}}{\mathbf{v}^\top \mathbf{v}} \mathbf{v}\]
Use the slider below to vary the angle between \(\mathbf{u}\) (blue) and \(\mathbf{v}\) (red). The projection \(\operatorname{proj}_{\mathbf{v}}(\mathbf{u})\) is drawn in green, with a dashed line showing the residual \(\mathbf{u} - \operatorname{proj}_{\mathbf{v}}(\mathbf{u})\) meeting \(\mathbf{v}\) at a right angle. This makes sense given what we know about right triangles and their hypotenuse. Imagine taking some other vector, \(\mathbf{w}\) in the direction of \(\mathbf{v}\) and asserting that was the closest vector. This would form a right triangle where \(\mathbf{u} - \mathbf{w}\) was the hypotenuse and would hence be longer in magnitude than \(\mathbf{u} - \operatorname{proj}_{\mathbf{v}}(\mathbf{u})\).
In summary, the projection is just the vector which is closest to some other vector. This looks awfully familliar if you squint. Have you seen this before?
Maybe if we realize inner products are scalars and so we can write \(\mathbf{u}^\top \mathbf{v} = \mathbf{v}^\top \mathbf{u}\), and then massaging the equation for \(c\), along with judicious use of brackets, we can get
\[ \operatorname{proj}_{\mathbf{v}}(\mathbf{u}) = \mathbf{v}(\mathbf{v}^\top \mathbf{v})^{-1} \mathbf{v}^\top \mathbf{u} \>.\]
Now does this look familiar? It almost looks like \(X (X^\top X)^{-1} X^\top y\). This is no coincidence. We know that \(X (X^\top X)^{-1} X^\top\) is a projection matrix and will project \(y\) onto the column space of \(X\).
To me, the insight here is reframing the projection formula we may learn in an introduction to linear algebra from an operation on \(\mathbf{v}\) to an operation on \(\mathbf{u}\). The formula \(\left(\mathbf{u}^\top \mathbf{v}\right)/({\mathbf{v}^\top \mathbf{v}}) \mathbf{v}\) makes it appear as if \(\mathbf{u}\) is ancillary to the computation in so far as it is used in creating the scalar needed to rescale \(\mathbf{v}\); that \(\mathbf{v}\) is the thing being transformed. If we change our perspective and think of \(\mathbf{u}\) as being transformed – and not just transformed, but transformed linearly – we can make good headway to understanding covariance.
Outer Products
All linear transformations on vectors can be represented my matrices. This begs the question “If projection is a linear operation, what is the matrix representation of projection”? Importantly, we need to represent projection between a matrix, \(A\), and the vector \(\mathbf{u}\). Given \(\mathbf{v}\) and \(\mathbf{u}\), we know how to compute the projection, so it might be useful to start there and try to see the operation as linear in \(\mathbf{u}\). Let’s write out the first few elements of the projection to try and spot a pattern.
The components are
\[ \begin{aligned} \left[ \operatorname{proj}_{\mathbf{v}}(\mathbf{u}) \right]_1 &= \dfrac{u_1 v_1 + u_2 v_2 + \cdots + u_n v_n}{v_1^2 + v_2^2 + \cdots + v_n^2}\, v_1 \\[6pt] \left[ \operatorname{proj}_{\mathbf{v}}(\mathbf{u}) \right]_2 &= \dfrac{u_1 v_1 + u_2 v_2 + \cdots + u_n v_n}{v_1^2 + v_2^2 + \cdots + v_n^2}\, v_2 \\[6pt] &\;\;\vdots \\[6pt] \left[ \operatorname{proj}_{\mathbf{v}}(\mathbf{u}) \right]_n &= \dfrac{u_1 v_1 + u_2 v_2 + \cdots + u_n v_n}{v_1^2 + v_2^2 + \cdots + v_n^2}\, v_n \end{aligned} \]
Now, it should be straightforward to write this as a matrix operation. Notice that all the entries share the squared norm of \(\mathbf{v}\), so we can pull that out. The remaining sums for the \(k^{th}\) component are
\[ \left[ \operatorname{proj}_{\mathbf{v}}(\mathbf{u}) \right]_k = \dfrac{1}{\lVert \mathbf{v} \rVert|^2} \sum_{i=1}^n (u_i v_i) v_k\]
which is re-written as
\[ \operatorname{proj}_{\mathbf{v}}(\mathbf{u}) = \dfrac{1}{\lVert \mathbf{v} \rVert^2} \begin{bmatrix} v_1 v_1 & v_1 v_2 & \cdots & v_1 v_n \\ v_2 v_1 & v_2 v_2 & \cdots & v_2 v_n \\ \vdots & \vdots & \ddots & \vdots \\ v_n v_1 & v_n v_2 & \cdots & v_n v_n \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{bmatrix} \>. \]
Now we’re getting somewhere. Let’s return back to the equation for the projection and recall we can move scalars around as needed. If we replace the inner product of \(\mathbf{v}\) with itself with notation for the squared norm, we find
\[\operatorname{proj}_{\mathbf{v}}(\mathbf{u}) = \dfrac{1}{\lVert \mathbf{v} \rVert^2} \mathbf{v} \mathbf{v}^\top \mathbf{u} \>.\]
This hints that
\[ \mathbf{v} \mathbf{v}^\top = \begin{bmatrix} v_1 v_1 & v_1 v_2 & \cdots & v_1 v_n \\ v_2 v_1 & v_2 v_2 & \cdots & v_2 v_n \\ \vdots & \vdots & \ddots & \vdots \\ v_n v_1 & v_n v_2 & \cdots & v_n v_n \end{bmatrix}\>. \] Indeed, this is true! We call \(\mathbf{v} \mathbf{v}^\top\) and “outer product” (as opposed to an inner product). This operation makes sense if we think of multiplying an \(n\times 1\) matrix with a \(1 \times n\) matrix and producing an \(n \times n\) matrix (though I don’t find this compelling).
To me, the insight here is that \(\frac{1}{\lVert \mathbf{v} \rVert^2} \mathbf{v} \mathbf{v}^\top\) is a matrix and represents a linear transformation; it represents the projection onto the span of \(\mathbf{v}\). This is key because covariance is written as an expectation of outer products. Hence, whatever we are averaging, it must have to do something with projections onto something.
Covariance
Consider some multivarate data \(\mathbf{x}_i\) with mean vector \(\mathbf{\mu} = \mathbf{0}\) and unknown covariance matrix. From the formula above, we know we can estimate the covariance by taking an expectation of projection matrices (scaled by a factor of \(\lVert \mathbf{x}_i \rVert^2\)). This implies that when we covariance matrices have something to do with projection, but what?
Consider a standard result from multivariate statistics, that \(\operatorname{Var}(\mathbf{u}^\top \mathbf{x}) = \mathbf{u}^\top \Sigma \mathbf{u} = \mathbb{E}\left( \mathbf{u}^\top \mathbf{x} \mathbf{x}^\top \mathbf{a }\right)\). The thing inside the expectation looks a lot like a projection of \(\mathbf{u}\) onto the data, and then one final inner product with \(\mathbf{u}^\top\). If we reason through this, this operation is interpreted as:
- We first project \(\mathbf{u}\) onto all data poionts, creating a set of vectors in the direction of the \(\mathbf{x}_i\).
- Recall from the projection section that \(\operatorname{proj}_{\mathbf{v}}(\mathbf{u}) \propto \mathbf{u}^\top \mathbf{v} = \mathbf{v}^\top \mathbf{u}\). So taking the inner product of these vectors with \(\mathbf{u}^\top\) creates a quantity proportional to the length of the projection of \(\operatorname{proj}_{\mathbf{u}}(\mathbf{x}_x \mathbf{x}_i^\top \mathbf{u})\).
- Averaging those quantities produces an estimate of the variance of \(\mathbf{u}^\top \mathbf{x}_i\). Why? The variance of (mean centered) data is just an expectation of squares, and through judicious use of brackets we can write \(\operatorname{Var}(\mathbf{u}^\top \mathbf{x}) = \mathbb{E}((\mathbf{x}^\top \mathbf{u})^2)\).
The interactive visualization below demonstrates this process (I recommend first setting the number of points to 1 or 2 to really understand what it is showing). The dots are the datapoints, and the blue vector is \(\mathbf{u}\). The green vectors are the projections \(\operatorname{proj}_{\mathbf{x}_i}(\mathbf{u})\), and the yellow arrows indicate the length of the projection of \(\operatorname{proj}_{\mathbf{u}}(\mathbf{x}_i \mathbf{x}_i^\top \mathbf{u})\) – that is, the projection back onto \(\mathbf{u}\). The length from the origin to the tip of the yellow arrow is what goes into the summand. Try rotating \(\mathbf{u}\) so that the yellow arrows have smallest length possible and think about why that is (where are those green vectors being projected to, and why are they being projected there?). If you want to get a sense of why the variance calculation works, I would recommend stepping through the calculation of the expectation when \(\mathbf{u} = \begin{bmatrix}1 & 0 \end{bmatrix}^\top\) and see what you are left with.
The insight here is that \(\operatorname{Var}(\mathbf{u}^\top \mathbf{x})\) is not a single projection, it is really two projections – once onto the data, and again back onto \(\mathbf{u}\). Perhaps a smaller insight is that anytime we do an inner product of some sort, we should be thinking of projecting one vector onto another in some sense.
I’m still not totally sold on this interpretation. I’ve asked Gemini for another explanation and it has said (I’m paraphrasing)
\(\mathbf{x}_i \mathbf{x}^\top_i \mathbf{u}\) captures how much of \(\mathbf{u}\) is preserved by the data point \(\mathbf{x}\), and \(\mathbf{u}^\top (\mathbf{x}_i \mathbf{x}^\top_i \mathbf{u})\) captures how much of the preserved part actually aligns with \(\mathbf{u}\).
I’m not sure I find that a better interpretation, but this will have to do.
PCA
A natural question after thinking about this is “For what \(\mathbf{u}\) is the \(\operatorname{Var}(\mathbf{u}^\top \mathbf{x})\) largest/smallest?” Play around with the interactive viz above and see where that happens. If you play with the visualization above—especially with a high correlation and many points—you’ll notice that as you rotate \(\mathbf{u}\), those yellow arrows grow and shrink. At two specific points (separated by 90°), the total length of those yellow arrows reaches a maximum and a minimum. These are the Principal Components.
Talk of largest/smallest implies we need to do some sort of optimization. From now on, let’s restrict ourself to those \(\mathbf{u}\) such that \(\lVert \mathbf{u} \rVert = 1\) (i.e. unit vectors). This will be a constraint in our optimization. Using Lagrange Multipliers, our objective function is
\[ L(\mathbf{u}, \lambda) = \mathbf{u}^\top \Sigma \mathbf{u} - \lambda(\mathbf{u}^\top \mathbf{u} - 1) \>.\]
Differentiating with respect to \(\mathbf{u}\) and setting the result to \(\mathbf{0}\): \[ \dfrac{\partial L}{\partial \mathbf{u}} = 2\Sigma \mathbf{u} - 2\lambda \mathbf{u} = \mathbf{0} \]
\[ \therefore \boxed{\Sigma \mathbf{u} = \lambda \mathbf{u}} \]
It is unsurprising that the directions of minimum/maximum variance are the eigen vectors of \(\Sigma\) – this is a commonly parroted fact about PCA! Since \(\mathbf{u}\) is a unit vector, this also means
\[ \operatorname{Var}(\mathbf{u}^\top \mathbf{x}) = \mathbf{u}^\top \Sigma \mathbf{u} = \mathbf{u}^\top \lambda \mathbf{u} = \lambda \]
and so the eigenvalues of the covariance matrix are also the variances in that direction, another commonly parroted fact about PCA! Moreover, the eigenvectors from this approach are orthonormal meaning we can actually get a fairly solid interpretation of the covariance matrix in terms of projection matrices.
From the definition I provided at the top of the post, the covariance matrix is an average of some projections. That is a “meh” interpretation, but consider the following. We can diagonalize \(\Sigma\) via the spectral decomposition to be
\[ \Sigma = \mathbf{U} \Lambda \mathbf{U}^\top = \sum_{j=1}^n \lambda_j \mathbf{u}_j \mathbf{u}_j^\top \]
and look – another projection matrix! In fact, there are \(n\) of them, so really the covariance matrix is a weighted sum of projection matrices! The weights are the variance of the principle components and while that may not be a very visceral interpretation, I do find it pretty damn cool.