why larger eigenval indicates more variance?

Detailed Explanation: Why Larger Eigenvalues Indicate More Variance in PCA

To understand why larger eigenvalues correspond to directions of greater variance in Principal Component Analysis (PCA), we need to dive into the mathematical foundations of eigenvalues, eigenvectors, and projections. Here’s a step-by-step breakdown:

1. Recap: Covariance Matrix and Eigen-Decomposition

Given a centered data matrix ( X ) (size ( N \times D )), the covariance matrix is:
[
C = \frac{1}{N-1} X^T X.
]

( C ) is symmetric (( C = C^T )) and positive semi-definite.
Its eigenvectors ( v_1, v_2, \dots, v_D ) (principal components) are orthogonal, and eigenvalues ( \lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_D \geq 0 ) represent the variance along each eigenvector.

2. Projection of Data onto an Eigenvector

When we project the centered data ( X ) onto an eigenvector ( v_j ), the transformed data ( s_j ) (scores) is:
[
s_j = X v_j.
]

( s_j ) is a vector of length ( N ) (one value per data point).
The variance of ( s_j ) is calculated as: [ \text{Var}(s_j) = \frac{1}{N-1} s_j^T s_j = \frac{1}{N-1} (X v_j)^T (X v_j) = v_j^T \underbrace{\left( \frac{1}{N-1} X^T X \right)}_{C} v_j = v_j^T C v_j. ]

3. Eigenvalue Equation and Variance

From the eigen-decomposition ( C v_j = \lambda_j v_j ), multiply both sides by ( v_j^T ):
[
v_j^T C v_j = v_j^T (\lambda_j v_j) = \lambda_j \underbrace{v_j^T v_j}_{=1} = \lambda_j.
]
Thus:
[
\text{Var}(s_j) = \lambda_j.
]

Key Insight:

The eigenvalue ( \lambda_j ) is exactly the variance of the data projected onto its corresponding eigenvector ( v_j ).

4. Why Larger ( \lambda_j ) = More Important Direction?

The first principal component (PC1) ( v_1 ) is the direction maximizing ( \text{Var}(Xv) ). By definition, this is the eigenvector with the largest eigenvalue (( \lambda_1 )).
The second PC ( v_2 ) is the next best direction (orthogonal to ( v_1 )) with variance ( \lambda_2 ), and so on.

Geometric Interpretation:

Eigenvalues ( \lambda_j ) quantify how "stretched" the data is along each PC. Larger ( \lambda_j ) means the data spreads out more in that direction, making it a dominant feature.

5. Formal Proof: Variance Maximization

PCA solves the constrained optimization problem:
[
\max_{v} v^T C v \quad \text{subject to} \quad |v| = 1.
]
The Lagrangian is:
[
\mathcal{L}(v, \lambda) = v^T C v - \lambda (v^T v - 1).
]
Taking the gradient w.r.t. ( v ) and setting it to zero:
[
\nabla_v \mathcal{L} = 2 C v - 2 \lambda v = 0 \implies C v = \lambda v.
]
This shows that the optimal directions ( v ) are eigenvectors of ( C ), and the variances ( \lambda ) are the eigenvalues.

6. Total Variance and Explained Variance

The total variance in the data is the sum of all eigenvalues: [ \text{Total Variance} = \sum_{j=1}^D \lambda_j = \text{trace}(C). ]
The proportion of variance explained by the ( j )-th PC is: [ \frac{\lambda_j}{\sum_{k=1}^D \lambda_k}. ] Larger ( \lambda_j ) means the PC explains more of the total variance.

7. Intuitive Example: 2D Data

Consider a 2D dataset with:
[
C = \begin{bmatrix}
5 & 2 \
2 & 3
\end{bmatrix}.
]

Eigenvalues: ( \lambda_1 = 7 ), ( \lambda_2 = 1 ).
PC1 (( \lambda_1 = 7 )): Direction where data varies most (accounts for ( 7/8 = 87.5\% ) of variance).
PC2 (( \lambda_2 = 1 )): Orthogonal direction with less variance (( 12.5\% )).

8. Key Takeaways

Eigenvalue = Variance: ( \lambda_j ) is the variance of the data projected onto ( v_j ).
Dominant PCs: Larger ( \lambda_j ) means more variance is captured by ( v_j ), making it a "major feature."
Optimality: PCs are the best orthogonal directions for maximizing variance (or minimizing reconstruction error).