Covariance measures how two variables change together—whether they increase or decrease in tandem
You will also see following formula in literature:
What does this matrix represent?
The diagonal elements = variances of individual features → how much that feature spreads out.
The off-diagonal elements = covariances → how two features change together:
Negative covariance: when one increases, the other decreases….X1 = age and X2 = number of pushups
Zero covariance: no linear relationship…..X1 = your age and X2 = neighbors income
Positive covariance: they increase together…. X1 = your age, X2 = income
For example X1 could be age, X2 could income and we may see that Cov(X1,X2) be +ve: as age goes up, income goes up to.
Example: Suppose we’re looking at
GameHrs: The number of hours a student plays video games per day
ExamScore: The student’s exam score out of 100.
Below is a sample taken from a population. Find covariance and correlation coeff. between the 2 features.
Student GameHours Scores
A 1 95
B 2 90
C 3 85
D 4 80
E 5 75
We expect that as GameHours (video games) increase, ExamScore decreases → negative covariance. Let’s see if can deduce this mathematically.
Here we would use sample covariance formula.
So the cova. is -12.5 hours per score point
Why do we need correlation coefficient (r) ρ?
Because covariance has units.
Example: if X = meters, Y = seconds → cov(X,Y) is in meter second
This makes it hard to compare across different datasets that may have different units
The formula for correlation coefficient is
Key point on r:
It is dimensionless ( no units ) and ∈ [ -1, 1 ]
+1 → perfect positive linear relationship
0 → no linear relationship
-1 → perfect negative linear relationship
Continuing with previous example,
If we plot Gamehours versus Scores, we see following perfect linear plot. This is typical of features that have correlation coeff = 1
Practice: Following is sample taken from a population. What is correlation coefficient between gamehours and ExamScore ?
Student GameHours Scores
A 1 92
B 2 56
C 3 68
D 4 38
E 5 40
Performing the same calculation as above, we that the mean of each is 3 and 58.8,
Cov(X,Y) = -24.4
Correlation coeff = -0.868
Top comments (0)