DEV Community

Ashwani Kumar Shamlodhiya
Ashwani Kumar Shamlodhiya

Posted on

Covariance + correlation

Covariance measures how two variables change together—whether they increase or decrease in tandem

You will also see following formula in literature:

What does this matrix represent?
The diagonal elements = variances of individual features → how much that feature spreads out.
The off-diagonal elements = covariances → how two features change together:
Negative covariance: when one increases, the other decreases….X1 = age and X2 = number of pushups
Zero covariance: no linear relationship…..X1 = your age and X2 = neighbors income
Positive covariance: they increase together…. X1 = your age, X2 = income

For example X1 could be age, X2 could income and we may see that Cov(X1,X2) be +ve: as age goes up, income goes up to.

Example: Suppose we’re looking at
GameHrs: The number of hours a student plays video games per day
ExamScore: The student’s exam score out of 100.
Below is a sample taken from a population. Find covariance and correlation coeff. between the 2 features.

Student GameHours Scores
A         1         95
B         2         90
C         3         85
D         4         80
E         5         75
Enter fullscreen mode Exit fullscreen mode

We expect that as GameHours (video games) increase, ExamScore decreases → negative covariance. Let’s see if can deduce this mathematically.
Here we would use sample covariance formula.

So the cova. is -12.5 hours per score point

Why do we need correlation coefficient (r) ρ?
Because covariance has units.
Example: if X = meters, Y = seconds → cov(X,Y) is in meter second
This makes it hard to compare across different datasets that may have different units

The formula for correlation coefficient is


Key point on r:
It is dimensionless ( no units ) and ∈ [ -1, 1 ]
+1 → perfect positive linear relationship
0 → no linear relationship
-1 → perfect negative linear relationship

Continuing with previous example,

If we plot Gamehours versus Scores, we see following perfect linear plot. This is typical of features that have correlation coeff = 1


Practice: Following is sample taken from a population. What is correlation coefficient between gamehours and ExamScore ?

Student GameHours Scores
A         1         92
B         2         56
C         3         68
D         4         38
E         5         40
Enter fullscreen mode Exit fullscreen mode

Performing the same calculation as above, we that the mean of each is 3 and 58.8,
Cov(X,Y) = -24.4
Correlation coeff = -0.868

Top comments (0)