DEV Community

Cover image for Pearson Correlation Coefficient: Connecting the Dots in Data
Prabrit Das
Prabrit Das

Posted on

Pearson Correlation Coefficient: Connecting the Dots in Data

In the vast landscape of data analysis , understanding how variables relate to one another is very crusial. It is a fundamental statistical tool that helps to find the linear relationship between two continuous variables.

Table of Contents

  1. What is Correlation?
  2. Pearson Correlation Coefficient
  3. Mathematics Behind Pearson Correlation Coefficient
  4. Mathematical Example
  5. Code for Pearson Correlation Coefficient
  6. Conclusion

What is Correlation?

Correlation is besicaly a statistical measure which can measures the strength and direction of the relationship between two variables. This can be an invaluable tool when trying to determine whether an increase in one variable corresponds to an increase or decrease in another.

Pearson Correlation Coefficient

Pearson Correlation Coefficient (denoted as rr ) the strength and direction of the linear relationship between two continuous variables where the value of rr ranges from 1-1 to +1+1 .

  • +1+1 : Perfect positive linear relationship

  • 00 : No linear relationship

  • 1-1 : Perfect negative linear relationship

Graph Representation

Interpretation of correlation

Mathematics Behind Pearson's Correlation Coefficient

Pearson's rr assesses how much two variables change together relative to how much they change independently. The formula for finding the value of rr is:

r=i=1n(XiX)(YiY)i=1n(XiX)2i=1n(YiY)2 r = \frac{\sum_{i=1}^{n} (X_i - \overline{X})(Y_i - \overline{Y})}{\sqrt{\sum_{i=1}^{n} (X_i - \overline{X})^2} \sqrt{\sum_{i=1}^{n} (Y_i - \overline{Y})^2}}

Where:

  • XX and YY are the two variables.
  • X\overline{X} and Y\overline{Y} are their respective mean values.
  • XiX_i and YiY_i are individual data points.

This formula normalizes the covariance between XX and YY by the product of their standard deviations, ensuring that rr value always remains between 1-1 to +1+1 .

Interpreting the Coefficient

Understanding the value of rr involves both it's sign and magnitude

  • Sign(+/-)
    It indicates the direction of the relationship

    • Positive: When XX increases , YY also increases.
    • Negative: When XX increases , YY tends to decrease.
  • Magnitude
    The magnitude represents the strength of the relationship.

    • 0.00 to 0.19: Very weak
    • 0.20 to 0.39: Weak
    • 0.40 to 0.59: Moderate
    • 0.60 to 0.79: Strong
    • 0.80 to 1.00: Very Strong
  • Graph

Depiction of how a higher value of coefficient refers to higher correlation

Mathematical Example

Let us take an example of Price vs the Demand -

Demand Price
65 67
66 68
67 65
67 68
68 72
69 72
70 69
72 71
  • Solution:

Considaring Demand as XX and Price as YY :

i=1n(XiX)(YiY){\sum_{i=1}^{n} (X_i - \overline{X})(Y_i - \overline{Y})} = 24

and

i=1n(XiX)2i=1n(YiY)2{\sqrt{\sum_{i=1}^{n} (X_i - \overline{X})^2} \sqrt{\sum_{i=1}^{n} (Y_i - \overline{Y})^2}} = 39.799

so the correlation coefficent will be :

r=i=1n(XiX)(YiY)i=1n(XiX)2i=1n(YiY)2 r = \frac{\sum_{i=1}^{n} (X_i - \overline{X})(Y_i - \overline{Y})}{\sqrt{\sum_{i=1}^{n} (X_i - \overline{X})^2} \sqrt{\sum_{i=1}^{n} (Y_i - \overline{Y})^2}} = 0.60303022

  • which indicates strong positive correlation.

Code for Pearson Correlation Coefficient

#include <stdio.h>
#include <math.h>
#include <stdlib.h>

int main()
{
  int n;

  double a[n];
  double b[n];
  double m_a;
  double m_b;
  double A[n];
  double B[n];
  double C;
  double D;
  double E;
  double d;

  printf("Enter the number of element ->");
  scanf("%d", &n);

  printf("Enter the value of y followed by x\n");
  for (int i = 0; i < n; i++)
  {

    scanf("%lf", &a[i]);
    scanf("%lf", &b[i]);
  }

  for (int i = 0; i < n; i++)
  {

    m_a = a[i] + m_a;
    m_b = b[i] + m_b;
  }

  double M_A = m_a / n; // Mean value for x
  double M_B = m_b / n; // Mean value for y

  for (int i = 0; i < n; i++)
  {

    A[i] = (a[i] - M_A);
    B[i] = (b[i] - M_B);

    C = (A[i] * A[i]) + C;
    D = (B[i] * B[i]) + D;

    E = (A[i] * B[i]) + E;
  }

  double F = C * D;
  double cor = E / sqrt(F);
  printf(" The Correlation Coefficient is %lf\n ", cor);
  printf("\n");

  if (cor > 0 && cor <= 1)
  {
    printf("Positivly Correlated");
  }

  else if (cor == 0)
  {

    printf("No relation found");
  }

  else if (cor < 0 && cor > -1)
  {

    printf("Negativly Correlated");
  }

  else
  {

    printf("Invalid Output !!");
  }
}
Enter fullscreen mode Exit fullscreen mode
  • Output

Output with the above sample data

Conclusion

The Pearson Correlation Coefficient offers us a straightforward method to quantify the linear relationship between two continuous variables. By mastering Pearson's rr and recognizing when to use alternative measures, one can enhance data analysis proficiency and derive more accurate insights from any data.

Top comments (0)