Shlok Kumar

Posted on Mar 25

Pearson Correlation Test

#ai #machinelearning #deeplearning

What is a Correlation Test?

A correlation test measures the strength of the association between two variables. For instance, if we want to explore whether there is a relationship between the heights of fathers and sons, the correlation coefficient can help us answer that question.

Methods for Correlation Analyses

There are two main methods for correlation analysis:

Parametric Correlation: This method measures the linear dependence between two variables (x and y) and is based on the distribution of the data. The most commonly used parametric method is the Pearson correlation.
Non-Parametric Correlation: This includes methods like Kendall's tau and Spearman's rho, which are rank-based correlation coefficients. These methods do not assume a specific data distribution.

Note

The Pearson correlation method is the most widely used for analyzing linear relationships.

Pearson Correlation Formula

The Pearson correlation coefficient ( r ) is calculated using the formula:

r = (n * (Σxy) - (Σx)(Σy)) / sqrt([(n * Σx² - (Σx)²) * (n * Σy² - (Σy)²)])

Where:

( x ) and ( y ) are two vectors of length ( n ),
( m_x ) and ( m_y ) are the means of ( x ) and ( y ), respectively.

Important Notes

( r ) ranges from -1 (negative correlation) to 1 (positive correlation).
An ( r ) value of 0 indicates no correlation.
The Pearson correlation cannot be applied to ordinal variables.
A sample size of 20-30 is generally recommended for good estimation.
Outliers can lead to misleading correlation values, making the method not robust in such cases.

Computing Pearson Correlation in Python

To compute the Pearson correlation in Python, you can use the pearsonr() function from the scipy.stats library.

Syntax

pearsonr(x, y)

Parameters

x, y: Numeric vectors with the same length.

Example Code

Here’s how you can find the Pearson correlation using Python:
Note: Data

# Import the necessary libraries
import pandas as pd
from scipy.stats import pearsonr

# Load your data into Python
df = pd.read_csv("Auto.csv")

# Convert the DataFrame into series
list1 = df['weight']
list2 = df['mpg']

# Apply the pearsonr() function
corr, _ = pearsonr(list1, list2)
print('Pearson correlation: %.3f' % corr)

Output

When you run the code, you might see an output like:

Pearson correlation: -0.878

Pearson Correlation for Anscombe’s Data

Anscombe’s quartet consists of four datasets that have nearly identical simple statistical properties but appear very different when graphed. Each dataset comprises eleven (x, y) points and was constructed by the statistician Francis Anscombe in 1973. This example illustrates the importance of graphing data before analyzing it and highlights how outliers can affect statistical properties.

Visualizing Anscombe’s Data

When we plot these points, we can notice significant differences in their distributions. For instance, applying Pearson’s correlation coefficient to each of these datasets reveals nearly identical correlation values.

Key Insights

Despite obtaining a high correlation coefficient close to one in the first dataset, it does not imply a linear relationship. The second dataset, while showing a high correlation, demonstrates a non-linear relationship.

This emphasizes the need for careful data visualization and analysis before concluding relationships based solely on correlation coefficients.

For more content, follow me at — https://linktr.ee/shlokkumar2303

Deploy with ease. Manage efficiently. Scale faster.

Leave the infrastructure headaches to us, while you focus on pushing boundaries, realizing your vision, and making a lasting impression on your users.

Get Started

DEV Community