What is a Correlation Test?
A correlation test measures the strength of the association between two variables. For instance, if we want to explore whether there is a relationship between the heights of fathers and sons, the correlation coefficient can help us answer that question.
Methods for Correlation Analyses
There are two main methods for correlation analysis:
Parametric Correlation: This method measures the linear dependence between two variables (x and y) and is based on the distribution of the data. The most commonly used parametric method is the Pearson correlation.
Non-Parametric Correlation: This includes methods like Kendall's tau and Spearman's rho, which are rank-based correlation coefficients. These methods do not assume a specific data distribution.
Note
- The Pearson correlation method is the most widely used for analyzing linear relationships.
Pearson Correlation Formula
The Pearson correlation coefficient ( r ) is calculated using the formula:
r = (n * (Σxy) - (Σx)(Σy)) / sqrt([(n * Σx² - (Σx)²) * (n * Σy² - (Σy)²)])
Where:
- ( x ) and ( y ) are two vectors of length ( n ),
- ( m_x ) and ( m_y ) are the means of ( x ) and ( y ), respectively.
Important Notes
- ( r ) ranges from -1 (negative correlation) to 1 (positive correlation).
- An ( r ) value of 0 indicates no correlation.
- The Pearson correlation cannot be applied to ordinal variables.
- A sample size of 20-30 is generally recommended for good estimation.
- Outliers can lead to misleading correlation values, making the method not robust in such cases.
Computing Pearson Correlation in Python
To compute the Pearson correlation in Python, you can use the pearsonr()
function from the scipy.stats
library.
Syntax
pearsonr(x, y)
Parameters
-
x
,y
: Numeric vectors with the same length.
Example Code
Here’s how you can find the Pearson correlation using Python:
Note: Data
# Import the necessary libraries
import pandas as pd
from scipy.stats import pearsonr
# Load your data into Python
df = pd.read_csv("Auto.csv")
# Convert the DataFrame into series
list1 = df['weight']
list2 = df['mpg']
# Apply the pearsonr() function
corr, _ = pearsonr(list1, list2)
print('Pearson correlation: %.3f' % corr)
Output
When you run the code, you might see an output like:
Pearson correlation: -0.878
Pearson Correlation for Anscombe’s Data
Anscombe’s quartet consists of four datasets that have nearly identical simple statistical properties but appear very different when graphed. Each dataset comprises eleven (x, y) points and was constructed by the statistician Francis Anscombe in 1973. This example illustrates the importance of graphing data before analyzing it and highlights how outliers can affect statistical properties.
Visualizing Anscombe’s Data
When we plot these points, we can notice significant differences in their distributions. For instance, applying Pearson’s correlation coefficient to each of these datasets reveals nearly identical correlation values.
Key Insights
- Despite obtaining a high correlation coefficient close to one in the first dataset, it does not imply a linear relationship. The second dataset, while showing a high correlation, demonstrates a non-linear relationship.
This emphasizes the need for careful data visualization and analysis before concluding relationships based solely on correlation coefficients.
For more content, follow me at — https://linktr.ee/shlokkumar2303
Top comments (0)