Anand

Posted on Jun 20

T-Test and Chi-Square Test in Data Analysis 🐍🤖🧠

#statistics #python #datascience #machinelearning

We apply these tests on data to determine whether there are statistically significant differences or associations between groups or variables

T-Test

Overview

The T-test is a statistical test used to compare the means of two groups to determine if they are significantly different from each other. It is commonly used when the data follows a normal distribution and the sample size is small.

Types of T-Tests

Independent T-Test: Compares the means of two independent groups.
Paired T-Test: Compares means from the same group at different times.
One-Sample T-Test: Compares the mean of a single group against a known mean.

Example
Suppose we want to compare the test scores of students from two different classes to see if there is a significant difference

import numpy as np
from scipy import stats

# Sample data
class_a_scores = [85, 86, 88, 75, 78, 94, 91, 88]
class_b_scores = [82, 84, 80, 72, 76, 90, 89, 85]

# Perform the t-test
t_stat, p_value = stats.ttest_ind(class_a_scores, class_b_scores)

print(f"T-Statistic: {t_stat}, P-Value: {p_value}")

output : T-Statistic: 1.07950662400349, P-Value: 0.2986093279117022

Chi-Square Test

Overview
The Chi-Square Test is used to determine if there is a significant association between two categorical variables. It compares the observed frequencies of occurrences with the expected frequencies.

Types of Chi-Square Tests

Chi-Square Test for Independence: Assesses whether two categorical variables are independent.
Chi-Square Goodness of Fit Test: Determines if a sample data matches a population.

Example

Suppose we want to check if there is an association between smoking status (smoker/non-smoker) and exercise frequency (regular/irregular).

import numpy as np
from scipy.stats import chi2_contingency

# Sample data in a contingency table
# Rows: Smoking Status (Smoker, Non-Smoker)
# Columns: Exercise Frequency (Regular, Irregular)
data = np.array([[15, 35], [40, 10]])

# Perform the Chi-Square test
chi2, p, dof, expected = chi2_contingency(data)

print(f"Chi-Square Statistic: {chi2}, P-Value: {p}")

output: Chi-Square Statistic: 20.833333333333336, P-Value: 5.223051050415452e-06

Impact of T-Test and Chi-Square Test in Data Analysis

T-Test

Comparing Group Means: Helps in comparing the means of two groups, useful in experiments and A/B testing.
Hypothesis Testing: Assists in determining if observed differences are statistically significant.
Chi-Square Test
Association Between Variables: Useful in understanding relationships between categorical variables, such as demographic factors and preferences.
Goodness of Fit: Helps in determining if a sample distribution fits an expected distribution, useful in model validation.

→ Let's perform a T-test and a Chi-Square test using datasets from the sklearn library.

T-Test Example
We'll use the Wine dataset from sklearn for the T-test. The Wine dataset contains data on various chemical properties of wines from three different cultivars. We'll compare the mean of one of the chemical properties (e.g., alcohol content) between two of these cultivars

from sklearn.datasets import load_wine
import pandas as pd
from scipy import stats

# Load the wine dataset
wine = load_wine()
wine_data = pd.DataFrame(data=wine.data, columns=wine.feature_names)
wine_data['target'] = wine.target

# Extract data for two cultivars (e.g., 0 and 1)
cultivar_0 = wine_data[wine_data['target'] == 0]['alcohol']
cultivar_1 = wine_data[wine_data['target'] == 1]['alcohol']

# Perform the t-test
t_stat, p_value = stats.ttest_ind(cultivar_0, cultivar_1)

print(f"T-Statistic: {t_stat}, P-Value: {p_value}")

output: T-Statistic: 16.478551495156527, P-Value: 1.9551698789379198e-33

Chi-Square Test Example
We'll use the Iris dataset from sklearn for the Chi-Square test. This dataset contains measurements of various features of Iris flowers from three different species. We'll test if there is an association between the species and a categorical feature created from one of the numerical features (e.g., sepal length).

from sklearn.datasets import load_iris
import pandas as pd
from scipy.stats import chi2_contingency

# Load the iris dataset
iris = load_iris()
iris_data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_data['species'] = iris.target

# Create a categorical feature from a numerical feature (e.g., sepal length)
iris_data['sepal_length_cat'] = pd.qcut(iris_data['sepal length (cm)'], q=3, labels=['short', 'medium', 'long'])

# Create a contingency table
contingency_table = pd.crosstab(iris_data['sepal_length_cat'], iris_data['species'])

# Perform the Chi-Square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square Statistic: {chi2}, P-Value: {p}")

output : Chi-Square Statistic: 123.28296703296704, P-Value: 1.0624436052362445e-25

Conclusion

Both T-tests and Chi-Square tests are essential tools in data analysis, providing insights into the relationships between variables and helping to validate hypotheses. Their proper application can lead to meaningful conclusions and better decision-making based on statistical evidence.

note: You can run the above Python code in your environment to see the results of the T-test and Chi-Square test on these datasets.

About Me:
🖇️LinkedIn
🧑‍💻GitHub

DEV Community

T-Test and Chi-Square Test in Data Analysis 🐍🤖🧠

T-Test

Chi-Square Test

Impact of T-Test and Chi-Square Test in Data Analysis

Top comments (0)

Read next

Google Announces Major Updates to Gemini AI: Enhancing Capabilities and Expanding Access

ValueError: A given column is not a column of the dataframe

RateMyReads API

🐣 Assignment Expressions (The Walrus Operator) in Python