Statistical Essentials for Data Analysts: A Beginner's Guide

#statistics #datascience #analytics #python

Understanding Basic Statistical Terminologies with Python

In this post, we'll explore some fundamental statistical concepts using Python and explain them in detail. We'll be working with a dataset of student scores in an exam, and we'll use Python's statistics module and matplotlib library for visualization.

Let's start by importing the necessary libraries and defining our dataset:

import matplotlib.pyplot as plt
import statistics

students data

# data of student scores in an exam
student_scores = [85, 78, 92, 88, 76, 80, 85, 90, 85, 78]

Mean, Median, and Mode
The mean represents the average value of the dataset, while the median is the middle value when the data is arranged in ascending order. The mode is the most frequent value in the dataset

mean_score = statistics.mean(student_scores)
median_score = statistics.median(student_scores)
mode_score = statistics.mode(student_scores)

print("Mean:", mean_score) # Mean: 83.7
print("Median:", median_score) # Median: 85.0
print("Mode:", mode_score) # Mode: 85

Standard Deviation and Variance
Standard deviation measures the dispersion of data points from the mean, while variance represents the average of the squared differences from the mean.

std_deviation = statistics.stdev(student_scores)
variance = statistics.variance(student_scores)

print("Standard Deviation:", std_deviation) # Standard Deviation : 5.47
print("Variance:", variance) # Variance : 30.011

Range and Quartiles
The range is the difference between the maximum and minimum values in the dataset. Quartiles divide the data into four equal parts.

range_score = max(student_scores) - min(student_scores)
sorted_scores = sorted(student_scores)
q1 = statistics.median(sorted_scores[:len(sorted_scores)//2])
q2 = statistics.median(sorted_scores)
q3 = statistics.median(sorted_scores[len(sorted_scores)//2:])

print("Range:", range_score) # Range: 16
print("Q1:", q1) #Q1: 78
print("Q2 (Median):", q2) #Q2: 85.0
print("Q3:", q3) #Q3: 88

Interquartile Range (IQR)
The Interquartile Range (IQR) is the range between the first and third quartiles, measuring the spread of data.

iqr = q3 - q1
print("Interquartile Range (IQR):", iqr) #Interquartile Range(IQR): 10

Correlation Coefficient
The correlation coefficient measures the linear relationship between two variables. We'll calculate the correlation coefficient between hours studied and test scores

def correlation_coefficient(x, y):
    n = len(x)
    mean_x = sum(x) / n
    mean_y = sum(y) / n
    covariance = sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(n))
    std_dev_x = (sum((xi - mean_x) ** 2 for xi in x) / n) ** 0.5
    std_dev_y = (sum((yi - mean_y) ** 2 for yi in y) / n) ** 0.5
    correlation = covariance / (std_dev_x * std_dev_y)
    return correlation

hours_studied = [4, 6, 3, 5, 7]
test_scores = [85, 90, 82, 88, 92]

correlation = correlation_coefficient(hours_studied, test_scores)
print("Correlation between hours studied and test scores:", correlation) 
#Correlation between hours studied and test scores: 4.97223302698313

Scatter Plot Visualization
Lastly, we'll visualize the relationship between hours studied and test scores using a scatter plot.

plt.scatter(hours_studied, test_scores)
plt.xlabel('Hours Studied')
plt.ylabel('Test Scores')
plt.title('Hours Studied vs. Test Scores')
plt.show()

DEV Community

Statistical Essentials for Data Analysts: A Beginner's Guide

Understanding Basic Statistical Terminologies with Python

Top comments (0)

Read next

EchoAPI vs Bruno: A Comprehensive Comparison from Design to Testing 💡

ClickHouse Vs DuckDB

How to Create Your Own RAG with Free LLM Models and a Knowledge Base

Advent of Code 2024 - Day 15: Warehouse Woes