DEV Community

Cover image for Statistical Essentials for Data Analysts: A Beginner's Guide
Anand
Anand

Posted on

Statistical Essentials for Data Analysts: A Beginner's Guide

Understanding Basic Statistical Terminologies with Python

In this post, we'll explore some fundamental statistical concepts using Python and explain them in detail. We'll be working with a dataset of student scores in an exam, and we'll use Python's statistics module and matplotlib library for visualization.

Let's start by importing the necessary libraries and defining our dataset:

import matplotlib.pyplot as plt
import statistics
Enter fullscreen mode Exit fullscreen mode

students data

# data of student scores in an exam
student_scores = [85, 78, 92, 88, 76, 80, 85, 90, 85, 78]
Enter fullscreen mode Exit fullscreen mode

Mean, Median, and Mode
The mean represents the average value of the dataset, while the median is the middle value when the data is arranged in ascending order. The mode is the most frequent value in the dataset

mean_score = statistics.mean(student_scores)
median_score = statistics.median(student_scores)
mode_score = statistics.mode(student_scores)

print("Mean:", mean_score) # Mean: 83.7
print("Median:", median_score) # Median: 85.0
print("Mode:", mode_score) # Mode: 85
Enter fullscreen mode Exit fullscreen mode

Standard Deviation and Variance
Standard deviation measures the dispersion of data points from the mean, while variance represents the average of the squared differences from the mean.

std_deviation = statistics.stdev(student_scores)
variance = statistics.variance(student_scores)

print("Standard Deviation:", std_deviation) # Standard Deviation : 5.47
print("Variance:", variance) # Variance : 30.011
Enter fullscreen mode Exit fullscreen mode

Range and Quartiles
The range is the difference between the maximum and minimum values in the dataset. Quartiles divide the data into four equal parts.

range_score = max(student_scores) - min(student_scores)
sorted_scores = sorted(student_scores)
q1 = statistics.median(sorted_scores[:len(sorted_scores)//2])
q2 = statistics.median(sorted_scores)
q3 = statistics.median(sorted_scores[len(sorted_scores)//2:])

print("Range:", range_score) # Range: 16
print("Q1:", q1) #Q1: 78
print("Q2 (Median):", q2) #Q2: 85.0
print("Q3:", q3) #Q3: 88
Enter fullscreen mode Exit fullscreen mode

Interquartile Range (IQR)
The Interquartile Range (IQR) is the range between the first and third quartiles, measuring the spread of data.

iqr = q3 - q1
print("Interquartile Range (IQR):", iqr) #Interquartile Range(IQR): 10
Enter fullscreen mode Exit fullscreen mode

Correlation Coefficient
The correlation coefficient measures the linear relationship between two variables. We'll calculate the correlation coefficient between hours studied and test scores

def correlation_coefficient(x, y):
    n = len(x)
    mean_x = sum(x) / n
    mean_y = sum(y) / n
    covariance = sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(n))
    std_dev_x = (sum((xi - mean_x) ** 2 for xi in x) / n) ** 0.5
    std_dev_y = (sum((yi - mean_y) ** 2 for yi in y) / n) ** 0.5
    correlation = covariance / (std_dev_x * std_dev_y)
    return correlation

hours_studied = [4, 6, 3, 5, 7]
test_scores = [85, 90, 82, 88, 92]

correlation = correlation_coefficient(hours_studied, test_scores)
print("Correlation between hours studied and test scores:", correlation) 
#Correlation between hours studied and test scores: 4.97223302698313
Enter fullscreen mode Exit fullscreen mode

Scatter Plot Visualization
Lastly, we'll visualize the relationship between hours studied and test scores using a scatter plot.

plt.scatter(hours_studied, test_scores)
plt.xlabel('Hours Studied')
plt.ylabel('Test Scores')
plt.title('Hours Studied vs. Test Scores')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Correlation Scatter Plot


LinkedIn

GitHub

SoloLearn

Top comments (0)