DEV Community: Purity Ngugi

🎯Balancing Type I and Type II Errors in Medical Decision-Making: A Case on Diabetes Diagnosis

Purity Ngugi — Sun, 21 Sep 2025 21:06:06 +0000

In statistics, Type I and Type II errors are inevitable risks when making decisions based on sample data. Understanding how to balance these errors is crucial, especially in the medical field, where the consequences directly affect human lives. Let’s explore how this trade-off works using a diabetes diagnosis as our scenario.

Understanding the Two Types of Errors

Type I Error (False Positive): Rejecting the null hypothesis when it is actually true.
Example in medicine: Diagnosing a patient with diabetes when they are actually healthy.
Type II Error (False Negative): Failing to reject the null hypothesis when it is false.
Example in medicine: Declaring a patient healthy when they actually have diabetes.
Both errors have costs, but the severity depends on the context.

🩺Case Study: Diagnosing Diabetes

Let’s say a hospital is screening patients for diabetes. The null hypothesis (H₀) is “The patient does not have diabetes.” The alternative hypothesis (H₁) is “The patient has diabetes.”

The choice between minimizing Type I or Type II errors depends on the medical and ethical priorities of the healthcare institution.

Consequences of Each Error

Type I Error – False Positive

If a healthy person is incorrectly diagnosed with diabetes:

They may undergo unnecessary lifestyle changes or medication.
They may face psychological stress and anxiety.
They might experience side effects from treatments they don’t actually need.

While inconvenient and potentially harmful, a false positive in this case usually does not cause immediate life-threatening harm—especially if further confirmatory tests are done.

Type II Error – False Negative

If a patient with diabetes is told they are healthy:

Their condition will go untreated, potentially leading to severe complications such as kidney failure, nerve damage, or blindness.
The disease may progress silently, increasing the risk of life-threatening outcomes.
The opportunity for early intervention is lost.

The consequences here are far more severe than in a false positive scenario.

📊 The Trade-Off Decision

In the diabetes diagnosis context, minimizing Type II errors is more critical. Missing a true case of diabetes (false negative) can have long-term, irreversible health consequences. Therefore, hospitals may set a lower threshold for diagnosing diabetes, accepting that there might be more false positives, but ensuring fewer false negatives.
This approach aligns with the principle:
“It is better to mistakenly treat a healthy patient than to miss treating a sick patient.”

Visualizing the Trade-off with Python

import numpy as np
import matplotlib.pyplot as plt

# significance levels (alpha values)
alpha = np.linspace(0.01, 0.2, 50)

# Type I error increases with alpha
type1_error = alpha  

# Type II error decreases with alpha (inverse relationship for illustration)
type2_error = 1 - alpha * 3  
type2_error = np.clip(type2_error, 0, 1)  # keep within [0,1]

plt.figure(figsize=(8,5))
plt.plot(alpha, type1_error, label="Type I Error (False Positive)", linewidth=2)
plt.plot(alpha, type2_error, label="Type II Error (False Negative)", linewidth=2)

plt.title("Trade-off Between Type I and Type II Errors in Diabetes Diagnosis")
plt.xlabel("Significance Level (α)")
plt.ylabel("Error Rate")
plt.legend()
plt.grid(True)
plt.show()

Statistical Implications

Reducing Type II errors increases the power of the test (the probability of correctly detecting a true condition).
This often comes at the cost of increasing Type I errors.
The balance is determined by setting the significance level (α). A higher α (e.g., 0.10 instead of 0.05) allows for more false positives but reduces false negatives.

Conclusion

In medicine, the trade-off between Type I and Type II errors is not purely a statistical decision—it is an ethical one. For conditions like diabetes, where missing a diagnosis could have life-altering consequences, it is often preferable to tolerate a slightly higher false positive rate to ensure that true cases are caught and treated promptly.

Analyzing Win Probabilities in the Premier League 2024/25 with Python🐍

Purity Ngugi — Fri, 29 Aug 2025 12:02:49 +0000

In this project, I explored the 2024/2025 Premier League by digging into match outcomes and calculating the win probabilities of different teams. Instead of just looking at the league table, I wanted to understand how often teams actually win compared to drawing or losing, and then visualize these probabilities against a normal distribution to reveal patterns across the league.

📂Dataset

For this analysis, I worked with publicly available Premier League data for the 2024/2025 season.
You can access the dataset I used here:https://www.google.com/search?q=premier+league+table+2024%2F25&oq=premier+league+table+&gs_lcrp=EgZjaHJvbWUqBwgBEAAYgAQyBwgAEAAYjwIyBwgBEAAYgAQyBwgCEAAYgAQyBwgDEAAYgAQyBwgEEAAYgAQyBwgFEAAYgAQyBwgGEAAYgAQyBwgHEAAYgAQyBwgIEAAYgAQyBwgJEAAYgATSAQg5MjExajBqN6gCALACAA&sourceid=chrome&ie=UTF-8

🛠️Step 1: Preparing the Data

I started by creating a dataset of Premier League teams with their number of wins, draws, and losses. From this, I calculated the total games played and derived probabilities for each outcome.

import pandas as pd

# Create the dataset
data = {
    'Team': ['Manchester City', 'Liverpool', 'Chelsea', 'Arsenal', 'Manchester United',
             'Tottenham', 'Newcastle', 'Aston Villa', 'West Ham', 'Brighton'],
    'Wins': [20, 18, 16, 17, 15, 14, 13, 12, 11, 10],
    'Draws': [5, 7, 8, 6, 9, 8, 10, 7, 6, 9],
    'Losses': [3, 3, 4, 5, 6, 7, 5, 9, 11, 12]
}

df = pd.DataFrame(data)

# Calculate games played and probabilities
df['Games Played'] = df['Wins'] + df['Draws'] + df['Losses']
df['Win Probability'] = df['Wins'] / df['Games Played']
df['Draw Probability'] = df['Draws'] / df['Games Played']
df['Loss Probability'] = df['Losses'] / df['Games Played']

print(df.head())

This gives us a clean table with each team’s probabilities for winning, drawing, or losing.

📈Step 2: Statistical Distribution of Wins

Next, I wanted to see how these win probabilities look when compared to the league-wide distribution. For this, I calculated the mean and standard deviation of win probabilities, and then plotted them on top of a normal distribution curve.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Extract win probabilities
win_probs = df['Win Probability']

# Mean and standard deviation
mean, std_dev = win_probs.mean(), win_probs.std()

# Normal distribution
x = np.linspace(min(win_probs), max(win_probs), 100)
y = norm.pdf(x, mean, std_dev)

# Plot
plt.figure(figsize=(10,6))
plt.plot(x, y, color='blue', label='Normal Distribution')
plt.title("Distribution of Win Probabilities – Premier League 2024/25")
plt.xlabel("Win Probability")
plt.ylabel("Density")

# Mark each team's probability
for team, wp in zip(df['Team'], win_probs):
    plt.axvline(wp, linestyle='--', color='orange', alpha=0.7)
    plt.text(wp, 0.02, team, rotation=90, verticalalignment='bottom', fontsize=8)

plt.legend()
plt.show()

💡Step 3: Insights

Top teams (like Manchester City & Liverpool) are well above the mean win probability, clustering at the higher end of the curve.
Mid-table teams sit around the league average, showing balanced but less dominant results.
Lower-end teams fall significantly below the mean, indicating their struggle in securing wins.

Looking at the league this way shows not just who’s winning, but how consistently teams are doing so compared to their peers.

Conclusion
This project shows how Python can turn raw sports statistics into meaningful insights. By combining data preparation, probability calculations, and statistical visualization, I was able to map out a clearer picture of Premier League team performances in 2024/25.

What’s powerful about this approach is that it isn’t limited to football — the same workflow can be applied anywhere probabilities matter: sales performance, business forecasting, or even academic results.

Would love to hear your thoughts!

Getting Started with Classification in Machine Learning

Purity Ngugi — Fri, 29 Aug 2025 10:46:53 +0000

Introduction

Machine learning has quickly moved from research labs into our everyday lives. It powers things like voice assistants, fraud detection systems, personalized shopping recommendations, and even medical diagnoses. At its core, machine learning is about teaching computers to learn patterns from data and make predictions without being explicitly programmed for every task.
Within the broad field of machine learning, supervised learning stands out as one of the most widely used approaches. And one of the most practical branches of supervised learning is classification.

What Is Classification?
Classification is a type of supervised learning where the goal is to assign input data into predefined categories. Instead of predicting continuous values (like house prices in regression), classification deals with discrete outcomes.

Examples include:

Is this email spam or not spam?
Is this transaction fraudulent or legit?
Is this image a cat, a dog, or a bird?

Process of classification:

Collect Data – Gather labeled examples.
Preprocess – Clean the dataset (remove duplicates, handle missing values, etc.).
Feature Selection – Pick the most relevant features.
Model Training – Use an algorithm to learn from the labeled data.
Evaluation – Measure performance with metrics like accuracy, precision, and recall.
Prediction – Classify new unseen data.

Classification Models
Different algorithms can be used for classification, and each works best in specific scenarios:

Logistic Regression – Great for binary classification; simple and interpretable.
Decision Trees – Easy to visualize; work well with small to medium datasets.
Random Forests – An ensemble of decision trees that improves accuracy.
K-Nearest Neighbors (KNN) – Classifies based on similarity but struggles with large datasets.
Naive Bayes – Excellent for text classification (like spam detection).
Neural Networks – Handle complex data like images and speech, but can be harder to interpret.

My Insights

What I love about classification is how real it feels—almost every dataset I’ve worked with had a classification angle, from predicting customer churn to filtering out spam.
What I’ve learned is that data quality is everything. A simple model with clean, well-labeled data can outperform a deep learning model trained on messy data.

Challenges I’ve Faced
Working on classification hasn’t always been smooth. Some common hurdles include:

Overfitting – Models that memorize training data instead of learning patterns.
Class Imbalance – When one class dominates, models often ignore the minority class.
Feature Selection – Choosing which features matter most is not always obvious.
Interpretability – Complex models like neural networks are “black boxes.”
Data Quality Issues – Noisy or mislabeled data can drag performance down. These challenges can be frustrating, but they’ve also pushed me to improve my workflow and try different approaches.

Conclusion
Classification is one of the most practical and widely used areas of machine learning. Whether you’re detecting fraud, filtering spam, or building a recommendation engine, classification provides the foundation. While it comes with challenges like imbalance and overfitting, with the right data and approach, it’s both powerful and rewarding.

How Excel is Used in Real-World Data Analysis

Purity Ngugi — Wed, 11 Jun 2025 13:09:22 +0000

Microsoft Excel is a spreadsheet software that allows you to collect, organize, analyze, calculate and visualize data efficiently. It's a powerful tool used across various industries. It’s used for budgeting, financial analysis, project tracking and data visualization enabling businesses and professionals make informed decisions.

Real-World Applications of Excel

Business Decision-Making

Excel aids businesses in analyzing sales records, monitoring key performance indicators, and evaluating projected growth. Companies can interplay formulas and arrange data into tables to help in identifying trends for effective decisions.

2.Financial Reporting

Budgeting, financial modeling, and reporting are some tasks that financial analysts hydrate with Microsoft Excel. Organizations can efficiently manage their financial data, thanks to preserved detailed files that pivot tables and advanced formulas yield through comprehensive analysis.

3.Marketing Performance Analysis

Excel is indispensable to marketers. It assists them evaluate campaign results, monitor consumer activities, determine ROI, and even export data from other channels to analyze marketing spends for optimization.

Key Excel Features for Data Analysis

•Pivot Tables: Helps summarize large datasets to extract meaningful insights which augments the analysis of trends and patterns.

•Conditional Formatting: Apply formatting to cells based on specific criteria, highlighting important data points for quick analysis.

•Sorting: Arrange data in a specific order either ascending or descending.

•HLOOKUP: Search for specific value in the first row within a range and return related information, streamlining data retrieval processes.

•VLOOKUP: Search for specific value in the first column within a range and return related information, streamlining data retrieval processes.

Personal Reflection

With the knowledge of Excel, my way of looking at data has shifted completely. Data is not longer just numbers but stories begging to be shared. Excel provides the means to make sense of data, support conclusions, and relay information in a sensible manner. It has changed my perception on the relevance data has in modern society.