DEV Community: Wangila russell

Normalization vs. Denormalization: Choosing the Right Database Design for Performance

Wangila russell — Thu, 09 Jul 2026 08:46:10 +0000

If you've worked with relational databases like PostgreSQL, MySQL, or SQL Server, you've probably encountered the terms normalization and denormalization. These concepts are fundamental to database design and often come up during interviews, system design discussions, and real-world software development.

As someone building ETL pipelines and data engineering projects, I recently found myself thinking about when to keep data normalized and when it's actually better to duplicate information. The answer isn't simply "one is better than the other." It depends entirely on your use case.

Let's explore both approaches, their advantages, disadvantages, and where each one fits.

What is Normalization?

Normalization is the process of organizing data into multiple related tables to eliminate redundancy and maintain data integrity.

Instead of storing the same information repeatedly, you store it once and reference it using relationships such as primary keys and foreign keys.

Imagine you're designing an e-commerce database.

Instead of storing customer information every time an order is placed, you separate the data.
Customers Table

Customer_ID	Name	Email
1	Alice	alice@email.com
2	John	john@email.com

Orders Table
| Order_ID | Customer_ID | Product | Amount |
| -------- | ----------- | -------- | ------ |
| 101 | 1 | Laptop | 1200 |
| 102 | 1 | Mouse | 25 |
| 103 | 2 | Keyboard | 80 |

The customer's details exist only once. Every order simply references the customer through Customer_ID.

Why Normalize?

Normalization offers several important benefits.

1. Eliminates Duplicate Data

Instead of storing Alice's email address on every order, it exists in only one location.

This saves storage space and reduces unnecessary repetition.

2. Improves Data Integrity

Suppose Alice changes her email.

Without normalization, you might need to update hundreds of records.

With normalization, you update exactly one row.

3. Prevents Update Anomalies

Imagine forgetting to update one duplicate record.

Now your database contains conflicting information.

Normalization minimizes these inconsistencies.
_

Easier Maintenance_

Well-structured databases are easier to modify, extend, and debug.

Relationships between entities become much clearer.

_The Downsides of Normalization
_
The biggest drawback is that retrieving information often requires joins.

For example:

SELECT
    c.name,
    o.product,
    o.amount
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id;

As the number of tables grows, queries become more complex.

On very large databases with millions of records, multiple joins can become expensive.

Why Denormalize?

Denormalization is common in systems where reading data is far more frequent than updating it.

Examples include:

Business Intelligence dashboards
Data warehouses
Analytics platforms
Reporting databases
Recommendation engines
Search indexes

In these systems, speed matters more than eliminating duplicate data.

Advantages of Denormalization
1. Faster Queries

Since related data already exists together, fewer joins are required.

Queries execute much faster.

2. Better Performance for Analytics

Reporting systems often scan millions of rows.

Denormalized tables simplify these operations.

3. Simpler Queries

Instead of joining five or six tables, developers can retrieve everything from one table.

This often makes SQL easier to read and maintain.

The Downsides of Denormalization

Everything comes with trade-offs.

Duplicate Data

Storage requirements increase because the same information appears multiple times.

More Difficult Updates

If Alice changes her email address, every copy must be updated.

Missing one row introduces inconsistent data.

Harder Maintenance

As the database grows, maintaining duplicated information becomes increasingly challenging.

When Should You Normalize?

Normalization works best for transactional systems (OLTP).

Examples include:

Banking systems
Hospital management systems
University databases
Inventory management
Customer relationship management (CRM)

These applications perform frequent inserts, updates, and deletes, making data consistency essential.

*When Should You Denormalize? *

Denormalization shines in read-heavy workloads.

Examples include:

Data warehouses
Power BI dashboards
ETL reporting databases
Data lakes with reporting layers
Large e-commerce analytics
Social media feeds

Most users are reading information rather than modifying it.

CONCLUSION
Normalization reduces redundancy, improves consistency, and keeps transactional systems reliable.

Denormalization sacrifices some of that consistency in exchange for significantly faster reads and simpler reporting.

Neither approach is universally better.

The right choice depends on your application's workload, performance requirements, and maintenance goals.

If you're designing an OLTP system, start with normalization.

If you're building an analytics platform or reporting layer, don't be afraid to denormalize where it makes sense.

Understanding when to use each approach is a skill that separates someone who simply writes SQL from someone who designs efficient, scalable databases.

Data Backfilling with Apache Airflow: Architectures and Implementations for Historical Data Processing

Wangila russell — Wed, 01 Jul 2026 10:54:20 +0000

Introduction
Modern data pipelines are designed to process data continuously, whether hourly, daily, or in real time. However, in practice, pipelines don't always run perfectly. Infrastructure failures, API outages, deployment issues, or newly created workflows often leave gaps in historical data. These missing records can affect dashboards, machine learning models, business reports, and downstream analytics.

This is where data backfilling becomes essential.

Backfilling is the process of rerunning a data pipeline to process historical data for a specific time range. Rather than waiting for future scheduled runs, engineers intentionally execute workflows for dates that were missed or require reprocessing.

Apache Airflow provides one of the most robust solutions for managing backfills because it treats workflows as directed acyclic graphs (DAGs) and schedules tasks based on logical execution dates instead of the current system time.

This article explores what data backfilling is, why it matters, common architectural patterns, and how to implement historical data processing using Apache Airflow.

What is Data Backfilling?

Data backfilling is the process of loading or reprocessing historical data that was not previously ingested into a data warehouse or database.

Imagine a pipeline that collects cryptocurrency prices every day.
Day 1 ✔ Day 2 ✔ Day 3 ✘ (API outage) Day 4 ✘ (Server failure) Day 5 ✔
Instead of accepting missing data for Days 3 and 4, a backfill reruns the pipeline specifically for those dates.

The result becomes:
Day 1 ✔ Day 2 ✔ Day 3 ✔ Day 4 ✔ Day 5 ✔

Why Backfilling Matters

Organizations rely on historical data for:

Business Intelligence dashboards
Financial reporting
Forecasting models
Machine Learning training
Compliance reporting
Trend analysis

Missing data can lead to:

Incorrect KPIs
Poor model accuracy
Misleading visualizations
Faulty business decisions

Backfilling restores data integrity without manually inserting records.

Apache Airflow and Historical Processing

Airflow separates the execution date from the actual runtime.
For example:

Pipeline runs today

Execution Date:
2026-01-01

Actual Runtime:
2026-07-01

The DAG behaves as though it is processing data for January 1st, even though it executes in July.
This concept makes Airflow particularly powerful for historical processing.

Scheduling Historical Data

Suppose a DAG is scheduled daily.

`schedule="@daily"`

If the pipeline starts today but has:

start_date=datetime(2026,1,1)
catchup=True

Airflow automatically schedules every missing execution between January 1st and today.
Instead of one run, Airflow generates many historical runs.

Manual Backfills

Sometimes only specific dates require reprocessing.
Airflow supports manual backfills through the command line.

airflow dags backfill \
    --start-date 2026-01-01 \
    --end-date 2026-01-07 \
    crypto_etl

This command reruns the DAG for each day between January 1st and January 7th.

Challenges of Large Backfills
Historical processing introduces unique challenges.
_
API Rate Limits_
Processing years of data may exceed API quotas.

*Database Bottlenecks
*
Large inserts can slow production systems.

Long Execution Times
Backfills may take hours or days.

Dependency Management
Downstream pipelines should not execute until backfills finish.
Proper orchestration is essential.

When Should You Avoid Backfills?
Backfills are not always the best solution.
Avoid them when:

Source systems no longer retain historical data.
Historical records are immutable and already archived.
The processing cost outweighs the analytical value.
Regulatory policies prohibit rewriting historical datasets.

Sometimes documenting missing data is preferable to recreating it.

Conclusion

Data backfilling is an essential capability for building reliable data engineering pipelines. Rather than accepting gaps caused by outages or newly deployed workflows, engineers can safely reconstruct historical datasets while preserving data quality.

Apache Airflow simplifies this process by scheduling workflows based on logical execution dates instead of the current time, making it possible to replay historical periods with minimal manual effort. Combined with idempotent ETL design, partitioned processing, and robust validation, Airflow enables organizations to maintain accurate, complete, and trustworthy datasets.

As data platforms continue to grow in scale and complexity, mastering backfilling techniques becomes a valuable skill for every data engineer. Building pipelines that can recover gracefully from failures is just as important as building pipelines that run successfully the first time.

Understanding Apache Airflow DAGs: Structure, Communication, and Deployment

Wangila russell — Tue, 23 Jun 2026 13:21:29 +0000

Apache Airflow has become one of the most widely used workflow orchestration platforms for building, scheduling, and monitoring data pipelines. At the heart of Airflow lies the Directed Acyclic Graph (DAG), a structure that defines how tasks are organized and executed. Understanding DAGs is essential for anyone working with data engineering, ETL pipelines, or workflow automation.

What is a DAG?
A Directed Acyclic Graph (DAG) is a collection of tasks organized in a way that defines dependencies and execution order.

Directed- means tasks have a specific direction of execution.
Acyclic- means there are no loops; a task cannot eventually depend on itself.
Graph- represents the relationship between tasks.

Basic DAG Structure
A typical Airflow DAG consists of:

DAG definition
Tasks (Operators or TaskFlow functions)
Dependencies

from airflow.sdk import dag, task
from datetime import datetime 
@dag( 
     start_date=datetime(2026, 1, 1), 
     schedule="@daily", 
     catchup=False 
) 
def sample_dag(): 
     @task def extract(): 
        return "data" 
     @task def transform(data): 
        return data.upper() 
     @task def load(data): 
        print(data)
     load(transform(extract())) 
sample_dag()

This DAG follows a simple Extract → Transform → Load pattern.

Task Communication with XCom

Tasks in Airflow are isolated from one another. To share information between tasks, Airflow provides Cross-Communication (XCom).

XCom allows tasks to push and pull small pieces of data.

Deploying DAGs with SCP

In many production environments, Airflow runs on a remote Linux server. Instead of manually recreating DAG files, engineers often use Secure Copy Protocol (SCP) to transfer DAGs.

scp gas_prices_dag.py user@server:/home/user/airflow/dags/

This command securely copies the DAG file to the server's DAG directory.

SCP is especially useful when deploying updated pipelines from a development machine to a production Airflow environment.

Running Airflow Services with nohup

Airflow components such as the scheduler and webserver need to remain running even after a terminal session closes.

The nohup command helps achieve this.

nohup airflow standalone &

This starts the scheduler in the background and prevents it from stopping when the terminal closes.
The output is redirected to log files for troubleshooting.

Managing Airflow with systemd

For production environments, systemd is the preferred way to manage Airflow services.

A systemd service can automatically:

Start Airflow after system boot
Restart failed services
Manage logs
Monitor service health

Monitoring and Troubleshooting DAGs

Airflow provides a web interface where users can:

Trigger DAG runs
Monitor task execution
View task logs
Retry failed tasks
Inspect XCom values

Conclusion
Apache Airflow DAGs provide a powerful way to orchestrate complex workflows and data pipelines. By understanding DAG structure, task dependencies, XCom communication, and deployment techniques such as SCP, nohup, and systemd, data engineers can build reliable and maintainable ETL systems. Whether running a simple daily pipeline or a large-scale production workflow, mastering DAGs is the foundation of effective workflow orchestration with Apache Airflow.

Data Engineering Pipeline: Understanding ETL vs ELT

Wangila russell — Fri, 12 Jun 2026 12:12:26 +0000

Introduction
This week, I started learning Data Engineering concepts, and one of the most important topics we covered was ETL and ELT pipelines.

To make it practical, I built my first simple data pipeline using Python, which helped me understand how raw data is transformed into usable insights.

What is a Data Pipeline?

A data pipeline is a process that moves data from one system to another while transforming it into a usable format.

Think of it like:

Raw data → Cleaning → Processing → Final usable data

What is ETL?
ETL = Extract, Transform, Load

Extract – Get data from a source (API, database, file)
Transform – Clean and structure the data
Load – Store it in a database or warehouse Example

Extract weather data from an API
Clean missing values
Store it in a database

What is ELT?

ELT = Extract, Load, Transform

Extract – Get raw data
Load – Store raw data first
Transform – Clean and process inside the database Key idea: In ELT, transformation happens after loading, usually in powerful systems like data warehouses.

ETL	ELT
Transform before loading	Transform after loading
Used in traditional systems	Used in modern cloud systems
Less raw data stored	Raw data stored first

My First Data Pipeline

This week, I built a simple Python pipeline that:

Extracted data from a collectapi
Transformed it into a structured format
Converted it into a Pandas DataFrame
Transformed the data and stored in my db

Here is a code snippet of transforming the data

def transform_cities(data):
    cities_df = pd.DataFrame(data)
    cities_df = cities_df.drop(columns="lowername")
    cities_df = cities_df.rename(columns={"name": "cities"})
    return cities_df

Final Thoughts

Building my first pipeline helped me connect theory with practice. It showed me that data engineering is not just about tools it’s about designing flow and structure for data.

This is just the beginning, and I’m excited to build more complex pipelines in the future.

Python and Its Role in the Data Analytics Space

Wangila russell — Mon, 11 May 2026 10:50:45 +0000

Introduction
Python is one of the most widely used programming languages in the world today. Over the last decade, it has become extremely popular among software developers, cybersecurity professionals, machine learning engineers, and especially data analysts. Many companies, organizations, universities, and researchers rely on Python because of its simplicity, flexibility, and powerful capabilities.

In the field of data analytics, Python has become one of the most important tools for collecting, cleaning, analyzing, visualizing, and interpreting data. Organizations generate large amounts of data every day, and Python helps analysts transform raw data into meaningful information that supports decision-making.

Python is considered beginner-friendly because its syntax is simple and easy to understand. Unlike many programming languages that require complicated structures, Python allows beginners to focus on solving problems instead of struggling with syntax rules. This is one of the main reasons why many people learning data analytics start with Python.

This article explains what Python is, why it is widely used in data analytics, the major libraries used in analytics, how Python is used to clean and visualize data, real-world applications of Python in analytics, and why beginners should learn Python.

Why Python is Popular in Data Analytics

Python has become the preferred language in data analytics for several reasons.

1. Simplicity and Readability

Python uses simple syntax that resembles normal English. This makes it easier to learn compared to languages such as Java or C++.

For example:

sales = [100, 200, 300]
print(sum(sales))

2. Large Number of Libraries

Python has many libraries specifically designed for data analysis and visualization. These libraries reduce the amount of code needed and improve productivity.

Examples include:

Pandas
NumPy
Matplotlib
Seaborn
Plotly
Scikit-learn

3. Automation Capabilities

Python can automate repetitive tasks such as:

Data cleaning
Report generation
File processing
Data extraction

This saves time and increases efficiency.

Python Libraries Used in Data Analytics

Libraries are collections of pre-written code that help developers perform tasks more efficiently.

1. Pandas

Pandas is one of the most important Python libraries for data analytics.

It is mainly used for:

Reading data
Cleaning data
Filtering data
Grouping data
Data transformation
Statistical analysis

Example:

import pandas as pd

file = pd.read_csv("sales.csv")
print(file.head())

2. NumPy

NumPy is used for numerical computing.

It supports:

Mathematical calculations
Arrays
Matrix operations
Statistical functions

Example:

import numpy as np

numbers = np.array([1, 2, 3, 4])
print(numbers.mean())

3. Matplotlib

Matplotlib is used for creating visualizations.

It helps analysts create:

Bar charts
Pie charts
Line graphs
Histograms

Example:

import matplotlib.pyplot as plt

months = ["Jan", "Feb", "Mar"]
sales = [100, 200, 300]

plt.plot(months, sales)
plt.show()

4. Seaborn

Seaborn builds on Matplotlib and provides more attractive visualizations.

It is commonly used for:

Heatmaps
Distribution plots
Correlation analysis 5. Scikit-learn

Scikit-learn is mainly used for machine learning.

It supports:

Prediction models
Classification
Clustering
Regression

How Python is Used in Data Cleaning

Raw data is often messy and contains problems such as:

Missing values
Duplicates
Incorrect formats
Blank spaces
Invalid entries

Data cleaning is one of the most important stages in analytics.

Python helps analysts clean data efficiently.

Example of Handling Missing Values

import pandas as pd

file = pd.read_csv("customers.csv")
file = file.dropna()

Example of Removing Duplicates

file = file.drop_duplicates()

Example of Converting Data Types

file["amount"] = file["amount"].astype(float)

These operations help ensure the data is accurate and reliable before analysis begins.

How Python is Used in Data Analysis

After cleaning data, analysts use Python to analyze and discover patterns.

Python supports many analytical operations such as:

Calculating averages
Grouping records
Identifying trends
Comparing categories
Statistical analysis Example of Grouping Data

revenue = file.groupby("region")["sales"].sum()
print(revenue)

Example of Finding Average Sales

average_sales = file["sales"].mean()
print(average_sales)

Python enables analysts to process thousands or millions of records efficiently.

How Python is Used in Data Visualization

Data visualization helps analysts present information in a way that is easy to understand.

Python can create:

Dashboards
Charts
Graphs
Interactive reports

Visualizations help organizations make better decisions.

Example of Bar Chart

import matplotlib.pyplot as plt

products = ["Phone", "Laptop", "Tablet"]
sales = [500, 300, 200]
plt.bar(products, sales)
plt.title("Product Sales")
plt.show()

Visualizations make trends and patterns easier to identify.

Real-World Applications of Python in Data Analytics

Python is used in many industries around the world.

1. Banking and Finance

Banks use Python for:

Fraud detection
Customer analysis
Risk management
Financial forecasting 2. Healthcare Hospitals and researchers use Python for:
Patient data analysis
Disease prediction
Medical research

3. E-commerce

Online stores use Python to:

Analyze customer behavior
Recommend products
Track sales trends

Python and APIs in Data Analytics

An API allows applications to communicate and exchange data.
Python can connect to APIs and extract real-time data.
Example:


import requests
response = requests.get("https://dummyjson.com/products")
print(response.json())

This is important because modern analytics often involves collecting live data from online systems.

Python and JSON Data

JSON stands for JavaScript Object Notation.
It is a popular format for storing and exchanging data.
Python works very well with JSON.

Example:

import json
sample = '{"name":"John", "age":25}'
person = json.loads(sample)
print(person)

Many APIs return JSON responses, making Python ideal for data extraction.

Conclusion

Python has become one of the most important programming languages in the world, especially in data analytics. Its simplicity, flexibility, and powerful libraries make it ideal for beginners and professionals alike.

Python helps analysts collect, clean, analyze, visualize, and interpret data efficiently. It supports automation, works with APIs and JSON data, and integrates with many technologies.

Organizations in banking, healthcare, education, telecommunications, e-commerce, and transportation rely heavily on Python for decision-making and data-driven insights.

For beginners interested in technology, data analytics, machine learning, or software development, learning Python is an excellent choice. It is easy to start with, highly demanded in the job market, and supported by a massive global community.

As the importance of data continues growing, Python will remain a key tool in helping organizations understand information and make better decisions.

RAG FOR DUMMIES

Wangila russell — Sun, 14 Sep 2025 13:52:12 +0000

Introduction

Large Language Models (LLMs) like ChatGPT are powerful, but they have two big problems:

They hallucinate (make up answers that sound real).
They don’t always know the latest information because their knowledge is frozen at training time.

Enter RAG – Retrieval-Augmented Generation.
Think of RAG as giving an AI a memory stick + Google access. Instead of only relying on what it remembers, it can look up relevant info first, then answer your question.
What is RAG?

RAG = Retriever + Generator.

Retriever: Finds the most relevant pieces of information from an external knowledge base (documents, PDFs, databases, websites, etc.).
Generator: Uses an LLM to create a natural language response, but grounded in the retrieved context.

Without RAG, the model is like a student taking a test with no books allowed.
With RAG, it’s an open-book exam — much more reliable.
How RAG Works (Step by Step)

You ask a question → “What’s the latest cyberattack trend in 2025?”
Retriever searches knowledge → Fetches relevant articles/reports.
Generator (LLM) → Reads both your question + retrieved context.
Final Answer → Factual, updated, and less likely to be hallucinated.

Conclusion
RAG is like giving AI superpowers:

It remembers less but knows more (because it can look things up).
It makes AI more accurate, explainable, and trustworthy.

The future of AI will almost certainly be retrieval-augmented rather than purely generative.
So next time you hear “RAG,” just remember:

It’s an open-book exam for AI.

📝 Supervised Learning

Wangila russell — Sun, 24 Aug 2025 21:28:41 +0000

Understanding Supervised Learning

Supervised learning is essentially "learning with guidance." Supervised learning is a type of machine learning where we teach the computer using labeled data. In simple terms, the dataset already contains both the input (features) and the correct output (labels). The algorithm’s job is to learn the relationship between them so it can predict outcomes for new, unseen data.
Supervised learning can be divided into two categories:

Regression – Used when the target variable is continuous. Example: predicting house prices, stock values, or a person’s weight.
Classification – Used when the target variable is categorical. Example: predicting whether an email is spam or not spam, or whether a patient has a disease or not.

How Classification Works
Classification is a branch of supervised learning where the goal is to assign input data to one of several categories. For example, given an email, the model decides whether it’s spam or not spam. The process involves training on labeled examples, learning patterns, and then applying the model to make predictions.

Models Used for Classification

k-Nearest Neighbors (k-NN) – Classifies based on similarity to nearby data points.
Naïve Bayes – Probabilistic model often used for text classification.
Decision Trees & Random Forests – Handle both categorical and numerical data effectively.
Gradient Boosting (XGBoost, LightGBM, CatBoost) – State-of-the-art models for structured data. My Personal Insights What fascinates me about classification is its wide range of applications – from medical diagnosis to fraud detection. Even though models like Random Forests are powerful, sometimes simpler models (like Logistic Regression) perform surprisingly well when data is clean and structured. Challenges I’ve Faced The biggest challenge has been feature selection. Too many irrelevant features can mislead the model. Another issue is interpretability – complex models like Gradient Boosting are accurate but hard to explain, which can be problematic in sensitive areas like healthcare.

⚖️ Balancing Type I and Type II Errors: A Medical Perspective

Wangila russell — Sun, 10 Aug 2025 22:01:16 +0000

Introduction
In statistics, Type I and Type II errors represent two different kinds of mistakes we can make when testing hypotheses. Deciding where to trade off between them is a crucial part of designing tests, experiments, or decision-making systems. In high-stakes fields such as medicine, the trade-off can literally mean life or death.
Understanding the Errors
Type I Error — False Positive
A Type I error occurs when we reject the null hypothesis when it is actually true.
In simple terms:
We conclude something is happening when it really is not.
Example in medicine:
A test says a patient has a disease when they are actually healthy.
Consequence:
Unnecessary anxiety, additional testing, possible harmful treatments.
Type II Error — False Negative
A Type II error happens when we fail to reject the null hypothesis when it is actually false.
In simple terms:
We miss detecting something that is actually happening.
Example in medicine:
A test says a patient does not have a disease when they actually do.
Consequence:
Missed diagnosis, delayed treatment, worsened prognosis.
The Trade-Off
There is an inherent trade-off between Type I and Type II errors.

Lowering the chance of Type I errors (making a test more “strict”) usually increases the chance of Type II errors.
Lowering the chance of Type II errors (making a test more “sensitive”) usually increases the chance of Type I errors.
This balance is controlled by:
Significance level (α): Probability of a Type I error.
Power (1 - β): Probability of detecting a true effect (reducing Type II errors).
Medical Scenario: Screening for a Serious Disease
Let’s imagine a blood test that screens for an early-stage cancer.
If we prioritize avoiding Type I errors:
We set a very strict threshold for calling the test positive.
Fewer healthy people will be incorrectly told they have cancer (fewer false positives).
BUT… some people with early cancer may test negative and go untreated (more false negatives).
If we prioritize avoiding Type II errors:
We set a more lenient threshold for calling the test positive.
We will catch almost everyone who has cancer (fewer false negatives).
BUT… more healthy people may be told they might have cancer, leading to unnecessary biopsies (more false positives).
Where to Trade Off in Medicine
The trade-off decision depends on:
Severity of the disease — If the disease is fatal and treatable in early stages, we often accept more Type I errors to catch all true cases.
Cost and risk of follow-up tests — If confirmatory tests are cheap and safe, a higher false-positive rate is acceptable.
Psychological impact — Over-diagnosis can cause stress; under-diagnosis can be life-threatening.

Example Decision:
For cancer screening, most doctors would favor minimizing Type II errors (false negatives) even at the cost of more false positives, because missing the disease could be deadly, whereas a false alarm can be corrected with further tests.
Conclusion
In any testing scenario, we cannot completely eliminate both Type I and Type II errors — improving one often worsens the other.
In medical diagnostics, especially for serious diseases, the priority is often to reduce Type II errors to ensure no case goes undetected, even if it means tolerating a higher number of false positives.

The choice of where to trade off depends on:

The consequences of each type of error
The costs and risks of follow-up actions
The values and priorities of patients, doctors, and society

In short: In life-critical medical scenarios, it’s better to risk a false alarm than to miss the real danger.

⚽ Calculating Premier League Win Probabilities Using Python and the Football-Data.org API

Wangila russell — Mon, 28 Jul 2025 10:37:36 +0000

As a football enthusiast and data science learner, I decided to analyze last season’s Premier League teams by calculating the probability of winning a specific number of games using the Bernoulli distribution. This article walks through how I used the Football-Data.org API and Python to extract match data and model win probabilities.

📦 Tools & Tech Stack

Python 🐍
Requests (HTTP Library)
Football-Data.org API
Bernoulli Distribution Formula:

𝑃(𝑘 wins)=(𝑛/𝑘)𝑝𝑘 (1−𝑝)𝑛−𝑘

where:

k = number of games won

n = total number of games played (usually 38)

p = estimated probability of winning a game
🔑 Step 1: Getting the API Key
To use the API:

Sign up at https://www.football-data.org/
Get your API key from the dashboard
Save it in a .env file like this:

API_KEY=your_api_key_here

🔐 Make sure to add .env to your .gitignore so it's never pushed to GitHub.

📡 Step 2: Fetch Premier League Standings via API
We used the /competitions/PL/standings endpoint for the 2024/2025 season:

import requests
import os
from dotenv import load_dotenv

load_dotenv()
api_key = os.getenv("API_KEY")

url = "https://api.football-data.org/v4/competitions/PL/standings"
headers = {"X-Auth-Token": api_key}

response = requests.get(url, headers=headers)
data = response.json()

📐 Step 3: Calculate Win Probability
We used the Bernoulli distribution to calculate the probability of each team winning k games out of n = 38:

import math

def calculate_win_probability(team_name, wins, total_games=38):
    p = wins / total_games
    probability = math.comb(total_games, wins) * (p ** wins) * ((1 - p) ** (total_games - wins))
    return team_name, round(probability, 6)

📈 Results
This gave us a probabilistic view of how likely it is that a team would win exactly the number of games they did — based on a binomial model.
| Team | Wins | Win Probability |
| --------------- | ---- | --------------- |
| Manchester City | 28 | 0.048129 |
| Arsenal | 26 | 0.060201 |
🤔 Limitations
The Bernoulli/binomial model assumes each match is independent and has equal probability, which isn’t realistic in football.
It does not account for home/away advantage, injuries, transfers, or form.
Still, it’s a fun and mathematically sound way to get started with sports analytics!

✅ Conclusion
This project was a great exercise in:

Consuming real-world APIs
Using statistical methods like the binomial distribution
Thinking probabilistically about sports performance

🧠 Understanding Measures of Central Tendency in Data Science

Wangila russell — Sun, 20 Jul 2025 15:25:44 +0000

_📌 Introduction*_*
In the world of data science, one of the first steps to understanding your dataset is to summarize it effectively. That’s where measures of central tendency come in. These are statistical metrics that give us a quick snapshot of what a "typical" data point looks like.

Whether you're cleaning data, performing exploratory data analysis (EDA), or building predictive models, knowing the center of your data distribution is crucial for making informed decisions.
📊 What Are Measures of Central Tendency?

Measures of central tendency are used to describe the center point or typical value of a dataset. The three most common ones are:

1. Mean (Average)
The sum of all values divided by the number of values. It's sensitive to outliers but useful for normally distributed data.
Example:

import numpy as np
data = [2, 4, 6, 8, 100]
mean = np.mean(data)
print(mean)

2. Median
The middle value when the data is sorted. It’s robust to outliers and skewed data.

Example:

median = np.median(data)
print(median)  # Output: 6

Mode The most frequently occurring value(s) in the dataset.

Example:

from scipy import stats
mode = stats.mode(data)
print(mode.mode[0])  # Output: 2

🔍 Why Are They Important in Data Science?
Data Summarization: Helps understand large datasets at a glance.

Outlier Detection: Comparing mean and median can help detect anomalies.

Feature Engineering: Central values are often used in data imputation, scaling, or as baselines.

Modeling Decisions: Knowing data distribution helps choose appropriate algorithms (e.g., use median for skewed data).

Interpretability: When explaining models or visualizations to stakeholders, central tendency makes results more relatable.
📈 Visual Example
A boxplot or histogram often visually illustrates the mean, median, and distribution.

import matplotlib.pyplot as plt
import seaborn as sns

sns.boxplot(data)
plt.title("Boxplot Showing Central Tendency")
plt.show()

📌 Conclusion
Measures of central tendency are fundamental tools in the data scientist's toolbox. They offer insight into the nature of the data, support better decision-making, and help communicate results effectively. Understanding when and how to use the mean, median, and mode ensures that your analysis is both accurate and actionable.

How Excel is Used in Real-World Data Analysis

Wangila russell — Tue, 10 Jun 2025 13:03:27 +0000

Excel is one of the most widely used tools in data analysis. It’s accessible, powerful, and flexible—used across industries to store, manipulate, and visualize data. This past week, I began my journey into Excel as part of my Data Science & Analytics course, and I was surprised at how much can be done with what initially looks like a simple spreadsheet program.

Real-World Applications of Excel in Data Analysis include:
Business Decision-Making: Companies rely on Excel to analyze trends and make informed decisions. For example, sales data can be sorted and filtered to show top-performing products, which helps managers decide where to focus marketing efforts or adjust inventory levels.

Financial Reporting: Accountants and financial analysts use Excel for budgeting, forecasting, and tracking expenses. With formulas and functions, it's easy to calculate monthly costs, compare actuals to forecasts, and generate quick summaries.

Marketing Performance Analysis: Excel helps marketers track campaign performance by analyzing metrics like click-through rates, conversion rates, and customer engagement. Pivot tables and charts make it easy to compare results across campaigns or time periods.

Excel Features and Formulas I’ve Learned
This week, I learned several powerful Excel tools that are essential in real-world data work:

VLOOKUP: This function helps you find specific data in large tables. For example, if you have a product ID and need to retrieve its description or price from another sheet, VLOOKUP makes that quick and simple.

Conditional Formatting: With this, I can highlight cells based on specific rules—such as showing all sales below a target in red. This instantly draws attention to important data points.

Pivot Tables and Pivot Charts: These allow you to summarize large datasets in a clean and interactive way. I used them to break down data by categories and create dynamic charts for dashboards.

Other useful skills included data validation (to control input), INDEX-MATCH (a more flexible alternative to VLOOKUP), and creating dashboards that combine multiple insights into a single, interactive view.

Personal Reflection
Before learning Excel, I saw data as something complex and intimidating. But now, I realize that with the right tools, anyone can make sense of data and extract meaningful insights. Excel has given me a hands-on way to explore data, find patterns, and tell stories with numbers. It’s no longer just rows and columns—it’s a canvas for analysis.