DEV Community: Naomi Jepkorir

Trade-offs of Airflow Standalone: SQLite Locks and Ghost Users

Naomi Jepkorir — Mon, 22 Jun 2026 15:51:04 +0000

Running Apache Airflow locally is the fastest way to test your DAGs, but it comes with a hidden cost: SQLite. Because local Airflow relies on a simple file-based database and background web workers, a single ungraceful shutdown can leave your environment completely deadlocked.

If you are fighting database is locked errors, port collisions, or mysterious 'Unauthorized' login failures, you don't need to rebuild your environment. You just need to know how to properly clear the cache. I had to learn this through first-hand experience, and here is how you can handle it:

1. The SQLite Death Grip

The Trade-off: Airflow uses SQLite by default because it requires zero configuration; it’s literally just a file on your hard drive. The massive downside is that SQLite handles concurrent connections poorly. If your standalone server hangs and you force-quit your terminal, background processes maintain a death grip on that file.

When you try to boot back up, you are met with a wall of error logs that looks like this:

The Fix: You have to nuke the invisible processes and delete the corrupted database to start fresh. Run this in your terminal:

# 1. Kill any hidden background processes
pkill -9 -f airflow

# 2. Delete the deadlocked database (and any journal files)
rm ~/airflow/airflow.db*

2. The Port 8080 Hostage Situation

The Trade-off: airflow standalone is actually a wrapper command that spins up four heavy components at once (scheduler, webserver, triggerer, database). Pushing this to the background or shutting it down ungracefully often leaves orphaned web workers alive.

You will know you hit this trap if your UI refuses to load and your logs spit out an Errno 98.Here's a snippet of what it might look like:

api-server | INFO:     Started server process [25087]
api-server | INFO:     Waiting for application startup.
api-server | INFO:     Application startup complete.
api-server | ERROR:    [Errno 98] error while attempting to bind on address ('0.0.0.0', 8080): address already in use
api-server | INFO:     Waiting for application shutdown.
api-server | INFO:     Application shutdown complete.

The Fix: Those orphaned workers are usually gunicorn processes. You will need to terminate them so your new server can bind to the port.

pkill -9 -f gunicorn

3. The "Ghost User" Lockout

The Trade-off: To save you from having to create an admin profile manually every time you test locally, standalone Airflow automatically generates a password and saves it to a tiny hidden simple_auth_manager_passwords.json.generated file.

However, if you just deleted your airflow.db to fix the SQLite lock mentioned above, you also deleted your user account. But because that .generated text file still exists, Airflow assumes the user is still active and refuses to give you a new password. You will type the exact password into the UI and immediately hit a 401 Unauthorized error.

If you check your startup logs, you will see it completely bypassing the password generation:

The Fix: You need to delete that hidden file so Airflow is forced to mint a brand new admin user from scratch.

rm ~/airflow/simple_auth_manager_passwords.json.generated

The Ultimate Reset Protocol

If your local environment is completely frozen, chain it all together. This is the exact sequence to completely factory reset a jammed standalone server without touching a single line of your DAG code:

pkill -9 -f airflow
pkill -9 -f gunicorn
rm -f ~/airflow/airflow.db*
rm ~/airflow/simple_auth_manager_passwords.json.generated
airflow standalone

When you run that final command, Airflow will breathe easily, rebuild a clean database, bind to port 8080 without a fight, and print a new password right in your terminal.
And that's on another day of fighting errors in the terminal😂.

How Linux is Used in Real-World Data Engineering

Naomi Jepkorir — Tue, 21 Apr 2026 09:26:32 +0000

So, you know Python and think, "Hey, why don't I get into Data Engineering?" You have your learning checklist ready to go, but there’s a giant, terminal-shaped hole in your plan: Linux.

While Python is the language of data, Linux is the environment where that data actually lives. Most production systems are built entirely on it, yet many beginners don't realize they need it until they’re staring at a broken cloud server with no GUI in sight.

If you want to avoid that "deep end" feeling, you need to understand the environment you're building in. Here's how it actually shows up in your day-to-day life as a Data Engineer:

1. Automation: cron Is Your Teammate

In tutorials, data is static. In reality, data never sleeps.

Let’s say you need to ingest millions of rows from an API every night. You’re not waking up at 2:00 AM to run a script manually, and if you are, something has gone very wrong.

This is where Linux automation comes in.

With a single cron job, your pipeline runs reliably in the background:

# Run the data ingestion script every night at 2:00 AM
0 2 * * * /usr/bin/python3 /home/user/pipelines/ingest_data.py >> /var/log/ingest.log 2>&1

That one line handles scheduling, execution, logging and failure tracking.

This is the difference between “I ran a script” *and *“I operate a system.”

2. First-Pass Cleaning: Bash vs. Pandas

Scenario: You try to load a massive, 50GB dataset into a Python Pandas dataframe, and your machine immediately crashes with a MemoryError.

Here's the problem:

Python loads data into memory (RAM)
Linux tools stream data from disk

Data engineers use native Linux tools to slice, filter and clean massive files before Python ever touches them.

Need to find error logs?

grep "ERROR" massive_server_log.txt > filtered_errors.txt

Need specific colums from a huge CSV?

awk -F',' '{print $2, $5}' raw_data.csv > cleaned_columns.csv

Mastering sed, awk, and grep allows you to process gigabytes of data in seconds using fractions of the memory.

If your first instinct is Pandas for large files, you’re already in trouble.

3. Environment Mastery: Docker Makes it reproducible 🐳

"It works on my machine!" is how outages begin.

Real pipelines depend on exact versions of Python, libraries and system dependencies. You cannot assume your production server matches your laptop.

Docker solves this by packaging everything into a consistent environment.

But here's the catch: Docker runs on Linux. If you don't understand Linux basics, your containers will fail in confusing ways; permissions, file paths, volumes.

A simple example:

docker build -t data-pipeline .
docker run -v /data:/app/data data-pipeline

If you don’t understand how /data permissions work on the host system, this breaks fast.

Knowing commands like chmod and chown isn’t optional, it’s what makes your pipelines actually run.

4. Surviving the Cloud: SSH and Tmux

Production systems don’t come with a UI. You get a terminal and a blinking cursor.

You connect using SSH to a remote server, and everything you do happens there.

Now imagine this:
You start a 6-hour job… and your Wi-Fi drops.

Connection gone. Job gone.

Unless you’re using a terminal multiplexer like tmux.

tmux new -s pipeline_run

Run your jobs inside tmux, and they keep running even if you disconnect. You can come back hours later and pick up exactly where you left off.

This isn’t a trick, it’s survival.

Wrapping It Up...

Jupyter notebooks are great for experimenting. But real data engineering happens in the terminal.

Linux is how you:

automate pipelines
process massive files
manage environments
operate remote systems

It’s not a nice-to-have skill. It’s the bridge between local projects and real-world systems.

The next time you’re about to write Python to move a file or filter a dataset, try Bash first.

Surviving a Kernel Panic: My Ubuntu War Story

Naomi Jepkorir — Mon, 23 Mar 2026 07:17:40 +0000

When your backlog is full of data science models and software engineering tasks, the last thing you need is your OS failing to boot because of a kernel panic, no?

Well, this happened to me. I was given the option to reboot, did it the first time, and I was back in after choosing an older kernel version from the menu. It happened a second time, a third time... and I was just okay with it, saying, "I'll fix it later."
This worked like a charm right up until Linux decided it had had enough. It finally threw an Input/output error, locked the root filesystem / into Read-only mode, and locked me out completely.

I couldn't even poweroff via the terminal.

The silver lining came when I realized I had a Kali Linux installer ISO sitting on a flash drive nearby. In case you find yourself in this,let's call it "very specific" situation, here is how to perform open-heart surgery on your system before you do anything rash like wiping your drive.

Step 1: The Hard Reset & The USB Boot

Since the terminal was completely frozen, a graceful shutdown was out of the question. Long-press the power button to force the PC to power off. Turn it back on and immediately open your startup menu (usually by spamming Esc, F12, or F9 depending on your laptop model). From there, boot directly into your rescue USB.

Step 2: Dropping into the Underworld (The BusyBox Shell)

Because I was using a Kali Linux installer ISO (not a Live Desktop), there was no beautiful graphical interface to save me. I had to navigate the installer menu and select "Execute a shell".

This drops you into a raw, stripped-down BusyBox ash shell. From here, we need to find exactly where the broken Ubuntu system lives on the hard drive.

Run this command to list your partitions:

fdisk -l

Scan the output for your main Linux filesystem. In my case, Ubuntu was sitting on /dev/sda2.

Step 3: Find and repair

Because I was forced to hard-reset the machine multiple times while the OS was locked up, the filesystem metadata was corrupted. Files were left hanging in memory, creating "orphaned inodes." If you try to boot with a corrupted filesystem, Linux will panic to protect your data.

NOTE: Make sure your broken partition is not mounted.
Quick check just to be safe:

umount /dev/sda2/

It's time to repair the drive. Run the ext4 filesystem check tool on your specific partition:

e2fsck -y /dev/sda2

(💡Tip: The -y flag is crucial here. It automatically answers "yes" to the hundreds of prompts asking if you want to fix individual corrupted sectors. Without it, you will be holding down the 'Y' key for an eternity).

Once the screen stops scrolling, you are looking for the holy grail: a message declaring your drive /dev/sda2: clean.

Step 4: The Boot Menu & The Investigation

With the filesystem repaired, type reboot (or hard reset again if the shell is stubborn) and unplug the USB.

Do not let it boot normally. As soon as the PC turns on, spam esc to bring up the GRUB boot menu. Go to Advanced options for Ubuntu and manually select an older kernel version (e.g., 6.14.x instead of the newest one).

Once you are successfully booted into your desktop, open a terminal. It is time to find the killer. List all installed kernels:

dpkg --list | grep linux-image

If you read the output carefully, you'll likely spot the culprit. Look at the two letters on the far left of the list:

ii means Installed and Intact (This is your stable, older kernel).

it means Installed, Triggers pending (This is a half-baked, broken kernel).

In my case, the system had tried to automatically update to the 6.17 kernel in the background. But a third-party module, specifically VirtualBox DKMS, failed to compile for the new kernel architecture. VirtualBox crashed, which halted the entire kernel installation halfway through, leaving my machine with an unbootable OS.

Step 5: The Execution(Purging the Rot)

Now we just need to clean up the mess and permanently delete the broken kernel so the system stops trying to default to it.

First, get the blocking software out of the way (you can reinstall it later):

sudo apt remove --purge virtualbox-dkms

Next, tell the package manager to unjam itself and fix any half-installed dependencies:

sudo apt --fix-broken install

Now, drop the hammer on the broken kernel (replace the version numbers with the broken one from your dpkg list):

sudo apt purge linux-image-6.17.0-19-generic linux-headers-6.17.0-19-generic

Finally, sweep up the orphaned packages and update your boot menu to lock in your stable kernel as the new default:

sudo apt autoremove
sudo update-grub

All good 😊✨

A little note...

Isn't it funny how things completely falling apart is usually the best way to figure out how they actually work?

But hey, in case you found yourself in this situation, found your way here, and this still doesn't work... you should probably just delete that OS 😂 or contact your local AI agent.

RAG for Dummies

Naomi Jepkorir — Thu, 18 Sep 2025 17:42:36 +0000

If you’ve been following AI news, you’ve probably heard the term RAG popping up everywhere.
No, it’s not about cleaning your house, RAG stands for Retrieval-Augmented Generation, and it’s one of the most exciting techniques in AI right now.

Let’s break it down so it’s easy to understand, no technical jargon required.

🤖 What is RAG?

Think of RAG as an AI that does its homework before giving you an answer.

Here’s what the name means:

Retrieval – The AI first looks up relevant information from a database, knowledge base, or document collection.
Augmented Generation – It then uses that information to generate a complete, accurate answer.

So instead of just guessing based on what it was trained on months or years ago, RAG can stay up to date and grounded in real facts.

✨ Why RAG Matters

This approach solves some big problems with AI:

Up-to-date knowledge – It can pull in the latest information, instead of relying only on what it “remembers.”
Better accuracy – By using real sources, it reduces those “hallucinations” where AI just makes stuff up.
Customizable – You can feed it your own data (like company manuals or research papers), and it will actually use them to answer questions.

🛠 How RAG Works (Simple Version)

Here’s the process in three steps:

1️⃣ You ask a question.

2️⃣ The AI searches through a set of documents for the most relevant pieces of information.

3️⃣ It writes a clear answer using what it just found.

In short:

💡 Search + Smart Writing = RAG

🌍 Real-World Examples

You’ve probably already seen RAG in action:

💬 Customer support chatbots that know about your account and can answer detailed questions.
📚 Research tools that summarize recent studies for you.
🏢 Internal company assistants that help employees find policies or technical documentation instantly.

The Bottom Line

RAG makes AI smarter, more reliable and more helpful by letting it look things up before answering.

So next time you hear someone talk about “RAG,” you can confidently say:

“It’s when AI searches for relevant info first, then writes a better answer, like having Google and ChatGPT work together.”

Understanding Classification in Supervised Learning

Naomi Jepkorir — Thu, 28 Aug 2025 06:08:31 +0000

Machine learning is everywhere today, from Netflix recommendations to fraud detection .

One of the most important techniques behind these systems is supervised learning, and within that, classification shines as one of the most practical approaches.

In this article, I’ll break down:

✨ What supervised learning is

✨ How classification works

✨ Common models for classification

✨ My personal views and insights

✨ Challenges I’ve faced along the way

📘 What is Supervised Learning?

Supervised learning is a type of machine learning where the model is trained on a labeled dataset.

Inputs (features): The data we feed into the model.
Outputs (labels): The known answers we want the model to predict.

Think of it like teaching a student with flashcards: you show the input (a picture of a cat) and the correct label (“cat”). After enough examples, the student (our model) learns to generalize and can correctly label new, unseen inputs.

Supervised learning has two main branches:

Regression – Predicting continuous values (e.g., house prices).
Classification – Predicting categories (e.g., spam vs. not spam).

Here, we’ll focus on classification.

🏷️ How Classification Works

Classification is all about sorting data into categories. Some everyday examples:

Email: spam or not spam
Medical scan: benign or malignant
Handwritten digit: 0–9

The process usually looks like this:

Collect labeled data 🗂️
Extract features 🔎
Train the model 🤖
Test/validate 📊
Make predictions ✅

At its heart, classification is about drawing boundaries between groups,some models literally draw a line, while others compare similarities like a “nearest neighbor.”

⚙️ Models Used for Classification

There’s no one-size-fits-all solution. Here are some popular models:

Logistic Regression 📉 – Despite its name, it’s a classification model. Predicts probabilities and assigns labels.
Decision Trees 🌲 – Splits data by asking “yes/no” questions.
Random Forests 🌲🌲🌲 – A team of decision trees that vote together.
Support Vector Machines (SVMs) – Finds the best dividing line (or hyperplane).
k-Nearest Neighbors (k-NN) – Looks at the neighbors and goes with the majority.
Neural Networks 🧠⚡ – Powerful for images, text and speech, though often harder to interpret.

Each comes with trade-offs, some are simple and easy to explain, others are powerful but feel like a black box.

💡 My Personal Views and Insights

Over time, I learned:

Data quality matters more than the model . If the data is messy or biased, results will be too.
Feature engineering is underrated. A simple model with great features can beat a complex one with poor inputs.
Accuracy isn’t everything . In real-world cases, metrics like precision, recall and F1-score often matter more, especially when classes are imbalanced.

🚧 Challenges I’ve Faced

Here are some hurdles I’ve personally run into:

Overfitting – When the model memorizes the training data but fails on new inputs.
Feature selection – Choosing the right features is tricky: too many = noise, too few = missed signals.
Class imbalance– Sometimes one class dominates the dataset, making it harder for the model to detect the minority class.(e.g., detecting fraud when only 1% of transactions are fraudulent).

Conclusion

Classification is one of the most practical parts of supervised learning. From filtering spam to diagnosing diseases, it’s everywhere .

For me, working with classification has been both challenging and rewarding. The key lessons?

Good data beats fancy models.
Evaluation metrics must match the real-world problem.
Interpretability matters, especially in sensitive applications.

Despite the hurdles, classification continues to be one of the most impactful tools in machine learning .

⚖️ Choosing Between Type I and Type II Errors

Naomi Jepkorir — Mon, 11 Aug 2025 20:56:55 +0000

In statistics, making a decision is a bit like crossing a busy street without traffic lights, you have to weigh the risk of moving too soon against the risk of waiting too long. In hypothesis testing, those two risks are called Type I and Type II errors, and you can’t avoid them both entirely.

🕵️ Meet the Errors

Type I Error (False Positive) – Rejecting the null hypothesis when it’s actually true. In medicine, this might mean diagnosing a patient with a disease they don’t have.
Type II Error (False Negative) – Failing to reject the null hypothesis when it’s false. In medicine, this might mean missing a diagnosis when the disease is present.

They’re like opposite sides of a see-saw ⚖️ , lowering one usually raises the other.

🦟 The Malaria Example

Picture yourself in a clinic in a malaria-endemic region. A patient walks in with fever, chills and body aches. You suspect malaria, and you have a rapid test 🧪.

If you make a Type I error , you say they have malaria when they don’t. They take unnecessary medicine , maybe get mild side effects, and the real cause of illness is missed.
If you make a Type II error, you say they don’t have malaria when they do. Without treatment, the disease can worsen quickly, and in severe cases, become life-threatening .

🔄 The Trade-off

In this setting, Type II errors are generally more dangerous 🚨. Why?

Malaria progresses fast, especially in children and pregnant women.
Anti-malarial treatment is relatively safe and inexpensive .
Missing a real case can have far worse consequences than treating a false one.

That’s why some clinics treat suspected malaria even when the test is negative but symptoms are strong, better to risk a false positive than lose a life ❤️.

🖥️ Simulating Type I and Type II Errors in Python

Here’s a small simulation showing how false positives and false negatives might look in malaria testing

import numpy as np

# Seed for reproducibility
np.random.seed(42)

# Number of patients
n_patients = 1000

# True malaria status (1 = has malaria, 0 = no malaria)
true_malaria = np.random.binomial(1, 0.3, n_patients)  # 30% prevalence

# Test characteristics
sensitivity = 0.95  # correctly detect malaria (reduces Type II errors)
specificity = 0.90  # correctly detect no malaria (reduces Type I errors)

# Simulated test outcomes
test_positive = []
for case in true_malaria:
    if case == 1:
        # True malaria case: test is positive with probability = sensitivity
        test_positive.append(np.random.rand() < sensitivity)
    else:
        # No malaria: false positive occurs with probability = (1 - specificity)
        test_positive.append(np.random.rand() > specificity)

# Convert to NumPy array
test_positive = np.array(test_positive)

# Count errors
type_I_errors = np.sum((test_positive == 1) & (true_malaria == 0))
type_II_errors = np.sum((test_positive == 0) & (true_malaria == 1))

print(f"Type I errors (False Positives): {type_I_errors}")
print(f"Type II errors (False Negatives): {type_II_errors}")

Output:

Type I errors (False Positives): 73
Type II errors (False Negatives): 18

How to use it:

Change the sensitivity to see what happens when you try to catch every malaria case.
Change the specificity to see what happens when you try to avoid false alarms.
Notice how improving one tends to worsen the other, the eternal trade-off.

📊 The Balancing Act

The choice between minimizing Type I or Type II errors depends on the context:

When the cost of a false positive is high (e.g., invasive surgery , expensive drugs ), reduce Type I errors.
When the cost of a false negative is high (e.g., fast-progressing diseases ), reduce Type II errors.

You can’t eliminate both, so you set your alpha (Type I error rate) and power (linked to Type II error rate) based on the stakes .

💡 Final Thought

Choosing between Type I and Type II errors isn’t about perfection,it’s about priorities. In malaria diagnosis, the priority is saving lives , even if it means some people take medicine they don’t actually need.

In every field, the key question is:

“Which mistake can we live with, and which can we not afford to make?” 🤔

⚽ Can We Predict the Next Premier League Champion with Binomial Probability?

Naomi Jepkorir — Wed, 06 Aug 2025 12:21:06 +0000

What are the chances your favorite EPL team wins the league next season? Time to let math do the talking! 🎲

🧠 Idea Behind the Madness

Every football fan has asked it:

"Can my team win the league next season?"

Instead of relying on blind hope, I decided to use binomial probability to calculate each team's chances of taking the crown in the next Premier League season, based entirely on how they performed last time.

We’ll:

Fetch last season’s final standings from an API.
Use binomial distribution to simulate two things:
- The probability of a team repeating its exact win total.
- The probability of a team reaching the typical championship threshold which is ≈27.6 so 28 wins.
Rank them accordingly.

🛰️ Step 1: Fetching EPL Data Using an API

I used the football-data.org API to pull the standings. You’ll need a free API token, save it in a .env file like this:

API_TOKEN=your_football_data_token

Now fetch the standings:

import requests
import os
from dotenv import load_dotenv

load_dotenv()

def fetch_epl_standings():
    token = os.getenv("API_TOKEN")
    if not token:
        raise ValueError("API_TOKEN not found in env")

    uri = "http://api.football-data.org/v4/competitions/PL/standings?season=2024"
    headers = { 'X-Auth-Token': token }

    response = requests.get(uri, headers=headers)

    if response.status_code != 200:
        raise Exception(f"API request failed with status code {response.status_code}: {response.text}")

    data = response.json()
    return data["standings"][0]["table"]

standings = fetch_epl_standings()

Convert to DataFrame:

import pandas as pd

data_rows = []
for team in standings:
    data_rows.append({
        "Pos": team["position"],
        "Team": team["team"]["name"],
        "Matches": team["playedGames"],
        "Wins": team["won"],
        "Draws": team["draw"],
        "Losses": team["lost"],
        "Points": team["points"],
        "+/-": team["goalDifference"],
        "Goals": f'{team["goalsFor"]}:{team["goalsAgainst"]}'
    })

df = pd.DataFrame(data_rows)
df.to_csv('epl_standings.csv', index=False)

🎯 Step 2: Binomial Probability of Exact Win Count

Now let's calculate the probability of each team repeating the exact number of wins they had last season.

import math

# Loop through each row and calculate binomial probability
for index, row in df.iterrows():
    team = row['Team']
    n = int(row['Matches'])  # total games
    k = int(row['Wins'])     # wins
    p = k / n                # estimated win probability

    try:
        binom_prob = math.comb(n, k) * (p**k) * ((1 - p)**(n - k))
    except OverflowError:
        binom_prob = 0.0

    print(f"{team}: P( {k} wins)  = {binom_prob:.6f}")

Sample Output:

Liverpool FC: P( 25 wins)  = 0.135388
Arsenal FC: P( 20 wins)  = 0.128761
Ipswich Town FC: P( 4 wins)  = 0.206486
Southampton FC: P( 2 wins)  = 0.278054

📉 What These Results Tell Us

Top teams like Liverpool have lower exact probabilities, there's more room for variation when you're near the top.
Lower-table teams tend to have higher repeat chances, but don't celebrate just yet...

🏆 Step 3: Probability of Title-Winning Season (≥ 28 Wins)

Next, we model the probability of each team reaching 28 or more wins, a common threshold to win the league.

We'll use the cumulative binomial distribution:

from scipy.stats import binom

def title_probability(wins, matches=38, threshold=28):
    p = wins / matches
    return 1 - binom.cdf(threshold - 1, matches, p)

for index, row in df.iterrows():
    team = row['Team']
    wins = int(row['Wins'])
    prob = title_probability(wins, threshold=28)
    print(f"{team}: P(Wins ≥ 28) = {prob:.6f}")

Sample Output:

Team	P(Wins ≥ 28)
Liverpool FC	19.78%
Manchester City FC	1.54%
Arsenal FC	0.66%
Chelsea FC	0.66%
Newcastle United	0.66%
Manchester United FC	0.00%

📊 Interpretation

Liverpool is most likely to hit 28+ wins based on current form.

City, Chelsea and the others trail behind, possibly due to more draws or inconsistent performances.

Man United? Their chance rounds to zero. Ouch 😬.

🫣 United fans, this model says your 11-win season gives you a statistically negligible shot at the title. You might want to pray harder than you code.

⚠️ Limitations

Let’s be honest, binomial probability isn’t a crystal ball. Here's why:

It ignores real-world dynamics: transfers, injuries, managerial changes.
It assumes independent, identically distributed matches (which football is not).
Based on one season, not a large enough sample for deep insight.

But hey, it’s fun and statistically grounded!

🧪 Want to Take This Further?

Here’s how you can level up the model:

Use Poisson regression to simulate goals per match.
Integrate Elo ratings or other power metrics.
Run full Monte Carlo simulations of future fixtures.
Track the model live across the season for dynamic probabilities.

💭 Final Thoughts

While this model won’t help you win your fantasy league, it does give a math-driven glimpse into who’s statistically positioned to succeed. Liverpool fans? You have reason to dream. Southampton? Maybe next year...

Football is unpredictable, and that's what makes it beautiful. But every now and then, it's fun to let the math have a shot at calling the game. ⚽📊

Understanding Measures of Central Tendency in Data Science

Naomi Jepkorir — Sun, 20 Jul 2025 21:11:12 +0000

When you think of "mean", "median" or "mode", chances are your brain flashes back to a math class you didn’t think you'd ever use again. 😅

But here I am ,knee-deep in datasets, and those three little words keep showing up. Not just as formulas, but as powerful tools that help tell the story behind the numbers.

This post is part of my continued journey into data science. After exploring tools like Excel,power Bi I started digging into core concepts - and measures of central tendency are some of the first I’ve truly appreciated in the real world.

Let’s break it down in plain English 👇

What Are Measures of Central Tendency? 🤔

Measures of central tendency help us understand the “center” or “typical” value in a dataset. Basically, they summarize what’s "normal" in your data, and that's a huge help when you’re making sense of hundreds (or millions) of numbers.

The three most common ones are:

Mean - the average value
Median - the middle value
Mode - the most frequently occurring value

They each tell you something slightly different, and choosing the right one depends on the situation.

Why Do They Matter in Data Science? 🎯

When you're working with data, you're usually trying to:

Understand trends
Compare groups
Make decisions
Build predictive models

Measures of central tendency give you a quick pulse check on your dataset. For example:

If you’re analyzing income data, the median might be better than the mean because of outliers (like billionaires).
If you're reviewing customer ratings from 1 to 5 stars, the mode could show you the most common sentiment.
If your data is pretty clean and normally distributed, the mean gives a solid summary.

Real-World Examples 🔍

Here are a few situations where these measures pop up:

📈 Business Reporting
Companies use the mean to summarize average sales, costs or customer satisfaction scores over time.
🏥 Healthcare
Hospitals might use the median to report wait times, since a few extreme cases can skew the average.
🛍️ Retail and Marketing
The mode helps track the most popular product sizes, colors or price points.

A Quick Python Example 🐍

If you’ve got a list of numbers, you can calculate all three super easily:

import statistics

data = [1, 2, 2, 3, 4, 4, 4, 5, 6]

mean = statistics.mean(data)     # 3.44
median = statistics.median(data) # 4
mode = statistics.mode(data)     # 4

print(mean, median, mode)

These tiny lines of code can give you a huge amount of insight.

My Reflection 💭

At first, I thought central tendency was just for passing stats exams. Now, I see it as one of the first things you should check when exploring a new dataset. It gives you a quick overview, helps spot data issues and sets the stage for deeper analysis or modeling.

Plus, it’s foundational. Whether you're in Excel, Python or SQL, you'll use these concepts everywhere.

If you're just getting started in data science like I am, don't overlook the basics. They’re called “central” for a reason. 😉

How Excel is Used in Real-World Data Analysis

Naomi Jepkorir — Wed, 11 Jun 2025 12:04:58 +0000

When I started my journey in Data Science & Analytics, I knew Excel was a common tool in the workplace, but I didn’t realize just how powerful and versatile it really is. After just one week of learning Excel, I’ve already seen how it plays a major role in real-world data analysis and decision-making across many industries.

What is Excel? 🤔

Microsoft Excel is a spreadsheet program that allows users to organize, analyze, and visualize data efficiently. It's widely used by professionals in fields like finance, marketing, operations and beyond. While it may seem simple at first glance, Excel offers a rich set of features that make it a go-to tool for data analysts around the world.

Real-World Uses of Excel in Data Analysis 🔍

Here are just a few examples of how Excel is used in real-world data analysis:

Business Decision-Making

Excel helps companies track performance metrics and make data-driven decisions. Dashboards built with Excel can show KPIs (Key Performance Indicators), trends and summaries that guide strategy and planning.
Financial Reporting

Financial analysts rely on Excel for budgeting, forecasting and generating reports. Excel’s formulas, templates and automation features reduce errors and save time on repetitive tasks.
Marketing Performance Analysis

Marketing teams use Excel to analyze campaign data, track conversions, segment audiences and measure ROI (Return on Investment). With features like pivot tables and filters, they can drill down into specific data segments easily.

Powerful Excel Features That Make Data Analysis Easy ⚙️

In just a week, I've learned a few advanced Excel features and formulas that really opened my eyes to what’s possible:

VLOOKUP() and XLOOKUP()

These functions help find and connect data across large datasets. Whether matching IDs to names or merging data from multiple sources, they simplify complex lookups.
Data Validation

This feature helps ensure clean, consistent data entry. For example, limiting entries in a column to a specific list (like “Low,” “Medium,” “High”) helps prevent typos and standardizes the data for more accurate analysis.
Conditional Formatting

This makes your data visually dynamic. You can highlight trends, outliers, or duplicates using color scales, icons, or rules. It’s especially helpful when trying to quickly identify which values stand out in a dataset which is great for spotting trends or anomalies.
Filters and Slicers

Filters help focus on specific data without deleting anything. When paired with pivot tables or tables, they allow for interactive exploration and quick insights—like segmenting sales by region or category.

My Reflection 💭

Learning Excel has changed the way I view data. Before, I saw spreadsheets as static and kind of boring — just tables of numbers. Now, I see them as dynamic tools for storytelling, insight and strategy. It’s amazing how much you can learn about a situation just by organizing the data correctly and applying the right formula. I’m excited to keep building my skills and see how Excel fits into more advanced analytics tools down the road.