DEV Community: Josiah Nyamai

Supervised Learning — The Heart of Modern AI

Josiah Nyamai — Mon, 25 Aug 2025 05:29:44 +0000

“If data is the new oil, supervised learning is the engine that refines it.”

Artificial Intelligence and Machine Learning (AI/ML) are transforming industries — from healthcare to finance to entertainment. At the core of most of these intelligent systems lies a foundational technique called Supervised Learning.

Whether you’re a data scientist in training, a software developer branching into ML, or just curious about how machines “learn,” this guide is for you. We’ll explore what supervised learning is, how it works, common algorithms, real-world use-cases, and even write a little code.

📌 What Is Supervised Learning?

Supervised learning is a type of machine learning where the model is trained using labeled data.

That means:

You give the algorithm input data (X) and the correct output (y).
The algorithm tries to learn the mapping between inputs and outputs.
Once trained, it can predict the output for new, unseen inputs.

📦 Think of it like this:
You’re the teacher. You give the model a bunch of math problems (inputs) and answers (labels). Over time, the model learns how to solve similar problems on its own.

🎯 The Goal

The goal of supervised learning is to minimize the error between the predicted output and the actual (true) output. It does this by adjusting internal parameters (called weights) during training.

🧩 Types of Supervised Learning

There are two major branches:

1️⃣ Regression

Output: Continuous values (e.g., real numbers)
Goal: Predict “how much” or “how many”

Examples:

Predicting house prices 🏠
Forecasting stock prices 📈
Estimating temperature 🌡️

🧮 Output Example: y = 250,000 (price in USD)

2️⃣ Classification

Output: Discrete categories or classes
Goal: Predict “which class” an input belongs to

Examples:

Spam or Not Spam 📧
Cat vs. Dog 🐱🐶
Disease diagnosis (positive/negative) 🧬

🎯 Output Example: y = "Spam"

🧠 How Does It Work? (Step-by-Step)

Here’s the general pipeline of supervised learning:

Collect Data
Gather labeled examples: each has input features (X) and a known label (y).
Split the Data
Training set (usually ~70–80%)
Test set (~20–30%)
Choose an Algorithm
Decide what type of model you want to train (e.g., Linear Regression, Decision Tree, etc.).
Train the Model
Feed the training data into the algorithm so it learns patterns.
Evaluate the Model
Test the model on unseen data and measure performance using metrics like accuracy, precision, recall, RMSE, etc.
Tune & Improve
Adjust parameters, try different algorithms, add more data, etc.

🧮 Common Algorithms in Supervised Learning

Here are some popular supervised learning algorithms:

Algorithm	Type	Use-case Example
Linear Regression	Regression	Predicting house prices
Logistic Regression	Classification	Spam detection
Decision Trees	Both	Customer segmentation
Random Forest	Both	Credit scoring
Support Vector Machines (SVM)	Both	Face recognition
K-Nearest Neighbors (KNN)	Both	Medical diagnosis
Gradient Boosting (XGBoost, LightGBM)	Both	Fraud detection
Neural Networks	Both	Image classification, speech analysis

🧪 Quick Python Example

Let’s use a simple example: predicting whether a person will buy a product based on their age and income.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Sample dataset
import pandas as pd

data = pd.DataFrame({
    'age': [22, 25, 47, 52, 46, 56, 55, 60],
    'income': [15000, 29000, 48000, 60000, 52000, 65000, 58000, 72000],
    'buy': ['No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes']
})

X = data[['age', 'income']]
y = data['buy']

# Step 2: Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Step 3: Train model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Step 4: Predict
y_pred = clf.predict(X_test)

# Step 5: Evaluate
print("Predictions:", y_pred)
print("Accuracy:", accuracy_score(y_test, y_pred))

📊 How Do We Measure Performance?

Metrics depend on the task:

✅ For Classification:

Accuracy – % of correct predictions
Precision – Of the predicted positives, how many were correct?
Recall – Of the actual positives, how many were found?
F1-score – Balance between precision & recall
Confusion Matrix – Table showing TP, FP, TN, FN

📈 For Regression:

Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
R² Score (Coefficient of Determination)

🏭 Real-World Applications

Supervised learning is literally everywhere:

Domain	Application Example
Healthcare	Disease prediction, drug response modeling
Finance	Credit scoring, fraud detection
Marketing	Customer churn prediction
Retail	Product recommendation
Agriculture	Crop disease classification
Transportation	Traffic flow prediction
Email	Spam detection
NLP	Sentiment analysis

⚠️ Challenges & Limitations

Need for labeled data: Labeled data is often expensive or time-consuming to get.
Overfitting: Model memorizes training data but fails on new data.
Bias in data: Garbage in, garbage out — biased data leads to biased models.
Computational cost: Some algorithms are slow with large datasets.

✅ Tips for Success

🧹 Clean your data. Missing values, duplicates, and wrong types can ruin your model.
📊 Explore your data using visualizations.
📦 Use scikit-learn or other libraries to avoid reinventing the wheel.
🧪 Experiment! Try multiple algorithms and compare results.
⚖️ Balance your dataset when classes are imbalanced (especially in classification).

🧠 TL;DR

Supervised learning uses labeled data to train models to predict outcomes.
It has two main branches: Regression (continuous outputs) and Classification (categorical outputs).
It's used in almost every industry today.
With Python and scikit-learn, you can build supervised models in just a few lines.

🙌 Conclusion

Supervised learning is the bread and butter of modern AI. From predicting your next Netflix show to detecting credit card fraud, it’s the quiet workhorse behind the scenes.

If you're starting out in machine learning, mastering supervised learning is non-negotiable. Once you understand the concepts and build a few models, you’ll unlock a whole new world of intelligent applications.

Happy learning, and may your loss always go down 📉 and your accuracy go up 📈!

From Messy to Meaningful: Cleaning and ETL on a Real-World Cancer Lab Dataset

Josiah Nyamai — Mon, 28 Jul 2025 10:34:46 +0000

Real-world datasets rarely come neat and tidy. During a recent technical assessment for a Data Engineering & AI position, I was challenged to transform a messy, inconsistent cancer diagnostics dataset into clean, structured data ready for analysis and potential machine learning applications. This article walks through the steps I took to clean the data, standardize values, and build an ETL pipeline.

🧪 The Dataset

The dataset, sourced from Kaggle, simulates test orders in a cancer diagnostics lab. It includes records across departments like:

Histology
Cytology
Haematology
Immunohistochemistry

But just like in the real world, the data was messy:

Missing values
Invalid or mixed date formats
Typos and inconsistent labels
Non-numeric prices
Ambiguous values (e.g., “N/A”, “Unknown”)

🔍 Step 1: Data Cleaning & Standardization

I started by downloading the dataset directly from Kaggle using the kagglehub library, then loaded it into a pandas DataFrame for inspection.

✅ Importing and Inspecting the Data

import kagglehub

path = kagglehub.dataset_download("eustusmurea/labtest-dataset")

print("Path to dataset files:", path)

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv(r"C:\Users\Administrator\.cache\kagglehub\datasets\eustusmurea\labtest-dataset\versions\1\messy_cancer_lab_dataset.csv")

data.head()

This gave me a first look at the structure of the dataset — including messy values, inconsistent formats, and missing data — and laid the groundwork for cleaning and transformation.

📅 Fixing Date Columns

Many rows had invalid or mixed date formats. I converted them using errors='coerce', which sets invalid entries to NaT.

date_columns = ['creation_date', 'signout_date']
for col in date_columns:
    data[col] = pd.to_datetime(data[col], format='mixed', errors='coerce')

Then I filled missing dates with a safe placeholder: 2020-01-01.

💵 Cleaning the Price Column

Prices came in multiple formats like "KES 2,500" or "ksh3000". I created a custom function to clean these values and convert them into floats.

def clean_price(value):
    if pd.isnull(value): return None
    value = str(value).lower().replace('ksh', '').replace('kes', '').replace(',', '').strip()
    try:
        return float(value)
    except:
        return None

data['price'] = data['price'].apply(clean_price)

After confirming there were no outliers via a boxplot, I filled missing values with the mean.

sns.boxplot(y='price', data = data)

data['price'] = data['price'].fillna(data['price'].mean())
data.info()

🧹 Filling Missing Object Values

For other columns with object dtype (e.g., text columns), I replaced nulls with "Unknown":

for col in data.select_dtypes(include='object').columns:
    data[col] = data[col].fillna('Unknown')

🏷️ Standardizing Text Columns

Inconsistent labels like "St. Marys" and "st. mary's oncology" needed normalization. I used .replace() and string functions for standardization.

data['facility'] = data['facility'].replace({
    'St. Marys': "St. Mary's Oncology",
    'mercy cancer center': 'Mercy Cancer Center'
})
data['facility'] = data['facility'].str.title().str.strip()

I applied similar methods to clean up categories, tests, and other text columns.

⚙️ Step 2: ETL Pipeline

After transforming the raw data, I moved on to building the ETL pipeline. The final step involved loading the clean dataset into a PostgreSQL database — making it ready for querying, reporting, or integration with BI tools.

🛢️ Load: Saving to PostgreSQL

🔹 Step 1: Export Cleaned Data to CSV

data.to_csv("Cleaned_dataset.csv", index=False)

Step 2: Establish PostgreSQL Connection

I used SQLAlchemy and psycopg2 to securely connect to a PostgreSQL database using environment variables for credentials.

from dotenv import load_dotenv
import os
import psycopg2
import pandas as pd
from sqlalchemy import create_engine

# Load .env variables
load_dotenv()

# 1. Connect postgres using psycopg2
connection = psycopg2.connect(  
    host=os.getenv("db_host"),
    database=os.getenv("db_name"),
    user=os.getenv("db_user"),
    password=os.getenv("db_pass"),
    port=os.getenv("db_port")   
)
cursor = connection.cursor()

#create a table in the database
create_table_query = """
CREATE TABLE IF NOT EXISTS lab_tests (
    order_id TEXT PRIMARY KEY,
    signout_date DATE,
    lab_number TEXT,
    assigned_to_pathologist TEXT,
    patient_name TEXT,
    facility TEXT,
    test_category TEXT,
    test TEXT,
    sub_category TEXT,
    receiving_centers TEXT,
    processing_centers TEXT,
    creation_date DATE,
    creation_year TEXT,
    creation_weekday TEXT,
    service TEXT,
    price FLOAT,
    payment_method TEXT,
    delay_days INT
);
"""
cursor.execute(create_table_query)
connection.commit()

# 2. Create SQLAlchemy engine and use pandas.to_sql() ---
db_url = f"postgresql://{os.getenv('db_user')}:{os.getenv('db_pass')}@" \
         f"{os.getenv('db_host')}:{os.getenv('db_port')}/{os.getenv('db_name')}"

engine = create_engine(db_url)

# Load DataFrame into PostgreSQL table
data.to_sql('lab_tests', engine, if_exists='replace', index=False)

print("Data successfully loaded into PostgreSQL database.")

🤖 Bonus: ML Use Case Proposal

As part of the assessment, I proposed an ML task:

🎯 Predicting Lab Test Delays

Target: Days between creation_date and signout_date

Features:

Facility
Test category
Price
Creation weekday/month

Models: Linear Regression, XGBoost
Metrics: MAE, RMSE, R²

📌 GitHub Repository

You can find the full code, cleaned dataset, and pipeline on GitHub:

🔗 GitHub – Cancer Lab ETL & Cleaning Project

Feel free to clone it, run it, or build on top of it.

📊 Understanding Measures of Central Tendency and Their Importance in Data Science

Josiah Nyamai — Sat, 26 Jul 2025 06:57:17 +0000

When diving into the world of data science, one of the first statistical concepts you encounter is measures of central tendency. These foundational tools help data scientists make sense of data by summarizing it with a single representative value. Whether you’re building machine learning models, cleaning data, or generating business insights, understanding central tendency is crucial.

In this article, we’ll explore what measures of central tendency are, how they work, and why they are vital in data science.

🔍 What Are Measures of Central Tendency?

Measures of central tendency are statistical metrics used to determine the center or typical value of a dataset. They give a sense of where data points tend to cluster and are useful for summarizing large datasets in a meaningful way.

There are three main measures:

Mean – The arithmetic average
Median – The middle value
Mode – The most frequently occurring value

Let’s break each one down.

📐 1. Mean (Arithmetic Average)

Formula:

Mean = ∑𝑥/𝑛

Where:

∑x = sum of all values
n = number of values

The mean is widely used due to its simplicity and mathematical properties. However, it can be sensitive to outliers.

Example:

Consider the ages: [22, 24, 25, 23, 100]
Mean = (22 + 24 + 25 + 23 + 100) / 5 = 38.8
In this case, the mean is skewed by the outlier (100).

🔢 2. Median (Middle Value)

The median is the middle number when the dataset is ordered. If there’s an even number of values, it’s the average of the two middle numbers.

Example:

[22, 23, 24, 25, 100]
Median = 24

The median is more robust to outliers and gives a better sense of central location in skewed data.

🔁 3. Mode (Most Frequent Value)

The mode represents the value that occurs most often in the dataset.

Example:

[22, 23, 23, 24, 25]
Mode = 23

Mode is especially useful for categorical data, where mean and median may not be applicable.

📌 Why Are Measures of Central Tendency Important in Data Science?

Understanding and correctly applying these measures can lead to better decisions, cleaner data, and more accurate models. Here's how they impact the data science workflow:

🧹 1. Data Cleaning & Preprocessing

Missing data is a common issue in real-world datasets. Measures of central tendency are often used to impute missing values. For example:

Use the mean to fill in missing numerical values.
Use the mode to impute missing categorical data.

This ensures the dataset remains useful and statistically representative.

📊 2. Exploratory Data Analysis (EDA)

During EDA, understanding the distribution of data is key. Central tendency measures help:

Detect skewness
Identify outliers
Compare different features

A quick look at mean and median can reveal whether data is symmetrical, left-skewed, or right-skewed, helping you choose the right transformation methods.

📈 3. Feature Engineering

When engineering new features for machine learning models, it’s common to:

Normalize data using the mean
Replace noisy data with the median
Create flags or indicators based on the mode

These practices help improve model accuracy and interpretability.

🤖 4. Model Evaluation & Bias Detection

Understanding the central tendencies of predictions and actual values can:

Reveal systematic bias
Help diagnose model drift
Support performance comparison across different segments

For instance, if the mean prediction is significantly higher than the mean actual, your model may be overestimating.

🧠 5. Communicating Insights

Data scientists are often required to present their findings to non-technical stakeholders. Measures of central tendency are intuitive, easy to understand, and widely accepted in business environments.

Example:

“The average customer age is 34”
is much more digestible than
“Customer age is normally distributed with a standard deviation of 6.2”.

🚫 Common Pitfalls

While useful, measures of central tendency can be misleading if not used appropriately:

Skewed Distributions: In heavily skewed data, mean might not represent the "center" accurately. Prefer the median.
Multimodal Data: If a dataset has multiple peaks, relying on a single mode can oversimplify.
Outliers: Outliers can distort the mean drastically. Always visualize your data before relying solely on these measures.

🛠 Tools in Python

In Python, libraries like pandas and numpy make it easy to calculate these measures:

import pandas as pd

data = pd.Series([22, 24, 25, 23, 100])

mean = data.mean()
median = data.median()
mode = data.mode().values[0]

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")

🧩 Final Thoughts

Measures of central tendency are simple yet powerful tools in a data scientist's toolkit. Whether you're just getting started or working on advanced models, they provide essential insights into your data's structure and behavior.

Always pair these statistics with data visualization (like histograms or box plots) for a more complete understanding. The better you understand your data's center, the more informed your analyses and decisions will be.

🧑‍💻 Web Scraping Indeed Remote Jobs and Storing in PostgreSQL

Josiah Nyamai — Tue, 17 Jun 2025 13:31:31 +0000

📌 Project Overview

In this project, I built a Python-based pipeline to automate the collection of remote job listings from Indeed. The project involved using Selenium and BeautifulSoup for web scraping, cleaning the extracted data with Pandas, and storing the final structured dataset into a PostgreSQL database for analysis or reporting.

This article walks you through each step of the pipeline, including:

Setting up the environment
Automating job searches with Selenium
Parsing job data using BeautifulSoup
Data cleaning and transformation
Storing the final dataset into PostgreSQL

🛠️ Step 1: Setting Up the Environment

First, I installed the necessary libraries:

!pip install selenium beautifulsoup4 pandas psycopg2-binary

I also downloaded the appropriate WebDriver (e.g., ChromeDriver) and ensured it's added to the system PATH.

🌐 Step 2: Navigating to the Website with Selenium

Using Selenium, I navigated to the Indeed search page and triggered a search for remote jobs in tech-related fields:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome()
driver.get("https://www.indeed.com")

# Search for remote jobs
search_job = driver.find_element(By.NAME, "q")
search_job.send_keys("Data Analyst Remote")
search_job.submit()

time.sleep(5)  # Wait for the results to load

🧩 Step 3: Parsing the HTML with BeautifulSoup

After the search results loaded, I passed the page source to BeautifulSoup to extract job details like title, company, location, and summary:

from bs4 import BeautifulSoup

soup = BeautifulSoup(driver.page_source, "html.parser")
job_cards = soup.find_all("div", class_="job_seen_beacon")

jobs = []

for card in job_cards:
    title = card.find("h2", class_="jobTitle").text.strip()
    company = card.find("span", class_="companyName").text.strip()
    location = card.find("div", class_="companyLocation").text.strip()
    summary = card.find("div", class_="job-snippet").text.strip()
    jobs.append([title, company, location, summary])

Once done, I closed the browser:

driver.quit()

📊 Step 4: Storing Data in a DataFrame and Cleaning It

I converted the list of jobs into a Pandas DataFrame and performed light cleaning:

import pandas as pd

df = pd.DataFrame(jobs, columns=["Job Title", "Company", "Location", "Summary"])
df.drop_duplicates(inplace=True)
df.head()

🗃️ Step 5: Storing the Data in PostgreSQL

Finally, I connected to a PostgreSQL database using psycopg2 and inserted the data:

import psycopg2

conn = psycopg2.connect(
    host="localhost",
    database="job_scraper",
    user="your_username",
    password="your_password"
)
cur = conn.cursor()

cur.execute("""
CREATE TABLE IF NOT EXISTS remote_jobs (
    id SERIAL PRIMARY KEY,
    job_title TEXT,
    company TEXT,
    location TEXT,
    summary TEXT
)
""")

for index, row in df.iterrows():
    cur.execute("""
        INSERT INTO remote_jobs (job_title, company, location, summary)
        VALUES (%s, %s, %s, %s)
    """, (row['Job Title'], row['Company'], row['Location'], row['Summary']))

conn.commit()
cur.close()
conn.close()

📈 Conclusion

This project demonstrates how web scraping can automate data collection from job platforms and store insights in a relational database for deeper analysis. Whether you're building a job analytics dashboard or tracking market demand, this approach scales well and integrates with tools like Power BI or Tableau.

🔗 Explore the Full Code on GitHub

Want to see the full source code for this project, including the complete Jupyter Notebook, scraping logic, and PostgreSQL integration?

👉 Check it out here: GitHub Repository - Scraping Indeed Remote Jobs

Feel free to clone it, star it ⭐, or fork it and customize for your own job data project!

How Excel is Used in Real-World Data Analysis

Josiah Nyamai — Tue, 10 Jun 2025 06:38:56 +0000

As a data analyst, returning to Excel during the first week of LuxDevHQ Data Analytics course has given me a testament to just how powerful and flexible the tool actually is. Far more than a simple spreadsheet program, Excel remains an indispensable platform for performing real-world data analysis for a wide industry base.

Excel: What is it?
A spreadsheet program called Microsoft Excel is used to arrange, examine, and display data. It enables users to build dashboards, make data-driven decisions, execute computations, and produce charts. It is a preferred tool for analysts, financial professionals, marketers, and decision-makers worldwide due to its intuitive interface and extensive functionality.

How Excel is used in Data Analysis in Everyday Situations

Business Decision-Making:
Companies use Excel to analyze sales trends, monitor stock, and understand consumer trends. By putting data in tables and charts, corporate leaders can identify trends and make informed decisions that lead to company success.
Financial Reporting:
Budgeting, forecasting, and creating financial reports are critical in finance and accounting. With financial formulas and pivot tables embedded, financial analysts can easily summarize big data to examine performance and future plan in an efficient manner.
Marketing Performance Measurement:
Marketing teams rely on Excel to track campaign performance, measure ROI, and analyze customer engagement. Excel’s charting tools and conditional formatting make it easier to interpret data visually and spot trends over time.

Key Excel Features and Formulas I’ve Learned

SUM Function:
A handy formula to calculate the sum of values across a range, useful for summing up sales, expenses, or any numeric data. For example, =SUM(B2:B10) quickly calculates the total revenue in a column of sales.
IF Function:
A logical function that evaluates one of two values based on a condition. For example, =IF(C2>10000, "High", "Low") can classify sales performance above a threshold
Pivot Tables:
Pivot tables allow for dynamic summarization of huge data amounts. I learned to sort, filter, and analyze data effortlessly at a button click — a time-saver for pulling insights without needing to type out complicated formulas.

My Reflection
Although I’ve used Excel extensively in my data analysis work, this week has deepened my appreciation for its capabilities. It’s not just a tool for organizing data — it’s a strategic asset for uncovering insights, validating assumptions, and driving decisions. Revisiting key features with a fresh perspective reminded me how crucial Excel is for building strong analytical foundations, even in complex data environments.

This experience has only served to convince me that a mastery of the basics — such as those presented by Excel — is critical for every data professional, regardless of how sophisticated their toolset becomes.