Josiah Nyamai

Posted on Jul 28

From Messy to Meaningful: Cleaning and ETL on a Real-World Cancer Lab Dataset

Real-world datasets rarely come neat and tidy. During a recent technical assessment for a Data Engineering & AI position, I was challenged to transform a messy, inconsistent cancer diagnostics dataset into clean, structured data ready for analysis and potential machine learning applications. This article walks through the steps I took to clean the data, standardize values, and build an ETL pipeline.

🧪 The Dataset

The dataset, sourced from Kaggle, simulates test orders in a cancer diagnostics lab. It includes records across departments like:

Histology
Cytology
Haematology
Immunohistochemistry

But just like in the real world, the data was messy:

Missing values
Invalid or mixed date formats
Typos and inconsistent labels
Non-numeric prices
Ambiguous values (e.g., “N/A”, “Unknown”)

🔍 Step 1: Data Cleaning & Standardization

I started by downloading the dataset directly from Kaggle using the kagglehub library, then loaded it into a pandas DataFrame for inspection.

✅ Importing and Inspecting the Data

import kagglehub

path = kagglehub.dataset_download("eustusmurea/labtest-dataset")

print("Path to dataset files:", path)

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv(r"C:\Users\Administrator\.cache\kagglehub\datasets\eustusmurea\labtest-dataset\versions\1\messy_cancer_lab_dataset.csv")

data.head()

This gave me a first look at the structure of the dataset — including messy values, inconsistent formats, and missing data — and laid the groundwork for cleaning and transformation.

📅 Fixing Date Columns

Many rows had invalid or mixed date formats. I converted them using errors='coerce', which sets invalid entries to NaT.

date_columns = ['creation_date', 'signout_date']
for col in date_columns:
    data[col] = pd.to_datetime(data[col], format='mixed', errors='coerce')

Then I filled missing dates with a safe placeholder: 2020-01-01.

💵 Cleaning the Price Column

Prices came in multiple formats like "KES 2,500" or "ksh3000". I created a custom function to clean these values and convert them into floats.

def clean_price(value):
    if pd.isnull(value): return None
    value = str(value).lower().replace('ksh', '').replace('kes', '').replace(',', '').strip()
    try:
        return float(value)
    except:
        return None

data['price'] = data['price'].apply(clean_price)

After confirming there were no outliers via a boxplot, I filled missing values with the mean.

sns.boxplot(y='price', data = data)

data['price'] = data['price'].fillna(data['price'].mean())
data.info()

🧹 Filling Missing Object Values

For other columns with object dtype (e.g., text columns), I replaced nulls with "Unknown":

for col in data.select_dtypes(include='object').columns:
    data[col] = data[col].fillna('Unknown')

🏷️ Standardizing Text Columns

Inconsistent labels like "St. Marys" and "st. mary's oncology" needed normalization. I used .replace() and string functions for standardization.

data['facility'] = data['facility'].replace({
    'St. Marys': "St. Mary's Oncology",
    'mercy cancer center': 'Mercy Cancer Center'
})
data['facility'] = data['facility'].str.title().str.strip()

I applied similar methods to clean up categories, tests, and other text columns.

⚙️ Step 2: ETL Pipeline

After transforming the raw data, I moved on to building the ETL pipeline. The final step involved loading the clean dataset into a PostgreSQL database — making it ready for querying, reporting, or integration with BI tools.

🛢️ Load: Saving to PostgreSQL

🔹 Step 1: Export Cleaned Data to CSV

data.to_csv("Cleaned_dataset.csv", index=False)

Step 2: Establish PostgreSQL Connection

I used SQLAlchemy and psycopg2 to securely connect to a PostgreSQL database using environment variables for credentials.

from dotenv import load_dotenv
import os
import psycopg2
import pandas as pd
from sqlalchemy import create_engine

# Load .env variables
load_dotenv()

# 1. Connect postgres using psycopg2
connection = psycopg2.connect(  
    host=os.getenv("db_host"),
    database=os.getenv("db_name"),
    user=os.getenv("db_user"),
    password=os.getenv("db_pass"),
    port=os.getenv("db_port")   
)
cursor = connection.cursor()

#create a table in the database
create_table_query = """
CREATE TABLE IF NOT EXISTS lab_tests (
    order_id TEXT PRIMARY KEY,
    signout_date DATE,
    lab_number TEXT,
    assigned_to_pathologist TEXT,
    patient_name TEXT,
    facility TEXT,
    test_category TEXT,
    test TEXT,
    sub_category TEXT,
    receiving_centers TEXT,
    processing_centers TEXT,
    creation_date DATE,
    creation_year TEXT,
    creation_weekday TEXT,
    service TEXT,
    price FLOAT,
    payment_method TEXT,
    delay_days INT
);
"""
cursor.execute(create_table_query)
connection.commit()

# 2. Create SQLAlchemy engine and use pandas.to_sql() ---
db_url = f"postgresql://{os.getenv('db_user')}:{os.getenv('db_pass')}@" \
         f"{os.getenv('db_host')}:{os.getenv('db_port')}/{os.getenv('db_name')}"

engine = create_engine(db_url)

# Load DataFrame into PostgreSQL table
data.to_sql('lab_tests', engine, if_exists='replace', index=False)

print("Data successfully loaded into PostgreSQL database.")

🤖 Bonus: ML Use Case Proposal

As part of the assessment, I proposed an ML task:

🎯 Predicting Lab Test Delays

Target: Days between creation_date and signout_date

Features:

Facility
Test category
Price
Creation weekday/month

Models: Linear Regression, XGBoost
Metrics: MAE, RMSE, R²

📌 GitHub Repository

You can find the full code, cleaned dataset, and pipeline on GitHub:

🔗 GitHub – Cancer Lab ETL & Cleaning Project

Feel free to clone it, run it, or build on top of it.

DEV Community