Real-world datasets rarely come neat and tidy. During a recent technical assessment for a Data Engineering & AI position, I was challenged to transform a messy, inconsistent cancer diagnostics dataset into clean, structured data ready for analysis and potential machine learning applications. This article walks through the steps I took to clean the data, standardize values, and build an ETL pipeline.
π§ͺ The Dataset
The dataset, sourced from Kaggle, simulates test orders in a cancer diagnostics lab. It includes records across departments like:
Histology
Cytology
Haematology
Immunohistochemistry
But just like in the real world, the data was messy:
Missing values
Invalid or mixed date formats
Typos and inconsistent labels
Non-numeric prices
Ambiguous values (e.g., βN/Aβ, βUnknownβ)
π Step 1: Data Cleaning & Standardization
I started by downloading the dataset directly from Kaggle using the kagglehub library, then loaded it into a pandas DataFrame for inspection.
β Importing and Inspecting the Data
import kagglehub
path = kagglehub.dataset_download("eustusmurea/labtest-dataset")
print("Path to dataset files:", path)
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\Administrator\.cache\kagglehub\datasets\eustusmurea\labtest-dataset\versions\1\messy_cancer_lab_dataset.csv")
data.head()
This gave me a first look at the structure of the dataset β including messy values, inconsistent formats, and missing data β and laid the groundwork for cleaning and transformation.
π Fixing Date Columns
Many rows had invalid or mixed date formats. I converted them using errors='coerce', which sets invalid entries to NaT.
date_columns = ['creation_date', 'signout_date']
for col in date_columns:
data[col] = pd.to_datetime(data[col], format='mixed', errors='coerce')
- Then I filled missing dates with a safe placeholder: 2020-01-01.
π΅ Cleaning the Price Column
Prices came in multiple formats like "KES 2,500" or "ksh3000". I created a custom function to clean these values and convert them into floats.
def clean_price(value):
if pd.isnull(value): return None
value = str(value).lower().replace('ksh', '').replace('kes', '').replace(',', '').strip()
try:
return float(value)
except:
return None
data['price'] = data['price'].apply(clean_price)
- After confirming there were no outliers via a boxplot, I filled missing values with the mean.
sns.boxplot(y='price', data = data)
data['price'] = data['price'].fillna(data['price'].mean())
data.info()
π§Ή Filling Missing Object Values
For other columns with object dtype (e.g., text columns), I replaced nulls with "Unknown":
for col in data.select_dtypes(include='object').columns:
data[col] = data[col].fillna('Unknown')
π·οΈ Standardizing Text Columns
Inconsistent labels like "St. Marys" and "st. mary's oncology" needed normalization. I used .replace() and string functions for standardization.
data['facility'] = data['facility'].replace({
'St. Marys': "St. Mary's Oncology",
'mercy cancer center': 'Mercy Cancer Center'
})
data['facility'] = data['facility'].str.title().str.strip()
- I applied similar methods to clean up categories, tests, and other text columns.
βοΈ Step 2: ETL Pipeline
After transforming the raw data, I moved on to building the ETL pipeline. The final step involved loading the clean dataset into a PostgreSQL database β making it ready for querying, reporting, or integration with BI tools.
π’οΈ Load: Saving to PostgreSQL
πΉ Step 1: Export Cleaned Data to CSV
data.to_csv("Cleaned_dataset.csv", index=False)
Step 2: Establish PostgreSQL Connection
I used SQLAlchemy and psycopg2 to securely connect to a PostgreSQL database using environment variables for credentials.
from dotenv import load_dotenv
import os
import psycopg2
import pandas as pd
from sqlalchemy import create_engine
# Load .env variables
load_dotenv()
# 1. Connect postgres using psycopg2
connection = psycopg2.connect(
host=os.getenv("db_host"),
database=os.getenv("db_name"),
user=os.getenv("db_user"),
password=os.getenv("db_pass"),
port=os.getenv("db_port")
)
cursor = connection.cursor()
#create a table in the database
create_table_query = """
CREATE TABLE IF NOT EXISTS lab_tests (
order_id TEXT PRIMARY KEY,
signout_date DATE,
lab_number TEXT,
assigned_to_pathologist TEXT,
patient_name TEXT,
facility TEXT,
test_category TEXT,
test TEXT,
sub_category TEXT,
receiving_centers TEXT,
processing_centers TEXT,
creation_date DATE,
creation_year TEXT,
creation_weekday TEXT,
service TEXT,
price FLOAT,
payment_method TEXT,
delay_days INT
);
"""
cursor.execute(create_table_query)
connection.commit()
# 2. Create SQLAlchemy engine and use pandas.to_sql() ---
db_url = f"postgresql://{os.getenv('db_user')}:{os.getenv('db_pass')}@" \
f"{os.getenv('db_host')}:{os.getenv('db_port')}/{os.getenv('db_name')}"
engine = create_engine(db_url)
# Load DataFrame into PostgreSQL table
data.to_sql('lab_tests', engine, if_exists='replace', index=False)
print("Data successfully loaded into PostgreSQL database.")
π€ Bonus: ML Use Case Proposal
As part of the assessment, I proposed an ML task:
π― Predicting Lab Test Delays
Target: Days between creation_date and signout_date
Features:
Facility
Test category
Price
Creation weekday/month
Models: Linear Regression, XGBoost
Metrics: MAE, RMSE, RΒ²
π GitHub Repository
You can find the full code, cleaned dataset, and pipeline on GitHub:
π GitHub β Cancer Lab ETL & Cleaning Project
Feel free to clone it, run it, or build on top of it.
Top comments (0)