Mohammad Waseem

Posted on Feb 3

Transforming Dirty Data: A Lead QA Engineer's Guide to SQL-driven Data Cleansing with Open Source Tools

#sql #opensource #datacleaning

Introduction

In the realm of data quality, cleaning dirty data is an essential yet complex task that can significantly impact downstream analytics and decision-making. As Lead QA Engineers, we often encounter large datasets riddled with inconsistencies, missing values, and anomalies. Leveraging SQL through open source tools offers a robust, scalable, and cost-effective approach to rectify these issues. This post explores a systematic method to clean dirty data using open source relational database management systems, focusing on PostgreSQL, but the principles apply broadly.

Understanding the Challenge

Dirty data can manifest as duplicate records, inconsistent formats, nulls, or outliers. These issues hinder accurate analysis, reporting, and machine learning models. The key is to identify the core problems within your dataset and craft SQL queries to resolve them.

Setting Up the Environment

For this example, we'll use PostgreSQL, an open source RDBMS renowned for its extensibility and standards compliance. Install PostgreSQL locally or via Docker:

docker run -d --name postgres-cleaning -p 5432:5432 -e POSTGRES_PASSWORD=yourpassword postgres

Create a sample dataset that mimics real-world dirty data:

CREATE TABLE raw_data (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100),
    date_of_birth VARCHAR(20),
    signup_date TIMESTAMP,
    score VARCHAR(10)
);

INSERT INTO raw_data (name, email, date_of_birth, signup_date, score) VALUES
('John Doe', 'john@example.com', '1985-02-30', '2021-05-15 10:00:00', '85'),
('Jane Smith', 'jane@sample', NULL, '2021-13-01 09:30:00', 'N/A'),
('Sam Johnson', 'sam.j@domain.com', '1990-07-12', NULL, '92'),
('Emily Davis', 'emily@domain.com', 'not a date', '2021-06-20 14:20:00', '88');

Cleaning Techniques Using SQL

1. Handling Nulls and Missing Data

Use COALESCE or CASE WHEN to replace nulls with default values:

UPDATE raw_data
SET date_of_birth = COALESCE(NULLIF(date_of_birth, ''), '1970-01-01')
WHERE date_of_birth IS NULL OR date_of_birth = '';

2. Correcting Data Types and Formats

Convert date_of_birth to proper date format, handling invalid entries:

ALTER TABLE raw_data
ALTER COLUMN date_of_birth TYPE DATE
USING
    NULLIF(NULLIF(date_of_birth, 'not a date'), '')::DATE;

For invalid entries, set a default or null:

UPDATE raw_data
SET date_of_birth = NULL
WHERE date_of_birth::TEXT !~ '^(19|20)\d{2}-\d{2}-\d{2}$';

3. Standardizing Categorical Data

Normalize email domains or name casing:

UPDATE raw_data
SET email = LOWER(email);

UPDATE raw_data
SET name = INITCAP(name); -- Capitalize first letters

4. Removing Duplicates and Outliers

Identify duplicate records:

DELETE FROM raw_data a
USING raw_data b
WHERE a.id < b.id
AND a.email = b.email;

Flag outliers in score:

UPDATE raw_data
SET score = NULL
WHERE score ~ '^[^0-9]+$'; -- Remove non-numeric entries

5. Automating Cleaning Pipelines

Combine these steps into a cleaning pipeline:

BEGIN;
-- Remove duplicates
DELETE FROM raw_data a
USING raw_data b
WHERE a.id < b.id AND a.email = b.email;
-- Fix null dates
UPDATE raw_data
SET date_of_birth = '1970-01-01'
WHERE date_of_birth IS NULL;
-- Standardize emails
UPDATE raw_data
SET email = LOWER(email);
COMMIT;

Conclusion

Employing SQL for data cleaning empowers QA teams to address data issues directly within open source database environments. It promotes automation, transparency, and integration with existing data workflows. By combining techniques for null handling, format correction, de-duplication, and validation, you can significantly enhance data quality, ensuring reliable insights and reporting.

Continuous refinement and validation remain crucial. Incorporate validation scripts and monitor data quality metrics regularly to maintain the health of your datasets in production environments.

🛠️ QA Tip

To test this safely without using real user data, I use TempoMail USA.

DEV Community