Mohammad Waseem

Posted on Jan 30

Cleaning Legacy Codebases: QA Strategies for Sanity in Dirty Data

#testing #legacy #data

Introduction

Maintaining and evolving legacy codebases often involves dealing with the fallout of accumulated data discrepancies—what we commonly refer to as 'dirty data.' As a Lead QA Engineer, one of my critical roles is to ensure data integrity through rigorous testing, especially when legacy systems are involved. In this post, I will share strategies and practical approaches, including code snippets, to systematically identify, clean, and validate dirty data in legacy environments.

The Challenge of Dirty Data in Legacy Systems

Legacy systems frequently harbor inconsistent, incomplete, or corrupted data due to outdated data entry practices, schema changes, or integration issues over time. These inconsistencies can cause failures, inaccurate reporting, or system crashes. Testing these systems requires careful planning to avoid unintended consequences.

Approach to 'Cleaning' Dirty Data

My approach revolves around creating a robust testing framework that simulates data anomalies, detects them with precise assertions, and verifies the effectiveness of data cleaning routines.

Step 1: Data Discovery and Profiling

Begin by understanding data patterns and anomalies. For example, in a customer database, inconsistent phone number formats or missing email addresses are common issues.

import pandas as pd

df = pd.read_csv('legacy_customer_data.csv')
print(df.describe(include='all'))
print(df.isnull().sum())

This snippet helps identify null values and statistical anomalies that need addressing.

Step 2: Define Data Validation Rules

Establish validation rules to flag dirty data. For instance, formatting phone numbers or email validation.

import re

def is_valid_email(email):
    pattern = r"[^@]+@[^@]+\.[^@]+"
    return re.match(pattern, email) is not None

def is_valid_phone(phone):
    pattern = r"\+?\d{10,15}"
    return re.match(pattern, phone) is not None

# Apply validations
df['email_valid'] = df['email'].apply(is_valid_email)
df['phone_valid'] = df['phone'].apply(is_valid_phone)
print(df[['email', 'email_valid', 'phone', 'phone_valid']])

This helps isolate records with invalid data.

Step 3: Establish Cleaning Scripts

Develop scripts to cleanse data, such as formatting normalization.

def clean_email(email):
    if is_valid_email(email):
        return email.lower()
    return None

def clean_phone(phone):
    digits = re.sub(r"\D", "", phone)
    if len(digits) >= 10:
        return '+' + digits
    return None

df['cleaned_email'] = df['email'].apply(clean_email)
df['cleaned_phone'] = df['phone'].apply(clean_phone)

This ensures data conforms to expected standards, ready for re-integration.

Step 4: Automated Testing & Validation

Create test cases that run on baseline and cleaned data, ensuring no valid data is removed or corrupted.

assert df['cleaned_email'].notnull().sum() >= df['email'].notnull().sum()
assert df['cleaned_phone'].notnull().sum() >= df['phone'].notnull().sum()

This verifies the cleaning process maintains data completeness.

Continuous Validation in CI/CD Pipelines

Integrate these tests into your CI/CD pipeline to catch dirty data issues as early as possible, preventing legacy problems from proliferating.

# Example: Running tests in CI
pytest tests/test_data_cleaning.py

Final Thoughts

Cleaning dirty data in legacy systems isn't just about patchwork fixes. It requires a structured, test-driven approach to identify issues, apply precise cleaning routines, and verify correctness. QA plays a pivotal role in transforming dirty, unreliable data into a trusted asset—ultimately ensuring system stability, data accuracy, and smoother migrations.

By systematically applying these testing strategies, you can significantly reduce legacy data issues and pave the way for scalable, reliable system evolution.

🛠️ QA Tip

To test this safely without using real user data, I use TempoMail USA.

DEV Community