DEV Community

Brad
Brad

Posted on

Python Data Validation: Catch Bad Data Before It Breaks Everything

Python Data Validation: Catch Bad Data Before It Breaks Everything

Bad data is silent. It slips into your pipeline, corrupts your database, and breaks your ML models weeks later. Here's how to catch it at the source.

Why Validation Matters

# Without validation:
user_age = int(user_input)  # Crashes on "twenty-five"
price = float(csv_field)    # Fails on "$9.99" 
date = parse_date(raw)      # Silently wrong timezone
Enter fullscreen mode Exit fullscreen mode

Schema Validation with Pydantic

The fastest way to validate complex data structures:

from pydantic import BaseModel, validator, EmailStr
from typing import Optional
from datetime import datetime

class UserRecord(BaseModel):
    name: str
    email: EmailStr
    age: int
    signup_date: datetime

    @validator('age')
    def age_must_be_realistic(cls, v):
        if not 0 < v < 150:
            raise ValueError(f'Age {v} is not realistic')
        return v

    @validator('name')
    def name_must_not_be_empty(cls, v):
        if not v.strip():
            raise ValueError('Name cannot be empty')
        return v.strip()

# Usage
try:
    user = UserRecord(
        name="John Doe",
        email="john@example.com",
        age=25,
        signup_date="2024-01-15T10:30:00"
    )
    print(f"Valid: {user.name}")
except ValueError as e:
    print(f"Invalid: {e}")
Enter fullscreen mode Exit fullscreen mode

CSV Data Validation Pipeline

import csv
from dataclasses import dataclass
from typing import List, Tuple

@dataclass
class ValidationResult:
    row: int
    field: str
    value: str
    error: str

def validate_csv(filepath: str) -> Tuple[List[dict], List[ValidationResult]]:
    valid_rows = []
    errors = []

    with open(filepath) as f:
        reader = csv.DictReader(f)
        for row_num, row in enumerate(reader, 1):
            row_errors = []

            # Validate required fields
            for field in ['name', 'email', 'amount']:
                if not row.get(field, '').strip():
                    row_errors.append(ValidationResult(
                        row=row_num, field=field,
                        value=row.get(field, ''),
                        error=f'{field} is required'
                    ))

            # Validate numeric fields
            if row.get('amount'):
                try:
                    amount = float(row['amount'].replace('$', '').replace(',', ''))
                    if amount < 0:
                        row_errors.append(ValidationResult(
                            row=row_num, field='amount',
                            value=row['amount'], error='Amount must be positive'
                        ))
                    row['amount_clean'] = amount
                except ValueError:
                    row_errors.append(ValidationResult(
                        row=row_num, field='amount',
                        value=row['amount'], error='Invalid number format'
                    ))

            if row_errors:
                errors.extend(row_errors)
            else:
                valid_rows.append(row)

    return valid_rows, errors

valid, errors = validate_csv('data.csv')
print(f"Valid rows: {len(valid)}, Errors: {len(errors)}")
for e in errors[:5]:
    print(f"Row {e.row}, {e.field}: {e.error}")
Enter fullscreen mode Exit fullscreen mode

The Full Automation Toolkit

Want 47 production-ready Python scripts including complete data validation frameworks for APIs, databases, and CSV pipelines?

👉 Python Automation Toolkit

What validation problems are you solving? Share in the comments!

Top comments (0)