137Foundry

Posted on Apr 13

5 Python Libraries That Make Data Transformation Painless

#webdev #productivity

Data transformation is the unglamorous middle step of every data pipeline. You have raw data from an API, a database, or a CSV export. You need it cleaned, reshaped, validated, and loaded somewhere useful. The transformation logic itself is usually straightforward: filter rows, rename columns, convert types, join datasets, compute aggregates. What makes it painful is the boilerplate, the edge cases, and the debugging.

Python's ecosystem for data transformation has matured significantly. These five libraries each solve a specific part of the transformation problem. Some overlap in capability, but each has a sweet spot where it outperforms the alternatives. Here are the ones I reach for most often.

1. Pandas

Pandas is the foundational library for tabular data manipulation in Python. If you are doing any kind of data transformation work, you are almost certainly using Pandas already. Its DataFrame API provides a consistent interface for filtering, grouping, merging, pivoting, and reshaping data.

Where Pandas shines is expressiveness. Operations that would take 20 lines of raw Python collapse into a single chained method call. The IO tools handle reading and writing CSV, JSON, Excel, Parquet, SQL databases, and dozens of other formats with minimal configuration.

Where Pandas struggles is scale. DataFrames live in memory, so datasets larger than available RAM require workarounds. For most business data workflows (thousands to low millions of rows), Pandas handles the load without issues. Beyond that, you need one of the alternatives below.

The Pandas documentation is among the best in the Python ecosystem. The 10 Minutes to Pandas guide is a genuinely useful introduction, and the cookbook section covers common transformation patterns with copy-paste examples.

Photo by Christina Morillo on Pexels

2. Polars

Polars is the newer, faster alternative to Pandas. Written in Rust with a Python API, Polars processes data significantly faster than Pandas for most operations. It uses lazy evaluation by default, meaning it builds a query plan and optimizes it before executing, similar to how SQL query optimizers work.

The API is deliberately different from Pandas. Polars uses expressions rather than index-based operations, which eliminates an entire category of common Pandas bugs:

import polars as pl

# Lazy evaluation - builds optimized query plan
result = (
    pl.scan_csv('sales_data.csv')
    .filter(pl.col('status') == 'completed')
    .with_columns([
        (pl.col('amount_cents') / 100).alias('amount_usd'),
        pl.col('sale_date').str.to_date('%Y-%m-%d')
    ])
    .group_by('product_category')
    .agg([
        pl.col('amount_usd').sum().alias('total_revenue'),
        pl.col('transaction_id').count().alias('order_count')
    ])
    .sort('total_revenue', descending=True)
    .collect()  # Execute the plan
)

Polars handles datasets larger than memory through its lazy evaluation and streaming capabilities. If your current Pandas pipeline runs out of memory or takes too long, Polars is the first alternative to evaluate. The Polars user guide covers the migration path from Pandas and explains the philosophical differences in API design.

3. Pydantic

Pydantic is not a data manipulation library. It is a data validation library. But in practice, data validation is inseparable from data transformation, and Pydantic handles the validation layer better than anything else in the Python ecosystem.

Define a schema for your data, and Pydantic validates, coerces, and normalizes incoming records automatically:

from pydantic import BaseModel, field_validator
from datetime import date
from typing import Optional

class SalesRecord(BaseModel):
    transaction_id: str
    customer_email: str
    amount_cents: int
    sale_date: date
    product_category: str
    discount_pct: Optional[float] = 0.0

    @field_validator('amount_cents')
    @classmethod
    def amount_must_be_positive(cls, v):
        if v < 0:
            raise ValueError('Amount cannot be negative')
        return v

    @field_validator('customer_email')
    @classmethod
    def email_must_be_valid(cls, v):
        if '@' not in v:
            raise ValueError('Invalid email format')
        return v.lower().strip()

When you parse raw API responses or CSV rows through a Pydantic model, you get type coercion (strings to dates, strings to integers), validation (rejecting invalid records with clear error messages), and normalization (lowercasing emails, stripping whitespace) in a single step. The Pydantic V2 documentation covers the full model API, including nested models, custom serializers, and JSON schema generation.

"Data validation is the most underinvested part of most data pipelines. Every hour you spend on schema validation saves ten hours of debugging bad data downstream." - Dennis Traina, 137Foundry

4. SQLAlchemy

SQLAlchemy bridges the gap between Python objects and relational databases. For data pipelines that read from or write to PostgreSQL, MySQL, SQLite, or any other SQL database, SQLAlchemy handles connection management, query building, and result mapping.

The Core API (as opposed to the ORM) is particularly useful for data pipelines because it generates SQL without the overhead of an object-relational mapping layer:

from sqlalchemy import create_engine, MetaData, select, func

engine = create_engine('postgresql://user:pass@host/dbname')
metadata = MetaData()
metadata.reflect(bind=engine)

orders = metadata.tables['orders']
customers = metadata.tables['customers']

# Build a query programmatically
query = (
    select(
        customers.c.name,
        func.sum(orders.c.total).label('lifetime_value'),
        func.count(orders.c.id).label('order_count')
    )
    .join(customers, orders.c.customer_id == customers.c.id)
    .where(orders.c.status == 'completed')
    .group_by(customers.c.name)
    .having(func.count(orders.c.id) > 5)
)

with engine.connect() as conn:
    result = conn.execute(query).fetchall()

SQLAlchemy integrates naturally with Pandas through the read_sql and to_sql methods, making it easy to pull data from a database into a DataFrame, transform it, and write the results back. The SQLAlchemy 2.0 tutorial covers the modern API patterns that replaced the legacy 1.x interface.

Photo by cottonbro studio on Pexels

5. Great Expectations

Great Expectations is a data quality framework that lets you define "expectations" about your data and validates them automatically during pipeline execution. Think of it as unit tests for your data.

Instead of writing custom validation logic for each pipeline, you define expectations declaratively:

import great_expectations as gx

context = gx.get_context()

# Define expectations for your sales data
validator = context.sources.pandas_default.read_csv("sales_data.csv")
validator.expect_column_to_exist("transaction_id")
validator.expect_column_values_to_be_unique("transaction_id")
validator.expect_column_values_to_not_be_null("customer_email")
validator.expect_column_values_to_be_between("amount_cents", min_value=0, max_value=10000000)
validator.expect_column_values_to_be_in_set("status", ["completed", "pending", "refunded"])

Great Expectations generates data quality reports that show which expectations passed and failed, with sample failing records for debugging. The expectation gallery lists all available expectation types, from simple null checks to statistical distribution tests.

For data pipelines that feed into business-critical reports or dashboards, Great Expectations provides the safety net that prevents bad data from reaching decision-makers. It integrates with Airflow, Prefect, and other orchestration tools through checkpoint triggers.

Choosing the Right Combination

These libraries work well together. A typical data pipeline stack might use:

Pandas or Polars for the core transformation logic
Pydantic for validating individual records from API responses
SQLAlchemy for database read/write operations
Great Expectations for end-to-end data quality validation

Start with Pandas for transformation and Pydantic for validation. Add SQLAlchemy when you need database interactions. Add Great Expectations when data quality becomes a business-critical concern. Scale from Pandas to Polars when dataset size or performance demands it.

For a broader look at how these libraries fit into a complete data automation architecture, this guide on building automated data pipelines covers the full pipeline lifecycle from extraction through scheduling and monitoring. The team at 137Foundry regularly helps businesses select the right combination of tools for their specific data volume, complexity, and team capabilities.

One additional consideration when choosing between these libraries is how well they compose together. Pandas and Polars both integrate cleanly with SQLAlchemy for database operations. Pydantic models can validate individual records before they enter a Pandas DataFrame. Great Expectations can validate entire DataFrames after transformation. The composability of Python's data ecosystem is one of its strongest advantages over monolithic ETL platforms that lock you into a single vendor's approach.

The Python data transformation ecosystem is deep enough that you rarely need to build transformation logic from scratch. These libraries handle the heavy lifting. Your job is understanding which tool fits each layer of the problem.

DEV Community