Janne Sinivirta

Posted on Jan 5

Meet Daffy: A Lightweight Guardian for Your DataFrames

#python #pandas #polars #datavalidation

The Data Validation Dilemma

Most DataFrame breakages are boring: a column got renamed, a join introduced nulls, a dtype changed, or a value shows up that you didn’t expect.

In notebooks, we do validate — just informally. We inspect .head(), run .info(), do a quick value_counts(), and add a couple of ad-hoc asserts when something looks suspicious. That’s often enough to move forward.

The problem is what happens when the notebook turns into “real code”. Those checks either:

stay behind in the exploration phase, or
get mixed into the transformation logic itself.

Either way, the assumptions become hard to see later:

What columns are required on input?
What constraints do we assume (non-null, ranges, allowed values)?
What does the function guarantee on output?

That’s the gap Daffy is trying to close: keep the transformation code clean, while making the DataFrame “contract” explicit at the function boundary.

What Daffy does

Daffy is a small library for validating pandas and Polars DataFrames at runtime using decorators.

You annotate a data-processing function with what you expect to receive (input) and what you promise to return (output). When the function runs, Daffy checks those expectations and raises a clear error if something doesn’t match — close to where the data is transformed, not several steps later.

Key features:

Column and type checks: Ensure required columns exist and have expected dtypes.
Value constraints: Enforce rules like non-null columns, unique keys, allowed categories, or numeric ranges.
Row-level validation: For cross-column business rules, validate rows with Pydantic models.
Multiple backends: Works with pandas, Polars, Modin, and PyArrow tables.

The main idea is separation of concerns:

Your function stays focused on the transformation.
The assumptions live next to it, but outside the transformation code.

Before and After: A Validation Story

To make this concrete, here’s a simple example: apply a discount to a products DataFrame.

Before Daffy – manual checks mixed into code:

def apply_discount(df):
    # Manual validation
    assert "Price" in df.columns, "Missing Price column!"
    assert "Brand" in df.columns, "Missing Brand column!"
    # We assume Price should be numeric;
    # you might add more checks here

    # Perform transformation
    df = df.copy()
    df["Discount"] = df["Price"] * 0.1
    return df

This works, but the validation and business logic are coupled. If this function grows (or gets copied), the checks tend to drift, get removed, or become inconsistent.

After Daffy – validation declared at the boundary:

from daffy import df_in, df_out

@df_in(columns=["Brand", "Price"])
@df_out(columns=["Brand", "Price", "Discount"])
def apply_discount(df):
    df = df.copy()
    df["Discount"] = df["Price"] * 0.1
    return df

You can also add constraints without turning the function into a pile of checks:

@df_in(columns={
    "Price": {"checks": {"gt": 0}},
    "Brand": {"checks": {"notnull": True}},
})
def apply_discount(df):
    ...

Why Choose Daffy over Pandera or Others?

You might ask: why not use Pandera or Great Expectations?

They’re both good tools — they’re just aimed at slightly different workflows:

Pandera is strong when you want a schema-first approach with rich validation, and you’re okay maintaining schemas/classes alongside the code.
Great Expectations is great for broader pipeline / warehouse-style data quality, expectation suites, reporting, and monitoring.

Daffy is intentionally narrower in scope. It’s for cases where you want lightweight checks right where you transform data, with minimal ceremony:

define the input/output expectations next to the function
keep the function body focused on transformations
fail early with an error that points to the violated assumption

Final Thoughts

If you’re mostly doing DataFrame work in Python (notebooks, scripts, ETL steps, small-to-medium pipelines) and your pain point is “assumptions are scattered and easy to forget”, Daffy is a practical middle ground.

It won’t replace heavier validation/monitoring frameworks for every scenario — and it shouldn’t try to. But if you want clearer function boundaries and faster feedback when your DataFrame shape changes, it fits nicely.

Try it out here: https://github.com/vertti/daffy

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.