Abinesh N

Posted on May 20

Stop adding print statements to debug your data pipeline — use watcher instead

I built a Python decorator that watches your DataFrame pipelines automatically

You know this moment:

Input rows  : 1,000,000
Output rows :   263,979

Somewhere in your pipeline, 736k rows disappeared.

Which step caused it?

A bad merge?
A silent dropna()?
A duplicate join key?
A dtype issue?
A filter you forgot existed?

So you start adding:

print(df.shape)
print(df.columns)
print(df.isnull().sum())

…everywhere.

Then rerun the entire pipeline again.

That frustration is why I built dfwatcher.

What is dfwatcher?

dfwatcher is a lightweight decorator for pandas pipelines that automatically tracks:

row count changes
null deltas
schema drift
dtype changes
join explosions
memory usage
pipeline summaries

with zero config.

Just decorate your functions.

from watcher import watch

@watch
def clean(df):
    return df.dropna()

That’s it.

Example

@watch
def merge_orders(df):
    return df.merge(orders, on="customer_id", how="left")

Output:

merge_orders()  964,203 → 1,069,104  ▲ +104,901 rows (+10.9%) ⚠

  columns added : +tier

  💥 join explosion · duplication ratio 10.9%

  key column     top value    repeat count
  customer_id    9182               184

Instead of just telling you rows increased…

…it tells you why.

Why I built it

Most pipeline bugs are not syntax bugs.

They’re data drift bugs.

The code runs successfully.
The tests pass.
The pipeline completes.

But the data quietly changes shape somewhere in the middle.

Those are the hardest bugs to debug because:

they’re silent
they propagate downstream
and they’re usually discovered hours later

I wanted something that behaves like:

“git diff for DataFrames”

but automatically during execution.

Features

Row tracking

clean()  1,000,000 → 964,203  ▼ -35,797 rows

Null tracking

nulls -35,797  status  (35,797 → 0)

Schema drift detection

columns added : +revenue_band

Dtype change detection

dtype change : customer_id  int64 → object

Join explosion detection

💥 join explosion

Threshold guards

@watch(
    warn_on_loss=0.05,
    raise_on_loss=0.20
)

Turn silent data corruption into CI failures.

Session summaries

with session("nightly ETL"):
    df = clean(df)
    df = merge(df)
    df = score(df)

At the end you get a full pipeline summary automatically.

What surprised me while building it

The hardest part wasn’t row tracking.

It was making the output useful without becoming noisy.

A debugging tool that prints too much becomes another thing developers ignore.

So I focused heavily on:

readable terminal formatting
meaningful warnings
showing only the most important changes
zero-config defaults

The goal was:

install → decorate → immediately useful

Roadmap

Currently:

pandas support
memory tracking
custom handlers
CI-friendly summaries

Planned:

Polars backend
DuckDB backend
HTML / notebook renderer
structured JSON logging
global config system

Install

pip install dfwatcher

GitHub:
https://github.com/Abineshabee/watcher

PyPI:
https://pypi.org/project/dfwatcher/

I’d genuinely love feedback from data engineers, ML engineers, analytics engineers, and pandas users.

Especially:

features you wish pipeline tools had
debugging pain points
weird merge bugs you’ve experienced
ideas for Polars / DuckDB support

DEV Community