DEV Community

Abinesh N
Abinesh N

Posted on

Stop adding print statements to debug your data pipeline — use watcher instead

I built a Python decorator that watches your DataFrame pipelines automatically

You know this moment:

Input rows  : 1,000,000
Output rows :   263,979
Enter fullscreen mode Exit fullscreen mode

Somewhere in your pipeline, 736k rows disappeared.

Which step caused it?

A bad merge?
A silent dropna()?
A duplicate join key?
A dtype issue?
A filter you forgot existed?

So you start adding:

print(df.shape)
print(df.columns)
print(df.isnull().sum())
Enter fullscreen mode Exit fullscreen mode

…everywhere.

Then rerun the entire pipeline again.

That frustration is why I built dfwatcher.


What is dfwatcher?

dfwatcher is a lightweight decorator for pandas pipelines that automatically tracks:

  • row count changes
  • null deltas
  • schema drift
  • dtype changes
  • join explosions
  • memory usage
  • pipeline summaries

with zero config.

Just decorate your functions.

from watcher import watch

@watch
def clean(df):
    return df.dropna()
Enter fullscreen mode Exit fullscreen mode

That’s it.


Example

@watch
def merge_orders(df):
    return df.merge(orders, on="customer_id", how="left")
Enter fullscreen mode Exit fullscreen mode

Output:

merge_orders()  964,203 → 1,069,104  ▲ +104,901 rows (+10.9%) ⚠

  columns added : +tier

  💥 join explosion · duplication ratio 10.9%

  key column     top value    repeat count
  customer_id    9182               184
Enter fullscreen mode Exit fullscreen mode

Instead of just telling you rows increased…

…it tells you why.


Why I built it

Most pipeline bugs are not syntax bugs.

They’re data drift bugs.

The code runs successfully.
The tests pass.
The pipeline completes.

But the data quietly changes shape somewhere in the middle.

Those are the hardest bugs to debug because:

  • they’re silent
  • they propagate downstream
  • and they’re usually discovered hours later

I wanted something that behaves like:

“git diff for DataFrames”

but automatically during execution.


Features

Row tracking

clean()  1,000,000 → 964,203  ▼ -35,797 rows
Enter fullscreen mode Exit fullscreen mode

Null tracking

nulls -35,797  status  (35,797 → 0)
Enter fullscreen mode Exit fullscreen mode

Schema drift detection

columns added : +revenue_band
Enter fullscreen mode Exit fullscreen mode

Dtype change detection

dtype change : customer_id  int64 → object
Enter fullscreen mode Exit fullscreen mode

Join explosion detection

💥 join explosion
Enter fullscreen mode Exit fullscreen mode

Threshold guards

@watch(
    warn_on_loss=0.05,
    raise_on_loss=0.20
)
Enter fullscreen mode Exit fullscreen mode

Turn silent data corruption into CI failures.


Session summaries

with session("nightly ETL"):
    df = clean(df)
    df = merge(df)
    df = score(df)
Enter fullscreen mode Exit fullscreen mode

At the end you get a full pipeline summary automatically.


What surprised me while building it

The hardest part wasn’t row tracking.

It was making the output useful without becoming noisy.

A debugging tool that prints too much becomes another thing developers ignore.

So I focused heavily on:

  • readable terminal formatting
  • meaningful warnings
  • showing only the most important changes
  • zero-config defaults

The goal was:

install → decorate → immediately useful


Roadmap

Currently:

  • pandas support
  • memory tracking
  • custom handlers
  • CI-friendly summaries

Planned:

  • Polars backend
  • DuckDB backend
  • HTML / notebook renderer
  • structured JSON logging
  • global config system

Install

pip install dfwatcher
Enter fullscreen mode Exit fullscreen mode

GitHub:
https://github.com/Abineshabee/watcher

PyPI:
https://pypi.org/project/dfwatcher/


I’d genuinely love feedback from data engineers, ML engineers, analytics engineers, and pandas users.

Especially:

  • features you wish pipeline tools had
  • debugging pain points
  • weird merge bugs you’ve experienced
  • ideas for Polars / DuckDB support

Top comments (0)