I built a Python decorator that watches your DataFrame pipelines automatically
You know this moment:
Input rows : 1,000,000
Output rows : 263,979
Somewhere in your pipeline, 736k rows disappeared.
Which step caused it?
A bad merge?
A silent dropna()?
A duplicate join key?
A dtype issue?
A filter you forgot existed?
So you start adding:
print(df.shape)
print(df.columns)
print(df.isnull().sum())
…everywhere.
Then rerun the entire pipeline again.
That frustration is why I built dfwatcher.
What is dfwatcher?
dfwatcher is a lightweight decorator for pandas pipelines that automatically tracks:
- row count changes
- null deltas
- schema drift
- dtype changes
- join explosions
- memory usage
- pipeline summaries
with zero config.
Just decorate your functions.
from watcher import watch
@watch
def clean(df):
return df.dropna()
That’s it.
Example
@watch
def merge_orders(df):
return df.merge(orders, on="customer_id", how="left")
Output:
merge_orders() 964,203 → 1,069,104 ▲ +104,901 rows (+10.9%) ⚠
columns added : +tier
💥 join explosion · duplication ratio 10.9%
key column top value repeat count
customer_id 9182 184
Instead of just telling you rows increased…
…it tells you why.
Why I built it
Most pipeline bugs are not syntax bugs.
They’re data drift bugs.
The code runs successfully.
The tests pass.
The pipeline completes.
But the data quietly changes shape somewhere in the middle.
Those are the hardest bugs to debug because:
- they’re silent
- they propagate downstream
- and they’re usually discovered hours later
I wanted something that behaves like:
“git diff for DataFrames”
but automatically during execution.
Features
Row tracking
clean() 1,000,000 → 964,203 ▼ -35,797 rows
Null tracking
nulls -35,797 status (35,797 → 0)
Schema drift detection
columns added : +revenue_band
Dtype change detection
dtype change : customer_id int64 → object
Join explosion detection
💥 join explosion
Threshold guards
@watch(
warn_on_loss=0.05,
raise_on_loss=0.20
)
Turn silent data corruption into CI failures.
Session summaries
with session("nightly ETL"):
df = clean(df)
df = merge(df)
df = score(df)
At the end you get a full pipeline summary automatically.
What surprised me while building it
The hardest part wasn’t row tracking.
It was making the output useful without becoming noisy.
A debugging tool that prints too much becomes another thing developers ignore.
So I focused heavily on:
- readable terminal formatting
- meaningful warnings
- showing only the most important changes
- zero-config defaults
The goal was:
install → decorate → immediately useful
Roadmap
Currently:
- pandas support
- memory tracking
- custom handlers
- CI-friendly summaries
Planned:
- Polars backend
- DuckDB backend
- HTML / notebook renderer
- structured JSON logging
- global config system
Install
pip install dfwatcher
GitHub:
https://github.com/Abineshabee/watcher
PyPI:
https://pypi.org/project/dfwatcher/
I’d genuinely love feedback from data engineers, ML engineers, analytics engineers, and pandas users.
Especially:
- features you wish pipeline tools had
- debugging pain points
- weird merge bugs you’ve experienced
- ideas for Polars / DuckDB support

Top comments (0)