Harry Wynn

Posted on Apr 19

Why don’t my database numbers match? I built a tool to find out.

#webdev #database #programming #dataengineering

If you’ve worked with databases, pipelines, or migrations long enough, you’ve hit this moment:

“Why don’t these numbers match?”

Production says:
10,291 orders

Warehouse says:
10,248 orders

Now you’re stuck figuring out why.

The usual process (aka pain)

Every time this happens, the workflow looks something like:

check row counts
write a LEFT JOIN
hunt for missing rows
compare columns
export CSVs when it gets too messy
repeat for way longer than you expected

You end up writing queries like:

SELECT *
FROM a
LEFT JOIN b ON a.id = b.id
WHERE a.amount != b.amount;

And every time, it’s slightly different… but basically the same problem.

The real problem

It’s not just comparing data.

It’s:

comparing across different databases
comparing across environments
dealing with schema drift
figuring out if differences are random or systemic

Most tools either:

don’t handle this well, or
require way more setup than the problem deserves

So I built something focused

I wanted a tool where I could just run one command:

diffgnome compare --left mysql://prod/orders --right postgres://warehouse/orders --key id

…and immediately see:

Table: orders

Row count mismatch
prod:      10,291
warehouse: 10,248

Missing rows: 43
Changed rows: 88

Columns affected:
amount (17 rows)
status (9 rows)

No giant queries.

No exporting data.

No digging through joins just to get started.

What it actually does

At a high level, it:

compares schema differences
compares row counts
splits data into chunks
hashes each chunk
drills into mismatched ranges
shows sample row differences

So instead of scanning everything manually, it quickly narrows down where the data diverges.

Where this is useful

This comes up constantly in real systems:

validating ETL pipelines
verifying migrations
debugging production issues
comparing environments
reconciling warehouse data

If data moves, this problem eventually shows up.

Why I built it

I’ve spent a lot of time working with:

ETL pipelines
data warehouses
messy production databases

And I kept running into the same situation:

I know something is wrong — I just need to find where.

So I built a tool that answers that question directly.

Current status

I’ve been using this against real datasets while building it, and it’s already replaced a lot of one-off SQL and manual comparison work.

There’s still more I want to add, but the core workflow is solid.

If this sounds familiar

If you’ve ever:

compared prod vs staging
verified a migration
debugged a broken pipeline
tried to reconcile two systems

…then you’ve probably run into this exact problem.

👉 You can check it out here: https://diffgnome.com

DEV Community