Joins don’t discover relationships.
They assume relationships already exist.
Signal 1: Null Distribution
· Fields filled in almost every row → behave like identifiers
· Sparse fields → contextual attributes
Null patterns tell you more than column names.
Signal 2: Cardinality (and Why It Saves You from Bugs)
· High cardinality → identifiers
· Low cardinality → states / enums
This is how people accidentally join unrelated status fields.
Signal 3: Inclusion Beats Equality
· Real systems are asymmetric
· One table is usually upstream
· Another is delayed / filtered
Why Brute-Force Comparison Doesn’t Scale
· Millions of values
· Exponential field growth
· Compute explosion
· Feature extraction
· Sampling
· Staged comparison
Practical Checklist
In the final post, I’ll argue why relationship discovery shouldn’t be treated as analysis work at all — but as infrastructure.


Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.