Alex Aslam

Posted on Mar 25

The Art of the Primary Key: Surrogate (Auto-increment) vs. Natural Keys

#webdev #programming #productivity #beginners

I still remember the exact moment I learned that primary keys are not just technical constraints, they are philosophical statements about how you view your data.

We were migrating a legacy e‑commerce system. The original developers bless their pragmatic hearts had used the product SKU as the primary key for the products table. It was a natural key: SKU-1234-AB. It was human‑readable, unique, and meaningful. Queries felt intuitive. Joins were straightforward.

Then the marketing team decided to rebrand.

Suddenly, every SKU had to change. Not just the prefix the entire product identifier logic. We were looking at updating millions of rows, cascading to hundreds of foreign key relationships. The database groaned. The application broke. And the team spent a week of sleepless nights untangling the mess.

That’s when I understood: choosing a primary key is choosing your data’s identity system. Get it wrong, and you’re not just renaming a column, you’re rewriting history.

The Two Schools of Thought

If you’ve been in this industry long enough, you’ve witnessed the Holy War. On one side, the Surrogate Camp: “Always use an auto‑incrementing integer (or UUID). It’s simple, immutable, and performant.” On the other side, the Natural Camp: “Keys should have meaning. Why introduce an artificial ID when the data already has a unique, stable identifier?”

Both are right. Both are wrong. It depends entirely on what you’re building, who will use it, and how long it needs to live.

Natural Keys: The Temptation of Meaning

Natural keys feel clean. They emerge from the domain itself, a user’s email, an ISBN, a government ID, a product code. They are self‑documenting. When you see WHERE user_id = 'alice@example.com', you instantly know what’s happening. There’s no need to join to another table just to find out who user_id = 1423 is.

I’ve used natural keys successfully in certain contexts. When I built a small internal tool for tracking employee training, the employee’s HR‑issued ID was perfect. It was stable, guaranteed unique, and nobody was going to rename it. Queries were simple, and the database footprint was smaller because we didn’t have an extra synthetic column.

But natural keys have a hidden cost: they assume the real world is stable.

The real world is not stable. People change emails. Companies rebrand product codes. Governments reassign IDs. And when a natural key changes, the cost is astronomical, unless you’ve built your entire system to handle cascading updates (which most ORMs and application layers are terrible at).

I once consulted for a startup that used the user’s email as the primary key for everything. When they introduced team accounts and users started using aliases, they realized they couldn’t change an email without breaking every associated record. They ended up building a migration that added a surrogate key and left the old email as a unique constraint but the damage was done. The schema was fragile, and the code was littered with workarounds.

Surrogate Keys: The Comfort of Abstraction

Surrogate keys auto‑increment integers, UUIDs, Snowflake IDs are the safe choice. They have no meaning outside the database. They are immutable by design. When a user changes their email, you update a single column in the users table, and the foreign keys stay perfectly intact.

I’ve leaned heavily on surrogate keys in most systems I’ve built over the past decade. They make migrations safer, they keep foreign key relationships simple, and they decouple your internal identity from external business logic.

But surrogate keys are not a free lunch.

First, they can hide data quality issues. If your natural key (e.g., email) has duplicates, a surrogate key will let those duplicates exist without complaint, because the uniqueness is only on the artificial ID. You end up needing to add unique constraints anyway, and then you’re back to managing natural key constraints.

Second, they can make debugging a nightmare. When a user calls support saying “my order is wrong,” and your logs show user_id = 847291 and order_id = 39284, you’re constantly looking up that ID in another system to figure out who it is. In systems with natural keys, the logs are often self‑contained.

Third, auto‑increment integers leak information. If you expose them in URLs, competitors can guess your growth rate. They also cause contention in distributed systems which is why UUIDs and distributed ID generators exist.

The Art of Choosing: A Strategic Framework

So how do you decide? Let me share the framework I’ve developed after 15 years of making and fixing these decisions.

1. Immutability Is King

Ask yourself: Can this value ever change in the real world?

If the answer is “never” (e.g., a government‑issued permanent identifier, a Git commit SHA), a natural key might be appropriate. But be absolutely certain. I’ve seen “never” become “well, maybe once” become “every six months.”

If there’s any chance of change, use a surrogate. The cost of updating a natural key cascade is rarely worth the convenience.

2. Context Matters

A primary key is not a universal concept. You can mix strategies in the same database.

For core domain entities (users, orders, products) that have long lives and many relationships: surrogate keys. Protect your future self.
For lookup tables (country codes, status types) where the natural value is stable and meaningful: natural keys. Using 'US' as the primary key for countries is perfect, it’s self‑describing and never changes.
For join tables (many‑to‑many relationships): surrogate keys are often overkill. A composite key of the two foreign keys is natural, ensures uniqueness, and avoids an extra index.

3. The UUID Trade‑off

When you need surrogate keys in distributed systems, UUIDs are the default. But they come with costs: they are 16 bytes (vs. 4 bytes for an integer), they fragment indexes if not sequential, and they are harder to type.

I’ve settled on using bigint auto‑increment for most centralized databases. For distributed systems, I use sequential UUIDs (UUIDv7) or Snowflake‑style IDs. The key is to keep them sortable and index‑friendly.

4. Don’t Forget Application Semantics

Your primary key choice affects developers and operators.

If you expose IDs in APIs, do you want users to see user/123 or user/alice@example.com? The latter is user‑friendly but ties your API to a mutable value.
If you use natural keys in URLs, consider using them as slugs (separate unique, indexed column) rather than the actual primary key. That gives you the best of both worlds: a meaningful identifier that can be changed without cascading updates.

5. The “Just Add an ID” Fallacy

I’ve seen teams add an auto‑increment primary key to every table, including join tables, without thinking. That’s cargo‑culting. Join tables often don’t need a surrogate, the composite key is perfectly fine and more efficient.

Similarly, a table that stores configuration key‑value pairs can use the key as a natural primary key. Adding an extra ID column just adds clutter.

The Journey: From Dogma to Discernment

I started my career believing that auto‑increment IDs were the only correct answer. Then I built systems where they made debugging painful and where I regretted not using a natural key for simple lookup tables.

Later, I went through a natural‑key phase, feeling intellectually superior because my schema was “self‑describing.” That ended when I spent three days fixing a broken migration because a client’s product codes had changed.

Now, I’ve arrived at a place of discernment. I treat each table as a work of art, carefully considering:

Longevity – How long will this data live? Decades? Surrogate. Months? Natural might be fine.
Relationship depth – How many foreign keys point to this table? Many? Surrogate. Few? Consider natural.
Domain stability – Is this value controlled by business users (volatile) or by the system (stable)?

I also document these decisions. I leave comments in the schema explaining why I chose a surrogate or natural key. Because the next person or my future self will appreciate knowing the reasoning, not just seeing the outcome.

Conclusion: Embrace the Gray

There’s no single right answer to the primary key question. There never was. The best engineers don’t follow a dogma, they understand the trade‑offs and choose based on context.

So next time you’re designing a table, resist the urge to auto‑pilot. Ask yourself: What is the true identity of this data? How will it be used? What will change over time? And then, with that clarity, make your choice.

And remember: whatever you choose, you can always add a surrogate later. But if you start with a natural key and it changes, you’ll pay the price. So when in doubt, lean toward the surrogate but leave room for the natural where it truly belongs.

That’s the art. That’s the journey.

DEV Community