Multi-site schemas: lessons from evolving one large system over time
I worked on a large multi-site system where the data model had to evolve as the product grew.
It started out simple, but as always, with progress the demand also grew to something that allowed flexibility.
The progression was:
site_id- many-to-many relationships
scope_id
Each stage solved the limitations of the previous one.
Looking back, I would skip directly to the final model.
1. Start simple: site_id
The original model was:
products
- id
- site_id
- name
Queries were simple:
WHERE site_id = ?
In practice, the active site was usually resolved from the request domain.
Examples:
-
us.example.com→ site1 -
eu.example.com→ site2
So each request enters the system with a current site_id, and that drives filtering, permissions, and configuration.
This model is easy to understand, easy to query, and operationally simple.
Where it stopped being enough
Once entities needed selective sharing across sites, site_id became restrictive.
Typical workaround:
site_id = 0
Meaning global.
That handles:
- one site
- all sites
But not:
- site A + B only
- region-specific sharing
- custom groups of sites
2. Add relationships: many-to-many
The next idea was relationship tables.
products
- id
- name
product_sites
- product_id
- site_id
Now entities could belong to any set of sites.
That solved the sharing problem.
Why I turned away from this approach
Only some data benefits from many-to-many visibility.
Examples:
Good candidates:
- products
- templates
- assets
Poor candidates:
- orders
- logs
- analytics events
- transactions
So the system would be using two patterns:
WHERE site_id = ?
and
JOIN product_sites ...
That inconsistency spreads over time.
Different repositories, query builders, exports, reports, and jobs all need to know which model applies.
3. The real requirement was visibility
The actual question most features were asking was not:
Which site owns this?
It was:
Which sites can access this?
That is a visibility problem.
Visibility is naturally represented as a reusable set.
4. Introduce scope_id
Instead of attaching sites directly to every entity, entities reference a scope.
- id
- scope_id
- name
scopes
- id
- hash
scope_members
- scope_id
- site_id
With:
scope_id = 0
Reserved for global visibility.
What a scope represents
Examples:
-
0→ visible everywhere - scope
12→ site[1] - scope
27→ sites[1,2,5]
The entity only stores visibility.
No direct ownership rules. No per-entity join tables.
5. Why this improved the system
The biggest benefit was consistency.
Reads normalize to one pattern:
WHERE scope_id IN (0, ...)
Where the allowed scopes for the active site are resolved once and cached.
That means:
- one read model
- no switching between joins and direct filters
- no special handling for global records
- simpler shared infrastructure code
6. Writes become explicit
Visibility is assigned intentionally:
scope_id = Scope::forSite(1);
scope_id = Scope::forSites([1,2,5]);
scope_id = 0;
That centralizes the logic into one helper layer.
7. Why duplication becomes expensive
A common workaround for multi-region systems is cloning records into each site.
That seems fine initially.
Later it creates ongoing problems:
- global identifiers
- syncing edits across copies
- partial overrides
- having to worry about the source of truth
- larger data size that could directly affect performance
- rollout coordination
At scale, managing copies becomes harder than managing access.
A shared entity with scoped visibility avoids most of that.
One identity, multiple sites.
8. What I would do now
If I were building this again, I would start with scope_id.
It is close enough to site_id in simplicity:
scope_id = Scope::forSite(1)
But it avoids the later refactor when sharing becomes necessary.
Single-site behavior still works naturally.
Shared visibility is already built in.
Global visibility already exists.
There is little benefit in starting with a narrower model if the broader one carries almost no real penalty.
Final thought
The strongest argument for scope_id is not just flexibility.
It is that it handles:
- isolated sites
- shared entities
- global entities
- future growth
using one consistent model.
If a design solves today’s requirements and tomorrow’s likely requirements without meaningful tradeoffs, it is usually the better starting point.
Top comments (0)