The Illusion of Scale, Part 3: Access Control Doesn't Scale Linearly

#security #distributedsystems #architecture #backend

One day you look up and realize your permissions model is something only two people on the team can explain. One of them just put in their notice.

Nobody planned to be in that position. It happened one exception at a time. One "just add a role for this" at a time. One "we'll clean this up later" at a time. Later never comes. It never comes.

This is Part 3 of a series about assumptions that quietly break systems at scale.

How 15 roles become 340 (a horror story in slow motion)

When we built out the permission model for one of the systems I worked on, we had 15 roles. Clean, well-defined, each with a clear purpose. You could explain the whole model in ten minutes to anyone new on the team. I was proud of it, honestly.

Two years later there were 340 roles. Three. Hundred. And forty.

Nobody planned for that. Nobody woke up one morning and said "you know what this system needs? 340 roles." It happened like this: a team needed access to one resource but not another, so a new role was created. A contractor role was almost identical to the standard role but needed one extra permission, so another role was created. An emergency access role was supposed to be temporary but was kept "just in case" and never revisited.

Each decision made perfect sense at the time. Collectively they produced a permission model that no single person could fully explain, audit, or reason about confidently. Including me, and I'd been there since the beginning.

That is role explosion. It's not a failure of discipline. It's what happens when a model designed for a clean set of cases gets pushed, one reasonable exception at a time, into a reality more complex than it was designed for.

Why simple RBAC always eventually breaks

Role-based access control works great when access decisions are binary: you either have the role or you don't. Clean, auditable, easy to reason about.

The problem is that real-world access decisions are almost never that clean.

You need a user who can access their own records but not others. You need access that expires after a project ends. You need a decision that depends on the current state of the resource, not just who's asking. Each of these requirements pushes you either toward more roles (which gets unwieldy fast) or toward a richer model that can express context-aware decisions.

Most teams take the path of more roles because it's faster in the moment. I've done this. You've probably done this. The second path -- attribute-based or policy-based access control -- is more work upfront and dramatically less work over time. But "more work upfront" loses to "we need this shipped by Friday" approximately 100% of the time.

The 10-minute incident (or: why caching permissions is terrifying)

Even a well-designed permission model has to be evaluated, and at scale the evaluation cost matters.

The usual answer is caching. Cache the authorization decision with a TTL. Fast, cheap, easy to implement. But during that TTL window, you're making decisions based on permissions that may no longer be current. This is fine. This is a reasonable tradeoff. Until it isn't.

We had a 10-minute TTL on cached permission decisions. The security team had asked what would happen if they needed to revoke access immediately. We said: up to 10 minutes. They accepted that.

Then a credential was compromised.

The security team revoked access and watched the logs. The system kept serving that user's requests for another eight minutes. Eight minutes is not long in most contexts. Standing in front of a security team watching real-time access logs during an active incident, trying to explain why the revocation hasn't taken effect yet, is a very different experience of those eight minutes. How did I eventually got around that problem? Take a guess in the comments.

Anyways..I have never forgotten what that room felt like. I will never set a cache TTL on permissions without thinking about that room.

That tradeoff -- cache TTL versus revocation speed -- exists whether or not your team has discussed it. The only variable is whether you made it consciously or discovered it during an incident.

Audit trails at volume (the compliance conversation from hell)

Every access decision needs to be attributable: who requested it, what they were authorized to do, what decision was made, and why. At 100,000 decisions per second, that's substantial write volume to your audit store.

Synchronous writes add latency. Asynchronous writes mean you have to handle the failure case where a decision is made but the audit entry is lost -- which is a compliance conversation nobody wants to have. I've been in that conversation. It's not fun.

I've worked on systems where the requirement was "log first, then execute." That constraint reshapes your entire architecture -- your latency budget, your failure handling, your storage design. It's buildable, but it needs to be in the design from the start. Retrofitting "log before execute" onto an existing system is expensive and almost never goes cleanly. Ask me how I know.

Granting is easy. Revocation is the real test.

Granting access is trivial. Write a row somewhere. Done. Ship it.

Revocation is where the design quality shows.

Access needs to be revoked across every cache, every replica, every long-running process that may have loaded a stale copy of that permission. A batch job that started before the revocation happened, loaded permissions at startup, and is still running an hour later -- technically, every individual check it made was valid at the time. But the aggregate behavior is wrong.

Explaining that gap to a compliance team is not a conversation you want. "Well, technically, at the time of each individual check..." doesn't land the way you hope it will.

Designing revocation that actually works means deciding explicitly what "immediately" means in your system and then building infrastructure to deliver it. Not assuming it'll sort itself out. It won't sort itself out.

What I'd do differently

Authorize close to the data, not just at the API boundary. Edge authorization is necessary and not sufficient.

Design the hot-path permission check to require no joins. It should be cheap by construction, not by optimization. Optimization after the fact is harder and less reliable than just designing it right.

Treat the cache staleness window as a product decision, not a technical one. Write it down. Make sure the people responsible for security incidents know what it is before the incident happens.

Build the audit trail into the design before anyone writes application code. Retrofitting it under compliance pressure is one of the more unpleasant engineering experiences I can describe. And I've had some unpleasant ones.

Next week: LATENCY - we all have seen the websites go "loading..." before they respond, what was that experience like? Not great, right? So, let's talk about the culprit behind that.

What access control decision do you wish you'd made differently? The 340 roles story is mine. I want to hear yours. The worse, the better.