DEV Community

Akarshan Gandotra
Akarshan Gandotra

Posted on

Part 6 — Authorization at Scale: Access Levels, Roles, and Compact Decisions

Authentication answers "who are you?" Authorization answers the harder question: "are you allowed to do this?"

By the time a request reaches this stage, we've already validated the token and confirmed the tenant. Now we need to decide — before the request touches any upstream service — whether this specific identity has permission to call this specific endpoint. That decision runs hundreds of millions of times a day. It needs to be fast, correct, and cheap to reason about when something goes wrong.

This post is about the model we use, the simpler approach that served us for a year, and the optimization we eventually built — and why we kept the old path around anyway.


The model: three layers, one question

Our authorization model has three layers:

A role is what a user is granted — something like clinic_admin or billing_specialist. An access level is a coarse permission — something like user:admin or schedule:write. An endpoint declares which access levels are sufficient to reach it.

Roles are bags of access levels. Endpoints are protected by lists of access levels. A user can call an endpoint if any of their roles' access levels appear in the endpoint's required list. That's the whole model.

We deliberately keep it coarse. There are dozens of access levels in the system, not thousands. Questions like "can this user delete this specific patient record?" belong to the upstream service that owns that data — it has the context the gateway doesn't. The gateway's job is to filter on the order of "is this user even an admin at all?" — a check that catches the vast majority of misuse and runs at edge speeds.


Who defines access levels and endpoints?

Here's something that might seem surprising: the gateway doesn't own the access level definitions. Individual product services do.

Each service ships a access_levels.json alongside its code. This file declares what access levels it recognizes and which endpoints require which levels. A scheduling service owns schedule:write. A billing service owns billing:read. The gateway is a consumer — it doesn't make editorial decisions about what permissions mean.

// access_levels.json  owned and maintained by the upstream service
{
  "access_levels": [
    { "name": "schedule:write", "description": "Create and modify appointments" },
    { "name": "schedule:read",  "description": "View appointments" }
  ],
  "endpoints": [
    { "path": "/api/appointments",  "method": "POST", "requires": ["schedule:write"] },
    { "path": "/api/appointments/*", "method": "GET",  "requires": ["schedule:read", "schedule:write"] }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The publish flow runs through CI/CD. When a service merges a change to its access level definitions, a pipeline step pushes the updated file to a well-known S3 path. The gateway picks up the change on its next refresh cycle — no gateway deploy required, no manual registry edits.

service-repo/
  access_levels.json   ← owned by the service team
  .github/workflows/publish.yml

# publish.yml (simplified)
- name: Publish access levels
  run: |
    aws s3 cp access_levels.json \
      s3://registry/services/${{ env.SERVICE_NAME }}/access_levels.json
Enter fullscreen mode Exit fullscreen mode

This keeps ownership aligned: the team that builds the feature decides what permission protects it. The gateway team owns the mechanism; product teams own the policy. Changes are auditable through git history, reviewable via pull request, and rollbackable the same way code is.


What the token carries — and what it doesn't

The JWT includes the user's granted access levels as a bitmap — a compact byte slice — along with the version of the registry used when the token was issued. It does not contain the full permission graph, and it does not contain endpoint requirements. Those live in the database, loaded into memory at boot.

A decoded JWT payload looks like this:

{
  "sub": "user_abc123",
  "tenant": "acme-health",
  "policy_bitmap": "__________________8H",
  "policy_bitmap_version": 114,
  "exp": 1714000000
}
Enter fullscreen mode Exit fullscreen mode

policy_bitmap is a base64url-encoded byte slice — each bit position corresponds to one access level in the registry at version 114. policy_bitmap_version tells the gateway exactly which registry snapshot to use when interpreting the bits. If the gateway's current registry is at version 114, it uses the fast bitmap path. If the versions differ, it falls back to string matching (more on that below).

This is a deliberate tradeoff. The stateless alternative — put everything in the token, make every decision without a database — sounds clean until users accumulate permissions. Tokens balloon to 4–8 KB. Cookies start failing at network edges. Mobile clients cache tokens aggressively and get stuck with stale permission sets. Every role change requires re-issuing every affected token immediately.

The compromise: the JWT carries coarse access levels (a small, stable set encoded as a bitmap), and the database carries endpoint requirements (queried once at startup, refreshed on demand). Per-request authorization is a fast in-memory lookup on both sides.

The payoff on token size is significant. Before the bitmap rework, heavily-permissioned admins had tokens approaching 3 KB. After:


The original approach: string matching

The first implementation is what you'd sketch on a whiteboard. Take the user's access levels, take the endpoint's required access levels, check if they overlap.

It's O(n + m) — linear in the number of user permissions and required permissions. With typical values (a user might have 20–80 access levels, an endpoint usually requires 1–3), this runs in nanoseconds. It's correct, it's readable, and it worked fine in production for over a year.

The reason we eventually replaced it had nothing to do with speed.

The first reason was token size. As the platform grew and senior users accumulated more access levels, tokens stretched. We had admins with tokens approaching 3 KB. That's uncomfortable but manageable — until it isn't.

The second reason was density of signal. String matching tells you that the user was authorized, but the log entry just says granted: ["user:admin"]. We wanted richer per-permission metrics — which access levels are actually being exercised, which ones are granted but never hit anything — without adding another pass over the data.


The bitmap approach: compress the representation, keep the logic

The idea is simple: assign every access level a stable integer index. Represent a user's granted permissions as a bit vector — one bit per access level. Represent each endpoint's requirements the same way. Authorization becomes a bitwise AND.

If the result is nonzero, the user has at least one of the required permissions. Allow. If the result is zero, deny. That's the entire hot path.

The anchoring snippet — the intersection check at the core of it:

func intersects(a, b []byte) bool {
    n := len(a)
    if len(b) < n {
        n = len(b)
    }
    for i := 0; i < n; i++ {
        if a[i]&b[i] != 0 {
            return true
        }
    }
    return false
}
Enter fullscreen mode Exit fullscreen mode

For a typical 32-byte bitmap (covering 256 possible access levels), this is a handful of CPU instructions. Decision time dropped from ~3 microseconds in the worst legacy case to under 200 nanoseconds. Not visible to end users. Very visible in CPU costs at 50,000 requests per second.

Token size dropped too — from ~3 KB for a heavily-permissioned admin to under 1 KB. The access levels that used to be a long string array became the policy_bitmap field: a base64url-encoded byte slice.

The two paths side by side:


The version problem — and why we kept the old path

Here's the catch: bit indexes have to be stable. If access level user:admin is bit 0 today, it must still be bit 0 when old tokens are being validated. This is managed through a versioned registry — each snapshot of the bit assignments carries a version number, and the JWT records which version was used when it was issued via policy_bitmap_version.

{
  "policy_bitmap": "__________________8H",
  "policy_bitmap_version": 114
}
Enter fullscreen mode Exit fullscreen mode

When the gateway boots, it loads the current registry — say, version 114 — and builds an in-memory lookup from version number to bit-index map. When a token arrives, the gateway reads policy_bitmap_version and checks:

  • Version matches current registry (114 == 114): decode the bitmap, run intersects(), done.
  • Version is older (e.g., 112): fall back to string matching against the access level names.
  • No policy_bitmap_version field: legacy token predating the bitmap feature — fall back to string matching.

The fallback uses the access level names embedded in the token (carried as a separate claim for exactly this purpose) and checks them against the endpoint's required list. Same outcome, no bitmap needed.

This fallback isn't a temporary measure. It's load-bearing. Long-lived service account tokens might be weeks old. We can't deny them just because they predate a registry update. The string-based check is version-agnostic: it doesn't care about bit indexes at all. As long as both sides agree on what the access level strings mean, it works.

New registry versions are created whenever a service publishes new access level definitions through the CI/CD pipeline. The version number increments, new bit positions are assigned to new access levels, and existing assignments are preserved verbatim. Old tokens stay valid — they just take the slightly slower path until they expire naturally.

We track the fallback rate with a metric. When it's near zero, things are healthy. A spike tells us something is wrong — maybe a token issuer is behind on registry versions, maybe a test fixture has stale data, or maybe a new service published access levels without updating the issuer to match.


A few things we deliberately didn't do

Some approaches we considered and rejected:

Caching the authorization decision. "Same token plus same endpoint equals same answer" feels right. It's wrong — role changes, revocation, and tenant changes all invalidate it. We cache the token decode result (the identity and access levels), not the decision.

Per-tenant access level definitions. Letting each tenant define what user:admin means sounds flexible. In practice, it means the registry forks and every cross-tenant reasoning breaks. Access levels are platform-wide; role assignments are per-tenant. That's the line. Individual services define access levels globally — they don't get per-tenant variants.

Hierarchical permissions. "user:admin implies user:read" is elegant on paper. It complicates bitmap encoding and makes rollback harder. We grant both explicitly. A few extra access levels per role is not a real cost.

A central registry team as the bottleneck. Early on, a single team owned all access level definitions. This created a queue — every new feature needed a registry PR to land before it could ship. Moving ownership to service teams via the CI/CD publish flow eliminated that queue entirely. The gateway team reviews the mechanism; service teams review each other's access level semantics in their own PRs.


The pattern underneath the optimization

The bitmap is a performance and density win. But the deeper idea is the same one from the last chapter: make the implicit explicit and keep the decision structure visible.

String matching and bitmap intersection both produce the same outcome — allow or deny. What the bitmap adds isn't correctness, it's compactness: a cheaper wire representation, a faster runtime check, and a version-aware fallback that degrades gracefully instead of breaking.

The CI/CD publish flow adds a different kind of compactness: it removes the coordination overhead of centralized registry management. Services declare what they need. The pipeline handles the distribution. The gateway consumes whatever's in the registry. No tickets, no handoffs.

The fallback is worth lingering on. Most optimizations in auth systems are irreversible — once you commit to a new token format, old tokens become a problem. Keeping the legacy path as a first-class citizen, with its own metrics and log fields, meant we could ship the optimization without a flag day. Old tokens kept working. New tokens got faster. The two paths converged over time on their own.


What the deployment actually looked like

Theory is one thing. Here's the Datadog dashboard from the bitmap deployment on Apr 27 at 17:30.

Datadog dashboard showing request hits, error rate, p99 latency, and execution time breakdown across the bitmap deployment boundary

The real win shows up in p99 latency: it drops from a spiky 5–14 ms pattern to a stable ~4–5 ms, eliminating GC-induced variance from string allocations.

Execution time stayed in the 100–400 µs range, with a one-time spike during in-memory bitmap rebuild. Fallback usage naturally decayed as tokens rotated, validating seamless migration.


Next up: Chapter 7 — token revocation. JWTs are stateless by design, which makes "log this user out right now" genuinely hard. We solved it with a Redis-backed revocation list, an in-process cache, and two startup races we had to fix the painful way.

Top comments (0)