DEV Community

Akarshan Gandotra
Akarshan Gandotra

Posted on

Part 3 — Inside the Auth Service: From Token Validator to Policy Decision Point

Most auth services start simple — verify the token, return 200 or 401. Then requirements accumulate. Tenant isolation. Service accounts. Token revocation. Access levels per endpoint. And suddenly what was a lightweight validator is carrying a lot of weight, without a clear structure to hold it.

This post is about how we structured ours — the ideas that shaped it, and the ones we got wrong before landing here.


One job, lots of supporting infrastructure

The Auth Service does exactly one thing from the outside: receive a subrequest from NGINX, inspect the headers, and return a decision. Under a millisecond, every time.

But a single HTTP handler that does that reliably at scale has a lot underneath it — caching, revocation checks, routing logic, identity propagation. The structural challenge is keeping the handler small while the infrastructure grows. We landed on a controller that reads like a flowchart:

  1. Extract the request metadata (URI, method, tenant).
  2. Resolve the endpoint to find out what kind of auth it needs.
  3. Based on that: allow it openly, run authentication only, or run authentication and authorization.

That's the whole thing. Everything else is a service the controller delegates to.


The insight that changed how we think about routing: endpoint classification is data, not code

Early on, we made auth decisions in code. A route was open because someone wrote if path == "/health" { return 200 }. Access control lived in conditionals scattered across handlers.

This breaks the moment your product team adds a new endpoint, or you need to temporarily open a route for a partner integration, or you realize a route that was open should have been authenticated all along.

We flipped it: every endpoint in the system has a classification stored in the database — OPEN, AUTHENTICATED, or ACCESS_CONTROLLED — along with a permission list if it's access-controlled. The auth service resolves the incoming request to an endpoint record and reads that classification. The decision logic then becomes a simple switch:

  • OPEN: allow, log it, done.
  • AUTHENTICATED: run token validation.
  • ACCESS_CONTROLLED: run token validation, then check permissions.

The consequence is that we never recompile or redeploy the Auth Service to change how a route is protected. That's a database update. It also means non-engineers can reason about the access model without reading code.


Naming every failure: the decision-reason contract

The second structural idea that shaped everything else: every outcome has an explicit name.

We maintain an enumerated list of decision reasons — constants like MISSING_TOKEN, TENANT_MISMATCH, TOKEN_REVOKED, SA_VERSION_MISMATCH, OPEN_ENDPOINT, ACCESS_LEVEL_MATCH. Every code path in the service must set one before returning. There's no exit that doesn't produce a named reason.

const (
    ReasonOpenEndpoint       = "OPEN_ENDPOINT"
    ReasonMissingToken       = "MISSING_TOKEN"
    ReasonTokenRevoked       = "TOKEN_REVOKED"
    ReasonTenantMismatch     = "TENANT_MISMATCH"
    ReasonSAVersionMismatch  = "SA_VERSION_MISMATCH"
    ReasonTokenTypeMismatch  = "TOKEN_TYPE_MISMATCH"
    // ... and so on
)
Enter fullscreen mode Exit fullscreen mode

This sounds like a minor logging detail. It isn't.

When a token fails, why it fails tells a completely different story depending on the reason. TOKEN_REVOKED means the user logged out or was disabled. SA_VERSION_MISMATCH means a service account was rotated and the calling service hasn't caught up. TOKEN_TYPE_MISMATCH means something is trying to authenticate with a refresh token where it should use an access token — usually a buggy SDK, occasionally something worth investigating.

If all of these collapsed into a generic 401 Unauthorized, you'd lose all of that signal. Dashboards would be useless. On-call would be guessing.

The list itself is a contract with the log pipeline. New reasons go through code review. Old reasons can't be deleted without checking dashboards and alerts first. It's one of the few places in the codebase where "this is more rigid than it needs to be" is actually correct.


One log line per request — and why that matters more than it sounds

Our first approach was to emit log lines at each stage of the pipeline — one when we resolved the route, one when we validated the token, one when we made the authorization decision. We could stitch them together by request ID.

We abandoned this. The stitching was always slightly wrong. Correlation IDs got dropped. Fields you needed were in a different log line than the one you found first. Debugging a production incident meant reconstructing a timeline from fragments.

Now there's one structured log record per request. It's built up incrementally — every handler in the pipeline writes into the same struct. By the time the response goes out, the record has every field: URI, method, tenant, identity, cache hit status, decision reason, outcome. It emits once, at the end.

The operational improvement was immediate. Grepping for a user's identity ID gives you a complete picture of every request they made — what was allowed, what was denied, and exactly why. No joining, no reconstruction.

If you're designing an auth service, this is the first structural decision we'd recommend getting right. Everything else can be refactored. The logging model tends to calcify early.


How we handle JWT verification at scale

Validating a JWT sounds cheap. For HS256 with a shared secret, it mostly is. For RS256 with asymmetric keys — which is what we use for user-facing tokens — the RSA verification step sits in the hundreds of microseconds. At meaningful request volume, that becomes a real CPU cost.

Our solution is a cache in front of the decode step. The cache key is a hash of the raw token string (not the string itself — the hash is 8 bytes versus potentially hundreds, which adds up at scale). The TTL matches the token's expiry. When a token comes in that we've already verified recently, we skip the RSA verification entirely.

A few things we were careful not to cache:

Revocation state. Whether a token has been revoked can change at any moment, independent of the token's validity. We cache the decode result — the claims, the identity — but we always check revocation live. These are different questions.

The auth decision itself. The decision depends on the endpoint, the tenant, and the required access level, none of which the token cache knows about. Caching decisions would mean a user who got their access level changed mid-session would still see stale decisions until cache expiry. Unacceptable.

The principle here generalizes: cache the facts (what the token says), not the decisions (what we're going to do about it).


The boundary the Auth Service deliberately doesn't cross

The clearest sign a service is well-designed is what it refuses to do.

Our Auth Service handles coarse-grained access: does this identity have the level of access required to reach this endpoint category? That's it. It does not answer questions like "can this user delete this specific record?" or "does this account have permission to access this tenant's billing history?"

Those are business policy questions. They belong in the services that own that data, where the full context exists.

Every time we've been tempted to push business logic into the Auth Service — usually because it would be convenient, or because a product requirement seemed auth-adjacent — we've regretted it. Business policy changes frequently. Auth infrastructure should be boring and stable. Keeping them separate means changes to one don't put the other at risk.

The Auth Service also doesn't store sessions, doesn't issue tokens, and doesn't look up users. Tokens carry enough identity for upstream services to do that themselves. The Auth Service only validates.


The pattern underneath all of this

Looking back, the decisions that held up over time share a common shape: make the implicit explicit.

Endpoint classification pulled auth rules out of code and into data. Decision reasons named every outcome instead of letting them collapse into status codes. The single log line made the request lifecycle visible as a single artifact instead of scattered fragments. The cache/decision boundary separated "what the token says" from "what we're going to do about it."

None of these are particularly novel ideas. But they compound. A service where every decision is named, every outcome is logged atomically, and every boundary is deliberate is a service you can actually operate.

That's the goal.


Next up: Chapter 4 — the path trie that resolves incoming URIs to endpoint records in O(path length), without a database call on the hot path.

Top comments (0)