The first eight chapters of this series have been about building an Auth Gateway. This one is about living with one.
A gateway in front of every authenticated request is a force multiplier — for both your platform and any oncall page. If something is broken, it's broken everywhere at once. So observability isn't a Chapter 9 thing. It's a Chapter 0 thing. We just describe it last because there's enough mechanism to talk about that you need the rest of the series first.
This chapter covers the four things you need to be able to do at 3 AM:
- Read a single log line and understand what happened.
- Trace a slow request from edge to upstream.
- Tell whether a pod is alive, ready, or in deep trouble.
- Get an alert once — not once per pod per second — when something degrades.
The log line
There are exactly two structured log lines per protected request: one from NGINX, one from the Auth Service. They share request_id, so you can join them.
The NGINX line
Every request is logged as JSON to stdout via NGINX's log_format main:
log_format main escape=json '{
"logType":"NGINX_LOGS",
"request_id":"$request_id",
"time_local":"$time_iso8601",
"remote_addr":"$remote_addr",
"request_method":"$request_method",
"request_uri":"$request_uri",
"request_path":"$uri",
"slug":"$location_path",
"product":"$product",
"microservice":"$microservice",
"status":"$status",
"status_class":"$status_class",
"request_time_ms":"$request_time_millis",
"service_request_time_ms":"$upstream_response_time_millis",
"service_connect_time":"$upstream_connect_time",
"auth_request_time_ms":"$auth_time_millis",
"auth_connect_time":"$auth_connect_time",
"body_bytes_sent":"$body_bytes_sent",
"http_referer":"$http_referer",
"http_user_agent":"$http_user_agent",
"http_x_forwarded_for":"$http_x_forwarded_for",
"http_host":"$http_host",
"tenant_id":"$tenant_id",
"tenant_namespace":"$tenant_namespace",
"identity_id":"$identity_id",
"identity_type":"$identity_type",
"auth_error_message":"$auth_error_message",
"auth_error_code":"$auth_error_code"
}';
A few fields that matter more than they look:
-
auth_request_time_ms— how long the auth subrequest took. We graph this. We page on the p99 going above 50 ms. -
service_request_time_ms— how long the upstream took, excluding the auth subrequest. Sequential, not overlapping (Chapter 2). -
status_class—1xx/2xx/3xx/4xx/5xx. Faster than parsingstatusfor dashboard breakdowns. -
tenant_id— the resolved tenant. Always grep by tenant first. -
auth_error_code+auth_error_message— populated on deny. Empty on allow.
Two extra log formats for the fail paths
When NGINX hits @auth_unavailable or @upstream_unavailable, we log to a different format:
log_format auth_unavailable '...auth-specific fields...';
log_format upstream_unavailable '...upstream-specific fields...';
location @auth_unavailable {
internal;
access_log /dev/stdout auth_unavailable;
return 503 ...;
}
The reason: when something is on fire, you want it isolated in its own log stream. Dashboards built off main get drowned by 200s; a dashboard against auth_unavailable shows you exactly the broken bucket without filtering.
The Auth Service line
The Auth Service emits exactly one AUTH_DECISION log per request:
{
"level":"info",
"ts":"2026-05-01T12:34:56.789Z",
"logger":"AUTH_DECISION",
"request_id":"550e8400-e29b-41d4-a716-446655440000",
"trace_id":"4bf92f3577b34da6a3ce929d0e0e4736",
"span_id":"00f067aa0ba902b7",
"uri":"/api/v1/users",
"method":"GET",
"slug":"user-service",
"tenant_id":"mt_prod",
"endpoint_type":"ACCESS_CONTROLLED",
"identity_type":"USER",
"identity_id":"user@example.com",
"auth_method":"bearer_token",
"decision_reason":"ACCESS_LEVEL_MATCH",
"outcome":"allow",
"status":200,
"duration_ms":2.5,
"authn_ms":1.8,
"authz_ms":0.7,
"jwt_cache_hit":true,
"bitmap_authz_used":true,
"granted_access_levels":["product:admin"],
"token_revoked":false,
"slow":false
}
This is the single most important artifact in the whole gateway. It is, in dry terms, our auth audit log. In practical terms it's the thing oncall greps when anything goes weird.
Design rules we apply to it religiously:
- Exactly one line per request. No "starting auth", "authenticated", "authorizing", "decided" — those make the storyline split across N entries that you have to stitch back together. One line, one decision, every field.
-
Every field, every time. If a field doesn't apply (e.g.,
bitmap_authz_usedfor an OPEN endpoint), it'sfalseor empty, not omitted. Optional fields make ad-hoc queries painful. -
decision_reasonis enum-only. Free-form strings here would be the death of dashboards. New reasons require code review (Chapter 3). -
Trace IDs are present.
trace_idandspan_idare pulled from the OpenTelemetry context, so the log line stitches to traces in our backend without join logic.
Joining NGINX and Auth Service
request_id is generated by NGINX ($request_id) and forwarded to the Auth Service via the subrequest header X-Request-ID. Both log lines carry it. A typical investigation:
1. user reports 401 at 12:34:56 UTC
2. grep tenant_id="mt_prod" identity_id="..." in NGINX logs around the time
3. capture request_id
4. grep that request_id in Auth Service logs
5. read decision_reason — full story
The whole graph of "client → ingress → NGINX → Auth Service → upstream" stitches back together by request_id.
Tracing
OpenTelemetry instrumentation runs across both NGINX (via the otel module if compiled in; we're tracking that as a future improvement) and the Auth Service. The Auth Service span looks like:
SPAN: pth-auth-service POST /auth
├── ATTRIBUTES
│ ├── http.route = "/auth"
│ ├── auth.tenant_id = "mt_prod"
│ ├── auth.endpoint_type = "ACCESS_CONTROLLED"
│ ├── auth.outcome = "allow"
│ └── auth.decision_reason = "ACCESS_LEVEL_MATCH"
├── EVENTS
│ ├── jwt.cache.hit
│ ├── route.cache.hit
│ └── bitmap.match
└── DURATION 2.5ms
The trace context propagates to upstream services via standard W3C trace headers. So a single trace shows: client → ingress → NGINX (eventually, via otel module) → Auth Service span → upstream service span(s) → DB calls inside the upstream. The whole story.
We don't turn on full sampling. 1% sampling at edges, 100% sampling for spans tagged auth.outcome="deny". The deny path is where the interesting investigations happen; sampling it fully gives us forensic detail without exploding storage.
Health probes
Three K8s probe endpoints, each with a different purpose.
/livez — process is alive
func Liveness(c *gin.Context) {
c.JSON(http.StatusOK, gin.H{"status": "ok"})
}
Returns 200, always. The contract: as long as this handler runs, the process isn't deadlocked. K8s only kills the pod if the request times out (handler doesn't run at all).
What /livez does not check:
- It does not check the trie.
- It does not check Redis.
- It does not check Postgres.
This is intentional. A pod whose Redis connection died can still serve cache-hot requests correctly. Killing it on a Redis outage is exactly the wrong thing to do — you turn a cache-hit-100%-but-Redis-down state into a cluster-wide rolling restart.
/readyz — pod can take traffic
func Readiness(c *gin.Context) {
if !trieLoaded() {
c.JSON(503, gin.H{"status": "trie not loaded"})
return
}
if revocationExpected() && !revocationServiceReady() {
c.JSON(503, gin.H{"status": "revocation cache not ready"})
return
}
c.JSON(200, gin.H{"status": "ok"})
}
Two gates:
- Trie loaded. Without it, we can't classify any endpoint. The pod is useless until the trie is in memory.
- Revocation cache ready (if revocation is enabled). Without it, fail-closed designs would deny everything; fail-open designs would miss every revocation. Either way, not ready.
Notably not gated on Redis health: a pod that lost Redis after readiness flips green stays ready. Refreshes fail loudly via Slack, but live traffic isn't disrupted.
/healthz — deep health
func Health(c *gin.Context) {
pgErr := pingPostgres(c.Request.Context())
rdErr := pingRedis(c.Request.Context())
if pgErr != nil || rdErr != nil {
c.JSON(503, gin.H{
"status": "degraded",
"postgres": errString(pgErr),
"redis": errString(rdErr),
})
return
}
c.JSON(200, gin.H{"status": "ok"})
}
This one does depend on Redis and Postgres. It's not used by Kubernetes — it's used by external monitoring (Pingdom, status pages). The distinction matters:
- K8s probes determine traffic routing. They should be tolerant — every false-positive failure pulls a pod out of service.
- Monitoring probes determine alerting. They should be strict — they tell humans something is wrong, not the load balancer.
A common mistake is wiring /healthz to readinessProbe. Don't. You will pull pods out of the rotation on a transient Redis blip and convert a degraded state into an outage.
stateDiagram-v2
[*] --> Booting
Booting --> NotReady: trie empty
NotReady --> Ready: trie loaded AND<br/>(revocation disabled OR revocation cache warm)
Ready --> Degraded: Redis stream XREAD failure
Degraded --> Ready: reconnect
Ready --> Live: /livez always 200
Ready --> DeepCheck: /healthz pings PG + Redis
Degraded mode
The gateway is built to tolerate three specific kinds of trouble:
Redis is slow or down
- Revocation stream consumer fails.
streamDegradedatomic flips true. Slack alert fires once. Hot path keeps working — local cache is in memory. - Pub/Sub subscriber reconnects in the background. On reconnect, the subscriber re-subscribes. The periodic cleanup goroutine resyncs the ZSET when it next runs.
- SA version sync fails. Local SA version map is unchanged. Tokens continue to validate against the last known versions until the next successful sync. Slack alert fires.
Postgres is slow or down
- Trie reload fails. Existing trie remains in memory. Slack alert. Hot path is unaffected.
- New pods cannot start (initial trie load blocks readiness). Existing pods serve.
- This is a partial outage: scaling up is broken, current capacity still works.
Slack is slow or down
- Alerts are fire-and-forget goroutines. We don't block the hot path on Slack.
- If Slack is down, alerts are queued in goroutines for a configured timeout (5s) then dropped. We don't retry forever and OOM the pod.
The alert tree
flowchart TD
Boot[startup] --> A1{revocation Redis OK?}
A1 -->|no| Alert1[Slack: TokenRevocationService Redis client not initialized<br/>readyz=503]
Run[runtime] --> A2{XREAD error?}
A2 -->|yes| Alert2[Slack: stream degraded once]
A2 -->|recover| Alert3[Slack: stream recovered]
Run --> A3{localCache size > MAX?}
A3 -->|yes| Alert4[Slack: cache overflow once<br/>fall back to ZSCORE]
Run --> A4{RSA key missing for tenant?}
A4 -->|yes| Alert5[Slack: per-tenant dedup]
The Slack-alerter pattern
Three rules we apply rigorously to alert code, because we learned them the hard way:
Rule 1: Atomic-bool dedup per state transition
type degradeFlag struct {
on atomic.Bool
}
func (d *degradeFlag) MarkDegraded(reason string) {
if !d.on.Swap(true) {
slack.Alert("auth.degraded", reason)
}
}
func (d *degradeFlag) MarkRecovered() {
if d.on.Swap(false) {
slack.Alert("auth.recovered", "")
}
}
The Swap returns the previous value. If it was already true, we don't re-alert. We alert once on the transition, and once on the recovery.
Without this, a Redis outage produces thousands of Slack messages per pod per minute. The first time it happened, oncall threw their phone across the room.
Rule 2: Per-(pod × cause) dedup, not per-cause
A 100-pod deployment hitting the same RSA key misconfiguration alerts 100 times. That's correct: each pod is a separate runtime, each could have its own state, each is a separate alert source. We tag every alert with the pod's hostname so you can see in Slack whether it's a single-pod or fleet-wide problem.
Rule 3: Deployment-tag prefix
Every alert is prefixed with [customer-env]:
[acme-prod] auth.degraded TokenRevocationService Redis stream degraded
This makes a single Slack channel viable for many environments. Without the prefix, you can't tell at a glance whether the alert is from staging or prod.
What we don't alert on
- Individual auth failures. Login throttling, expired tokens, denied requests — those are normal and high-volume. They're in dashboards, not Slack.
- High latency on a single request. Latency alerts go on rolling p99, not on individual outliers.
- Anything below "the gateway behaves correctly but degraded." If the system self-heals quietly, we don't page humans.
NGINX-specific operations
A few NGINX-isms worth calling out.
Graceful shutdown
The chart's deployment does this on preStop:
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
echo "[preStop] draining 15 seconds..."
sleep 15
echo "[preStop] nginx -s quit..."
nginx -s quit
while pgrep -x nginx > /dev/null; do sleep 1; done
15 seconds of drain, then nginx -s quit (graceful — drain in-flight requests, then exit), then wait for all worker processes to finish. Combined with terminationGracePeriodSeconds: 60, we have ~60 seconds total budget for clean shutdown. Without this, rolling deploys produced visible spikes of 502s in client logs.
Worker tuning
worker_processes auto;
worker_rlimit_nofile 65535;
worker_shutdown_timeout 30s;
events {
worker_connections 10240;
use epoll;
multi_accept on;
accept_mutex off;
}
worker_processes auto scales to CPU count. accept_mutex off is the modern default — let kernel epoll handle accept distribution. worker_connections 10240 is per worker, so a 4-core pod handles ~40k concurrent connections.
Upstream keepalive
upstream auth_service {
server auth-service-golang.<ns>.svc.cluster.local:80
max_fails=3 fail_timeout=30s;
keepalive 64;
keepalive_requests 1000;
keepalive_timeout 60s;
}
keepalive 64 is per worker. A 4-worker NGINX with 64 keepalive maintains 256 keepalive connections to the Auth Service. Without keepalive, every subrequest opens a fresh TCP connection — fatal at any real RPS. The first time we deployed without it, p99 auth time was 80 ms. With it: 3 ms.
max_fails=3 fail_timeout=30s marks the upstream "down" after 3 consecutive failures and stops sending traffic for 30 seconds. Combined with the retry config in auth.conf, this gives smooth failover when one auth pod is sick.
A picture of where the time goes
gantt
title Single request as seen by NGINX
dateFormat X
axisFormat %s ms
section NGINX
receive + route match : 0, 1
auth subrequest : 1, 5
proxy upstream : 6, 18
log JSON : 24, 1
The auth subrequest is the smallest thing on the timeline. That's not by accident. Chapter 8's caches are the reason it stays small. Chapter 9 is what tells you when it stops staying small.
What's next
Chapter 10 is the retrospective: what we'd build differently if we did it from scratch, what tools we wish we'd added on day one, and the maturity progression from "auth library in every service" to "production-grade gateway." It's the chapter you write after operating the thing for two years, and it's the one I'd want to read if I were starting from scratch.
Top comments (0)