DEV Community

Akarshan Gandotra
Akarshan Gandotra

Posted on

Part 9 — Operating the gateway: logs, traces, health, and degraded mode

The first eight chapters of this series have been about building an Auth Gateway. This one is about living with one.

A gateway in front of every authenticated request is a force multiplier — for both your platform and any oncall page. If something is broken, it's broken everywhere at once. So observability isn't a Chapter 9 thing. It's a Chapter 0 thing. We just describe it last because there's enough mechanism to talk about that you need the rest of the series first.

This chapter covers the four things you need to be able to do at 3 AM:

  1. Read a single log line and understand what happened.
  2. Trace a slow request from edge to upstream.
  3. Tell whether a pod is alive, ready, or in deep trouble.
  4. Get an alert once — not once per pod per second — when something degrades.

The log line

There are exactly two structured log lines per protected request: one from NGINX, one from the Auth Service. They share request_id, so you can join them.

The NGINX line

Every request is logged as JSON to stdout via NGINX's log_format main:

log_format main escape=json '{
  "logType":"NGINX_LOGS",
  "request_id":"$request_id",
  "time_local":"$time_iso8601",
  "remote_addr":"$remote_addr",
  "request_method":"$request_method",
  "request_uri":"$request_uri",
  "request_path":"$uri",
  "slug":"$location_path",
  "product":"$product",
  "microservice":"$microservice",
  "status":"$status",
  "status_class":"$status_class",
  "request_time_ms":"$request_time_millis",
  "service_request_time_ms":"$upstream_response_time_millis",
  "service_connect_time":"$upstream_connect_time",
  "auth_request_time_ms":"$auth_time_millis",
  "auth_connect_time":"$auth_connect_time",
  "body_bytes_sent":"$body_bytes_sent",
  "http_referer":"$http_referer",
  "http_user_agent":"$http_user_agent",
  "http_x_forwarded_for":"$http_x_forwarded_for",
  "http_host":"$http_host",
  "tenant_id":"$tenant_id",
  "tenant_namespace":"$tenant_namespace",
  "identity_id":"$identity_id",
  "identity_type":"$identity_type",
  "auth_error_message":"$auth_error_message",
  "auth_error_code":"$auth_error_code"
}';
Enter fullscreen mode Exit fullscreen mode

A few fields that matter more than they look:

  • auth_request_time_ms — how long the auth subrequest took. We graph this. We page on the p99 going above 50 ms.
  • service_request_time_ms — how long the upstream took, excluding the auth subrequest. Sequential, not overlapping (Chapter 2).
  • status_class1xx/2xx/3xx/4xx/5xx. Faster than parsing status for dashboard breakdowns.
  • tenant_id — the resolved tenant. Always grep by tenant first.
  • auth_error_code + auth_error_message — populated on deny. Empty on allow.

Two extra log formats for the fail paths

When NGINX hits @auth_unavailable or @upstream_unavailable, we log to a different format:

log_format auth_unavailable     '...auth-specific fields...';
log_format upstream_unavailable '...upstream-specific fields...';

location @auth_unavailable {
  internal;
  access_log /dev/stdout auth_unavailable;
  return 503 ...;
}
Enter fullscreen mode Exit fullscreen mode

The reason: when something is on fire, you want it isolated in its own log stream. Dashboards built off main get drowned by 200s; a dashboard against auth_unavailable shows you exactly the broken bucket without filtering.

The Auth Service line

The Auth Service emits exactly one AUTH_DECISION log per request:

{
  "level":"info",
  "ts":"2026-05-01T12:34:56.789Z",
  "logger":"AUTH_DECISION",
  "request_id":"550e8400-e29b-41d4-a716-446655440000",
  "trace_id":"4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id":"00f067aa0ba902b7",
  "uri":"/api/v1/users",
  "method":"GET",
  "slug":"user-service",
  "tenant_id":"mt_prod",
  "endpoint_type":"ACCESS_CONTROLLED",
  "identity_type":"USER",
  "identity_id":"user@example.com",
  "auth_method":"bearer_token",
  "decision_reason":"ACCESS_LEVEL_MATCH",
  "outcome":"allow",
  "status":200,
  "duration_ms":2.5,
  "authn_ms":1.8,
  "authz_ms":0.7,
  "jwt_cache_hit":true,
  "bitmap_authz_used":true,
  "granted_access_levels":["product:admin"],
  "token_revoked":false,
  "slow":false
}
Enter fullscreen mode Exit fullscreen mode

This is the single most important artifact in the whole gateway. It is, in dry terms, our auth audit log. In practical terms it's the thing oncall greps when anything goes weird.

Design rules we apply to it religiously:

  • Exactly one line per request. No "starting auth", "authenticated", "authorizing", "decided" — those make the storyline split across N entries that you have to stitch back together. One line, one decision, every field.
  • Every field, every time. If a field doesn't apply (e.g., bitmap_authz_used for an OPEN endpoint), it's false or empty, not omitted. Optional fields make ad-hoc queries painful.
  • decision_reason is enum-only. Free-form strings here would be the death of dashboards. New reasons require code review (Chapter 3).
  • Trace IDs are present. trace_id and span_id are pulled from the OpenTelemetry context, so the log line stitches to traces in our backend without join logic.

Joining NGINX and Auth Service

request_id is generated by NGINX ($request_id) and forwarded to the Auth Service via the subrequest header X-Request-ID. Both log lines carry it. A typical investigation:

1. user reports 401 at 12:34:56 UTC
2. grep tenant_id="mt_prod" identity_id="..." in NGINX logs around the time
3. capture request_id
4. grep that request_id in Auth Service logs
5. read decision_reason — full story
Enter fullscreen mode Exit fullscreen mode

The whole graph of "client → ingress → NGINX → Auth Service → upstream" stitches back together by request_id.

Tracing

OpenTelemetry instrumentation runs across both NGINX (via the otel module if compiled in; we're tracking that as a future improvement) and the Auth Service. The Auth Service span looks like:

SPAN: pth-auth-service POST /auth
├── ATTRIBUTES
│   ├── http.route = "/auth"
│   ├── auth.tenant_id = "mt_prod"
│   ├── auth.endpoint_type = "ACCESS_CONTROLLED"
│   ├── auth.outcome = "allow"
│   └── auth.decision_reason = "ACCESS_LEVEL_MATCH"
├── EVENTS
│   ├── jwt.cache.hit
│   ├── route.cache.hit
│   └── bitmap.match
└── DURATION 2.5ms
Enter fullscreen mode Exit fullscreen mode

The trace context propagates to upstream services via standard W3C trace headers. So a single trace shows: client → ingress → NGINX (eventually, via otel module) → Auth Service span → upstream service span(s) → DB calls inside the upstream. The whole story.

We don't turn on full sampling. 1% sampling at edges, 100% sampling for spans tagged auth.outcome="deny". The deny path is where the interesting investigations happen; sampling it fully gives us forensic detail without exploding storage.

Health probes

Three K8s probe endpoints, each with a different purpose.

/livez — process is alive

func Liveness(c *gin.Context) {
    c.JSON(http.StatusOK, gin.H{"status": "ok"})
}
Enter fullscreen mode Exit fullscreen mode

Returns 200, always. The contract: as long as this handler runs, the process isn't deadlocked. K8s only kills the pod if the request times out (handler doesn't run at all).

What /livez does not check:

  • It does not check the trie.
  • It does not check Redis.
  • It does not check Postgres.

This is intentional. A pod whose Redis connection died can still serve cache-hot requests correctly. Killing it on a Redis outage is exactly the wrong thing to do — you turn a cache-hit-100%-but-Redis-down state into a cluster-wide rolling restart.

/readyz — pod can take traffic

func Readiness(c *gin.Context) {
    if !trieLoaded() {
        c.JSON(503, gin.H{"status": "trie not loaded"})
        return
    }
    if revocationExpected() && !revocationServiceReady() {
        c.JSON(503, gin.H{"status": "revocation cache not ready"})
        return
    }
    c.JSON(200, gin.H{"status": "ok"})
}
Enter fullscreen mode Exit fullscreen mode

Two gates:

  1. Trie loaded. Without it, we can't classify any endpoint. The pod is useless until the trie is in memory.
  2. Revocation cache ready (if revocation is enabled). Without it, fail-closed designs would deny everything; fail-open designs would miss every revocation. Either way, not ready.

Notably not gated on Redis health: a pod that lost Redis after readiness flips green stays ready. Refreshes fail loudly via Slack, but live traffic isn't disrupted.

/healthz — deep health

func Health(c *gin.Context) {
    pgErr := pingPostgres(c.Request.Context())
    rdErr := pingRedis(c.Request.Context())
    if pgErr != nil || rdErr != nil {
        c.JSON(503, gin.H{
            "status": "degraded",
            "postgres": errString(pgErr),
            "redis":    errString(rdErr),
        })
        return
    }
    c.JSON(200, gin.H{"status": "ok"})
}
Enter fullscreen mode Exit fullscreen mode

This one does depend on Redis and Postgres. It's not used by Kubernetes — it's used by external monitoring (Pingdom, status pages). The distinction matters:

  • K8s probes determine traffic routing. They should be tolerant — every false-positive failure pulls a pod out of service.
  • Monitoring probes determine alerting. They should be strict — they tell humans something is wrong, not the load balancer.

A common mistake is wiring /healthz to readinessProbe. Don't. You will pull pods out of the rotation on a transient Redis blip and convert a degraded state into an outage.

stateDiagram-v2
    [*] --> Booting
    Booting --> NotReady: trie empty
    NotReady --> Ready: trie loaded AND<br/>(revocation disabled OR revocation cache warm)
    Ready --> Degraded: Redis stream XREAD failure
    Degraded --> Ready: reconnect
    Ready --> Live: /livez always 200
    Ready --> DeepCheck: /healthz pings PG + Redis
Enter fullscreen mode Exit fullscreen mode

Degraded mode

The gateway is built to tolerate three specific kinds of trouble:

Redis is slow or down

  • Revocation stream consumer fails. streamDegraded atomic flips true. Slack alert fires once. Hot path keeps working — local cache is in memory.
  • Pub/Sub subscriber reconnects in the background. On reconnect, the subscriber re-subscribes. The periodic cleanup goroutine resyncs the ZSET when it next runs.
  • SA version sync fails. Local SA version map is unchanged. Tokens continue to validate against the last known versions until the next successful sync. Slack alert fires.

Postgres is slow or down

  • Trie reload fails. Existing trie remains in memory. Slack alert. Hot path is unaffected.
  • New pods cannot start (initial trie load blocks readiness). Existing pods serve.
  • This is a partial outage: scaling up is broken, current capacity still works.

Slack is slow or down

  • Alerts are fire-and-forget goroutines. We don't block the hot path on Slack.
  • If Slack is down, alerts are queued in goroutines for a configured timeout (5s) then dropped. We don't retry forever and OOM the pod.

The alert tree

flowchart TD
    Boot[startup] --> A1{revocation Redis OK?}
    A1 -->|no| Alert1[Slack: TokenRevocationService Redis client not initialized<br/>readyz=503]
    Run[runtime] --> A2{XREAD error?}
    A2 -->|yes| Alert2[Slack: stream degraded once]
    A2 -->|recover| Alert3[Slack: stream recovered]
    Run --> A3{localCache size > MAX?}
    A3 -->|yes| Alert4[Slack: cache overflow once<br/>fall back to ZSCORE]
    Run --> A4{RSA key missing for tenant?}
    A4 -->|yes| Alert5[Slack: per-tenant dedup]
Enter fullscreen mode Exit fullscreen mode

The Slack-alerter pattern

Three rules we apply rigorously to alert code, because we learned them the hard way:

Rule 1: Atomic-bool dedup per state transition

type degradeFlag struct {
    on atomic.Bool
}

func (d *degradeFlag) MarkDegraded(reason string) {
    if !d.on.Swap(true) {
        slack.Alert("auth.degraded", reason)
    }
}

func (d *degradeFlag) MarkRecovered() {
    if d.on.Swap(false) {
        slack.Alert("auth.recovered", "")
    }
}
Enter fullscreen mode Exit fullscreen mode

The Swap returns the previous value. If it was already true, we don't re-alert. We alert once on the transition, and once on the recovery.

Without this, a Redis outage produces thousands of Slack messages per pod per minute. The first time it happened, oncall threw their phone across the room.

Rule 2: Per-(pod × cause) dedup, not per-cause

A 100-pod deployment hitting the same RSA key misconfiguration alerts 100 times. That's correct: each pod is a separate runtime, each could have its own state, each is a separate alert source. We tag every alert with the pod's hostname so you can see in Slack whether it's a single-pod or fleet-wide problem.

Rule 3: Deployment-tag prefix

Every alert is prefixed with [customer-env]:

[acme-prod] auth.degraded TokenRevocationService Redis stream degraded
Enter fullscreen mode Exit fullscreen mode

This makes a single Slack channel viable for many environments. Without the prefix, you can't tell at a glance whether the alert is from staging or prod.

What we don't alert on

  • Individual auth failures. Login throttling, expired tokens, denied requests — those are normal and high-volume. They're in dashboards, not Slack.
  • High latency on a single request. Latency alerts go on rolling p99, not on individual outliers.
  • Anything below "the gateway behaves correctly but degraded." If the system self-heals quietly, we don't page humans.

NGINX-specific operations

A few NGINX-isms worth calling out.

Graceful shutdown

The chart's deployment does this on preStop:

lifecycle:
  preStop:
    exec:
      command:
        - /bin/sh
        - -c
        - |
          echo "[preStop] draining 15 seconds..."
          sleep 15
          echo "[preStop] nginx -s quit..."
          nginx -s quit
          while pgrep -x nginx > /dev/null; do sleep 1; done
Enter fullscreen mode Exit fullscreen mode

15 seconds of drain, then nginx -s quit (graceful — drain in-flight requests, then exit), then wait for all worker processes to finish. Combined with terminationGracePeriodSeconds: 60, we have ~60 seconds total budget for clean shutdown. Without this, rolling deploys produced visible spikes of 502s in client logs.

Worker tuning

worker_processes  auto;
worker_rlimit_nofile 65535;
worker_shutdown_timeout 30s;

events {
  worker_connections 10240;
  use epoll;
  multi_accept on;
  accept_mutex off;
}
Enter fullscreen mode Exit fullscreen mode

worker_processes auto scales to CPU count. accept_mutex off is the modern default — let kernel epoll handle accept distribution. worker_connections 10240 is per worker, so a 4-core pod handles ~40k concurrent connections.

Upstream keepalive

upstream auth_service {
  server auth-service-golang.<ns>.svc.cluster.local:80
         max_fails=3 fail_timeout=30s;
  keepalive 64;
  keepalive_requests 1000;
  keepalive_timeout 60s;
}
Enter fullscreen mode Exit fullscreen mode

keepalive 64 is per worker. A 4-worker NGINX with 64 keepalive maintains 256 keepalive connections to the Auth Service. Without keepalive, every subrequest opens a fresh TCP connection — fatal at any real RPS. The first time we deployed without it, p99 auth time was 80 ms. With it: 3 ms.

max_fails=3 fail_timeout=30s marks the upstream "down" after 3 consecutive failures and stops sending traffic for 30 seconds. Combined with the retry config in auth.conf, this gives smooth failover when one auth pod is sick.

A picture of where the time goes

gantt
    title Single request as seen by NGINX
    dateFormat  X
    axisFormat %s ms
    section NGINX
    receive + route match : 0, 1
    auth subrequest       : 1, 5
    proxy upstream        : 6, 18
    log JSON              : 24, 1
Enter fullscreen mode Exit fullscreen mode

The auth subrequest is the smallest thing on the timeline. That's not by accident. Chapter 8's caches are the reason it stays small. Chapter 9 is what tells you when it stops staying small.

What's next

Chapter 10 is the retrospective: what we'd build differently if we did it from scratch, what tools we wish we'd added on day one, and the maturity progression from "auth library in every service" to "production-grade gateway." It's the chapter you write after operating the thing for two years, and it's the one I'd want to read if I were starting from scratch.

Top comments (0)