DEV Community: Oleksandr Kuryzhev

kubectl rollout restart vs delete pod: Which Is Safer?

Oleksandr Kuryzhev — Mon, 27 Jul 2026 07:01:41 +0000

Originally published on kuryzhev.cloud

When you face this choice

It's 2 AM, a pod is wedged — stale config loaded into memory, a leaked DB connection pool, a sidecar that hung on startup and never recovered. The image hasn't changed. The manifest hasn't changed. You just need the process to come back to life without a full redeploy. This is the exact moment where the kubectl rollout restart vs delete pod decision matters, and most engineers reach for whichever command muscle memory gives them first.

There are two instinctive moves here. You either run kubectl rollout restart deployment/<name>, or you go blunt with kubectl delete pod -l app=<name>. Both "fix" the stuck pod. Both look identical in a Slack incident channel. But they behave completely differently under the hood, and picking the wrong one at the wrong replica count is how a debugging session turns into a customer-facing outage.

I've watched both mistakes happen in production — a rollout restart that silently did nothing useful on a single-replica deployment, and a delete-pod command with a loose label selector that took out a canary deployment nobody meant to touch. Neither is inherently wrong. They're just wrong for different situations, and knowing which is which under pressure is the actual skill.

Option A — kubectl rollout restart

Under the hood, kubectl rollout restart doesn't "restart" anything directly. It patches spec.template.metadata.annotations on the Deployment with a fresh kubectl.kubernetes.io/restartedAt timestamp. That change to the pod template triggers the Deployment controller to roll out a brand-new ReplicaSet, exactly the same way a new image tag would — respecting maxUnavailable and maxSurge along the way.

That's the whole trick, and it's why this is the mechanism I default to. It's auditable — you can see the new revision in kubectl rollout history. It's reversible — kubectl rollout undo deployment/<name> takes you straight back. And it respects your PodDisruptionBudget, so if you've actually bothered to set one, the controller won't tear down more pods than your PDB allows at once.

Pros: zero-downtime when replicas ≥ 2, PDB-aware, full rollback path, works cleanly in CI/CD with kubectl rollout status as a gate.

Cons: on a replicas: 1 deployment it still causes a brief outage — there's no second pod to shift traffic to while the old one terminates. It's also slower when you genuinely just want one specific pod dead right now. And it only works on Deployments, DaemonSets, and StatefulSets — you can't run it against a bare Pod or a Job.

Watch out: if you're on kubectl older than v1.15, rollout restart doesn't exist as a subcommand. You'll need the manual annotation patch instead — same effect, uglier syntax:

# Manual equivalent for kubectl < v1.15 without the rollout restart subcommand
kubectl patch deployment/<name> -p '{"spec":{"template":{"metadata":{"annotations":{"kubectl.kubernetes.io/restartedAt":"'$(date +%Y-%m-%dT%H:%M:%S%z)'"}}}}}'

Option B — kubectl delete pod

This one is exactly as blunt as it sounds. kubectl delete pod -l app=foo force-deletes every pod matching that label immediately, or after terminationGracePeriodSeconds if you don't override it. The owning controller — usually a ReplicaSet — notices the pod count dropped and recreates it from the existing spec. No new revision. No history entry. No rollback path.

It's justified more often than purists admit. If a rollout is already stuck — say a bad readiness probe is blocking progress — rollout restart won't help you; you need to kill the offending pod directly. It also works on any pod regardless of what's managing it, which makes it the tool for single-replica debugging where you just need a fresh process.

Pros: instant, controller-agnostic, useful as an escape hatch when rollout mechanisms are already jammed.

Cons: zero rollout tracking, can outright bypass your PDB if you add --force --grace-period=0, and — the classic footgun — a loose label selector can match pods well beyond the one deployment you intended.

# The label-selector footgun in action
$ kubectl delete pod -l app=api -n prod
pod "api-7d9f6c5-abc12" deleted
pod "api-7d9f6c5-def34" deleted
pod "api-canary-99xz1" deleted   # <-- oops, canary shared the same label

# No rollout history entry, no easy undo — just recreated pods
$ kubectl get rs -n prod -l app=api
# same ReplicaSet hash as before — proves this was NOT a real rollout

I stopped aliasing kdel='kubectl delete pod' after exactly this scenario happened to a teammate mid-incident — the canary deployment shared the app=api label and got wiped along with the real target. Always run --dry-run=client first when you're not 100% sure your selector is scoped tightly enough.

Decision matrix

Under pressure, run through this quickly rather than defaulting on autopilot:

Replica count ≥ 2, no urgency: rollout restart. Zero downtime, fully audited.
Replica count = 1: either command causes a brief gap — but rollout restart still wins because it's tracked and reversible.
Need an audit trail (compliance, postmortems): rollout restart, always. Delete pod leaves nothing in history.
PDB configured: rollout restart respects it; delete pod with --force --grace-period=0 can blow straight past it.
Rollout already stuck/blocked: delete pod is your escape hatch — sometimes the controller needs a nudge before a restart can even proceed.
StatefulSet involved: almost always rollout restart. Deleting pods out of ordinal order on something like Kafka or etcd can trigger re-election storms — ordinal-dependent init logic doesn't like being skipped.
CI/CD pipeline context: rollout restart is scriptable with kubectl rollout status --timeout=60s as a gate. Delete pod isn't built for that — there's no clean success signal to wait on.

One more thing worth checking before you run either command in production: RBAC scope. delete on pods and patch/get on deployments are different verbs, and if your on-call engineers are sitting on overly broad cluster-admin bindings, either command becomes riskier than it needs to be. Scope RBAC to namespace and verb — it's cheap insurance against a 2 AM mistake with cluster-wide blast radius. See the official Kubernetes RBAC docs if you haven't audited this in a while.

My pick

I default to kubectl rollout restart every time, no exceptions unless the rollout mechanism is already broken. It's auditable, it respects PodDisruptionBudgets, it's scriptable in pipelines, and it gives you rollout undo as a safety net if the restart itself goes sideways. kubectl delete pod stays in my toolkit strictly as a break-glass option — for when a rollout is already stuck, or I'm debugging a single throwaway pod that isn't behind any serious traffic.

Here's the helper script I actually run before touching anything in prod. It checks replica count, checks for a matching PDB, restarts via rollout, waits for it to succeed, and auto-rolls-back on failure:

#!/usr/bin/env bash
# safe_restart.sh — decision helper for restarting a workload safely
# Usage: ./safe_restart.sh <deployment-name> <namespace>

set -euo pipefail

NAME="$1"
NAMESPACE="${2:-default}"

# 1. Check replica count — warn if restart won't be zero-downtime
REPLICAS=$(kubectl get deployment "$NAME" -n "$NAMESPACE" -o jsonpath='{.spec.replicas}')
if [[ "$REPLICAS" -le 1 ]]; then
  echo "⚠️  Only $REPLICAS replica(s) — rollout restart will NOT be zero-downtime."
fi

# 2. Check for a PodDisruptionBudget covering this deployment
PDB=$(kubectl get pdb -n "$NAMESPACE" -o json | \
  jq -r --arg name "$NAME" '.items[] | select(.spec.selector.matchLabels.app == $name) | .metadata.name')
if [[ -z "$PDB" ]]; then
  echo "⚠️  No PDB found for app=$NAME — restart is unprotected against voluntary disruption."
fi

# 3. Prefer rollout restart — the safe default
echo "Restarting via rollout (auditable, respects PDB)..."
kubectl rollout restart deployment/"$NAME" -n "$NAMESPACE"

# 4. Wait and verify, bail out with rollback if it fails
if ! kubectl rollout status deployment/"$NAME" -n "$NAMESPACE" --timeout=90s; then
  echo "❌ Rollout failed — rolling back automatically."
  kubectl rollout undo deployment/"$NAME" -n "$NAMESPACE"
  exit 1
fi

echo "✅ Restart complete. New ReplicaSet:"
kubectl get rs -n "$NAMESPACE" -l app="$NAME" --sort-by=.metadata.creationTimestamp | tail -n1

One caveat before you close this tab: if you're restarting many deployments at once — say kubectl rollout restart deployment --all across a busy namespace — you can spike node CPU and memory from parallel image pulls and readiness probes firing simultaneously. Stagger it, or tune maxUnavailable down so the blast radius stays predictable. Check the official kubectl rollout reference for the full flag set before you script this into a pipeline.

So the short answer to kubectl rollout restart vs delete pod: default to rollout restart, keep delete pod as your break-glass tool, and know exactly why you're reaching for either one before you hit enter. We cover more of these "which command actually wins in production" tradeoffs over at kuryzhev.cloud if you want the rest of the series.

Fixing n8n Bedrock Automation: Throttling, Duplicates, Cost Blowouts

Oleksandr Kuryzhev — Sun, 26 Jul 2026 07:01:46 +0000

Originally published on kuryzhev.cloud

Your n8n Bedrock automation pipeline works fine in testing. You trigger it manually, watch the Claude response come back in two seconds, ship the demo to your team. Then it goes live, a webhook retries three times because a downstream CMS timed out, and you've published the same blog draft twice while Bedrock throws ThrottlingException at every fourth request. We've rebuilt this exact pipeline for three different clients now, and the failure modes are almost identical every time.

This post is about what actually happens when n8n talks to AWS Bedrock for content generation — drafts, summaries, repurposing — and where n8n Bedrock automation setups quietly fall apart once they leave the sandbox.

What this actually does

Strip away the marketing framing and the data path is simple: an n8n trigger (webhook, cron, or manual) fires, an HTTP Request node or AWS-credentialed node calls Bedrock's InvokeModel or InvokeModelWithResponseStream endpoint, a parsing node extracts the completion, and a sink node pushes the result somewhere — a CMS API, a Slack channel, an S3 bucket for archival.

The important thing to internalize: n8n does zero inference. It's purely an orchestrator that signs and sends AWS requests. Every millisecond of latency and every cent of cost lives inside Bedrock, not inside your n8n instance. This matters because people debug n8n performance when the actual bottleneck is a model's cold-start variance or a regional TPS quota.

There's also a credential distinction that trips people up immediately. n8n's built-in "AWS" credential type — the one used for S3 and Lambda nodes — does not correctly sign Bedrock runtime requests out of the box. Bedrock runtime lives on a separate endpoint (https://bedrock-runtime.<region>.amazonaws.com/model/<model-id>/invoke), and you need either the newer "Generic Credential Type: AWS" available in n8n 1.6x+ HTTP Request nodes, or manual SigV4 signing if you're stuck on an older version or a community node. Check the Bedrock API setup docs for the exact signing requirements — it's not optional, it's a hard 403 if you get it wrong.

How people use it wrong

The most common anti-pattern I see is one giant workflow per content type. Summarization, drafting, and social repurposing all crammed into a single n8n canvas with twenty nodes. It works until you need to change one prompt — then you're redeploying the entire workflow, retesting every branch, and hoping you didn't break the Slack notification three nodes downstream. Modular sub-workflows exist for exactly this reason, and ignoring them is the single biggest maintainability killer I've seen in these builds.

Second: no idempotency. A webhook fails, n8n retries, and now you've called Bedrock twice and published the same article twice to your CMS. There's no dedupe key, no execution-id check, nothing preventing a duplicate. I watched a client publish the same LinkedIn post three times in one morning because their webhook trigger had zero request deduplication — embarrassing, and entirely avoidable with a simple idempotency check against a Redis set or a database row.

Gotcha: people treat Bedrock like a stateless, infinitely scalable API. It isn't. On-demand throughput has real per-model TPS quotas — sometimes as low as 5-10 requests per second per region — and batch content jobs that fan out fifty items at once will hit ThrottlingException almost immediately. Check your actual quota in the Service Quotas console before you architect around an assumption.

The correct approach

Separate your "trigger/queue" workflow from your "generation" workflow, connected via n8n's sub-workflow call node. The trigger workflow handles dedupe, batching, and queuing. The generation workflow does one thing: call Bedrock, parse the response, hand it back. Each gets its own error workflow attached, so a Bedrock failure doesn't cascade into your entire pipeline throwing a generic red X.

Use IAM role-based credentials, not static access keys, scoped to specific model ARNs — not bedrock:*. I've lost count of how many times AccessDeniedException: User is not authorized to perform bedrock:InvokeModel turned out to be an IAM policy that granted the action but forgot to scope the resource to the actual model ARN like anthropic.claude-3-sonnet-20240229-v1:0.

Retry logic matters too. n8n's default "Retry On Fail" is 0 — off. You need to explicitly set retries with a wait time, or better, write exponential backoff yourself in a Function node so you can branch specifically on ThrottlingException versus a genuine model timeout. And keep prompt templates in S3 or Parameter Store instead of hardcoding them in Set nodes — that way editorial can iterate on prompts without you touching the workflow at all.

Here's the queue-mode setup we run in production. Main-process mode chokes fast under concurrent Bedrock calls, so this is non-negotiable past a handful of parallel executions:

# docker-compose.yml — n8n queue mode setup for concurrent Bedrock workflows
version: "3.8"

services:
  postgres:
    image: postgres:15
    restart: always
    environment:
      POSTGRES_USER: n8n
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_DB: n8n
    volumes:
      - pg_data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    restart: always
    command: ["redis-server", "--appendonly", "yes"]

  n8n-main:
    image: n8nio/n8n:1.62.1
    restart: always
    ports:
      - "5678:5678"
    environment:
      DB_TYPE: postgresdb
      DB_POSTGRESDB_HOST: postgres
      DB_POSTGRESDB_USER: n8n
      DB_POSTGRESDB_PASSWORD: ${DB_PASSWORD}
      EXECUTIONS_MODE: queue          # required for async Bedrock calls
      QUEUE_BUS_REDIS_HOST: redis
      EXECUTIONS_DATA_PRUNE: "true"   # avoid Postgres bloat from LLM payloads
      EXECUTIONS_DATA_MAX_AGE: "168"  # hours, prune after 7 days
      N8N_ENCRYPTION_KEY: ${ENCRYPTION_KEY}
    depends_on:
      - postgres
      - redis

  n8n-worker:
    image: n8nio/n8n:1.62.1
    restart: always
    command: worker
    environment:
      DB_TYPE: postgresdb
      DB_POSTGRESDB_HOST: postgres
      DB_POSTGRESDB_USER: n8n
      DB_POSTGRESDB_PASSWORD: ${DB_PASSWORD}
      EXECUTIONS_MODE: queue
      QUEUE_BUS_REDIS_HOST: redis
      N8N_ENCRYPTION_KEY: ${ENCRYPTION_KEY}
    deploy:
      replicas: 3   # scale workers to control Bedrock concurrency, not n8n itself
    depends_on:
      - postgres
      - redis

volumes:
  pg_data:

Watch out for: storing raw API keys or secrets inside a "Set" node in the workflow JSON. That JSON gets exported, committed to git, and now your Bedrock credentials are sitting in a public repo history. Always use n8n's credential vault. It's not extra effort, it's one dropdown.

Advanced patterns

Once a single workflow works reliably, the next problem is scale. For batch jobs — say 500 items needing summaries — don't blast them all at Bedrock at once. Use n8n's "Split In Batches" node combined with queue-mode worker concurrency caps. We've seen a 500-item fan-out with no concurrency limit trigger a regional quota lockout that affected unrelated production workloads on the same AWS account. That's not a theoretical risk, that happened to a client's checkout service because a content job saturated the shared Bedrock quota.

Model routing is worth building explicitly rather than hardcoding one model. A router node picks Claude 3 Sonnet for long-form drafts, a cheaper Llama 3 variant for short summaries, and falls back to a secondary model if the primary throttles repeatedly. This keeps cost proportional to content complexity instead of paying Sonnet rates for a two-sentence summary.

For editorial previews where waiting on a full completion feels sluggish, switch to InvokeModelWithResponseStream and pipe chunks into a WebSocket or SSE node. It changes the perceived latency dramatically even though total generation time is the same.

Guardrails deserve their own mention. Attach a guardrail ID in the request payload and branch on the response's guardrailAction field instead of trusting raw model output blindly — especially if this content is publishing anywhere public-facing without human review.

// Bedrock InvokeModel payload used inside n8n Function node before HTTP Request
{
  "anthropic_version": "bedrock-2023-05-31",
  "max_tokens": 1024,
  "temperature": 0.4,
  "messages": [
    {
      "role": "user",
      "content": "{{ $json.promptTemplate }}"
    }
  ],
  "guardrailConfig": {
    "guardrailIdentifier": "gr-content-safety-01",
    "guardrailVersion": "1"
  }
}

// Expected throttle error shape to branch on in the next node
{
  "__type": "ThrottlingException",
  "message": "Too many requests, please wait before trying again.",
  "$statusCode": 429
}

Miss the anthropic_version field on a Claude payload and Bedrock returns ValidationException instead — an easy mistake when copy-pasting between model families that expect different request shapes.

Performance notes

Main-process mode in n8n starts choking somewhere around 10-15 parallel Bedrock executions. Past that, queue mode with Redis and multiple workers isn't optional — it's the only thing keeping your webhook triggers from blocking on long-running generations while waiting for a busy worker.

On-demand throughput throttles hard during bursts. If you're running scheduled batch content jobs — nightly summaries, weekly digests — Provisioned Throughput on Bedrock removes the 429s entirely, at a higher fixed cost. We switched a client with a daily 200-article batch job to provisioned and their throttle rate dropped from roughly 12% of requests to zero, worth the extra spend for their volume.

Token limits and payload size directly affect latency. We measured p95 latency on Claude 3 Sonnet nearly double when passing full multi-thousand-token context versus a truncated prompt — if your use case tolerates summarized context instead of raw source text, it's a real latency win, not just a cost one. Speaking of cost: Claude 3 Sonnet runs roughly $3 per million input tokens and $15 per million output tokens at current Bedrock pricing (check the official Bedrock pricing page since this shifts). A batch job without token-limit guards can blow through hundreds of dollars in a single unattended overnight run — we've seen it happen.

Finally, don't ignore your Postgres execution log. n8n stores full JSON responses for every execution by default, and Bedrock responses aren't small. Without EXECUTIONS_DATA_PRUNE=true and a sane EXECUTIONS_DATA_MAX_AGE, that table grows for weeks until query performance degrades silently — no error, just slowly worsening dashboard load times until someone notices. I check this on every n8n Bedrock automation deployment now, after watching one client's execution history table hit 40GB in under two months. If you want more on structuring infra config so it doesn't drift like this, we've covered related patterns over on our AWS automation posts.

n8n Bedrock automation is genuinely powerful once it's wired correctly — modular workflows, scoped IAM, explicit retry logic, and queue mode aren't optional extras, they're the difference between a demo and something that survives real traffic without duplicating content or draining your AWS bill overnight.

Fix nginx-proxy SSL Certs Not Issuing in Docker Compose

Oleksandr Kuryzhev — Sat, 25 Jul 2026 07:02:00 +0000

Originally published on kuryzhev.cloud

Your app is up, DNS resolves, and yet the browser throws NET::ERR_CERT_AUTHORITY_INVALID instead of a green padlock. This is the single most common failure mode when running nginx-proxy Let's Encrypt certs through acme-companion in Docker Compose, and it almost always comes down to one of three things: DNS, environment variable mismatches, or a broken volume contract between the two containers. I've hit all three in production, usually at the worst possible time — right before a client demo. Here's the runbook I use now to diagnose and fix it fast, without accidentally torching my Let's Encrypt rate limit in the process.

Symptoms

Before you touch any config, confirm you're actually looking at this problem and not something else. The tell-tale signs:

Browser shows NET::ERR_CERT_AUTHORITY_INVALID, or the cert details show CN=nginx-proxy-default — that's the bundled self-signed fallback cert, not a real issued one.
docker logs nginx-proxy shows repeated nginx: [emerg] errors, or the app returns a 502 Bad Gateway even though the container itself is healthy.
acme-companion logs show Error creating new order :: too many certificates already issued or Timeout during connect (likely firewall problem).

Secondary symptoms worth noting: an infinite HTTP→HTTPS redirect loop right after you add LETSENCRYPT_HOST, an /etc/nginx/certs directory that's empty or only contains default.crt/default.key, or certs vanishing entirely after a redeploy. There's also a pattern I've learned to trust immediately: "works on staging.example.com, fails on app.example.com." That's almost never a config bug — it's DNS or rate-limiting, and no amount of restarting containers will fix it.

Root cause

To debug this properly you need to understand the three-container chain. nginx-proxy watches Docker events via docker-gen and rewrites /etc/nginx/conf.d/default.conf whenever a container starts or stops. acme-companion watches for LETSENCRYPT_HOST env vars and performs the HTTP-01 challenge on port 80, through the same shared network. Both containers must mount the same /etc/nginx/certs volume, or nginx-proxy quietly falls back to its bundled self-signed cert — with no obvious error in the logs.

The two most common root causes I see, in order of frequency:

DNS doesn't point at the host yet, but the ACME request fires anyway. Every failed HTTP-01 challenge counts against Let's Encrypt's rate limits, and repeated attempts snowball into a lockout that takes an hour or more to clear.
VIRTUAL_HOST / LETSENCRYPT_HOST mismatch — a typo, a trailing dot, or a www vs non-www inconsistency means docker-gen never generates a server block for that domain, so nginx serves the default placeholder.

Watch out for: acme-companion needs read-write access to the certs volume. nginx-proxy only needs read access. If you accidentally invert that — mounting certs as :ro on acme-companion — renewal breaks silently with no obvious error, and you won't notice until the cert expires.

Fix #1: Verify network path and DNS before touching config

Rule out infrastructure first. Config debugging on top of broken DNS wastes time and burns rate limit budget. Run this from outside the host:

# 1. Confirm DNS points at this host
dig +short app.example.com
curl -s ifconfig.me     # run on the server — output must match dig result

# 2. Confirm port 80 is reachable externally for the ACME challenge
curl -v http://app.example.com/.well-known/acme-challenge/probe

If the IPs don't match, stop — fix DNS propagation before anything else. If port 80 times out or connection is refused, check your cloud security group or host firewall (ufw status, iptables -L). This is a frequent silent killer: everything "looks" configured correctly, but the ACME HTTP-01 challenge simply never reaches the container.

Also confirm all three containers — nginx-proxy, acme-companion, and the app — are on the same user-defined bridge network, not Docker's default bridge:

docker network inspect proxy-net

The default bridge network doesn't do container name resolution the way user-defined networks do. If docker-gen can't resolve the app container by name, it can't build a correct upstream block, and you'll see intermittent 502s that look like a cert problem but aren't.

Fix #2: Correct the env var contract between containers

This is the most common misconfiguration, and it's almost always a copy-paste typo. Audit every app service for exact matching pairs:

environment:
  VIRTUAL_HOST: app.example.com
  LETSENCRYPT_HOST: app.example.com
  LETSENCRYPT_EMAIL: ops@example.com
  VIRTUAL_PORT: "8080"   # must match the app's actual listen port

A single trailing dot, or a www/non-www mismatch between VIRTUAL_HOST and LETSENCRYPT_HOST, breaks the whole chain — docker-gen and acme-companion have to agree on the exact string. After fixing env vars, force a config regeneration and check what actually got written:

docker exec nginx-proxy cat /etc/nginx/conf.d/default.conf | grep -A2 server_name
docker exec nginx-proxy nginx -t          # validate syntax without reloading
docker exec nginx-proxy nginx -s reload   # apply

If you're running multiple domains behind one app, use comma-separated LETSENCRYPT_HOST=a.example.com,b.example.com. But here's a gotcha I got burned by: acme-companion issues one cert per unique VIRTUAL_HOST value, not automatically one SAN per comma entry — check the logs to confirm which SANs actually landed on the issued cert before assuming it worked.

Another gotcha: if VIRTUAL_PORT is missing and the app exposes multiple ports, nginx-proxy guesses wrong more often than you'd expect, producing intermittent 502s that read exactly like a TLS problem but have nothing to do with certs.

Fix #3: Fix volume persistence and force a clean cert reissue

If certs vanish after every redeploy, your volumes aren't actually persistent. Make sure certs, vhost.d, and html are named volumes — not anonymous volumes, and definitely not bind mounts to an ephemeral CI runner filesystem — and that they're declared identically in both the nginx-proxy and acme-companion service blocks. Here's the full working stack I use as a baseline:

# docker-compose.yml — nginx-proxy + acme-companion + sample web app
version: "3.9"

services:
  nginx-proxy:
    image: nginxproxy/nginx-proxy:1.4
    container_name: nginx-proxy
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - certs:/etc/nginx/certs:ro          # read-only: proxy only serves certs
      - vhost:/etc/nginx/vhost.d
      - html:/usr/share/nginx/html
      - /var/run/docker.sock:/tmp/docker.sock:ro
    networks:
      - proxy-net
    labels:
      - "com.github.nginx-proxy.nginx"     # required for acme-companion discovery

  acme-companion:
    image: nginxproxy/acme-companion:2.2
    container_name: acme-companion
    restart: unless-stopped
    depends_on:
      - nginx-proxy
    volumes:
      - certs:/etc/nginx/certs:rw          # read-write: needs to place new certs
      - vhost:/etc/nginx/vhost.d
      - html:/usr/share/nginx/html
      - acme:/etc/acme.sh
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      DEFAULT_EMAIL: ops@example.com
      # ACME_CA_URI: https://acme-staging-v02.api.letsencrypt.org/directory  # uncomment while testing
    networks:
      - proxy-net

  webapp:
    image: myorg/webapp:1.2.3
    container_name: webapp
    restart: unless-stopped
    environment:
      VIRTUAL_HOST: app.example.com
      LETSENCRYPT_HOST: app.example.com
      LETSENCRYPT_EMAIL: ops@example.com
      VIRTUAL_PORT: "8080"                 # must match app's actual listen port
    networks:
      - proxy-net

volumes:
  certs:
  vhost:
  html:
  acme:

networks:
  proxy-net:
    driver: bridge                          # never use default bridge here

If certs are stuck in a bad state, don't nuke the whole volume — that wastes reissue attempts. Instead, stop the stack, inspect the volume, and remove only the affected domain's .crt/.key/.json files:

docker compose down
docker volume inspect certs
# manually remove app.example.com.crt / .key / .json from the volume mountpoint
docker compose up -d

And while you're iterating, point ACME_CA_URI at the Let's Encrypt staging endpoint (https://acme-staging-v02.api.letsencrypt.org/directory). Production limits are 5 duplicate-cert failures/hour and 50 certs/week per registered domain — trivially easy to burn through while debugging a typo. Flip back to production only once staging confirms the chain works end to end. Run the full diagnostic sequence below to confirm everything's actually resolved:

# Diagnostic sequence for "cert not issuing" — run in order

# 3. Check what nginx-proxy actually generated
docker exec nginx-proxy cat /etc/nginx/conf.d/default.conf | grep -A2 server_name

# 4. Check acme-companion's last attempt
docker logs acme-companion --tail 50

# Example failing output:
# 2024-05-11 acme-companion  | Registering domain 'app.example.com'
# 2024-05-11 acme-companion  | Error creating new order :: too many failed authorizations recently

# 5. Confirm the served cert (should be Let's Encrypt, not "nginx-proxy-default")
openssl s_client -connect app.example.com:443 -servername app.example.com 2>/dev/null \
  | openssl x509 -noout -issuer -dates

Prevention

Once you've got nginx-proxy Let's Encrypt certs issuing correctly, lock it in so it doesn't regress on the next deploy:

Pin exact image tags — nginxproxy/nginx-proxy:1.4 and nginxproxy/acme-companion:2.2 — instead of :latest. I stopped using :latest on these images after a minor docker-gen template bump silently broke vhost generation across an entire fleet overnight.
Add an independent monitoring check that hits https://yourdomain.com and alerts if expiry drops under 14 days, separate from acme-companion's internal renewal logic. Don't trust the renewal daemon to also monitor itself.
Document the env var contract — VIRTUAL_HOST/LETSENCRYPT_HOST/VIRTUAL_PORT — in a .env.example or README so the next service added to the stack doesn't silently miss it.
Always test new domains against Let's Encrypt staging first. This one habit alone prevents the vast majority of rate-limit lockouts I've seen teams hit.
If you ever set NETWORK_ACCESS=internal, make sure acme-companion's challenge path is still reachable — I've seen teams misconfigure this and then "fix" it by switching to DNS-01 with API keys stored in plaintext env vars, which is a worse security tradeoff than the original problem.

For more Docker Compose patterns we run in production, check the DevOps_DayS archive — there's a related writeup on trimming bloated images that pairs well with this proxy setup. And if you want the canonical reference on ACME rate limits, Let's Encrypt's own rate limits documentation is worth bookmarking before your next debugging session.

Terraform AWS Modules vs Custom: VPC and IAM Done Right

Oleksandr Kuryzhev — Fri, 24 Jul 2026 07:01:54 +0000

Originally published on kuryzhev.cloud

When you face this choice

Every new AWS account starts the same way. You spin up the account, get IAM access sorted, and someone on the platform team opens a terminal and types terraform init for the first time in that account. Then comes the question that quietly shapes everything for the next two years: do you pull terraform aws modules vpc iam setups straight from the Registry, or do you write your own?

This isn't a throwaway decision. VPC, IAM, and S3 are almost always the first modules a platform team writes or adopts, because everything else — EKS clusters, RDS instances, Lambda functions — depends on them. Whatever pattern you pick here becomes the template every other module in the repo copies. Get it wrong and you're refactoring twelve downstream modules eighteen months later.

The trigger points I see most often: onboarding a new AWS account into an existing landing zone, standing up a multi-account Terraform Cloud or Enterprise workspace structure, or — my personal favorite — a security audit that suddenly demands every IAM role in the org has a permissions boundary attached. That last one has a way of surfacing exactly how much you trusted upstream module defaults without reading them.

I've built this decision three times now for three different companies, and I land on a different answer each time depending on the resource type. That's actually the point of this post — it's not "Registry vs custom," it's "Registry and custom, applied deliberately per module."

Option A — terraform-aws-modules (Registry modules)

The terraform-aws-modules org on the Terraform Registry is genuinely excellent engineering. The VPC module handles NAT gateway HA, VPC endpoints, subnet CIDR math, and Flow Logs — all the stuff that's tedious to get right and easy to get subtly wrong. Pros are real: battle-tested edge cases, weekly-ish release cadence, and enough GitHub stars and production usage that "did I miss a flag" risk drops dramatically. You write twenty lines of HCL and get a fully wired 3-tier VPC. Speed is the headline benefit. A junior engineer can stand up a compliant-looking VPC in an afternoon. But there are real costs that show up later. Version drift is the first one. Bumping terraform-aws-modules/vpc/aws from v3 to v5 renames variables and changes default behavior around enable_nat_gateway — upgrade without reading CHANGELOG.md and your plan silently changes what it's provisioning. I've watched a terraform plan show zero changes when it should have shown forty, because a variable rename meant the new setting was silently ignored and the old default kicked back in. Second cost: bloated state. A "simple" 3-AZ VPC module run can create 40-60 resources — subnets × three types × three AZs, route tables, associations, NACLs you never touch. That inflates plan time on large state files and increases blast radius if something in that module block goes wrong. Run this before any upgrade:

terraform state list | grep module.vpc | wc -l

Third: enforcing org tagging and naming conventions on a Registry module means either passing every tag through every variable, or wrapping it. There's no "org mode" switch. And pinning to an exact version is mandatory — version = ">= 5.0" in a prod root module is how you inherit an unreviewed upstream change on a Tuesday afternoon.

Option B — Hand-rolled internal modules

The alternative is writing your own thin modules from raw aws_vpc, aws_subnet, aws_iam_role resources. I've done this for IAM at every company I've worked with, and I'd do it again. Full control is the big win. You decide exactly which resources get created — no default NACLs, no Flow Logs you didn't ask for, no surprise route table associations. State stays small, blast radius stays small, and a new hire reading terraform plan output can actually reason about what's about to happen. You also control tagging and naming at the resource level instead of threading it through a wrapper's tag-merging logic. Security-sensitive logic benefits the most from this. A hand-rolled IAM module can enforce permissions_boundary on every role as a required variable — no default, no null shortcut, compile fails if someone forgets it. Registry IAM modules almost always leave that as optional, which means it gets skipped under deadline pressure, which is exactly how you end up with an over-permissioned role six months later. The cons are just as real, though. You own every edge case: multi-AZ NAT failover logic, S3 cross-region replication configs, the CIDR math for subnetting. That's reinventing wheels that terraform-aws-modules already spent years getting right. Initial delivery is slower — budget for it. And as your team grows and AWS's API surface changes, you're the one maintaining that logic, not a community of maintainers pushing weekly releases. My rule of thumb from production: modules under ~10 resources are cheap to hand-roll. Anything with cross-resource dependency math — like VPC subnetting or NAT routing — is expensive to hand-roll correctly and rarely worth it.

Decision matrix

Here's how I score the two approaches across the criteria that actually matter once you're running this in production, not just in a demo.

Criteria                     | Registry | Hand-rolled
------------------------------|----------|------------
Setup speed                  |    5     |     2
Customization / tag control  |    2     |     5
Maintenance burden           |    3     |     4
Security posture out-of-box  |    3     |     4  (only if you enforce it)
Blast radius / state size    |    2     |     5
Learning curve for new hires |    4     |     3
Cost of ongoing upgrades      |    2     |     4

Registry wins on speed and, honestly, on baseline security defaults — someone already thought about VPC endpoints and S3 lifecycle rules so you don't have to. Hand-rolled wins on control and state size, which matters more than people expect once your state file crosses a few hundred resources and every terraform plan takes ninety seconds. The pattern I see repeatedly: teams start 100% Registry, then around the third or fourth AWS account they start wrapping the VPC module and fully replacing the IAM module. That's not indecision — that's the natural maturity curve. If you're a one-account startup, just use the Registry modules and move on. If you're building a landing zone for a company with a compliance team, start hand-rolling IAM now.

My pick

I use terraform-aws-modules for VPC. I hand-roll IAM and S3. That split isn't arbitrary — it maps directly to variance. VPC networking logic — route tables, NAT HA, subnet CIDR math — has low variance across organizations. A VPC in fintech and a VPC in a gaming startup look almost identical structurally. Reusing a well-maintained module here is just good engineering. Fighting that instinct to "own everything" wastes engineering time on a solved problem. IAM and S3 policy logic has high variance. Least-privilege requirements, permissions boundaries, compliance frameworks (SOC 2, PCI, whatever your auditors care about this quarter) differ wildly between orgs. A generic Registry module optimizes for flexibility, which in practice means permissive defaults. That's exactly the wrong direction for security-sensitive infra. Reuse here invites over-permissioning, not because the module authors did anything wrong, but because "works for everyone" and "least privilege for you" are opposing goals. The practical compromise — and what I actually ship — is a thin wrapper around the Registry VPC module that pins the exact version and injects org tagging/naming, giving you Registry speed without losing control:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.8.1" # pin exact version — never a range in prod

  name = "${var.org}-${var.env}-vpc"
  cidr = var.vpc_cidr

  azs             = var.azs
  private_subnets = [for k, az in var.azs : cidrsubnet(var.vpc_cidr, 4, k)]
  public_subnets  = [for k, az in var.azs : cidrsubnet(var.vpc_cidr, 4, k + 10)]

  # HA NAT instead of single_nat_gateway to avoid AZ-wide outage risk
  enable_nat_gateway     = true
  one_nat_gateway_per_az = var.env == "prod" ? true : false
  single_nat_gateway     = var.env != "prod"

  enable_dns_hostnames = true
  enable_dns_support   = true

  # Org-mandated tagging — this is why we wrap instead of calling raw module
  tags = merge(var.default_tags, {
    Module     = "vpc-wrapper"
    ManagedBy  = "terraform"
    CostCenter = var.cost_center
  })
}

# --- Hand-rolled IAM role module instead of Registry iam-assumable-role ---
# Full control over permissions_boundary enforcement (security requirement)
resource "aws_iam_role" "app_role" {
  name                 = "${var.org}-${var.env}-${var.role_name}"
  assume_role_policy   = data.aws_iam_policy_document.trust.json
  permissions_boundary = var.permissions_boundary_arn # non-negotiable, no default = null shortcut

  tags = var.default_tags
}

resource "aws_iam_role_policy_attachment" "app_role_policies" {
  for_each   = toset(var.managed_policy_arns)
  role       = aws_iam_role.app_role.name
  policy_arn = each.value
}

Watch out: single_nat_gateway = true in production is a cost-saving move that trades money for reliability — if that one NAT's AZ goes down, every private subnet loses egress at once. one_nat_gateway_per_az = true costs more (multi-AZ NAT gateways run ~$0.045/hr plus ~$0.045/GB processed each, easily $300+/month across three AZs before real traffic), but it's the difference between an AZ blip and a full outage. Dev environments should stay on single-NAT — that keeps costs around $32/month instead. Second watch out: the S3 module. The Registry's terraform-aws-modules/s3-bucket/aws module has acl = "private" as a default, but it does not automatically set block_public_acls, restrict_public_buckets, or attach_deny_insecure_transport_policy. I've seen this exact gap in a prod logs bucket:

# --- WRONG: relies on module defaults, misses two security controls ---
module "s3_bucket_wrong" {
  source  = "terraform-aws-modules/s3-bucket/aws"
  version = "4.2.1"
  bucket  = "kuryzhev-logs-prod"
  acl     = "private"
  # versioning left unset — object deletion is unrecoverable
  # attach_deny_insecure_transport_policy left unset — HTTP allowed
}

# --- CORRECT: explicit security controls ---
module "s3_bucket_correct" {
  source  = "terraform-aws-modules/s3-bucket/aws"
  version = "4.2.1"
  bucket  = "kuryzhev-logs-prod"
  acl     = "private"

  versioning = {
    enabled = true
  }

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true

  attach_deny_insecure_transport_policy = true

  tags = {
    Environment = "prod"
    Owner       = "platform-team"
  }
}

# Audit before any module version upgrade:
# terraform state list | grep module.s3_bucket_correct | wc -l

Notice both blocks use the Registry module — my "hand-roll S3" preference in practice often means hardening the Registry module with explicit security blocks rather than writing raw aws_s3_bucket resources from scratch. Same principle, less boilerplate. What matters is that nothing security-relevant is left to an implicit default. If you're just starting a landing zone, don't overthink this: use terraform aws modules vpc iam patterns for networking, wrap or replace them for IAM and S3, pin every version, and run terraform-docs (v0.17+) against your hand-rolled modules so the README stays honest about what variables actually exist. We cover more of this workspace structure over in our Terraform category if you want the multi-account variant of this setup. For the canonical module docs, the Terraform module sources documentation is worth bookmarking before you fork anything to a private Git repo — the source syntax changes completely once you do.

CI/CD Checklist: Quality Gates, Approvals, and Rollback Paths

Oleksandr Kuryzhev — Thu, 23 Jul 2026 07:28:49 +0000

Originally published on kuryzhev.cloud

Why this checklist

Last month a teammate pushed a hotfix that dropped test coverage from 82% to 68%. SonarQube flagged it. The pipeline turned green anyway, deployed to prod, and nobody noticed until customer support started forwarding 500 errors two hours later. That's the failure mode this CI/CD quality gate checklist exists to prevent: a pipeline that reports quality problems but doesn't actually enforce them.

Most teams I've worked with have some version of three control points — a quality gate, a manual approval step, and a rollback plan. The problem is these are almost always implemented in isolation, by different people, at different times, so nobody checks whether they actually connect. You end up with a SonarQube integration that scans but never blocks, an approval step that gates the wrong action, or a rollback command that's never been run outside someone's laptop. Each piece looks fine on its own. Together they leave a gap wide enough to drive a broken release through.

This checklist is for teams running GitHub Actions, GitLab CI, or Jenkins to deploy into Kubernetes or VM-based production environments — basically anyone who has a "deploy to prod" job and more than one person touching it. If you're the only one who ever deploys, you probably don't need half of this. If you have a team, an on-call rotation, or compliance requirements, you need all fifteen items below wired together, not just present.

We cover related patterns — GitOps rollback flows, drift detection, and alerting on failed deploys — in more detail on the DevOps_DayS blog if you want the deeper dives after this.

The checklist

Quality gate (items 1–5)

Coverage threshold is enforced in CI, not just reported. A SonarQube dashboard showing "68.4% coverage" in red means nothing if the pipeline exit code is still 0. The scanner has to fail the job.
Static analysis blocks the merge, not just the deploy. Catching bad code at PR time is cheaper than catching it after a deploy job runs.
Dependency and vulnerability scanning is a gate, not a report. A critical CVE in a transitive dependency should stop the pipeline the same way a failing unit test does.
The gate result is polled synchronously. This is the mistake that bit us: SonarQube runs asynchronously by default. Without sonar.qualitygate.wait=true, the scanner step exits before the gate computation finishes, and the pipeline moves on regardless of the result.
Gate configuration lives in version control alongside the pipeline. If the coverage threshold is a checkbox in a web UI that only the team lead can see, it will drift silently. Keep it in the repo.

Here's what a properly wired quality gate looks like in GitHub Actions, running against SonarQube 10.4 with the gate result actually blocking the downstream deploy job:


# .github/workflows/deploy-with-gate-approval-rollback.yml
name: deploy-prod

on:
  push:
    branches: [main]

jobs:
  quality-gate:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # required for accurate SonarQube blame/coverage diff

      - name: Run tests with coverage
        run: |
          pytest --cov=app --cov-report=xml

      - name: SonarQube scan and wait for gate
        uses: SonarSource/sonarqube-scan-action@v2
        with:
          args: >
            -Dsonar.qualitygate.wait=true
            -Dsonar.qualitygate.timeout=300
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
        # This step FAILS the job if the gate fails, blocking downstream deploy

  deploy-prod:
    needs: quality-gate
    runs-on: ubuntu-22.04
    environment:
      name: production  # required reviewers configured in repo settings
    timeout-minutes: 30  # prevents indefinitely stuck approval from blocking runner queue
    steps:
      - uses: actions/checkout@v4

      - name: Deploy immutable image
        run: |
          IMAGE_DIGEST=$(cat digest.txt)  # never deploy by :latest tag
          kubectl set image deployment/myapp \
            myapp=registry.example.com/myapp@${IMAGE_DIGEST} \
            --record

      - name: Post-deploy smoke test
        id: smoke
        run: ./scripts/smoke-test.sh
        continue-on-error: true

      - name: Auto-rollback on smoke test failure
        if: steps.smoke.outcome == 'failure'
        run: |
          echo "Smoke test failed, rolling back..."
          kubectl rollout undo deployment/myapp
          kubectl rollout status deployment/myapp --timeout=120s
          exit 1  # still fail the pipeline so it's visible, but prod is safe

If the gate fails, the error looks exactly like a test failure: ERROR: Quality gate failed: Coverage 68.4% is less than required 80.0%. You can also check it manually against the SonarQube API with GET /api/qualitygates/project_status?projectKey=myapp before trusting the CI output.

Manual approval (items 6–10)

Protected environment with required reviewers. In GitHub Actions environments, this is Settings → Environments → production → Required reviewers. Note the cap: max 6 reviewers per environment.
Approval is scoped to the specific environment. An approval on staging should never carry over to prod. GitLab CI does this with environment: plus deployment_tier; Jenkins uses an input step scoped inside the relevant stage.
Pending approvals expire. GitHub Actions environment approvals have no built-in expiry. Pair them with timeout-minutes on the job so a stale pending deploy auto-cancels after, say, 60 minutes instead of sitting in the queue for three days.
Audit log of who approved and when. CI logs aren't enough — GitHub Actions defaults to 90 days of log retention, which is often shorter than what compliance asks for. Ship approver identity, timestamp, and commit SHA to an external system.
Approval gates the mutating step, not the step after it. This is the one I've seen go wrong the most: teams put the approval gate after the deploy job has already applied database migrations. By the time someone clicks "approve," the mutation already happened. The gate has to sit in front of anything that changes state.

Rollback (items 11–15)

Every deploy tags an immutable artifact. Deploy by digest — myapp@sha256:... — never by :latest. If you roll back while the tag still points at the new build, you'll re-pull the broken image.
Rollback command tested in staging quarterly. A rollback path nobody has run since it was written is a rollback path that fails during the actual incident.
Rollback triggers an automated smoke test. Don't assume the revert worked — verify it the same way you'd verify a fresh deploy.
Previous N revisions are retained. Kubernetes Deployments default to revisionHistoryLimit: 10. You can lower it for cost, but I wouldn't go below 3 — that's the minimum for a rollback to actually have somewhere to go.
Rollback is runnable without pipeline access. Keep a break-glass script that works even if CI is down.

To pick the right target revision, check history first, then roll back with kubectl rollout undo:


# Example: kubectl rollout history output used to pick a rollback target

$ kubectl rollout history deployment/myapp
deployment.apps/myapp
REVISION  CHANGE-CAUSE
4         kubectl set image deployment/myapp myapp=myapp@sha256:1a2b3c...
5         kubectl set image deployment/myapp myapp=myapp@sha256:4d5e6f...
6         kubectl set image deployment/myapp myapp=myapp@sha256:7g8h9i... (current, broken)

$ kubectl rollout undo deployment/myapp --to-revision=5
deployment.apps/myapp rolled back

$ kubectl rollout status deployment/myapp
Waiting for deployment "myapp" rollout to finish: 1 old replicas pending termination...
deployment "myapp" successfully rolled out

If you're on Helm, the equivalent is helm rollback myapp 0 --timeout 5m (revision 0 means "previous release"). If you're running canary or blue-green through Argo Rollouts 1.6, use kubectl argo rollouts undo myapp-rollout — it triggers the same automated analysis checks that gated the original rollout.

Commonly missed items

The gaps that only show up during an incident, not during a code review:

The gate runs but nobody waits for it. This is the SonarQube case from the intro — the scan happens against an external tool, but the CI job doesn't block on the result. It's the single most common thing I find when auditing someone else's pipeline. Always check for sonar.qualitygate.wait=true explicitly; don't assume it's there.

Approval granted to a bot or a single human. I've seen service accounts configured as the "required reviewer" for a production environment — which means the pipeline can approve itself. Separate the approver identity from the pipeline execution identity, full stop, or you've built a self-approval bypass with extra YAML. And if the only human approver is on vacation, deploys just stop. Always have a backup reviewer.

Rollback assumes rebuilding from source. If your "rollback" is actually a rebuild of an old commit, check whether the build environment has drifted since then — a base image update, a removed package mirror, a changed lockfile resolution. We got burned by this once: the rollback commit built fine three weeks earlier but failed against a newer Alpine base image during the actual incident. That's exactly why immutable digests matter — you're not rebuilding, you're re-deploying something that already ran successfully.

Silent rollbacks. If a rollback fires and nothing alerts, nobody investigates the root cause, and the same bad deploy goes out again next sprint. Wire rollback execution into your alerting pipeline the same way you'd alert on a failed deploy.

Automation ideas

None of this has to be manual toil once the controls exist. A few things worth automating without weakening any of the gates above:

Auto-generate approval requests with context. A Slack message with a diff summary, a link to the SonarQube gate report, and an approve/deny button turns a five-minute Slack hunt into a ten-second decision. Reviewers approve faster when they don't have to go dig for context.

Trigger rollback from health checks, not humans. The workflow above already does this — a failed smoke test triggers kubectl rollout undo automatically. Extend it with HTTP error rate or latency SLO breaches from your monitoring stack instead of waiting for someone to notice a spike on a dashboard.

Watch out for rollback loops. If the previous version shares the same underlying issue — a bad database migration, for example — an error-rate-triggered auto-rollback can flap between two broken versions indefinitely. Always pair automated rollback with a circuit breaker or a max-attempts counter so it fails loud instead of looping quiet.

Run scheduled rollback drills. A weekly job that rolls a staging namespace back to its previous revision and runs the smoke test suite catches rollback path rot before it's the middle of an incident. It's the same idea as a chaos engineering game day, just scoped to one specific failure mode.

On cost: re-running a full coverage suite on every quality gate check burns CI minutes fast on large repos. Switch to PR-scoped analysis with sonar.pullrequest.key — it typically cuts scan runtime by 40-60% since it only analyzes the diff instead of the whole codebase.

Put together, a working CI/CD quality gate checklist isn't about adding more steps — it's about making sure the steps you already have actually connect to each other. Gate, approval, rollback: each one is only as good as its handoff to the next.

Validate AI-Generated Kubernetes Manifests Before They Hit Git

Oleksandr Kuryzhev — Wed, 22 Jul 2026 07:02:10 +0000

Originally published on kuryzhev.cloud

We started letting an internal tool generate boilerplate Deployments and Services from a short intent spec, and within two weeks a teammate almost merged a manifest with policy/v1beta1 — an apiVersion removed since Kubernetes 1.25. That's the moment we realized validating AI-generated Kubernetes manifests isn't optional review theater, it's a hard gate that has to sit before the commit, not just before the deploy. Here's what we've locked into our pipeline since, tip by tip.

Pin the model version and temperature — stop manifest drift between runs

If you're generating YAML from an LLM and re-running the same prompt gives you a different resources: block twice in a row, that's not a fluke — it's temperature. Set temperature: 0 (or the provider's closest equivalent) for anything manifest-related; you want deterministic output, not creative writing.

Also pin the exact model string. gpt-4o as an alias will silently shift behavior when the provider updates "latest" mid-sprint — use gpt-4o-2024-08-06 instead. Log the model string plus a hash of the prompt alongside the generated commit so you can actually reproduce a bad manifest six weeks later when someone asks "why does this Deployment have three replicas."

Validate against your cluster's real API surface, not generic schemas

LLMs are trained on a snapshot of the internet, which means they happily emit deprecated kinds like extensions/v1beta1 or CRD versions that don't exist in your live cluster. Run kubeconform with -kubernetes-version 1.29.0 against your actual target version, not whatever ships as the default schema set.

For CRDs — ArgoCD's Application, Cilium policies, whatever you run — pull the live OpenAPI schema straight from the cluster instead of trusting a public catalog:

# export the cluster's real OpenAPI schema for CRD-aware validation
kubectl get --raw /openapi/v2 > schema.json

Watch out for this error: no matches for kind "Application" in version "argoproj.io/v1alpha1". Nine times out of ten that's not a real problem with the manifest — it's a stale CRD schema cache on your validation side, and it'll waste an hour if you assume the LLM is wrong.

Enforce a pre-commit hook, not just a CI check

We used to only validate in CI, which meant a rejected PR still left a "clean-looking" diff sitting in git history. Reviewers would open manifests that were never actually valid Kubernetes objects and waste time reasoning about them. Shift the check left — block the commit itself.

#!/usr/bin/env bash
# .git/hooks/pre-commit (or wired via .pre-commit-config.yaml local hook)
# Generates a manifest via LLM, validates it, blocks commit on failure.

set -euo pipefail

SPEC_FILE="deploy/intent.yaml"          # human-authored high-level spec
OUT_FILE="deploy/manifests/deployment.yaml"
CACHE_FILE=".ai-gitops-cache.json"
CLUSTER_VERSION="1.29.0"

# Skip LLM call if spec hasn't changed since last validated run
CURRENT_HASH=$(sha256sum "$SPEC_FILE" | awk '{print $1}')
CACHED_HASH=$(jq -r '.hash // ""' "$CACHE_FILE" 2>/dev/null || echo "")

if [[ "$CURRENT_HASH" == "$CACHED_HASH" ]]; then
  echo "[ai-gitops] No spec change detected, skipping regeneration."
  exit 0
fi

echo "[ai-gitops] Spec changed, calling LLM to (re)generate manifest..."
python3 scripts/generate_manifest.py \
  --spec "$SPEC_FILE" \
  --model "gpt-4o-2024-08-06" \
  --temperature 0 \
  --out "$OUT_FILE"

echo "[ai-gitops] Validating against schema for k8s ${CLUSTER_VERSION}..."
if ! kubeconform -strict -kubernetes-version "${CLUSTER_VERSION}" "$OUT_FILE"; then
  echo "[ai-gitops] Schema validation FAILED. Commit blocked."
  exit 1
fi

echo "[ai-gitops] Checking resource/security policy with Kyverno..."
if ! kyverno apply policies/resource-limits.yaml --resource "$OUT_FILE" | grep -q "pass"; then
  echo "[ai-gitops] Policy validation FAILED. Commit blocked."
  exit 1
fi

echo "[ai-gitops] Building through kustomize to catch merge/indentation issues..."
kustomize build deploy/overlays/staging | kubeconform -strict -kubernetes-version "${CLUSTER_VERSION}" -

# All good — update cache and stage the generated file
jq -n --arg h "$CURRENT_HASH" '{hash: $h}' > "$CACHE_FILE"
git add "$OUT_FILE" "$CACHE_FILE"

echo "[ai-gitops] Manifest generated and validated. Proceeding with commit."

If you use kustomize overlays, add a post-build validation step too — patches from an LLM can silently merge wrong due to indentation, and pre-build validation won't catch that.

Never trust the LLM's resource requests/limits — enforce with policy

Left unchecked, LLMs either omit the resources: block entirely or copy suspiciously round numbers — cpu: 1000m, memory: 1Gi — regardless of what the workload actually needs. We've seen teams over-provision nodes by 2-3x within a month just from trusting these guesses.

Run a policy engine as a second, independent pass. We landed on Kyverno 1.11+ for this because kyverno apply --resource gives you a fast local dry-run before anything reaches the cluster:

# policies/resource-limits.yaml — Kyverno policy that overrides LLM guesses
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-container-resources
      match:
        any:
          - resources:
              kinds:
                - Deployment
                - StatefulSet
      validate:
        message: >-
          LLM-generated manifest is missing resource requests/limits,
          or exceeds max allowed values (cpu: 2, memory: 4Gi).
        pattern:
          spec:
            template:
              spec:
                containers:
                  - resources:
                      requests:
                        cpu: "?*"
                        memory: "?*"
                      limits:
                        cpu: "<=2"
                        memory: "<=4Gi"

# Example failure output when the LLM omits resources entirely:
# policy require-resource-limits/check-container-resources fail:
#   validation error: LLM-generated manifest is missing resource
#   requests/limits, or exceeds max allowed values (cpu: 2, memory: 4Gi)

OPA/Conftest works just as well if that's already in your stack — the point is that schema validation and cost/security policy are two separate concerns and neither one substitutes for the other.

Strip secrets and internal context from prompts before sending to hosted models

Treat every prompt as an egress channel. The most common accidental data leak we've seen is someone pasting a real Secret value, an internal hostname, or a VPC CIDR into a prompt like "generate a manifest based on our existing setup" — and that text now lives on a third-party API log.

Put a redaction preprocessor in front of anything going to a hosted model — a regex or a small Python filter that masks env values and internal domains before the request goes out. If your compliance policy is strict about production namespace names, route that generation through a self-hosted model like Ollama running CodeLlama instead. I stopped sending anything touching prod-adjacent context to hosted APIs after one near-miss with a staging hostname that mapped 1:1 to an internal DNS record — not worth the risk for a boilerplate Deployment.

Force schema-constrained output instead of free-form YAML

Free-form generation invites hallucinated fields — we've seen spec.replicaCount show up instead of the real spec.replicas, which parses fine as YAML and fails silently downstream. Use JSON mode or function-calling (OpenAI's response_format: json_schema) or a grammar-constrained approach via guidance/outlines so the model literally can't invent a field that doesn't exist in the schema.

This doesn't replace kubeconform, though. Schema-constrained output reduces structural hallucination but says nothing about semantic correctness — a field can be perfectly valid and still hold the wrong value. "It parsed as YAML" and "it's a valid Kubernetes object" are two separate validation layers, and conflating them is the mistake we see most often on teams new to this. Check the Kubernetes API concepts docs if you want to understand exactly why schema and semantics diverge.

Cache prompts and diffs to control LLM API cost on repeated pushes

Regenerating the full manifest tree on every single commit burns tokens for no reason. Hash the source intent file and only call the LLM when that hash actually changes — and when it does, send a diff-style prompt ("here's the current manifest, apply this change") instead of full regeneration. That alone cut our token usage roughly 40-60% on iterative PRs.

Use a cheaper model like gpt-4o-mini for the validation/fix-suggestion loop and reserve the larger model for initial generation only — there's no reason to pay premium-model prices to check whether a field name is right. We keep the last-validated manifest and its input hash in .ai-gitops-cache.json, which is exactly what lets the pre-commit hook above skip regeneration entirely when nothing meaningful has changed. If you're building out a broader GitOps pipeline around this pattern, we cover more of the plumbing over at kuryzhev.cloud's GitOps posts.

None of this is exotic tooling — kubeconform, Kyverno, and a pre-commit hook are things most teams already have lying around. The only real shift is treating AI-generated Kubernetes manifests as untrusted input by default, the same way you'd treat a PR from someone who's never seen your cluster before.

GPT Slack Bot for CI Failures: 3 Mistakes We Made

Oleksandr Kuryzhev — Tue, 21 Jul 2026 07:02:12 +0000

Originally published on kuryzhev.cloud

Context

A GPT Slack bot for CI failures sounds like a weekend project, and honestly, it was — the first version shipped in an afternoon. We had 40+ CI/CD pipelines split between GitHub Actions and GitLab CI failing daily, and engineers were burning 10-15 minutes per failure just scrolling raw logs to find the actual error line buried under dependency install noise and retry spam. Someone pitched it in standup: webhook fires on pipeline failure, we grab the job trace, feed it to GPT, post a 3-bullet summary in the channel. Simple.

We built it with Slack Bolt and the OpenAI Python SDK, demoed it against a handful of sample failures, and it looked genuinely great. Clean summaries, right in the thread, no more log-diving. We shipped it to the team channel the same week. It fell apart within seven days — not dramatically at first, just quietly wrong, then loudly wrong. What follows is the honest version of what broke and why, because the demo-to-production gap on this project was bigger than anything else we'd deployed that quarter.

Mistake 1: We Piped the Entire Raw Log Into the Prompt

Our failed CI jobs produced logs anywhere from 20,000 to 80,000 lines — verbose test runners, retried package installs, the usual noise. Our first instinct was "GPT is smart, just send it the whole trace." So we did, straight into gpt-4-turbo, no filtering.

This broke in a way we didn't expect. The model has a context limit, and our naive client-side truncation cut logs from the top to fit. The problem is that the actual failure — the stack trace, the exit code, the real error — is almost always near the end of a CI log, not the beginning. So we were feeding GPT the "npm install" spam and dropping the part that mattered. Summaries came back describing something that looked like a successful build. Engineers started ignoring the bot within days because it was confidently wrong.

Then the cost landed. Multi-thousand-line prompts on gpt-4-turbo pushed a single summary to $0.30–$0.80. At 40+ failures a day, that's a real line item, not a rounding error, and it showed up on the OpenAI billing dashboard a month later as a line someone in finance actually asked about. We were paying premium prices for summaries that were wrong roughly a third of the time. That's the worst combination you can build.

Mistake 2: We Processed Everything Synchronously Inside the Slack Handler

Slack's Events API requires an HTTP 200 within 3 seconds of receiving an event, or it assumes the request failed. Our handler was calling OpenAI — 2 to 6 seconds of latency depending on load — before responding to Slack. Every slow failure meant Slack marked our webhook as failed.

Slack then retried the same event, up to 3 times, using the X-Slack-Retry-Num header to tell us it was a retry. We weren't checking that header, and we had no idempotency logic at all. So a single CI failure could trigger our handler two or three times, each one kicking off its own OpenAI call and posting its own summary into the channel — same failure, slightly reworded, three times over.

Nobody flagged it for almost two weeks because the duplicates were plausible. GPT doesn't produce identical output on every call, so the three posts read like three independent (if oddly redundant) summaries rather than an obvious bug. It took someone finally asking "why does the bot post twice for the same job" in the channel before we went digging and found the retry storm. This is the kind of bug that's embarrassing precisely because the fix is one line — check the retry header, or better, dedupe on job ID — and we just hadn't written it.

Mistake 3: We Sent Secrets Straight to OpenAI Without Scrubbing

This is the one that actually hurt. A failing deploy job had a debug step that ran env on error to help with troubleshooting — a completely reasonable thing to do in isolation. That output included AWS_SECRET_ACCESS_KEY and a database connection string with an embedded password. That raw text went straight into the GPT prompt, and worse, it came back out in the summary and got posted to a Slack channel with over 200 members.

Watch out for this one specifically: we had zero scrubbing or redaction between "fetch log" and "send to OpenAI." We'd been treating CI logs as trusted internal text, which they are not — they're one of the most credential-dense artifacts in the entire pipeline, and we were forwarding them to a third-party API and then broadcasting them internally on top of that.

The root issue wasn't a code bug, it was process. We shipped an internal automation that touches CI logs without any security review. Nobody on the team asked "what happens if a log contains a secret" until it actually happened. We rotated the AWS key within the hour, but the bigger fix was structural: never trust a log's contents, no matter how internal the source feels.

What We Do Differently Now

The rebuilt version fixes all three problems together, and it's what we run in production now. First, log preprocessing: grep for ERROR|FAIL|Traceback|exit code, keep the last 300 relevant lines, count tokens with tiktoken before the API call, and hard-cap input around 4,000 tokens. Second, async everything: ack Slack in under a second with no OpenAI call in that path, process in a background thread or queue, and dedupe on job ID with a short TTL cache so retries don't produce duplicate posts. Third, and non-negotiable now: a regex scrubbing pass runs on the log before it touches either the OpenAI request or the Slack payload — before, not after.

We also switched the default model from gpt-4-turbo to gpt-4o-mini, which dropped cost roughly 95% for this use case, and only escalate to a bigger model when someone explicitly asks for a deeper analysis. Here's the current version of the handler, scrubbing, trimming, and async ack included:

# slack_ci_summarizer.py
# Minimal Slack Bolt app: acks fast, processes async, scrubs secrets, caps tokens.

import os, re, hashlib, threading
import requests
import tiktoken
from slack_bolt import App
from openai import OpenAI

app = App(token=os.environ["SLACK_BOT_TOKEN"], signing_secret=os.environ["SLACK_SIGNING_SECRET"])
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
seen_jobs = {}  # simple in-memory dedupe cache; use Redis in production

SECRET_PATTERNS = [
    r"AKIA[0-9A-Z]{16}",
    r"-----BEGIN [A-Z ]+PRIVATE KEY-----[\s\S]+?-----END [A-Z ]+PRIVATE KEY-----",
    r"Bearer\s+[A-Za-z0-9._-]+",
    r"(?i)(password|token|secret)\s*=\s*\S+",
]

def scrub(log_text: str) -> str:
    for pattern in SECRET_PATTERNS:
        log_text = re.sub(pattern, "[REDACTED]", log_text)
    return log_text

def trim_relevant_lines(log_text: str, max_lines: int = 300) -> str:
    # keep lines that matter; fall back to last N lines if nothing matches
    hits = [l for l in log_text.splitlines() if re.search(r"error|fail|traceback|exit code", l, re.I)]
    if not hits:
        return "\n".join(log_text.splitlines()[-max_lines:])
    return "\n".join(hits[-max_lines:])

def cap_tokens(text: str, limit: int = 4000) -> str:
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    return enc.decode(tokens[:limit]) if len(tokens) > limit else text

def process_failure(job_id: str, log_url: str, channel: str, thread_ts: str):
    log_hash = hashlib.sha256((job_id).encode()).hexdigest()
    if log_hash in seen_jobs:
        return  # dedupe retries / repeated triggers
    seen_jobs[log_hash] = True

    raw_log = requests.get(log_url, timeout=10).text
    cleaned = cap_tokens(trim_relevant_lines(scrub(raw_log)))

    resp = client.chat.completions.create(
        model="gpt-4o-mini",  # cheap default; escalate manually if needed
        messages=[
            {"role": "system", "content": "Summarize this CI failure in 3 bullets: root cause, failed step, suggested fix."},
            {"role": "user", "content": cleaned},
        ],
    )
    summary = resp.choices[0].message.content
    app.client.chat_postMessage(channel=channel, thread_ts=thread_ts, text=summary)

@app.event("app_mention")
def handle_mention(body, ack):
    ack()  # respond within 3s, no OpenAI calls here
    event = body["event"]
    threading.Thread(
        target=process_failure,
        args=(event["client_msg_id"], event["text"], event["channel"], event["ts"]),
    ).start()

if __name__ == "__main__":
    app.start(port=3000)

And here's what the scrubbing pass actually does to a real (sanitized) log excerpt before anything leaves our network:

# Example: before/after scrubbing on a real log excerpt

BEFORE:
Deploying to production...
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
DB_CONN=postgres://admin:S3cr3tPass@db.internal:5432/app
Error: connection refused, retrying...
Traceback (most recent call last):
  File "deploy.py", line 42, in run
    conn = psycopg2.connect(DB_CONN)
psycopg2.OperationalError: could not connect to server

AFTER (sent to GPT + posted in Slack):
Deploying to production...
export [REDACTED]
DB_CONN=[REDACTED]
Error: connection refused, retrying...
Traceback (most recent call last):
  File "deploy.py", line 42, in run
    conn = psycopg2.connect(DB_CONN)
psycopg2.OperationalError: could not connect to server

# Resulting GPT summary (structured):
{
  "root_cause": "Deploy step could not reach the database (connection refused)",
  "failed_step": "deploy.py:42 - psycopg2.connect",
  "suggested_fix": "Verify DB host is reachable from the deploy runner and check network/security group rules"
}

We also moved from free-form prose to a structured JSON prompt (root cause, failed step, suggested fix), which is far easier to parse reliably and render as Slack Block Kit sections instead of a wall of text. A summary cache keyed by log hash means an identical retriggered failure doesn't cost a second API call. Cost per summary dropped from $0.30–$0.80 down to roughly $0.01–$0.02, and, more importantly, nothing sensitive leaves our infrastructure unredacted anymore.

If you're pulling job traces yourself, GitLab exposes them via GET /api/v4/projects/:id/jobs/:job_id/trace and GitHub Actions via GET /repos/{owner}/{repo}/actions/jobs/{job_id}/logs — check the GitHub Actions REST API docs for auth scopes before wiring this into anything that touches production secrets. We cover more of our CI/CD automation experiments over at kuryzhev.cloud's CI/CD category if you want to see how other pieces of this pipeline evolved.

A GPT Slack bot for CI failures is genuinely useful once it's async, scoped, and scrubbed — but none of those three things are optional, and we learned that the hard way, one at a time, over about three weeks.

Fix Silent Loki SSH Brute-Force Detection Gaps in Grafana

Oleksandr Kuryzhev — Mon, 20 Jul 2026 07:01:47 +0000

Originally published on kuryzhev.cloud

Your Grafana dashboard shows hundreds of failed SSH logins per minute from a single IP. The graph spikes at 3 a.m. like clockwork. And yet — nobody's phone buzzes. No PagerDuty ping, no Slack message, nothing. This is exactly the failure mode we hit last quarter while running Loki SSH brute-force detection across a fleet of bastion hosts, and it took longer than I'd like to admit to figure out why the pipeline was silently blind despite the data being right there in Loki.

Symptoms

The first sign wasn't an incident — it was a curious engineer poking around in Grafana Explore after a routine audit. They ran a quick query against the auth logs and saw a clear pattern: bursts of sshd[...]: Failed password for entries, clustered from a handful of rotating source IPs, mostly hitting during off-hours. Classic credential-stuffing behavior against SSH. Nothing subtle about it.

What was subtle: zero alerts had fired. No rule had triggered, no receiver had been paged, and the on-call rotation had no idea this had been happening for days. We went to the host directly — journalctl -u ssh showed hundreds of Invalid user and Failed password lines per minute during the attack windows. But the equivalent LogQL query in Loki returned either zero results or a partial subset, missing most of the actual volume.

On top of that, when we tried tightening the query to catch more of it, the Loki ruler started throwing too many outstanding requests and rule evaluation timeouts. That's usually the tell that a query is scanning far more data than it needs to, which is its own symptom worth flagging separately from the missing-alert problem.

Put together: the raw logs exist, the attack is real, but the detection layer between "logs in Loki" and "human gets paged" has multiple silent failure points. That's the scenario this post walks through.

Root cause

Loki wasn't broken. The pipeline was just built with three independent gaps that each looked fine in isolation but combined into total silence.

First, sshd log formats vary depending on distro and OpenSSH version. Failed password for invalid user X from IP port P ssh2 and Failed password for X from IP port P ssh2 are both valid lines from the same daemon, and a regex written to match one variant quietly ignores the other. We had a rule that only caught invalid user attempts — meaning attackers guessing valid usernames like root or ubuntu sailed through undetected while we thought coverage was complete.

Second, our Promtail pipeline never extracted source_ip as a field. Without that, LogQL had no cheap way to group and count attempts per attacker — every query fell back to expensive full-text line filtering across the whole stream, which is exactly what triggered the ruler timeouts.

Third — and this is the one nobody checks until it's too late — there was no alerting rule in the ruler config at all for one of the two log variants, and the rule that did exist had a team: security label that didn't match anything in the Alertmanager route.match tree. Alertmanager doesn't complain about unmatched labels; it just routes to whatever default receiver exists (or nothing), silently.

None of these are exotic bugs. They're the standard gap between "the logs are in Loki" and "someone actually gets notified," and I've seen variations of this exact gap in three different environments now.

Fix #1

Start with the LogQL query itself. The goal is to reliably match failed SSH attempts regardless of variant, extract the source IP without promoting it to a static label, and turn raw line counts into a rate per attacker.


# Base query: match failed password lines and extract IP via LogQL regex parser
{unit="ssh.service"} |= "Failed password" | regex "from (?P\S+) port"

# Wrapped for rate detection: count attempts per IP over a 5m window
sum by (ip) (
  count_over_time(
    {unit="ssh.service"}
      |= "Failed password"
      | regex "from (?P\S+) port"
    [5m]
  )
) > 10

This is query-time extraction, not ingestion-time labeling — that distinction matters and I'll come back to it in Prevention. Gotcha: it's tempting to promote ip to a Promtail static label because it makes queries feel snappier in Grafana. Don't. On Loki 3.0+, use structured metadata instead if you need it indexed; static labels on attacker IPs will wreck your index size fast (more on that below).

Fix #2

The query fix is useless if Promtail's pipeline is silently dropping or mis-parsing lines before they even hit Loki. We rebuilt the pipeline stage to handle both sshd variants in a single regex, and — critically — tested it locally before touching production.


# promtail-config.yaml snippet — pipeline stage to reliably parse sshd journald output
scrape_configs:
  - job_name: sshd-journal
    journal:
      json: false
      max_age: 12h
      labels:
        job: varlogs
        unit: ssh.service          # matches Loki query label filter above
    pipeline_stages:
      - match:
          selector: '{unit="ssh.service"}'
          stages:
            - regex:
                # single regex handles both invalid-user and known-user failures
                expression: 'Failed password for (invalid user )?(?P<user>\S+) from (?P<ip>\S+) port (?P<port>\d+)'
            - labels:
                user:               # ok as label - low-ish cardinality vs raw IP

# --- Test output from logcli after deploying the fix ---
# $ logcli query '{unit="ssh.service"} |= "Failed password"' --limit=5
# 2024-05-14T03:12:41Z {unit="ssh.service"} sshd[1922]: Failed password for invalid user admin from 203.0.113.5 port 51422 ssh2
# 2024-05-14T03:12:43Z {unit="ssh.service"} sshd[1923]: Failed password for root from 203.0.113.5 port 51430 ssh2
# 2024-05-14T03:12:45Z {unit="ssh.service"} sshd[1924]: Failed password for admin from 198.51.100.9 port 40211 ssh2

We ran promtail --dry-run --config.file=/etc/promtail/config.yaml --client.url=http://localhost:3100/loki/api/v1/push against a sample log file before deploying — this caught a broken capture group the first time around. Watch out for the classic mistake of bolting a json stage onto plaintext syslog output. sshd doesn't emit JSON, so a stray json stage just no-ops silently. No error in the Promtail logs, no crash — the pipeline just does nothing useful, and you don't find out until you go looking for data that isn't there.

Fix #3

Detection is worthless if it doesn't route to a human. We defined the rule group directly in the Loki ruler and made sure the labels lined up exactly with Alertmanager's routing tree.


# loki-rules.yaml — Loki ruler rule group for SSH brute-force detection
# Path: /etc/loki/rules/security-tenant/ssh-bruteforce.yaml
groups:
  - name: ssh-bruteforce
    interval: 1m                     # evaluation interval - keep low-cost queries at 1m
    rules:
      - alert: SSHBruteForceDetected
        expr: |
          sum by (ip, instance) (
            count_over_time(
              {unit="ssh.service"}
                |= "Failed password"
                | regex "from (?P<ip>\S+) port"
              [5m]
            )
          ) > 10
        for: 2m                      # require sustained activity, avoid single-spike false positives
        labels:
          severity: critical
          team: security             # must match Alertmanager route.match tree exactly
        annotations:
          summary: "Possible SSH brute-force from {{ $labels.ip }} on {{ $labels.instance }}"
          description: >
            {{ $value }} failed SSH login attempts detected from {{ $labels.ip }}
            in the last 5 minutes on host {{ $labels.instance }}.
          runbook_url: "https://kuryzhev.cloud/runbooks/ssh-bruteforce"

      - alert: SSHBruteForceInvalidUser
        expr: |
          sum by (ip, instance) (
            count_over_time(
              {unit="ssh.service"}
                |= "Invalid user"
                | regex "from (?P<ip>\S+) port"
              [5m]
            )
          ) > 5
        for: 1m
        labels:
          severity: warning
          team: security
        annotations:
          summary: "Repeated invalid-user SSH attempts from {{ $labels.ip }}"

Two details here matter more than they look. The for: 2m requirement stops a single retry burst — say, a CI job rotating deploy keys badly — from paging someone at 3 a.m. And grouping by ip, instance in the sum by clause is what prevents one alert per log line; skip it and you flood the receiver with duplicates for a single ongoing attack. We learned that one the hard way after a test run generated 40 near-identical pages in under a minute.

Prevention

Getting Loki SSH brute-force detection working once is easy. Keeping it working — and keeping Loki's storage bill sane — takes a few standing rules.

Never promote raw attacker IPs to Loki labels at ingestion time. With thousands of unique IPs hitting a public bastion host, that can blow up index size by 10-50x and your storage cost right along with it. Extract IP at query time with regex, or use structured metadata (stable since Loki 2.9.x, default-on in 3.0) if you genuinely need it queryable without a full scan — see the official Loki structured metadata docs.

Add inhibition rules in Alertmanager so a critical brute-force alert suppresses the lower-severity duplicate for the same host — check the Alertmanager inhibit_rule reference for the exact syntax. Pair detection with action: forward the matched IP from an Alertmanager webhook receiver into a script running fail2ban-client set sshd banip so alerts trigger a block, not just a dashboard glance. And audit your sshd log format after every OpenSSH package bump — a minor version upgrade can silently reformat log lines and break your regex without any error surfacing anywhere.

We've covered more log pipeline war stories like this over at kuryzhev.cloud — worth a browse if you're building out your own observability stack from scratch.

PostgreSQL Archive Old Rows Script: 3 Mistakes We Made

Oleksandr Kuryzhev — Sun, 19 Jul 2026 07:02:01 +0000

Originally published on kuryzhev.cloud

Context

Our PostgreSQL archive old rows script started as a five-line cron job and ended up as the reason our on-call engineer got paged at 2 a.m. three separate times in one month. The setup was simple enough on paper: an events table on Postgres 14.9 (RDS, db.r5.xlarge) was growing roughly 40 million rows a month, dashboards were timing out, and query planning on that table had gotten noticeably slower. The fix seemed obvious — archive anything older than 90 days into a separate table, then delete it from the live one.

The first version was a single Python script using psycopg2 2.9.9, run via cron once a day. It copied matching rows into events_archive, then deleted them from events, all inside one big transaction. It ran fine in staging against a 200k-row test dataset. It ran fine in production too — for about two weeks, while the volume of "old" rows was still manageable. Then the backlog caught up, the batch sizes ballooned into the millions, and everything we hadn't tested for showed up at once.

I want to be upfront: none of these mistakes were exotic. They're the kind of thing you'd catch in a code review if you'd been burned by them before. We hadn't been. This is the retrospective on what broke and what we changed.

Mistake 1: One giant transaction to delete everything

The original script did an INSERT INTO events_archive SELECT * FROM events WHERE created_at < now() - interval '90 days', followed by the matching DELETE, both inside a single transaction, no LIMIT, no batching. On a day where the backlog had grown to 6 million qualifying rows, that transaction held a lock on events for 18 minutes straight.

During those 18 minutes, every app write against that table queued up behind it. The app logs filled with canceling statement due to statement timeout — connections from the pool were dying waiting for a lock that never released in time. Worse, the WAL volume from deleting 6M rows in one shot spiked hard enough that replication lag on our read replica jumped past 40 minutes. Nothing errored on the replica side — it just silently served stale data to whatever was reading from it, which in our case included a reporting pipeline and an ETL job. Nobody noticed until someone asked why a dashboard showed numbers from an hour ago.

Root cause, in hindsight, was obvious: no LIMIT, no batching, and no lock_timeout or statement_timeout set on the connection. A single unbatched DELETE against millions of rows is going to hold locks for as long as it takes, and Postgres will happily let you do it. Watch out for this if you're writing a "simple" cleanup script against any table that also takes live writes — the size of the row count is the only variable that decides whether your script is invisible or an incident.

Mistake 2: No protection against overlapping cron runs

The job was scheduled hourly. Under normal conditions each run finished in a couple of minutes, so overlap was never a concern — until Mistake 1 made a run take 18 minutes, which meant the next scheduled run started while the first one was still working. Cron doesn't check whether the previous invocation finished. It just fires on the interval.

Two processes ended up selecting the same "old" rows before either had deleted them. Both inserted the same rows into events_archive, and the second insert hit a primary key violation. That would have been a loud, obvious failure — except the exception handling in that section of the script was, and I'm not proud of this, a bare except Exception: pass. It was swallowing duplicate-key errors silently. We found it during a code review months later, after someone noticed archive row counts didn't reconcile with the source table's delete count.

This is a classic gotcha: cron does not guarantee non-overlapping execution. A fixed interval plus a variable-duration job will eventually overlap — it's not a matter of if, only when. The permanent fix was a Postgres advisory lock (pg_try_advisory_lock), which we'll cover below. At the time, we had literally nothing preventing two copies of the script from running at once, and the silent exception handler meant we had no visibility into it happening for weeks.

Mistake 3: Trusting archive-then-delete as atomic across two systems

Even after fixing the lock and the overlap issue, the script logic was still: copy rows to events_archive, commit, then delete from events, commit. Two separate transactions. That gap between them is where things went wrong.

A network blip between the insert-commit and the delete-commit left around 12,000 rows sitting in both tables — archived and not yet deleted — for several days before anyone caught it, because nothing failed loudly. We only noticed when a monthly reconciliation script found the archive count didn't match the delete count.

The worse near-miss came from an earlier version of the script, before a code review comment made us reorder the steps. That version deleted from events first, then archived second. If the process had crashed between those two steps — a deploy, an OOM kill, an RDS failover — those rows would have been gone with zero recovery path. We got lucky that this version never actually crashed at the wrong moment in production. It could have, and there was no way to tell from the logs afterward that it almost had.

The lesson here is that "archive then delete" only sounds atomic. Unless both operations commit as one transaction, you're one interrupted process away from either duplicated data or permanently lost data, and which failure mode you get depends entirely on which write you did first.

What we do differently now

The current version batches everything into small, short transactions instead of one long one. Each batch selects up to 5,000 rows with FOR UPDATE SKIP LOCKED so we never fight with concurrent app writers touching the same rows, and the insert-plus-delete for that batch happens in a single transaction — so a crash mid-batch just means re-running that batch, which is safe because the archive table has ON CONFLICT (id) DO NOTHING on its primary key.


# archive_old_rows.py
# Batched, lock-safe archival of rows older than N days from `events` -> `events_archive`
import os
import sys
import time
import logging
import psycopg2

logging.basicConfig(level=logging.INFO)
log = logging.getLogger("archiver")

DSN = os.environ["ARCHIVE_DB_DSN"]  # pulled from Secrets Manager at deploy time, never hardcoded
BATCH_SIZE = 5000
RETENTION_DAYS = 90
LOCK_KEY = "archive_events_job"

def acquire_lock(conn):
    with conn.cursor() as cur:
        cur.execute("SELECT pg_try_advisory_lock(hashtext(%s))", (LOCK_KEY,))
        return cur.fetchone()[0]

def archive_batch(conn):
    """Move a single batch. Returns rows moved (0 means done)."""
    with conn.cursor() as cur:
        # session-level guard: never let one batch block writers for long
        cur.execute("SET lock_timeout = '5s'")
        cur.execute("SET statement_timeout = '30s'")
        cur.execute("""
            WITH batch AS (
                SELECT id FROM events
                WHERE created_at < now() - interval '%s days'
                ORDER BY id
                LIMIT %s
                FOR UPDATE SKIP LOCKED
            ),
            moved AS (
                INSERT INTO events_archive
                SELECT e.* FROM events e JOIN batch b ON e.id = b.id
                ON CONFLICT (id) DO NOTHING
                RETURNING id
            )
            DELETE FROM events
            WHERE id IN (SELECT id FROM batch)
            RETURNING id;
        """, (RETENTION_DAYS, BATCH_SIZE))
        rows = cur.rowcount
    conn.commit()  # commit per-batch, not per-run — keeps locks short
    return rows

def main():
    conn = psycopg2.connect(DSN)
    conn.autocommit = False
    if not acquire_lock(conn):
        log.info("Another archive run holds the lock, exiting cleanly.")
        return
    total = 0
    try:
        while True:
            try:
                moved = archive_batch(conn)
            except psycopg2.errors.LockNotAvailable:
                log.warning("Batch skipped due to lock contention, retrying in 5s")
                time.sleep(5)
                continue
            total += moved
            if moved == 0:
                break
            time.sleep(0.1)  # brief pause to avoid saturating IOPS
        log.info(f"Archived {total} rows.")
    finally:
        conn.close()

if __name__ == "__main__":
    main()

The advisory lock wraps the whole run so an overlapping cron or systemd-timer execution just exits cleanly instead of racing — pg_try_advisory_lock returns false immediately if another run already holds it, no polling required. We also moved scheduling off plain crontab to a systemd timer with Persistent=true, so a missed run after a reboot still fires once instead of silently never happening again.


# /etc/systemd/system/archive-events.timer
# Runs every hour, catches up on missed runs if host was down
[Unit]
Description=Timer for events archival job

[Timer]
OnCalendar=hourly
Persistent=true
RandomizedDelaySec=60

[Install]
WantedBy=timers.target

# /etc/systemd/system/archive-events.service
[Unit]
Description=Archive old rows from events table

[Service]
Type=oneshot
EnvironmentFile=/etc/archiver/env
ExecStart=/usr/bin/python3 /opt/archiver/archive_old_rows.py
User=archiver_svc

# --- observed output during the Mistake 1 incident (for comparison) ---
# ERROR: canceling statement due to statement timeout
# app-server-3 | psycopg2.errors.QueryCanceled: canceling statement due to lock timeout
# pg_stat_replication.replay_lag: 00:41:12
#
# --- after fix, typical run log ---
# INFO:archiver:Archived 6023841 rows.
# real    9m14.221s

The numbers speak for themselves: the same 6M-row backlog that used to take 18 minutes and block every writer now takes about 9 minutes total, with each batch holding a lock for roughly 200ms. Zero write blocking, zero timeout errors in the app logs.

A few other things changed alongside the code. The archiver now runs under a dedicated archiver Postgres role with only INSERT on events_archive and SELECT/DELETE on events — no DROP or ALTER, and credentials come from AWS Secrets Manager instead of being hardcoded in a config file. We also dropped three unused indexes on events_archive since it's append-only and rarely queried; that alone cut insert time noticeably. And after any large archive run, we manually kick off VACUUM (ANALYZE) events; during a low-traffic window, because autovacuum can fall behind on a table with heavy write traffic, and dead tuples pile up fast after a mass delete. We track n_dead_tup from pg_stat_user_tables in Grafana and alert if a run takes more than 10 minutes.

One quiet gotcha worth calling out separately: created_at < now() - interval '90 days' is evaluated in the database's timezone, not the application's. If your RDS instance isn't set to UTC, "90 days old" can be off by hours depending on daylight saving — easy to miss, annoying to debug after the fact. We double-checked this against the Postgres date/time function docs and the RDS parameter group settings and normalized everything to UTC explicitly instead of trusting defaults.

If you're running a similar cleanup job today, I'd rather you learn this from a retrospective than from a 2 a.m. page. We wrote up more of our production database patterns over on kuryzhev.cloud if you want the broader context on how we run Postgres on RDS day to day.

Stop Credential Stuffing on Login with Cloudflare WAF Rules

Oleksandr Kuryzhev — Sat, 18 Jul 2026 07:01:52 +0000

Originally published on kuryzhev.cloud

Cloudflare WAF credential stuffing attacks don't look like a DDoS, and that's exactly why our first response made things worse. At 3am, PagerDuty woke me up because /api/v1/auth/login latency had jumped from 80ms to 4 seconds and origin CPU was pinned at 100% on a 4-vCPU box sitting behind Cloudflare. By the time I had coffee in hand, support tickets were piling up from real users locked out of their own accounts.

The problem we hit

The Cloudflare Analytics dashboard told the real story fast: 340k requests per minute hitting /auth/login, coming from roughly 9,000 unique IPs, almost all in residential proxy ranges. Every request had a valid-looking JSON body and a rotating User-Agent header. Nothing screamed "bot" at a glance — no malformed payloads, no obvious SQLi patterns, just a flood of plausible login attempts.

Our first instinct was to trust the app's existing rate limiter, a Redis-backed counter keyed per-endpoint. It did exactly what it was built to do: it started blocking after N failed attempts on the login path. Problem was, it counted by endpoint, not by identity or session, so once the attack volume crossed the threshold, it started rejecting legitimate mobile users too — anyone behind shared NAT or CGNAT got caught in the crossfire. We had turned a targeted attack into a self-inflicted outage for real customers, which is arguably worse.

We briefly flipped on Cloudflare's "I'm Under Attack" mode, hoping for a quick win. It didn't touch the traffic — because this wasn't a volumetric flood, it was slow, distributed, and looked like normal HTTP traffic to anything watching for spikes in raw request rate per IP.

Why it happens

Credential stuffing replays leaked username/password pairs against a login form at scale, using botnets or proxy pools to spread requests across thousands of source IPs. It's not trying to overwhelm bandwidth — it's trying to blend in. That single fact defeats most default protections out of the box.

Cloudflare's managed WAF rulesets are tuned to catch known exploit signatures: SQL injection, XSS, path traversal. A well-formed login POST with a fake username and password doesn't trip any of those signatures because, structurally, it's a completely valid request. Bot Fight Mode helps a little on lower plans, but without Bot Management scoring, you're relying on heuristics that rotating-proxy botnets are specifically built to evade.

App-level rate limiting breaks down for a more mechanical reason: Cloudflare terminates TLS at the edge, so your origin only sees Cloudflare's IPs unless you're correctly trusting CF-Connecting-IP or X-Forwarded-For. Even when that's configured right, keying purely on ip.src is a losing game — if the attacker has 9,000 IPs and your threshold is 10 requests per IP, they just spread the load. One IP per handful of requests defeats any per-IP counter, no matter how tight you set the window.

The fix (with code)

The fix needed two layers: a WAF custom rule to challenge suspicious traffic based on behavior, not just IP reputation, and a rate limiting rule keyed on a composite identifier that rotating IPs can't easily fake. We used cf.bot_management.score (available with Bot Management on Business/Enterprise, heuristic fallback otherwise) combined with cf.unique_visitor_id as a secondary rate-limit key — this ties the count to a Cloudflare-derived visitor fingerprint, not just the source IP.

We deployed this with Terraform using the cloudflare/cloudflare provider v4.x. Note: v4 rewrote rule resources into the unified cloudflare_ruleset — mixing v3's deprecated cloudflare_rate_limit or cloudflare_filter with v4 syntax is a fast way to get confusing plan diffs. Don't mix them in the same repo.


# terraform/cloudflare_waf_credential_stuffing.tf
# Provider: cloudflare/cloudflare v4.20+
terraform {
  required_providers {
    cloudflare = {
      source  = "cloudflare/cloudflare"
      version = "~> 4.20"
    }
  }
}

variable "zone_id" {
  description = "Cloudflare Zone ID for the protected domain"
  type        = string
}

# Layer 1: WAF custom rule - challenge low bot-score traffic on login endpoint
resource "cloudflare_ruleset" "login_bot_challenge" {
  zone_id     = var.zone_id
  name        = "credential-stuffing-bot-challenge"
  description = "Managed challenge for low bot-score requests hitting login"
  kind        = "zone"
  phase       = "http_request_firewall_custom"

  rules {
    action = "managed_challenge"
    expression = "(http.request.uri.path eq \"/api/v1/auth/login\" and http.request.method eq \"POST\" and cf.bot_management.score lt 30)"
    description = "Challenge suspected bots on login"
    enabled = true
  }
}

# Layer 2: Rate limiting rule - keyed on IP + unique visitor ID, not IP alone
resource "cloudflare_ruleset" "login_rate_limit" {
  zone_id     = var.zone_id
  name        = "credential-stuffing-rate-limit"
  description = "Rate limit login attempts per composite key"
  kind        = "zone"
  phase       = "http_ratelimit"

  rules {
    action = "managed_challenge"
    expression = "(http.request.uri.path eq \"/api/v1/auth/login\" and http.request.method eq \"POST\")"
    description = "Throttle login POSTs beyond threshold"
    enabled = true

    ratelimit {
      characteristics     = ["ip.src", "cf.unique_visitor_id"]
      period              = 60
      requests_per_period = 5
      mitigation_timeout  = 600
    }
  }
}

We deliberately set the action to managed_challenge, not block. Hard-blocking on IP reputation alone caused roughly 2% false-positive lockouts of real mobile users behind CGNAT during the incident — that's a "watch out for" I'd pin to a whiteboard. A challenge gives real humans a way through; a hard block just moves your outage from "attackers win" to "you locked out paying customers."

Before touching production, we validated the expression via the API directly, then simulated distinct source IPs against staging using CF-Connecting-IP headers to confirm the composite key actually throttled correctly:


# Manual verification via API before Terraform apply — dry-run the expression
# Requires token scope: Zone.Firewall Services:Edit

ZONE_ID="your_zone_id_here"
CF_API_TOKEN="your_scoped_token_here"

# 1. Create the ruleset via API (equivalent to the Terraform resource above)
curl -s -X PUT "https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/rulesets/phases/http_ratelimit/entrypoint" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" \
  -H "Content-Type: application/json" \
  --data '{
    "rules": [{
      "action": "managed_challenge",
      "expression": "(http.request.uri.path eq \"/api/v1/auth/login\" and http.request.method eq \"POST\")",
      "ratelimit": {
        "characteristics": ["ip.src", "cf.unique_visitor_id"],
        "period": 60,
        "requests_per_period": 5,
        "mitigation_timeout": 600
      },
      "description": "Throttle login POSTs beyond threshold",
      "enabled": true
    }]
  }'

# 2. Simulate distinct source IPs to confirm keying works (staging only)
for i in 1 2 3 4 5 6; do
  curl -s -o /dev/null -w "attempt $i -> %{http_code}\n" \
    -X POST "https://staging.example.com/api/v1/auth/login" \
    -H "CF-Connecting-IP: 203.0.113.$i" \
    -d '{"username":"test","password":"wrong"}'
done
# Expected: first 5 return 401/200, 6th onward returns 429 or challenge page (HTTP 403/503 w/ CF-Mitigated header)

# 3. Confirm which rule fired via Logpush FirewallEvents dataset (async, ~1-5min delay)

One error we hit while iterating on the expression: "1101: rule execution failed". That turned out to be a typo referencing cf.bot_management.score as cf.bot_managment.score — the dashboard's "Test Rule" preview catches this before you deploy, use it. Also worth knowing: rule quotas differ by plan — Free allows 5 custom WAF rules per zone, Pro 20, Business 100. Hit the ceiling and the API silently fails creation with HTTP 400: "reached maximum number of rules", which is a nasty surprise mid-incident when you're trying to add one more rule fast.

We also learned that phase order matters. WAF custom rules in http_request_firewall_custom execute before rate limiting rules in http_ratelimit by default. If your rate limit rule seems to be silently skipped, check whether an earlier phase rule — including an old "Skip" allowlist rule for CI/CD or a partner integration — is exempting the traffic before it ever reaches the rate limiter. We had exactly this: a legacy skip rule scoped to /api/* for a partner webhook was inadvertently exempting attacker traffic hitting /api/v1/auth/login. Audit your skip rules before layering in new blocking logic; the details are covered well in the Cloudflare WAF documentation and the Rulesets API reference.

Prevention checklist

Once the fire was out, we built guardrails so this incident class doesn't come back disguised as something else.

Enable Bot Fight Mode or Bot Management, and watch the score distribution. Pull cf.bot_management.score weekly via Logpush so you notice drift before an attack, not during one.
Protect the pivot endpoints too. Once login is hardened, attackers move to password-reset and registration. We added the same challenge + rate-limit pattern to both within a week of the original fix.
Alert on challenge counts, not just blocks. Query the Security Events GraphQL Analytics API for spikes in action = "challenge" — a rising challenge rate is your early warning before volume escalates to something that needs a 3am page.
Scope and rotate WAF automation tokens. Use least-privilege tokens (Zone:Firewall Services:Edit only) for Terraform CI, and keep state remote — plaintext rule IDs and zone tokens in git history are a liability you don't want to discover during an audit.
Never leave a rate limit rule set to log only in production. Logging-only gives attackers a free window to keep hammering while you "observe." Pair it with at least managed_challenge from day one on any auth-adjacent endpoint.

Cloudflare WAF credential stuffing defenses only work if you treat them as a layered, monitored system rather than a one-time rule you set and forget. We revisit thresholds monthly using 30-day Logpush data pulled into R2, since dashboard analytics only retain granular detail for 24 hours on lower plans — not enough to spot slow-building attacks. If you're setting up your infra repo structure for this kind of Terraform-managed security config, our DevOps_DayS archive has more patterns worth stealing.

Prometheus Game Server Exporter Checklist for Player Metrics

Oleksandr Kuryzhev — Fri, 17 Jul 2026 07:02:01 +0000

Originally published on kuryzhev.cloud

Your Prometheus dashboard says everything is green while players are rage-quitting in Discord over lag. That disconnect happens because node_exporter and blackbox_exporter have no idea what's happening inside your game server's UDP query protocol. Building a proper prometheus game server exporter is the only way to close that gap, and it's easy to get wrong in ways that only show up under real player load.

Why this checklist

Every gaming infra team I've worked with eventually hits the same wall: CPU and memory graphs look fine, but support tickets say "the game feels laggy." That's because player_count and query_latency are the actual SLIs players feel — not the generic host metrics your exporters already scrape. A server can sit at 20% CPU and still have a broken matchmaking queue or a query protocol timing out under load.

The problem is that most game engines don't expose Prometheus-friendly metrics out of the box. Source engine servers, Unreal dedicated servers, and custom UDP-based protocols speak A2S, RCON, or proprietary JSON — none of which Prometheus understands natively. You need glue code: something that polls the game server's own query protocol and translates it into metric types Prometheus can scrape.

I've reviewed enough of these exporters, written by different teams at different studios, to notice a pattern: the mistakes are rarely about missing knowledge. They're small oversights — a Counter where a Gauge belongs, a missing timeout, a label that quietly explodes cardinality. That's exactly the kind of failure a checklist catches better than a tutorial does. You don't need to relearn Prometheus fundamentals every time you ship a new exporter; you need something to run down before merging the PR. That's what follows.

The checklist (numbered)

Pick the right metric type. Use Gauge for gameserver_player_count — it goes up and down. Use Histogram for gameserver_query_latency_seconds with buckets matched to your game genre (tight buckets like [0.05, 0.1, 0.25] for FPS, wider ones for MMO tick rates). Never use Summary for latency if you plan to aggregate across instances in PromQL — quantiles from a Summary can't be averaged mathematically across servers.
Follow naming conventions. Stick to gameserver_<noun>_<unit>, e.g. gameserver_query_latency_seconds. This matches the Prometheus metric naming guidelines and keeps your Grafana queries predictable.
Bound your label set. Label by region, shard, or map — never by player_id, session_id, or IP. Labels must be a known, finite set decided at design time, not something that grows with player count.
Make the query layer non-blocking with a hard timeout. Poll the game server's query protocol on a background thread or async loop, with a timeout shorter than Prometheus's scrape_timeout. The HTTP handler serving /metrics should only read from an in-memory cache, never make a live network call.
Separate /healthz from /metrics. A liveness probe checking exporter process health shouldn't depend on the game server being reachable — otherwise a game server outage takes down your monitoring too.
Package it properly. Run as a non-root user, bind to 127.0.0.1:9105 by default (9105+ is the unofficial exporter port range), and let a sidecar or reverse proxy handle external exposure.
Self-instrument. Expose gameserver_exporter_build_info and gameserver_exporter_scrape_duration_seconds so you can tell "game server is down" from "exporter is down."
Validate output in CI. Run promtool check metrics against a live curl of /metrics before shipping — this catches missing #TYPE/#HELP lines that a plain curl check would miss.

Here's a minimal exporter that satisfies most of the checklist above — it polls a Source-style query protocol on a background thread and exposes the metrics on a local-only port. Pin prometheus-client==0.20.0; pre-0.9 versions had registry behavior that broke silently on re-registration.

#!/usr/bin/env python3
# gameserver_exporter.py - polls a UDP query protocol and exposes Prometheus metrics
import socket
import time
import threading
from prometheus_client import start_http_server, Gauge, Histogram, Counter

# --- Metric definitions (checklist item: correct types + labels) ---
PLAYER_COUNT = Gauge(
    "gameserver_player_count",
    "Current number of connected players",
    ["region", "shard"]
)
QUERY_LATENCY = Histogram(
    "gameserver_query_latency_seconds",
    "Latency of the query protocol round-trip",
    ["region", "shard"],
    buckets=[0.05, 0.1, 0.25, 0.5, 1, 2, 5]
)
QUERY_FAILURES = Counter(
    "gameserver_query_failures_total",
    "Number of failed queries to the game server",
    ["region", "shard"]
)

SERVERS = [
    {"host": "10.0.1.10", "port": 27015, "region": "eu-west", "shard": "shard-1"},
    {"host": "10.0.1.11", "port": 27015, "region": "eu-west", "shard": "shard-2"},
]

QUERY_TIMEOUT = 3.0  # must stay well below Prometheus scrape_timeout (10s)

def query_server(host, port):
    """Send a minimal A2S-style query, return (player_count, latency_seconds)."""
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.settimeout(QUERY_TIMEOUT)
    start = time.monotonic()
    try:
        sock.sendto(b"\xFF\xFF\xFF\xFFTSource Engine Query\x00", (host, port))
        data, _ = sock.recvfrom(1024)
        latency = time.monotonic() - start
        # NOTE: real parsing depends on protocol; this is illustrative
        players = data[-1] if data else 0
        return players, latency
    finally:
        sock.close()

def poll_loop():
    while True:
        for srv in SERVERS:
            labels = {"region": srv["region"], "shard": srv["shard"]}
            try:
                players, latency = query_server(srv["host"], srv["port"])
                PLAYER_COUNT.labels(**labels).set(players)
                QUERY_LATENCY.labels(**labels).observe(latency)
            except (socket.timeout, OSError):
                QUERY_FAILURES.labels(**labels).inc()
        time.sleep(15)  # matches scrape_interval to avoid stale-vs-fresh mismatch

if __name__ == "__main__":
    threading.Thread(target=poll_loop, daemon=True).start()
    # bind to localhost only; expose via reverse proxy/service mesh in prod
    start_http_server(9105, addr="127.0.0.1")
    while True:
        time.sleep(3600)

And the matching scrape config, plus what a healthy scrape should actually look like when you validate it with promtool:

# prometheus.yml snippet - scrape config for the exporter above
scrape_configs:
  - job_name: game_servers
    scrape_interval: 15s
    scrape_timeout: 10s   # must exceed QUERY_TIMEOUT * number of servers polled synchronously
    static_configs:
      - targets: ["10.0.2.5:9105"]
        labels:
          env: production

# --- expected /metrics output (validate with promtool check metrics) ---
# HELP gameserver_player_count Current number of connected players
# TYPE gameserver_player_count gauge
gameserver_player_count{region="eu-west",shard="shard-1"} 42
# HELP gameserver_query_latency_seconds Latency of the query protocol round-trip
# TYPE gameserver_query_latency_seconds histogram
gameserver_query_latency_seconds_bucket{region="eu-west",shard="shard-1",le="0.1"} 12
gameserver_query_latency_seconds_bucket{region="eu-west",shard="shard-1",le="0.5"} 30
gameserver_query_latency_seconds_bucket{region="eu-west",shard="shard-1",le="+Inf"} 34
gameserver_query_latency_seconds_sum{region="eu-west",shard="shard-1"} 5.32
gameserver_query_latency_seconds_count{region="eu-west",shard="shard-1"} 34

Run promtool check metrics < <(curl -s localhost:9105/metrics) before every deploy. It's a five-second check that catches malformed output long before Grafana starts mis-rendering panels.

Commonly missed items

Cardinality is the big one. I've seen a 50-server fleet turn into millions of active time series because someone added a player_id or session_id label "for debugging." Prometheus doesn't age out label combinations gracefully — it just keeps ingesting until memory runs out. Watch out for this: it doesn't show up in dev with 3 test players, only in prod with real traffic.

Metric type confusion is the second-most common bug. Using Counter for player_count means every server restart looks like a reset, and your alerting rules — which usually assume Counters only go up — will fire false "anomaly" alerts every time you deploy. Gauge is correct here; the value can legitimately go down when players leave.

Blocking scrapes are a quieter failure. If the /metrics handler queries the game server synchronously with no timeout, a single hung UDP request blocks the entire scrape past the 10s scrape_timeout. The result is up == 0 flapping exactly when you need visibility most — during a load spike or DDoS attempt.

And don't skip the security angle. An unauthenticated /metrics endpoint leaks player counts and internal server topology to anyone scanning ports. I stopped exposing exporters directly to the internet after a griefer group used exposed metrics to time raids on low-population shards. Put it behind mTLS, a firewall rule, or federate it through a Prometheus instance with basic_auth configured in the scrape job — see the Prometheus scrape configuration docs for the auth options.

Automation ideas

Once the exporter pattern works for one game server, the goal is fleet-wide repeatability, not another hand-deployed binary. Bake the exporter into the same pod as the game server as a sidecar, and auto-register it with a Prometheus Operator PodMonitor. Use relabel_configs to drop noisy Kubernetes metadata like __meta_kubernetes_pod_uid before it ever hits your TSDB — that alone can cut series count meaningfully across a large fleet.

Generate dashboards and alert rules as code — jsonnet or Terraform templated per game title — so spinning up a new shard doesn't mean someone manually clones a Grafana dashboard and forgets to update a variable. We covered a similar templated-alerting pattern in our DevOps_DayS monitoring notes if you want the broader pattern beyond game servers.

Finally, add a CI job that spins up the exporter against a mock UDP server, curls /metrics, runs promtool check metrics, and asserts histogram bucket boundaries with a Python unit test. Fail the build if #HELP or #TYPE lines are missing — that's the cheapest bug to catch in CI and the most expensive one to catch in a 3am page. And never tag exporter images as latest; a silent metric rename in a new build breaks every downstream dashboard without a visible diff in your deploy log.

Getting a prometheus game server exporter right isn't about clever engineering — it's about running down the same short list every time before it hits production. The checklist above is exactly what I run before merging any exporter PR now, and it's caught real bugs every single time.

3 WireGuard VPN Mistakes That Broke Our Remote Server Access

Oleksandr Kuryzhev — Thu, 16 Jul 2026 07:01:50 +0000

Originally published on kuryzhev.cloud

Context

WireGuard VPN mistakes rarely announce themselves. That's the whole problem — the tunnel just quietly stops working, or quietly does the wrong thing, and you find out from a Slack message that says "hey, is the VPN down for anyone else?" We learned this the hard way over about six weeks last year, and I want to walk through exactly what broke and why, because I think we're not the only team that got bitten by the same shortcuts.

We were migrating off an aging OpenVPN bastion. It worked, technically, but reconnects took 15-20 seconds, the logs were a mess to audit, and every new hire meant another round of certificate generation nobody enjoyed. WireGuard looked like the obvious upgrade: in-kernel, ChaCha20, dead simple config format. We stood up a single hub server running wireguard-tools 1.0.20210914 on Ubuntu 22.04, using the kernel module rather than the userspace implementation, with six remote peers — four engineers and two CI runners that needed access to internal build artifacts.

Day one, it worked. wg show showed green handshakes, SSH over the tunnel was fast, everyone was happy. That's exactly the trap. WireGuard's simplicity means a config mistake doesn't throw an error — it just silently does something you didn't intend, and you don't notice until it costs you. This isn't a "we nailed it" post. Three mistakes cost us an outage, a security exposure, and a weekend. Here's what actually happened.

Mistake 1: Reusing the same key pair across multiple peers

To speed up onboarding, we cloned a "template" client config for new engineers — Address, DNS, AllowedIPs, all pre-filled, just swap the hostname and go. Except the template also included a pre-generated private key. Two engineers ended up with the exact same keypair on two different laptops.

WireGuard doesn't error on this. It just lets the most recently connected peer win. wg show on the server showed only one peer online at a time — whichever laptop had connected last simply kicked the other off the tunnel, silently, with zero error message on either end. We burned a few hours chasing "why did my VPN drop for no reason" before someone noticed both configs had identical PrivateKey values.

The fix we applied at the time was wrong: we told people to "just generate your own keys." No script, no enforcement, just a Slack message and good intentions. A month later a new hire cloned the same template from an old onboarding doc, and we had the exact same collision again. The lesson wasn't "generate unique keys" — everyone already agreed with that in principle. The lesson was that without a script or a process gate, someone will always take the fast path, especially under onboarding pressure. Per-machine key generation needs to be the only path available, not the recommended one.

Mistake 2: AllowedIPs set to 0.0.0.0/0 without a real routing plan

We wanted full-tunnel security for engineers on public WiFi — route everything through the VPN, not just internal traffic. Reasonable goal. So we set AllowedIPs = 0.0.0.0/0 on the client side, intending WireGuard to route all outbound traffic through the tunnel and out via the server.

What we forgot: our server-side NAT rule was scoped only to the internal subnet.

# server-side NAT — only masquerades traffic from the VPN subnet
# this was fine for split-tunnel, but broke full-tunnel silently
iptables -t nat -A POSTROUTING -s 10.10.10.0/24 -o eth0 -j MASQUERADE

With 0.0.0.0/0 on the client, all traffic — including general internet browsing — got routed into the tunnel. But since the MASQUERADE rule only covered the VPN's own subnet range for legitimate internal routing, anything outside that scope hit a dead end on the server. Clients lost general internet access entirely. Traffic went into the tunnel and never came back out.

The nasty part: SSH over the tunnel kept working fine, because that traffic stayed within the 10.10.10.0/24 range the whole time. So from an infra perspective, everything looked healthy. It took two days before someone reported "my internet is down, but only when I'm on the VPN, and only on WiFi" — because their 4G hotspot bypassed the VPN app on mobile and worked fine. We had never actually decided whether we wanted split-tunnel or full-tunnel; we just typed 0.0.0.0/0 because it felt like "more security." Full-tunnel and split-tunnel are different architectures with different NAT requirements, and mixing them without a plan is how you get two days of confused bug reports instead of one clear failure.

Mistake 3: No PersistentKeepalive + firewall left wide open on 51820/udp

This one was actually two separate mistakes that happened to surface around the same time, both stemming from the same root cause: WireGuard fails silently, and we weren't watching closely enough.

First, peers behind carrier-grade NAT — home routers, 4G modems — would drop after roughly two minutes of idle. No error, no disconnect event, the tunnel would just stop passing traffic. The cause was a stale NAT mapping on the client's router. Without PersistentKeepalive, WireGuard never sends anything to keep that mapping alive, so once the router's NAT table entry expired (usually somewhere between 30 and 120 seconds depending on the device), the server had no way to reach the client anymore. The only fix was manually restarting wg-quick on the client, which of course nobody understood the first few times it happened — they just thought the VPN was "flaky."

Second, separately, we'd applied ufw allow 51820/udp with zero source restriction — wide open to the entire internet, undocumented and unapproved by anyone reviewing the change. WireGuard doesn't respond to unauthenticated packets, which is genuinely one of its best security properties compared to OpenVPN's TLS handshake surface. But the port being open at all, with no rate limiting, still got flagged during a security review after a masscan-style sweep hit it repeatedly in our logs. The exposure wasn't exploitable, but it was invisible and unapproved — which is its own problem in a compliance-sensitive environment.

Both issues sat unnoticed for weeks. There's no auth log for WireGuard to grep, no failed-connection event to alert on — just connections that quietly stop reconnecting, and a port that's open but never logs an unauthorized attempt.

What we do differently now

Key generation is scripted, not templated. Every peer gets a unique keypair generated at onboarding time and stored in Vault — never copy-pasted, never cloned from a template file.

#!/bin/bash
# generate-peer.sh — per-peer key generation, run once per new client
# lesson learned: never copy an existing privatekey file between machines

set -euo pipefail

PEER_NAME="${1:?Usage: generate-peer.sh }"
KEY_DIR="/etc/wireguard/peers/${PEER_NAME}"
SERVER_PUBKEY_FILE="/etc/wireguard/server_public.key"
SERVER_ENDPOINT="vpn.kuryzhev.cloud:51820"
CLIENT_SUBNET_BASE="10.10.10"

mkdir -p "${KEY_DIR}"
umask 077  # keys must never be world-readable

# generate unique keypair for this peer only
wg genkey | tee "${KEY_DIR}/private.key" | wg pubkey > "${KEY_DIR}/public.key"
PRIVATE_KEY=$(cat "${KEY_DIR}/private.key")
PUBLIC_KEY=$(cat "${KEY_DIR}/public.key")
SERVER_PUBKEY=$(cat "${SERVER_PUBKEY_FILE}")

# assign next available IP in the /24 — naive counter, replace with real IPAM in prod
NEXT_OCTET=$(( $(ls /etc/wireguard/peers | wc -l) + 1 ))
CLIENT_IP="${CLIENT_SUBNET_BASE}.${NEXT_OCTET}/32"

cat > "${KEY_DIR}/client.conf" <



Split-tunnel is now the default — AllowedIPs scoped to 10.10.10.0/24 only. Full-tunnel is an explicit opt-in flag, documented per use case, and we double-check the routing table with ip route after every peer add. PersistentKeepalive = 25 is mandatory for anything not sitting on a static IP, per the official WireGuard quickstart docs. The firewall rule for 51820/udp now goes through rate-limiting via iptables, since WireGuard has no auth log for fail2ban to hook into — restricting source ranges in the cloud security group does more real work than a generic allow rule ever did.

We also added a standing on-call check using wg show and journalctl -u wg-quick@wg0 -f, with a simple rule of thumb: alert if a peer's latest handshake exceeds 180 seconds when PersistentKeepalive is configured. That single threshold would have caught every one of these WireGuard VPN mistakes days earlier instead of weeks.

# wg show output — how we now spot dead peers before users complain
$ wg show wg0

interface: wg0
  public key: AbCdEf...server_pub...==
  private key: (hidden)
  listening port: 51820

peer: xY12ab...client1_pub...=
  endpoint: 203.0.113.44:51820
  allowed ips: 10.10.10.2/32
  latest handshake: 42 seconds ago       # healthy
  transfer: 1.24 MiB received, 3.87 MiB sent

peer: qR98zt...client2_pub...=
  endpoint: 198.51.100.9:51820
  allowed ips: 10.10.10.3/32
  latest handshake: 18 minutes ago       # stale — likely NAT timeout, no keepalive
  transfer: 890 KiB received, 1.1 MiB sent

# rule of thumb we now enforce: alert if latest handshake > 180s
# for a peer with PersistentKeepalive set — indicates dropped tunnel, not just idle

Worth noting: none of this made WireGuard slower or more expensive. The hub still runs on a $5/mo VPS — the low overhead of in-kernel ChaCha20 handles our peer count fine, though we know a single-core box gets CPU-bound somewhere past 15 concurrent peers doing sustained throughput, so that's on our radar for the next capacity review. If you're weighing a similar migration, I've written more on VPN and network hardening decisions over on kuryzhev.cloud, and the official WireGuard project docs are worth reading end to end before you touch a production config — they're short, and every gotcha above is technically documented there, we just didn't read closely enough the first time.

Related


Security hardening checklists for infra exposed to the internet
AWS networking and firewall automation patterns
Bash scripting patterns for infra automation and onboarding