DEV Community: david

The .gitleaks-baseline.json That Suppressed Live Production Secrets

david — Mon, 13 Jul 2026 09:35:43 +0000

A previous article here covered setting up gitleaks for homelab secret scanning - the setup, the pre-commit hook, getting CI to fail on new commits that contain secrets. The setup was correct. The tool was running. The CI was green. And it had been quietly suppressing a live production credential for months.

This is the follow-on story: not about getting gitleaks running, but about the specific way a baseline file breaks the guarantees you think you have once it's in place.

View the complete homelab infrastructure source on GitHub 🐙

What a Baseline File Does (and Is Supposed to Do)

When gitleaks first runs on an existing repo, it finds every secret-shaped string in the full git history - including secrets that were introduced years ago, rotated long since, and are completely inert. Flagging those in CI creates noise that causes developers to tune out gitleaks entirely, which is worse than not having it.

The baseline workflow is the standard answer: run gitleaks on the current state, export all findings to a JSON file, commit that file to the repo, and tell gitleaks to suppress any finding that already appears in the baseline. Future commits that introduce new secrets still fail; old known-inert findings don't.

# Generate baseline from current HEAD
gitleaks detect --report-format json --report-path .gitleaks-baseline.json

# Tell gitleaks to use it
gitleaks detect --baseline-path .gitleaks-baseline.json

The assumption embedded in this workflow: findings that appear in the baseline are inert. They were there before the baseline was generated; they've been there; they're known.

The Assumption That Broke It

The baseline was generated at a point when the repo contained Garage's rpc_secret and admin_token committed in a YAML file. Those were real production values - the cluster was live, using those exact secrets - but the baseline suppression treated them as "known, reviewed, not a problem."

The commit that introduced them had happened a few weeks before the baseline was generated. The baseline capture happened to include them. From that point forward, gitleaks considered them baseline findings and did not alert on them.

// .gitleaks-baseline.json - what it actually contained
{
  "Description": "Generic API Key",
  "StartLine": 8,
  "Match": "rpc_secret: \"REDACTED\"",
  "Secret": "REDACTED",
  "File": "kubernetes/system/garage/garage.yml",
  ...
}

The finding was correctly identified as a secret. It just wasn't being acted on because the baseline said it was known.

How the Finding Was Caught

The finding surfaced during a security audit pass - specifically, a manual re-review of the baseline file itself to check what it was suppressing, not a gitleaks run. The review question was: "for each finding in this baseline, has the secret actually been rotated, or did we just suppress it?"

Garage's rpc_secret and admin_token had not been rotated. The values in the baseline were the same values currently running in production. The baseline was suppressing an active, live credential committed to a public repository.

The audit entry:

SEC-013: Garage rpc_secret and admin_token committed to public repo
Status: CRITICAL - values unrotated since 2026-06-03 commit
Suppressed by: .gitleaks-baseline.json (line 47, line 89)
Action: rotate both via Vault, force ExternalSecret resync, restart Garage

The Remediation

Rotation via Vault (where the secrets were already supposed to live, per the ExternalSecret setup):

# Write new secrets to Vault
vault kv patch secret/garage \
  rpc_secret="$(openssl rand -hex 32)" \
  admin_token="$(openssl rand -hex 32)"

# Force ExternalSecret to re-sync
kubectl annotate externalsecret garage-secret -n system \
  force-sync="$(date +%s)" --overwrite

# Verify new secret picked up
kubectl get secret garage-secret -n system -o jsonpath='{.data.rpc_secret}' | base64 -d

# Restart Garage to pick up new RPC secret
kubectl rollout restart deployment/garage -n system

# Verify RPC handshake succeeded on new secret
kubectl logs -l app=garage -n system | grep -i "rpc\|handshake\|connect" | tail -20

After rotation, the baseline file was updated to remove the (now-rotated) findings. The new baseline only contains findings where the secret is confirmed rotated or confirmed not a real secret.

The Structural Fix: Baseline Review as Part of the Rotation Workflow

The root problem isn't that baseline files exist - they're genuinely useful. The problem is that "add to baseline" and "rotated and safe" are two completely different states that the workflow treats identically. Every finding in a baseline should have a documented reason it's suppressed, and "it was already there when I set up gitleaks" is not a sufficient reason.

The updated workflow:

# Never auto-add to baseline without a verified status
# Each entry in .gitleaks-baseline.json now has a comment in the adjacent audit doc:
# - ROTATED: secret value changed, old value in history is inert
# - FALSE_POSITIVE: not actually a secret (test value, placeholder, etc.)
# - PENDING_ROTATION: known issue, rotation scheduled, date logged

The last category - PENDING_ROTATION - is the honest version of what the original baseline was doing. The difference is that it's explicit, it has a date, and it shows up in the audit document as an open finding rather than a silently-closed one.

The concrete check to run against any existing gitleaks baseline:

# For each secret in the baseline, check if it's still in the live cluster:
# 1. Get the secret value from the baseline
# 2. Check Vault / ExternalSecret / live Secret for the current value
# 3. If they match → not rotated, baseline is suppressing a live credential
jq -r '.[].Secret' .gitleaks-baseline.json | while read secret; do
  echo "Checking: ${secret:0:8}..."
  # grep your Vault / k8s secrets for this value
done

This check doesn't fit neatly into automated CI - it requires access to the live secret values, not just the git history. It belongs in a periodic manual audit, not a commit hook. But it's the only check that answers the actual question: not "does gitleaks find new secrets" but "are the secrets it's ignoring actually safe to ignore."

The same class of problem appears in Azure at scale: Azure Policy compliance reports show a resource as "Compliant" because it was exempted when the policy was first assigned, not because the underlying condition was remediated. The exemption and the fix look identical in the dashboard. Periodic review of exemptions - exactly like periodic review of a gitleaks baseline - is the only way to tell the difference.

Redis Killed Nextcloud and Nobody Noticed for Hours

david — Mon, 06 Jul 2026 14:08:48 +0000

The Nextcloud outage looked like a capabilities problem at first. Kubernetes had just received a securityContext hardening pass — readOnlyRootFilesystem: true across most containers — and the timing fit. Pods were up, they passed readiness checks, but PHP requests were hanging and returning zero bytes. The capabilities hardening had broken things before. It was a reasonable assumption.

It was wrong. The capabilities issue was real, but it wasn't what took Nextcloud down. This is the Redis story.

View the complete homelab infrastructure source on GitHub 🐙

The Setup That Created the Problem

Nextcloud's session data and file lock cache both run through a Redis instance deployed as a sidecar-style container in the same manifest. That Redis container had no PersistentVolumeClaim — it was intentionally ephemeral, purely in-memory, with sessions expected to survive only as long as the pod did.

The problem: Redis doesn't know it's running without a PVC. Its default configuration ships with persistence enabled:

save 3600 1   # write RDB snapshot if 1 key changed in the last hour
save 300 100  # write RDB snapshot if 100 keys changed in the last 5 minutes
save 60 10000 # write RDB snapshot if 10000 keys changed in the last minute

Nextcloud is an active application. It easily generates 100 session and lock writes within 300 seconds of normal use. When the 300-second threshold triggered, Redis attempted to write a snapshot — and hit a read-only filesystem:

Failed opening the temp RDB file temp-repl-16307.rdb (in server root dir /data)
for saving: Read-only file system
Background saving error

That error alone would be logged and ignored. Redis would retry on the next threshold. The issue is what happens after enough background saves fail: Redis enables its stop-writes-on-bgsave-error safeguard, which is yes by default, and which means exactly what it says — Redis stops accepting write commands entirely.

Why the Logs Were Misleading

The Redis log showed the RDB error. But Redis was still running, still responding to pings, still reachable via the network — just refusing writes. From Nextcloud's PHP perspective, the Redis session handler was calling SET and getting back a Redis error object instead of OK. PHP's error handling in that path doesn't surface a visible 500 — it hangs the request or returns an empty response, depending on the session lock timeout.

The result from outside: pages loaded up to the authentication stage, then returned nothing. Not a 500, not a timeout message — nothing. That looks exactly like what capabilities.drop: [ALL] does when it breaks an entrypoint: the process starts, gets partway through initialization, then stops.

The actual diagnostic path that found it:

# Redis container still running, pod shows Running — check the logs directly
kubectl logs nextcloud-<pod> -n apps -c redis-nextcloud --previous

# Key line:
# "Background saving error" / "MISCONF Redis is configured to save RDB snapshots,
# but it's currently not able to persist on disk."

# Confirm Redis is refusing writes:
kubectl exec -it nextcloud-<pod> -n apps -c redis-nextcloud -- redis-cli SET test_key test_val
# → (error) MISCONF Redis is configured to save RDB snapshots...

Once the actual Redis error surfaced, the fix was immediate and obvious.

The Fix: Invoke redis-server Directly with Persistence Off

The fix used the same pattern already in place for the cluster's other ephemeral Redis instances (immich-redis, valkey) — invoke redis-server directly with the relevant flags rather than relying on the image's default config:

# Before: redis-nextcloud relied on image defaults (persistence on)
containers:
  - name: redis-nextcloud
    image: redis:8

# After: explicit command disabling both RDB and AOF
  - name: redis-nextcloud
    image: redis:8
    command:
      - redis-server
      - "--save"
      - ""          # empty string disables all save thresholds
      - "--appendonly"
      - "no"

The same change applied to paperless's Redis, which had the same no-PVC configuration. The authelia Redis was left untouched — it actually has a real PVC (kubernetes/system/redis) so its --appendonly yes configuration is intentional and works correctly.

The Asymmetry That Makes This Hard to Catch

Redis with persistence enabled and no PVC doesn't fail immediately. It runs, accepts connections, handles traffic, and appears healthy in every Kubernetes signal:

Pod status: Running
Readiness probe: passing (Redis responds to TCP or PING checks)
ArgoCD application: Synced/Healthy
Restart count: 0

The failure only manifests after the first persistence threshold is hit — which for the 300 100 threshold (100 writes in 5 minutes) could be minutes into normal operation, or could be hours if traffic is light. After that threshold fires and the first failed RDB save triggers stop-writes-on-bgsave-error, the failure is silent at the Redis level and confusing at the application level.

The version of this that would have prevented the outage:

# Before deploying any Redis instance, check whether it has a PVC:
grep -A 20 "redis" kubernetes/apps/nextcloud/nextcloud.yml | grep -i "persistentVolumeClaim\|claimName"
# If nothing → persistence must be explicitly disabled

Or more defensively: any Redis container without a PVC in its volume list gets --save "" --appendonly no as a matter of policy, not an afterthought.

Why This Shows Up in Hardened Clusters More Than Default Ones

The readOnlyRootFilesystem: true security hardening is what surfaced this. Without it, Redis would have written its RDB snapshots to the container's writable root filesystem — wasteful, potentially filling the node's ephemeral storage, but functionally harmless in the short term. The write would succeed, stop-writes-on-bgsave-error would never trigger, and the outage would never happen.

In that sense, the security hardening exposed a pre-existing misconfiguration that was only dormant because the filesystem was writable. The hardening was correct. The Redis configuration was wrong, and the hardening just removed the thing that had been quietly absorbing the mistake.

The same scenario applies to any Redis instance in Kubernetes where sessions, cache, or ephemeral data don't need to survive a pod restart — which, in a cluster with properly-sized application replicas, is most of them. If the Redis deployment doesn't have a PVC, it should have --save "" --appendonly no as explicit arguments rather than inheriting defaults designed for a server with durable storage.

The Disaster Recovery Runbook Nobody Had Actually Run

david — Mon, 06 Jul 2026 14:03:52 +0000

DISASTER-RECOVERY.md had a Tier 1 procedure: restore the latest Proxmox Backup Server snapshot for a given container, boot it, confirm the application comes back up with real data. It was written carefully, reviewed, and had sat untested for months. Writing a recovery runbook and running a recovery are two different acts of verification, and only one of them had actually happened.

This is what running it for the first time found.

View the complete homelab infrastructure source on GitHub 🐙

The Test, Designed to Be Safe

The container chosen was ct-srv-atlantis-01 — not because it was low-stakes exactly, but because its data (Terraform plan history, lock state) is reconstructible if something went wrong, unlike, say, Nextcloud's actual file storage.

The test design: restore the latest PBS snapshot to a scratch VMID, distinct from the running original, boot it there, verify it comes up with its Docker stack auto-starting and real data present, then destroy the scratch instance. The live original was never touched, never stopped, never at risk.

# Restore to a new, unused VMID -- not overwriting the original
proxmox-backup-client restore <snapshot-id> --target-vmid 9901

# Original ct-srv-atlantis-01 keeps running the entire time

This is the correct way to test a restore procedure without creating the exact outage you're testing recovery from. The design held up.

The Gotcha the Documentation Never Mentioned

The restored container came up. Its Docker stack auto-started correctly, per the systemd unit configuration already in place. Its data was intact — Terraform state, PR history, everything as expected.

Then it started fighting with the original over the network.

PBS restores duplicate the source's configuration exactly, which includes its static IP and MAC address. The scratch container ct-9901 booted with the identical IP and MAC as the still-running ct-srv-atlantis-01. Two devices claiming the same IP and MAC on the same VLAN produces exactly the failure mode you'd expect — ARP conflicts, unpredictable routing of traffic to whichever host answered last, and depending on switch/router ARP table behavior, either host could silently lose connectivity for a period.

# What actually needs to happen before booting a restored scratch instance:
# 1. Restore to new VMID (already being done)
# 2. Before first boot: change the network config to avoid IP/MAC collision
pct set 9901 -net0 name=eth0,bridge=vmbr0,ip=<scratch-ip>/24,hwaddr=<new-mac>
# 3. Only then start it
pct start 9901

This wasn't caught by planning, only by actually running the test and watching the original container's connectivity briefly degrade. No amount of reading the PBS documentation surfaces this — it's not a PBS bug, it's an inherent property of what "restore" means: a byte-for-byte duplicate of the source, network identity included.

The Second Finding: A Runbook Step That Would Have Failed

While writing up this test, a second problem surfaced — not from testing, but from cross-checking Tier 1's other listed recovery path against a separate finding from the same week (REL-051, covered separately). The runbook's fallback for "total PBS storage loss" pointed at the Google Drive offsite sync as the recovery source.

That offsite sync had been deliberately disabled months earlier — the destination Google Drive account didn't have enough free space for the datastore, so the sync was turned off rather than left to fail daily. The runbook still described it as an active recovery tier. Following it during a real total-loss event would have sent someone down a dead path while genuinely searching for a way to restore.

<!-- DISASTER-RECOVERY.md, before -->
### Tier 1: Total PBS Storage Loss
Restore from the Google Drive offsite backup: [procedure]

<!-- After -->
### Tier 1: Total PBS Storage Loss
**No working recovery path currently exists for this scenario.**
Google Drive offsite sync is deliberately disabled (insufficient storage
quota for this datastore's chunk format — see docs/backup-strategy.md).
This is a known, deferred gap, not an oversight.

Documenting "this doesn't work" is a strictly better outcome than documenting a procedure that looks complete but silently fails during the one moment it matters. A runbook that's honest about its gaps is more useful in an actual incident than one that's optimistically wrong.

Why "Documented" and "Tested" Are Different Claims

The instinct after writing a recovery procedure is to consider the job done — the steps are correct, they were reviewed, they match how the backup tool is supposed to work. But "the steps are theoretically correct" and "these steps, run against a real host, actually produce a working restore" are different claims, and only the second one matters during an actual incident.

The IP/MAC conflict specifically would only ever be discovered by running the restore against infrastructure that has other live devices on the same network — which is exactly the situation a real disaster recovery event is in. A test performed in isolation (a lab VLAN with nothing else running) wouldn't have found it either.

The concrete practice this argues for: any documented recovery procedure gets a scheduled dry run, on a cadence that matches how often the underlying infrastructure changes — new network segments, new backup targets, new container images. A runbook that hasn't been executed in the last few months should be treated as unverified, regardless of how carefully it reads.

The same gap shows up in Azure disaster recovery plans built around ASR (Azure Site Recovery) or storage account geo-redundancy: the failover procedure is documented, the RTO/RPO numbers are in the compliance doc, and nobody has actually triggered a failover drill against a resource with the same network topology as production. The IP/MAC duplication issue here has a direct Azure analogue — a failed-over VM retaining its original NIC configuration can conflict with resources still running in the primary region during a partial failover test. Test the actual failover, not just the plan for one.

kubectl Said Everything Was Correct. Traefik 404'd Anyway.

david — Mon, 29 Jun 2026 11:09:55 +0000

Jellyfin's k3s Deployment had no GPU passthrough — pure software transcoding on a cluster with no GPU access. Moving it to a dedicated LXC with VAAPI hardware transcode access to the host's APU is straightforward in principle: stand up the LXC, run Jellyfin there via Ansible, and point the existing Traefik IngressRoute at the new location instead of a cluster pod.

That last part — pointing a Kubernetes Service at an external IP without changing anything downstream — should be one of the more boring parts of a migration like this. It produced the more interesting bug of the two covered here.

View the complete homelab infrastructure source on GitHub 🐙

Routing a Service to an IP Outside the Cluster

Traefik's IngressRoute for Jellyfin references a Service by name — services: [{name: jellyfin, port: 8096}]. To avoid touching that IngressRoute (and its Authelia middleware) at all, the plan was to keep the Service object, but back it with the LXC's IP instead of a pod selector. Kubernetes has a mechanism for exactly this: skip the Service's selector field, and manually create an object that lists the actual backend addresses.

The modern way to do this is EndpointSlice:

apiVersion: v1
kind: Service
metadata:
  name: jellyfin
  namespace: apps
spec:
  ports:
    - port: 8096
      targetPort: 8096
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: jellyfin
  namespace: apps
  labels:
    kubernetes.io/service-name: jellyfin
addressType: IPv4
ports:
  - name: ""
    port: 8096
endpoints:
  - addresses:
      - 10.0.20.254

Every check via kubectl confirmed this was correct: the Service existed, the EndpointSlice existed with the right label linking it to the Service, the endpoint address listed correctly. By every signal kubectl could give, this was done.

Every request to media.woitzik.dev 404'd.

Reading the Actual Error Instead of Re-Checking kubectl

The instinct when kubectl says everything is fine is to re-check kubectl — verify the label matches exactly, check for typos, look for a missing field. None of that was the problem. The actual answer was sitting in Traefik's own logs the entire time:

subset not found for apps/jellyfin

Traefik's Kubernetes CRD provider resolves a Service's backend addresses through the legacy v1/Endpoints API — the object with subsets, not the newer EndpointSlice. This is true regardless of the fact that EndpointSlice is the object Kubernetes itself prefers and recommends for new code, and regardless of the fact that kubectl has no opinion about which one any particular consumer actually reads. Kubernetes ships both APIs side by side specifically for this kind of provider-compatibility gap — and Traefik's ingress controller, as of the version in use here, is one of the consumers that hasn't moved to the newer one.

The fix is a straightforward swap to the older object:

apiVersion: v1
kind: Endpoints
metadata:
  name: jellyfin
  namespace: apps
subsets:
  - addresses:
      - ip: 10.0.20.254
    ports:
      - port: 8096

Confirmed immediately via curl against the public hostname, and via Traefik's own logs going quiet for that route. Same conceptual object, different API, and only one of the two is the one this specific consumer actually reads.

The general lesson: "kubectl shows it's configured correctly" answers whether the Kubernetes API server accepted and stored the object — it says nothing about whether the specific consumer reading that object supports the API version you used. For anything involving a controller or ingress provider reading Kubernetes objects indirectly (CRD-based routing being the clearest example), check that controller's own logs and documented compatibility before assuming a kubectl-clean object is a working one.

The Second Gotcha: A PVC Shared by Reference Across Unrelated Files

While removing Jellyfin's old in-cluster Deployment and its associated PVC, a second issue surfaced — this one with real data-loss potential, caught only because Kubernetes' own safety mechanism bought time.

The media PVC had been defined in jellyfin.yml. It looked, from that file alone, like it belonged to Jellyfin and only Jellyfin. It didn't: four other Deployments in a completely different file — usenet.yml's Sonarr, Radarr, Bazarr, and SABnzbd — referenced the exact same PVC by name:

# usenet.yml — nothing in this file defines "media", it just claims it
volumes:
  - name: media
    persistentVolumeClaim:
      claimName: media

Deleting the PVC's defining resource from jellyfin.yml put it into Terminating state. It did not actually disappear, because of pvc-protection — a built-in Kubernetes finalizer that blocks PVC deletion while any pod still references it. The four usenet pods kept running, the PVC stayed Bound from their perspective, and nothing broke in that moment. But that's a temporary window, not a safe state: the moment any of those four pods restarted — a node drain, an OOM kill, a routine rollout — the PVC's finalizer would have cleared and the volume would have gone with it, taking the actual media library out from under four still-running applications.

# What would have caught this before deleting anything:
grep -rn "claimName: media" kubernetes/
# → jellyfin.yml (the file being edited)
# → usenet.yml (NOT obvious from looking at jellyfin.yml alone)

The fix: recreate the same NFS-backed volume under new names, owned by usenet.yml — the file whose Deployments actually still needed it — and repoint all four claimName references to the new names. The old PVC finished terminating cleanly once nothing referenced it anymore.

The general lesson: before deleting any PersistentVolumeClaim (or any named Kubernetes object that other resources reference indirectly — Secrets, ConfigMaps, Services), grep the entire manifest tree for that name, not just the file that appears to define it. claimName, secretName, and similar cross-references are invisible from the defining file alone, and Kubernetes' own protective finalizers — while genuinely useful — can create a false sense of safety: "nothing broke yet" is not the same claim as "this was safe."

Both Gotchas, Side by Side

	EndpointSlice/Endpoints	Shared PVC
What `kubectl` showed	Completely correct	Completely correct (until deletion)
What actually mattered	Whether the consuming controller (Traefik) supports the object's API version	Whether other files reference the same object by name
Where the real signal was	The controller's own logs, not the API server	A repo-wide grep, not the file being edited
Safety net that bought time	None — direct outage	`pvc-protection` finalizer — temporary, not a fix

Both failures share the same shape: kubectl (or any direct API inspection) confirms an object exists and is well-formed, but the actual question — does this specific consumer read it correctly, does this specific name have other dependents — lives somewhere kubectl doesn't look. The fix in both cases was checking a different source of truth: the consuming controller's logs in one case, a full-tree text search in the other.

The Traefik EndpointSlice/Endpoints gap specifically is worth checking against your own ingress controller version before relying on it — provider compatibility for newer Kubernetes APIs varies and changes between releases. For Azure environments running AKS with an external service (an on-prem system, a VM outside the cluster), the same external-IP-backed-Service pattern applies, and the same "check what your specific ingress controller actually supports" caveat applies just as directly.

SLO Burn-Rate Alerting with Prometheus: Beyond Threshold Alerts

david — Wed, 24 Jun 2026 19:03:59 +0000

Most uptime alerts look like this:

- alert: ServiceDown
  expr: probe_success == 0
  for: 2m

That fires when a service is completely down for two minutes. It won't fire when a service is responding to 95% of requests for 48 hours straight — even though that's silently consuming your entire monthly error budget.

Burn-rate alerting is a different model. Instead of alerting on current state, it alerts on how fast you're spending your error budget. A 30x burn rate means you'll exhaust your entire month of tolerance in about 50 minutes. A 6x burn rate means you have a few hours. Both warrant action — just different kinds of action.

This is the implementation running on my bare-metal k3s cluster, based directly on the multi-window multi-burn-rate approach from the Google SRE Workbook.

View the complete homelab infrastructure source on GitHub 🐙

Error Budgets, Briefly

If your SLO is 99.9% availability, your monthly error budget is the allowed downtime: 43.8 minutes per month (0.1% of 43,800 minutes).

The core insight: not all errors are the same urgency. A service that's been returning errors at 30x the normal rate for the past two hours will exhaust that 43.8-minute budget in ~50 minutes — that's a page. A service burning at 6x for the past six hours has 4 hours left — that's a ticket, handled during the shift.

Threshold alerting conflates these. Burn-rate alerting separates them.

The SLI: HTTP Probe Success Rate

Everything is built on a single Service Level Indicator: the fraction of successful HTTP probes from the Prometheus blackbox exporter.

The blackbox exporter probes each public service endpoint on a fixed interval. probe_success is 1 for a successful probe and 0 for a failure. The SLI is the average over a time window:

# kubernetes/system/monitoring/slo-rules.yml

- record: job_instance:probe_success:rate5m
  expr: avg_over_time(probe_success[5m])

- record: job_instance:probe_error:rate5m
  expr: 1 - avg_over_time(probe_success[5m])

1 - success_rate = error_rate. At 99.9% SLO, the allowed steady-state error rate is 0.001 (0.1%).

Recording Rules: Pre-Computing the Windows

Multi-window alerting needs error rates computed over multiple time windows. Prometheus can do this inline in alert expressions, but pre-computing them as recording rules keeps the alert expressions readable and reduces query load.

- name: slo.availability.windows
  interval: 1m
  rules:
    # Short windows (fast-burn detection)
    - record: job_instance:probe_success:rate1h
      expr: avg_over_time(probe_success[1h])
    - record: job_instance:probe_success:rate2h
      expr: avg_over_time(probe_success[2h])

    # Medium windows
    - record: job_instance:probe_success:rate6h
      expr: avg_over_time(probe_success[6h])
    - record: job_instance:probe_success:rate30m
      expr: avg_over_time(probe_success[30m])

    # Long windows (slow-burn detection)
    - record: job_instance:probe_success:rate24h
      expr: avg_over_time(probe_success[24h])

These evaluate every minute. The result is a set of pre-computed availability metrics across six time windows — from 30 minutes (most sensitive) to 24 hours (catches slow bleeds).

The Alert Rules

Fast Burn: Page Immediately

- alert: SLOAvailabilityFastBurn
  expr: |
    (1 - job_instance:probe_success:rate2h) > (30 * (1 - 0.999))
    and
    (1 - job_instance:probe_success:rate1h) > (30 * (1 - 0.999))
  for: 2m
  labels:
    severity: critical
    slo: availability
  annotations:
    summary: "SLO fast burn: {{ $labels.instance }}"
    description: >
      {{ $labels.instance }} error rate is burning through the monthly error budget
      at ≥30x the allowed rate. At this pace the 99.9% budget is exhausted in ~50min.
      Current 2h error rate: {{ printf "%.2f" $value }}%

The math: A 99.9% SLO means 0.1% of requests can fail. The threshold for 30x burn is 30 × 0.001 = 0.03 — a 3% error rate. If both the 2-hour window and the 1-hour window exceed 3%, this fires.

Why two windows? The short window (1h) catches fast-developing incidents. The long window (2h) provides confirmation — it prevents a single spike from paging. Both must exceed the threshold simultaneously. This dual-window check is the key difference from naive threshold alerting: a two-minute blip won't page you, but a sustained fast burn will.

Burn-rate math at 30x:

Monthly budget: 43.8 minutes
At 30x burn: 43.8 ÷ 30 = 1.46 minutes consumed per minute
Budget exhausted in: 43.8 ÷ (30 - 1) ≈ 51 minutes

51 minutes to act. Page.

Slow Burn: Create a Ticket

- alert: SLOAvailabilitySlowBurn
  expr: |
    (1 - job_instance:probe_success:rate6h) > (6 * (1 - 0.999))
    and
    (1 - job_instance:probe_success:rate30m) > (6 * (1 - 0.999))
  for: 15m
  labels:
    severity: warning
    slo: availability
  annotations:
    summary: "SLO slow burn: {{ $labels.instance }}"
    description: >
      {{ $labels.instance }} error rate is burning through the monthly error budget
      at ≥6x the allowed rate. At this pace the 99.9% budget is exhausted in ~4h.
      Current 6h error rate: {{ printf "%.2f" $value }}%

The math: 6 × 0.001 = 0.006 — a 0.6% error rate. Budget exhaustion at 6x burn: 43.8 ÷ (6 - 1) ≈ 8.8 hours. The for: 15m means it must sustain this rate for 15 minutes before firing, which filters transient dips.

6h (long) + 30m (short) windows. A slow degradation is visible over 6 hours; the 30m short window prevents false positives from stale data.

Severity: warning. This goes to a Slack channel, not a pager. Fix it during the shift.

Comparing Against Threshold Alerting

Scenario	Threshold alert (`< 99%`)	Burn-rate alert
Service down for 2 minutes	✅ Fires	✅ Fires (fast burn)
Service at 95% for 48h	❌ Fires then resolves	✅ Fires slow burn, escalates
3% error rate for 1h	❌ May not fire	✅ Fast burn fires
0.5% error rate for 6h	❌ Never fires	✅ Slow burn fires
Single 10-second blip	✅ Fires (false positive)	❌ Below `for` threshold

The pattern: burn-rate alerting catches slow degradations that threshold alerting misses, and it filters the transient blips that threshold alerting over-alerts on.

Deploying as a PrometheusRule

The rules deploy as a PrometheusRule CRD, picked up automatically by the Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: homelab-slo-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: slo.burn-rate.page
      rules:
        - alert: SLOAvailabilityFastBurn
          # ... (see above)

The prometheus: kube-prometheus label tells the Prometheus Operator to load this rule. kubectl get prometheusrule -n monitoring should show it; kubectl get --raw /api/v1/namespaces/monitoring/pods/prometheus-kube-prometheus-prometheus-0/proxy/api/v1/rules lets you query the loaded rules directly.

What the Error Budget Dashboard Shows

The complementary Grafana dashboard (slo-dashboard.yml) renders three panels:

Availability over time — job_instance:probe_success:rate5m across all probed services
Error budget remaining — 1 - (sum(rate(probe_success[30d])) / count(probe_success)) relative to the 0.1% budget
Burn rate — current consumption rate, coloured by severity tier

The budget panel is the most useful. When it's dropping steeply, something is consuming more than the flat weekly allocation. That's a signal even before an alert fires.

Limitations

This implementation measures external availability only — HTTP probes from inside the cluster. It won't catch:

Increased latency that doesn't fail probes (need histogram SLIs for that)
Internal service-to-service degradation (need distributed tracing or internal probes)
Correctness issues — a 200 OK with wrong data doesn't fail a probe

For most homelab services — Nextcloud, Authelia, Jellyfin, Gitea — availability is the right SLI. For a production API, you'd want to add latency SLOs (P99 < 500ms) using histogram recording rules.

The same pattern applies directly to enterprise environments. If you're running Azure Load Balancer health probes or Application Gateway, the SLI is the same: probe success rate. The recording rules and alert thresholds are identical. The only difference is where the metrics come from.

I Hardened Pod securityContext and Broke 9 Containers in Production

david — Wed, 24 Jun 2026 19:03:35 +0000

kubeconform passed. kubectl --dry-run passed. The PR looked exactly like what every Kubernetes security checklist tells you to do: capabilities.drop: [ALL], runAsNonRoot: true, allowPrivilegeEscalation: false across every container that was missing a securityContext. Schema-valid, reviewed, merged.

Within minutes — because this cluster runs ArgoCD with selfHeal: true, where merge is deploy — nine containers were down. Two of them were Postgres, backing Paperless and Nextcloud. That's not a degraded non-critical service; that's an outage.

This is the failure analysis, the two wrong assumptions that caused it, the trap that bit during recovery, and the lesson for the next time anyone — including future me — is tempted to do a blanket securityContext pass across a manifest tree.

View the complete homelab infrastructure source on GitHub 🐙

The Two Wrong Assumptions

Assumption 1: capabilities.drop: [ALL] is always safe if the container doesn't need special privileges at runtime.

Wrong. It's not about what the final running process needs — it's about what the entrypoint script needs before it execs into that process. A huge number of container images follow the same pattern: start as root, chown/chmod the data directory so it's owned by an unprivileged user, then drop privileges via su-exec or setpriv before launching the actual application. That privilege-drop step itself requires CAP_CHOWN, CAP_SETUID, and CAP_SETGID — capabilities that drop: [ALL] removes before the entrypoint ever runs.

# What looked like the safe, recommended hardening:
securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

This broke gitea, authelia, headscale, mealie, and both Postgres instances (paperless and nextcloud) — every one of them runs this exact root-then-drop-privileges pattern in its entrypoint. It also broke the Paperless and Nextcloud Redis instances — but, tellingly, not the Authelia Redis instance, because that one has an explicit command: redis-server ... override that bypasses the image's normal entrypoint script entirely. Same image, same securityContext, different outcome — because the actual code path that runs is different.

Assumption 2: runAsNonRoot: true is safe to set on any container, since "obviously" you want it not running as root.

Wrong in the opposite direction. runAsNonRoot: true doesn't change anything about how the container runs — it's an admission-time check that fails outright if the image's actual default user is root and nothing in the pod spec overrides it. vault-unseal (hashicorp/vault), the Nextcloud and Paperless Redis instances, and cloudflare-ddns (curlimages/curl) all default to root. These containers didn't crash-loop — they never started at all:

Error: container has runAsNonRoot and image will run as root

That's a CreateContainerConfigError, a clean failure with a clear message — which made it one of the easier categories to diagnose. The crash-looping containers from Assumption 1 were the harder half.

Catching It: Why "Application: Synced/Healthy" Lied

The first instinct when something looks wrong is to check ArgoCD. kubectl get application -n argocd showed Synced and Healthy. That was stale — ArgoCD's poll interval meant the Application object hadn't refreshed its view of the cluster yet, even though the new pods were already failing underneath it.

# Don't trust Application status alone during an active incident
argocd app get <name> --hard-refresh
# or:
kubectl annotate application <name> -n argocd \
  argocd.argoproj.io/refresh=hard --overwrite

The only thing that actually told the truth was looking directly at pod status and the pod's own creationTimestamp:

kubectl get pods -n apps -o wide
kubectl get pod <name> -n apps -o jsonpath='{.metadata.creationTimestamp}{"\n"}{.status.containerStatuses[0].restartCount}'
kubectl logs <name> -n apps --previous

A pod with N restarts and a recent restart count "looking survivable" is not proof of health. Two failures in this incident — paperless-ngx and uptime-kuma — surfaced only on a slower ReplicaSet rollout and weren't caught in the first sweep immediately after merge. They were found ~30 minutes later during an extended verification pass, specifically because someone went back and checked for a clean creationTimestamp with zero restarts since — not just "fewer restarts than expected." The bar for "this is actually fixed" has to be zero restarts on the current generation, not a restart count that happens to look low.

The Logs Told the Real Story Every Time

Once you're looking at the right pod, kubectl logs on the crashing container is unambiguous:

chown: /config: Operation not permitted
su-exec: setgroups(0): Operation not permitted
setpriv: setresuid failed: Operation not permitted

Three different error message formats, same root cause: the entrypoint tried to drop privileges and couldn't, because the capability that does that had been dropped first. This is the single most useful debugging fact from the whole incident — if you see any of these three error patterns after a securityContext change, the fix is "give the capability back," not "investigate the application."

The Recovery Trap: selfHeal Undoes Manual Fixes

To restore service faster than waiting on a PR review cycle, the instinct during an active outage is to patch the live cluster directly:

kubectl patch deployment authelia -n apps --type=json \
  -p '[{"op": "remove", "path": "/spec/template/spec/containers/0/securityContext/capabilities"}]'

This works — for about as long as it takes ArgoCD's next reconciliation loop to notice the drift. With selfHeal: true, ArgoCD's entire job is to make the live cluster match Git. A manual kubectl patch that diverges from the committed manifest is drift, by definition, and gets silently reverted back to the still-broken state.

With selfHeal enabled, Git is the only place a fix can actually stick. During this incident, the real fix had to land as a committed, merged change before it survived — the manual patch bought a few minutes at best, and gave a false sense of "it's fixed" that evaporated on the next sync cycle. For an incident under selfHeal, the fastest real path to recovery is a fast-tracked PR, not a live patch.

The Fix, Applied Selectively

Five follow-up PRs, each fixing a specific verified failure mode as it was confirmed live — not a blanket re-revert of everything:

# Reverted only what was proven to break — capabilities stay dropped
# wherever the image's entrypoint doesn't need them:
securityContext:
  allowPrivilegeEscalation: false
  # capabilities.drop: ["ALL"]  ← removed for this specific image,
  # see inline comment for the verified failure mode

# kubernetes/apps/authelia/authelia.yml
securityContext:
  allowPrivilegeEscalation: false
  # SEC-012: image entrypoint runs as root and needs CAP_CHOWN/CAP_SETGID/
  # CAP_SETUID to chown /config and su-exec into its runtime user —
  # confirmed live, dropping all capabilities crash-loops it
  # ("su-exec: setgroups(0): Operation not permitted").

Each reverted file got an inline comment recording the specific verified failure mode — not a vague "this broke things." The next person (or future me) who's tempted to re-attempt a blanket capability drop across this manifest tree has the actual evidence sitting right there, rather than rediscovering it the same way.

Final state: allowPrivilegeEscalation: false everywhere — that one's genuinely always safe, it has no entrypoint-behavior dependency. capabilities.drop: [ALL] kept only where verified safe (cloudflared, gitea after its own fix, and several others). runAsNonRoot: true kept only where the image's actual default user is verifiably non-root. Net result: Trivy's configuration-misconfiguration finding count went from 215 to 171 — real progress, just not the full sweep the first PR claimed.

The Lesson

kubeconform and kubectl --dry-run validate that a manifest is schema-valid. They say nothing about whether the container's actual entrypoint will survive the constraints you just imposed on it. Those are two completely different questions, and passing the first one tells you nothing about the second.

For any image you don't control, the actual behavior of its entrypoint — does it run as root and drop privileges, does it default to a non-root user, does anything in its startup sequence need a specific capability — has to be verified live, one image at a time, before a blanket security hardening change goes anywhere near a cluster with auto-sync enabled. The pattern to specifically watch for: anything that does chown/chmod on a data directory before launching the real process almost certainly needs CAP_CHOWN and friends, regardless of how harmless the final running process looks.

Addendum: Handing the Capability Back Wasn't the Best Fix Available

A reader pointed out, correctly, that "give the capability back" gets a service out of an outage but quietly gives up the thing the hardening pass was for in the first place. There's a fix that keeps both.

For the root-then-drop-privileges images, the chown in the entrypoint exists to fix ownership of a mounted volume so the unprivileged runtime user can write to it. fsGroup in the pod's securityContext does exactly that job at mount time — the kubelet recursively chowns the volume to the given GID before the container ever starts. If the volume is already owned correctly when the entrypoint runs, the entrypoint's own chown becomes a no-op it doesn't need permission for, and capabilities.drop: ["ALL"] can stay in place on the main container:

securityContext:
  fsGroup: 1000  # kubelet chowns the mounted volume(s) to this GID before container start
containers:
  - name: authelia
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]  # no longer needs CAP_CHOWN/SETUID/SETGID —
                        # fsGroup already made the volume writable

For the handful of images that chown unconditionally regardless of existing ownership — no flag to skip it — the same idea applies one level up: move the chown into an initContainer that runs as root with only CAP_CHOWN, and lock the long-lived app container down fully. The privileged moment still happens, but it's a few seconds during pod startup instead of a capability held by the process serving traffic for the pod's entire lifetime:

initContainers:
  - name: fix-ownership
    image: busybox
    command: ["chown", "-R", "1000:1000", "/config"]
    securityContext:
      runAsUser: 0
      capabilities:
        add: ["CHOWN"]
        drop: ["ALL"]
    volumeMounts:
      - name: config
        mountPath: /config
containers:
  - name: authelia
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]

One trap on the fsGroup route for volumes the size of Nextcloud's or Paperless's: the default behavior recursively chowns the entire volume on every pod start, not just the first. On a large PV that's a full-tree walk before the app can even begin starting — startup time falls off a cliff as the dataset grows. Set fsGroupChangePolicy: OnRootMismatch so Kubernetes only does the recursive chown when the volume's root ownership doesn't already match, and skips the walk on every subsequent restart once it's correct:

securityContext:
  fsGroup: 1000
  fsGroupChangePolicy: OnRootMismatch

This is the better fix for every one of the five images this incident's follow-up PRs reverted — the follow-up PRs restored capabilities to get the fastest possible recovery during an active incident, which was the right call in the moment, but fsGroup (plus the initContainer pattern for the stubborn unconditional-chown images) is the version worth landing as the actual final state, since it gets the hardening back without reopening the exposure it was meant to close.

The same blast-radius problem exists in Azure — a blanket Pod Security Standard or Azure Policy applied across an AKS cluster's namespaces can break exactly this class of container for exactly this reason, just with kubectl apply replaced by a policy assignment that enforces on the next pod restart instead of immediately. Verify per-workload before enforcing cluster-wide, not after.

Hardening Unattended Raspberry Pi Edge Nodes: Watchdog, fail2ban, nftables, and the Mistakes That Take Down DNS

david — Mon, 22 Jun 2026 12:14:01 +0000

Two Raspberry Pi 4Bs run AdGuard Home and Unbound for an entire home network, in an active/passive pair via Keepalived. They're physical hardware sitting on a shelf, not VMs or LXCs — no Proxmox snapshot, no PBS backup, no terraform destroy && apply to recover from a bad state. If one hangs hard at 2am, nobody notices until someone's phone can't resolve a hostname.

This is the hardening pass that closed every gap I found in that setup: a hardware watchdog for total-system-freeze recovery, fail2ban for the one SSH-exposed surface, an nftables host firewall that's careful not to fight with Docker's own iptables rules, log size caps to stop slow SD-card death, and a DNS health check that works even on the day the rest of the monitoring stack is offline — which, as it turned out, was exactly the day it mattered.

View the complete homelab infrastructure source on GitHub 🐙

Why "It's Just DNS" Needs More Hardening, Not Less

The instinct with a small, single-purpose device is to leave it alone — fewer moving parts, fewer ways to break it. That's backwards for a device with no operator watching it and no automated recovery path. A k3s pod that crashes gets rescheduled in seconds. A Raspberry Pi that hard-hangs stays hung until a human walks over and pulls the power.

Everything below is about closing that gap: detecting failure independently, recovering from total freezes without intervention, and not introducing a new failure mode in the process of doing any of this.

Hardware Watchdog: Recovering From a Hang Software Can't See

A crashed container gets restarted by Docker. A kernel deadlock — the whole system stops responding, nothing crashes, nothing logs anything — doesn't. Nothing is left running to notice the problem or act on it.

The Broadcom SoC in a Raspberry Pi has a hardware watchdog timer: a circuit that resets the board if it isn't periodically "petted." As long as something pets it, the system is presumed alive. If petting stops — because the kernel is deadlocked and nothing can run — the watchdog fires and power-cycles the board.

# /boot/firmware/config.txt
dtparam=watchdog=on

# /etc/systemd/system.conf
RuntimeWatchdogSec=15s
RebootWatchdogSec=10min

RuntimeWatchdogSec=15s means systemd pets the hardware watchdog every 15 seconds while the system is healthy. If systemd itself stops running (the actual deadlock case this exists for), the pets stop, and the watchdog circuit force-resets the board. RebootWatchdogSec=10min is a second, independent safety net — if a reboot itself hangs (stuck somewhere in shutdown), the watchdog fires again after 10 minutes rather than leaving the board hung mid-reboot indefinitely.

This requires a reboot to take effect — the config.txt change only applies at boot. I gated the actual reboot behind an explicit flag (rpi_optimize_reboot, default false) rather than auto-rebooting a DNS server as a side effect of an Ansible run.

fail2ban: The One Exposed Surface

These Pis are reachable from the entire server VLAN, and via the Keepalived VIP, present a single consistent address that's an obvious target for anything scanning the network. The only network-facing attack surface that matters here is SSH.

# /etc/fail2ban/jail.d/sshd.local
[sshd]
enabled = true
port = ssh
filter = sshd
maxretry = 5
findtime = 10m
bantime = 1h

Five failed attempts within ten minutes bans the source IP for an hour. fail2ban only watches sshd auth logs — it has zero interaction with the DNS path (AdGuard, Unbound, Docker). That isolation matters: a misconfigured fail2ban jail watching the wrong log file, or banning based on the wrong filter, is a self-inflicted outage risk on a box where outages are expensive. Scoping it to exactly one well-understood log source keeps the blast radius of a fail2ban misconfiguration limited to "SSH access," never to DNS itself.

The nftables Trap: Don't Touch /etc/nftables.conf

This is the part that could have caused the exact outage the rest of this hardening pass exists to prevent.

The obvious way to add a host firewall on Debian is to edit /etc/nftables.conf and enable nftables.service. The problem: that file conventionally starts with flush ruleset — and Docker manages its own NAT and FORWARD chains via iptables-nft (the nftables-backed iptables compatibility layer). Enabling the stock nftables.service would flush ruleset on every boot, wiping out Docker's NAT rules along with it, and silently break every published container port. On a box running AdGuard with network_mode: host specifically so it can bind port 53 directly — but also running other containers in bridge mode with published ports — that's not a hypothetical, it's the actual topology.

The fix: don't touch /etc/nftables.conf or the stock service at all. Use a separate ruleset file and a separate, custom systemd service:

# /etc/nftables-hostfw.conf
table inet hostfw {
  chain input {
    type filter hook input priority filter; policy drop;
    iif "lo" accept
    ct state established,related accept
    ip protocol icmp accept
    meta l4proto ipv6-icmp accept
    tcp dport 22 accept
    tcp dport 53 accept
    udp dport 53 accept
    tcp dport 3001 accept
    tcp dport { 80, 443 } accept
    udp dport 41641 accept
    ip protocol vrrp accept
  }
}

# /etc/systemd/system/hostfw.service
[Unit]
Description=Host firewall (inet hostfw table, additive — does not touch Docker's tables)
After=network.target docker.service
Wants=docker.service

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/usr/sbin/nft -f /etc/nftables-hostfw.conf
ExecStop=/usr/sbin/nft delete table inet hostfw

[Install]
WantedBy=multi-user.target

A named table (inet hostfw) in its own namespace, with policy drop only on that table's input chain — it's additive to whatever else nftables is doing, not a replacement of the ruleset. After=docker.service and Wants=docker.service ensure ordering: this table gets applied after Docker has already set up its own rules, so there's no race where this firewall's policy drop briefly applies before Docker's accept rules for its own traffic exist.

What this firewall covers: SSH (22), DNS (53 — AdGuard runs network_mode: host, so this is genuinely host-stack traffic, not Docker-NAT'd), AdGuard's web UI (3001), the HAProxy VIP (80/443), Tailscale (41641/udp), Keepalived VRRP.

What it deliberately doesn't cover: bridge-mode containers like Unbound (5335) and node_exporter (9100). Docker DNATs traffic to these before it ever reaches the host's INPUT chain — this firewall's table never sees that traffic, confirmed by live testing, not just by reading documentation about how Docker's iptables integration works. Restricting bridge-mode container ports would require rules in Docker's own DOCKER-USER chain, with careful IPv4/IPv6 handling to avoid breaking container egress. I deferred this: MikroTik already segments these Pis from the wider internet at the network layer, and the mistake-risk of getting DOCKER-USER chain rules wrong on a live DNS server outweighed the marginal security benefit of restricting traffic that's already internal-only.

Validation that actually validates the deployment path, not just the live change: live-tested on the replica Pi first, with a systemd-run safety-rollback timer staged before every individual change (the same dead-man's-switch pattern as the MikroTik cleanup). Then re-tested via the actual Ansible run — a separate code path from the manual live test, since a playbook can have a templating bug that a manual nft -f test wouldn't catch. Then validated with an actual reboot, to confirm the systemd service correctly reapplies the ruleset on boot, rather than only working because it happened to still be live-applied from the manual test. Only after the replica was fully green did the same sequence run against the primary DNS node.

Stopping Slow SD-Card Death

Docker's default json-file log driver has no size limit. On a box with a real disk, that's eventually a problem; on a Pi with an SD card as its only storage, it's a slow-motion outage that looks like nothing is wrong until the card is full and everything stops:

// /etc/docker/daemon.json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Existing container logs were already at 17MB and 2.7MB by the time I checked — not catastrophic yet, but on a trajectory toward "disk full" with zero warning beforehand, months out. This setting only caps logs for containers created or recreated after the daemon restart — it doesn't retroactively truncate what's already there. Existing oversized logs needed a manual one-time cleanup; the daemon-wide default just stops the problem from recurring.

Memory Limits: Catching a Leak Before It Takes the Whole Pi Down

# docker-compose, per service
adguardhome:
  mem_limit: 512m
unbound:
  mem_limit: 256m
promtail:
  mem_limit: 256m
node_exporter:
  mem_limit: 128m
autoheal:
  mem_limit: 64m

These are generous numbers, chosen from actual observed usage with real headroom — the goal isn't to constrain normal operation, it's to make sure a genuine memory leak or runaway process in one container gets killed by Docker's OOM handling for that container before it starves every other process on the Pi, including the DNS resolver everything depends on. Tested incrementally on the replica first, verified via docker inspect that limits were actually enforced, confirmed all containers came back Up after restart, with DNS unaffected throughout — the kind of change where "looks fine" isn't sufficient confirmation on a box this important.

Local Config Backup: The Gap Nobody Noticed

These Pis are physical hardware — Proxmox Backup Server and Velero only cover VMs and LXCs, so neither one was ever backing these up. The gap had existed since the Pis were first deployed, just never surfaced, because nothing had ever required restoring from a backup yet.

#!/bin/bash
# /usr/local/bin/backup-rpi-configs.sh
set -euo pipefail
DEST=/opt/backups
STAMP=$(date +%Y%m%d-%H%M%S)
tar czf "${DEST}/configs-${STAMP}.tar.gz" \
  -C / opt/adguardhome/conf opt/unbound 2>/dev/null || true
ls -t "${DEST}"/configs-*.tar.gz 2>/dev/null | tail -n +15 | xargs -r rm --

Daily, via a systemd timer with randomized delay (to avoid both Pis hitting disk I/O at the exact same instant), keeping the 14 most recent snapshots. Deliberately local-only, with no NFS or git dependency — the NFS server runs as an LXC on the Proxmox host, and depending on the thing you're backing up away from failing defeats the purpose. AdGuard's config also contains a bcrypt password hash; pushing that into git history, even encrypted-at-rest on a private remote, is an unnecessary exposure for a snapshot whose only job is "let me recover the last known-good config after an accidental change."

Alerting That Survives the Main Alerting Stack Being Down

This is the piece that mattered in practice, not just in theory. The homelab's primary alerting path (Prometheus → Alertmanager → Discord) runs on the k3s cluster, which runs on the Proxmox host. On the day I built this, the Proxmox host itself was down for hardware repair — which meant the entire alerting pipeline was also down, on exactly the day DNS health mattered most, since DNS was now also the only thing left running unsupervised.

#!/bin/bash
# Independent DNS health check — ZERO dependency on k3s/Prometheus/Alertmanager
WEBHOOK_URL="..."
STATE_FILE="/var/lib/dns-healthcheck.state"
HOSTNAME=$(hostname)

check_dns() {
  dig +short +timeout=3 google.com @127.0.0.1 -p 53 > /dev/null 2>&1 && \
  dig +short +timeout=3 google.com @127.0.0.1 -p 5335 > /dev/null 2>&1
}

PREV_STATE="unknown"
[ -f "$STATE_FILE" ] && PREV_STATE=$(cat "$STATE_FILE")

if check_dns; then CURRENT_STATE="healthy"; else CURRENT_STATE="unhealthy"; fi

if [ "$CURRENT_STATE" != "$PREV_STATE" ]; then
  if [ "$CURRENT_STATE" = "unhealthy" ]; then
    MESSAGE="🔴 **${HOSTNAME}**: DNS resolution failing. This alert is independent of the main monitoring stack."
  else
    MESSAGE="🟢 **${HOSTNAME}**: DNS resolution recovered."
  fi
  curl -s -X POST -H "Content-Type: application/json" \
    -d "{\"content\": \"${MESSAGE}\"}" "${WEBHOOK_URL}" > /dev/null 2>&1 || true
fi

echo "$CURRENT_STATE" > "$STATE_FILE"

Run every two minutes via a systemd timer. Two design choices that matter more than the script's mechanics:

It tests both layers independently — AdGuard on port 53 and Unbound directly on port 5335. AdGuard forwards to Unbound; testing only the front door (53) wouldn't distinguish "AdGuard is fine but its upstream resolver died" from "everything's fine." && between the two dig calls means both have to succeed for the overall state to be healthy.

It only posts on a state change, not on every run. A naive healthcheck that posts every two minutes regardless of state either spams a channel into being muted (defeating the purpose) or gets its messages ignored after the first few identical ones. Tracking previous state in a file and diffing against it means the alert fires exactly twice per incident: once when it breaks, once when it recovers — and nothing in between.

The webhook URL reuses the same Discord webhook Alertmanager already posts to — found, while wiring this up, to have been committed in plaintext in the cluster's own monitoring config. Worth its own fix, but explicitly out of scope for this change; noted rather than silently expanded into a second unrelated remediation in the same commit.

What Actually Got Tested, Not Just Written

Every change here got the same validation discipline, because the box matters too much to skip it: replica first, primary only after the replica was fully green; a manual live test and a separate Ansible-driven test, since they're different code paths; and for anything that should survive a reboot, an actual reboot — not just trusting that a systemd unit file is correct.

The pattern generalizes past Raspberry Pis: any unattended edge device — a branch-office router, an IoT gateway, a remote sensor node — has the same shape of problem. No operator watching it, no automated platform-level recovery, and a failure mode (hard hang) that ordinary application-level monitoring can't see because the monitoring agent itself is also hung. A hardware watchdog plus an alerting path with zero dependency on the thing being monitored is the minimum bar for "I'll find out if this breaks," regardless of what the device actually does.

IPv6 NAT66 Behind a FritzBox: The RouterOS 7 Bug That Broke WiFi Clients

david — Mon, 22 Jun 2026 12:13:49 +0000

Most homelab IPv6 guides assume you have native IPv6 from your ISP: a delegated /56 prefix, clean RA on the WAN, no NAT. That describes maybe 30% of actual deployments in Germany.

The other 70% sits behind a FritzBox with DS-Lite or CGN, gets a GUA on the WAN interface via SLAAC, and has no delegated prefix to distribute internally. If you want IPv6 inside your network, you build it yourself.

This is the setup I run: ULA addressing internally, NAT66 masquerade for outbound, everything Terraform-managed. It worked until RouterOS 7's router advertisement defaults caused every FritzBox WiFi client to route IPv6 through MikroTik — and then get dropped.

View the complete homelab infrastructure source on GitHub 🐙

The Topology

Internet
    │
FritzBox (CGN / DS-Lite)
    │  ether1 (WAN) — gets GUA via SLAAC from FritzBox
MikroTik RB5009
    ├── vlan10-mgmt   fd10::1/64
    ├── vlan20-srv    fd20::1/64
    ├── vlan30-dmz    fd30::1/64
    ├── vlan40-iot    fd40::1/64
    └── vlan100-admin fd64::1/64

The FritzBox provides:

IPv4 via CGN/DS-Lite (no public IPv4)
IPv6 GUA prefix via RA on its LAN port — MikroTik's ether1 picks this up via SLAAC

Internally, I use ULA (fd00::/8, RFC 4193). ULA is the IPv6 equivalent of RFC1918 private addressing. It's stable — it doesn't change when the ISP rotates the GUA prefix — and it works for all internal communication. The NAT66 rule masquerades ULA sources to the GUA when leaving ether1.

Step 1: ULA Addresses Per VLAN

Each VLAN gets a /64 from the fd::/8 space. I use the VLAN number as the second octet for readability:

# terraform/stacks/network/ipv6_network.tf

locals {
  ipv6_ula_prefixes = {
    "vlan10-mgmt"   = "fd10::/64"
    "vlan20-srv"    = "fd20::/64"
    "vlan30-dmz"    = "fd30::/64"
    "vlan40-iot"    = "fd40::/64"
    "vlan100-admin" = "fd64::/64"
  }
}

resource "routeros_ipv6_address" "vlan_ula" {
  for_each = local.ipv6_ula_prefixes

  address   = replace(each.value, "::/64", "::1/64")
  interface = each.key
  advertise = true
  comment   = "ULA gateway for ${each.key}"
}

advertise = true enables IPv6 ND (Neighbor Discovery) on each interface. Hosts on each VLAN receive a Router Advertisement with the /64 prefix and auto-configure a ULA address via SLAAC. No DHCPv6 needed.

The router address is ::1 in each /64: fd10::1/64, fd20::1/64, etc.

Step 2: Accept RA on ether1

MikroTik defaults to ignoring Router Advertisements when forward = true (i.e., when acting as a router). You have to explicitly enable RA acceptance on the WAN interface:

resource "routeros_ipv6_settings" "global" {
  accept_router_advertisements = "yes"
  forward                      = true
}

With this, ether1 accepts the RA from the FritzBox and configures its GUA via SLAAC. ip6 address print will show the GUA alongside the manually configured ULA if you have any internal IPv6 config on ether1.

Step 3: NAT66 Masquerade

The NAT66 rule masquerades outbound IPv6 from ULA sources to the GUA on ether1:

resource "routeros_ipv6_firewall_nat" "nat66_masquerade" {
  chain         = "srcnat"
  action        = "masquerade"
  src_address   = "fd00::/8"
  out_interface = "ether1"
  comment       = "NAT66: ULA → WAN GUA (FritzBox upstream)"
}

The src_address = "fd00::/8" constraint is critical. Without it, the rule matches ALL IPv6 traffic leaving ether1 — including traffic from FritzBox WiFi clients that happens to transit MikroTik. This is one half of the bug that caused problems (more on that below).

Step 4: IPv6 Firewall

The IPv6 firewall mirrors the IPv4 firewall philosophy: default-drop, explicit allows, place_before for deterministic rule ordering.

# INPUT chain
resource "routeros_ipv6_firewall_filter" "v6_in_00_established" {
  action           = "accept"
  chain            = "input"
  connection_state = "established,related,untracked"
  place_before     = routeros_ipv6_firewall_filter.v6_in_01_icmpv6.id
  comment          = "V6-IN-00: Allow established/related"
}

resource "routeros_ipv6_firewall_filter" "v6_in_01_icmpv6" {
  action       = "accept"
  chain        = "input"
  protocol     = "icmpv6"
  place_before = routeros_ipv6_firewall_filter.v6_input_drop_all.id
  comment      = "V6-IN-01: Allow ICMPv6 (NDP, RA, ping6)"
}

resource "routeros_ipv6_firewall_filter" "v6_input_drop_all" {
  action  = "drop"
  chain   = "input"
  comment = "V6-IN-DROP: Drop all other IPv6 input"
}

# FORWARD chain
resource "routeros_ipv6_firewall_filter" "v6_fwd_00_established" {
  action           = "accept"
  chain            = "forward"
  connection_state = "established,related,untracked"
  place_before     = routeros_ipv6_firewall_filter.v6_fwd_01_icmpv6.id
  comment          = "V6-FWD-00: Allow established/related"
}

resource "routeros_ipv6_firewall_filter" "v6_fwd_01_icmpv6" {
  action       = "accept"
  chain        = "forward"
  protocol     = "icmpv6"
  place_before = routeros_ipv6_firewall_filter.v6_fwd_02_internal_out.id
  comment      = "V6-FWD-01: Allow ICMPv6"
}

resource "routeros_ipv6_firewall_filter" "v6_fwd_02_internal_out" {
  action        = "accept"
  chain         = "forward"
  src_address   = "fd00::/8"
  out_interface = "ether1"
  place_before  = routeros_ipv6_firewall_filter.v6_forward_drop_all.id
  comment       = "V6-FWD-02: Allow internal ULA to WAN"
}

resource "routeros_ipv6_firewall_filter" "v6_forward_drop_all" {
  action  = "drop"
  chain   = "forward"
  comment = "V6-FWD-DROP: Drop all other IPv6 forward"
}

The forward rule v6_fwd_02_internal_out only allows ULA sources (fd00::/8) to exit via ether1. That's intentional — and it's what exposed the RouterOS 7 bug.

The Bug: RouterOS 7 Sends RA on All Interfaces

After deploying this configuration, FritzBox WiFi clients started losing IPv6 connectivity.

The symptom: devices on the FritzBox WiFi (SSID, not the MikroTik VLANs) had IPv6 addresses but couldn't reach the internet via IPv6. traceroute6 on an affected device showed the path going through MikroTik — not the FritzBox.

The cause: RouterOS 7 enables Router Advertisement on all interfaces by default, including ether1 (WAN).

Here's the sequence:

MikroTik receives a GUA prefix from FritzBox via RA on ether1
RouterOS 7 then re-advertises a Router Advertisement on ether1 — back towards the FritzBox
The FritzBox sees MikroTik advertising itself as an IPv6 router on the LAN
FritzBox WiFi clients pick up MikroTik's RA and install it as their default IPv6 gateway
IPv6 traffic from WiFi clients routes through MikroTik's FORWARD chain
FORWARD chain only accepts fd00::/8 sources — GUA addresses from WiFi clients don't match
Traffic dropped. IPv6 broken for all FritzBox WiFi clients.

The fix is to disable RA on ether1. In RouterOS /ip6/nd, find the ether1 entry and set advertise=no.

The problem: as of terraform-routeros provider version 1.99.1 (latest at time of writing), there is no routeros_ipv6_nd resource to manage this via Terraform. The fix has to be applied manually:

/ipv6/nd set [find interface=ether1] advertise=no

This is documented in the Terraform configuration as a comment so it doesn't get overwritten by a future terraform apply:

# RouterOS 7 enables RA advertisement on ALL interfaces by default — including
# ether1 (WAN). Once ether1 gets a GUA via SLAAC, MikroTik starts sending RAs
# on the FritzBox LAN. FritzBox WiFi clients then use MikroTik as their IPv6
# gateway, but the FORWARD chain only allows fd00::/8 sources → GUA clients
# are dropped → IPv6 broken on FritzBox WiFi.
# RA on ether1 is disabled in RouterOS: /ipv6/nd set [find interface=ether1] advertise=no
# routeros_ipv6_nd is not exposed in terraform-routeros/routeros ≤ 1.99.1 (latest as of 2026-06).

Once routeros_ipv6_nd is added to the provider (tracked upstream), this should be managed as:

resource "routeros_ipv6_nd" "ether1_no_ra" {
  interface = "ether1"
  advertise = false
}

ULA vs. GUA: Why Not Just Use the ISP Prefix?

The obvious alternative: use the GUA prefix the FritzBox receives from the ISP, delegate a /64 to each VLAN, and skip NAT66 entirely. IPv6 was designed to eliminate NAT.

The problem: German ISPs frequently rotate GUA prefixes. A prefix change means every device on every VLAN gets a new address — breaking DNS records, Ansible inventory, firewall rules, and anything else that references addresses directly.

ULA solves this. The fd::/8 prefix is locally assigned and never changes. Internal addressing is stable forever. The NAT66 rule handles the GUA ↔ ULA translation at the WAN boundary transparently.

The trade-off: ULA + NAT66 breaks end-to-end IPv6 reachability (GUA hosts on the internet can't initiate connections to your ULA hosts). For a homelab where all inbound connections come through a Cloudflare Tunnel or Traefik ingress anyway, that's not a problem.

Verifying the Setup

After applying the Terraform config and the manual RA fix:

# From a device on vlan20-srv (should have fd20::/64 address)
ip -6 addr show
# Should see: fd20::xxx/64

# Test outbound IPv6
ping6 -c 3 ipv6.google.com
# Should succeed (NAT66 masquerades the ULA source to the GUA)

# From a FritzBox WiFi device
ip -6 route show
# Default via should point to FritzBox, not MikroTik

If WiFi clients still route through MikroTik after setting advertise=no, run ip6/nd print on the RouterOS terminal to verify the change persisted. RouterOS can be slow to propagate ND configuration changes.

The same ULA-vs-GUA stability trade-off shows up in Azure networking — except there it's RFC1918 address space behind NAT Gateway or Azure Firewall instead of a CGN ISP. If you're designing the equivalent zero-trust network layer for Azure, the same default-deny-plus-explicit-allow philosophy applies.

My Firewall Had 77 Rules. Terraform Knew About 22 of Them.

david — Sun, 21 Jun 2026 15:43:57 +0000

I wrote an article about building a zero-trust MikroTik firewall with Terraform — default-deny chains, explicit allow rules, place_before for deterministic ordering. The Terraform code was correct. I'd run terraform plan regularly and it showed no drift.

The live router had 77 firewall filter rules. The Terraform configuration tracked 22.

This is the story of how that happened, why terraform plan showing clean didn't catch it, and how a security tightening I'd made — and verified, and considered done — had been silently undone for weeks.

View the complete homelab infrastructure source on GitHub 🐙

How You End Up With Four Generations of the Same Firewall

The pattern, in hindsight, is obvious: every time I did a significant firewall rework, I wrote a fresh, complete set of rules in firewall_deterministic.tf and ran terraform apply. Terraform created the new rules. It did not — because nothing told it to — remove the old generation, because the old generation's rules weren't Terraform resources Terraform knew about. They'd been created by a previous terraform apply of an earlier version of the same file, then the resource definitions were edited or replaced rather than removed cleanly, or in a couple of cases, created directly via the RouterOS API during a debugging session and never imported.

Terraform only manages what's in its state. A rule that exists on the router but isn't a resource in the current configuration is invisible to terraform plan — there's no diff to show, because there's nothing in the config to compare it against. terraform plan reporting "no changes" means the resources Terraform knows about match reality. It says nothing about resources Terraform was never told to track.

The Bug This Actually Caused

This wasn't just clutter. It actively undid a real security fix.

At some point I'd tightened a monitoring rule from "Prometheus can reach all internal VLANs" to "Prometheus can reach only port 9100 on the management VLAN":

# The narrow, intentional version — added to fix an overly broad rule
resource "routeros_ip_firewall_filter" "fwd_04a_srv_monitoring" {
  action       = "accept"
  chain        = "forward"
  src_address  = "10.0.20.0/24"
  dst_address  = "10.0.10.0/24"
  dst_port     = "9100"
  protocol     = "tcp"
  place_before = routeros_ip_firewall_filter.fwd_08_allow_dns.id
  comment      = "04a: SRV - Prometheus scrape to MGMT node_exporter (port 9100)"
}

This rule existed in Terraform. terraform plan showed it as applied, no drift. I had every reason to believe the network was scoped exactly this way.

But RouterOS evaluates firewall rules in order and stops at the first match. Buried earlier in the live ruleset — a leftover from a previous generation — was the old, broad version:

"04a: SRV - Allow monitoring to all internal VLANs"
src=10.0.20.0/24 dst=10.0.10.0/24 action=accept

No port restriction. No protocol restriction. And because RouterOS hit this rule first, traffic matching it was accepted before the router ever evaluated the narrower, newer rule. The port-9100-only restriction I'd written, tested, and confirmed in Terraform had never actually been enforced on the live device — the older, broader rule was silently winning every time.

This is the sharpest version of the general problem with ordered rule lists: a rule that looks dead (superseded by a newer one) isn't dead unless it's actually removed. It's just sitting there, waiting for the day its broader match happens to fire first.

Finding the Actual Scope of the Problem

# Pull live rules via the RouterOS REST API
curl -s -k -u admin:$PASS https://10.0.10.1/rest/ip/firewall/filter | jq length
# → 77

# Count Terraform-managed resources
grep -c 'resource "routeros_ip_firewall_filter"' terraform/stacks/network/firewall_deterministic.tf
# → 22

55 rules existed on the router with no corresponding Terraform resource. Diffing live rules against the 22 known-good ones by exact field match (action, chain, src/dst address, port, protocol — not just comment text, since comments had also drifted across generations) split that 55 into two groups:

36 rules were exact or near-exact duplicates of a currently-tracked rule — leftover generations of the same intent, just stale.
19 rules were legitimate, distinct, and still in active use — VPN access tiers, Atlantis/MikroDash API access, WireGuard, a Minecraft server port-forward, OIDC redirect routes. These had been created manually at some point and simply never added to Terraform in the first place. Not drift in the dangerous sense — just infrastructure that was never brought under IaC.

Deleting Firewall Rules Without Locking Yourself Out

This is the highest-blast-radius device in the network. A mistake deleting the wrong rule doesn't get fixed by SSHing back in — if the rule that breaks is the one allowing SSH, there's no way back in remotely. Before deleting anything, I staged a full per-rule restore as a one-shot RouterOS scheduler entry — a dead man's switch:

/system scheduler add name="restore-firewall-failsafe" \
  start-time=startup interval=5m \
  on-event="/system script run restore-firewall-rules"

The restore script re-creates every rule about to be deleted, scheduled to fire automatically in five minutes unless cancelled. The procedure:

Stage the restore script and the scheduler entry (not yet running — disabled=yes).
Enable the scheduler.
Delete the 36 orphaned rules via direct REST API calls.
Immediately verify DNS, SSH, and WAN connectivity from a separate, already-open session.
Only if everything checks out: disable and remove the scheduler entry.

If step 4 had failed — if deleting a rule had broken something — the scheduler would have restored the deleted rules automatically within five minutes, without requiring any further access to the router. This pattern generalizes to any change where the failure mode is "I can no longer reach the device to fix my mistake": stage the rollback to fire automatically on a timer, and only cancel the timer after confirming success through a separate channel.

What's Left

41 rules remain: the 22 Terraform-managed ones, plus the 19 legitimate manual rules — now tracked as a known gap (docs/OPERATIONS.md) rather than invisible clutter. Bringing those 19 under Terraform via import blocks is the obvious next step, but it's explicitly not urgent — they're working, intentional, and visible in documentation now. The 36 that mattered (because they were actively undermining a security control) are gone.

The General Lesson

terraform plan showing no drift is not the same claim as "the live device matches my intent." It only means the resources Terraform is tracking match their last-applied state. Anything created outside that tracked set — via a prior version of the config that got edited rather than cleanly replaced, or via direct API/CLI access during a debugging session — is invisible to the diff, indefinitely, until someone goes and looks at the live device directly.

For an ordered rule list specifically (firewalls, but also things like Azure Firewall Policy rule collections, NSG priority-ordered rules, or any first-match system), an orphaned broad rule isn't neutral clutter — it can silently take precedence over a narrower rule you believe supersedes it. Periodically diffing live state against Terraform state by direct query — not just trusting plan — is the only way to catch this class of bug.

The same risk exists in Azure NSGs and Azure Firewall Policy: priority-ordered rules where an old, broad rule with a lower priority number can silently win over a newer, narrower one if it was never cleaned up after a security tightening. If you're managing NSG rule sets at scale, periodically pulling live rule state via az network nsg rule list and diffing it against your Terraform state catches exactly this class of drift before it becomes a finding in someone else's audit.

Kyverno: Supply Chain Security as Admission Control on Kubernetes

david — Sun, 21 Jun 2026 15:43:51 +0000

Kubernetes has no opinion about what you run. You can deploy a container with no resource limits, no security context, root access to the host filesystem, and an image tagged :latest that changes every week — and the scheduler will place it without complaint.

In a homelab that's annoying. In a production cluster, it's a compliance failure.

Kyverno is a Kubernetes-native policy engine. It runs as an admission webhook — every kubectl apply, every ArgoCD sync, every Helm install is evaluated against your policies before it reaches the scheduler. Violations are either blocked (Enforce) or logged (Audit).

This post covers the three policies running on my k3s cluster and the Audit-first rollout strategy that lets you enforce gradually without breaking existing workloads.

View the complete homelab infrastructure source on GitHub 🐙

Why Admission Control

The alternative to admission control is runtime enforcement: scan running containers, alert on violations, remediate manually. This works, but it's reactive. A misconfigured deployment reaches the scheduler, requests a node, pulls an image, and starts running before anything flags it.

Admission control is preventive. The webhook intercepts the API request before the object is created. A rejected request never touches the scheduler.

For supply chain security specifically — controlling what can run, not just what is running — admission control is the right layer.

Installing Kyverno via ArgoCD

Kyverno deploys via Helm:

# kubernetes/system/kyverno/application.yml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: kyverno
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://kyverno.github.io/kyverno/
    targetRevision: 3.2.6
    chart: kyverno
    helm:
      values: |
        admissionController:
          replicas: 1
        backgroundController:
          replicas: 1
  destination:
    server: https://kubernetes.default.svc
    namespace: kyverno
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

The backgroundController is the component that evaluates existing resources against policies (background scan) and populates PolicyReport objects. Without it, you only catch violations at admission time — existing non-compliant resources stay invisible.

Policy 1: Require Resource Limits (Audit)

Containers without resource limits are a noisy-neighbour problem. A single container that consumes unbounded memory will trigger the OOM killer across the whole node, affecting every other pod on it.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
  annotations:
    policies.kyverno.io/title: Require Resource Limits
    policies.kyverno.io/description: >
      Pods without resource limits can starve other workloads on the same node.
      Run `kubectl get policyreport -A` to see violations before enforcing.
spec:
  validationFailureAction: Audit
  background: true
  rules:
    - name: check-container-limits
      match:
        any:
          - resources:
              kinds: [Pod]
              namespaces: [apps, monitoring, database]
      validate:
        message: "Container '{{ request.object.spec.containers[0].name }}' must define resources.limits (cpu and memory)."
        foreach:
          - list: "request.object.spec.containers"
            deny:
              conditions:
                any:
                  - key: "{{ element.resources.limits | length(@) }}"
                    operator: Equals
                    value: 0

This is in Audit mode — violations are logged to PolicyReport but not blocked. That's intentional. Before enforcing, you need to know what would break.

Check current violations:

kubectl get policyreport -A
kubectl describe policyreport -n apps

The report shows every pod in apps, monitoring, and database namespaces that's missing resource limits. Fix those, then flip validationFailureAction to Enforce.

Policy 2: Disallow Privileged Containers (Enforce)

This one is in Enforce mode from day one. No homelab service needs privileged mode — if something requires privileged: true, that's a flag to investigate, not accommodate.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged-containers
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: check-privileged
      match:
        any:
          - resources:
              kinds: [Pod]
              namespaces: [apps, monitoring, database, external-secrets]
      validate:
        message: "Privileged containers are not allowed."
        foreach:
          - list: "request.object.spec.containers"
            deny:
              conditions:
                any:
                  - key: "{{ element.securityContext.privileged || false }}"
                    operator: Equals
                    value: true

|| false handles the case where securityContext is not set — the expression evaluates to false, which correctly passes validation (no security context means non-privileged).

This blocks:

Containers with securityContext.privileged: true
By extension, anything that tries to use hostPID, hostNetwork, or hostPath in ways that require privilege escalation

Policy 3: Disallow `:latest` Image Tag (Audit)

Images tagged :latest are non-deterministic. What nginx:latest points to today is different from what it pointed to last week. Rollbacks are impossible because you can't pin to the previous image. Reproducible deployments require pinned tags — either semver (4.39.20) or digest (sha256:abc123).

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-latest-tag
spec:
  validationFailureAction: Audit
  background: true
  rules:
    - name: check-image-tag
      match:
        any:
          - resources:
              kinds: [Pod]
              namespaces: [apps, monitoring, database]
      validate:
        message: "Image '{{ element.image }}' must use a specific tag, not :latest or untagged."
        foreach:
          - list: "request.object.spec.containers"
            deny:
              conditions:
                any:
                  - key: "{{ element.image }}"
                    operator: Equals
                    value: "*:latest"
                  - key: "{{ element.image | contains(@, ':') }}"
                    operator: Equals
                    value: false

The second condition catches images with no tag at all — nginx without :latest still resolves to latest under the hood.

Real-World Example: The Authelia `:latest` Violation

When I first enabled the disallow-latest-tag policy in Audit mode and ran kubectl get policyreport -n apps, Authelia showed up immediately:

PASS  disallow-privileged-containers  authelia-xxx    apps
FAIL  disallow-latest-tag             authelia-xxx    apps
  → ghcr.io/authelia/authelia:latest

The Authelia deployment had been running latest from day one. The fix was straightforward:

# Before
image: ghcr.io/authelia/authelia:latest

# After
image: ghcr.io/authelia/authelia:4.39.20

Once pinned, the PolicyReport cleared. This is the Audit → Enforce workflow in practice: enable, observe, fix violations, then enforce.

The Audit → Enforce Rollout Strategy

The pattern that makes Kyverno safe to adopt on a running cluster:

1. Deploy all policies in validationFailureAction: Audit
2. Wait for background scans to populate PolicyReports (5-10 min)
3. kubectl get policyreport -A → review violations
4. Fix violations in affected deployments
5. Change policy to validationFailureAction: Enforce
6. Verify no new violations in PolicyReport

Going directly to Enforce on a running cluster breaks things. ArgoCD sync jobs, Helm hooks, system daemonsets — they all create pods and will hit the policy. Audit first gives you visibility without the blast radius.

Scoping Policies to Specific Namespaces

Notice all three policies match only [apps, monitoring, database]. System namespaces (kube-system, kube-public, argocd, kyverno) are excluded deliberately.

System components often need exception behaviour — hostPath volumes for kubelet, privileged containers for CNI plugins, no resource limits on critical infrastructure. Scoping your policies to user workload namespaces avoids blocking cluster internals while still enforcing what matters.

PolicyReport: The Visibility Layer

# All reports across all namespaces
kubectl get policyreport -A

# Detail for a specific namespace
kubectl describe policyreport polr-ns-apps -n apps

# All violations
kubectl get policyreport -A -o json | \
  jq '.items[].results[] | select(.result == "fail")'

PolicyReports are Kubernetes-native objects. You can build Grafana dashboards against them (Kyverno exports metrics to Prometheus), alert on policy violations, and track compliance trends over time.

The require-resource-limits violation count is a useful metric: as you fix deployments, it should trend towards zero. When it hits zero and stays there, flip to Enforce.

Enterprise Bridge

These three policies map directly to supply chain security requirements under ISO 27001 (A.12.6 — Technical Vulnerability Management) and NIS2 Article 21 (4.b — handling of incidents, 4.e — supply chain security).

In Azure environments, the equivalent layer is Azure Policy + Defender for Containers — the same concept, different implementation. For teams deploying Kubernetes workloads in regulated environments, Kyverno policies committed to Git provide the audit trail that compliance frameworks require: every policy change is a pull request, every violation is logged.

I Ran Gitleaks Against My Own Repo and Found 12 Real Secrets

david — Sun, 21 Jun 2026 15:14:09 +0000

I assumed my homelab repo was clean. No one had ever flagged anything in review (there is no one else reviewing it), CI was green, and I generally try to use Vault and ExternalSecrets for anything sensitive.

Then I ran a full-history gitleaks detect against it. It found 12 distinct secrets committed in plaintext — including the OIDC private key that signs SSO tokens for half the cluster.

This is the scanning setup I put in place afterward, the baseline strategy that let me adopt secret scanning without getting blocked by my own history on every commit, and the remediation plan for the leaks themselves.

View the complete homelab infrastructure source on GitHub 🐙

What Gitleaks Found

gitleaks detect --no-banner -v

Twelve real findings, plus one already-hashed password (lower severity but still shouldn't be hand-committed) and one false positive in ROADMAP.md (documentation text that happened to match a generic API key pattern).

The real findings, by severity:

File	Secret	Why It Matters
`kubernetes/apps/authelia/configmap.yml`	OIDC issuer private key	Signs SSO tokens for ArgoCD, Vault, Grafana — highest blast radius
`kubernetes/apps/garage/config.yml`	RPC secret + admin token	Storage backend for Velero/Loki/CNPG backups
`kubernetes/apps/garage/secrets.yml`	Admin token (duplicate)	Same secret committed twice in two files
`terraform/stacks/network/local_backend.hcl`	Garage S3 access key	This is the Terraform state backend's own credential
`kubernetes/system/postgres/cnpg-backup-secret.yml`	Garage S3 secret key	Used for WAL archiving
`kubernetes/apps/paperless/secrets.yml`	Postgres password + AI API token
`kubernetes/apps/cloudflared/secrets.yml`	Cloudflare Tunnel token
`kubernetes/apps/headscale/config.yml`	OIDC client secret	Must match Authelia's client config
`kubernetes/system/monitoring/loki.yml`	Minio/S3 password
`kubernetes/apps/mikrodash/secrets.yml`	Dashboard password	Lowest priority — internal tool only

None of these were exposed by a public repo (this one is private), but "private repo" is not a security control — it's a single permission setting away from being public, and anyone with read access to the repo (or its history, forever) has all of this regardless.

Why a Private Repo Doesn't Make This Fine

The honest reason these accumulated: early in the project, before Vault and ExternalSecrets were set up, every new service got a quick secrets.yml with the actual values inline, "just to get it working." Once Vault was running, new services went through it — but nobody went back and migrated the old ones. Each individually felt low-risk at the time. Twelve of them, four months later, is a real exposure if the repo's access list ever changes.

This is the same drift pattern as the Terraform-vs-RouterOS-firewall divergence I wrote about separately: each shortcut is locally reasonable, the accumulated state is not.

Setting Up Gitleaks Without Getting Blocked by History

The naive approach — turn on gitleaks protect in pre-commit and call it done — fails immediately. Every single future commit gets blocked by the 12 pre-existing leaks, because gitleaks scans the whole working tree, not just your diff. You'd have to fix all 12 before you could make any other commit, including the commit that adds the scanning.

The fix is a baseline file:

gitleaks detect --baseline-path .gitleaks-baseline.json --no-banner -v

A baseline is a snapshot of currently-known findings. Anything in the baseline is allowed to keep existing; anything new fails the hook. Generate it once:

gitleaks detect --report-format json --report-path .gitleaks-baseline.json

Commit that baseline file. From this point forward, gitleaks only blocks genuinely new secrets — exactly what you want when adopting scanning on a repo with history older than the scanning itself.

The Three-Layer Hook Setup

One scan layer is not enough — a single missed git commit --no-verify or a commit made from a machine without the hooks installed slips through. Three layers, increasing scope, decreasing frequency:

# .pre-commit-config.yaml
- id: gitleaks-staged
  name: Gitleaks (staged changes)
  description: >-
    Blocks committing NEW secrets. Uses .gitleaks-baseline.json so the
    12 pre-existing leaks don't block every commit until they're fully
    remediated — only genuinely new secrets fail this.
  entry: bash -c 'gitleaks protect --staged --baseline-path .gitleaks-baseline.json --no-banner -v'
  language: system
  always_run: true
  pass_filenames: false
  stages: [pre-commit]

- id: gitleaks-full-repo
  name: Gitleaks (full history, pre-push only)
  description: Re-scans the entire repo and history before any push, against the same baseline.
  entry: bash -c 'gitleaks detect --baseline-path .gitleaks-baseline.json --no-banner -v'
  language: system
  always_run: true
  pass_filenames: false
  stages: [pre-push]

gitleaks protect --staged at commit time — fast, scans only what's staged, catches a secret before it ever enters history.

gitleaks detect at push time — re-scans the entire repo (slower, but only runs once per push, not once per commit). This catches anything that slipped past the first layer, for example a commit made with git commit --no-verify.

CI runs the same gitleaks detect command as a third, environment-independent layer — catches anything pushed from a machine that never had the hooks installed at all.

Allowlisting Real False Positives

The ROADMAP.md false positive needed an explicit allowlist entry, not a baseline bypass — baseline entries are meant for things you intend to fix, allowlist entries are for things that were never secrets in the first place:

# .gitleaks.toml
[extend]
useDefault = true

[allowlist]
description = "Known false positives"
regexes = [
  # ROADMAP.md doc text listing which services use which OIDC client auth
  # method — matches the generic-api-key pattern but is plain documentation,
  # not a secret.
  '''Proxmox/PBS/Grafana/Headscale use `client_secret_basic`''',
]

Be specific with allowlist regexes. A broad pattern here defeats the entire point of scanning — match the exact false-positive string, not a category of strings that happens to include it.

The Remediation Plan

Finding the leaks and remediating them are two different projects. Remediation means: rotate the actual credential (not just remove it from the file — the old value is still valid until rotated), and move the new value into Vault behind an ExternalSecret so it never gets hand-committed again.

The tricky part is ordering. Some of these credentials are dependencies of each other:

1. Garage RPC secret + admin token + S3 keys
   ↳ Everything else's backups depend on Garage being internally consistent.
     Rotating the S3 key also invalidates Terraform's own state backend
     credential (terraform/stacks/network/local_backend.hcl uses the same
     key) — update both in the same pass or Terraform loses access to its
     own state.

2. Authelia OIDC issuer private key
   ↳ Highest blast radius if left exposed (signs every SSO session).
     After rotating, every service trusting the old key should be checked
     for unexpected active sessions.

3. Everything else, any order
   ↳ Cloudflare Tunnel token (rotate in Cloudflare dashboard first, update
     second — order matters for tokens with an external source of truth).
   ↳ Headscale OIDC client secret must be rotated in lockstep with
     Authelia's matching client config — they're a pair.

A secret with downstream dependents must be rotated with the dependents in mind, not in isolation. Rotating Garage's S3 key without immediately updating the Terraform backend config doesn't remove a vulnerability — it breaks Terraform's access to its own state.

Confirming Remediation Actually Worked

After moving a secret to Vault and rotating the credential, re-run the same scan:

gitleaks detect --baseline-path .gitleaks-baseline.json --no-banner -v

The secret will still show up — it's in history, and the baseline still lists it. That's expected; the baseline isn't meant to disappear until every listed item has actually been fixed. Only regenerate the baseline once all 12 are addressed, as a final confirmation step that nothing was missed in the process — not as a way to make individual items "go away" faster.

What This Doesn't Fix

Scanning catches secrets in files. It does not:

Scrub git history. The old values remain readable to anyone with repo access, forever, unless you rewrite history (git filter-repo) — which has its own risks if anyone else has a clone.
Replace rotation. A secret found and removed from the current file tree is still valid until you change the actual credential at its source (Cloudflare dashboard, Garage admin CLI, Postgres ALTER USER, etc.).
Catch secrets gitleaks' default ruleset doesn't recognize. Custom internal token formats need custom regex rules — useDefault = true covers known formats (AWS keys, generic API key patterns, JWTs) but not everything.

The same baseline-adoption pattern applies directly to any enterprise repo with years of history and no prior secret scanning — which describes most codebases that predate a security initiative. The Vault + ExternalSecrets target architecture this remediation moves toward is the same pattern covered in External Secrets Operator + HashiCorp Vault — that's where these 12 secrets are headed.

ArgoCD Gotchas: Cache Staleness and the SharedResourceWarning Nobody Explains

david — Sun, 21 Jun 2026 15:14:06 +0000

kubectl apply reports success. You check the resource — the field you just changed is back to its old value. No error. No event. kubectl get shows the change applied, then a few seconds later shows it gone, like it never happened.

This isn't a typo or a YAML indentation bug. It's ArgoCD's selfHeal doing exactly what it's designed to do — re-applying from its own cached understanding of what the resource should be, which can lag behind a change you just made by hand, or even behind a fresh git push.

This hit the same homelab three times in one day, across three unrelated resources. Here's the pattern, the fix, and a second, related gotcha that produces a different symptom from a similar root cause.

View the complete homelab infrastructure source on GitHub 🐙

The Symptom

Three separate incidents, same shape:

A Tempo PersistentVolumeClaim's storageClassName kept reverting after being changed.
Traefik's tlsStore and dashboard configuration reverted after a Helm values update.
A paperless-gpt deployment's volumeMounts reverted after a direct edit.

Each time, the sequence was: edit the live resource or push a change to Git → confirm the change is live → come back later → the old value is back, with no error logged anywhere obvious.

Why This Happens: `selfHeal` Plus a Stale Cache

ArgoCD's selfHeal: true continuously reconciles the live cluster state against ArgoCD's rendered understanding of what the Application's manifests/Helm chart should produce. That's the entire point of GitOps — drift gets corrected automatically, so a manual kubectl edit doesn't silently become the new permanent state.

The bug isn't that selfHeal exists. It's that the rendered understanding ArgoCD reconciles against comes from the argocd-repo-server's manifest/Helm chart cache, and that cache doesn't always get invalidated promptly after a fresh git push or a fresh kubectl apply made outside ArgoCD. For a window of time — usually short, but long enough to be confusing — ArgoCD's source of truth for "what should this look like" is stale, and selfHeal faithfully reverts your change back to match it.

This is functionally indistinguishable, from the outside, from "ArgoCD is ignoring my change" — but the actual mechanism is "ArgoCD is enforcing an outdated cached version of what it thinks I want."

The Fix: Force a Hard Refresh

kubectl patch application <name> -n argocd --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

The hard refresh value (as opposed to normal) tells ArgoCD to bypass the repo-server's manifest cache entirely and re-render from source. Wait roughly 15 seconds, then re-check.

If that alone doesn't resolve it, the cache itself may need restarting, not just invalidating for one Application:

kubectl rollout restart deployment argocd-repo-server -n argocd

This is a bigger hammer — it affects every Application's next reconciliation, not just the one you're debugging — so try the targeted hard refresh annotation first.

The StatefulSet Exception

For the Tempo PVC specifically, neither of the above fully resolved it on the first try, because volumeClaimTemplates on a StatefulSet are immutable — Kubernetes rejects any attempt to change them on an existing object. Clearing ArgoCD's stale cache fixes ArgoCD's intent going forward, but it can't retroactively fix a field that was never mutable on the live object in the first place.

The fix there is to delete and recreate the StatefulSet itself (the underlying PVC and its data survive deleting the StatefulSet, as long as you don't also delete the PVC):

kubectl delete statefulset <name> -n <namespace> --cascade=orphan
# re-sync from ArgoCD to recreate the StatefulSet with the new template

--cascade=orphan deletes the StatefulSet object without deleting the Pods or PVCs it owns — letting ArgoCD's next sync recreate the StatefulSet (now with the corrected, non-stale template) and re-adopt the existing PVC.

A Second, Different-Looking Bug With a Related Cause: SharedResourceWarning

A related but distinct symptom: a resource flickers between two different specs, or gets pruned entirely, and .status.conditions on one of the Applications shows a SharedResourceWarning.

This isn't a cache problem — it's an ownership conflict. Two different ArgoCD Applications are both trying to manage a resource with the same name and namespace. In this case: a Helm chart's own ingressRoute.dashboard.enabled flag was creating a Traefik dashboard IngressRoute, while a separately, manually-defined IngressRoute with the same name existed in a different Application's manifest set — both claiming ownership of the same object.

ArgoCD has no way to know which one is "correct" — it just observes that the live object doesn't match what either Application individually expects, and flags the conflict rather than guessing.

The fix is to pick exactly one owner and have the other stop claiming the resource:

# kubernetes/system/traefik/application.yml — Helm chart's own dashboard route, disabled
helm:
  values: |
    ingressRoute:
      dashboard:
        enabled: false  # the manual, Authelia-protected route below is canonical

# kubernetes/system/other-ingressroute.yml — the manually-defined route, kept
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: traefik-dashboard
  namespace: traefik
spec:
  # ... Authelia-protected route — this is the one that stays

Once only one Application's manifest set defines the object, recreate it (delete the now-orphaned duplicate definition's effect, let the remaining owner's next sync take over cleanly) and the warning clears.

Telling the Two Apart

Symptom	Likely Cause	Fix
A field reverts within seconds of a manual or git-pushed change; no error anywhere	Repo-server cache staleness	`hard` refresh annotation; restart `argocd-repo-server` if that's not enough
A field reverts but `volumeClaimTemplates` is involved on a StatefulSet	Cache staleness plus an immutable field that can't be patched in place	Same cache fix, plus delete-and-recreate the StatefulSet with `--cascade=orphan`
A resource flickers between two different specs, or gets pruned; `SharedResourceWarning` in `.status.conditions`	Two Applications both claim ownership of the same resource	Disable one owner's claim (Helm flag or manifest removal), keep the other

The diagnostic tell: cache staleness is temporal — the same Application reverts a change made moments ago, and a refresh fixes it. Ownership conflict is structural — check .status.conditions for SharedResourceWarning first; if it's there, refreshing the cache won't help, because there's nothing stale about either Application's understanding — they're both correctly rendering their own manifests, and the manifests themselves conflict.

The cache-staleness pattern is specific to ArgoCD's repo-server architecture, but the ownership-conflict pattern is universal to any GitOps tool managing Kubernetes resources — Flux has the same failure mode if two Kustomizations or HelmReleases both define a resource with the same identity. Checking .status.conditions before assuming a sync or cache problem saves a lot of time chasing the wrong fix.

DEV Community: david

The .gitleaks-baseline.json That Suppressed Live Production Secrets

What a Baseline File Does (and Is Supposed to Do)

The Assumption That Broke It

How the Finding Was Caught

The Remediation

The Structural Fix: Baseline Review as Part of the Rotation Workflow

Redis Killed Nextcloud and Nobody Noticed for Hours

The Setup That Created the Problem

Why the Logs Were Misleading

The Fix: Invoke redis-server Directly with Persistence Off

The Asymmetry That Makes This Hard to Catch

Why This Shows Up in Hardened Clusters More Than Default Ones

The Disaster Recovery Runbook Nobody Had Actually Run

The Test, Designed to Be Safe

The Gotcha the Documentation Never Mentioned

The Second Finding: A Runbook Step That Would Have Failed

Why "Documented" and "Tested" Are Different Claims

kubectl Said Everything Was Correct. Traefik 404'd Anyway.

Routing a Service to an IP Outside the Cluster

Reading the Actual Error Instead of Re-Checking kubectl

The Second Gotcha: A PVC Shared by Reference Across Unrelated Files

Both Gotchas, Side by Side

SLO Burn-Rate Alerting with Prometheus: Beyond Threshold Alerts

Error Budgets, Briefly

The SLI: HTTP Probe Success Rate

Recording Rules: Pre-Computing the Windows

The Alert Rules

Fast Burn: Page Immediately

Slow Burn: Create a Ticket

Comparing Against Threshold Alerting

Deploying as a PrometheusRule

What the Error Budget Dashboard Shows

Limitations

I Hardened Pod securityContext and Broke 9 Containers in Production

The Two Wrong Assumptions

Catching It: Why "Application: Synced/Healthy" Lied

The Logs Told the Real Story Every Time

The Recovery Trap: selfHeal Undoes Manual Fixes

The Fix, Applied Selectively

The Lesson

Addendum: Handing the Capability Back Wasn't the Best Fix Available

Hardening Unattended Raspberry Pi Edge Nodes: Watchdog, fail2ban, nftables, and the Mistakes That Take Down DNS

Why "It's Just DNS" Needs More Hardening, Not Less

Hardware Watchdog: Recovering From a Hang Software Can't See

fail2ban: The One Exposed Surface

The nftables Trap: Don't Touch /etc/nftables.conf

Stopping Slow SD-Card Death

Memory Limits: Catching a Leak Before It Takes the Whole Pi Down

Local Config Backup: The Gap Nobody Noticed

Alerting That Survives the Main Alerting Stack Being Down

What Actually Got Tested, Not Just Written

IPv6 NAT66 Behind a FritzBox: The RouterOS 7 Bug That Broke WiFi Clients

The Topology

Step 1: ULA Addresses Per VLAN

Step 2: Accept RA on ether1

Step 3: NAT66 Masquerade

Step 4: IPv6 Firewall

The Bug: RouterOS 7 Sends RA on All Interfaces

ULA vs. GUA: Why Not Just Use the ISP Prefix?

Verifying the Setup

My Firewall Had 77 Rules. Terraform Knew About 22 of Them.

How You End Up With Four Generations of the Same Firewall

The Bug This Actually Caused

Finding the Actual Scope of the Problem

Deleting Firewall Rules Without Locking Yourself Out

What's Left

The General Lesson

Kyverno: Supply Chain Security as Admission Control on Kubernetes

Why Admission Control

Installing Kyverno via ArgoCD

Policy 1: Require Resource Limits (Audit)

Policy 2: Disallow Privileged Containers (Enforce)

Policy 3: Disallow :latest Image Tag (Audit)

Real-World Example: The Authelia :latest Violation

The Audit → Enforce Rollout Strategy

Scoping Policies to Specific Namespaces

PolicyReport: The Visibility Layer

Enterprise Bridge

I Ran Gitleaks Against My Own Repo and Found 12 Real Secrets

Policy 3: Disallow `:latest` Image Tag (Audit)

Real-World Example: The Authelia `:latest` Violation

Why This Happens: `selfHeal` Plus a Stale Cache