DEV Community

JOE
JOE

Posted on • Originally published at github.com

K8s Necromancer — a Black Box Flight Recorder for dead Kubernetes pods.

Every K8s operator knows the feeling, a pod dies, kubelet garbage-collects it in 30 seconds, and you're left staring at a CrashLoopBackOff with no context. The logs have rotated. The events are gone. You're switching between kubectl describe, Loki, Grafana, and deploy histories trying to reconstruct what happened.

I got tired of this, so I built K8s Necromancer — a controller that intercepts pod deaths before GC and freezes the entire forensic state.

  • What it does

When a pod crashes, it captures:

  • Container logs (previous container + fallback)
  • K8s events timeline
  • Resolved ENV vars (reads ConfigMaps and Secrets via K8s API)
  • Full pod spec snapshot
  • ConfigMap volume snapshots
  • CPU/Memory sparklines from Prometheus (optional)

All of this goes into a Tomb CRD (lightweight metadata in etcd) + PersistentVolume (heavy data). SHA-256 dedup prevents duplicate tombs.

  • 6 capture triggers
  1. Restart count increase
  2. Pod phase = Failed
  3. Image pull error
  4. Non-zero exit code
  5. First restart detected
  6. Pending timeout (>10m, configurable)

Skips kube-system, necromancer namespaces, and clean job exits (exit=0).

  • The CLI

  • necromancer autopsy <id> - generates a coroner's report with forensic timeline, resource sparklines, and ENV DIFF (compares against last healthy pod). Outputs to terminal or Markdown.

  • necromancer resurrect <id> - spins up a ghost pod in a sandboxed namespace so you can inspect the dead pod's filesystem and config. Not to run the app but to investigate.

  • necromancer list / inspect / bury - browse, query, and clean up old tombs with dry-run support.

  • Safety (this was important to me)

Ghost pods run in a locked-down namespace:

  • deny-all-egress NetworkPolicy (DNS allowed only)
  • Secrets stripped by default (opt-in with --include-secret-volumes)
  • No SA tokens, no host namespaces, probes removed
  • Entrypoint overridden to sleep infinity
  • LimitRange (500m/512Mi) + ResourceQuota (10 pods) enforced
  • Controller runs as non-root (UID 1000, drops ALL capabilities)
  • SSRF protection on Prometheus URL (blocks loopback, cloud metadata)
  • Path traversal prevention on all API inputs

  • HA

Default deployment runs 2 replicas with leader election enabled + ReadWriteMany PVC (EFS, Filestore, CephFS, etc). Dev overlay for Kind/Minikube runs 1 replica with ReadWriteOnce.

  • Stack

Go 1.26, controller-runtime, Cobra CLI, Prometheus integration, Kustomize overlays, Kind-based e2e tests.

Repo: https://github.com/privjoesrepos/k8s-necromancer

Docker: https://hub.docker.com/r/privjoesrepos/k8s-necromancer

MIT licensed.

Thrilled to answer questions or take feedback.

Top comments (0)