Every K8s operator knows the feeling, a pod dies, kubelet garbage-collects it in 30 seconds, and you're left staring at a CrashLoopBackOff with no context. The logs have rotated. The events are gone. You're switching between kubectl describe, Loki, Grafana, and deploy histories trying to reconstruct what happened.
I got tired of this, so I built K8s Necromancer — a controller that intercepts pod deaths before GC and freezes the entire forensic state.
- What it does
When a pod crashes, it captures:
- Container logs (previous container + fallback)
- K8s events timeline
- Resolved ENV vars (reads ConfigMaps and Secrets via K8s API)
- Full pod spec snapshot
- ConfigMap volume snapshots
- CPU/Memory sparklines from Prometheus (optional)
All of this goes into a Tomb CRD (lightweight metadata in etcd) + PersistentVolume (heavy data). SHA-256 dedup prevents duplicate tombs.
- 6 capture triggers
- Restart count increase
- Pod phase = Failed
- Image pull error
- Non-zero exit code
- First restart detected
- Pending timeout (>10m, configurable)
Skips kube-system, necromancer namespaces, and clean job exits (exit=0).
The CLI
necromancer autopsy <id>- generates a coroner's report with forensic timeline, resource sparklines, and ENV DIFF (compares against last healthy pod). Outputs to terminal or Markdown.necromancer resurrect <id>- spins up a ghost pod in a sandboxed namespace so you can inspect the dead pod's filesystem and config. Not to run the app but to investigate.necromancer list / inspect / bury- browse, query, and clean up old tombs with dry-run support.Safety (this was important to me)
Ghost pods run in a locked-down namespace:
- deny-all-egress NetworkPolicy (DNS allowed only)
- Secrets stripped by default (opt-in with --include-secret-volumes)
- No SA tokens, no host namespaces, probes removed
- Entrypoint overridden to sleep infinity
- LimitRange (500m/512Mi) + ResourceQuota (10 pods) enforced
- Controller runs as non-root (UID 1000, drops ALL capabilities)
- SSRF protection on Prometheus URL (blocks loopback, cloud metadata)
Path traversal prevention on all API inputs
HA
Default deployment runs 2 replicas with leader election enabled + ReadWriteMany PVC (EFS, Filestore, CephFS, etc). Dev overlay for Kind/Minikube runs 1 replica with ReadWriteOnce.
- Stack
Go 1.26, controller-runtime, Cobra CLI, Prometheus integration, Kustomize overlays, Kind-based e2e tests.
Repo: https://github.com/privjoesrepos/k8s-necromancer
Docker: https://hub.docker.com/r/privjoesrepos/k8s-necromancer
MIT licensed.
Thrilled to answer questions or take feedback.
Top comments (0)