K8s Necromancer — a Black Box Flight Recorder for dead Kubernetes pods.

#kubernetes #go #docker #github

Every K8s operator knows the feeling, a pod dies, kubelet garbage-collects it in 30 seconds, and you're left staring at a CrashLoopBackOff with no context. The logs have rotated. The events are gone. You're switching between kubectl describe, Loki, Grafana, and deploy histories trying to reconstruct what happened.

I got tired of this, so I built K8s Necromancer — a controller that intercepts pod deaths before GC and freezes the entire forensic state.

What it does

When a pod crashes, it captures:

Container logs (previous container + fallback)
K8s events timeline
Resolved ENV vars (reads ConfigMaps and Secrets via K8s API)
Full pod spec snapshot
ConfigMap volume snapshots
CPU/Memory sparklines from Prometheus (optional)

All of this goes into a Tomb CRD (lightweight metadata in etcd) + PersistentVolume (heavy data). SHA-256 dedup prevents duplicate tombs.

6 capture triggers

Restart count increase
Pod phase = Failed
Image pull error
Non-zero exit code
First restart detected
Pending timeout (>10m, configurable)

Skips kube-system, necromancer namespaces, and clean job exits (exit=0).

The CLI
necromancer autopsy <id> - generates a coroner's report with forensic timeline, resource sparklines, and ENV DIFF (compares against last healthy pod). Outputs to terminal or Markdown.
necromancer resurrect <id> - spins up a ghost pod in a sandboxed namespace so you can inspect the dead pod's filesystem and config. Not to run the app but to investigate.
necromancer list / inspect / bury - browse, query, and clean up old tombs with dry-run support.
Safety (this was important to me)

Ghost pods run in a locked-down namespace:

deny-all-egress NetworkPolicy (DNS allowed only)
Secrets stripped by default (opt-in with --include-secret-volumes)
No SA tokens, no host namespaces, probes removed
Entrypoint overridden to sleep infinity
LimitRange (500m/512Mi) + ResourceQuota (10 pods) enforced
Controller runs as non-root (UID 1000, drops ALL capabilities)
SSRF protection on Prometheus URL (blocks loopback, cloud metadata)
Path traversal prevention on all API inputs
HA

Default deployment runs 2 replicas with leader election enabled + ReadWriteMany PVC (EFS, Filestore, CephFS, etc). Dev overlay for Kind/Minikube runs 1 replica with ReadWriteOnce.