DEV Community

Aashish Bajpai
Aashish Bajpai

Posted on • Originally published at perfsage.com

5-Minute Post-Deploy Postmortem with SignalPilot

Field Notes #5 · TL;DRSignalPilot v1.0 is live. Install with pip install perfsage-signalpilot, apply read-only RBAC, run signalpilot analyze — get a ranked HTML report with cited evidence and copy-paste kubectl fixes in under five minutes. Not another dashboard. Analysis you can act on. Landing page · Sample report.


The MTTR gap nobody talks about

Deploy reviews often fail on one question:

"Why did errors spike after my last deployment?"

Not "what's the error rate?" — you can see that in Grafana. The hard part is defensible correlation: linking OOMKilled on pod api-7f3c to a memory limit change in the deploy diff, a new log fingerprint, and optionally the git commit that touched the heap allocator.

That correlation used to cost me 2–3 hours of tab-switching. SignalPilot targets under five minutes for typical post-deploy regressions.

Stage Manual war room SignalPilot
T+0 Deploy completes Deploy completes
T+5 min Someone opens kubectl signalpilot analyze starts collectors
T+20 min Grafana dashboard shared Deploy diff + events + metrics fused
T+60 min "Maybe it's memory?" Ranked finding: oom_killed with evidence
T+120 min Still debating rollback Copy-paste kubectl fix on screen
T+180 min Postmortem doc started HTML report exported; gate ready for CI

Install (v1.0.0)

pip install perfsage-signalpilot

kubectl apply -f https://raw.githubusercontent.com/perfsage/signalpilot/v1.0.0/deploy/signalpilot-rbac.yaml

signalpilot analyze my-namespace --deployment my-app --output report.html
Enter fullscreen mode Exit fullscreen mode

Preview output without a cluster: sample HTML report on GitHub.


Walkthrough: oom_killed after deploy

Symptom: Error rate jumps after a deploy. Pods restarting.

What SignalPilot correlates:

Signal source Evidence
K8s API Container app OOMKilled, 4 restarts in 10 min
metrics-server Memory working-set at 96% of limit
Deploy diff resources.limits.memory changed 512Mi → 256Mi
Logs New fingerprint: java.lang.OutOfMemoryError: Java heap space

Rule fired: oom_killed — confidence ranked HIGH.

Recommended fix (copy-paste from report):

kubectl set resources deployment/my-app -n my-namespace \
  --limits=memory=512Mi --requests=memory=256Mi
Enter fullscreen mode Exit fullscreen mode

Each finding cites multiple signal types — not a single chart anomaly. That's the difference from staring at one Grafana panel.


CI gate: catch regressions before traffic fully shifts

Complement load-test SLO gates from SLO Reporter with a post-deploy sanity check:

signalpilot gate my-namespace --deployment my-app --junit-xml results.xml
Enter fullscreen mode Exit fullscreen mode

GitHub Actions example:

- name: Post-deploy RCA gate
  run: |
    pip install perfsage-signalpilot
    signalpilot gate production-namespace \
      --deployment api \
      --junit-xml signalpilot-results.xml
Enter fullscreen mode Exit fullscreen mode

Exits non-zero on HIGH+ findings — same severity model as your SLO gates, different signal layer.


Deterministic rules first, optional LLM polish

I'm not building "AI that fixes prod." SignalPilot's core RCA runs deterministic rulesoom_killed, cpu_throttled, crash_loop, image_pull_error, probe_failure, code_regression, and more. Optional LLM narrative polish is there if you want it; no API key required for ranked findings and kubectl recommendations.


The PerfSage ladder: test → gate → RCA

  1. Reveal — JMeter JTL analysis in the lab
  2. SLO Reporter — CI gates on load tests
  3. SignalPilot — post-deploy RCA in production

Same DNA across all three: reports data → explains what to do next.


Try it

War-room stories and feedback welcome on GitHub Issues.

Field Notes #5 · By Aashish Bajpai


Originally published at https://perfsage.com/blog/5-minute-post-deploy-postmortem-signalpilot/

Top comments (0)