Field Notes #5 · TL;DR — SignalPilot v1.0 is live. Install with
pip install perfsage-signalpilot, apply read-only RBAC, runsignalpilot analyze— get a ranked HTML report with cited evidence and copy-pastekubectlfixes in under five minutes. Not another dashboard. Analysis you can act on. Landing page · Sample report.
The MTTR gap nobody talks about
Deploy reviews often fail on one question:
"Why did errors spike after my last deployment?"
Not "what's the error rate?" — you can see that in Grafana. The hard part is defensible correlation: linking OOMKilled on pod api-7f3c to a memory limit change in the deploy diff, a new log fingerprint, and optionally the git commit that touched the heap allocator.
That correlation used to cost me 2–3 hours of tab-switching. SignalPilot targets under five minutes for typical post-deploy regressions.
| Stage | Manual war room | SignalPilot |
|---|---|---|
| T+0 | Deploy completes | Deploy completes |
| T+5 min | Someone opens kubectl |
signalpilot analyze starts collectors |
| T+20 min | Grafana dashboard shared | Deploy diff + events + metrics fused |
| T+60 min | "Maybe it's memory?" | Ranked finding: oom_killed with evidence |
| T+120 min | Still debating rollback | Copy-paste kubectl fix on screen |
| T+180 min | Postmortem doc started | HTML report exported; gate ready for CI |
Install (v1.0.0)
pip install perfsage-signalpilot
kubectl apply -f https://raw.githubusercontent.com/perfsage/signalpilot/v1.0.0/deploy/signalpilot-rbac.yaml
signalpilot analyze my-namespace --deployment my-app --output report.html
Preview output without a cluster: sample HTML report on GitHub.
Walkthrough: oom_killed after deploy
Symptom: Error rate jumps after a deploy. Pods restarting.
What SignalPilot correlates:
| Signal source | Evidence |
|---|---|
| K8s API | Container app OOMKilled, 4 restarts in 10 min |
| metrics-server | Memory working-set at 96% of limit |
| Deploy diff |
resources.limits.memory changed 512Mi → 256Mi |
| Logs | New fingerprint: java.lang.OutOfMemoryError: Java heap space
|
Rule fired: oom_killed — confidence ranked HIGH.
Recommended fix (copy-paste from report):
kubectl set resources deployment/my-app -n my-namespace \
--limits=memory=512Mi --requests=memory=256Mi
Each finding cites multiple signal types — not a single chart anomaly. That's the difference from staring at one Grafana panel.
CI gate: catch regressions before traffic fully shifts
Complement load-test SLO gates from SLO Reporter with a post-deploy sanity check:
signalpilot gate my-namespace --deployment my-app --junit-xml results.xml
GitHub Actions example:
- name: Post-deploy RCA gate
run: |
pip install perfsage-signalpilot
signalpilot gate production-namespace \
--deployment api \
--junit-xml signalpilot-results.xml
Exits non-zero on HIGH+ findings — same severity model as your SLO gates, different signal layer.
Deterministic rules first, optional LLM polish
I'm not building "AI that fixes prod." SignalPilot's core RCA runs deterministic rules — oom_killed, cpu_throttled, crash_loop, image_pull_error, probe_failure, code_regression, and more. Optional LLM narrative polish is there if you want it; no API key required for ranked findings and kubectl recommendations.
The PerfSage ladder: test → gate → RCA
- Reveal — JMeter JTL analysis in the lab
- SLO Reporter — CI gates on load tests
- SignalPilot — post-deploy RCA in production
Same DNA across all three: reports data → explains what to do next.
Try it
-
Install:
pip install perfsage-signalpilot - Repo: github.com/perfsage/signalpilot
- Release: v1.0.0
- Background: Field Notes #4 — why I built it · Field Notes #3 — quick start
War-room stories and feedback welcome on GitHub Issues.
Field Notes #5 · By Aashish Bajpai
Originally published at https://perfsage.com/blog/5-minute-post-deploy-postmortem-signalpilot/
Top comments (0)