This is a production-focused follow-up to my original post: pgwd: A Watchdog for Your PostgreSQL Connections. In this one, I show what pgwd looked like in action, how we responded, and what we changed in a controlled way.
When PostgreSQL connection pressure builds up, the real problem is not just crossing max_connections; it is crossing it without operational context.
That is where pgwd helped us most: not as "just another alert sender," but as a signal-to-action layer for on-call decisions.
What happened (anonymized timeline)
In one of our production environments, we saw a fast escalation pattern:
-
attention(75%) - then
alert(85%) - then repeated
danger(95%+)
All within a relatively short window.
The key signal from Slack was not only the threshold level, but also the breakdown:
totalactiveidlemax_connections- plus
cluster,database,namespace, andclient
That context let us decide quickly whether we were seeing true workload pressure, connection churn, or an idle-heavy pattern that still threatened capacity.
Why threshold levels matter in production
A 3-tier model (75/85/95) maps well to real operations:
- Attention (75%): observe trend, prepare people
- Alert (85%): start mitigation planning
- Danger (95%): execute containment now
This prevented us from waiting for FATAL: sorry, too many clients already as the first real signal.
The runbook we used (manual-first)
For this rollout, we intentionally chose controlled, operator-present execution.
No unattended automation for critical steps yet.
1) Attention (>=75%)
- Confirm trend across intervals (not just one spike)
- Check
activevsidleratio - Verify affected DB scope (single DB vs multiple)
- Open an observation incident thread
2) Alert (>=85%)
- Engage app + platform on-call
- Correlate with scheduled jobs, batch windows, and maintenance events
- Reduce non-critical pressure if possible
- Prepare containment action
3) Danger (>=95%)
- Execute mitigation immediately (controlled maintenance/throttling based on internal SOP)
- Prioritize availability restoration
- Capture timestamps for post-incident learning
What we are changing now: max_connections to 3192
Based on this run, one concrete action in our runbook is:
- Increase PostgreSQL
max_connectionsfrom2048to3192 - Apply the change in a controlled session with operators present
- Monitor closely after the change
This is not "increase and forget."
It is "increase, observe, validate, and adjust."
Important guardrail: do not solve pressure by over-sizing blindly
Raising connection limits without infrastructure awareness can create a different failure mode:
- more connections -> more backend memory/CPU pressure
- more pressure -> noisy performance and unstable pods/nodes
- teams then over-allocate resources reactively
So our runbook explicitly includes this guardrail:
- Track infrastructure headroom (CPU/memory) after increasing limits
- Validate DB and app behavior under the new ceiling
- Avoid resource over-sizing unless data supports it
Minimal commands (hybrid style)
# Basic daemon mode
export PGWD_DB_URL="postgres://..."
export PGWD_SLACK_WEBHOOK="https://hooks.slack.com/..."
export PGWD_INTERVAL=60
pgwd
# Verify notifier delivery before/after changes
pgwd -force-notification
# Optional: run against Postgres service in Kubernetes
pgwd -kube-postgres <namespace>/svc/postgres \
-db-url "postgres://user:...@localhost:5432/db"
Post-change verification checklist (24h / 72h)
After increasing to 3192, we track:
- Alert frequency by level (
attention/alert/danger) -
active/idlebehavior by database - Peak total connections vs new headroom
- App error rates and latency around peak windows
- Infrastructure utilization trend (not just point-in-time)
Success criteria:
- No sustained
dangerperiods - Fewer repeated escalations
- Stable app behavior
- No unjustified resource inflation
Lessons learned
-
pgwdis most valuable when tied to a runbook, not only to notifications. - Alert levels should map to explicit operator actions.
- Controlled, manual-first execution is safer for critical production changes.
- Increasing
max_connectionscan be right, if paired with disciplined capacity monitoring.
Community note
If you want a complementary intro in French (installation + quick setup), this community write-up is also useful:
If you run PostgreSQL in production, I would love to hear your threshold strategy and runbook design.
- Original intro post: pgwd: A Watchdog for Your PostgreSQL Connections
- Install:
go install github.com/hrodrig/pgwd@latest - Repo/docs/releases: github.com/hrodrig/pgwd
Disclosure: This post was drafted with AI assistance and reviewed by the author.


Top comments (0)