DEV Community

Cover image for pgwd in Production: From Alerts to Runbook
Hermes Rodríguez
Hermes Rodríguez

Posted on

pgwd in Production: From Alerts to Runbook

This is a production-focused follow-up to my original post: pgwd: A Watchdog for Your PostgreSQL Connections. In this one, I show what pgwd looked like in action, how we responded, and what we changed in a controlled way.

When PostgreSQL connection pressure builds up, the real problem is not just crossing max_connections; it is crossing it without operational context.

That is where pgwd helped us most: not as "just another alert sender," but as a signal-to-action layer for on-call decisions.

What happened (anonymized timeline)

In one of our production environments, we saw a fast escalation pattern:

  • attention (75%)
  • then alert (85%)
  • then repeated danger (95%+)

All within a relatively short window.

The key signal from Slack was not only the threshold level, but also the breakdown:

  • total
  • active
  • idle
  • max_connections
  • plus cluster, database, namespace, and client

That context let us decide quickly whether we were seeing true workload pressure, connection churn, or an idle-heavy pattern that still threatened capacity.

Anonymized pgwd Slack alerts timeline

Why threshold levels matter in production

A 3-tier model (75/85/95) maps well to real operations:

  • Attention (75%): observe trend, prepare people
  • Alert (85%): start mitigation planning
  • Danger (95%): execute containment now

This prevented us from waiting for FATAL: sorry, too many clients already as the first real signal.

The runbook we used (manual-first)

For this rollout, we intentionally chose controlled, operator-present execution.

No unattended automation for critical steps yet.

1) Attention (>=75%)

  • Confirm trend across intervals (not just one spike)
  • Check active vs idle ratio
  • Verify affected DB scope (single DB vs multiple)
  • Open an observation incident thread

2) Alert (>=85%)

  • Engage app + platform on-call
  • Correlate with scheduled jobs, batch windows, and maintenance events
  • Reduce non-critical pressure if possible
  • Prepare containment action

3) Danger (>=95%)

  • Execute mitigation immediately (controlled maintenance/throttling based on internal SOP)
  • Prioritize availability restoration
  • Capture timestamps for post-incident learning

What we are changing now: max_connections to 3192

Based on this run, one concrete action in our runbook is:

  • Increase PostgreSQL max_connections from 2048 to 3192
  • Apply the change in a controlled session with operators present
  • Monitor closely after the change

This is not "increase and forget."

It is "increase, observe, validate, and adjust."

Configuration update with max_connections and resource values

Important guardrail: do not solve pressure by over-sizing blindly

Raising connection limits without infrastructure awareness can create a different failure mode:

  • more connections -> more backend memory/CPU pressure
  • more pressure -> noisy performance and unstable pods/nodes
  • teams then over-allocate resources reactively

So our runbook explicitly includes this guardrail:

  • Track infrastructure headroom (CPU/memory) after increasing limits
  • Validate DB and app behavior under the new ceiling
  • Avoid resource over-sizing unless data supports it

Minimal commands (hybrid style)

# Basic daemon mode
export PGWD_DB_URL="postgres://..."
export PGWD_SLACK_WEBHOOK="https://hooks.slack.com/..."
export PGWD_INTERVAL=60
pgwd

# Verify notifier delivery before/after changes
pgwd -force-notification

# Optional: run against Postgres service in Kubernetes
pgwd -kube-postgres <namespace>/svc/postgres \
     -db-url "postgres://user:...@localhost:5432/db"
Enter fullscreen mode Exit fullscreen mode

Post-change verification checklist (24h / 72h)

After increasing to 3192, we track:

  • Alert frequency by level (attention / alert / danger)
  • active / idle behavior by database
  • Peak total connections vs new headroom
  • App error rates and latency around peak windows
  • Infrastructure utilization trend (not just point-in-time)

Success criteria:

  • No sustained danger periods
  • Fewer repeated escalations
  • Stable app behavior
  • No unjustified resource inflation

Lessons learned

  1. pgwd is most valuable when tied to a runbook, not only to notifications.
  2. Alert levels should map to explicit operator actions.
  3. Controlled, manual-first execution is safer for critical production changes.
  4. Increasing max_connections can be right, if paired with disciplined capacity monitoring.

Community note

If you want a complementary intro in French (installation + quick setup), this community write-up is also useful:


If you run PostgreSQL in production, I would love to hear your threshold strategy and runbook design.


Disclosure: This post was drafted with AI assistance and reviewed by the author.

Top comments (0)