Hermes Rodríguez

Posted on Mar 26

pgwd in Production: From Alerts to Runbook

#database #devops #monitoring #postgres

This is a production-focused follow-up to my original post: pgwd: A Watchdog for Your PostgreSQL Connections. In this one, I show what pgwd looked like in action, how we responded, and what we changed in a controlled way.

When PostgreSQL connection pressure builds up, the real problem is not just crossing max_connections; it is crossing it without operational context.

That is where pgwd helped us most: not as "just another alert sender," but as a signal-to-action layer for on-call decisions.

What happened (anonymized timeline)

In one of our production environments, we saw a fast escalation pattern:

attention (75%)
then alert (85%)
then repeated danger (95%+)

All within a relatively short window.

The key signal from Slack was not only the threshold level, but also the breakdown:

total
active
idle
max_connections
plus cluster, database, namespace, and client

That context let us decide quickly whether we were seeing true workload pressure, connection churn, or an idle-heavy pattern that still threatened capacity.

Why threshold levels matter in production

A 3-tier model (75/85/95) maps well to real operations:

Attention (75%): observe trend, prepare people
Alert (85%): start mitigation planning
Danger (95%): execute containment now

This prevented us from waiting for FATAL: sorry, too many clients already as the first real signal.

The runbook we used (manual-first)

For this rollout, we intentionally chose controlled, operator-present execution.

No unattended automation for critical steps yet.

1) Attention (`>=75%`)

Confirm trend across intervals (not just one spike)
Check active vs idle ratio
Verify affected DB scope (single DB vs multiple)
Open an observation incident thread

2) Alert (`>=85%`)

Engage app + platform on-call
Correlate with scheduled jobs, batch windows, and maintenance events
Reduce non-critical pressure if possible
Prepare containment action

3) Danger (`>=95%`)

Execute mitigation immediately (controlled maintenance/throttling based on internal SOP)
Prioritize availability restoration
Capture timestamps for post-incident learning

What we are changing now: `max_connections` to `3192`

Based on this run, one concrete action in our runbook is:

Increase PostgreSQL max_connections from 2048 to 3192
Apply the change in a controlled session with operators present
Monitor closely after the change

This is not "increase and forget."

It is "increase, observe, validate, and adjust."

Important guardrail: do not solve pressure by over-sizing blindly

Raising connection limits without infrastructure awareness can create a different failure mode:

more connections -> more backend memory/CPU pressure
more pressure -> noisy performance and unstable pods/nodes
teams then over-allocate resources reactively

So our runbook explicitly includes this guardrail:

Track infrastructure headroom (CPU/memory) after increasing limits
Validate DB and app behavior under the new ceiling
Avoid resource over-sizing unless data supports it

Minimal commands (hybrid style)

# Basic daemon mode
export PGWD_DB_URL="postgres://..."
export PGWD_SLACK_WEBHOOK="https://hooks.slack.com/..."
export PGWD_INTERVAL=60
pgwd

# Verify notifier delivery before/after changes
pgwd -force-notification

# Optional: run against Postgres service in Kubernetes
pgwd -kube-postgres <namespace>/svc/postgres \
     -db-url "postgres://user:...@localhost:5432/db"

Post-change verification checklist (24h / 72h)

After increasing to 3192, we track:

Alert frequency by level (attention / alert / danger)
active / idle behavior by database
Peak total connections vs new headroom
App error rates and latency around peak windows
Infrastructure utilization trend (not just point-in-time)

Success criteria:

No sustained danger periods
Fewer repeated escalations
Stable app behavior
No unjustified resource inflation

Lessons learned

pgwd is most valuable when tied to a runbook, not only to notifications.
Alert levels should map to explicit operator actions.
Controlled, manual-first execution is safer for critical production changes.
Increasing max_connections can be right, if paired with disciplined capacity monitoring.

Community note

If you want a complementary intro in French (installation + quick setup), this community write-up is also useful:

Surveillez votre base de données PostgreSQL avec pgwd

If you run PostgreSQL in production, I would love to hear your threshold strategy and runbook design.

Original intro post: pgwd: A Watchdog for Your PostgreSQL Connections
Install: go install github.com/hrodrig/pgwd@latest
Repo/docs/releases: github.com/hrodrig/pgwd

Disclosure: This post was drafted with AI assistance and reviewed by the author.

DEV Community

pgwd in Production: From Alerts to Runbook

What happened (anonymized timeline)

Why threshold levels matter in production