Kazu

Posted on Jun 23

The first 30 seconds of a Postgres incident: why they take 30 minutes

#opensource #postgres #database

2 a.m. PagerDuty goes off. "Production is slow."

You open your laptop, fire up psql with bleary eyes, and connect to production. The prompt comes up. The cursor blinks after production=>.

And your hands stop.

Now, where was I supposed to look first? Do I get an overview of the whole DB, or do I start drilling into individual queries? It's been a while since the last incident, and the first move doesn't come to me.

"What kind of incident is this even?" — and you freeze

"Production is slow." If you could act on that much information, this would be easy. Is the slowness in the code or the DB? Is one runaway slow query thrashing, or is the overall load up? Are connections exhausted? Is a chain of lock waits piling up somewhere? Is an idle in transaction session holding a transaction open?

These are all different incidents. Different places to look, different moves to make. Do you check pg_stat_activity? pg_locks? Cache hit ratio in pg_stat_database? Before you can decide which pg_stat_* to hit, you need a hunch about which category of incident this is.

But you don't have anything yet to form that hunch.

What you need, really, is the big picture. The goal is to narrow it down. "Ah, this is connection exhaustion." "No, this is a lock." Only once you've placed the incident into a category does the first query become obvious. And for that, you first want to see — on one screen — what the entire database looks like at this exact moment. How are connections doing? Is TPS spiking or dropping? Is anything waiting? You don't want the overview for its own sake. It's the footing you need before you can make the call.

But what psql gives you is one query, one snapshot. Fire off a SELECT and you get back a single slice. No big picture. You want the whole picture, and you have to start by guessing the right query to get a view of it.

So you pick the first query on instinct. Usually it's "pg_stat_activity for now." Not wrong. But you're running it without knowing whether it's the right call. With no hunch yet, you're running the very query that was supposed to give you one.

Guess wrong and you pay for it in extra time. Once, I saw the connection count was unusually high, decided "connection exhaustion," suspected the app's connection pool, and nearly started a conversation about adding servers. The actual culprit was a single long query thrown by a nightly batch. It had settled in and clogged everything behind it, and the connections were just stacking up as a result. I'd looked at the result (connection count) and mistaken it for the cause (the long query). If I'd been able to see the whole thing on one screen first, I'd probably have caught it in a couple of minutes.

These few minutes — from the alert firing to the first keystroke — are the most nerve-wracking part.

You don't have the queries memorized

Say you commit to "pg_stat_activity first." The next problem arrives.

You can't write that query, properly filtered, from memory.

SELECT * FROM pg_stat_activity you can write. You can — but run it and 143 rows scroll past, and you can't tell which one is the culprit. What you want is "only the active, long-running queries, ordered by duration descending, only the columns I need." Can you write that SELECT, with the state = 'active' filter and the now() - query_start math, at 2 a.m. with nothing in front of you? I can't.

So you open a browser. You search "postgres long running queries." You open Stack Overflow. You copy a query. You paste it into psql. ERROR: column "waiting" does not exist. An old answer referencing a column that was removed in PG 9.6. You fix it.

Locks are even worse. JOIN pg_locks and pg_stat_activity to produce the pairs of which session is blocking which — how many people have that query memorized? You search again. You paste again.

Then you remember, "we put a query collection in the team wiki," and go looking for it. It does exist. But it was last updated two years ago, and half of it doesn't run as-is. Figuring out which ones still work and which are stale is, again, more searching. In the middle of an incident, you're bouncing between search results and a wiki instead of your own repository.

Is that the cause, or the result?

So you go through all that trouble and finally produce a list of long-running queries. At the top sits a query that's been running for 2 minutes 14 seconds.

Is this the culprit? You don't know.

Is the query itself heavy and slow, or is it being blocked somewhere else and just sitting there long as a result of waiting? One snapshot can't separate the two. You think you've grabbed the cause, when you might be looking at the victim.

To separate them, you also need to look at the lock side. Which means another query. And then you reconcile both in your head. At 2 a.m.

And you blink, and 30 minutes are gone

Queries are snapshots, so if you want to follow how things change, you have to re-run them by hand. Hit pg_stat_activity once. The situation shifts. Hit it again. Recall the history with ↑, hit enter. Run it again.

Written out, the first response looks like this:

Search "postgres running queries"
→ Copy a query from Stack Overflow
→ Fix it for a PG version mismatch
→ Sort by duration
→ Wonder: is this the cause or the result?
→ Search for a separate locks query
→ Paste and run
→ The situation changed, so re-run everything
→ …

It feels like an instant. Your hands never stop. Searching, pasting, fixing, re-running. And yet you glance at the clock and 30 minutes have passed.

For those 30 minutes, the incident channel in Slack keeps growing. When someone asks "what's the status?", all you can say is "still investigating." Your hands haven't stopped. You've been hitting something the whole time. And yet you haven't even taken the first step toward recovery. You don't even have a hunch about the cause.

One tool has a tagline: "the first 30 seconds of a Postgres incident." But with nothing in hand, those "30 seconds" quietly balloon into 30 minutes of manual queries and searching. Where it should have taken 30 seconds, 30 minutes go by. That gap is what hits hardest in a late-night incident.

What I actually wanted was "now, on one screen"

Looking back, what I wanted in the first 30 seconds was always the same thing.

"What does the database look like right now" — on one screen. Are connections filling up? Has cache hit ratio dropped? Are there long-running queries? Is a lock queue forming? Is a session sitting with a transaction open? I want to see this as one picture first, without memorizing queries and without searching.

Then, having placed it — "this is connection exhaustion," "no, this is a lock" — I want to drill down only in that direction. Overview first, then dig. The order was always supposed to be this, and yet with psql I had no choice but to start from the one query for "digging."

Because I was digging without an overview, I dug holes on instinct and filled them back in, and 30 minutes went by.

pgincident — putting that into one TUI

pgincident is a tool that packs that "overview first, then dig" into a single TUI in your terminal. Instead of the many queries you'd hit in psql, you take your guess from an overall-health overview screen, then drop into a category dashboard to dig.

Setup is nothing dramatic. On macOS, you install it with Homebrew.

brew tap shinagawa-web/tap
brew install pgincident

On Linux / macOS, a one-liner works too.

curl -fsSL https://raw.githubusercontent.com/shinagawa-web/pgincident/main/install.sh | sh

Getting started is three steps. First, generate a config file.

pgincident --init
# Created /your/project/.pgincident.toml

Write your connection details into the generated .pgincident.toml.

[connections.default]
dsn = "postgres://user:password@localhost:5432/mydb"

[thresholds]
long_running        = "5s"
idle_in_transaction = "30s"

Then just launch it.

pgincident

The connecting role doesn't have to be a superuser. As long as it's a member of the pg_monitor role, it works. This is a quiet detail, but it matters — on managed Postgres like RDS or Cloud SQL, you often can't get a superuser at all.

What you see in the first 30 seconds

When you launch it, the first thing you get is the Overview screen. That "now, on one screen" is right here.

primary  10.0.1.42:5432  PG 16.1                              interval: 5.0s
──────────────────────────────────────────────────────────────────────────
  DB Health Overview
──────────────────────────────────────────────────────────────────────────

  Metric                Value                 Status
  ──────────────────────────────────────────────────
  Connections           142 / 200 (71%)       OK
  TPS                   2340                  OK
  Cache hit             99.2%                 OK
  Checkpoints           req: 0                OK
  Autovacuum            0 workers             OK

──────────────────────────────────────────────────────────────────────────
[o]dashboard  [q]uit  [+/-]interval  [?]help

Connections, TPS, Cache hit, Checkpoints, Autovacuum. Each has a threshold set, and the badge reads OK when healthy, WARN when shaky, CRIT when bad. If your setup has a replication standby, a Replication lag row joins them.

This is exactly what I wanted at 2 a.m. If Connections is red at 90%, this looks like connection exhaustion. If Cache hit has dropped, another lead opens up. Without memorizing queries, without searching, the hunch about "which category of incident" forms right here. And this screen refreshes itself on the interval you set, so there's no re-running it by hand.

Once you've placed it, press o to drop into the Dashboard screen. Here, the things you'd hit with several queries in psql line up in three categories.

Long-running queries — active queries still running past the threshold (default 5s)
Locks — pairs of the blocking and the blocked sessions
Idle in transaction — sessions left holding a transaction open past the threshold (default 30s)

That "cause or result?" separation you were stuck on moves forward here. The list of long-running queries and the blocking relationships of locks are on the same screen at the same time. If the top long-running query shows up on the blocked side, it's not the culprit — it's the victim. No need to re-run a separate query and reconcile it in your head.

Press Enter on a long-running query row and its full text opens in an overlay. Formatted with line breaks, keywords highlighted.

┌─ Query Detail ──────────────────────────────────────────────────────────┐
│ PID: 12345   user: app_user   duration: 00:02:14.32   state: active     │
│ ─────────────────────────────────────────────────────────────────────── │
│ SELECT                                                                   │
│   u.id, u.name, u.email,                                                │
│   o.id AS order_id, o.status, o.total_amount                            │
│ FROM users u                                                             │
│ JOIN orders o ON o.user_id = u.id                                       │
│ WHERE u.status = 'active'                                               │
│   AND o.created_at > NOW() - INTERVAL '7 days'                         │
│ ORDER BY o.created_at DESC                                              │
│ LIMIT 100                                                               │
└─────────────────────────────────────────────────────────────────────────┘
[any key] close

Beyond that: if you've configured multiple DBs, c switches the connection (the use case being to move between a primary and a replica). You adjust the polling interval with + / -, and Tab moves between sections.

Getting the "first 30 seconds" back

What you actually needed in the first response to an incident wasn't a new query collection, or pg_stat_* syntax to re-learn. It was being able to take in "what's happening right now" as an overview on one screen, place your guess, and then dig — and to walk that order without searching or re-pasting. Because I had no way to get the overview, I kept digging on instinct — holes I'd just fill back in. What pgincident gives back is that first overview.

This article covers only the first 30 seconds, from launch — the core experience. Deeper dives into lock chains, and the investigation further down the line, are for another post.

2 a.m., those few minutes where your hands stopped after production=>. So that they don't turn into 30 minutes, have the one-screen now on hand first — that alone changes the first response quite a bit.