DEV Community

Kazu
Kazu

Posted on

"Couldn't you just use pg_activity? — staking out one corner of Postgres incident triage"

Earlier, I wrote about a 2 a.m. incident. Production is slow, you open psql, the cursor blinks, and your hand freezes on the very first query. I introduced pgincident as a tool that breaks that "first-response paralysis" by giving you the whole picture on one screen.

When I show that article to people, there's almost always one thing that comes back first.

"Couldn't you just use pg_activity for that?"

Or "isn't pgcenter enough?" Fair question. Terminal-based Postgres monitors have been around forever. Anyone can picture "that kind of screen" the moment you say top, and I've personally been bailed out by both of those tools in the field more times than I can count. If I'm going to add one more TUI to the pile, I owe people a real explanation of what makes it different from the established standards.

This article is that explanation. It's not a case that pgincident beats everything. After taking honest stock of where the existing tools are strong and where they have a harder time reaching, where pgincident stands turns out to be a fairly narrow corner: managed Postgres × the Mac in front of you × first-response triage. That's where this lands. Read it as a stock-take, not a sales pitch.

First, let me give credit — the existing tools are good

I don't want to open by talking anything down, so let me say it up front: pg_activity and pgcenter are both well-made tools.

(One caveat before we start: everything I say in this article about how the tools behave is as of pg_activity v3.6.2 / pgcenter v0.10.1 / pgincident v0.6.0. OS support, required privileges, and managed-environment behavior can all change between versions, so read every "works / doesn't work / gets stripped" here as a statement about those three specific versions.)

pg_activity is a top-style activity viewer written in Python. It's been around a long time, it's mature, and if you're SSH'd into a production host it's still first-rate. Instead of re-typing pg_stat_activity by hand, the running queries line up sorted by duration and refresh on their own. Part of what I said I "wanted" in the last article has been sitting right here all along.

pgcenter is written in Go and is also of the top lineage. What's genuinely impressive about pgcenter is that it can correlate system stats — CPU, I/O — with Postgres stats on one screen. "Is the DB slow because of the queries, or because the host's I/O is saturated?" You can make that call by looking at the OS and the DB side by side on a single screen. If you want to do this on a self-managed Linux box, pgcenter is still a top pick.

This isn't me propping it up in the abstract. There's a concrete time it saved me. On a self-managed Linux production box (Postgres I'd stood up myself on EC2), API latency one day ballooned to about three times its usual. Staring at the slow query log, I couldn't single out one bad query — everything was uniformly slow. That kind of slowness is the nasty kind, because you can't pin the blame on a single query. I SSH'd into the production host and launched pgcenter as the postgres user. On that familiar top-lineage screen, the first thing that jumped out was the I/O wait row, with disk utilization basically pegged. The bottom half of the same screen had the Postgres-side stats, and I could see checkpoints firing far more often than usual. OS-side I/O saturation and DB-side checkpoint flurry, stacked one above the other, visible at the same time. "This isn't the queries — the host's I/O is jammed, and the trigger is checkpoints." I had that hunch within about ten seconds of opening the screen. If I'd been cross-referencing pg_stat_bgwriter and iostat in separate terminals, it would have taken a lot longer.

That's what correlating OS and DB on one screen buys you. It's pgcenter's strongest suit, and where it works, it really works.

So this isn't a piece about killing the old thing to sell the new one. The question is where "where it works" actually is.

What the descendants of top assume

pg_activity and pgcenter are similar at the root of their design. Both are descendants of top, and the basic shape is "show what's running right now as a single sorted list." And both deliver their best experience under one assumption.

That assumption, roughly put, is that you're sitting on the host where the DB is running.

Concretely, it comes down to this. pgcenter's signature feature — correlation with system stats — pulls CPU and I/O information from the OS. So the tool has to run on the same host as the DB. pg_activity, too, if you want it to show system information and temp-file information, assumes you run it on the same host as the OS user running postgres (or root).

A word on privileges — this one's easy to get wrong, so let me be precise. Both tools deliver their best experience when the connecting role is a SUPERUSER. But it's not that they refuse to run without it. pg_activity drops into degraded mode if you're not a superuser and stops showing system and temp-file information — but you can still see the activity list itself. pgcenter is also happiest with a superuser, but it'll run with privileges sufficient to read the stats. It's not "SUPERUSER required," it's "without it, some of it gets stripped." That's a distinction worth keeping.

Either way, the precondition for the best experience is "on the DB host, on Linux, with sufficient privileges." As long as that holds, both tools are strong.

Where the assumption falls apart — RDS, Cloud SQL, Aurora

The problem is that in a lot of recent incidents, that assumption falls apart wholesale.

When your production Postgres runs on a managed service (RDS, Cloud SQL, Aurora), that precondition breaks in three ways at once. (1) you can't become the OS user, (2) you can't get SUPERUSER, (3) you can't get system stats. Let's take them in order.

First, (1), you can't become the OS user of the DB host. You don't get to touch the internals of a managed service. The whole premise of SSH'ing in and launching the tool as the postgres user simply doesn't hold.

Next, (2), you can't get SUPERUSER. Managed services generally won't hand you a SUPERUSER-equivalent role. So you're structurally locked out of that "SUPERUSER gives you the best experience" case from earlier.

And (3), this is the one that stings most. The correlation with system stats — pgcenter's killer feature — simply doesn't work on managed services, because the mechanism for pulling CPU and I/O from the host can't reach inside the walls of a managed DB. pg_activity has an --rds flag, but with it, system stats don't show up.

Here, though, the damage lands differently on the two tools, so let me write it out fairly. pgcenter has its single biggest selling point — correlating OS and DB — vanish. The top half of that side-by-side screen, the one that let me catch the I/O saturation on EC2 in one shot, stops meaning anything at all. pg_activity, by contrast, loses the add-on information (system stats, temp-file info), but its core as an activity viewer survives. The main act — watching running queries sorted by duration, auto-refreshing — works just fine on a managed service too. pgcenter loses a pillar; pg_activity loses a branch. Both "get weaker on managed services," but that difference is not small.

In fact, I once got burned by exactly this. During a morning incident with a slow Aurora (PostgreSQL-compatible), I typed out pgcenter on my Mac out of habit. It wouldn't even start — pgcenter is Linux-only and the Mac is unsupported (this isn't degradation, it just plain doesn't run). So I pointed pg_activity at the Aurora instance with the --rds flag, and this time it started up fine. The activity list showed up too. But the system-information columns I always leaned on were completely blank. CPU, I/O, temp files, all empty. Which makes sense when you think about it — you can't reach the OS inside the managed wall — so this is what degraded looks like under --rds. Not "doesn't run," but "comes back as half of itself."

That swing-and-a-miss — it starts, but the half you most wanted to see is blank — wore on me more than I'd have expected. On EC2 I'd had a hunch within ten seconds of opening the screen, and now the same tool just wouldn't do the same work. In the end, that morning I couldn't use either of my usual standards in its proper form, and I had to go form a hunch off the activity list plus guesswork in my head. The first thing you open during a late-night or early-morning incident is usually not the production host but the terminal on your own Mac. And in that "first thing you open," both tools fall short of their best at the same time.

For the record, this isn't a defect in either tool. It's just that the world their design assumed (self-managed Linux hosts) and the world incidents now happen in (managed × the machine in front of you) have drifted apart. The world moved over these past few years; the tools didn't degrade.

A difference in trade-offs — deliberately dropping system stats

Here's the design call pgincident makes. To be fair, this isn't "pgincident has more features." If anything, it's throwing features away.

pgincident deliberately does not carry system stats (CPU / I/O / memory). The README even marks this as an explicit non-goal: "that's pgcenter's job; this one focuses on Postgres internals."

This might look like a weakness. And it is — for someone who wants to correlate OS and DB on self-managed Linux, pgcenter is clearly the better choice. I'll grant that.

But as a design decision it's coherent. On managed services you can't get system stats anyway. A feature built on top of something you can't get is dead on arrival if managed is your main case. So instead of depending on it, you pour everything into the range you can get with the pg_monitor role: activity, locks, idle in transaction, and the big-picture overview of overall health. Unlike SUPERUSER, pg_monitor is a role that can be granted on managed services, so you could say pgincident shifts the starting point of its design so that the best case "still holds on a managed service."

Why can pg_monitor be granted on managed services while SUPERUSER is withheld? This is worth one paragraph of context. SUPERUSER bypasses permission checks wholesale. It's a sweeping privilege that reaches the filesystem, server configuration, and other people's data. Hand that to a user and AWS or GCP can no longer hold the management boundary between the OS and the infrastructure. So managed services structurally don't issue SUPERUSER. pg_monitor, on the other hand, is a narrowly scoped role that bundles up only the reads you need for monitoring — all columns of pg_stat_activity, the various stats views. Because what it can see is limited to monitoring stats, the managed side can grant it without worry. By planting its starting point here, pgincident stands from the outset on the side of "the privilege a managed service can comfortably issue," rather than "the privilege a managed service structurally can't."

One honest note: pgincident's managed-service support itself is, on the repo's SQL catalog, still largely unverified (most of the testing so far has been on local PG16). So I won't write "battle-tested on RDS." What I'm pointing at here isn't a verification track record but a choice of privilege model and architecture — the design's starting point sits on managed services and pg_monitor. Same kind of TUI, but the best case it assumes faces a different direction.

The other difference is the "shape" — one list vs. overview-then-dig

Apart from the privilege and platform story, the shape of the screen itself is different. And personally I think this is the more important connection back to the last article.

The basis of the top family is "show what's running right now as a single sorted list." Sort by duration, sort by resource. This is good for seeing "what's heaviest." But it doesn't directly help with that paralysis I wrote about last time, the one where your hand stops at "what kind of incident is this even?" A list gives you a "heaviest-things ranking," but it doesn't give you the one-frame picture of "what's the overall state right now." Are connections exhausted? Are locks chaining? Is an idle in transaction session holding on? You need that hunch before the list.

pgincident reverses the order. The first thing you get is the Overview — a one-frame picture of overall health. Connection count, TPS, cache hit ratio, checkpoints, autovacuum. Each has a threshold, and the shaky metrics get highlighted in color. Here you get a read on it first — "this is the connection-exhaustion line," "no, this is a lock" — and only then drill down into that direction's category (activity / locks / idle in transaction). Overview first, then dig.

Last time, I wrote this: "Because I was digging without an overview, I dug holes on instinct and filled them back in, and 30 minutes went by." A single top-family list is an excellent tool for the "digging" side. What pgincident is trying to add is the one "overview" frame that comes before it, and that's the direct answer to the first-response paralysis I described in the first article. The comparison is only a means; the theme has been "the quality of the first response" the whole time.

Just one comparison table

I'll pick the axes carefully and lay down a single table. Think of it not as a binary win/lose table but as one for reading "under which assumption does which tool work." (Restated: the behavior here is as of pg_activity v3.6.2 / pgcenter v0.10.1 / pgincident v0.6.0.)

pg_activity pgcenter pgincident
Language Python Go Go
OS Runs on Mac (system stats need a Linux host) Linux-only (won't start on Mac) Linux + macOS
Privileges Best with SUPERUSER / degrades without it Ideal with SUPERUSER / works with stats-read privileges Designed around pg_monitor
System-stats correlation (needs same host + OS privileges) best-in-class (needs same host, Linux) deliberately absent (non-goal)
Managed DB as the main arena secondary secondary (system stats vanish) the primary design assumption
Maturity / adoption high, proven track record high new
Screen shape one sorted list OS+DB correlated on one screen overview → drill into a category

The bold cells are where each tool clearly wins. For system-stats correlation, pgcenter; for maturity and track record, pg_activity; for putting managed services at the center from the start, pgincident. Read the table this way and you can see that what pgincident fills is the bottom-right corner — only the case of "managed is the main arena, pg_monitor is the starting point, and I want to come in from an overview."

A quick word on tools in other categories

Let me also address the "but we already run Datadog for monitoring" question. This is less a comparison than a sorting-out of layers.

Metrics platforms (Datadog, Grafana + postgres_exporter, pganalyze) are for continuously accumulating metrics and watching trends and anomalies over time. Catching "it's been gradually getting worse since last week" is this layer's job, and that's where it earns its keep. But that's the "look back later / keep a standing watch" layer, and it lives on a different time axis from "the first 30 seconds right after the alert fires, what's happening in the terminal in front of me right now." I think of it as a different layer, not a competitor.

GUIs (pgAdmin, DBeaver) are tools for browsing schemas and writing queries. They're not what you open in that moment when the cursor is blinking at the start of an incident. Different category too.

Conclusion — "the best tool" depends on the situation

The conclusion of an honest stock-take comes out like this.

  • If you're on self-managed Linux and want to look at OS and DB together to sort out "is it the host or the DB," then pgcenter. Its system-stats correlation is still first-rate.
  • If you're SSH'd into the production host and want a mature, proven standard, then pg_activity. It's ahead on maturity and adoption.
  • If you want long-term trends and standing monitoring, then Datadog / Grafana / pganalyze. That's a different layer.
  • And if you want to get oriented right after the alert fires, on managed Postgres, from the Mac in front of you, then the tool trying to fill that corner is pgincident.

To repeat: this is a story about "the design's starting point sits on managed × pg_monitor," not about "a track record of field verification piling up on RDS." Verification has mostly progressed on local PG16 so far, and that honestly remains a weak point. Even so, as far as I know there's no other first-response TUI whose assumed best case faces managed services from the start. What I'm asserting is the design's direction, not a track record.

pgincident isn't a tool I built to beat everything. By dropping system stats, narrowing the feature set, and shifting its assumed best case to "managed × pg_monitor," it's trying to take on just the one corner that the existing tools have a structurally harder time reaching.

My answer to "couldn't you just use pg_activity?" is "in many cases, yes, exactly." If you can put it on the production host, on Linux, and want to see all the way down to system stats, the existing tools are the better call. But if it's 2 a.m. and you want to get into RDS from the Mac in front of you and form a hunch about which category of incident this is within 30 seconds — for that one narrow case, and that case alone, pgincident exists.

So the conclusion of this stock-take isn't "use this," it's "choose by the situation." If you want to SSH into EC2 and look at the whole host, pgcenter; for a mature standard living on the production host, pg_activity; for tracking trends, Datadog; and for managed × the machine in front of you × first response, pgincident. Tools are stuck to their situations, and no single one covers all of them. Pick based on the situation you most often find yourself in. For me, that just happened to be "RDS from the Mac, late at night."

pgincident is on GitHub.

Top comments (0)