Damaso Sanoja

Posted on Apr 22 • Edited on Apr 29

SQL database architecture, use cases, and monitoring: a practitioner's guide

#sql #database #architecture #performance

Most SQL performance problems trace back to a handful of knobs, a handful of metrics, and the architecture that connects them. This guide covers all three across PostgreSQL, MySQL InnoDB, and SQL Server, starting with the cheat sheet you can act on today and working backward through the justification for every number in it.

If you are setting up a new SQL deployment or auditing one you inherited, the next two tables are the answer. Screenshot them, calibrate the numbers against your own baseline (next section), and read on for the architecture that explains why each number sits where it does.

The tuning cheat sheet

Knob	PostgreSQL	MySQL (InnoDB)	SQL Server	Starting point
Buffer pool size	`shared_buffers`	`innodb_buffer_pool_size`	`max server memory`	PostgreSQL: 25% of host RAM on a dedicated database host, diminishing returns above 8-10 GB unless host has >32 GB RAM. MySQL: 70-80% of host RAM on a dedicated host. SQL Server: set `max server memory` leaving ~10-15% of host RAM for the OS.
Planner cache hint	`effective_cache_size`	n/a	n/a	50-75% of host RAM; update alongside `shared_buffers` so the planner accounts for OS page cache
Commit durability	`synchronous_commit`	`innodb_flush_log_at_trx_commit`	(always on)	Leave strict for financial data. Relax to `off` on PostgreSQL (up to ~200 ms crash-loss window, bounded by `wal_writer_delay`) or `2` on MySQL (up to ~1 second crash-loss window, bounded by the once-per-second log flush) on event logs and session stores.
Autovacuum aggressiveness	`autovacuum_vacuum_scale_factor`	purge thread (tuned via `innodb_purge_batch_size`, `innodb_purge_threads`)	n/a	Drop PG from the 0.2 default to 0.01-0.05 on any table receiving millions of updates per day; apply per-table (see §4.2) rather than globally
Connection ceiling	`max_connections`	`max_connections`	`max worker threads` (default 0 = auto)	Size so (app pool size) × (app servers) × 1.2 stays under the ceiling; add pooler if math doesn't close. SQL Server has no `max_connections` analog; for finer control use workload groups under Resource Governor.
Snapshot isolation	(on by default via MVCC)	(on by default via MVCC)	`ALTER DATABASE ... SET READ_COMMITTED_SNAPSHOT ON`	Enable RCSI on SQL Server databases carrying mixed OLTP and reporting, and budget for tempdb write pressure

The alerting cheat sheet

Signal	Threshold that should page you	What it is actually telling you
Query p95 execution time, per query pattern	2× the pattern's own two-week baseline	A plan regression, stale statistics, or a new callsite running without an index
Buffer cache hit ratio	Below 95% sustained on OLTP; below 99% on hot-data-heavy PG deployments; OLAP workloads may tolerate lower	Working set exceeds buffer pool, or a cold cache after restart, or a sequential scan that should not be happening
Deadlock count	Any non-zero count in a 5-minute window	Lock ordering inconsistency in application code, not a database bug
Lock wait count	Rising trend, not an absolute number	A long transaction holding row locks against OLTP traffic; usually surfaces upstream as HTTP 503s
Connection usage	Sustained above 80% of the connection ceiling (75-90% range is defensible depending on risk tolerance)	Pooling is undersized, missing, or the app is leaking connections
Replication lag	Above your RPO target, not a universal number	WAL sender saturation, slow replica consumer, network, or a long-running query on the replica
Commit latency	Above 10 ms on NVMe, 50 ms on SATA SSD	fsync contention on the log volume, usually because data and log share the same disk

Baselining: capturing a fingerprint before the first incident

Every number in the cheat sheet is a starting point, not a verdict. A 95% cache hit ratio is healthy on one workload and a disaster on another. The only way to know which side your deployment sits on is to capture a fingerprint before production traffic arrives, so that when it does, you have something to compare against.

Synthetic load is the entry point. On PostgreSQL, pgbench ships in the contrib package and runs a TPC-B-like workload out of the box. pgbench -i -s 50 creates a dataset large enough that the working set pushes buffer pool behavior into realistic territory, and pgbench -c 20 -j 4 -T 600 drives it for ten minutes. Its final output gives you tps, latency average, and (with --report-per-command) per-statement latency; the latency average line and stddev map directly to the p95 query time fingerprint. On MySQL, sysbench plays the same role, and its oltp_read_write profile is a reasonable first cut. Read the 95th percentile line under Latency (ms) in the summary. Tool setup is covered in vendor documentation; the signal you should capture once it's running is what matters here. Neither tool replaces an application-shaped load test, but they produce enough signal to detect whether your configuration is sane before real users expose the places it is not.

A minimum viable baseline fingerprint covers six numbers, captured over a window long enough to include your full cycle of cron jobs and batch work (two weeks is the usable lower bound for p95 query time):

p95 query execution time per significant query pattern
buffer cache hit ratio, on average and at its worst fifteen-minute window
WAL or redo log write rate in bytes per second
lock wait count per hour
deadlock count per day (you are hoping for zero)
replication lag peak, if you have replicas

Record each one, note the day and hour of its worst value, and keep the file somewhere your on-call rotation can find it. Every alert threshold in the cheat sheet becomes defensible once you can say "yes, we crossed 2× our own baseline" rather than "yes, we crossed a number we read on the internet."

With a baseline in hand, the rest of the article explains why each number sits where it does.

Why those numbers: the four components that dictate them

Every SQL database, whether PostgreSQL 17+, MySQL 8.0, or SQL Server 2022, shares four components that each drive a specific row in the cheat sheet. The query processor parses, plans, and executes queries. The storage engine handles physical reads and writes. The transaction log (WAL in PostgreSQL, redo log in MySQL InnoDB) persists changes before commit. The buffer pool caches data pages in memory.

The query processor and the stale-statistics failure mode

The query processor drives the "query p95 2× baseline" alert. Its optimizer chooses a plan based on table statistics, and those statistics go stale the moment a batch load changes the row count without triggering a stats refresh. A table with 10 million rows whose stored statistics still claim 500,000 gets a full sequential scan where an index seek would have sufficed, and execution cost multiplies by orders of magnitude. What the monitoring dashboard shows is latency; what the database is doing is reading the entire heap.

This is why a post-deployment p95 spike is worth checking for statistics invalidation before other root causes: a schema migration or large insert is a common statistics-invalidation event in a team's weekly rhythm.

The buffer pool and the hit-ratio threshold

Sized memory is what separates a database that answers in milliseconds from one that answers in seconds, and the cache hit ratio alert is measuring exactly that. PostgreSQL's shared_buffers defaults to 128 MB, which is adequate for a laptop and absurd on a host with 50 GB of hot data. MySQL InnoDB's innodb_buffer_pool_size defaults to 128 MB for the same historical reason, though it resizes dynamically since MySQL 5.7. SQL Server sizes its buffer pool automatically under the max server memory ceiling.

Every cache miss under an undersized buffer pool becomes a disk read, and the cost depends entirely on the medium:

Medium	Read latency	Penalty vs. RAM (sub-1 µs)
NVMe SSD	~25 µs	~25×
SATA SSD	100–200 µs	100–200×
15,000 RPM enterprise HDD	4,000–6,000 µs (4–6 ms)	4,000–6,000×
7,200 RPM consumer HDD	10,000–15,000 µs (10–15 ms)	10,000–15,000×

At 10,000 queries per second, the difference between a 97% and an 87% hit ratio is the difference between a healthy database and a queue of backed-up requests.

Changing shared_buffers requires a PostgreSQL restart, and you should update effective_cache_size at the same time so the planner accounts for the OS page cache on top of the buffer pool. Above roughly 8-10 GB the marginal return drops, so throwing RAM at the problem past that point is not the fix it looks like.

The transaction log and fsync latency

Every major engine writes durability records before acknowledging a commit, and that fsync cost is what the commit latency alert is measuring. If the log volume sits on the same disk as the data volume, and that disk is under I/O pressure from buffer pool flushes or cache-miss reads, transaction commits queue behind every other operation on the disk and commit times spike while the query execution clock looks fine.

The rule is boringly mechanical: put the log on its own volume, or verify that your cloud storage class gives the log volume headroom independent of the data volume.

The pipeline end-to-end

The four components do not fail independently. A stale-statistics problem generates a sequential scan, which blows through the buffer pool, which triggers disk I/O that contends with transaction log writes, which inflates commit latency. One regression, four cheat-sheet rows lit up at once. This component-to-component cascade is why concurrency, the other layer of runtime behavior, is the next piece of the justification.

Concurrency: the second layer of the 'why'

The cheat sheet's deadlock, lock wait, and autovacuum thresholds all come from how the database enforces isolation and durability under concurrent load.

What ACID actually costs

Atomicity pays for rollback capability with WAL writes on every transaction. Durability pays for crash safety with an fsync on commit, which is the reason the PostgreSQL synchronous_commit = off row in the cheat sheet exists. With async commit, writes return to the application before the fsync completes, and the exposure window on a crash is bounded by wal_writer_delay (default 200 ms). For event logs and session stores that is fine; for financial records it is not. MySQL exposes an equivalent lever through innodb_flush_log_at_trx_commit = 2, which flushes to the OS buffer once per second rather than on every commit and carries a crash-loss window of up to ~1 second.

Isolation pays in one of two currencies: lock contention or MVCC bookkeeping. You do not get to opt out of both.

MVCC and the dead tuple tax

PostgreSQL and MySQL InnoDB both use Multi-Version Concurrency Control. Readers get a consistent snapshot as of their transaction start; writers create new row versions rather than overwriting in place. The side effect, and the reason autovacuum is on the cheat sheet, is dead tuple accumulation. Every UPDATE or DELETE leaves an old row version behind, and that version stays reachable until no active snapshot still references it.

The default autovacuum_vacuum_scale_factor of 0.2 waits until 20% of a table has changed before vacuuming runs. On a table receiving millions of updates per day, 20% is a long time, and bloat pushes sequential scan cost upward while evicting live pages from the buffer pool (which is how the cache hit ratio row and the autovacuum row on the cheat sheet are really the same row, seen from two angles). The trigger is actually autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor × reltuples). On a 20-million-row table the 50-row default threshold is irrelevant, but on a 5,000-row lookup table the threshold dominates and should be scaled down proportionally.

In production, apply the aggressive scale factor per-table rather than globally:

ALTER TABLE high_churn_table
  SET (autovacuum_vacuum_scale_factor = 0.01,
       autovacuum_vacuum_threshold = 100);

This avoids triggering frequent vacuums on small or rarely-updated tables that a global change would also hit. MySQL handles the same problem through its purge thread, and a growing "History list length" in SHOW ENGINE INNODB STATUS is the canary that purging is falling behind.

SQL Server defaults to pessimistic row-level locking under READ COMMITTED, which means readers and writers compete for the same locks on the same rows. Read Committed Snapshot Isolation swaps this for a version-store model closer to PostgreSQL's MVCC, and on databases carrying mixed OLTP and reporting traffic it typically cuts reader-writer lock wait counts visibly, at the cost of additional tempdb write pressure.

Reading a deadlock graph

The deadlock row on the cheat sheet ("any non-zero count should page you") is defensible only if you know what to do with the graph when it fires. The classic two-transaction cycle looks like this when MySQL's SHOW ENGINE INNODB STATUS reports it:

------------------------
LATEST DETECTED DEADLOCK
------------------------
*** (1) TRANSACTION:
TRANSACTION 4212, ACTIVE 3 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 21, query id 112 localhost app updating
UPDATE accounts SET balance = balance - 100 WHERE id = 2

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 42 page no 4 n bits 72 index PRIMARY of table `shop`.`accounts`
trx id 4212 lock_mode X locks rec but not gap waiting

*** (2) TRANSACTION:
TRANSACTION 4213, ACTIVE 2 sec starting index read
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 22, query id 113 localhost app updating
UPDATE accounts SET balance = balance + 100 WHERE id = 1

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 42 page no 4 n bits 72 index PRIMARY of table `shop`.`accounts`
trx id 4213 lock_mode X locks rec but not gap

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 42 page no 4 n bits 72 index PRIMARY of table `shop`.`accounts`
trx id 4213 lock_mode X locks rec but not gap waiting

*** WE ROLL BACK TRANSACTION (1)

Read the graph in four steps. First, confirm both transactions touch the same table and index (both rows above live in PRIMARY of table 'shop'.'accounts'). Second, identify which rows each transaction already holds and which it is waiting for (transaction 2 holds id = 1 and wants id = 2; transaction 1 is the mirror). Third, note the query id of each waiter and walk it back through the application logs to find the callsite. Fourth, look at the order the rows are touched: one transaction updates id = 2 first, the other updates id = 1 first, and the inconsistent ordering is the actual bug. The fix is application-side, usually sorting locked keys before the transaction opens so that every caller acquires them in the same order.

PostgreSQL does not print a waits-for graph; it logs each deadlock as a ERROR: deadlock detected line with a DETAIL block per process, provided you have log_lock_waits = on and deadlock_timeout set:

ERROR:  deadlock detected
DETAIL:  Process 18422 waits for ShareLock on transaction 9911; blocked by process 18423.
         Process 18423 waits for ShareLock on transaction 9912; blocked by process 18422.
         Process 18422: UPDATE accounts SET balance = balance - 100 WHERE id = 2;
         Process 18423: UPDATE accounts SET balance = balance + 100 WHERE id = 1;
HINT:  See server log for query details.

The same four-step read applies: same table (extract from the UPDATE fragments in DETAIL), held vs. waiting (each ShareLock on transaction line names the blocking PID), callsite (cross-reference the PID against pg_stat_activity at the time of the error), and access order (compare the row IDs across the two UPDATEs). Each PostgreSQL deadlock entry is complete but separate per process; on InnoDB the monitor only reports the most recent cycle.

The database is not broken. It detected the cycle, killed the cheaper transaction, and returned ERROR 1213: Deadlock found when trying to get lock; try restarting transaction (InnoDB) or a 40P01 SQLSTATE (PostgreSQL). Teams often spend days debugging application logic when the answer is a two-line change in the function that opens the transaction.

Long transactions as a connection-pool killer

A batch job that opens a transaction, processes 50,000 rows, and holds row-level locks for 90 seconds blocks concurrent OLTP writes against those same rows for the entire duration. Those writes do not fail. They queue behind the lock wait timeout, and while they queue, the connections they hold fill the application pool. A common first visible symptom is HTTP 503s at the load balancer, and the database-side lock wait often does not surface as an explicit error in the application logs. This is why the cheat sheet treats lock wait count as a rising-trend alert rather than a single-number threshold: the database is patient, and the pool dies first.

Replication topology and lag as a first-class metric

Replication was a footnote in most database guides a decade ago. It is now the way you isolate analytics from OLTP, and replication lag is the second-fastest alert category to matter in managed environments, behind only query latency. Before you can reason about what lag signals, the topology itself has to justify its place in the runtime model, so start there.

Read replicas and materialized views for analytics isolation

Analytics queries create the opposite pressure from OLTP. A GROUP BY over 200 million rows, or a three-way join against a fact table, produces a plan that runs for minutes on OLTP-class hardware and scans so many pages that it evicts everything else from the buffer pool. Running that kind of query against the primary is how you destroy the cache hit ratio for every other workload at once.

Two architectural answers, used together more often than apart. A read replica takes the analytics traffic off the primary entirely; the primary's buffer pool stays warm with its real working set, and the replica can have its own planner settings tuned for long scans. A materialized view precomputes the aggregation so the analytics query reads kilobytes instead of gigabytes, and PostgreSQL's REFRESH MATERIALIZED VIEW CONCURRENTLY lets the refresh run without blocking concurrent reads on the view (though it does require a unique index on the view, and will error loudly if one is missing).

Replication lag as its own alert category

Once you have replicas, lag is a metric in its own right. The cheat sheet leaves the threshold blank on purpose: a 10-second lag is fine on an analytics replica and catastrophic on a read-your-writes OLTP replica, so the number is whatever your RPO says it is.

On PostgreSQL, the diagnostic query is:

SELECT application_name,
       client_addr,
       state,
       sync_state,
       write_lag,
       flush_lag,
       replay_lag
FROM pg_stat_replication;

The three _lag columns (available since PostgreSQL 10) return intervals, so the output reads directly as time and maps straight to RPO-based alert thresholds. The three columns separate the causes. A high flush_lag points at slow replica disk I/O. A high write_lag with a healthy flush_lag more often indicates WAL receiver CPU saturation or a network socket issue on the replica side, not disks. A high replay_lag with healthy write and flush usually means a long-running query on the replica is blocking WAL replay (PostgreSQL applies WAL on a single process, and a conflicting reader can hold it off).

When you need to pinpoint the bottleneck at byte granularity (for example, to estimate how many WAL segments a replica is behind), use pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) from the same view.

On MySQL, SHOW REPLICA STATUS gives you Seconds_Behind_Source, which is a reasonable first-cut metric with two failure modes to know. First, it returns NULL when the I/O thread is disconnected, so a disconnected replica shows no lag rather than infinite lag, and an alerting rule that pages on high values only will miss the outage entirely. Second, with GTID-based replication the value can understate real lag when the replica executes transactions out of commit-timestamp order. For anything beyond the first cut, compare GTID_SUBTRACT(@@GLOBAL.gtid_executed, Executed_Gtid_Set) between the source and the replica, or diff binlog positions directly.

Rising lag is almost never a database bug. It is usually WAL sender saturation, a slow replica consumer, a network event, or a long-running replica query, in that order of likelihood.

Managed services: what you can and cannot tune

Every row of the opening cheat sheet assumes you can actually change the knob. On RDS, Cloud SQL, and Azure SQL, several of them are gated, and a few are gone.

Parameter group lockouts

Managed services expose their tuning surface through parameter groups (RDS, Cloud SQL) or database-scoped configuration (Azure SQL). The surface overlaps heavily with a self-managed deployment but is not identical: some parameters are dynamic and changeable at any time, some are static and require an instance reboot, and some are marked read-only and cannot be changed at all regardless of permissions.

On RDS PostgreSQL, shared_buffers, effective_cache_size, and work_mem are available but require a parameter group change and, for the first one, a reboot. wal_level is a static parameter: standard PostgreSQL only reads it at server startup, and in RDS it is controlled indirectly via the static rds.logical_replication parameter, which also requires an instance reboot. Changing it has cascading effects on replication topology. A handful of parameters that exist in self-managed PostgreSQL are not exposed at all; verify any cheat-sheet row against your parameter group before committing to a remediation plan in an incident.

On Azure SQL, the DTU model hides the concept of individual tuning knobs entirely in favor of a blended performance tier, while the vCore model exposes more traditional sizing levers. If you inherited a DTU-model database and the cheat sheet tells you to resize the buffer pool, the answer is "move to vCore or resize the tier."

Connection budgets and pooling

Managed services cap max_connections based on the instance class memory. An RDS db.t3.medium (4 GB RAM) lands around 450 connections, following a memory-derived formula that effectively divides available memory by a per-connection overhead constant. If your application opens pools of 50 threads per app server and you run 10 app servers, you have consumed the entire connection budget with a single tier before any batch jobs or admin sessions show up. The cheat sheet's "keep usage under 80%" row assumes you did this math during deployment; on managed services, that math is not optional.

PgBouncer or RDS Proxy sits between the app and the database and multiplexes connections so the backend count stays flat while the client count grows. Use transaction pooling mode rather than session pooling mode; session pooling holds a connection for the entire client session and saves nothing worth having. The trade-off used to be that transaction mode broke server-side prepared statements, forcing applications that relied on PREPARE/EXECUTE to move preparation client-side or accept session mode's ceiling. PgBouncer 1.21+ on PostgreSQL 14+ added protocol-level support for named prepared statements in transaction mode (via max_prepared_statements), removing that constraint for teams on current versions. On older PgBouncer or pre-PG14, the trade-off still stands.

Storage tiers and IOPS cliffs

The buffer cache hit ratio row on the cheat sheet assumes the working set either fits in RAM or falls back to consistently-fast storage. On RDS, "consistently fast" is a storage class, not a given. gp3 volumes provide a baseline 3,000 IOPS for volumes under 400 GB and 12,000 IOPS above that threshold for most database engines, with provisioned IOPS decoupled from storage size. io2 volumes provide provisioned IOPS contractually and are the right choice when your cache miss rate is high enough that fallback-to-storage is a hot path rather than a rare event. The older gp2 class uses a burst credit model where cross-medium deployments can hit a cliff when credits drain, and the symptom looks exactly like a buffer cache regression (latency climbs, hit ratio stays flat) even though the root cause is storage throttling.

Check the storage class before you act on a cache hit ratio alert on a managed deployment, because the right remediation is sometimes a storage class upgrade rather than a memory one. Cloud-native monitoring tools like Site24x7's database monitoring can surface this distinction automatically across RDS, Azure SQL, and Google Cloud SQL by correlating I/O metrics against buffer pool behavior in a single view.

Metrics and diagnostic follow-through

The architecture sections above walked through what each component does and named the cheat-sheet row each one drives. This section pairs each of those rows with the exact diagnostic query you run when the alert fires, so measurement and investigation stop being two separate stages. Treat the subsections below as the operator's checklist; the architectural explanation lives upstream.

Query execution time, with an EXPLAIN ANALYZE walkthrough

Track query execution time per query pattern, not as a single dashboard aggregate. A p95 number that blends every query in the system hides the one that actually regressed.

The sources:

PostgreSQL: pg_stat_statements, enabled via shared_preload_libraries = 'pg_stat_statements' in postgresql.conf followed by a restart, and then CREATE EXTENSION pg_stat_statements; in each database where you want visibility. The shared_preload_libraries change loads the module; the view is not queryable until the extension is installed.
MySQL: the slow query log, enabled via slow_query_log = ON and long_query_time = 1 in my.cnf.
SQL Server: sys.dm_exec_query_stats joined to sys.dm_exec_sql_text, available out of the box.

Once a query pattern trips the alert, EXPLAIN ANALYZE is the next command you run. An 800 ms query against a large members table looks like this:

EXPLAIN ANALYZE
SELECT * FROM members
WHERE subscription_state = 'active_paid'
  AND last_seen_at < NOW() - INTERVAL '90 days';

A plan that opens with Seq Scan on members (cost=0.00..45231.00 rows=2847182 width=...) (actual time=0.031..823.400 rows=2841000 loops=1) tells you the engine is reading every row. The fields that matter:

cost=0.00..45231.00 is the planner's estimated startup and total cost in arbitrary units, useful for comparing plans rather than reading as absolute time.
rows=2847182 is the planner's row estimate; compare it against the actual rows number in the parentheses to detect stale statistics.
actual time=0.031..823.400 is the real execution time in milliseconds, first row to last row.
The node with the highest actual time spread is where optimization effort should go, and in this plan it is the Seq Scan.

A composite index changes the access pattern:

CREATE INDEX idx_members_state_seen
  ON members(subscription_state, last_seen_at);

After the index exists, EXPLAIN ANALYZE returns Index Scan using idx_members_state_seen with actual execution time orders of magnitude lower than the sequential scan. Same schema, same query, different access pattern. MySQL 8.0.18+ supports EXPLAIN ANALYZE FORMAT=TREE for equivalent runtime detail; EXPLAIN FORMAT=JSON gives plan structure without runtime timing.

One important caution the tooling does not remind you about: EXPLAIN ANALYZE on INSERT, UPDATE, or DELETE statements actually executes the statement. Always wrap the call in a transaction and roll back, or you will quietly modify live data.

Buffer cache hit ratio

When the cheat sheet's hit ratio alert fires, anything below 90% on OLTP warrants immediate investigation: insufficient memory, a cold cache after restart, or working-set growth past buffer pool capacity. OLAP workloads may tolerate lower ratios, so calibrate against your baseline rather than a universal number.

The diagnostic query for PostgreSQL pulls per-table hit ratios from pg_statio_user_tables so you can see exactly which tables are generating the disk reads:

SELECT ROUND(100.0 * heap_blks_hit / NULLIF(heap_blks_hit + heap_blks_read, 0), 1)
         AS cache_hit_pct,
       relname,
       heap_blks_read,
       heap_blks_hit
FROM pg_statio_user_tables
ORDER BY cache_hit_pct ASC
LIMIT 10;

Ordering by cache_hit_pct ascending surfaces the worst offenders first, which is the view you actually want during an incident. SQL Server's equivalent visibility comes through sys.dm_os_buffer_descriptors, aggregated by database_id for a per-database view.

Lock waits and deadlock counts

The deadlock row on the cheat sheet fires on any non-zero count in a 5-minute window. The lock wait row is a rising-trend alert because absolute values vary too much by workload to set a universal threshold.

For PostgreSQL, the live view of who is blocked on what comes from pg_stat_activity:

SELECT wait_event_type,
       wait_event,
       state,
       usename,
       application_name,
       query_start,
       query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
  AND state <> 'idle'
ORDER BY query_start ASC NULLS LAST;

The usename and application_name columns give you attribution back to the source tier, which matters more than the PID during an incident. SQL Server's equivalent is sys.dm_os_wait_stats filtered on LCK_M_ wait types for the class view, and sys.dm_exec_requests filtered on blocking_session_id IS NOT NULL for the live blocked sessions.

Connection usage

Sustained usage above 80% of the connection ceiling is the number to page on (some teams set 75-90% depending on risk tolerance). In PostgreSQL, SELECT count(*) FROM pg_stat_activity gives the live number; in MySQL, SHOW STATUS LIKE 'Threads_connected' returns the same value. On managed services, plug the current number into the max_connections ceiling from the parameter group and check against the 80% line before the alert ever fires.

Wait statistics on SQL Server

SQL Server's sys.dm_os_wait_stats classifies accumulated wait time by type and lets you answer "am I CPU-bound, I/O-bound, lock-bound, or memory-bound" as a first cut before committing to a deeper investigation. The wait classes that matter most:

SOS_SCHEDULER_YIELD for CPU waits
PAGEIOLATCH_* for I/O waits
LCK_M_* for lock waits
RESOURCE_SEMAPHORE for memory grant waits

Raw DMV output is dominated by benign background waits, so a filtered query is the one worth keeping. Paul Randal's widely-cited exclusion list filters the idle types and returns signal:

SELECT TOP 10
       wait_type,
       wait_time_ms / 1000.0 AS wait_s,
       (wait_time_ms - signal_wait_time_ms) / 1000.0 AS resource_s,
       signal_wait_time_ms / 1000.0 AS signal_s,
       waiting_tasks_count
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
        'SLEEP_TASK','BROKER_TO_FLUSH','SQLTRACE_BUFFER_FLUSH',
        'CLR_AUTO_EVENT','CLR_MANUAL_EVENT','LAZYWRITER_SLEEP',
        'SLEEP_SYSTEMTASK','WAITFOR','BROKER_EVENTHANDLER',
        'BROKER_RECEIVE_WAITFOR','BROKER_TASK_STOP','DISPATCHER_QUEUE_SEMAPHORE',
        'FT_IFTS_SCHEDULER_IDLE_WAIT','XE_DISPATCHER_WAIT','XE_TIMER_EVENT',
        'REQUEST_FOR_DEADLOCK_SEARCH','CHECKPOINT_QUEUE','TRACEWRITE')
  AND wait_time_ms > 0
ORDER BY wait_time_ms DESC;

The top row of this output is the bottleneck class, and it changes every subsequent diagnostic step.

Start collecting signal today

If you have not instrumented any of the above, three commands per engine get you to "I can see the slowest queries" without any external tooling.

PostgreSQL, after pg_stat_statements is enabled and the extension is created:

SELECT query,
       calls,
       mean_exec_time,
       stddev_exec_time,
       rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

Surfacing mean_exec_time and stddev_exec_time directly (instead of computing an average from total_exec_time / calls) makes regressions jump out: a high standard deviation on a query that used to run flat is usually a parameter-sensitive plan or a missing index on a newly-common parameter value.
MySQL, after enabling the slow query log:

mysqldumpslow -s at /var/log/mysql/mysql-slow.log

The -s at flag sorts by average time per query rather than total time. Average time is the right sort for spotting regressions (the query that used to be fast and now is not); total time is the right sort for spotting high-frequency cost hogs that were always a little slow. You can run both; this version picks average-time because it catches the kind of incident this article is about.

SQL Server, directly against the DMVs:

SELECT TOP 20
       (qs.total_elapsed_time / 1000.0) / qs.execution_count AS avg_elapsed_ms,
       qs.execution_count,
       qs.last_execution_time,
       qt.text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
ORDER BY avg_elapsed_ms DESC;

Dividing by 1000.0 gets you milliseconds rather than microseconds (easier to eyeball), and including last_execution_time lets you spot recently-compiled plans that may still be in their post-deployment shakedown window.

SQL as its own observability instrument

The five metric categories above cover operational threshold failures. They do not cover the class of failure where the data itself is silently wrong, and that class requires a different instrument.

Consider a high-churn table where rows are supposed to receive a refresh event within a short time of creation. The alerting tooling can tell you the write rate, the read rate, the cache hit ratio, and the query latency. None of them can tell you that a subset of rows is being inserted and then never updated, which is the failure mode of a downstream worker that quietly stopped processing a specific partition:

SELECT DATE(stale_since)             AS day,
       COUNT(*)                       AS stale_count
FROM accounts
WHERE stale_since >= NOW() - INTERVAL '14 days'
  AND last_touched_at < NOW() - INTERVAL '24 hours'
GROUP BY DATE(stale_since)
ORDER BY day DESC;

A row whose stale_since is inside the expected window but whose last_touched_at has not advanced in 24 hours is a missed refresh event, not a latency spike. The query is cheap when stale_since and last_touched_at are indexed and expensive when they are not, and running it on a schedule catches the kind of incident the dashboard is structurally unable to see.

The second pattern worth running is the orphan-row check. Foreign key constraints catch orphans at write time when they exist, but schemas that grew under application-layer integrity enforcement often lack constraints in some places. The anti-join surfaces the rows that should not exist:

SELECT i.id,
       i.customer_id,
       i.created_at
FROM invoices i
LEFT JOIN customers c ON c.id = i.customer_id
WHERE c.id IS NULL
  AND i.created_at >= NOW() - INTERVAL '7 days';

A non-empty result set here is almost always a bug in the delete path of the parent table: someone deleted customers without cleaning up their invoices, and the integrity of every downstream report that joins on customer_id is quietly wrong. Alerting tools do not generate this query. You have to write it.

These patterns are not examples of what SQL can do. They are signal you cannot get any other way.

Three incidents, easiest to hardest

Three worked examples, ordered by diagnostic difficulty rather than frequency. The easiest is the most mechanical, and the hardest involves the most engine-specific knowledge. Each scenario closes with the cheat-sheet row that would have caught it earlier, which is the point of having a cheat sheet in the first place.

Scenario A: buffer cache hit ratio drops from 98% to 87% overnight

Run the per-table hit-ratio query from the buffer cache section above. Tables with low hit ratios are generating disk reads; cross-reference them against pg_stat_statements for queries whose blks_read climbed after the last deployment. The usual culprits are a new query doing a full sequential scan on a large table, data growth that pushed the working set past shared_buffers, or a missing index introduced by a schema migration.

If a newly-deployed query is the cause, add a covering index. If data growth is the cause, raise shared_buffers to 25% of available system RAM (PostgreSQL's dedicated-host guideline), keeping in mind that the change requires a restart and that effective_cache_size needs to move with it.

The cheat-sheet row that would have caught it earlier: the cache hit ratio alert, set to page on "below 95% sustained." If the ratio had been paging the team at 94% instead of being noticed at 87%, the investigation would have started half a day earlier with a much smaller blast radius.

Scenario B: application returns lock wait timeout errors during peak traffic

On PostgreSQL, run the pg_stat_activity query from the lock waits section. Identify the session holding the lock that the waiting sessions need, then look at its query_start. A transaction open for 45 minutes during a window where the scheduled batch job runs for 5 minutes tells you the batch job never committed, and the batch job's held row-level locks are what the OLTP traffic is queueing behind.

On SQL Server, the equivalent path is sys.dm_exec_requests filtered on blocking_session_id IS NOT NULL for the blocked-sessions view and sys.dm_os_wait_stats filtered on LCK_M_ for the wait-type distribution.

The remediation is one of three, in increasing order of intrusiveness: isolate batch processing to a maintenance window; enable RCSI on SQL Server so readers proceed against the version store while writers continue updating live rows; or split the batch transaction into smaller units so no single commit window is wide enough to queue the OLTP traffic behind it.

Earlier detection: the lock wait count rising-trend alert. Lock waits do not go from zero to crisis in a single minute; they climb for the length of the batch job, and the rising trend is visible for twenty to thirty minutes before the first HTTP 503 shows up in the load balancer logs.

Scenario C: query p95 latency doubles after a deployment

On SQL Server, the prime suspect is parameter sniffing. The optimizer caches an execution plan on first execution using the literal parameter values passed at that moment. If those values are skewed against the overall distribution, every subsequent call runs the suboptimal cached plan and latency climbs without a corresponding change in workload.

Start with the sys.dm_exec_query_stats query from the "Start collecting signal today" section. To isolate the exact statement rather than the full batch text, replace qt.text with SUBSTRING(qt.text, (qs.statement_start_offset/2)+1, ((CASE qs.statement_end_offset WHEN -1 THEN DATALENGTH(qt.text) ELSE qs.statement_end_offset END - qs.statement_start_offset)/2)+1). Look for queries whose avg_elapsed_ms climbed while execution_count stayed flat or grew.

Then retrieve the cached plan via sys.dm_exec_query_plan to compare against a recompile:

SELECT qp.query_plan
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
WHERE qs.sql_handle = @handle;

The returned query_plan is XML. SSMS renders it as a graphical plan, and you can compare it directly against the plan produced by re-running the same query with OPTION (RECOMPILE). If the recompiled plan is meaningfully different and meaningfully faster, sniffing is confirmed. Remediation options:

Immediate incident mitigation: DBCC FREEPROCCACHE(plan_handle) evicts the bad plan so the next call recompiles.
Permanent per-query fix: add OPTION (RECOMPILE) as a query hint, accepting the compilation cost on every execution.
Plan stability alternative: OPTION (OPTIMIZE FOR UNKNOWN) tells the optimizer to use average distribution statistics rather than first-call parameter values, which avoids the worst-case skew without paying the per-execution recompile cost.

On PostgreSQL, the same symptom more often traces to the stale-statistics failure mode described in the query processor section. Run ANALYZE tablename after any large data load so the planner picks a correct plan on the next execution.

Prevention point: a per-pattern p95 alert set to 2× baseline would have flagged the regression on the first post-deployment execution, rather than at whatever arbitrary threshold the aggregate dashboard happened to cross.

Operational hazards and compatibility notes

Small-print items that would have bloated earlier sections and are worth knowing. Several of these cause incidents rather than mere confusion, so read the section as risk, not trivia.

Reading deadlock graphs gets harder with three or more transactions. The two-transaction case in the concurrency section is the textbook shape; real production deadlocks often involve a third transaction holding a shared lock that neither cycle participant can bypass, and the InnoDB deadlock monitor only reports the most recent cycle rather than the full waits-for graph. On SQL Server, capture the full graph via Extended Events with the xml_deadlock_report event rather than relying on the system health session alone. On PostgreSQL, each deadlock log entry stands alone per process, so capturing a cycle with three or more participants means joining the pg_stat_activity history for the PIDs listed in each DETAIL block.

All pg_stat_statements queries in this article use PG13+ column names (total_exec_time, mean_exec_time, stddev_exec_time). If you are still on a pre-13 version, the older names are total_time, mean_time, and stddev_time.

For teams that want execution plan context correlated against APM trace IDs without writing the correlation layer manually, commercial tooling exists that handles this as a built-in view. ManageEngine OpManager Nexus covers the on-premise side, while Site24x7's database monitoring provides the cloud/SaaS counterpart for RDS, Aurora, Azure SQL, and self-managed instances. Both surface the correlation next to the sys.dm_exec_query_stats join from Scenario C, rather than replacing it.

The cheat sheet rows are not independent alerts. They form a causal chain: stale statistics trigger sequential scans, which blow through the buffer pool, which contend with transaction log writes, which inflate commit latency. When one row fires, the diagnostic path starts by checking whether the upstream component caused it. Think in chains, not rows, and the right fix surfaces faster.

Pick one row from the alerting cheat sheet and turn it into a live signal by Friday. If you have paging infrastructure, wire the threshold into your on-call rotation. If you do not, schedule the matching diagnostic query as a cron job that writes to a log file you check daily. One row, one threshold, one query.

DEV Community