Damaso Sanoja

Posted on Mar 24 • Edited on Apr 29

Stop Tuning Blind: Query Observability as the Foundation for Database Optimization

#database #monitoring #performance #sql

A team notices a checkout endpoint slowing down. Response times have crept from 80ms to 900ms over two weeks, but the infrastructure dashboard shows nothing abnormal. So the engineer does what most teams do first: adds an index on the column mentioned in the ticket, deploys, and moves on.

Two weeks later, the same endpoint is slow again. A different engineer adds another index. Then another. The table now carries 23 indexes. Every INSERT pays write amplification across all of them. The original slow query is still slow, because the root cause was never the missing index. Stale statistics after a schema migration had triggered a plan regression, and no one caught it because no one was watching query-level execution data.

This guide inverts the usual approach. Instead of starting with indexing techniques and treating observability as an afterthought, it starts with the telemetry pipeline: how to capture query-level execution data, correlate it with application traces, and build the feedback loop that makes every subsequent optimization decision measurable. From there, it moves into execution plan analysis, indexing strategies, and resource management, each one grounded in the signals your pipeline surfaces. The principles apply across PostgreSQL, MySQL, and most relational engines. It assumes working knowledge of SQL and basic database administration.

Instrumenting before you optimize

Database optimization requires three categories of signals, and most teams have at best one of them in place.

The first is query execution metrics: per-query call count, mean latency, execution time standard deviation, rows scanned versus rows returned, and cache hit ratio. In PostgreSQL, pg_stat_statements captures these metrics directly, though p99 latency approximations require pg_stat_monitor (which provides histogram-based latency distributions) or an external metrics store for precise percentile calculations (stddev_exec_time is the closest proxy pg_stat_statements provides). Enable it by adding the extension to shared_preload_libraries, restarting the server, and creating the extension in each target database:

-- postgresql.conf (restart required after saving)
-- In managed clouds like AWS RDS or GCP Cloud SQL, enable via Parameter Groups or database flags
shared_preload_libraries = 'pg_stat_statements'
-- pg_stat_statements.track = top   -- default: tracks only top-level statements
-- Set to 'all' if your workload runs queries inside functions or stored procedures
-- After restart, run in each target database
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top consumers by total execution time
SELECT query, calls, total_exec_time, rows,
       mean_exec_time, stddev_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

In MySQL, the Performance Schema is enabled by default and provides equivalent data. Sort by total time consumed, not worst-case single execution. A query that takes 20ms per call but runs 50,000 times per hour contributes 1,000 seconds of database time, far more than a 5-second query that runs twice a day.

The second signal is infrastructure-level database metrics: connection counts, operation rates, and table I/O. The OpenTelemetry Collector (otelcol-contrib, not the core distribution) scrapes these on a configurable interval with no application code changes:

First, create the monitoring user with the required permissions:

-- Create monitoring user (PostgreSQL 10+)
CREATE USER otel_monitor WITH PASSWORD 'your_password';
GRANT pg_monitor TO otel_monitor;  -- covers pg_stat_statements, pg_stat_activity, etc.
-- If pg_monitor is unavailable (pre-10), grant individually:
-- GRANT SELECT ON pg_stat_statements TO otel_monitor;
-- GRANT SELECT ON pg_stat_user_tables TO otel_monitor;
-- On AWS RDS and GCP Cloud SQL, pg_monitor is available and the preferred approach.

Then configure the collector:

receivers:
  postgresql:
    endpoint: localhost:5432
    username: otel_monitor
    password: ${env:PG_PASSWORD}
    collection_interval: 30s
    databases:
      - myapp_prod

processors:
  batch:

exporters:
  otlp:
    endpoint: your-backend:4317

service:
  pipelines:
    metrics:
      receivers: [postgresql]
      processors: [batch]
      exporters: [otlp]

The third signal is application traces. Auto-instrumentation libraries for most languages and database clients (Python and Java have the most mature support; Go and Rust require more manual setup) emit a trace span for every database call, carrying the query text and operation type as span attributes. Without application-level tracing, you can identify slow queries but not which service, endpoint, or user action generated them.

With all three in place, build a baseline dashboard before changing anything. Run four panels for at least one full business cycle (24 to 48 hours): top queries by total execution time, active connections over time, cache hit ratio, and index scan versus sequential scan ratio per table. Grafana works well for this. The baseline is what you compare against after every optimization. Skip it, and you can't confirm whether a change helped or quantify by how much.

If assembling this stack in-house isn't the right fit, hosted platforms like Site24x7 collect the same signal categories across PostgreSQL, MySQL, SQL Server, and RDS/Aurora. The rest of this guide applies regardless of where the telemetry lives.

The next section uses these signals to read execution plans and identify what needs fixing.

Reading what your telemetry surfaces

Your pipeline is collecting query metrics, infrastructure signals, and application traces. The next step is interpreting what they reveal. Three patterns account for the majority of production database problems, and each one leaves a distinct signature in your telemetry before it becomes a user-facing incident.

Plan regressions

Plan regressions appear as a sudden or gradual increase in execution time for a specific query fingerprint, with no corresponding change in query text. The query planner makes cost-based decisions using statistics about row counts and value distributions. When those statistics go stale after a bulk load, a migration, or months of organic growth, the planner's row estimate diverges from reality, and the planner picks a worse access path. Your pg_stat_statements data will show the regression as a jump in mean_exec_time for that fingerprint. The execution plan confirms it.

Running EXPLAIN ANALYZE on the offending query produces the actual execution, not just the planner's estimate. Here is what a plan regression looks like in practice:

EXPLAIN ANALYZE SELECT * FROM events WHERE user_id = 42;

-- Output (simplified):
-- Seq Scan on events  (cost=0.00..18450.00 rows=50 width=64)
--                     (actual time=0.042..312.7 rows=180000 loops=1)
--   Filter: (user_id = 42)
--   Rows Removed by Filter: 320000
-- Planning Time: 0.08 ms
-- Execution Time: 458.3 ms

The planner estimated 50 rows; the actual count was 180,000, a 3,600x divergence. The Seq Scan node confirms no index was used, even though one exists on user_id. The Rows Removed by Filter line shows 320,000 rows were read and discarded. Refreshing statistics manually after large data changes is standard practice:

-- PostgreSQL: refresh statistics for a specific table
ANALYZE events;

-- MySQL: equivalent command
ANALYZE TABLE events;

After running ANALYZE, re-execute the EXPLAIN ANALYZE. If the row estimate now matches reality and the planner switches to an index scan, stale statistics were the root cause.

Stale statistics are the most common trigger, but plan regressions can also surface through changes in join strategy or CTE materialization. Nested loop joins are efficient when one side is small and indexed; hash joins handle larger unindexed sets, and merge joins work best on pre-sorted input. When the planner switches strategy between deploys your execution plan will show the new join node and your pg_stat_statements data will show the performance delta. The same diagnostic applies: compare estimated versus actual rows and check whether stale statistics or data growth changed the cost calculation.

A related case is Common Table Expression materialization. In PostgreSQL 12 and later, CTEs are inlined by default if they are non-recursive, referenced only once, and free of side-effects. In PostgreSQL 11 and earlier, all CTEs are materialized as optimization fences, preventing predicate pushdown into the CTE body. When a CTE is referenced multiple times, PostgreSQL still materializes it to avoid duplicate computation unless you explicitly specify NOT MATERIALIZED. If your telemetry shows a query scanning far more rows than expected through a CTE, check whether materialization is forcing a full scan where a filtered one would suffice. The first diagnostic question is whether the CTE executes once per query or once per row in a join.

Contention

Contention shows a different signature. Instead of one query getting slower, many connections wait on the same resource simultaneously. A SHOW PROCESSLIST (MySQL) or SELECT * FROM pg_stat_activity (PostgreSQL) during the incident might show 140 connections blocked on a table-level lock held by a single long-running transaction.

Your telemetry surfaces this pattern through execution time variance. The same query fingerprint alternates between 5ms and 4 seconds depending on whether it hits the lock window, producing a high stddev_exec_time relative to mean_exec_time in pg_stat_statements. When you see that ratio spike, investigate lock waits before assuming a plan problem. Contention-driven variance affects multiple unrelated fingerprints at the same time; if only a single fingerprint shows high stddev, the cause is more likely an inherently variable workload than a locking issue.

To identify the blocking session, use pg_blocking_pids() (PostgreSQL 9.6+):

-- Find blocking sessions and what they are running
SELECT
  blocked.pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE cardinality(pg_blocking_pids(blocked.pid)) > 0;

The MySQL equivalent joins performance_schema.data_lock_waits with performance_schema.threads.

Maintenance drift

Maintenance drift is the slowest-moving pattern, and the hardest to notice because no single event triggers it. Over weeks and months, dead index entries accumulate from row updates and deletes, statistics go stale as migrations reshape data distributions, and indexes that once matched hot access patterns quietly fall out of alignment with what the application actually queries. None of this shows up on a standard infrastructure dashboard.

What your telemetry does surface is a gradual increase in the rows-scanned-to-rows-returned ratio across multiple query fingerprints, often paired with a declining cache hit ratio. When a query scans 200,000 rows to return 40, the planner is telling you it can't satisfy that predicate with any existing index. A partial or expression index often closes the gap.

Diagnostic triage: from signal to action

The following decision tree maps each telemetry pattern to its diagnostic path and the section that addresses the fix:

flowchart TD
    A["Telemetry signal detected"] --> B{"Signal pattern?"}
    B -->|"mean_exec_time jump,<br>single fingerprint"| C["Plan regression"]
    B -->|"High stddev_exec_time,<br>multiple fingerprints"| D["Contention"]
    B -->|"Gradual scan ratio rise,<br>cache hit ratio decline"| E["Maintenance drift"]
    C --> F["EXPLAIN ANALYZE: compare<br>estimated vs. actual rows"]
    F -->|"Stale statistics"| G["ANALYZE table, re-check plan"]
    F -->|"Wrong access path"| H["See: Indexing decisions"]
    D --> I["pg_stat_activity /<br>SHOW PROCESSLIST"]
    I -->|"Connection saturation"| J["See: Connection pooling"]
    I -->|"Single lock holder"| K["Identify blocking transaction"]
    E --> L["pgstatindex for bloat /<br>table size for growth"]
    L -->|"Index bloat"| M["See: Index maintenance"]
    L -->|"Unbounded table growth"| N["See: Table partitioning"]

Once you know which queries need attention and why the planner chose poorly, the next question is what structural change fixes it. Indexing decisions, grounded in the signals your telemetry just surfaced, are where that answer starts.

Indexing decisions driven by what the data shows

The next step is the structural change that fixes what the planner got wrong. Indexing is the most common response to a slow query, and the most commonly misconfigured one. A well-chosen index can cut execution time by orders of magnitude; a poorly chosen one adds write overhead with no measurable read benefit. The difference depends on matching the index design to what your signals actually showed.

Composite index column ordering

An Index Scan in the execution plan does not guarantee efficiency. If the planner is still reading far more rows than it returns, the index exists but its column order doesn't match the query's predicate structure. The general rule for multi-column indexes: equality predicates go first, then sorting columns (for ORDER BY or GROUP BY), and range predicates go last.

Consider a query filtering by user_id and ranging on created_at:

-- Suboptimal: range predicate on the leading column
-- The index can only be used for the created_at range;
-- user_id filtering happens after the scan, not during it
CREATE INDEX idx_events_ts_user ON events (created_at, user_id);
SELECT * FROM events WHERE created_at > '2024-01-01' AND user_id = 42;

-- Correct: equality first, range last
-- The index narrows to all rows for user 42, then scans only the timestamp range
CREATE INDEX idx_events_user_ts ON events (user_id, created_at);
SELECT * FROM events WHERE user_id = 42 AND created_at > '2024-01-01';

Putting user_id first collapses the initial scan to a single user's rows before the range scan begins. The same principle extends to sorting: placing a range predicate before the sort column can force an expensive in-memory sort instead of using the index's native ordering.

Partial (filtered) indexes

When the scan ratio is high only for queries targeting a narrow subset, like the few thousand pending rows in a million-row job queue, a full index wastes I/O on rows those queries never touch.

-- Only index rows where work still needs to happen
CREATE INDEX idx_jobs_pending ON jobs (created_at) WHERE status = 'pending';

The resulting index is orders of magnitude smaller than the full alternative. Because the query planner recognizes the predicate, it uses the partial index directly for queries that include WHERE status = 'pending'. The trade-off is specificity: if your application queries other status values with similar frequency, you'll need separate partial indexes or a full one.

Expression (functional) indexes

Sometimes the predicate itself is the problem. When a query filters on a transformed column like LOWER(email), a standard B-tree index on the raw column is useless because the planner cannot match the transformation to the stored index entries. An expression index indexes the output of the function, not the column itself:

-- Case-insensitive email lookup
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'user@example.com';

-- JSON field extraction
CREATE INDEX idx_events_payload_type ON events ((payload->>'event_type'));
SELECT * FROM events WHERE payload->>'event_type' = 'checkout';

The query predicate must match the indexed expression exactly. WHERE LOWER(email) = '...' hits idx_users_email_lower; while WHERE email ILIKE '...' does not, because the planner treats them as distinct operations. MySQL supports expression indexes from version 8.0 with the same identity requirement.

Covering indexes

The heap fetch is one of the most under valued performance bottlenecks. Even when the planner picks the right index and row estimates are accurate, each index hit triggers a random I/O back to the table to retrieve columns not stored in the index. A covering index eliminates that secondary lookup by including every column the query needs.

-- Hot path query on a multi-tenant SaaS table
SELECT user_id, email, created_at FROM users
WHERE tenant_id = 12 AND active = true;

-- Covering index satisfies the full query from the index alone
CREATE INDEX idx_users_tenant_active
  ON users (tenant_id, active)
  INCLUDE (user_id, email, created_at);

The INCLUDE clause attaches non-key columns to the index leaf pages without affecting the B-tree structure. PostgreSQL and SQL Server support it directly. MySQL (InnoDB) has no INCLUDE keyword, but every secondary index already carries the Primary Key at its leaf nodes, so you achieve the same effect by appending the extra columns to a standard index definition.

The payoff is most pronounced on frequently executed queries where the heap fetch accounts for a measurable share of execution time. The cost is a larger index and added write overhead per row change, so covering indexes make sense for critical hot paths, not general use.

Index bloat and maintenance

Your telemetry shows a pattern consistent with maintenance drift: cache hit ratio declining gradually, scan times rising across multiple query fingerprints with no corresponding change in query text or data volume. Dead index entries from row updates and deletes are a common cause. In PostgreSQL, the pgstattuple extension provides the pgstatindex function to measure B-tree bloat directly via page density:

-- Install the extension once per database (required before pgstatindex is available)
CREATE EXTENSION IF NOT EXISTS pgstattuple;

SELECT * FROM pgstatindex('idx_events_user_ts');
-- avg_leaf_density dropping significantly below its baseline is a signal worth investigating;
-- no single universal threshold applies, but sustained readings below ~70% are a commonly
-- cited starting point; treat it as a prompt to investigate trends, not a hard trigger

When bloat reaches the point where rebuilds are warranted, most engines can do it online. PostgreSQL offers REINDEX CONCURRENTLY (available since PostgreSQL 12); MySQL's InnoDB rebuilds indexes in-place via ALTER TABLE ... FORCE or OPTIMIZE TABLE. How often you need to rebuild depends on write volume.

Both engines include automatic maintenance, but the defaults assume moderate write loads. PostgreSQL's autovacuum fires when the fraction of dead rows in a table crosses autovacuum_vacuum_scale_factor, which defaults to 0.2 (20%). For a 1,000-row lookup table, that threshold is fine. For a 10-million-row events table, it means 2 million dead rows can accumulate before cleanup begins. MySQL's InnoDB purge thread handles dead-row cleanup continuously, but under heavy update workloads the purge lag (History list length in SHOW ENGINE INNODB STATUS) can grow faster than the thread drains it, producing similar bloat symptoms.

In PostgreSQL, you can identify tables where autovacuum is falling behind:

-- Identify tables where autovacuum is not keeping up
SELECT relname, n_dead_tup, n_live_tup, last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

-- Override autovacuum threshold for a specific high-churn table (no restart required)
ALTER TABLE events SET (autovacuum_vacuum_scale_factor = 0.01);
-- Now autovacuum fires after 1% dead rows instead of 20%

Unused index audit

Every index adds overhead to every write operation and the overhead compounds silently. The intro scenario's 23-index table is an extreme case, but smaller versions of the same problem are common. Auditing for indexes your query workload never uses is as important as adding new ones. In PostgreSQL, pg_stat_user_indexes exposes idx_scan counts per index.

Any index with zero or near-zero scans after weeks of production traffic is a candidate for removal, with two caveats. First, make sure the index isn't enforcing a UNIQUE constraint or Primary Key, since these do critical work enforcing data integrity on every write, even if never explicitly scanned by a SELECT. Second, make sure your observation window doesn't miss heavy seasonal queries, such as end-of-month reporting or quarterly rollups.

Indexing addresses the query path. The next layer is the infrastructure around it: connection management, data layout, and write throughput.

Managing the infrastructure on which your queries run

Indexing optimized the query path. Three infrastructure-level bottlenecks can negate those gains: connection exhaustion under load, scan costs that grow with table size despite correct indexes, and write latency amplified by row-at-a-time inserts. Each surface in your telemetry before it becomes a production incident.

Connection pooling and routing

The contention pattern from the previous sections, where 140 connections were blocked on a table-level lock, often starts as a connection management problem. Most relational databases carry overhead per connection: process or thread creation, memory allocation, and authentication. In PostgreSQL, idle connections share most memory pages with the parent process via Copy-on-Write, but actual overhead ranges from under 2 MB (with huge pages and minimal prior activity) to over 10 MB, depending on shared_buffers size and prior query activity. Active connections cost far more: work_mem is allocated per sort or hash node in the query plan (default 4 MB each), so a complex query with multiple such nodes can consume a multiple of that figure. Connection poolers like PgBouncer (PostgreSQL) and ProxySQL (MySQL and PostgreSQL) multiplex many application connections onto a smaller pool of database connections.

The architectural decision is the pooling mode. Session mode maps each application connection to a dedicated database connection for its lifetime, preserving session state (prepared statements, advisory locks). Transaction mode returns connections to the pool after each commit, enabling higher concurrency, but breaks any session-scoped feature. Audit your application's session-level usage before migrating modes. For read-heavy workloads with replicas, ProxySQL can route SELECT queries to replicas and writes to the primary at the proxy layer. The trade-off is replication lag: reads immediately after writes may not reflect the latest state.

Table partitioning

Your telemetry shows correct index usage, the planner picks the right index, row estimates are accurate, but execution time still grows month over month. The table itself is growing, and even a good index scan takes longer when the underlying B-tree is larger. Range partitioning on a timestamp column addresses this by enabling partition pruning: when a query includes a predicate on the partition key, the database scans only the relevant partitions.

-- Parent table: partitioned by month on created_at
CREATE TABLE events (
    id         bigint GENERATED ALWAYS AS IDENTITY,
    user_id    bigint NOT NULL,
    action     text NOT NULL,
    created_at timestamptz NOT NULL
) PARTITION BY RANGE (created_at);

-- One child partition per month
CREATE TABLE events_2025_01 PARTITION OF events
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

A query filtering to the last 30 days on a table partitioned by month typically scans 2 partitions rather than the full table. The execution plan confirms pruning via a Partitions field or equivalent. Teams typically automate partition maintenance (creating future partitions in advance and detaching old ones) with pg_partman, a PostgreSQL extension that manages partition creation and retention on a configurable schedule. Without this automation, INSERT statements targeting a date range with no corresponding partition will fail at runtime.

Batch write throughput

Row-at-a-time inserts pay two costs per statement: a network round trip to the server and index maintenance across every index on the table. Batching rows into a single INSERT pays both costs once per statement instead of once per row. Hundreds to thousands of rows per statement typically deliver 10 to 20x throughput improvement on bulk loads, depending on row width and network latency.

Each engine imposes a ceiling on batch size. SQL Server caps parameterized queries at 2,100 parameters. MySQL's max_allowed_packet rejects oversized payloads and closes the connection entirely; check the current limit with SHOW VARIABLES LIKE 'max_allowed_packet' and increase it globally in my.cnf or via SET GLOBAL max_allowed_packet = 134217728 (existing connections pick up the new default on reconnection). PostgreSQL's extended query protocol caps any single parameterized statement at 65,535 bind parameters. In practice, chunking into batches of 1,000 to 5,000 rows is the sweet spot across all three engines.

With the query path and infrastructure tuned, the remaining question is where automation can reduce the ongoing maintenance burden.

Automating optimization and anomaly detection

The telemetry pipeline, execution plan analysis, indexing strategy, and infrastructure tuning covered so far are manual disciplines. Each requires an engineer to interpret signals and decide on a change. Two categories of automation can reduce that burden without replacing the judgment behind it.

Workload-aware index recommendations

Tools like EverSQL ingest production query logs or slow query exports, build a workload model from query fingerprints, simulate execution plans, and generate index recommendations ranked by estimated improvement. Some also suggest query rewrites. The value is prioritization: instead of manually reviewing pg_stat_statements output to decide which query to optimize first, the tool ranks candidates by aggregate impact and proposes a specific structural change. But no recommendation should go straight to production. Treat these recommendations as a starting point, not a deployment-ready output. Check whether the recommended index covers a write-heavy table, since read performance gains come at the cost of write amplification across every INSERT and UPDATE. Confirm that any rewritten query produces identical results under edge-case data distributions, not just the common case the tool optimized for.

Anomaly detection on query metrics

ML-based anomaly detection on time-series query execution metrics can flag plan regressions post-deployment without requiring manual baseline comparison. This addresses the intro scenario directly: the checkout endpoint's latency crept from 80ms to 900ms over two weeks, with no alert firing because no static threshold was breached. An anomaly detector trained on per-fingerprint latency distributions would flag a 10x deviation from the rolling baseline within hours, not weeks.

This is more useful than static thresholds because it adapts to traffic patterns. A query that naturally runs slower during batch jobs at 2 AM shouldn't generate a 3 AM alert. However, effective anomaly detection requires long-term retention of per-fingerprint query metrics. You can build that on your database's built-in statistics views, on the external metrics store your OTel pipeline already feeds, or delegate it to a hosted tool with anomaly detection built in, such as ManageEngine's database monitoring. The trade-off is where the telemetry sits and who retains it.

Managed database automation

Cloud-managed databases increasingly bundle automatic index recommendations (Azure SQL Database, Amazon RDS Performance Insights) and compute auto-scaling. These autonomous features reduce operational overhead but operate within the bounds set by schema structure and access patterns, both of which require human decisions upstream. They handle the maintenance loop. They don't replace the diagnostic skill of reading an execution plan or the architectural judgment of choosing a partitioning strategy.

Building a measurable feedback cycle

Whether you automate parts of the maintenance cycle or handle every step manually, the principle is the same: every optimization needs a closed feedback loop to prove it worked.

With the pipeline described in this guide, the opening scenario plays out differently. The pg_stat_statements baseline catches the mean_exec_time regression within a day. The EXPLAIN ANALYZE output reveals a 3,600x row estimate divergence, pointing to stale statistics after the schema migration. Running ANALYZE on the affected table restores the correct execution plan. The unused index audit flags 19 of those 23 indexes as candidates for removal. The baseline dashboard confirms the fix: execution time drops, write throughput recovers, and the next regression, whenever it arrives, will surface in the same pipeline before a user files a ticket.

The underlying shift is structural: from reacting to symptoms toward building a system that surfaces causes. Query-level telemetry provides the signals. Execution plan analysis reveals what the planner decided and whether it decided well. From there, indexing and infrastructure changes become the levers, and the baseline dashboard closes the loop by confirming whether pulling a lever worked. Each piece feeds the next.

Database optimization is not a one-time project. It's a feedback loop. The teams that maintain fast, reliable databases over time are not the ones with the best indexing intuition. They're the ones whose instrumentation tells them where to look next. Start with pg_stat_statements or Performance Schema, build the four-panel baseline, and let the data show you where your first optimization should land.

DEV Community