Damaso Sanoja

Posted on Apr 8 • Edited on Apr 29

Database Observability: An Engineer's Guide to Full-Stack Monitoring Across SQL, NoSQL, and Cloud Databases

#database #devops #monitoring #performance

Nobody plans a three-dashboard monitoring setup. It grows on its own. You deploy MySQL, so you add mysqld_exporter. The team moves a workload to RDS, so you wire up a CloudWatch integration. Then MongoDB Atlas enters the stack, and Atlas ships its own metrics view. Three databases, three dashboards, three alert pipelines, zero correlation between them.

At 2:47am, that fragmentation has a price. A p99 latency spike fires an alert, and you spend fifteen minutes switching between tools before tracing it to a missing index. The data existed in three places. The relationship between those data points existed in none.

That gap is the difference between metric collection and observability. Metric collection tells you something crossed a threshold. Observability gives you the distributed trace connecting an application service, a SQL statement, host disk I/O, and a slow query log entry into one causal chain, so you can answer why without adding new instrumentation after the incident starts.

Most production environments already run this kind of mixed stack. PostgreSQL handles transactional writes, MongoDB stores document data, Aurora or RDS manages read-heavy workloads, and a Redis or Memcached caching layer sits adjacent to all of it. This guide focuses on primary data stores: SQL, NoSQL, and cloud-managed databases. Caching layers have a different telemetry profile and are outside scope here. Each engine has a different telemetry model, a different collection method, and a different set of signals that actually predict trouble. Stitching observability across the full mix is the hard part, and it starts with knowing which signals to watch per engine.

What to actually monitor, by database type

A single mysqld_exporter instance can publish hundreds of Prometheus series. PostgreSQL's statistics collector exposes a comparable volume. During an incident, almost none of that matters. What matters is the handful of signals that predict user-facing degradation before it becomes a page.

SQL databases: PostgreSQL and MySQL

The signals worth watching for PostgreSQL and MySQL:

Query latency at p50, p95, and p99. Average latency hides the outliers your users actually feel. A mean of 12ms tells you nothing if the p99 is 800ms, because that 1% of slow requests lands on real user sessions and drives timeout errors, retry storms, and SLA breaches.
Active connections versus connection limit. On PostgreSQL, compare numbackends in pg_stat_database against max_connections. On MySQL, compare Threads_connected from SHOW GLOBAL STATUS against the max_connections system variable. Connection saturation causes query queuing before it causes timeouts.
Cache hit ratio. On PostgreSQL, that's heap_blks_hit / (heap_blks_hit + heap_blks_read) from pg_statio_user_tables. A ratio below 95% signals trouble; aim for 99%. On MySQL, the equivalent is the InnoDB buffer pool hit ratio: 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests) from SHOW GLOBAL STATUS, where the same 99%+ target applies.
Replication lag in seconds. On PostgreSQL, query pg_stat_replication for replay_lag. Lag that climbs steadily means replicas are falling behind on writes, and read queries hitting those replicas will return stale data.
Lock wait count. Rising lock contention is the precursor to deadlocks. A sustained increase in waiting locks means transactions are blocking each other, and throughput will degrade before any single query times out.
Slow query rate over a rolling window. A sudden increase in the proportion of queries exceeding your slow-query threshold (typically 100ms-1s depending on workload) signals a regression, whether from a bad deployment, plan change, or resource contention.

Most of these signals aren't surfaced in default dashboards. You need to query them directly to establish a baseline before automating collection.

The PostgreSQL cache hit ratio from pg_statio_user_tables:

SELECT
  round(
    sum(heap_blks_hit)::numeric / nullif(sum(heap_blks_hit + heap_blks_read), 0),
    4
  ) AS hit_ratio
FROM pg_statio_user_tables;

The nullif call guards against division-by-zero on a cold instance where no blocks have been read yet. The round wrapper gives you a clean four-decimal ratio instead of a long float.

For query-level performance, pg_stat_statements is where the data lives on PostgreSQL. Once the extension is enabled (see the implementation section), this query pulls the top 15 queries by total execution time:

SELECT
  left(query, 80) AS query_preview,
  calls,
  round((total_exec_time / 1000)::numeric, 2) AS total_time_sec,
  round((mean_exec_time)::numeric, 2) AS avg_ms,
  rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 15;

The ordering matters. A query called 50,000 times at 2ms each burns far more total database time than one called 10 times at 500ms, yet only the latter trips a slow-query alert. Ranking by cumulative time surfaces both patterns.

On MySQL, the equivalent lives in the Performance Schema. The events_statements_summary_by_digest table provides normalized query fingerprints with execution counts, total latency, and lock time:

SELECT
  LEFT(DIGEST_TEXT, 120) AS query_digest,
  COUNT_STAR AS exec_count,
  ROUND(SUM_TIMER_WAIT / 1e12, 3) AS total_sec,
  ROUND(AVG_TIMER_WAIT / 1e12, 3) AS avg_sec,
  SUM_ROWS_EXAMINED,
  SUM_ROWS_SENT
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 15;

MySQL's Performance Schema stores timer values in picoseconds, so the / 1e12 conversion gives you seconds. The SUM_ROWS_EXAMINED versus SUM_ROWS_SENT comparison is useful too: a large gap between examined and sent rows often points to missing indexes.

MySQL replication lag is available via SHOW REPLICA STATUS\G under the Seconds_Behind_Source field. If you're still on a version before 8.0.22, the command is SHOW SLAVE STATUS and the field is Seconds_Behind_Master; both old names were dropped entirely in MySQL 8.4. One caveat: this metric measures delay at the SQL apply thread, not end-to-end data freshness. Under multi-source replication or GTID-based topologies, it can report zero while a channel is actually stalled. Percona's pt-heartbeat (or a custom heartbeat table that your application writes to and replicas read from) gives you a ground-truth lag measurement independent of the replication thread's self-reporting.

NoSQL databases: MongoDB

MongoDB's signals that matter:

Operation latency from serverStatus.opLatencies, broken down by reads, writes, and commands. Separating read and write latency is critical because MongoDB workloads are often asymmetric, and a write latency spike won't show up in a combined average if reads dominate throughput.
Queue depth via globalLock.currentQueue.total. A rising queue means operations are waiting for execution faster than the engine can process them. Sustained queue growth precedes the latency cliff where response times go nonlinear.
Replication oplog window in hours. This is your buffer before a lagging secondary falls off the oplog and needs a full resync. An oplog window under 4 hours on a write-heavy deployment leaves little recovery margin (community discussion on oplog sizing shows operators typically target 24+ hours). Your safe minimum depends on how long a full resync takes in your environment.
WiredTiger cache utilization as a ratio of bytes in cache to the configured maximum (default: the larger of 50% of (RAM minus 1 GB) or 256 MB). When the internal cache fills, eviction pressure forces the engine to discard and re-read pages more frequently. The resulting latency pattern looks like disk-bound behavior but originates inside the storage engine's own memory management, not the OS page cache. You won't identify this eviction-driven latency from host-level memory metrics alone.

All four signals come from a single shell command. Run db.runCommand({ serverStatus: 1 }) and extract what you need:

const s = db.runCommand({ serverStatus: 1 });

// Operation latency (microseconds) — split by read/write/command
printjson(s.opLatencies);

// Queue depth — operations waiting for execution
print("Queued ops:", s.globalLock.currentQueue.total);

// WiredTiger cache pressure — ratio approaching 1.0 means eviction trouble
const used = s.wiredTiger.cache["bytes currently in the cache"];
const max  = s.wiredTiger.cache["maximum bytes configured"];
print("Cache fill:", (used / max).toFixed(3));

For the oplog window, db.getReplicationInfo().timeDiff / 3600 gives you hours of runway before a lagging secondary needs a full resync.

Atlas users: On MongoDB Atlas, serverStatus access depends on your cluster tier (M10+ for full stats). Atlas exposes metrics through its own Monitoring UI and the Atlas Administration API. The OTel mongodb receiver connects to Atlas clusters via SRV connection strings (mongodb+srv://) with SCRAM authentication.

Cloud-managed databases: RDS, Aurora, and Cloud SQL

With managed databases, you don't have SSH access or direct access to system views. The signals that matter are the same (connections, IOPS, replication, storage), but collection runs through cloud provider APIs instead.

The signals to watch (metric names below use AWS CloudWatch conventions; Azure Monitor and GCP Cloud Monitoring expose equivalents under different names, e.g., connection_count on Cloud SQL, connection_successful on Azure SQL):

DatabaseConnections versus the engine's max connection limit. Managed instances enforce the same connection ceiling as self-hosted engines, but you can't tune OS-level socket limits to buy time. When you hit the cap, new connections are refused outright.
ReadIOPS and WriteIOPS versus provisioned IOPS limits. Exceeding provisioned IOPS triggers throttling at the storage layer, adding latency that looks like slow queries but originates below the engine. The queries themselves haven't changed; the disk can't keep up.
FreeStorageSpace. Alert before autoscaling triggers, not after. Autoscaling events cause a brief I/O pause on some instance types, and if autoscaling is disabled, a full volume means writes stop entirely.
ReplicaLag. Same concern as self-managed replication: read replicas serving stale data. The difference is that you can't inspect the replication thread directly, so this CloudWatch metric is your only visibility into how far behind a replica has fallen.
CPUCreditBalance on burstable instance types (T3, T4g). A depleted credit balance is a hidden latency trigger that looks like a CPU spike but is actually credit exhaustion. Once credits hit zero, the instance is capped at baseline CPU, and every query slows down uniformly.

Collection runs through CloudWatch GetMetricData for RDS and Aurora, the Azure Monitor REST API for Azure SQL, and the Cloud Monitoring API for Cloud SQL.

The resolution tradeoff with CloudWatch matters. Standard RDS metrics publish at 1-minute intervals. AWS Enhanced Monitoring drops that to 1-second granularity for OS-level metrics, and Performance Insights adds DB load sampling at 1-second resolution with query-level attribution (the per-second samples are aggregated to produce the Top SQL view; query statistics themselves come from engine-level stats). Note: AWS has announced the Performance Insights console experience will reach end-of-life on June 30, 2026, with functionality migrating to CloudWatch Database Insights. Native engine-level metrics through CloudWatch stay at 1-minute resolution, so transient sub-minute anomalies at the engine level are invisible by default.

Hosted platforms like ManageEngine's database monitoring consolidate these cross-provider APIs into a single query interface, which is useful when a single fleet spans RDS, Azure SQL, and Cloud SQL simultaneously.

Universal signals across all database types

Regardless of engine, four metrics travel across any database and make cross-database comparison possible: query error rate, connection pool saturation (used / max), query throughput (QPS or TPS), and disk I/O wait percentage.

Signal	PostgreSQL	MySQL	MongoDB	AWS RDS / Aurora
Query latency	pg_stat_statements (total_exec_time)	events_statements_summary_by_digest	opLatencies (reads/writes/commands)	ReadLatency, WriteLatency
Connection pressure	numbackends vs max_connections	Threads_connected vs max_connections	currentQueue.total	DatabaseConnections vs engine max
Cache health	heap_blks_hit ratio (target ≥99%)	InnoDB buffer pool hit ratio	WiredTiger cache fill ratio	BufferCacheHitRatio
Replication delay	pg_stat_replication.replay_lag	Seconds_Behind_Source (or pt-heartbeat)	oplog window in hours	ReplicaLag (seconds)
Slow query signal	pg_stat_statements + slow log	slow_query_log + Perf Schema	currentOp + database profiler	Performance Insights / Database Insights
Storage / I/O pressure	blks_read, I/O wait %	Innodb_data_reads, I/O wait %	WiredTiger eviction rate	WriteIOPS vs provisioned IOPS

Knowing which signals matter is the first step. Collecting them consistently across every engine in a single pipeline is the next.

Building a unified telemetry pipeline

Three approaches exist for collecting database telemetry in production, each with a different tradeoff between setup speed, vendor independence, and long-term maintenance cost:

Vendor agents with proprietary instrumentation. Fastest to deploy and lowest initial maintenance since the vendor manages the agent lifecycle. The cost is vendor independence: switching backends means re-instrumenting everything.
Prometheus exporters (postgres_exporter, mysqld_exporter, mongodb_exporter). Moderate setup, vendor-neutral, and battle-tested. Maintenance stays low once running, but they're metric-only. They don't share a data model with your application traces, so correlation requires stitching across separate pipelines.
OpenTelemetry Collector with database-specific receivers. Its postgresql, mysql, and mongodb receivers normalize metrics into shared semantic conventions, so telemetry from different engines lands in a comparable format. Fully vendor-portable and trace-aware, but the most setup effort upfront and the highest ongoing maintenance (config drift, biweekly releases, semantic convention changes).

This guide uses the OTel Collector path. As of 2026, OpenTelemetry is the de facto standard for new observability instrumentation, and it's the only option above that unifies database metrics and application traces under the same data model. Building on proprietary agents now means repeating this work at the next platform migration.

Two common deployment patterns exist. In agent mode, a Collector runs on each database host, collects local metrics, and forwards them to a central gateway or directly to the backend. In gateway mode, a centralized Collector reaches out to remote database endpoints. Agent mode gives you host-level correlation for free (the Collector inherits host.id). Gateway mode reduces the number of Collector instances to manage. Most production setups use agent mode for self-managed databases and gateway mode for cloud-managed instances where you can't deploy locally.

The following sections walk through receiver configuration for each database type, starting with PostgreSQL.

Setting up the PostgreSQL receiver

One gotcha before the first receiver config: the postgresql, mysql, and mongodb receivers ship in the contrib distribution, not the core binary. Download otelcol-contrib (also available as Docker image otel/opentelemetry-collector-contrib) or the receivers won't be available. The configs below were validated against otelcol-contrib v0.115.0. Receiver config schemas can change between releases; check the receiver README for your installed version if you encounter validation errors.

Create a dedicated monitoring user on your PostgreSQL instance (PostgreSQL 10+):

CREATE ROLE otel_reader WITH LOGIN PASSWORD 'change_me';
GRANT pg_monitor TO otel_reader;

pg_monitor is a built-in role (introduced in PostgreSQL 10) that bundles read access to every statistics view the receiver needs: activity stats, background writer stats, database-level stats, and pg_stat_statements if the extension is loaded. On PostgreSQL 9.x, you'll need to grant SELECT on each view individually since the bundled role doesn't exist.

A minimal OTel Collector configuration:

receivers:
  postgresql:
    endpoint: localhost:5432
    username: otel_reader
    password: "${env:PGMON_PASS}"
    databases:
      - app_prod
      - app_analytics
    collection_interval: 20s
    tls:
      insecure: true  # disable for production; configure certs instead

exporters:
  otlp/primary:
    endpoint: "otel-gateway.internal:4317"

service:
  pipelines:
    metrics:
      receivers: [postgresql]
      exporters: [otlp/primary]

Two details worth noting. The tls: insecure: true flag disables TLS verification, acceptable for local development but not production. The ${env:VAR_NAME} syntax is the Collector's built-in expansion for OS environment variables. The Collector doesn't read .env files, so set them before starting the process (e.g., export PGMON_PASS=secret && ./otelcol-contrib --config config.yaml).

The postgresql receiver pulls metrics from pg_stat_bgwriter, pg_stat_database, and related system views. At the span level, verify that db.system.name, db.operation.name, and db.query.text attributes are populating (these are the current names per OTel Semantic Conventions v1.33.0). Older documentation may reference the deprecated db.system, db.operation, and db.statement attributes, so check which version your instrumentation library implements.

Setting up the MySQL receiver

The same pattern applies: create a monitoring user, then point the receiver at it.

-- MySQL 8.0+ monitoring role
CREATE USER 'otel_reader'@'localhost' IDENTIFIED BY 'change_me';
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'otel_reader'@'localhost';
GRANT SELECT ON performance_schema.* TO 'otel_reader'@'localhost';

receivers:
  mysql:
    endpoint: localhost:3306
    username: otel_reader
    password: "${env:MYMON_PASS}"
    collection_interval: 20s
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [mysql]
      exporters: [otlp/primary]

The mysql receiver collects from SHOW GLOBAL STATUS, SHOW REPLICA STATUS, and performance_schema tables. Enable Performance Schema (performance_schema=ON in my.cnf) for query-level metrics. It has been on by default since MySQL 5.6.6, so most installations already have it active.

Collecting CloudWatch metrics for RDS

Cloud-managed databases don't allow local agent deployment, so the collection path differs. The OTel Collector's awscloudwatchreceiver only supports logs, not metrics. For RDS metric collection through the OTel pipeline, the proven approach is YACE (Yet Another CloudWatch Exporter), a Prometheus exporter maintained under the prometheus-community org. YACE polls CloudWatch's GetMetricData API and exposes the results as Prometheus metrics, which the Collector scrapes via its prometheus receiver.

YACE uses the standard AWS credential chain (instance profile, AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY, or ~/.aws/credentials). The IAM principal requires cloudwatch:GetMetricData, cloudwatch:ListMetrics, and tag:GetResources permissions.

YACE configuration (yace-config.yml):

apiVersion: v1alpha1
discovery:
  jobs:
    - type: AWS/RDS
      regions:
        - eu-west-1
      metrics:
        - name: DatabaseConnections
          statistics: [Average]
          period: 300
          length: 300
        - name: ReadIOPS
          statistics: [Average]
          period: 300
          length: 300
        - name: ReplicaLag
          statistics: [Maximum]
          period: 300
          length: 300

YACE auto-discovers all RDS instances in the specified region. To limit to specific instances, add a searchTags filter with a tag key/value pair you've applied to your RDS instances.

YACE exposes metrics on port 5000 by default. Point the OTel Collector's prometheus receiver at it:

receivers:
  prometheus/cloudwatch:
    config:
      scrape_configs:
        - job_name: yace-rds
          scrape_interval: 300s
          static_configs:
            - targets: ["localhost:5000"]

service:
  pipelines:
    metrics:
      receivers: [prometheus/cloudwatch]
      exporters: [otlp/primary]

The scrape_interval should match YACE's period to avoid gaps or duplicate data points.

Setting up the MongoDB receiver

Back to the standard pattern for self-managed instances. Create a monitoring user with the clusterMonitor role:

// Run in mongosh connected to the admin database
use admin;
db.createUser({
  user: "otel_reader",
  pwd: "change_me",
  roles: [
    { role: "clusterMonitor", db: "admin" },
    { role: "read", db: "local" }   // needed for oplog access
  ]
});

receivers:
  mongodb:
    hosts:
      - endpoint: mongo-primary.internal:27017
    username: otel_reader
    password: "${env:MONGOMON_PASS}"
    collection_interval: 20s
    tls:
      insecure: true

This receiver collects the serverStatus metrics covered earlier (operation latency, queue depth, WiredTiger cache utilization, and replication oplog data) without requiring manual shell queries. For Atlas clusters, the same receiver connects via SRV connection strings (mongodb+srv://) with SCRAM authentication; replace the endpoint with your Atlas SRV URI.

The complete pipeline

With all four receivers configured, the pipeline routes through a single Collector:

All telemetry, from a PostgreSQL instance on-prem, a MongoDB Atlas cluster, or an RDS replica in us-east-1, routes through the same collector, lands in the same backend, and shares the same resource attributes (host.id, service.name, db.name). Those shared attributes are what make cross-signal correlation possible, which is where the real incident-resolution speed comes from.

Cross-signal correlation: three axes that close incidents

A unified pipeline gives you the raw material. But collection alone doesn't explain why a latency spike happened. A PostgreSQL dashboard showing elevated p95 tells you something is wrong. It doesn't tell you whether the cause is a bad query, a contended host, or a deployment that changed application behavior. Answering that requires correlating database metrics with signals from outside the database.

Three correlation axes progressively narrow the search space during an incident.

Axis 1: Database metrics + APM traces = which query caused it. Slow database spans in distributed traces carry db.query.text attributes that link directly to the responsible statement. When p95 spikes, the span shows the exact SQL. That span-to-query linkage automates what EXPLAIN ANALYZE does manually, across every query variant, on every request.

Axis 2: Database metrics + infrastructure metrics = what constrained it. CPU steal, disk I/O wait, and network throughput on the database host reveal whether a slowdown is a resource contention issue. A report query that normally completes in 25ms but suddenly takes 1.2 seconds, with no deployment in between, is usually competing for disk or CPU on a shared host rather than running a degraded plan (though lock contention, stale statistics, or index bloat can look similar). Without the infrastructure layer, you'd waste time chasing query-level explanations for a host-level problem.

Axis 3: Database metrics + logs = what sequence of events led to it. Slow query logs, error logs, and lock contention events provide the narrative that metric time series cannot. Metrics show what changed. Logs explain what happened. For example, lock contention is one of the most common incident triggers, and the metric alone (rising lock wait count) doesn't tell you which session is blocking. Querying pg_stat_activity with pg_blocking_pids() (PostgreSQL 9.6+; for earlier versions, query pg_locks directly) pinpoints the blocking session, its query, and how long it's been holding the lock:

SELECT
  blocker.pid AS blocker_pid,
  left(blocker.query, 100) AS blocker_query,
  waiting.pid AS waiting_pid,
  left(waiting.query, 100) AS waiting_query,
  now() - blocker.state_change AS lock_held_for
FROM pg_stat_activity waiting
JOIN pg_stat_activity blocker
  ON blocker.pid = ANY(pg_blocking_pids(waiting.pid))
WHERE waiting.wait_event_type = 'Lock';

Together, these three axes turn an alert into a causal chain: the trace identifies responsible queries, infrastructure metrics rule out host-level bottlenecks, and log correlation surfaces the trigger. Whether that chain resolves in one interface or across three separate tools depends on your platform and your alerting setup.

Correlation closes the gap between alert and cause, but only if the alerts that wake you up are actually worth investigating.

Alert fatigue is a design problem, platform choice is the fix

Static thresholds on database metrics produce high false-positive rates. Query patterns vary by hour and day of week. A batch job that pushes p95 latency to 600ms every Tuesday at 3am is normal, not an incident. A static alert at 500ms pages you every Tuesday.

Dynamic baselining eliminates this false-positive pattern. Instead of a hardcoded threshold, the alert fires when a metric deviates from its own rolling historical pattern for that time window. p95 at 600ms on Tuesday at 3am is expected. p95 at 600ms on Wednesday at 2pm is a deviation worth investigating.

But dynamic baselining is only one piece. Whether you can actually implement it, and whether the alerts it produces are actionable, depends on what your observability platform supports. Alert quality is inseparable from platform choice. Six criteria separate a platform that sounds good in a demo from one that holds up at 3am:

Coverage breadth. Native support for your actual database mix (PostgreSQL, MySQL, MongoDB, RDS, Aurora, Azure SQL, and whatever else you run) is non-negotiable. Community plugins with no SLA add risk in production.
Query-level visibility. CPU and connection counts are necessary but insufficient. You need per-query latency distributions, execution counts, and normalized query fingerprinting that aggregates variants of the same logical query. Without fingerprinting, you're scrolling through raw query strings instead of seeing the handful of patterns that account for most of your total execution time.
Cross-signal correlation. If database metrics, APM traces, and infrastructure metrics live in separate tools, you're doing the correlation manually. That context switch is where time evaporates during incidents.
Alert quality. Static thresholds versus dynamic baselining is the dividing line. Platforms that support rolling historical baselines eliminate most false positives from cyclical workload patterns.
Pricing model. Per-host pricing behaves differently at 80 nodes than per-metric or per-GB pricing. Project the numbers against your current and expected fleet size before signing.
Operational overhead. Agent deployment and upgrades across 80+ nodes compound over time. Centralized configuration, auto-upgrade, and agentless collection for cloud-managed databases (where agent deployment isn't an option) matter more than they appear in an initial evaluation.

Criterion #4 (dynamic baselining) is where AI-driven features are pushing the boundary, moving beyond rolling averages into pattern detection that no human would configure manually.

AI-assisted database monitoring: faster triage, not fewer engineers

AI-driven features are gaining traction in observability platforms. The Grafana Observability Survey 2025 found that the two most sought-after AI capabilities were training-based alerts that fire on pattern deviations and faster root cause analysis through automated signal interpretation. These two ranked at the top across nearly every demographic surveyed. Autonomous remediation drew interest, but with significant practitioner skepticism. The pattern is clear: engineers want faster triage, not hands-off automation.

Where AI adds the most value is in catching what no human would wire up manually: co-occurring metric changes across signals (a replication lag spike alongside a batch job CPU spike on the same host) that only correlate under specific conditions. Capacity forecasting is the other win, spotting growth trends that will cause pressure weeks before the pressure becomes a production incident.

The judgment call that follows still requires a person. Deciding whether a flagged query needs a composite index, a denormalized read path, or a move to a different storage engine depends on access patterns, consistency requirements, and how the data model will evolve over the next two quarters. No anomaly detector has that context. AI narrows the search; an engineer who understands the domain decides what to do with what it finds.

These capabilities come from the platform, not the pipeline. If you've built the OTel collection layer yourself, the question becomes what that self-assembled stack actually costs to maintain.

The operational cost of a self-assembled stack

If you've followed along this far, you've assembled a capable observability pipeline: OTel Collector with four receivers, application SDK instrumentation, alerting rules, and cross-signal correlation. It works. But it's worth tallying what you're now maintaining.

The Collector itself needs upgrades. Core and contrib release together every two weeks, and each release can bring receiver config changes and semantic convention updates (the db.statement to db.query.text rename is a recent example). Across a fleet of 20+ database nodes, that's 20+ Collector configs to keep in sync. YAML drift is quiet until it causes a gap in your telemetry during an incident.

Alert tuning is ongoing. Static thresholds need manual adjustment as workloads evolve. Dynamic baselines, if your backend supports them, need their own validation. Each new database instance means another set of receiver configs, user grants, and alert rules.

Cloud-managed databases add a different kind of overhead. IAM policies, CloudWatch API rate limits, and the resolution gaps between standard and enhanced monitoring all require attention that scales with the number of instances.

None of this is unreasonable for a team with dedicated platform engineering capacity. But for teams where observability is one responsibility among many, the assembly and maintenance cost is the real expense, not the software licenses. The next section walks through the implementation sequence; the managed alternative follows at the end.

Getting started: a concrete implementation sequence

You can get the first piece of actionable data quickly. Run the pg_stat_statements query from the PostgreSQL section above and see which queries dominate your database's total execution time. The full setup depends on your environment, but each step below is individually small.

Step 1: Enable pg_stat_statements

Check what's already loaded with SHOW shared_preload_libraries;. If the result is empty, run ALTER SYSTEM SET shared_preload_libraries = 'pg_stat_statements';. If other libraries are already loaded (e.g., timescaledb), append rather than replace: ALTER SYSTEM SET shared_preload_libraries = 'timescaledb, pg_stat_statements';. This requires a full PostgreSQL restart, which means a maintenance window in production. After the restart, run CREATE EXTENSION pg_stat_statements; in your target database and query it immediately to get your baseline.

Step 2: Instrument your application with an OTel SDK

The Collector pipeline in Step 3 collects infrastructure-level database metrics. Application-level database spans (the ones carrying db.query.text that link to APM traces) require your application to emit them via an OTel SDK. Each language and database driver combination needs its own instrumentation library, SDK initialization, and exporter configuration. The OTel Instrumentation Registry covers the specific packages. For a team running multiple services across multiple languages, this step alone touches every application in the stack.

Step 3: Deploy the OTel Collector

Deploy the Collector with the postgresql receiver on the same host, using the configuration from the pipeline section above. Point it at your backend via Prometheus remote write or an OTLP endpoint. Verify that db.system.name, db.name, and db.query.text attributes are populating on spans from your application's database client library.

Step 4: Set baseline alerts

Three non-negotiable alerts to start with. If your platform supports dynamic baselining, use these:

p95 SELECT latency more than 2x the 7-day rolling baseline for the same hour-of-week
Connection utilization (active / max) above 80% sustained for 5 minutes
Replication lag above 30 seconds

Step 5: Verify cross-signal correlation

Trigger a slow query manually with SELECT pg_sleep(3); and confirm the resulting database span in your APM traces carries the db.query.text attribute (or db.statement if your library uses the older convention) and links back to the metric spike. If it doesn't, your pipeline has a tagging gap that will cost you during the next real incident. Fix it now while the system is quiet.

Step 6: Repeat for your next database

Once PostgreSQL is fully instrumented and alerting is stable, repeat Steps 1 through 5 for your next database type. Each engine means a different receiver config, different monitoring user grants, different signal verification, and a different set of edge cases. A three-database stack means running this sequence three times, each with its own failure modes.

What the DIY path delivers

If you've followed the implementation sequence above, the 2:47am scenario from the introduction looks different now. Instead of fifteen minutes switching between dashboards, you have a single correlated timeline where the responsible query, the host contention, and the triggering event are already connected.

That's the DIY path. It works, and it's entirely vendor-neutral. The tradeoff is the assembly and ongoing maintenance cost that scales with every database you add to the fleet.

Managed alternative: same criteria, less assembly

For teams where that tradeoff doesn't pencil out, ManageEngine OpManager Nexus is one option worth evaluating. Here's how it maps against the six criteria from the alerting section:

Coverage breadth: Out-of-the-box monitoring for 50+ database types, from PostgreSQL and MongoDB to managed offerings like Aurora and Azure SQL. No per-engine receiver assembly or contrib binary juggling.
Query-level visibility: Latency distributions, execution frequency, and fingerprinted query grouping that rolls up thousands of raw statements into the patterns that actually drive load.
Cross-signal correlation: Database, application, and host telemetry share a single interface. During an incident, you click from a slow query span to the host's CPU timeline without opening a second tool.
Alert quality: ML-driven baselines that learn your workload's weekly rhythm, so the Tuesday 3am batch job doesn't page anyone but a Wednesday 2pm anomaly does.
Pricing model: Priced per monitor rather than per GB of ingested telemetry. At 80+ database nodes, this distinction determines whether the bill scales linearly or exponentially.
Operational overhead: Cloud-managed databases connect via JDBC and cloud APIs with no local agent. Self-managed instances use centralized config pushed from the server, so there's no per-node YAML to maintain or drift to chase.

For teams whose telemetry lives primarily in AWS, Azure, or GCP, the cloud-delivered sibling is Site24x7, ManageEngine's SaaS monitoring platform. The same six criteria apply: native coverage for PostgreSQL, MySQL, SQL Server, Oracle, MongoDB, and RDS/Aurora; query-level latency with fingerprinting; correlated application and infrastructure metrics in one console; AI-driven anomaly detection on per-query baselines. The tradeoff flips compared to a self-hosted deployment. No local infrastructure to run, but telemetry leaves your environment, and retention is tied to the subscription tier.

Whether the managed path or the DIY pipeline is the better fit depends on your team's platform engineering capacity and how many database types you're running. The six criteria give you a framework to evaluate either approach, or any other platform, on equal footing.

What does your current database monitoring setup look like? If you're running a mixed stack, I'd be curious to hear how you're handling cross-signal correlation today, and where it still breaks down.

DEV Community