Damaso Sanoja

Posted on May 22

Database Maintenance: Tracing Production Incidents to Their Root Cause

#database #devops #performance #sql

Database maintenance fails when it runs on a calendar instead of on signal. Fragmentation, stale statistics, log growth, and lock contention are functions of write workload, not weekly schedules. Scheduled maintenance skips the tables that need it most, and the resulting incident fires before anyone notices the gap.

This article replaces the cron job with a response system. Four observable symptoms (I/O degradation, query plan regression, storage pressure, and lock contention) each trace back to a specific maintenance root cause, with fixes for SQL Server, PostgreSQL, and MySQL. Silent corruption, the one failure mode that produces no precursor signal, gets its own detection-first treatment. A closing scorecard lets you self-assess.

First Response: Wait State Triage Across Engines

When a slow query alert fires, the first diagnostic step is the same regardless of engine: check what the query is waiting on. Wait states are the universal entry point for database incident triage. They tell you whether the problem is I/O bound, lock bound, or CPU bound, and that classification determines which section of this article contains your fix.

SQL Server wait types

PAGEIOLATCH_SH means the query is waiting for data pages to be read from disk into the buffer pool. This points to index fragmentation, buffer cache pressure, or storage subsystem saturation. LCK_M_S and LCK_M_X indicate row or table-level lock contention from a concurrent transaction or a maintenance operation holding locks. CXPACKET (visible in sys.dm_exec_requests) signals parallelism skew, which typically traces to stale statistics or a missing index causing the optimizer to choose an expensive parallel plan.

PostgreSQL and MySQL equivalents

PostgreSQL exposes wait diagnostics through pg_stat_activity. The query below is your triage entry point:

-- PostgreSQL: active session wait events
SELECT pid, wait_event_type, wait_event, state, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL
  AND state != 'idle'
  AND backend_type = 'client backend';

The diagram above maps each value to its target section. One non-obvious case is worth calling out: a NULL wait_event while state = 'active' indicates the query is compute-bound (the PostgreSQL equivalent of CPU pressure), which can point toward stale statistics or a plan regression rather than I/O.

For MySQL, performance_schema.events_waits_current is the source for the values shown in the diagram. Verify performance_schema = ON in my.cnf first, as it is disabled by default in some MySQL 5.x builds and carries non-zero overhead; on MySQL 8.0+ it is enabled by default. SHOW PROCESSLIST gives a quicker but less granular view.

Once you have identified the wait type, the sections below trace each category to its maintenance root cause and prescribe the fix. For hybrid topologies that span on-prem and cloud-managed instances, ManageEngine OpManager Nexus surfaces wait-state and slow-query data across both in a single triage view through its SaaS delivery for managed databases.

Symptom: I/O Degradation and Read Amplification

A buffer cache hit ratio drifting below the 95-99% range that healthy OLTP workloads maintain is the cross-engine signal that the engine is reading more pages from disk than memory can satisfy.

SQL Server practitioners typically treat 90% as a warning and 85% as an action threshold; PostgreSQL and MySQL expose equivalents in pg_statio_user_tables and information_schema.INNODB_BUFFER_POOL_STATS (or SHOW ENGINE INNODB STATUS). The most common cause is index fragmentation: pages split, B-tree leaves scatter across non-contiguous extents, and one logical read becomes several physical I/Os. Read amplification surfaces as PAGEIOLATCH waits on SQL Server, DataFileRead on PostgreSQL, and elevated innodb_data_file waits on MySQL.

On cloud-managed instances where DMV access is restricted (RDS, Azure SQL Managed Instance), OpManager Nexus's SaaS delivery surfaces the same buffer-pool visibility through its agent.

Diagnosing index bloat

SQL Server: sys.dm_db_index_physical_stats is the authoritative source for fragmentation data. The query below returns indexes above 5% fragmentation with more than 1,000 pages (the page count filter matters because rebuilding very small indexes produces negligible performance improvement):

SELECT
    OBJECT_NAME(ips.object_id) AS tbl_name,
    i.name AS idx_name,
    ips.index_type_desc,
    ips.avg_fragmentation_in_percent,
    ips.page_count
FROM sys.dm_db_index_physical_stats(
    DB_ID(), NULL, NULL, NULL, 'LIMITED') AS ips
JOIN sys.indexes i
    ON ips.object_id = i.object_id
    AND ips.index_id = i.index_id
WHERE ips.avg_fragmentation_in_percent > 5
    AND ips.page_count > 1000
ORDER BY ips.avg_fragmentation_in_percent DESC;

The 'LIMITED' scan mode traverses only the index allocation structure, making it safe and fast on production systems. 'SAMPLED' reads a statistical sample of data pages for more accurate numbers at moderate I/O cost on very large tables or partitioned indexes. 'DETAILED' performs a full scan; reserve it for offline assessment.

PostgreSQL: The pg_stat_user_tables view provides the first signal. A dead_pct above 10-20% on a high-write table is a common trigger for manual VACUUM (this range aligns with practitioner guidance, with the autovacuum default kicking in at 20%):

SELECT schemaname, relname,
       n_dead_tup,
       n_live_tup,
       round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
       last_vacuum,
       last_autovacuum
FROM pg_stat_user_tables
WHERE n_live_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;

For index-level bloat (physical B-tree bloat that VACUUM does not reclaim), the pgstattuple extension exposes two functions. pgstattuple() returns free_percent, the wasted-space ratio that is the PostgreSQL equivalent of avg_fragmentation_in_percent:

CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('orders_created_at_idx');

pgstatindex() returns the B-tree-specific metrics: leaf_fragmentation (percentage of leaf pages not in logical order, indicating physical scatter) and avg_leaf_density (below 50% suggests the index has many near-empty pages):

SELECT * FROM pgstatindex('orders_created_at_idx');

Both functions perform a full scan of the target relation, so on a multi-hundred-GB index expect runtime and I/O comparable to a sequential read of the entire object — schedule them like any other heavy diagnostic, not in a hot loop.

High free_percent with low leaf_fragmentation may indicate space reclaimable by VACUUM rather than a full rebuild. Values of free_percent in the 20-30% range are a widely used trigger for REINDEX; consult your workload and current community guidance to calibrate the threshold.

MySQL: Query information_schema.TABLES for InnoDB tablespace fragmentation:

SELECT table_schema, table_name,
       round(data_length / 1024 / 1024, 2) AS data_mb,
       round(data_free / 1024 / 1024, 2) AS free_mb,
       round(data_free / (data_length + index_length + data_free) * 100, 2) AS frag_pct
FROM information_schema.TABLES
WHERE engine = 'InnoDB'
  AND data_free > 0
ORDER BY data_free DESC
LIMIT 10;

This metric is meaningful only with per-table tablespaces (innodb_file_per_table = ON, the default since MySQL 5.6); on shared-tablespace deployments, data_free reflects unused space in the global ibdata file and is repeated identically across every InnoDB row.

Tables with frag_pct above 20% are commonly treated as candidates for OPTIMIZE TABLE or pt-online-schema-change (this threshold is a practitioner guideline rather than a MySQL-documented limit).

Remediation by engine and downtime tolerance

Microsoft's documentation on index reorganization and rebuild maps fragmentation levels to two SQL Server operations:

5-30% fragmentation: ALTER INDEX idx_name ON tbl_name REORGANIZE compacts leaf-level pages incrementally as an online operation. It can be interrupted mid-run without corrupting the index.
Above 30%: ALTER INDEX idx_name ON tbl_name REBUILD recreates the index. Offline by default (acquires a schema modification lock that blocks concurrent access). Add WITH (ONLINE = ON) on Enterprise edition to keep the index available during the rebuild. Note that even online rebuilds acquire a brief Schema Modification (Sch-M) lock at the beginning and end of the operation, typically milliseconds, but long enough to cause noticeable waits on extremely high-concurrency workloads.

On SQL Server 2017+, combine ONLINE = ON with RESUMABLE = ON and a configurable MAX_DURATION to pause and resume long rebuilds: ALTER INDEX idx_name ON tbl_name REBUILD WITH (ONLINE = ON, RESUMABLE = ON, MAX_DURATION = 60). Resume with ALTER INDEX idx_name ON tbl_name REBUILD WITH (RESUME). RESUMABLE = ON requires ONLINE = ON and is Enterprise-edition-only on SQL Server 2017; SQL Server 2019+ also enables it on Standard and Web editions, so verify your edition before scripting against this syntax.

The 5% floor matters equally. Running REORGANIZE on a 3% fragmented index generates log activity, consumes I/O, and produces no measurable query improvement.

For PostgreSQL, VACUUM reclaims dead tuple storage and updates the visibility map. ANALYZE updates planner statistics. REINDEX rebuilds the B-tree structure when physical index bloat is confirmed:

VACUUM VERBOSE ANALYZE transactions;

-- Blocking rebuild (requires maintenance window):
REINDEX INDEX transactions_created_at_idx;

-- Non-blocking rebuild (PostgreSQL 12+):
REINDEX INDEX CONCURRENTLY transactions_created_at_idx;

REINDEX CONCURRENTLY cannot run inside a transaction block and takes longer than the standard form, but it allows writes to continue during the rebuild. Beyond immediate remediation, VACUUM VERBOSE output is worth reviewing regularly on your heaviest-write tables. It provides dead tuple counts, page recycling data, and cleanup statistics that give indirect signals of table health. PostgreSQL's autovacuum handles routine dead tuple cleanup automatically, but under high-velocity delete workloads it can fall behind. The official PostgreSQL documentation on routine vacuuming covers tuning autovacuum_vacuum_scale_factor and autovacuum_vacuum_threshold for tables where the defaults prove too conservative.

For MySQL, OPTIMIZE TABLE defragments the tablespace and rebuilds statistics in a single operation. In MySQL 8.0+, this runs online for regular InnoDB tables with only brief metadata locks at prepare and commit phases, but the full copy can take significant time on large tables:

OPTIMIZE TABLE events;
ANALYZE TABLE events;

Internally, InnoDB maps OPTIMIZE TABLE to ALTER TABLE ... FORCE, rebuilding the clustered index and all secondary indexes. For zero-downtime execution on large tables, pt-online-schema-change from Percona Toolkit performs the same rebuild while keeping the original table live:

pt-online-schema-change \
  --alter "ENGINE=InnoDB" \
  --execute \
  D=app_prod,t=events,h=127.0.0.1,F=$HOME/.my.cnf

This maintains a shadow copy and replays writes via triggers throughout the rebuild. The --execute flag is required; without it the tool runs in dry-run mode only.

Remediation lookup by symptom severity:

Symptom Severity	Engine	Downtime Tolerance	Recommended Action
Mild (frag < 5% / dead_pct < 10%)	All	N/A	None
Moderate (5-30%)	SQL Server	Any	ALTER INDEX ... REORGANIZE
Severe (> 30%)	SQL Server	Required	ALTER INDEX ... REBUILD WITH (ONLINE=ON) [Enterprise]
Severe (> 30%)	SQL Server	Available	ALTER INDEX ... REBUILD
Elevated (dead_pct > 10%)	PostgreSQL	Any	VACUUM ANALYZE
High bloat (free_percent > 30%)	PostgreSQL	Required	REINDEX CONCURRENTLY
Elevated (frag_pct > 20%)	MySQL	Available	OPTIMIZE TABLE
Elevated (frag_pct > 20%)	MySQL	Required	pt-online-schema-change

With fragmentation addressed, the next failure category that produces slow queries is stale statistics, which causes the optimizer to choose a scan where an index seek would be orders of magnitude faster.

Symptom: Query Plan Regression

The execution plan shows a table scan where an index seek ran yesterday. The optimizer has not changed; the data it relies on has. This is a statistics problem.

Diagnosing stale statistics

The SQL Server optimizer uses row count estimates and data distribution histograms to choose between index seeks and table scans. When those statistics are weeks out of date on a fast-growing table, the optimizer picks a scan where a seek would be dramatically faster. Run UPDATE STATISTICS table_name WITH FULLSCAN on any table that receives large batch loads. The WITH SAMPLE variant uses a row sampling percentage that can miss skewed distributions on large tables, producing statistics that look current but reflect an unrepresentative subset.

To detect indexes suffering from stale statistics or poor plan choices, query sys.dm_db_index_usage_stats:

SELECT OBJECT_NAME(object_id) AS tbl_name,
       index_id,
       user_seeks,
       user_scans,
       user_lookups
FROM sys.dm_db_index_usage_stats
WHERE database_id = DB_ID()
ORDER BY user_scans DESC;

Indexes with zero seeks but high scans are candidates for statistics updates or missing index evaluation.

PostgreSQL's ANALYZE command and MySQL's ANALYZE TABLE update planner statistics independently from VACUUM and OPTIMIZE TABLE respectively. On PostgreSQL, autovacuum runs ANALYZE automatically after a configurable percentage of rows change (controlled by autovacuum_analyze_scale_factor, default 0.1 or 10%), but that default is too high for large tables. A 200-million-row table would need 20 million row changes to trigger autovacuum's ANALYZE pass, by which point the query plan may have been wrong for hours. Lowering autovacuum_analyze_scale_factor to 0.01 or using autovacuum_analyze_threshold with per-table overrides addresses this.

Updating statistics without disruption

On SQL Server, UPDATE STATISTICS generally does not block queries (it runs with NOLOCK semantics on data reads), though asynchronous statistics updates can cause brief schema lock contention during query compilation in high-workload scenarios. It does invalidate cached execution plans for the affected table: immediately after, SQL Server will recompile plans on next execution, which can briefly spike CPU on systems with many concurrent queries against the updated table. Run during low-traffic windows on heavily queried tables. The choice between FULLSCAN and SAMPLE depends on table size and distribution skew.

For tables in the small-to-medium range, FULLSCAN typically completes quickly enough to run during off-peak hours (the practical upper bound depends on hardware, but many teams use roughly 100M rows as a rule-of-thumb cutoff). For larger tables, a higher sample percentage (such as SAMPLE 20 PERCENT or SAMPLE 30 PERCENT) typically provides a better tradeoff between accuracy and duration than the default sample, though the optimal percentage varies by workload.

On PostgreSQL, ANALYZE reads a configurable sample (default default_statistics_target = 100, meaning 30,000 rows per column) and does not lock the table. Run it manually after any bulk load or partition swap.

On MySQL, ANALYZE TABLE is a lightweight operation on InnoDB that reads the index tree's random dive samples. It is a fast operation: in MySQL 8.0+, ANALYZE TABLE uses online DDL semantics, avoiding the full read lock that earlier versions required. Capture EXPLAIN for representative queries before and after to confirm the planner picked up the new statistics.

OpManager Nexus automates detection of query plan regression on-prem through historical baseline comparison and anomaly flagging. The same capability extends to cloud-managed databases through its SaaS delivery, where slow query log analysis drills into queries exceeding a configurable execution-time threshold. The Automated Remediation section below covers how to wire that detection into corrective workflows.

Statistics failures are invisible until the query plan degrades. Storage failures are equally silent, until a disk fills and takes the database offline.

Symptom: Storage Pressure and Runaway Growth

A disk usage alert fires at 85% capacity. The database server has been running for months without anyone checking how fast the log files or tablespaces are growing. The root cause splits into two categories: unmanaged transaction log growth and missing archiving strategy. Both are maintenance failures that monitoring should have caught weeks earlier.

Transaction log and WAL management

SQL Server: A full recovery model database without regular transaction log backups will grow its log file until the disk fills, and a full data volume is an immediate production outage. To check current log space usage across all databases, run DBCC SQLPERF(LOGSPACE);, which returns log size, space used percentage, and status for every database. For a single database, query sys.databases for the log_reuse_wait_desc column, which tells you exactly why the log cannot be truncated (e.g., LOG_BACKUP, ACTIVE_TRANSACTION). Schedule log backups at an interval matching your Recovery Point Objective (RPO): for most OLTP workloads, intervals in the range of 5-30 minutes are commonly used, with tighter intervals for high-transaction systems, though the right frequency is workload-specific.

DBCC SHRINKFILE on the log file is a last resort for reclaiming space after an unexpected log growth event. The reason it is a last resort, rather than a routine cleanup tool, is the side effect on Virtual Log Files (VLFs), the internal segments SQL Server divides the transaction log into. Each shrink-then-regrow cycle adds a new VLF, so a log that has been shrunk repeatedly ends up fragmented into many small VLFs instead of a few large ones. That fragmentation degrades sequential log write throughput and increases recovery time. The fix is to address the root cause (missing log backups, long-running transactions) rather than shrinking on a schedule.

PostgreSQL: WAL (Write-Ahead Log) management serves the same function as SQL Server's transaction log. The archive_mode and archive_command settings control whether completed WAL segments are shipped to archive storage. Without archiving enabled, WAL segments accumulate in pg_wal/ until disk fills. The wal_keep_size parameter (PostgreSQL 13+, replacing wal_keep_segments) sets a floor for retained WAL data, but does not cap growth. For production systems, configure continuous archiving with archive_mode = on and point archive_command to your backup infrastructure (pgBackRest, Barman, or cloud-native equivalents).

To verify archiving is active and current: SELECT * FROM pg_stat_archiver; Check last_archived_wal timestamp and failed_count. A non-zero failed_count or a stale last_archived_time means WAL segments are accumulating. Also: SELECT count(*), pg_size_pretty(sum(size)) FROM pg_ls_waldir(); (PostgreSQL 10+) shows total WAL directory size.

MySQL: Binary logs (binlogs) serve replication and point-in-time recovery. Without rotation, they grow indefinitely. expire_logs_days (deprecated in MySQL 8.0.3) or binlog_expire_logs_seconds (MySQL 8.0+) controls automatic purge. Setting binlog_expire_logs_seconds = 604800 retains seven days of binary logs, which is sufficient for most replication topologies. Run PURGE BINARY LOGS BEFORE NOW() - INTERVAL 7 DAY for one-time cleanup.

Capacity forecasting with OpManager Nexus

Reacting to a disk alert at 85% leaves little room for planned action. OpManager Nexus's AI/ML-based storage forecasting uses up to 14 days of history to predict when storage will hit 80%, 90%, and 100%, giving your team a "disk full in N days" signal once it has at least 3 days of data. Its adaptive thresholds learn baseline behavior so alerts fire on genuine anomalies rather than every batch job, and the Database Tab surfaces individual database size, data and log file utilization, and growth trends.

Note: OpManager Nexus's own monitoring data retention (configured under Settings > General Settings > Database Maintenance) is independent of your production database storage. Defaults are 7, 30, and 365 days for detailed, hourly, and daily statistics.

Use OpManager Nexus's forecast reports to verify your archiving cadence keeps pace with growth: if the forecast shows 80% capacity in 30 days but your archive job runs monthly, increase frequency or provision more storage.

Storage pressure is a passive failure that accumulates over time. Lock contention is an active failure: the maintenance operation meant to fix the database becomes the source of the incident.

Symptom: Lock Contention from Maintenance Operations

A spike in blocked sessions immediately after a scheduled maintenance run is direct evidence that the REBUILD or REORGANIZE collided with production traffic and created lock contention. The maintenance job is supposed to fix performance, but index REBUILDs running without ONLINE = ON during peak traffic or without a maintenance window hold locks that block concurrent queries, turning the fix into the incident.

Identifying maintenance-induced blocking

Correlating maintenance timing with OpManager Nexus's Sessions Tab is how you distinguish maintenance-induced blocking from application-level contention. If blocked session counts spike within minutes of a maintenance window opening, the maintenance job is the cause. On SQL Server, check sys.dm_exec_requests for sessions with wait_type values starting with LCK_M_*, then look up the head-of-chain blocker and inspect its command column for ALTER INDEX or DBCC operations.

On PostgreSQL, pg_stat_activity shows Lock wait events with wait_event values like relation or transactionid. If the blocking PID is running REINDEX or VACUUM FULL, that is maintenance-induced contention. For cloud-managed instances where Sessions Tab access is unavailable, OpManager Nexus's SaaS delivery surfaces lock contention and blocking session counts on its database performance dashboard for the same triage signal.

Online and resumable operations

The fix is operational: use online operations and schedule them outside peak traffic windows.

SQL Server: Use ALTER INDEX ... REBUILD WITH (ONLINE = ON, RESUMABLE = ON, MAX_DURATION = 60) as described in the I/O Degradation section. The duration is any positive integer in minutes; set it based on your maintenance window. REORGANIZE is always online and interruptible.

PostgreSQL: REINDEX INDEX CONCURRENTLY (introduced in the I/O Degradation section) avoids exclusive locks. VACUUM without FULL does not block reads or writes.

MySQL: Standard OPTIMIZE TABLE already runs as online DDL on MySQL 8.0+ (introduced in the I/O Degradation section). Reach for pt-online-schema-change when you need finer control over lock duration on very large tables, or when you want triggered shadow-copy semantics that OPTIMIZE TABLE does not offer.

The four symptom categories above all produce observable performance signals before they become outages. Corruption is different: it produces no signal until it surfaces as query failures or data loss.

Symptom: Silent Corruption and Integrity Failures

Because corruption produces no precursor wait events or latency drift, detection is a deliberate scheduled act, not an alert response. Regular integrity checks are the primary detection mechanism, supplemented by storage-level checksums, page verification, and reliable backups.

SQL Server: DBCC CHECKDB catches page corruption, allocation errors, and consistency violations.

-- Recommended production form: suppresses informational messages, shows only errors
DBCC CHECKDB('ProductionDB') WITH NO_INFOMSGS, ALL_ERRORMSGS;

For large databases where a full DBCC CHECKDB is too slow for a maintenance window, DBCC CHECKDB ... WITH PHYSICAL_ONLY checks page and record header integrity without logical consistency checks and completes significantly faster. Corruption surfaces in the SQL Server error log as messages Msg 823, 824, or 825. To proactively check for known corruption events, query the suspect pages table:

SELECT db_id, file_id, page_id, event_type, error_count, last_update_date
FROM msdb.dbo.suspect_pages
WHERE event_type IN (1, 2, 3);

Event_type 1 = 823/824 errors, 2 = bad checksum, 3 = torn page. A non-empty result requires immediate DBCC CHECKDB and restore planning.

Running DBCC CHECKDB as frequently as your maintenance windows allow is the safe path. Many experts recommend daily on all databases; if that is impractical, prioritize critical databases and shorten the interval on large ones using WITH PHYSICAL_ONLY.

PostgreSQL: The pg_amcheck utility (PostgreSQL 14+) verifies B-tree index integrity by checking that every heap tuple referenced by an index entry actually exists and that index entries are in the correct sort order. The default invocation is fast enough for routine scheduled checks and catches most corruption:

pg_amcheck mydb

After an unexpected crash, storage event, or replication failure, run the thorough variant on critical tables:

pg_amcheck --heapallindexed --parent-check mydb

--heapallindexed performs a deeper check that every heap tuple has a corresponding index entry; --parent-check verifies cross-level B-tree invariants. Both flags increase runtime substantially, so reserve them for incident response or post-event verification rather than the routine schedule.

MySQL: mysqlcheck provides table-level integrity verification:

mysqlcheck --check --all-databases -u root -p

For individual tables, CHECK TABLE table_name within the MySQL client performs the same operation. InnoDB tables benefit from CHECK TABLE ... FOR UPGRADE after major version upgrades to verify storage format compatibility.

Running these checks manually is the safety net. The next section shows how to automate the response so the platform acts before the on-call engineer logs in.

From Alert to Fix: Automated Remediation Across Engines

When the alert fires at 3 AM, having the platform execute the remediation automatically matters far more than knowing the fix. OpManager Nexus's IT Workflow Automation triggers a custom monitoring script when an alert threshold is breached: the script queries the symptom's diagnostic surface (fragmentation, dead tuples, log space), evaluates severity, and runs the remediation.

SQL Server: wiring remediation into OpManager Nexus

OpManager Nexus accepts PowerShell or shell scripts as custom monitors (Custom Script Monitors require build 12.7 or later). The integration pattern matches the PostgreSQL and MySQL examples below: query sys.dm_db_index_physical_stats for fragmentation, branch on the threshold, issue ALTER INDEX REORGANIZE or REBUILD WITH (ONLINE = ON) accordingly, and emit one log line per action so the run shows up in the monitor's history. Run the script under a service account with at least db_ddladmin on the target database; for SQL authentication or cross-domain setups, pull credentials from a secrets store rather than embedding them.

PostgreSQL and MySQL shell automation

For PostgreSQL, a cron-driven shell script can query pg_stat_user_tables for bloated tables and trigger remediation:

#!/usr/bin/env bash
# PostgreSQL automated vacuum/reindex for tables exceeding dead tuple threshold.
# Credentials sourced from ~/.pgpass (chmod 600); export PGPASSFILE if non-default.
PGHOST="localhost"
PGPORT="5432"
PGDATABASE="app_prod"
PGUSER="maintenance_user"
export PGPASSFILE="${PGPASSFILE:-$HOME/.pgpass}"

DEAD_THRESHOLD=15
BLOAT_THRESHOLD=30

# VACUUM tables with high dead tuple ratio
psql -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$PGDATABASE" -t -A -F'|' -c "
  SELECT schemaname, relname, round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2)
  FROM pg_stat_user_tables
  WHERE n_live_tup > 10000
    AND round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) > $DEAD_THRESHOLD
" | while IFS='|' read -r schema table dead_pct; do
  echo "$(date '+%Y-%m-%d %H:%M:%S') | VACUUM ANALYZE ${schema}.${table} | dead_pct=${dead_pct}%"
  psql -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$PGDATABASE" -c "VACUUM ANALYZE ${schema}.${table};"
done

For MySQL, a similar approach queries information_schema.TABLES and triggers OPTIMIZE TABLE. Use a MySQL option file instead of embedding credentials in the script (create ~/.my.cnf with [client] credentials and restrict permissions to 600):

#!/usr/bin/env bash
# MySQL automated optimize for InnoDB tables exceeding fragmentation threshold
MYSQL_HOST="localhost"
MYSQL_DB="app_prod"

FRAG_THRESHOLD=20

mysql --defaults-extra-file="$HOME/.my.cnf" -h "$MYSQL_HOST" -N -B -e "
  SELECT table_name, round(data_free / (data_length + index_length + data_free) * 100, 2) AS frag_pct
  FROM information_schema.TABLES
  WHERE table_schema = '${MYSQL_DB}'
    AND engine = 'InnoDB'
    AND data_free > 0
    AND round(data_free / (data_length + index_length + data_free) * 100, 2) > ${FRAG_THRESHOLD}
" | while read -r table frag_pct; do
  echo "$(date '+%Y-%m-%d %H:%M:%S') | OPTIMIZE TABLE ${table} | frag_pct=${frag_pct}%"
  mysql --defaults-extra-file="$HOME/.my.cnf" -h "$MYSQL_HOST" "$MYSQL_DB" -e "OPTIMIZE TABLE ${table};"
done

Schedule either script via cron (e.g., 0 3 * * * /opt/scripts/pg_maintenance.sh >> /var/log/db_maintenance.log 2>&1) and monitor the log output through OpManager Nexus's custom monitor integration.

Cloud-managed database automation

For databases running on Amazon RDS, Aurora, or Azure SQL, OpManager Nexus's SaaS delivery provides the cloud-side counterpart of the PowerShell and shell automation patterns above. Its IT Automation module triggers corrective actions from threshold breaches and anomaly detections, and AI-powered baselines replace the manual threshold tuning that self-managed instances require. For RDS specifically, service actions like start, stop, and reboot with failover are surfaced directly. Engine-specific monitor setup for SQL Server, PostgreSQL, and MySQL is documented separately. Threshold profiles let you apply equivalent alert configurations across dev, staging, and production monitors, so a query that fragments an index under realistic staging load surfaces in slow query detection before it reaches production scale.

Maintenance Health Scorecard: Assessing Your Current Posture

Instead of running through the diagnostic queries from scratch, use this scorecard to assess your maintenance posture. Each item references the diagnostic approach covered in its corresponding section above.

I/O health (see: I/O Degradation section)

[ ] SQL Server: Run the sys.dm_db_index_physical_stats query (filter the results at 30% fragmentation). Count of indexes returned: ___
[ ] PostgreSQL: Run the pg_stat_user_tables dead tuple query. Tables with dead_pct above 10-20% are candidates for immediate attention: ___
[ ] MySQL: Run the information_schema.TABLES fragmentation query. Tables with frag_pct above 20%: ___

Statistics freshness (see: Query Plan Regression section)

[ ] SQL Server: Check sys.dm_db_index_usage_stats for indexes with zero seeks but high scans (plan regression or poorly matched index)
[ ] PostgreSQL: Verify autovacuum_analyze_scale_factor is set below 0.1 for tables above 100 million rows
[ ] MySQL: Run ANALYZE TABLE on your top 10 tables by write volume; capture EXPLAIN output for representative queries before and after to confirm planner statistics changed as expected

Storage trajectory (see: Storage Pressure section)

[ ] OpManager Nexus forecast report confirms sufficient capacity runway before any threshold crossing: Yes / No
[ ] Transaction log backup job (SQL Server) or WAL archiving (PostgreSQL) is confirmed running and last backup verified: Yes / No
[ ] Binary log rotation (MySQL) is configured with binlog_expire_logs_seconds set to an explicit value: Yes / No

Integrity baseline (see: Silent Corruption section)

[ ] SQL Server: DBCC CHECKDB last run date on critical databases: ___
[ ] PostgreSQL: pg_amcheck last run date (or equivalent manual check): ___
[ ] MySQL: mysqlcheck --check last run date: ___

Automation coverage (see: Automated Remediation section)

[ ] At least one automated remediation script is deployed, scheduled, and confirmed to be producing output logs: Yes / No
[ ] OpManager Nexus alert thresholds are configured and tested for key database health metrics (BCHR, disk utilization, blocked sessions): Yes / No
[ ] Maintenance windows are scheduled based on monitoring signals, not calendar dates: Yes / No

Cross-reference those results against OpManager Nexus's slow query and session data (on-prem Performance Tab or SaaS Database Metrics dashboard). If a table in the top results by size also appears as a source of slow query detections, that is your highest-priority maintenance target.

DEV Community