Damaso Sanoja

Posted on May 20

Oracle Database Performance Monitoring: A Practitioner's Decision Framework

#backend #database #monitoring #performance

Oracle exposes a deep diagnostic surface: AWR snapshots, ASH samples, wait event histograms, ADDM recommendations, alert log entries, and hundreds of V$ dynamic performance views. Every signal stops at the database boundary, which is where the hardest production cases tend to live.

A Concurrency wait spike during WebLogic connection pool exhaustion produces the same AWR output as genuine latch contention under steady-state load. A db file sequential read climbing on index range scans could mean a bad execution plan, or a storage array adding tens of milliseconds of latency because a batch ETL job on a separate system saturated the backend. Each example follows the same pattern: Oracle tells you what the database is waiting for; infrastructure and application data tells you why. Closing that gap means putting database events on the same timeline as the surrounding stack.

ManageEngine OpManager Nexus does exactly that, surfacing WebLogic, OCI, storage, and network signals alongside Oracle metrics in a single console.

This guide is a decision framework for Oracle 19c performance monitoring: wait event triage, V$ metric interpretation, tablespace capacity, and alert routing.

AWR report navigation

A typical AWR report runs to dozens of sections, but three carry the diagnostic weight for most investigations.

Load Profile gives you execution rate, transaction rate, and logical/physical read rates normalized per second and per transaction. Comparing a healthy snapshot against a degraded one with these metrics is the fastest way to tell whether a performance change is driven by workload volume or by execution efficiency degradation on the same volume. Two of its values, DB Time and DB CPU, feed the triage ratio in the next section.

Top 5 Timed Events ranks the wait events that consumed the most DB Time during the snapshot, in absolute seconds and as a percentage of total DB Time. Map the dominant events to the wait class decision table for routing.

SQL Ordered by Elapsed Time identifies the individual SQL statements responsible for the highest DB Time consumption. Cross-reference with Top 5 Events to separate query-specific bottlenecks from systemic ones.

AWR is historical by design. During an active incident, pair AWR findings with ASH data to see which sessions are contributing to wait time right now:

SELECT wait_class, event, COUNT(*) AS samples
FROM V$ACTIVE_SESSION_HISTORY
WHERE sample_time   > SYSTIMESTAMP - INTERVAL '15' MINUTE
  AND session_state = 'WAITING'
GROUP BY wait_class, event
ORDER BY samples DESC
FETCH FIRST 10 ROWS ONLY;

When the workflow itself is the bottleneck (generating an HTML report mid-incident, or reading AWR for a cloud-managed Oracle deployment where the provider gates direct access), OpManager Nexus streams these same metrics continuously and exposes equivalent collection through cloud APIs. The manual generation path remains available when needed, most often for Oracle Support cases or side-by-side snapshot comparison: @$ORACLE_HOME/rdbms/admin/awrrpt.sql (prompts for report format and begin/end snapshot IDs from DBA_HIST_SNAPSHOT) or DBMS_WORKLOAD_REPOSITORY.AWR_REPORT_HTML(l_dbid, l_inst_num, l_bid, l_eid).

Two caveats before applying any of this. AWR, ASH, and ADDM require the Oracle Diagnostics Pack license (bundled with Enterprise Edition); the V$ queries used throughout this guide do not. Oracle 23ai Autonomous Database manages AWR automatically, so the snapshot mechanics above do not apply there.

From here, these signals feed the triage path the next section lays out.

The triage decision framework

Performance triage follows two branches: a first cut on CPU-bound versus wait-heavy, then wait-class routing for the wait-heavy case.

From AWR snapshot to action

DB Time is the sum of all elapsed time across foreground sessions; DB CPU is the on-CPU subset. Their ratio is the first triage signal.

When DB CPU dominates DB Time (above ~75% on OLTP as a starting point, calibrated to your environment), the workload is CPU-bound. SQL tuning or resource contention is the investigation path, and SQL Ordered by Elapsed Time identifies the statements consuming the most DB Time.

When the share drops well below that, the workload is wait-heavy and Top 5 Timed Events becomes the primary diagnostic surface. Read the %DB Time column first; raw wait counts mislead. Events consuming a high proportion are the ones to triage, smaller fractions are background noise.

Cross-reference both views before acting. A db file sequential read dominant event paired with a top SQL doing millions of single-block reads is a query-specific candidate. The same wait dominant with a simple-lookup top SQL points at a systemic bottleneck (storage, cache pressure) rather than the query itself.

With the dominant wait class identified, the next subsection routes each class to its investigation pathway and infrastructure check.

Wait class decision table

Of Oracle 19c's 13 wait classes, the table below covers the ones that surface in production OLTP triage. The Idle row is included for diagnostic context, not as a bottleneck.

Wait Class	Common Events	Root Cause Pathway	Infrastructure Check
System I/O	`db file sequential read`, `db file scattered read`	Index scan latency or full table scan I/O	Storage IOPS, latency, bandwidth utilization
Concurrency	`library cache lock`, `buffer busy waits`	Hard-parse storms, hot segment blocks, DDL contention	Application deployment timeline, WebLogic thread pool state
Commit	`log file sync`	Redo write latency, log writer contention	Storage throughput on redo log volumes
Application	`enq: TX - row lock contention`	Application-level lock design	Transaction duration in application logs
Cluster (RAC)	`gc buffer busy acquire`, `gc cr block 2-way`	Interconnect saturation, cross-instance data block contention	Private interconnect throughput and latency
Idle (diagnostic)	`SQL*Net message from client`	Application think time, connection pool sizing	Connection pool metrics, application round-trip count

The query below produces a class-level wait distribution as triage input:

SELECT
  wait_class,
  COUNT(*)                                     AS session_count,
  ROUND(AVG(seconds_in_wait), 2)               AS avg_wait_sec,
  COUNT(DISTINCT event)                        AS distinct_events
FROM V$SESSION
WHERE state      = 'WAITING'
  AND wait_class <> 'Idle'
GROUP BY wait_class
HAVING COUNT(*) > 2  -- adjust threshold for your concurrency level
ORDER BY session_count DESC;

From here, each dominant class has its own triage pathway, covered in the deep dives that follow.

Wait event deep dives by class

Each dive follows the same shape: Oracle wait data first, then the infrastructure check that names the actual root cause. Oracle's wait event descriptions reference catalogs every event below; the focus here is triage, not definitions.

System I/O: physical read events

db file sequential read fires on index range scans, where Oracle reads one block at a time from a specific index entry to its corresponding table block. High wait time with well-tuned execution plans points to storage latency rather than query structure.

db file scattered read fires on full table scans, where Oracle reads multiple contiguous blocks in a single I/O. Elevated wait time here means either the full scans are expected (large analytical queries with db_file_multiblock_read_count tuned for the workload) or missing indexes are forcing full scans where range scans would be more selective.

Both events can spike from storage subsystem saturation with no change in query execution paths. When System I/O waits climb alongside elevated storage latency on the host, the fix belongs at the infrastructure layer rather than in SQL. OpManager Nexus shortens that diagnostic by surfacing Oracle wait data and host storage metrics on the same timeline.

Concurrency: library cache and hot block contention

library cache lock typically signals hard-parse storms where sessions compete to parse SQL statements that could be shared. First rule out DDL operations (ALTER TABLE, CREATE INDEX); these take an exclusive lock on affected objects and produce the same wait. If no DDL is concurrent, the fix is on the cursor_sharing side: setting it to FORCE converts literals to bind variables and reduces hard parses, at the cost of potential plan instability where literal values affect cardinality estimates.

buffer busy waits indicate multiple sessions competing to access the same buffer in the cache, often a symptom of hot segment blocks in high-concurrency OLTP workloads. Query V$WAITSTAT for the block class with the highest counts to determine whether contention is in undo blocks, data blocks, or segment headers:

SELECT class, count AS wait_count, time AS wait_time
FROM V$WAITSTAT
WHERE count > 0
ORDER BY count DESC
FETCH FIRST 10 ROWS ONLY;

A Concurrency wait spike immediately after a new application deployment is a regression indicator pointing at code that generates unshared cursors. The same event during a maintenance window is background noise. Application-layer context (deployment timeline, WebLogic thread pool state) resolves the ambiguity.

Commit: redo write latency

log file sync fires every time a session issues a COMMIT and waits for the log writer (LGWR) to flush the redo buffer to disk. High log file sync times correlate directly with redo write latency. If redo log files sit on slow storage or share I/O bandwidth with datafiles, commit-heavy workloads stall here.

Check redo log placement and storage throughput before tuning application commit frequency. The most direct measurement is the average wait on log file parallel write from V$SYSTEM_EVENT:

SELECT event,
       total_waits,
       ROUND(time_waited_micro / NULLIF(total_waits, 0) / 1000, 2) AS avg_wait_ms
FROM V$SYSTEM_EVENT
WHERE event = 'log file parallel write';

Sustained avg waits above the low-tens-of-milliseconds range indicate a log writer bottleneck or storage constraint on the redo volume. V$SYSTEM_EVENT values are cumulative since instance startup; read them as deltas (see the V$ metrics reference for the cumulative-vs-delta rule).

Application: row lock contention

enq: TX - row lock contention fires when one session holds a row lock and another waits to modify the same row. The blocking pair is visible in V$LOCK joined with V$SESSION:

SELECT
  l.sid        AS waiting_sid,
  s.username   AS waiting_user,
  l.type       AS lock_type,
  l.id1, l.id2,
  bk.sid       AS blocking_sid,
  bs.username  AS blocking_user,
  bs.status    AS blocking_status
FROM V$LOCK l
JOIN V$SESSION s  ON l.sid    = s.sid
JOIN V$LOCK bk    ON bk.type  = l.type
                 AND bk.id1   = l.id1
                 AND bk.id2   = l.id2
                 AND bk.request = 0
JOIN V$SESSION bs ON bk.sid   = bs.sid
WHERE l.request > 0;

The bk.type = l.type predicate prevents a TX waiter from being paired with an unrelated TM or UL holder that happens to share id1/id2. For simpler diagnostics, V$SESSION.BLOCKING_SESSION (10g+) returns the blocker's SID directly without the self-join, at the cost of losing the per-lock detail above.

The fix belongs in the application layer: reduce transaction duration, reorder DML operations to minimize lock hold time, or redesign the data access pattern.

Idle: client-side wait events

SQL*Net message from client (classified by Oracle as Idle, not Network) records time Oracle spends waiting for the client application to send the next request. A spike during a deployment window often indicates the new application version is doing more round-trips or holding connections open longer between statements. Seeing this event consume a significant fraction of DB Time (a quarter or more) with no application change is worth investigating as a potential connection leak.

For network-layer correlation, OpManager Nexus surfaces network device metrics alongside the application and server data already in view. A SQL*Net message from client spike coinciding with network saturation on the segment connecting application servers to the database tier is a diagnostic that requires both views.

Beyond individual wait events, the V$ dynamic performance views provide a continuous metrics signal that complements AWR-based triage.

V$ metrics reference

The table below consolidates the V$ metrics that correlate with actionable production incidents. Treat the values as starting points and calibrate against your environment's steady-state baseline.

Threshold reference table

Metric	Normal Range	Warning	Critical	V$ Source
Buffer Cache Hit Ratio	Mid-90s % or above	Drops below the mid-90s	Below the high-80s	V$SYSSTAT
Physical Reads/sec	Workload baseline	Elevated above baseline (e.g., 2x)	Significantly elevated (e.g., 4x), calibrate against observed steady state	V$SYSSTAT
Logical Reads/sec	Workload baseline	Elevated above baseline (e.g., 3x)	Significantly elevated (e.g., 5x), calibrate against observed steady state	V$SYSSTAT
Active User Sessions	Operational baseline (commonly < 70% of max as a heuristic)	Approaching limit (e.g., > 70% of max)	Near limit (e.g., > 90% of max), calibrate for your environment	V$SESSION
DB CPU % of DB Time	High ratio indicates CPU-bound OLTP workload (specific thresholds are practitioner heuristics, calibrate against your steady state)	Significantly below steady state	Very low	V$SYSMETRIC
Tablespace Used %	< 80%	> 80% (Oracle default: 85%)	> 90% (Oracle default: 97%)	V$TABLESPACE / DBA_DATA_FILES
ASM Diskgroup Used %	< 75% (operational heuristic for headroom; Oracle default critical: 90%)	> 75%	> 85%	V$ASM_DISKGROUP
Redo Write Latency (`log file parallel write` avg)	< 10ms	Sustained 15-20ms+	> 50ms (environment-dependent)	V$SYSTEM_EVENT
Parse CPU / Parse Elapsed	Close to 1.0	Drops into the 0.7-0.8 range	< 0.50 (indicative)	V$SYSSTAT
Datafile Status	ONLINE	Any OFFLINE	Any RECOVER	V$DATAFILE

Cumulative statistics like physical reads accumulate across the instance lifetime. AWR captures them as deltas between snapshots (default interval: 60 minutes, default retention: 8 days on Oracle 19c). Real-time polling tools calculate these as per-second rates by tracking the delta across consecutive polls.

Metrics that need context

When Buffer Cache Hit Ratio falls into the warning band, sessions are doing more physical reads than the SGA can absorb; expect correlated spikes in db file sequential read and db file scattered read. A very high ratio (above 99%) can mask SQL inefficiency in workloads with small working sets, so the exact inflection point is workload-dependent. Calculate it from the cache-specific counters:

WITH stats AS (
  SELECT name, value
  FROM V$SYSSTAT
  WHERE name IN ('physical reads cache',
                 'db block gets from cache',
                 'consistent gets from cache')
)
SELECT
  ROUND(
    (1 - (MAX(CASE WHEN name = 'physical reads cache' THEN value END)
          / NULLIF(MAX(CASE WHEN name = 'db block gets from cache'   THEN value END)
                 + MAX(CASE WHEN name = 'consistent gets from cache' THEN value END), 0)
    )) * 100, 2
  ) AS cache_hit_ratio
FROM stats;

The legacy formula using bare physical reads counts direct-path reads (full table scans, parallel query, large LOB reads) as misses even though those reads bypass the buffer cache entirely; on mixed workloads, that depresses the ratio without indicating a real cache problem.

Active sessions approaching the sessions parameter limit (Oracle's default formula is 1.5 x PROCESSES + 22) is a leading indicator of connection pool misconfiguration or a connection leak. Check current utilization:

SELECT current_utilization                                AS curr_sessions,
       limit_value                                         AS limit_value,
       CASE WHEN limit_value = 'UNLIMITED' THEN NULL
            ELSE ROUND(TO_NUMBER(current_utilization)
                       / NULLIF(TO_NUMBER(limit_value), 0) * 100, 1)
       END                                                 AS pct_used
FROM V$RESOURCE_LIMIT
WHERE resource_name = 'sessions';

Physical Reads and Logical Reads should be tracked as rates per second. A 5-minute polling interval catches transient spikes that a 60-minute AWR window would average out.

Redo Write Latency is covered in the Commit subsection above (query and interpretation).

Parse CPU / Parse Elapsed is the ratio of CPU time spent parsing to total elapsed parse time. A ratio near 1.0 means parses complete on CPU without waiting; that's the soft-parse signal. When the ratio drops well below 1.0, sessions are waiting on parse latches. No Oracle-documented standard exists for the cutoff, so calibrate against your environment's steady-state baseline.

WITH parse_stats AS (
  SELECT name, value
  FROM V$SYSSTAT
  WHERE name IN ('parse time cpu', 'parse time elapsed')
)
SELECT
  ROUND(
    MAX(CASE WHEN name = 'parse time cpu'     THEN value END)
    / NULLIF(MAX(CASE WHEN name = 'parse time elapsed' THEN value END), 0)
  , 2) AS parse_cpu_to_elapsed_ratio
FROM parse_stats;

V$SYSSTAT values for parse time are in centiseconds. Both counters are cumulative since instance startup, so this query returns the lifetime average and will hide a 30-minute parse-latch storm completely. For an actionable real-time signal, capture two readings 5-10 minutes apart and compute the ratio of the deltas.

ASM Statistics expose TOTAL_MB, FREE_MB, and USABLE_FILE_MB per diskgroup. For mirrored diskgroups, USABLE_FILE_MB is the number that matters: a diskgroup showing 30% free by raw space may have far less usable capacity once mirror overhead is factored in.

Operational baseline establishment

Establishing those baselines takes a structured collection period, segmentation by workload window, and percentile-based threshold derivation.

Collection period. A baseline period of two to four weeks captures enough variation to account for weekly batch cycles, month-end processing, and workload fluctuations.

Workload window segmentation. OLTP daytime hours and overnight batch windows produce different metric profiles. A buffer cache hit ratio that holds in the mid-90s during OLTP hours can drop significantly during a legitimate batch ETL run that scans large tables. Treat these as separate baselines rather than averaging them together; thresholds set against a blended average miss real anomalies during batch windows and generate false positives during OLTP hours.

Threshold derivation. For metrics like physical reads/sec and active session count where "normal" varies by workload, derive thresholds from observed percentiles. A Warning threshold at the 95th percentile of your baseline period and a Critical threshold at the 99th percentile catches genuine anomalies while tolerating normal variance. The multipliers in the reference table are a reasonable starting point when no baseline data is available yet.

To calculate percentiles from AWR data (requires Diagnostics Pack):

SELECT
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) AS p95,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY value) AS p99
FROM DBA_HIST_SYSMETRIC_HISTORY
WHERE metric_name = 'Physical Reads Per Sec'
  AND end_time   >= SYSTIMESTAMP - INTERVAL '30' DAY;

Use the same pattern for each metric in your baseline period; filter on end_time by time-of-day to segment OLTP versus batch windows.

Dynamic vs. static baselines. Static baselines work when workload patterns are stable. For environments where workload volume shifts over time (growing user bases, seasonal traffic patterns, migration phases), the SaaS delivery of OpManager Nexus offers AI-driven dynamic thresholds that adjust automatically without manual recalibration.

With baselines in place, the same calibration logic carries into the capacity monitoring covered next.

Tablespace and storage capacity

Utilization percentage alone misses the failure mode that actually pages people: growth running into a ceiling between polling cycles. The right primary signal is growth rate against the nearest ceiling, whether that's filesystem, MAXSIZE, or ASM diskgroup capacity.

Permanent tablespace monitoring

Autoextend is the silent-failure mode. A tablespace with autoextend enabled will consume disk space until either the filesystem fills, the tablespace hits its MAXSIZE limit, or the underlying ASM diskgroup runs out of usable capacity. When MAXSIZE is set to UNLIMITED (indicated by MAXBYTES=0 in DBA_DATA_FILES), Oracle grows the datafile until the filesystem is full or the smallfile/bigfile platform maximum is reached (whichever comes first), with no Oracle-space threshold alert at the datafile level. By the time an ORA-01653 (unable to extend table) or ORA-01688 (unable to extend table partition) error appears in the alert log, sessions have already failed.

This query exposes remaining capacity per tablespace, not just current consumption:

SELECT
  m.tablespace_name,
  ROUND(m.tablespace_size * t.block_size / 1073741824, 2)       AS total_gb,
  ROUND(m.used_space      * t.block_size / 1073741824, 2)       AS used_gb,
  ROUND((m.tablespace_size - m.used_space) * t.block_size
        / 1073741824, 2)                                         AS remaining_gb,
  ROUND(m.used_percent, 2)                                      AS used_pct,
  ROUND(100 - m.used_percent, 2)                                AS remaining_pct
FROM DBA_TABLESPACE_USAGE_METRICS m
JOIN DBA_TABLESPACES t ON m.tablespace_name = t.tablespace_name
ORDER BY m.used_percent DESC;

Track the growth trend (how many gigabytes per day a tablespace is growing) and alert before utilization reaches the autoextend ceiling. For licensed environments, calculate daily growth from AWR history:

SELECT
  v.name AS tablespace_name,
  ROUND(
    (MAX(h.tablespace_usedsize) - MIN(h.tablespace_usedsize))
    * t.block_size / 1073741824
    / NULLIF(EXTRACT(DAY FROM MAX(s.end_interval_time) - MIN(s.end_interval_time)), 0)
  , 2) AS avg_growth_gb_per_day
FROM DBA_HIST_TBSPC_SPACE_USAGE h
JOIN DBA_HIST_SNAPSHOT s
  ON  h.snap_id = s.snap_id
  AND h.dbid    = s.dbid
JOIN V$TABLESPACE    v ON h.tablespace_id = v.ts#
JOIN DBA_TABLESPACES t ON v.name          = t.tablespace_name
WHERE s.end_interval_time >= SYSDATE - 30
GROUP BY v.name, t.block_size
ORDER BY avg_growth_gb_per_day DESC NULLS LAST;

Without the Diagnostics Pack, capture periodic snapshots of DBA_TABLESPACE_USAGE_METRICS to an external tracking table on a scheduled basis and compute deltas between rows.

To identify datafiles with unbounded growth potential:

SELECT file_name, tablespace_name,
  ROUND(bytes / 1073741824, 2)                        AS current_size_gb,
  CASE WHEN maxbytes = 0 THEN 'UNLIMITED'
       ELSE TO_CHAR(ROUND(maxbytes / 1073741824, 2))
  END                                                  AS max_size_gb,
  CASE WHEN maxbytes = 0 THEN NULL
       ELSE ROUND((maxbytes - bytes) / 1073741824, 2)
  END                                                  AS growth_headroom_gb
FROM DBA_DATA_FILES
WHERE autoextensible = 'YES'
ORDER BY tablespace_name;

MAXBYTES = 0 indicates no explicit autoextend ceiling rather than literally unlimited capacity; Oracle still bounds the datafile at the smallfile-versus-bigfile platform maximum (~32 GB and ~32 TB respectively at 8 KB blocks), so factor that ceiling into capacity planning rather than treating "UNLIMITED" as truly unbounded.

TEMP tablespace monitoring

TEMP tablespace exhaustion (ORA-01652: unable to extend temp segment) is a frequent production incident. Large sort operations, hash joins, and global temporary table usage can exhaust TEMP space without advance warning. Set a warning threshold at 75-80% of configured TEMP size:

SELECT tablespace_name,
       ROUND(tablespace_size / 1048576, 2)              AS temp_total_mb,
       ROUND(allocated_space / 1048576, 2)              AS temp_allocated_mb,
       ROUND(free_space      / 1048576, 2)              AS temp_free_mb,
       ROUND((tablespace_size - free_space)
             / NULLIF(tablespace_size, 0) * 100, 2)     AS temp_used_pct
FROM DBA_TEMP_FREE_SPACE;

V$TEMP_SPACE_HEADER reports the allocation high-water mark from the tempfile header bitmap and does not reflect reclaimable extents. Oracle's lazy reclamation means freed sort segments stay marked used until a tempfile shrink or instance restart, so a query against it tends to alarm on healthy systems. DBA_TEMP_FREE_SPACE accounts for free-but-not-released space and is the right source for capacity thresholds; for currently-active sort/hash usage during an incident, V$SORT_SEGMENT or V$TEMPSEG_USAGE shows what individual sessions are holding.

Tablespace growth changes slowly enough that polling every 30-60 minutes is sufficient for most environments.

From single-instance scope, the next section adds the RAC and Multitenant scoping rules that change how the V$ queries above return data.

RAC and multitenant monitoring

In RAC environments, the V$ queries shown throughout this guide return data for the local instance only. Use GV$ views (e.g., GV$SESSION, GV$WAITSTAT) and filter by INST_ID to query across all nodes. For single-instance diagnostics on a specific RAC node, V$ queries remain correct but will not reflect wait events or sessions active on other nodes.

Cluster wait class events. gc buffer busy acquire and gc cr block 2-way indicate sessions waiting for blocks to transfer across the private interconnect between RAC nodes. Elevated wait times point to interconnect saturation or cross-instance contention for the same data blocks. Check private interconnect throughput and consider partitioning or rebalancing workloads across nodes to reduce inter-node block shipping. OpManager Nexus surfaces these RAC metrics alongside node state and ASM diskgroup capacity in one view.

CDB/PDB V$ view scoping. In Oracle 19c Multitenant architecture (non-CDB was deprecated in 21c), V$SESSION, V$SYSSTAT, and tablespace views return data scoped to the current container by default. Monitoring at the CDB root level shows aggregate metrics across all PDBs. For per-PDB visibility, configure a separate monitor for each PDB or use CON_ID filtering in your queries. For example, to scope the session wait query to a specific PDB:

SELECT wait_class, COUNT(*) AS session_count,
       ROUND(AVG(seconds_in_wait), 2) AS avg_wait_sec
FROM V$SESSION
WHERE state = 'WAITING'
  AND wait_class <> 'Idle'
  AND con_id = (SELECT con_id FROM V$PDBS WHERE name = 'YOUR_PDB_NAME')
GROUP BY wait_class
HAVING COUNT(*) > 2
ORDER BY session_count DESC;

OpManager Nexus supports automatic PDB discovery (set Discover Pluggable Database to Yes during monitor creation).

Once the monitor is collecting from the right scope, the next section covers what to do when the values cross threshold.

Threshold configuration and alert routing

Group-level or overall health alerts collapse multiple attribute states into a single signal; the result is alert noise and ambiguous routing. Per-attribute thresholds are the right unit for Oracle environments, where buffer cache hit ratio, tablespace utilization, physical reads, and wait event states each warrant a different response. OpManager Nexus supports this model directly.

Setting per-attribute thresholds

OpManager Nexus uses four severity states for Oracle monitor attributes:

Critical: a confirmed issue requiring immediate action
Warning: a potential issue that warrants attention but has not yet caused operational impact
Clear: a previously triggered condition that has resolved
Unknown: displayed when the attribute value does not match any configured severity condition

Use the reference table from the V$ Metrics section as your starting point for per-attribute thresholds. Oracle's own tablespace defaults (85%/97%) and ASM defaults (75%/90%) are more permissive than the table's recommendations, so decide whether your environment can tolerate that extra margin. Physical reads and session count thresholds require a baseline period before they are meaningful (see the Operational baseline establishment section).

Tablespace statistics collection is configured at Settings > Performance Polling > Optimize Data Collection in OpManager Nexus. Select Oracle from the Monitor Type dropdown, then TableSpace Statistics from the metric dropdown. Two scheduling options: "Collect data in every polling" runs tablespace collection on every poll cycle (appropriate for high-growth OLTP environments); "Collect data at customized time interval" schedules collection at a fixed time (sufficient for stable OLAP or data warehouse tablespaces with predictable growth).

Alert log collection follows the same path: Settings > Performance Polling > Optimize Data Collection, then select Oracle Alert Log from the metric dropdown. OpManager Nexus collects alert log entries on each poll and stores alert log history for a configurable retention period, which is useful for correlating a metrics anomaly with a specific Oracle error. You can suppress specific error patterns that are known-benign in your environment by entering them in the Errors to Ignore field under Settings > Performance Polling > Database Servers.

Alert log monitoring

The Oracle Alert Log surfaces errors that no metric will catch on its own. ORA-600 (internal errors that typically warrant Oracle Support involvement), ORA-4031 (shared pool memory exhaustion, which can trigger cascading parse failures), ORA-27xxx (I/O and OS errors), and media recovery events indicating datafile or redo log corruption are all signals that surface in the alert log before they appear in V$ metrics.

Configure thresholds at the individual error pattern level where possible. ORA-600 and ORA-4031 warrant Critical severity and immediate escalation. ORA-12514 (TNS listener errors) may warrant Warning severity during maintenance windows but Critical at other times.

Webhook and incident management integration

For routing alerts to your incident management platform, configure a webhook action. In OpManager Nexus, go to Admin > Alarm/Action > Actions and create a RestAPI Action. Provide your incident platform's webhook URL, set the form submission method to POST, and configure a JSON payload using OpManager Nexus's replaceable tags:

{
  "source": "$MONITORNAME",
  "host": "$HOSTNAME",
  "attribute": "$ATTRIBUTE",
  "severity": "$SEVERITY",
  "value": "$ATTRIBUTEVALUE",
  "message": "$RCAMSG_PLAINTEXT",
  "timestamp": "$STRMODIFIEDTIME"
}

In OpManager Nexus's webhook configuration UI, tags are entered without backslashes (e.g., $MONITORNAME, $SEVERITY). The backslashes shown in some documentation sources are a rendering artifact.

This payload includes $SEVERITY, which passes the current alarm severity (Critical, Warning, or Clear) to the receiving system. When paired with a Clear event, this enables auto-resolution of tickets in any incident platform that accepts incoming webhooks.

The ServiceDesk Plus integration creates tickets automatically when threshold conditions are met and resolves them when the alarm clears. It uses a dedicated REST API integration rather than the generic webhook action, but the auto-create/auto-close behavior is the same.

With thresholds and routing handled, the closing section walks through the monitor setup that puts them into effect.

Monitor setup and initial configuration

To add your first Oracle Database monitor in OpManager Nexus, go to New Monitor and select Oracle DB Server under Database Servers. Enter the host IP or hostname, port, username, and a valid SID or host connection string, then set your polling interval.

The monitoring user requires at minimum: CONNECT privilege, SELECT_CATALOG_ROLE (covers DBA_* views and most V$ views in 19c), and explicit grants on the underlying V_$ tables for any tooling that runs without role inheritance (definer's-rights stored procedures, for example, where roles are disabled at execution):

GRANT SELECT ON V_$SESSION         TO monitor_user;
GRANT SELECT ON V_$SYSSTAT         TO monitor_user;
GRANT SELECT ON V_$SYSMETRIC       TO monitor_user;
GRANT SELECT ON V_$SYSTEM_EVENT    TO monitor_user;
GRANT SELECT ON V_$WAITSTAT        TO monitor_user;
GRANT SELECT ON V_$RESOURCE_LIMIT  TO monitor_user;

The list above is representative rather than exhaustive; other queries in this guide also touch V$LOCK, V$ACTIVE_SESSION_HISTORY, V$TABLESPACE, and V$PDBS, which SELECT_CATALOG_ROLE already covers in role-aware contexts. Add the corresponding V_$ grants if your tooling cannot inherit role privileges, and GRANT SELECT ON V_$TEMP_SPACE_HEADER TO monitor_user; if you retain the legacy TEMP query for ad-hoc debugging.

For Multitenant environments, set Discover Pluggable Database to Yes to enumerate PDBs automatically. For RAC, select Oracle RAC Server instead of Oracle DB Server during monitor creation, provide either the Scan Host Name or the SCAN IP, and grant GV_$ equivalents to the monitoring user. OpManager Nexus's documentation lists the full grant set for its specific Oracle monitor implementation.

After the monitor is active:

Enable TableSpace Statistics and Oracle Alert Log collection (Settings > Performance Polling > Optimize Data Collection)
Using the V$ metrics reference table as your starting point, configure per-attribute thresholds for buffer cache hit ratio, tablespace utilization, and physical reads
Set up webhook or ServiceDesk Plus integration for alert routing
Collect baseline data for a sufficient period (typically two to four weeks) before tightening multiplier-based thresholds for physical reads and session count

For polling cadence: V$ metrics benefit from sub-AWR-interval polling, tablespace statistics tolerate lower-frequency intervals, and the alert log is best polled on every cycle so errors are captured within the polling window.

The triage framework in this guide (from DB Time ratio to wait class routing to V$ metric thresholds) gives you a repeatable path from symptom to corrective action. OpManager Nexus places Oracle events on the same timeline as the surrounding stack across on-prem and SaaS deployments, which cuts the context-switching that extends incident resolution time.

DEV Community