DEV Community

Bonthu Durga Prasad
Bonthu Durga Prasad

Posted on

OCI Monitoring & Alarms: Practical Guide with Real-Time Testing, Architecture, and Troubleshooting

Introduction

Modern cloud environments require proactive monitoring to detect issues before they impact users.

In production environments, lack of proper monitoring leads to delayed incident response and downtime. OCI Monitoring solves this by providing real-time observability and alerting

Oracle Cloud Infrastructure Monitoring enables you to:

  • Collect real-time metrics
  • Define intelligent alarms
  • Trigger automated notifications

In this guide, we will:

  • Understand OCI Monitoring architecture
  • Configure alarms using Console
  • Validate alerts using real testing
  • Apply across multiple OCI services

Architecture Overview

Flow Explanation:

  • OCI Services emit metrics: Compute Load Balancer Autonomous DB
  • Metrics are collected by πŸ‘‰ OCI Monitoring
  • Alarms evaluate conditions Notifications sent via πŸ‘‰ OCI Notifications Understanding Metrics in OCI

Metrics are:

  • Time-series data
  • Automatically generated by OCI services

Examples:

  • CPU Utilization (Compute)
  • HTTP Errors (Load Balancer)
  • Storage Usage (DB)

**Understanding Alarms

Alarms:**

  • Continuously evaluate metrics
  • Trigger when thresholds are breached

Example:

  • CPU > 80%
  • Error rate > 5%

Step-by-Step: Creating Alarm (Console)

πŸ‘‰

Observability & Management β†’ Monitoring β†’ Alarms β†’ Create Alarm

Key Configuration:

  • Metric Namespace (Oracle_Compute_Agent)
  • Interval (1m / 5m)
  • Threshold condition
  • Severity

Understanding Metric Namespaces in OCI

OCI metrics are organized into namespaces based on the source of data:

  • oci_compute β†’ Provides default infrastructure-level metrics such as CPU utilization, network throughput, and disk I/O. These are available without any additional configuration.

  • oci_computeagent β†’ Provides enhanced, guest OS-level metrics such as memory usage, filesystem utilization, and detailed performance insights. These require the Oracle Cloud Agent plugin to be enabled on the instance.

Notifications Setup

Using
πŸ‘‰ OCI Notifications

Steps:

Create Topic
Add Subscription (Email / HTTPS)
Confirm subscription

-> Define alarm notification with topic you have created so that the triggered alarms will notify you with that email.

-> You have created an alarm with the topic where you get notified when define threshold reaches.

Practical Validation

πŸ‘‰ This is where your test compute instance comes in (for screenshots)

Even though OCI Monitoring is service-agnostic, we validate using a compute instance.

Triggering a Real Alert

SSH into instance:

  • sudo yum install stress -y
  • stress --cpu 2 --timeout 120

Expected Outcome:
CPU spike
Alarm moves to FIRING state
Notification received

Metrics graph spike
Alarm state

Multi-Service Use Cases

This same setup works for:

  • πŸ–₯️ Compute
    CPU, Memory

  • 🌐 Load Balancer
    HTTP 5xx errors
    Latency

  • πŸ—„οΈ Databases
    Storage thresholds
    Active sessions

πŸ‘‰ One monitoring system β†’ multiple services

Troubleshooting

  • ❌ Alarm not triggering
    Wrong metric namespace
    Incorrect interval

  • ❌ No notifications
    Subscription not confirmed
    Topic mismatch

  • ❌ Metrics missing
    Service delay
    Agent/plugin disabled (for compute)

⚑ Best Practices

  • Use different severities
  • Avoid alert noise (don’t set too low thresholds)
  • Always validate alarms manually

Validation Checklist

  • Metrics visible βœ…
  • Alarm configured βœ…
  • Notification received βœ…
  • Real test performed βœ…

🏁 Conclusion

OCI Monitoring and Alarms provide a powerful and unified observability solution across all OCI services. By combining real-time metrics, flexible alarm configurations, and integrated notifications, teams can proactively detect and respond to issues before they impact users.

This guide demonstrated not just configuration, but real-time validation using practical testing β€” a critical step for production readiness.

With these practices, organizations can significantly improve system reliability, reduce downtime, and enhance operational visibility across cloud environments.

Top comments (0)