Alerting and Notification for Monoliths: Setting Up Early Warning Systems for Potential Issues

Monolithic applications are still prevalent in many organizations, and they require robust monitoring and alerting systems to ensure their smooth operation. Platform engineering teams are responsible for setting up these early warning systems to detect potential issues before they impact end-users. In this blog post, we will discuss how to set up alerting and notification systems for monolithic applications.

Monitoring Metrics

The first step in setting up an alerting and notification system is to identify the metrics that need to be monitored. These metrics will vary depending on the application and its use case. However, some common metrics that should be monitored include:

CPU usage
Memory usage
Disk usage
Network traffic
Response time
Error rates

Once these metrics have been identified, they need to be collected and analyzed. This can be done using various monitoring tools such as Prometheus, Nagios, or Datadog.

Setting Up Alerts

Once the metrics have been collected, alerts need to be set up to notify the platform engineering team when there are potential issues. Alerts can be triggered based on threshold values or anomaly detection.

Threshold-based alerts are triggered when a metric exceeds or falls below a predefined value. For example, an alert can be set up to notify the team if CPU usage exceeds 80% for more than 5 minutes.

Anomaly detection alerts are triggered when there is an unexpected deviation from the normal behavior of a metric. These alerts can help detect issues that may not be caught by threshold-based alerts.

Alert notifications can be sent via email, SMS, or through a collaboration tool such as Slack or Microsoft Teams. It is essential to ensure that the alert notifications are sent to the right people and that they contain enough information to enable the team to take action quickly.

Example: Setting Up Alerts in Prometheus

Prometheus is an open-source monitoring and alerting tool that can be used to monitor monolithic applications. Here is an example of how to set up alerts in Prometheus:

1. Define the alert rule in a Prometheus rule file:

groups:
- name: example-alerts
  rules:
  - alert: HighCPUUsage
    expr: node_cpu_seconds_total{mode="idle"} < 0.2
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: High CPU usage on instance {{ $labels.instance }}
      description: CPU usage is above 80% for more than 5 minutes on instance {{ $labels.instance }}.

In this example, an alert is triggered when CPU usage is above 80% for more than 5 minutes. The alert is labeled as critical and contains a summary and description of the issue.

2. Configure Prometheus to use the alert rule file:

rule_files:
  - /etc/prometheus/rules/example-alerts.yaml

3. Configure Prometheus to send alert notifications:

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

In this example, Prometheus is configured to send alert notifications to an Alertmanager instance running on port 9093.

4. Verify that the alerts are working:

$ curl http://localhost:9090/alerts
[
  {
    "status": "firing",
    "labels": {
      "alertname": "HighCPUUsage",
      "instance": "localhost:9100",
      "job": "prometheus",
      "severity": "critical"
    },
    "annotations": {
      "summary": "High CPU usage on instance localhost:9100",
      "description": "CPU usage is above 80% for more than 5 minutes on instance localhost:9100."
    },
    "startsAt": "2022-03-01T12:34:56Z",
    "endsAt": "0001-01-01T00:00:00Z",
    "generatorURL": "/alerts"
  }
]

In this example, the alert is firing, and the alert details are displayed.

Conclusion

Setting up alerting and notification systems for monolithic applications is essential to ensure their smooth operation. Platform engineering teams need to identify the metrics that need to be monitored, set up alerts based on threshold values or anomaly detection, and configure alert notifications to be sent to the right people. By following these steps, platform engineering teams can set up early warning systems that detect potential issues before they impact end-users.