Sajja Sudhakararao

Posted on Apr 19

Build an Alert Decision Layer CLI in Python

#devops #sre #python #monitoring

We talk a lot about alerting, but not enough about deciding.

This weekend project builds a small Alert Decision Layer as a Python CLI called alertdecider:

Input: alerts JSON (think Alertmanager or PagerDuty export).
Engine: a clear rule set that considers severity, environment, service tier, and flapping history.
Output: Markdown + JSON with decisions (page, ticket, aggregate, suppress) and reasons.

If you liked project-based posts like "AI trading bot in Python" or "Self-healing containers with Bash", this sits in the same category: you end up with a tool you can run and extend.

What you'll build

By the end of this tutorial you’ll have:

A Python package alertdecider-agent/ with:
- Dataclasses for Alert, ServiceProfile, History.
- A rule-based AlertDecisionEngine.
- A CLI entry point.
An examples/ folder with sample alerts, services, and alert history.
A command you can run locally to triage alerts and generate a report.

Prerequisites

Python 3.10+
Basic familiarity with JSON/YAML
A terminal where you can run python -m ...

1. Clone and set up the project

git clone https://github.com/AutoShiftOps/alertdecider.git
cd alertdecider
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The requirements.txt is intentionally small:

rich==13.9.4
PyYAML==6.0.2

2. Model the domain: alerts, services, history

In alertdecider-agent/models.py we define three dataclasses:

@dataclass
class Alert:
    id: str
    source: str
    name: str
    service: str
    severity: str
    environment: str
    starts_at: str
    fingerprint: str
    raw: Dict[str, Any]

@dataclass
class ServiceProfile:
    name: str
    tier: str
    slo_critical: bool
    owner: str | None = None

@dataclass
class History:
    fingerprint: str
    count_24h: int
    last_status: str | None = None

This gives us a normalized view of alerts and some context we can use to make better decisions.

3. Load alerts, services, and history

alertdecider-agent/loader.py contains helpers to turn raw files into those models.

load_alerts(path) reads alerts.json and extracts labels like alertname, service, severity, env, and fingerprint.
load_services(path) reads services.yml and builds ServiceProfile objects.
load_history(path) reads history.json and tracks how many times each fingerprint fired in the last 24h.

Example services.yml:

services:
  checkout-api:
    tier: tier1
    slo_critical: true
    owner: team-checkout
  notification-service:
    tier: tier1
    slo_critical: false
    owner: team-notify
  internal-reporting:
    tier: tier2
    slo_critical: false
    owner: team-data

4. Implement the AlertDecisionEngine

Now the interesting part: turn alerts + context into decisions.

In alertdecider-agent/engine.py we implement AlertDecisionEngine with a few rules. Conceptually:

class AlertDecisionEngine:
    def __init__(self, services, history):
        self.services = services
        self.history = history

    def decide(self, alerts: List[Alert]) -> List[Decision]:
        return [self._decide_one(a) for a in alerts]

    def _decide_one(self, alert: Alert) -> Decision:
        service_profile = self.services.get(alert.service)
        hist = self.history.get(alert.fingerprint)
        sev = alert.severity.lower()

        # 1) Suppress noisy low-severity alerts in non-prod
        if sev in {"info", "debug"} and alert.environment not in {"prod", "production"}:
            return Decision(alert, "suppress", "low-severity alert in non-prod environment")

        # 2) Page for tier1, slo-critical services on critical alerts
        if (sev == "critical" and service_profile and
                service_profile.tier == "tier1" and service_profile.slo_critical):
            return Decision(alert, "page", "critical alert on tier1 slo-critical service")

        # 3) Aggregate flapping alerts (lots of repeats in 24h)
        if hist and hist.count_24h >= 20:
            return Decision(alert, "aggregate", "alert fingerprint is flapping/noisy")

        # 4) Warnings in prod for tier1 services become tickets
        if sev == "warning" and service_profile and service_profile.tier == "tier1":
            return Decision(alert, "ticket", "warning on tier1 service; track as ticket")

        # 5) Default: ticket for prod, suppress for non-prod
        if alert.environment in {"prod", "production"}:
            return Decision(alert, "ticket", "prod alert without more specific rule")

        return Decision(alert, "suppress", "non-prod alert without more specific rule")

These rules are not perfect – they’re a starting point you can tweak.

The key is that they’re explicit. Anyone on your team can read, discuss, and change them.

5. Wire up the CLI

alertdecider-agent/cli.py glues everything together:

ap = argparse.ArgumentParser(
    prog="alertdecider-agent",
    description="Alert Decision Layer CLI",
)
ap.add_argument("--alerts", required=True)
ap.add_argument("--services", default="")
ap.add_argument("--history", default="")
ap.add_argument("--out-dir", default="out")

args = ap.parse_args()

alerts = load_alerts(args.alerts)
services = load_services(args.services)
history = load_history(args.history)

engine = AlertDecisionEngine(services, history)
decisions = engine.decide(alerts)

write_reports(args.out_dir, decisions)
render_console(decisions)

We also have __main__.py so you can use python -m alertdecider-agent.

6. Run it with sample data

The examples/ folder contains a simple dataset:

python -m alertdecider-agent   --alerts examples/alerts.json   --services examples/services.yml   --history examples/history.json   --out-dir out

Check the CLI table output.
Open out/decision_report.md to see the human-friendly report.
Open out/decision_report.json if you want to wire this into another tool.

Try changing severity, env, or service tier in the examples and see how decisions change.

7. Make it yours

Here are some ideas for adapting this to your environment:

Add time-of-day logic (e.g., don’t page at 03:00 for non-critical stuff).
Add SLO signals (e.g., error budget burn rate) into the decision rules.
Replace history.json with a real datastore of past alerts.
Call alertdecider from your Alertmanager/PagerDuty webhook pipeline.

You now have a small, understandable alert decision layer you can evolve – and a much better place to plug AI into in the future.

If you build on this, drop a link – I’d love to see different rule sets and architectures.

Top comments (1)

Sajja Sudhakararao • Apr 19

You can a find a detailed post here - autoshiftops.com/monitoring/alerti...