We talk a lot about alerting, but not enough about deciding.
This weekend project builds a small Alert Decision Layer as a Python CLI called alertdecider:
- Input: alerts JSON (think Alertmanager or PagerDuty export).
- Engine: a clear rule set that considers severity, environment, service tier, and flapping history.
- Output: Markdown + JSON with decisions (
page,ticket,aggregate,suppress) and reasons.
If you liked project-based posts like "AI trading bot in Python" or "Self-healing containers with Bash", this sits in the same category: you end up with a tool you can run and extend.
What you'll build
By the end of this tutorial you’ll have:
- A Python package
alertdecider-agent/with:- Dataclasses for
Alert,ServiceProfile,History. - A rule-based
AlertDecisionEngine. - A CLI entry point.
- Dataclasses for
- An
examples/folder with sample alerts, services, and alert history. - A command you can run locally to triage alerts and generate a report.
Prerequisites
- Python 3.10+
- Basic familiarity with JSON/YAML
- A terminal where you can run
python -m ...
1. Clone and set up the project
git clone https://github.com/AutoShiftOps/alertdecider.git
cd alertdecider
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
The requirements.txt is intentionally small:
rich==13.9.4
PyYAML==6.0.2
2. Model the domain: alerts, services, history
In alertdecider-agent/models.py we define three dataclasses:
@dataclass
class Alert:
id: str
source: str
name: str
service: str
severity: str
environment: str
starts_at: str
fingerprint: str
raw: Dict[str, Any]
@dataclass
class ServiceProfile:
name: str
tier: str
slo_critical: bool
owner: str | None = None
@dataclass
class History:
fingerprint: str
count_24h: int
last_status: str | None = None
This gives us a normalized view of alerts and some context we can use to make better decisions.
3. Load alerts, services, and history
alertdecider-agent/loader.py contains helpers to turn raw files into those models.
-
load_alerts(path)readsalerts.jsonand extracts labels likealertname,service,severity,env, andfingerprint. -
load_services(path)readsservices.ymland buildsServiceProfileobjects. -
load_history(path)readshistory.jsonand tracks how many times each fingerprint fired in the last 24h.
Example services.yml:
services:
checkout-api:
tier: tier1
slo_critical: true
owner: team-checkout
notification-service:
tier: tier1
slo_critical: false
owner: team-notify
internal-reporting:
tier: tier2
slo_critical: false
owner: team-data
4. Implement the AlertDecisionEngine
Now the interesting part: turn alerts + context into decisions.
In alertdecider-agent/engine.py we implement AlertDecisionEngine with a few rules. Conceptually:
class AlertDecisionEngine:
def __init__(self, services, history):
self.services = services
self.history = history
def decide(self, alerts: List[Alert]) -> List[Decision]:
return [self._decide_one(a) for a in alerts]
def _decide_one(self, alert: Alert) -> Decision:
service_profile = self.services.get(alert.service)
hist = self.history.get(alert.fingerprint)
sev = alert.severity.lower()
# 1) Suppress noisy low-severity alerts in non-prod
if sev in {"info", "debug"} and alert.environment not in {"prod", "production"}:
return Decision(alert, "suppress", "low-severity alert in non-prod environment")
# 2) Page for tier1, slo-critical services on critical alerts
if (sev == "critical" and service_profile and
service_profile.tier == "tier1" and service_profile.slo_critical):
return Decision(alert, "page", "critical alert on tier1 slo-critical service")
# 3) Aggregate flapping alerts (lots of repeats in 24h)
if hist and hist.count_24h >= 20:
return Decision(alert, "aggregate", "alert fingerprint is flapping/noisy")
# 4) Warnings in prod for tier1 services become tickets
if sev == "warning" and service_profile and service_profile.tier == "tier1":
return Decision(alert, "ticket", "warning on tier1 service; track as ticket")
# 5) Default: ticket for prod, suppress for non-prod
if alert.environment in {"prod", "production"}:
return Decision(alert, "ticket", "prod alert without more specific rule")
return Decision(alert, "suppress", "non-prod alert without more specific rule")
These rules are not perfect – they’re a starting point you can tweak.
The key is that they’re explicit. Anyone on your team can read, discuss, and change them.
5. Wire up the CLI
alertdecider-agent/cli.py glues everything together:
ap = argparse.ArgumentParser(
prog="alertdecider-agent",
description="Alert Decision Layer CLI",
)
ap.add_argument("--alerts", required=True)
ap.add_argument("--services", default="")
ap.add_argument("--history", default="")
ap.add_argument("--out-dir", default="out")
args = ap.parse_args()
alerts = load_alerts(args.alerts)
services = load_services(args.services)
history = load_history(args.history)
engine = AlertDecisionEngine(services, history)
decisions = engine.decide(alerts)
write_reports(args.out_dir, decisions)
render_console(decisions)
We also have __main__.py so you can use python -m alertdecider-agent.
6. Run it with sample data
The examples/ folder contains a simple dataset:
python -m alertdecider-agent --alerts examples/alerts.json --services examples/services.yml --history examples/history.json --out-dir out
- Check the CLI table output.
- Open
out/decision_report.mdto see the human-friendly report. - Open
out/decision_report.jsonif you want to wire this into another tool.
Try changing severity, env, or service tier in the examples and see how decisions change.
7. Make it yours
Here are some ideas for adapting this to your environment:
- Add time-of-day logic (e.g., don’t page at 03:00 for non-critical stuff).
- Add SLO signals (e.g., error budget burn rate) into the decision rules.
- Replace
history.jsonwith a real datastore of past alerts. - Call
alertdeciderfrom your Alertmanager/PagerDuty webhook pipeline.
You now have a small, understandable alert decision layer you can evolve – and a much better place to plug AI into in the future.
If you build on this, drop a link – I’d love to see different rule sets and architectures.
Top comments (1)
You can a find a detailed post here - autoshiftops.com/monitoring/alerti...