I spent my career in accounting and finance before building infrastructure in Zimbabwe.
In accounting, every transaction has three properties:
Authorization — no entry without approval
Immutability — once recorded, never altered
Reconciliation — every debit has a corresponding credit, provable by audit
When I started building FastAPI AlertEngine, I applied the same discipline to production incidents. The result is not a monitoring tool. It's an operational governance system.
Monitoring Tools Are for Forensics. Governance Tools Are for Control.
Monitoring tools tell you what broke after it broke. Datadog, Grafana, Sentry — they produce beautiful post-mortems.
Governance tools enforce that nothing executes without authorization, and they prove it afterward.
Most teams conflate the two. They buy monitoring, assume governance, and get surprised when auditors ask: "Who approved that deploy?"
AlertEngine separates them explicitly:
plain
Detection → Policy (deterministic, no AI)
Diagnosis → AI (explains, recommends, does not decide)
Authorization → Human (engineer taps approve)
Execution → Webhook (your infrastructure, your control)
Audit → Ledger (immutable, replayable, actor-attributed)
This is not a feature list. It's an architectural hierarchy enforced by code.
The Zimbabwe Constraint
Engineers in Zimbabwe aren't always at laptops when things break. WhatsApp is ubiquituous and can be the operational control plane.
That constraint produces something better than a dashboard: alerts that find you, with a single tap to authorise recovery. No SSH. No runbooks. No "log into Grafana and interpret the graph."
Just: "Something broke. Here's why. Tap approve. Nothing runs without you."
The Ledger Philosophy
In finance, a ledger has two sides: what happened, and who authorized it.
AlertEngine's audit trail has the same structure:
JSON
{
"timestamp": 1717344000,
"incident_id": "inc-abc123-1685000000",
"stage": "AUTHORIZED",
"actor": "engineer",
"decision": "approve",
"reason": "Database connection pool exhausted — restart recommended",
"confidence": 0.87,
"policy_version": "1.0.0",
"tenant_id": "tenant-xyz789"
}
Every entry is append-only. Every entry has an actor. Every entry is replayable.
This is not logging. Logging tells you what the system did. A ledger tells you who authorized it and why.
Policy Is the Floor. AI Is the Ceiling.
The most important architectural decision in AlertEngine is this:
Claude cannot trigger a state transition.
Policy decides whether an incident exists. Policy decides when a system has recovered. Claude diagnoses and explains — but the state machine doesn't listen to Claude. It listens to incident_policy.py.
When health metrics recover, the pipeline doesn't ask Claude what to do. It calls should_recover(score, err) and if the threshold is met, it transitions to RECOVERED with actor="policy". Claude's recommendation is irrelevant.
This means:
A confident wrong AI diagnosis cannot cause an incident to escalate
A policy recovery override is logged as actor: "policy" — auditors can see exactly when and why
Changing thresholds is a one-line edit in one file, versioned, and logged in every subsequent audit entry
The audit trail never lies about who made the decision
Why This Matters Now
Three forces are converging:
- Regulators are tightening. SOC 2, PCI DSS, HIPAA, GDPR — all require documented authorisation for production changes. "The AI did it" is not a compliant answer.
- AI is getting faster. Claude can diagnose an incident in 3 seconds. Without governance, the temptation is to let it act autonomously. That's how you get a confident wrong diagnosis: restarting your database at peak traffic.
- Engineers are burning out. 3 AM alerts with no context, no authorisation trail, and no proof of what happened. The answer isn't better dashboards — it's better workflows. AlertEngine addresses all three: policy gates prevent AI from acting alone, human authorisation prevents burnout, and the audit trail prevents regulatory surprises.
The Honest Part
I'm also building a payment orchestration platform for the African "hustler" context. Getting infrastructure funding in Zimbabwe is genuinely hard.
So I packaged the operational governance layer as a standalone product. It solves a real problem — I needed it myself at 2am. It also funds the bigger build.
That felt worth being honest about.
The Code
The orchestrator is source-available. Every claim in this post is verifiable:
orchestrator/pipeline.py — policy hierarchy, actor="policy" on recovery override
orchestrator/incident_policy.py — single POLICY dict, versioned, env-configurable
orchestrator/audit.py — append-only Redis LIST, full actor attribution, replayable
Read the code. Audit the architecture. Then decide if your infrastructure deserves the same discipline as your accounting.
GitHub: github.com/Tandem-Media/fastapi-alertengine
Install:
bash
pip install fastapi-alertengine
Managed orchestrator: anchorflowalertengine@outlook.com
Built in Harare, Zimbabwe. 🇿🇼
Top comments (0)