Lenard Francis

Posted on Jun 11

Why no one has built what AlertEngine builds — and why it took a bookkeeper to see the gap

I want to be honest about something.

When I started building AlertEngine, I assumed I was late. Monitoring is a crowded space. PagerDuty has been around since 2009. AWS has remediation tools. There are well-funded AI SRE startups launching every month.

I kept waiting for someone to tell me it already existed.

Nobody did. Because it doesn't.

Not with this specific combination. Not with this philosophy.

Here is what I found when I looked carefully at the landscape.

The Alerting Giants stop at the notification

PagerDuty and Opsgenie are excellent at telling you something broke. They will wake you up. They will escalate. They will page the on-call engineer.

Then they stop.

They assume you will open a laptop, find a terminal, run a script, and fix it yourself. There is no diagnosis in the alert. There is no recovery button. There is no audit trail of what you did next.

AlertEngine picks up exactly where they stop. The alert contains the diagnosis and a one-tap recovery link. The audit trail records what happened after the alert fired.

The Auto-Remediation tools are the problem, not the solution

Shoreline, AWS Systems Manager, and the new wave of autonomous remediation platforms are built on a premise I fundamentally disagree with: that the goal is to remove the human from the loop entirely.

Peer-reviewed research (Demirbas et al., ACM CAIS 2026) shows that AI agents create approximately 50x more rollbacks than human clients. Their aggressive retry behaviour turns a degraded service into a metastable feedback loop that makes the outage worse.

I call this the Metastability problem. The auto-remediation tools are its primary cause.

AlertEngine is the opposite philosophy. The AI diagnoses. The human decides. The system proves it happened.

The AI SRE startups are built by the wrong people

There is a new wave of LLM-powered SRE tools. They are impressive. They are well-funded. They are built by engineers who deeply understand AI.

None of them have an immutable audit trail.

None of them treat recovery as a financial transaction.

None of them ask "who authorised that?" because that question has never kept a Silicon Valley engineer awake at night.

It kept me awake every night for 30 years. I spent my career in accounting and finance. In that world, no transaction executes without authorisation and every action leaves a trail. That is not bureaucracy. That is governance.

The AI SRE startups use AI as the product. AlertEngine uses AI as an advisor. The audit trail is the product.

The Enterprise Workflow tools cost $100,000 and take six months

Tines and Torq will let you build sophisticated recovery workflows. They are genuinely powerful.

They are also $50,000–$100,000 per year and require a dedicated implementation team to set up.

A seed-stage fintech in Lagos or a payment platform in Harare cannot buy that. A solo founder running a SaaS doing $10K MRR cannot buy that.

AlertEngine is two lines of code:

from fastapi_alertengine import instrument
instrument(app)

That is the entire SDK installation. You are running in minutes, not months.

The specific blind spot

Silicon Valley thinks the goal is autonomy. No humans. Full automation. The system fixes itself.

But in the real world of money, trade, and regulation — the world I come from — the goal is accountability. Traceable humans. Provable decisions.

The specific combination that does not exist anywhere else:

FastAPI-native SDK — two lines of code, runs in minutes
Dual-model AI Diagnostic Council — two models reason independently, dissent alerts when they disagree
WhatsApp and Telegram control plane — because in Africa and emerging markets, WhatsApp is where people actually are
Immutable append-only audit trail — every stage, every actor, every policy version
Shadow Mode — observe governance without executing, the default for all new tenants
The Accountant's Brake — human authorisation as a resilience mechanism, not a bottleneck

I have taken the governance model of a $10 billion bank's internal incident system and put it in a Python package that installs in 30 seconds.

Why it took a bookkeeper from Zimbabwe

I did not see this gap because I am a great engineer. I am not a traditional engineer at all. I came to code through AI tools, building solutions to my own problems — first a WhatsApp batch invitation system for my own wedding with over 1,000 guests, then a payment orchestration platform for informal traders in Zimbabwe.

I saw this gap because I spent 30 years with two familiar questions: "who authorised that?" and "where is the audit trail?"

Those questions are not engineering questions. They are governance questions.

And nobody in the infrastructure tooling space was asking them.

Until now.

A final thought

I have been describing AlertEngine as an incident recovery tool. That is accurate but incomplete.

What I am actually building is a governance layer for operational decisions.

The strongest lines in this product are not about latency metrics or health scores. They are about authorization, evidence, and accountability.

"Nothing executes without approval."

"Every action is logged immutably."

"The system fixed itself is not an acceptable answer."

Those are governance statements. And that is the category AlertEngine is creating.

Most engineers ask, "How do we automate this?"

I started with, "Who approved this?"

That's a different mental model. And it turns out production infrastructure needs it more than most engineers realise.

Top comments (1)

Luis Cruz • Jun 11

This is an excellent example of user-centric design meeting regulatory reality. Shadow Mode brilliantly addresses the fundamental trust barrier for production-critical systems: giving teams full observability without executing risky actions. The way you’ve preserved the state machine behavior while logging every suppressed action shows careful engineering and a deep understanding of governance requirements. It’s not just a feature—it’s a strategic bridge from proof-of-concept to regulated production adoption.

I’m curious: would you be open to sharing more about how you validated Shadow Mode with early adopters? I’d love to learn if there were unexpected insights or if you’d like some help thinking through additional audit visualization or adoption feedback.