DEV Community: Lenard Francis

# I Built a Legal OS for a Zimbabwean Law Firm — Here's What AI-Assisted Legal Research Actually Looks Like

Lenard Francis — Tue, 07 Jul 2026 11:05:55 +0000

My wife is a lawyer and partner at a law firm in Harare, Zimbabwe. She has been practising for over two decades, juggles her caseload with speaking engagements, and runs legal education programs on two radio stations three days a week. She was tracking 118 active matters in a diary and a notepad.

That's not a criticism — she's been practising for over a decade and the system worked. But when I started building with AI tools, I kept thinking: what happens when she's sick? What happens when a junior associate needs to know the deadline on a matter she's handling? What happens when someone needs to research a legal question at 10pm before a morning hearing?

So I built MutemoOS. Here's what I learned.

The problem with generic legal tech

Most legal research tools are built for BigLaw in London or New York. They're expensive, bandwidth-heavy, and tuned for English common law or US federal jurisdiction. They don't know what a dies induciae is. They've never heard of the Labour Court of Zimbabwe.

The interesting engineering problems only show up when you're building for a specific jurisdiction:

Case law is scattered across ZimLII, Veritas, ZLHR, and Law Reports of Zimbabwe
Statutes get amended without warning — the Labour Amendment Act 2023 changed everything
Lawyers use colloquial terms that don't appear in legislation ("small houses", "lobola") but map to specific statutory provisions
Court deadlines are calculated in working days, excluding Zimbabwe's specific public holidays, and the High Court recess suspends dies induciae only for Heads of Argument — not for other steps

If you're building a legal AI for Zimbabwe, you can't just use a generic RAG pipeline and call it done.

Architecture overview

MutemoOS is a FastAPI backend with a React-style HTML frontend (no framework, just vanilla JS), deployed on Railway. The core components:

PostgreSQL          — matters, documents, chunks, calendar events
ChromaDB            — vector store for semantic search (firm collection)
Laws.Africa KB API  — live Zimbabwe legislation + judgments
sentence-transformers — all-MiniLM-L6-v2 embeddings
Anthropic Claude    — synthesis, grounding, document drafting
Cloudflare R2       — document storage

The Legal Intelligence Feed is a separate FastAPI service that scrapes ZimLII, Veritas, ZLHR, and NewsDay daily and pushes content to MutemoOS via a multi-instance pusher — built for multiple law firm clients.

The grounded synthesis problem

The first version of search was straightforward RAG: embed query → retrieve chunks → synthesise answer. It worked well when the right content was indexed. But it failed in two important ways:

1. Hallucinated citations. The model would confidently cite "section 12(3) of the Labour Act" when the retrieved chunks didn't actually contain that section. Lawyers can't use hallucinated citations.

2. Silent failures. When no relevant content was indexed, the model would fall back to general knowledge and present it as if it came from the sources. A lawyer reading the output had no way to know which claims were grounded and which weren't.

The fix was a two-stage synthesis pipeline:

# Stage 1 — Ground check (Claude Haiku, fast and cheap)
grounding = ground_check_sync(query, context)
# Returns: sources_sufficient, source_gap, grounding_note

# Stage 2 — Constrained synthesis (Claude Sonnet)
# If sources insufficient, Sonnet is explicitly told what's missing
# and instructed to prefix unsupported claims with:
# "[General principle — verify in source]"
answer = synthesise_answer_sync(query, context, grounding)

The frontend renders a ✓ green badge when sources are sufficient, and a ⚠ yellow badge with a specific gap description when they're not:

⚠ Partial sources — Missing: Insolvency Act, Companies Act, Labour Act 
provisions on employee rights during liquidation

This single change transformed how lawyers interact with the system. They now know exactly which parts of an answer to verify before relying on it.

The query expansion problem

Zimbabwe legal terminology has a mismatch problem. A lawyer asks about "small houses" — a colloquial term for informal second relationships. The legislation calls them "civil partnerships" under section 41 of the Marriages Act [Chapter 5:15]. The embedding model (all-MiniLM-L6-v2) doesn't know this mapping.

The problem runs deeper with customary law. Zimbabwe operates a dual legal system — general (Roman-Dutch) law and customary law — and a significant portion of legal practice involves both simultaneously. A client walks in and says "we paid lobola but never registered the marriage." That's not just a cultural statement — it maps to a specific legal regime under the Customary Marriages Act [Chapter 5:07], with distinct inheritance rules, maintenance claims, and estate administration procedures under the Administration of Estates Act. The system needs to understand that "lobola" and "roora" and "unregistered customary union" are all entries into the same legal framework, and that the applicable law differs materially from a civil marriage registered under the Marriages Act.

The solution is a two-layer fallback:

Layer 1 — Similarity threshold filtering. ChromaDB returns results below a similarity threshold (0.35) which are likely noise. Filter them out. When ChromaDB returns nothing above threshold, the FTS fallback fires:

# FTS fallback — word overlap + exact phrase bonus
query_words = set(query_lower.split()) - STOPWORDS
for chunk in chunks:
    text_lower = chunk["text"].lower()
    word_score = len(query_words & set(text_lower.split())) / max(len(query_words), 1)
    phrase_bonus = 0.5 if any(w in text_lower for w in query_words if len(w) > 4) else 0
    score = word_score + phrase_bonus

Layer 2 — Zimbabwe-specific query expansion. When both semantic and FTS fail, Claude Haiku expands the query with Zimbabwe legal synonyms:

QUERY_EXPANSION_PROMPT = """You are a Zimbabwean legal research assistant.
Key Zimbabwe-specific mappings:
- "civil partner" / "unmarried couple" → "section 41 Marriages Act Chapter 5:15 unregistered union"
- "Islamic marriage" → "qualified civil marriage section 44 Marriages Act polygamous union"
- "small houses" → "section 41 civil partnership unregistered union Marriages Act"
- "lobola" / "roora" / "customary marriage" → "Customary Marriages Act Chapter 5:07 unregistered union"
- "unfair dismissal" → "Labour Act Chapter 28:01 due inquiry section 12"
- "estate" / "deceased estate" → "Administration of Estates Act Chapter 6:01 Master of High Court"
...
Expand this query with Zimbabwe legal synonyms. Return ONE line only.
Query: {query}"""

The result: searching "small houses" returns section 41 of the Marriages Act with a ✓ green badge and a complete answer about civil partnership protections under Zimbabwe law.

The dies induciae calculator

This is the feature that matters most to practitioners. Every Zimbabwe procedural step has a deadline in working days, excluding weekends, public holidays, and — for Heads of Argument only — High Court recess periods.

The High Court recess rule is specific: recess suspends the dies only for Heads of Argument. All other deadlines (NITDs, plea, discovery) run straight through recess. Building this as a hardcoded rule set:

class CourtProcedureEngine:
    FIXED_HOLIDAYS = {
        (1,1),(2,21),(4,18),(5,1),(5,25),
        (8,11),(8,12),(12,22),(12,25),(12,26)
    }

    @classmethod
    def _add_working_days(cls, start, days, recess_periods=None, suspend_in_recess=False):
        count = 0
        recess_hit = 0
        d = start + timedelta(days=1)
        while count < days:
            if cls._is_working_day(d):
                in_recess = suspend_in_recess and any(
                    r['start'] <= d <= r['end'] for r in (recess_periods or [])
                )
                if in_recess:
                    recess_hit += 1
                else:
                    count += 1
            if count < days:
                d += timedelta(days=1)
        return d, recess_hit

The recess periods are user-supplied (the Registrar publishes the court calendar annually) rather than hardcoded — a law firm inputs the recess dates once at the start of the court year and they apply to all matters automatically.

When a lawyer selects "High Court Application" and enters a service date, the system calculates all deadlines, creates calendar events automatically, and flags critical deadlines in red:

Labour Court Heads of Argument: BAR RISK — failure to file = barred from oral submissions (Rule 19)
Magistrates Court plea: DEFAULT JUDGMENT RISK — 7 days after appearance to defend

The Laws.Africa integration

The most impactful single change was integrating the Laws.Africa Knowledge Base API. Instead of scraping Zimbabwe legislation and uploading PDFs manually, we now query their vector index directly:

async def search_laws_africa(query: str, top_k: int = 3) -> list:
    payload = {
        "text": query,
        "top_k": top_k,
        "filters": {"commenced": True, "repealed": False, "principal": True}
    }
    for kb_code in ["legislation-zw", "judgments-zw"]:
        resp = await http.post(
            f"{LAWS_AFRICA_API}/{kb_code}/retrieve",
            json=payload,
            headers={"Authorization": f"Bearer {token}"}
        )
        # Returns section-level chunks with source URL, Act title, section name

Every search now queries three sources simultaneously: local ChromaDB (firm precedents + uploaded documents), Laws.Africa legislation-zw (all Zimbabwe Acts), and Laws.Africa judgments-zw (all Zimbabwe judgments). The grounding check runs across all three.

What surprised me

Colloquial language works. The query expansion approach means a lawyer can type exactly how they'd describe a problem to a colleague. "My client is a small house" gets the right legal framework. This was the biggest UX win — it turns out the barrier to using legal AI isn't the AI, it's the expectation that you need to speak like a statute.

Grounding is more valuable than accuracy. A system that gives you a 90% accurate answer with no indication of what it doesn't know is more dangerous than a system that gives you a 75% accurate answer with a clear ⚠ badge saying "I'm missing the Insolvency Act." Lawyers can work with uncertainty. They can't work with invisible uncertainty.

The embedding model is the ceiling. `all-MiniLM-L6-v2` is a general-purpose model. It doesn't know that "qualified civil marriage" and "Islamic marriage" are the same thing in Zimbabwe law. The query expansion workaround helps, but the right long-term answer is a legal-domain embedding model. This is the next migration.

What's next

PostgreSQL migration for multi-tenancy (currently per-firm Railway instances)
Better embedding model (OpenAI text-embedding-3-small or legal-BERT)
LSZ Tariff integration — auto-suggest fees from the Law Society of Zimbabwe tariff
IECMS bridge — Zimbabwe's new electronic court filing system

The product is live. My wife uses it daily. A second firm is onboarding next month.

If you're building legal tech for Africa or any underserved legal market, I'd genuinely like to compare notes. The problems are interesting and the competition is thin.

Lenard Francis is the founder of Tofamba Technology, building MutemoOS (legal practice management) and AlertEngine (human-authorized incident recovery for FastAPI). Based in Harare, Zimbabwe. @leoofharare

Why no one has built what AlertEngine builds — and why it took a bookkeeper to see the gap

Lenard Francis — Thu, 11 Jun 2026 13:31:38 +0000

I want to be honest about something.

When I started building AlertEngine, I assumed I was late. Monitoring is a crowded space. PagerDuty has been around since 2009. AWS has remediation tools. There are well-funded AI SRE startups launching every month.

I kept waiting for someone to tell me it already existed.

Nobody did. Because it doesn't.

Not with this specific combination. Not with this philosophy.

Here is what I found when I looked carefully at the landscape.

The Alerting Giants stop at the notification

PagerDuty and Opsgenie are excellent at telling you something broke. They will wake you up. They will escalate. They will page the on-call engineer.

Then they stop.

They assume you will open a laptop, find a terminal, run a script, and fix it yourself. There is no diagnosis in the alert. There is no recovery button. There is no audit trail of what you did next.

AlertEngine picks up exactly where they stop. The alert contains the diagnosis and a one-tap recovery link. The audit trail records what happened after the alert fired.

The Auto-Remediation tools are the problem, not the solution

Shoreline, AWS Systems Manager, and the new wave of autonomous remediation platforms are built on a premise I fundamentally disagree with: that the goal is to remove the human from the loop entirely.

Peer-reviewed research (Demirbas et al., ACM CAIS 2026) shows that AI agents create approximately 50x more rollbacks than human clients. Their aggressive retry behaviour turns a degraded service into a metastable feedback loop that makes the outage worse.

I call this the Metastability problem. The auto-remediation tools are its primary cause.

AlertEngine is the opposite philosophy. The AI diagnoses. The human decides. The system proves it happened.

The AI SRE startups are built by the wrong people

There is a new wave of LLM-powered SRE tools. They are impressive. They are well-funded. They are built by engineers who deeply understand AI.

None of them have an immutable audit trail.

None of them treat recovery as a financial transaction.

None of them ask "who authorised that?" because that question has never kept a Silicon Valley engineer awake at night.

It kept me awake every night for 30 years. I spent my career in accounting and finance. In that world, no transaction executes without authorisation and every action leaves a trail. That is not bureaucracy. That is governance.

The AI SRE startups use AI as the product. AlertEngine uses AI as an advisor. The audit trail is the product.

The Enterprise Workflow tools cost $100,000 and take six months

Tines and Torq will let you build sophisticated recovery workflows. They are genuinely powerful.

They are also $50,000–$100,000 per year and require a dedicated implementation team to set up.

A seed-stage fintech in Lagos or a payment platform in Harare cannot buy that. A solo founder running a SaaS doing $10K MRR cannot buy that.

AlertEngine is two lines of code:

from fastapi_alertengine import instrument
instrument(app)

That is the entire SDK installation. You are running in minutes, not months.

The specific blind spot

Silicon Valley thinks the goal is autonomy. No humans. Full automation. The system fixes itself.

But in the real world of money, trade, and regulation — the world I come from — the goal is accountability. Traceable humans. Provable decisions.

The specific combination that does not exist anywhere else:

FastAPI-native SDK — two lines of code, runs in minutes
Dual-model AI Diagnostic Council — two models reason independently, dissent alerts when they disagree
WhatsApp and Telegram control plane — because in Africa and emerging markets, WhatsApp is where people actually are
Immutable append-only audit trail — every stage, every actor, every policy version
Shadow Mode — observe governance without executing, the default for all new tenants
The Accountant's Brake — human authorisation as a resilience mechanism, not a bottleneck

I have taken the governance model of a $10 billion bank's internal incident system and put it in a Python package that installs in 30 seconds.

Why it took a bookkeeper from Zimbabwe

I did not see this gap because I am a great engineer. I am not a traditional engineer at all. I came to code through AI tools, building solutions to my own problems — first a WhatsApp batch invitation system for my own wedding with over 1,000 guests, then a payment orchestration platform for informal traders in Zimbabwe.

I saw this gap because I spent 30 years with two familiar questions: "who authorised that?" and "where is the audit trail?"

Those questions are not engineering questions. They are governance questions.

And nobody in the infrastructure tooling space was asking them.

Until now.

A final thought

I have been describing AlertEngine as an incident recovery tool. That is accurate but incomplete.

What I am actually building is a governance layer for operational decisions.

The strongest lines in this product are not about latency metrics or health scores. They are about authorization, evidence, and accountability.

"Nothing executes without approval."

"Every action is logged immutably."

"The system fixed itself is not an acceptable answer."

Those are governance statements. And that is the category AlertEngine is creating.

Most engineers ask, "How do we automate this?"

I started with, "Who approved this?"

That's a different mental model. And it turns out production infrastructure needs it more than most engineers realise.

Why I made Shadow Mode the default for my FastAPI incident recovery tool

Lenard Francis — Thu, 11 Jun 2026 13:06:11 +0000

I didn't plan to build Shadow Mode.
I built AlertEngine to solve a specific problem: when a production API fails at 2am, most monitoring tools tell you what broke. None of them tell you who authorised the fix, or leave a record an auditor can replay.
That's the gap AlertEngine fills. AI diagnoses the incident. A human taps approve on WhatsApp. Every stage is logged to an immutable audit trail. Nothing executes without explicit authorisation.
The architecture works. The tests pass. The audit trail is real.
But when I started reaching out to potential customers in African fintech — payment platforms, cross-border rails, compliance-sensitive APIs — I kept hitting the same wall.
"How do we trust this around production?"
That question stopped me.
Because they were right. No regulated team should hand production recovery authority to a tool they've known for five minutes. That's not caution. That's governance.
So I asked a different question.
What if they didn't have to trust it yet?

What Shadow Mode does
Shadow Mode is the default evaluation state for all new AlertEngine tenants.
When Shadow Mode is active:

Health polling runs every 5 seconds
Incident detection runs via deterministic policy gates
AI diagnosis runs — Diagnostic Council (dual-model) or single model
Full pipeline state transitions: DETECTED → PROPOSED → VALIDATED
Complete audit trail written with actor attribution

What doesn't run:

WhatsApp and Telegram notifications
Recovery token generation
Webhook execution
Voice escalation

Every suppressed action is logged to the audit trail with actor: "shadow_mode" so the tenant can see exactly what would have happened.

The implementation
The change was surgical. pipeline.py needed zero modifications — the state machine runs normally in Shadow Mode. All the gates are in loop.py.
I added a shadow_mode flag to the tenant schema, read it at the top of _process_tenant(), and passed it through every _execute_actions() call:
pythonshadow_mode = bool(tenant.get("shadow_mode", False))
In _execute_actions(), every external call checks the flag first:
pythonif action_type == "SEND_NOTIFICATION":
if shadow_mode:
append_event(
incident_id=incident_id,
stage=stage,
decision="shadow",
reason=f"[SHADOW] Would have sent {action.get('payload', {}).get('type')} notification",
confidence=0.0,
actor="shadow_mode",
tenant_id=tenant_id,
metadata={"shadow_mode": True, "suppressed_action": action},
)
continue
# ... normal notification flow
The audit trail gets fully populated. The state machine advances normally. Nothing external fires.

The Shadow Mode API
Four endpoints manage the evaluation lifecycle:bash# Enable shadow mode (default for new tenants)
POST /tenant/{tenant_id}/shadow

Check current status

GET /tenant/{tenant_id}/shadow

Get governance report

GET /tenant/{tenant_id}/shadow/report

Go live

DELETE /tenant/{tenant_id}/shadow
The governance report is the sales tool. After 30 days of observation it returns:

"23 incidents observed, 23 notifications suppressed, 23 recovery tokens suppressed — all logged to the immutable audit trail."

That's what you show a risk committee before going live.

What changed strategically
Before Shadow Mode the sales conversation was:

"Install AlertEngine and trust it."

After Shadow Mode it became:

"Run AlertEngine in observation mode. Here's the governance report of everything it would have done. Now decide."

That's a completely different risk profile for a regulated buyer.
Shadow Mode shipped on Thursday. It wasn't on the roadmap on Tuesday.
Sometimes the best features come from asking "what's the real objection?" rather than "what's the next feature?"

I Spent Years Balancing Ledgers. Now I Balance Redis Connections.

Lenard Francis — Wed, 03 Jun 2026 09:57:54 +0000

I spent my career in accounting and finance before building infrastructure in Zimbabwe.
In accounting, every transaction has three properties:
Authorization — no entry without approval
Immutability — once recorded, never altered
Reconciliation — every debit has a corresponding credit, provable by audit
When I started building FastAPI AlertEngine, I applied the same discipline to production incidents. The result is not a monitoring tool. It's an operational governance system.

Monitoring Tools Are for Forensics. Governance Tools Are for Control.

Monitoring tools tell you what broke after it broke. Datadog, Grafana, Sentry — they produce beautiful post-mortems.
Governance tools enforce that nothing executes without authorization, and they prove it afterward.
Most teams conflate the two. They buy monitoring, assume governance, and get surprised when auditors ask: "Who approved that deploy?"
AlertEngine separates them explicitly:
plain
Detection → Policy (deterministic, no AI)
Diagnosis → AI (explains, recommends, does not decide)
Authorization → Human (engineer taps approve)
Execution → Webhook (your infrastructure, your control)
Audit → Ledger (immutable, replayable, actor-attributed)
This is not a feature list. It's an architectural hierarchy enforced by code.

The Zimbabwe Constraint

Engineers in Zimbabwe aren't always at laptops when things break. WhatsApp is ubiquituous and can be the operational control plane.
That constraint produces something better than a dashboard: alerts that find you, with a single tap to authorise recovery. No SSH. No runbooks. No "log into Grafana and interpret the graph."
Just: "Something broke. Here's why. Tap approve. Nothing runs without you."

The Ledger Philosophy
In finance, a ledger has two sides: what happened, and who authorized it.
AlertEngine's audit trail has the same structure:
JSON
{
"timestamp": 1717344000,
"incident_id": "inc-abc123-1685000000",
"stage": "AUTHORIZED",
"actor": "engineer",
"decision": "approve",
"reason": "Database connection pool exhausted — restart recommended",
"confidence": 0.87,
"policy_version": "1.0.0",
"tenant_id": "tenant-xyz789"
}
Every entry is append-only. Every entry has an actor. Every entry is replayable.
This is not logging. Logging tells you what the system did. A ledger tells you who authorized it and why.
Policy Is the Floor. AI Is the Ceiling.
The most important architectural decision in AlertEngine is this:
Claude cannot trigger a state transition.
Policy decides whether an incident exists. Policy decides when a system has recovered. Claude diagnoses and explains — but the state machine doesn't listen to Claude. It listens to incident_policy.py.
When health metrics recover, the pipeline doesn't ask Claude what to do. It calls should_recover(score, err) and if the threshold is met, it transitions to RECOVERED with actor="policy". Claude's recommendation is irrelevant.

This means:

A confident wrong AI diagnosis cannot cause an incident to escalate
A policy recovery override is logged as actor: "policy" — auditors can see exactly when and why
Changing thresholds is a one-line edit in one file, versioned, and logged in every subsequent audit entry
The audit trail never lies about who made the decision

Why This Matters Now

Three forces are converging:

Regulators are tightening. SOC 2, PCI DSS, HIPAA, GDPR — all require documented authorisation for production changes. "The AI did it" is not a compliant answer.
AI is getting faster. Claude can diagnose an incident in 3 seconds. Without governance, the temptation is to let it act autonomously. That's how you get a confident wrong diagnosis: restarting your database at peak traffic.
Engineers are burning out. 3 AM alerts with no context, no authorisation trail, and no proof of what happened. The answer isn't better dashboards — it's better workflows. AlertEngine addresses all three: policy gates prevent AI from acting alone, human authorisation prevents burnout, and the audit trail prevents regulatory surprises.

The Honest Part

I'm also building a payment orchestration platform for the African "hustler" context. Getting infrastructure funding in Zimbabwe is genuinely hard.
So I packaged the operational governance layer as a standalone product. It solves a real problem — I needed it myself at 2am. It also funds the bigger build.
That felt worth being honest about.

The Code
The orchestrator is source-available. Every claim in this post is verifiable:
orchestrator/pipeline.py — policy hierarchy, actor="policy" on recovery override
orchestrator/incident_policy.py — single POLICY dict, versioned, env-configurable
orchestrator/audit.py — append-only Redis LIST, full actor attribution, replayable
Read the code. Audit the architecture. Then decide if your infrastructure deserves the same discipline as your accounting.
GitHub: github.com/Tandem-Media/fastapi-alertengine
Install:
bash
pip install fastapi-alertengine
Managed orchestrator: anchorflowalertengine@outlook.com
Built in Harare, Zimbabwe. 🇿🇼

From Eclipses to P95 Latency: What the Joseon Dynasty Can Teach Us About Incident Response

Lenard Francis — Sun, 31 May 2026 16:22:51 +0000

The Joseon Dynasty ruled Korea for more than five centuries, from 1392 to 1897.

That is longer than the United States has existed. Longer than the printing press has been in widespread use. Five hundred years of one government, one bureaucracy, one record-keeping system.

And they documented everything.

The 朝鮮王朝實錄 (Veritable Records of the Joseon Dynasty) is one of the most extensive continuous historical records ever produced. Royal decrees, diplomatic correspondence, criminal cases, military campaigns, natural disasters, celestial observations, agricultural conditions, and administrative decisions were meticulously recorded.

Every eclipse. Every comet. Every drought. Every flood. Every tiger that wandered into a village.

At first glance, it reads like a civilisation obsessed with omens. Look closer and it begins to resemble something else: an accountability system operating at a national scale.

The Mandate of Heaven as an Accountability Mechanism

Joseon inherited the concept of the Mandate of Heaven from Chinese political philosophy.

The basic premise was simple: Heaven's approval of a ruler could be inferred from events in the natural world. Stable harvests, favorable weather, and orderly skies suggested good governance. Floods, eclipses, unusual celestial events, and other disruptions demanded attention.

Whether or not one accepted the underlying cosmology, the system functioned as a powerful accountability mechanism.

When an eclipse occurred, someone had to observe it. Someone had to record it. Officials had to discuss its significance. The court had to determine whether action was required. The response had to be documented. And the entire process became part of a permanent historical record.

A king could not plausibly claim ignorance of a reported eclipse.
An official could not quietly invent a justification for a policy years later. The record existed. The deliberation existed. The decision existed.

Accountability was structural.

https://ajin.im/is/building/omen.ops/

A Dynasty Rendered as Telemetry

Recently, Ajin built omen.ops, a project that renders the Veritable Records as a modern observability dashboard.

It is one of the most serious pieces of digital scholarship I have encountered. Every entry is sourced, annotated, and presented with the gravity the original record-keepers intended. Rather than merely digitising the records, the project reinterprets them through the lens of modern operations, observability, and incident management.

Suddenly, centuries-old historical events look remarkably familiar. Eclipses appear as system alerts. Comets register as anomaly spikes. Droughts become degradation events. The Mandate of Heaven itself is represented as a system health score.

The effect is both humorous and strangely illuminating.
A guest star observed over thirteen consecutive night-watches in 1592 appears as a P1 incident. The court astronomers of the Gwansanggam—the royal bureau responsible for celestial observation—tracked its position relative to known stars, recorded its persistence, and noted the absence of any established remediation procedure.

In modern operations language, the alert was acknowledged, classified, documented, and ultimately deemed unactionable.

Every engineer who has ever been paged at 3 AM has encountered the same category of problem.The dashboard also presents a derived metric called the Mandate Volatility Index. It compresses centuries of recorded anomalies into a single score relative to a reign's baseline conditions.

The historical court and the modern SRE team face the same challenge: overwhelming amounts of signal with limited human attention. Different centuries. Different tools. Same problem.
They needed summaries. We use dashboards. They had volatility indexes. We have P95 latency graphs.

The Tiger Incident

The most memorable entry may not involve the heavens at all.

In 1571, a white-browed tiger reportedly killed hundreds of people and livestock near the capital. The response escalated rapidly. The court mobilised a specialised tiger-catching commander and launched a coordinated hunt.

Then a secondary problem emerged. The soldiers sent to eliminate the tiger began looting civilians. The court was suddenly managing two incidents instead of one. The tiger threat was eventually mitigated. Multiple animals were killed. The military operation was scaled back. Reports of civilian misconduct were documented.

Incident opened. Response initiated. Unexpected side effects detected. Mitigation adjusted. Incident closed. The terminology changes. The workflow does not.

The Governance Problem Hasn't Changed

The tools have changed completely. The governance problem has not. What fascinates me about the Joseon system is not the astronomy. It is the process.

Observe - Record - Deliberate - Authorize - Act - Document

That sequence appears repeatedly throughout the Veritable Records. It is also the sequence behind every mature operational system. Modern monitoring platforms are excellent at collecting signals. They can detect latency spikes, memory pressure, queue backlogs, failed deployments, and infrastructure degradation in seconds.

What many systems still struggle with is everything that happens after detection. Who saw the alert? Who approved the response?
What evidence was considered? Why was a particular action taken?
Can someone reconstruct the decision six months later?
Detection is only the beginning. Accountability begins when decisions become traceable.

Building for Active Control

This idea sits at the centre of what I am building with FastAPI AlertEngine.

FastAPI AlertEngine is incident intelligence for FastAPI services. A free SDK adds health monitoring to your application with a single line of code. When degradation is detected, a managed orchestrator investigates the likely cause and sends you a WhatsApp or Telegram notification containing a single-use approval link. Nothing executes without your authorisation.

The goal is not simply to collect more alerts. Most organisations already have more alerts than they can reasonably process. The goal is to preserve the decision chain.

Observe the signal. Capture the evidence. Present the context.
Recommend an action. Require authorisation. Execute the response.
Record everything.

The Joseon court performed this process with a brush, ink, astronomers, and royal historians. We perform it with telemetry pipelines, machine reasoning, and APIs. The difference is speed. The court might take days to process an eclipse.
We can detect a P95 latency spike, identify likely causes, generate remediation options, and request approval in seconds.

But the underlying governance problem remains unchanged.
How do you ensure that the people responsible for a system are confronted with evidence, required to make a decision, and unable to rewrite history afterward?

Five hundred years ago, the Joseon Dynasty answered that question with brush and ink. We are still figuring it out with Redis and JWT tokens.

Explore the historical dashboard: https://ajin.im/is/building/omen.ops/

If you run FastAPI in production and want incident response that asks before it acts, the free SDK is available through FastAPI AlertEngine.

How I Taught My Incident Alerts to Say "This Broke 3 Minutes After Your Last Deploy"

Lenard Francis — Sat, 30 May 2026 19:50:31 +0000

You're staring at a P95 latency spike.

The alert says: "Database pool exhausted. P95: 2847ms."You know what broke. You don't know why.
So you open your git log, check when the spike started, scroll through commits, and try to figure out what changed in the 10 minutes before everything went sideways.
That archaeology takes 20 minutes on a good day. At 2am it takes longer.

The Problem with Context-Free Alerts
Most incident alerts are great at telling you the “what”. None of them tell you the “when” in relation to your codebase.
The question every engineer asks during an incident isn't "what is the P95?" — they already know that. It's "Did we just deploy something?"

The Insight: Incidents Have a Deployment Shadow
The way I see it, the majority of production incidents fall into one of two categories:
• Infrastructure events — upstream dependency failure, Redis outage, traffic spike
• Deployment shadows — something changed in the last deploy that didn't show up in testing

For category 2, the fastest path to resolution is knowing exactly what changed and when — down to the commit level.
If your alert says:
Database pool exhausted (P95: 2847ms)
Recent deployments before incident:
3m ago — a1b2c3d: "Fix checkout query isolation level" (John, +12/-3)
1 recent commit touched database/query files
You've just saved 20 minutes of log archaeology.

How to Build It
The implementation is simpler than it sounds. Three components:
• A commit store — Redis sorted set, scored by timestamp
• A GitHub webhook — receives push events, stores commits
• An incident correlator — maps incident start time to nearby commits

The Commit Store
def store_commit(tenant_id, sha, message, author, timestamp, files_changed):
key = f"orchestrator:commits:{tenant_id}"
redis.zadd(key, {entry: timestamp})
redis.expire(key, 86400 * 7) # 7 day TTL
A Redis sorted set gives you O(log N) insertion and O(log N + K) range queries — perfect for "give me commits in the 10 minutes before this timestamp."

The GitHub Webhook
@app.post("/commits/webhook")
async def github_webhook(request: Request):
body = await request.json()
for commit in body.get("commits", []):
store_commit(...)

Injecting Context into AI Diagnosis
Without commit context, Claude sees raw metrics. With commit context, Claude sees the metrics AND what changed 3 minutes before the incident — shifting the diagnosis from "likely database connection issue" to "checkout query isolation level change likely caused connection pool exhaustion."
That's a different quality of diagnosis entirely.

What the WhatsApp Message Looks Like
⚠️ Action Recommended
Service: Payment API
Issue: Database pool exhausted — P95 2.8s
Likely cause: Checkout query isolation level change
(commit a1b2c3d, 3m ago)
Confidence: 87%
👉 Approve fix: [link]
Nothing will run without your approval.

Three Setup Options
• GitHub webhook (recommended) — POST /commits/webhook with header X-AlertEngine-Tenant-ID
• Manual push from CI — curl from your GitHub Actions workflow
• GitHub API polling — set GITHUB_TOKEN and GITHUB_REPO, AlertEngine fetches automatically

The Broader Pattern
This feature is an instance of a broader pattern: enrich your incident context with everything that changed recently, not just the metrics at the moment of failure.
Future extensions of the same idea:
• Feature flag changes in the 10 minutes before an incident
• Infrastructure changes (Terraform applies, Docker image updates)
• Database migration executions
• Config changes

The alert that says, "Here's what broke, here's what changed right before it broke, here's the fix"—that's the alert worth building for.

─────────────────────────────────────────
This is now live in FastAPI AlertEngine as commit_context.py.
GitHub: github.com/Tandem-Media/fastapi-alertengine
Docs: tandem-media.github.io/fastapi-alertengine/
pip install fastapi-alertengine

Why P95 Latency Is the Only Metric That Matters at 3 AM

Lenard Francis — Thu, 21 May 2026 18:58:30 +0000

If your checkout endpoint serves 10,000 requests per minute, a 5% latency spike means 500 users are having a bad experience every minute.

Averages compress that pain into a single comfortable number.
P95 latency — the latency at the 95th percentile — tells you what your slowest users are actually experiencing.

It's the metric that catches the spike average hides.
This is why I track P95 as the primary health signal, not averages.

How Latency Spikes Actually Propagate
A latency spike rarely starts in your application.It usually starts somewhere else and cascades inward.

The typical pattern looks like this:

Slow upstream dependency
↓
Connection pool saturation
↓
Request queue growth
↓
Latency spike propagation
↓
Timeouts and failures

The Cascade Pattern
An upstream dependency (database, payment gateway, third-party API) slows down
Your FastAPI app keeps accepting requests while waiting for responses.
Your connection pool fills up – new requests queue behind existing ones.
Queue depth grows, memory pressure builds
Response times climb across all endpoints, not just the affected one. Eventually requests start timing out or failing entirely

By stage 3, you have a problem. By stage 5, your customers know about it before you do.
The cascade failure pattern is particularly nasty.A slow database query holds a connection.

That held connection blocks another request. That blocked request ties up execution capacity. Multiply that by concurrent users and you get full service degradation from a single slow dependency.

Under async workloads, the failure mode becomes especially deceptive because the application continues accepting requests while upstream awaits accumulation in the background.

High Traffic Spikes Make This Worse. Under normal load, a slow upstream dependency is annoying.
Under a traffic spike, it's catastrophic.

Here's why:

Connection pool saturation happens faster. If you have 20 database connections and traffic doubles, you hit the ceiling twice as fast.
Queue depth explodes. Requests piling up behind a slow dependency compound each other's wait time.
Memory pressure builds. Each queued request holds state. Enough of them and you drift toward OOM territory.
Recovery is non-linear. Once a connection pool is saturated, it often stays saturated even after the upstream issue resolves — because the backlog keeps it full.

The cruel irony is that traffic spikes happen when your service matters most.

A flash sale. A viral moment. A major announcement.
Exactly the wrong time to be debugging latency from a dashboard.

What Didn't Work For Me

Monitoring sounds easy in theory. In practice, most setups failed me in one of four ways.

Prometheus + Grafana. Powerful, but operationally heavy.

Setting up exporters, configuring dashboards, maintaining the stack — all before writing a single alert rule.

And when the alert fires at 3am, one still has to log in and interpret charts under pressure.

Simple Health Checks

GET /health → 200 OK tells you the service is alive.
It doesn't tell you it's running at 8x normal latency while technically responding.

Average Latency Monitoring

Averages mask the spikes that actually hurt users.

In one case, a payment provider slowdown pushed P95 latency from roughly 180 ms to over 2 seconds within minutes — while average latency still looked acceptable.

By the time averages reflected the issue, checkout failures had already started.

Alert Fatigue

I added more monitors to catch more things. Which meant more alerts. Most of them were noise. When everything is urgent, nothing is. Monitoring systems usually optimise for data collection.

Operators actually need decision compression.

What I Built Instead

I wanted something that:
Tracked P95, not averages
Produced a single health score instead of 15 metrics to interpret
Caught degradation trends early, before full failure
Required zero config to add to an existing FastAPI app

The result is a FastAPI middleware that continuously computes degradation signals directly from live request traffic.

from fastapi import FastAPI
from fastapi_alertengine import instrument

app = FastAPI()
instrument(app)

The middleware exposes a structured /health/alerts endpoint:

{
"status": "warning",
"health_score": {
"score": 61,
"trend": "degrading"
},
"metrics": {
"overall_p95_ms": 1847.3,
"error_rate": 0.08,
"anomaly_score": 0.9
}
}

One status. One score. One trend direction. No dashboards to configure. No agents to run. No Prometheus exporters.

The Human-in-the-Loop Layer

Once I had a reliable health signal, the next question was:
What do I do with it?

I built a managed orchestration layer that polls /health/alerts every 5 seconds. When the score drops below the threshold, it:

Runs Claude AI diagnosis on the metric context
Sends a WhatsApp or Telegram message (or Slack) with a plain-English summary
Generates a single-use recovery link

Most AI incident tooling jumps straight to autonomous remediation. I intentionally didn't.

Production systems deserve human authorisation before recovery actions execute. I read the diagnosis, preview the recovery action, and tap approve – all from my phone.

Nothing executes automatically. Every action is logged immutably.

I built the mobile-first delivery because I work in Zimbabwe, where engineers aren't always at laptops when things break.

WhatsApp is the operational control plane here.

That constraint produced something better than I expected:

Alerts that find you, rather than dashboards you have to find.

The Open Source Core
The telemetry middleware is free and MIT licensed.
pip install fastapi-alertengine

The managed orchestration layer (AI diagnosis, WhatsApp/Telegram alerts, and human-authorised recovery) is a commercial service.

GitHub: https://github.com/Tandem-Media/fastapi-alertengine

YouTube:

Most monitoring stacks are good at detecting incidents.
Very few are good at reducing operator uncertainty during one.
How are you handling that gap today?