<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lenard Francis</title>
    <description>The latest articles on DEV Community by Lenard Francis (@tandemmedia).</description>
    <link>https://dev.to/tandemmedia</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3944281%2F75480865-6efb-4519-b055-c468650b1a8f.jpg</url>
      <title>DEV Community: Lenard Francis</title>
      <link>https://dev.to/tandemmedia</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tandemmedia"/>
    <language>en</language>
    <item>
      <title>Why no one has built what AlertEngine builds — and why it took a bookkeeper to see the gap</title>
      <dc:creator>Lenard Francis</dc:creator>
      <pubDate>Thu, 11 Jun 2026 13:31:38 +0000</pubDate>
      <link>https://dev.to/tandemmedia/why-no-one-has-built-what-alertengine-builds-and-why-it-took-a-bookkeeper-to-see-the-gap-2i3e</link>
      <guid>https://dev.to/tandemmedia/why-no-one-has-built-what-alertengine-builds-and-why-it-took-a-bookkeeper-to-see-the-gap-2i3e</guid>
      <description>&lt;p&gt;I want to be honest about something.&lt;/p&gt;

&lt;p&gt;When I started building AlertEngine, I assumed I was late. Monitoring is a crowded space. PagerDuty has been around since 2009. AWS has remediation tools. There are well-funded AI SRE startups launching every month.&lt;/p&gt;

&lt;p&gt;I kept waiting for someone to tell me it already existed.&lt;/p&gt;

&lt;p&gt;Nobody did. Because it doesn't.&lt;/p&gt;

&lt;p&gt;Not with this specific combination. Not with this philosophy.&lt;/p&gt;

&lt;p&gt;Here is what I found when I looked carefully at the landscape.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Alerting Giants stop at the notification
&lt;/h2&gt;

&lt;p&gt;PagerDuty and Opsgenie are excellent at telling you something broke. They will wake you up. They will escalate. They will page the on-call engineer.&lt;/p&gt;

&lt;p&gt;Then they stop.&lt;/p&gt;

&lt;p&gt;They assume you will open a laptop, find a terminal, run a script, and fix it yourself. There is no diagnosis in the alert. There is no recovery button. There is no audit trail of what you did next.&lt;/p&gt;

&lt;p&gt;AlertEngine picks up exactly where they stop. The alert contains the diagnosis and a one-tap recovery link. The audit trail records what happened after the alert fired.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Auto-Remediation tools are the problem, not the solution
&lt;/h2&gt;

&lt;p&gt;Shoreline, AWS Systems Manager, and the new wave of autonomous remediation platforms are built on a premise I fundamentally disagree with: that the goal is to remove the human from the loop entirely.&lt;/p&gt;

&lt;p&gt;Peer-reviewed research (Demirbas et al., ACM CAIS 2026) shows that AI agents create approximately 50x more rollbacks than human clients. Their aggressive retry behaviour turns a degraded service into a metastable feedback loop that makes the outage worse.&lt;/p&gt;

&lt;p&gt;I call this the Metastability problem. The auto-remediation tools are its primary cause.&lt;/p&gt;

&lt;p&gt;AlertEngine is the opposite philosophy. The AI diagnoses. The human decides. The system proves it happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  The AI SRE startups are built by the wrong people
&lt;/h2&gt;

&lt;p&gt;There is a new wave of LLM-powered SRE tools. They are impressive. They are well-funded. They are built by engineers who deeply understand AI.&lt;/p&gt;

&lt;p&gt;None of them have an immutable audit trail.&lt;/p&gt;

&lt;p&gt;None of them treat recovery as a financial transaction.&lt;/p&gt;

&lt;p&gt;None of them ask "who authorised that?" because that question has never kept a Silicon Valley engineer awake at night.&lt;/p&gt;

&lt;p&gt;It kept me awake every night for 30 years. I spent my career in accounting and finance. In that world, no transaction executes without authorisation and every action leaves a trail. That is not bureaucracy. That is governance.&lt;/p&gt;

&lt;p&gt;The AI SRE startups use AI as the product. AlertEngine uses AI as an advisor. The audit trail is the product.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Enterprise Workflow tools cost $100,000 and take six months
&lt;/h2&gt;

&lt;p&gt;Tines and Torq will let you build sophisticated recovery workflows. They are genuinely powerful.&lt;/p&gt;

&lt;p&gt;They are also $50,000–$100,000 per year and require a dedicated implementation team to set up.&lt;/p&gt;

&lt;p&gt;A seed-stage fintech in Lagos or a payment platform in Harare cannot buy that. A solo founder running a SaaS doing $10K MRR cannot buy that.&lt;/p&gt;

&lt;p&gt;AlertEngine is two lines of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi_alertengine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;instrument&lt;/span&gt;
&lt;span class="nf"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire SDK installation. You are running in minutes, not months.&lt;/p&gt;




&lt;h2&gt;
  
  
  The specific blind spot
&lt;/h2&gt;

&lt;p&gt;Silicon Valley thinks the goal is autonomy. No humans. Full automation. The system fixes itself.&lt;/p&gt;

&lt;p&gt;But in the real world of money, trade, and regulation — the world I come from — the goal is accountability. Traceable humans. Provable decisions.&lt;/p&gt;

&lt;p&gt;The specific combination that does not exist anywhere else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI-native SDK&lt;/strong&gt; — two lines of code, runs in minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dual-model AI Diagnostic Council&lt;/strong&gt; — two models reason independently, dissent alerts when they disagree&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WhatsApp and Telegram control plane&lt;/strong&gt; — because in Africa and emerging markets, WhatsApp is where people actually are&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutable append-only audit trail&lt;/strong&gt; — every stage, every actor, every policy version&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shadow Mode&lt;/strong&gt; — observe governance without executing, the default for all new tenants&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Accountant's Brake&lt;/strong&gt; — human authorisation as a resilience mechanism, not a bottleneck&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have taken the governance model of a $10 billion bank's internal incident system and put it in a Python package that installs in 30 seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why it took a bookkeeper from Zimbabwe
&lt;/h2&gt;

&lt;p&gt;I did not see this gap because I am a great engineer. I am not a traditional engineer at all. I came to code through AI tools, building solutions to my own problems — first a WhatsApp batch invitation system for my own wedding with over 1,000 guests, then a payment orchestration platform for informal traders in Zimbabwe.&lt;/p&gt;

&lt;p&gt;I saw this gap because I spent 30 years with two familiar questions: "who authorised that?" and "where is the audit trail?" &lt;/p&gt;

&lt;p&gt;Those questions are not engineering questions. They are governance questions.&lt;/p&gt;

&lt;p&gt;And nobody in the infrastructure tooling space was asking them.&lt;/p&gt;

&lt;p&gt;Until now.&lt;/p&gt;




&lt;h2&gt;
  
  
  A final thought
&lt;/h2&gt;

&lt;p&gt;I have been describing AlertEngine as an incident recovery tool. That is accurate but incomplete.&lt;/p&gt;

&lt;p&gt;What I am actually building is a governance layer for operational decisions.&lt;/p&gt;

&lt;p&gt;The strongest lines in this product are not about latency metrics or health scores. They are about authorization, evidence, and accountability.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Nothing executes without approval."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Every action is logged immutably."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"The system fixed itself is not an acceptable answer."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Those are governance statements. And that is the category AlertEngine is creating.&lt;/p&gt;

&lt;p&gt;Most engineers ask, "How do we automate this?"&lt;/p&gt;

&lt;p&gt;I started with, "Who approved this?"&lt;/p&gt;

&lt;p&gt;That's a different mental model. And it turns out production infrastructure needs it more than most engineers realise.&lt;/p&gt;




</description>
    </item>
    <item>
      <title>Why I made Shadow Mode the default for my FastAPI incident recovery tool</title>
      <dc:creator>Lenard Francis</dc:creator>
      <pubDate>Thu, 11 Jun 2026 13:06:11 +0000</pubDate>
      <link>https://dev.to/tandemmedia/why-i-made-shadow-mode-the-default-for-my-fastapi-incident-recovery-tool-3966</link>
      <guid>https://dev.to/tandemmedia/why-i-made-shadow-mode-the-default-for-my-fastapi-incident-recovery-tool-3966</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;I didn't plan to build Shadow Mode.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
I built AlertEngine to solve a specific problem: when a production API fails at 2am, most monitoring tools tell you what broke. None of them tell you who authorised the fix, or leave a record an auditor can replay.&lt;br&gt;
That's the gap AlertEngine fills. AI diagnoses the incident. A human taps approve on WhatsApp. Every stage is logged to an immutable audit trail. Nothing executes without explicit authorisation.&lt;br&gt;
The architecture works. The tests pass. The audit trail is real.&lt;br&gt;
But when I started reaching out to potential customers in African fintech — payment platforms, cross-border rails, compliance-sensitive APIs — I kept hitting the same wall.&lt;br&gt;
"How do we trust this around production?"&lt;br&gt;
That question stopped me.&lt;br&gt;
Because they were right. No regulated team should hand production recovery authority to a tool they've known for five minutes. That's not caution. That's governance.&lt;br&gt;
So I asked a different question.&lt;br&gt;
What if they didn't have to trust it yet?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Shadow Mode does&lt;/strong&gt;&lt;br&gt;
Shadow Mode is the default evaluation state for all new AlertEngine tenants.&lt;br&gt;
When Shadow Mode is active:&lt;/p&gt;

&lt;p&gt;Health polling runs every 5 seconds&lt;br&gt;
Incident detection runs via deterministic policy gates&lt;br&gt;
AI diagnosis runs — Diagnostic Council (dual-model) or single model&lt;br&gt;
Full pipeline state transitions: DETECTED → PROPOSED → VALIDATED&lt;br&gt;
Complete audit trail written with actor attribution&lt;/p&gt;

&lt;p&gt;What doesn't run:&lt;/p&gt;

&lt;p&gt;WhatsApp and Telegram notifications&lt;br&gt;
Recovery token generation&lt;br&gt;
Webhook execution&lt;br&gt;
Voice escalation&lt;/p&gt;

&lt;p&gt;Every suppressed action is logged to the audit trail with actor: "shadow_mode" so the tenant can see exactly what would have happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The implementation&lt;/strong&gt;&lt;br&gt;
The change was surgical. pipeline.py needed zero modifications — the state machine runs normally in Shadow Mode. All the gates are in loop.py.&lt;br&gt;
I added a shadow_mode flag to the tenant schema, read it at the top of _process_tenant(), and passed it through every _execute_actions() call:&lt;br&gt;
pythonshadow_mode = bool(tenant.get("shadow_mode", False))&lt;br&gt;
In _execute_actions(), every external call checks the flag first:&lt;br&gt;
pythonif action_type == "SEND_NOTIFICATION":&lt;br&gt;
    if shadow_mode:&lt;br&gt;
        append_event(&lt;br&gt;
            incident_id=incident_id,&lt;br&gt;
            stage=stage,&lt;br&gt;
            decision="shadow",&lt;br&gt;
            reason=f"[SHADOW] Would have sent {action.get('payload', {}).get('type')} notification",&lt;br&gt;
            confidence=0.0,&lt;br&gt;
            actor="shadow_mode",&lt;br&gt;
            tenant_id=tenant_id,&lt;br&gt;
            metadata={"shadow_mode": True, "suppressed_action": action},&lt;br&gt;
        )&lt;br&gt;
        continue&lt;br&gt;
    # ... normal notification flow&lt;br&gt;
The audit trail gets fully populated. The state machine advances normally. Nothing external fires.&lt;/p&gt;

&lt;p&gt;The Shadow Mode API&lt;br&gt;
Four endpoints manage the evaluation lifecycle:bash# Enable shadow mode (default for new tenants)&lt;br&gt;
POST /tenant/{tenant_id}/shadow&lt;/p&gt;

&lt;h1&gt;
  
  
  Check current status
&lt;/h1&gt;

&lt;p&gt;GET /tenant/{tenant_id}/shadow&lt;/p&gt;

&lt;h1&gt;
  
  
  Get governance report
&lt;/h1&gt;

&lt;p&gt;GET /tenant/{tenant_id}/shadow/report&lt;/p&gt;

&lt;h1&gt;
  
  
  Go live
&lt;/h1&gt;

&lt;p&gt;DELETE /tenant/{tenant_id}/shadow&lt;br&gt;
The governance report is the sales tool. After 30 days of observation it returns:&lt;/p&gt;

&lt;p&gt;"23 incidents observed, 23 notifications suppressed, 23 recovery tokens suppressed — all logged to the immutable audit trail."&lt;/p&gt;

&lt;p&gt;That's what you show a risk committee before going live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What changed strategically&lt;/strong&gt;&lt;br&gt;
Before Shadow Mode the sales conversation was:&lt;/p&gt;

&lt;p&gt;"Install AlertEngine and trust it."&lt;/p&gt;

&lt;p&gt;After Shadow Mode it became:&lt;/p&gt;

&lt;p&gt;"Run AlertEngine in observation mode. Here's the governance report of everything it would have done. Now decide."&lt;/p&gt;

&lt;p&gt;That's a completely different risk profile for a regulated buyer.&lt;br&gt;
Shadow Mode shipped on Thursday. It wasn't on the roadmap on Tuesday.&lt;br&gt;
Sometimes the best features come from asking "what's the real objection?" rather than "what's the next feature?"&lt;/p&gt;

</description>
      <category>api</category>
      <category>devops</category>
      <category>python</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I Spent Years Balancing Ledgers. Now I Balance Redis Connections.</title>
      <dc:creator>Lenard Francis</dc:creator>
      <pubDate>Wed, 03 Jun 2026 09:57:54 +0000</pubDate>
      <link>https://dev.to/tandemmedia/i-spent-years-balancing-ledgers-now-i-balance-redis-connections-pb</link>
      <guid>https://dev.to/tandemmedia/i-spent-years-balancing-ledgers-now-i-balance-redis-connections-pb</guid>
      <description>&lt;p&gt;I spent my career in accounting and finance before building infrastructure in Zimbabwe.&lt;br&gt;
In accounting, every transaction has three properties:&lt;br&gt;
Authorization — no entry without approval&lt;br&gt;
Immutability — once recorded, never altered&lt;br&gt;
Reconciliation — every debit has a corresponding credit, provable by audit&lt;br&gt;
When I started building FastAPI AlertEngine, I applied the same discipline to production incidents. The result is not a monitoring tool. It's an operational governance system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring Tools Are for Forensics. Governance Tools Are for Control.
&lt;/h2&gt;

&lt;p&gt;Monitoring tools tell you what broke after it broke. Datadog, Grafana, Sentry — they produce beautiful post-mortems.&lt;br&gt;
Governance tools enforce that nothing executes without authorization, and they prove it afterward.&lt;br&gt;
Most teams conflate the two. They buy monitoring, assume governance, and get surprised when auditors ask: "Who approved that deploy?"&lt;br&gt;
AlertEngine separates them explicitly:&lt;br&gt;
plain&lt;br&gt;
Detection    →  Policy  (deterministic, no AI)&lt;br&gt;
Diagnosis    →  AI      (explains, recommends, does not decide)&lt;br&gt;
Authorization →  Human  (engineer taps approve)&lt;br&gt;
Execution    →  Webhook (your infrastructure, your control)&lt;br&gt;
Audit        →  Ledger  (immutable, replayable, actor-attributed)&lt;br&gt;
This is not a feature list. It's an architectural hierarchy enforced by code.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Zimbabwe Constraint&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Engineers in Zimbabwe aren't always at laptops when things break. WhatsApp is ubiquituous and can be the operational control plane.&lt;br&gt;
That constraint produces something better than a dashboard: alerts that find you, with a single tap to authorise recovery. No SSH. No runbooks. No "log into Grafana and interpret the graph."&lt;br&gt;
Just: "Something broke. Here's why. Tap approve. Nothing runs without you."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Ledger Philosophy&lt;/strong&gt;&lt;br&gt;
In finance, a ledger has two sides: what happened, and who authorized it.&lt;br&gt;
AlertEngine's audit trail has the same structure:&lt;br&gt;
JSON&lt;br&gt;
{&lt;br&gt;
  "timestamp": 1717344000,&lt;br&gt;
  "incident_id": "inc-abc123-1685000000",&lt;br&gt;
  "stage": "AUTHORIZED",&lt;br&gt;
  "actor": "engineer",&lt;br&gt;
  "decision": "approve",&lt;br&gt;
  "reason": "Database connection pool exhausted — restart recommended",&lt;br&gt;
  "confidence": 0.87,&lt;br&gt;
  "policy_version": "1.0.0",&lt;br&gt;
  "tenant_id": "tenant-xyz789"&lt;br&gt;
}&lt;br&gt;
Every entry is append-only. Every entry has an actor. Every entry is replayable.&lt;br&gt;
This is not logging. Logging tells you what the system did. A ledger tells you who authorized it and why.&lt;br&gt;
Policy Is the Floor. AI Is the Ceiling.&lt;br&gt;
The most important architectural decision in AlertEngine is this:&lt;br&gt;
Claude cannot trigger a state transition.&lt;br&gt;
Policy decides whether an incident exists. Policy decides when a system has recovered. Claude diagnoses and explains — but the state machine doesn't listen to Claude. It listens to incident_policy.py.&lt;br&gt;
When health metrics recover, the pipeline doesn't ask Claude what to do. It calls should_recover(score, err) and if the threshold is met, it transitions to RECOVERED with actor="policy". Claude's recommendation is irrelevant.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;This means:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A confident wrong AI diagnosis cannot cause an incident to escalate&lt;br&gt;
A policy recovery override is logged as actor: "policy" — auditors can see exactly when and why&lt;br&gt;
Changing thresholds is a one-line edit in one file, versioned, and logged in every subsequent audit entry&lt;br&gt;
The audit trail never lies about who made the decision&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why This Matters Now&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Three forces are converging:&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Regulators are tightening.&lt;/strong&gt; SOC 2, PCI DSS, HIPAA, GDPR — all require documented authorisation for production changes. "The AI did it" is not a compliant answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI is getting faster.&lt;/strong&gt; Claude can diagnose an incident in 3 seconds. Without governance, the temptation is to let it act autonomously. That's how you get a confident wrong diagnosis: restarting your database at peak traffic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineers are burning out.&lt;/strong&gt; 3 AM alerts with no context, no authorisation trail, and no proof of what happened. The answer isn't better dashboards — it's better workflows.
AlertEngine addresses all three: policy gates prevent AI from acting alone, human authorisation prevents burnout, and the audit trail prevents regulatory surprises.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Honest Part&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I'm also building a payment orchestration platform for the African "hustler" context. Getting infrastructure funding in Zimbabwe is genuinely hard.&lt;br&gt;
So I packaged the operational governance layer as a standalone product. It solves a real problem — I needed it myself at 2am. It also funds the bigger build.&lt;br&gt;
That felt worth being honest about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;The Code&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
The orchestrator is source-available. Every claim in this post is verifiable:&lt;br&gt;
orchestrator/pipeline.py — policy hierarchy, actor="policy" on recovery override&lt;br&gt;
orchestrator/incident_policy.py — single POLICY dict, versioned, env-configurable&lt;br&gt;
orchestrator/audit.py — append-only Redis LIST, full actor attribution, replayable&lt;br&gt;
Read the code. Audit the architecture. Then decide if your infrastructure deserves the same discipline as your accounting.&lt;br&gt;
GitHub: github.com/Tandem-Media/fastapi-alertengine&lt;br&gt;
Install:&lt;br&gt;
bash&lt;br&gt;
pip install fastapi-alertengine&lt;br&gt;
Managed orchestrator: &lt;a href="mailto:anchorflowalertengine@outlook.com"&gt;anchorflowalertengine@outlook.com&lt;/a&gt;&lt;br&gt;
Built in Harare, Zimbabwe. 🇿🇼&lt;/p&gt;

</description>
      <category>devops</category>
      <category>fintech</category>
      <category>fastapi</category>
      <category>observability</category>
    </item>
    <item>
      <title>From Eclipses to P95 Latency: What the Joseon Dynasty Can Teach Us About Incident Response</title>
      <dc:creator>Lenard Francis</dc:creator>
      <pubDate>Sun, 31 May 2026 16:22:51 +0000</pubDate>
      <link>https://dev.to/tandemmedia/from-eclipses-to-p95-latency-what-the-joseon-dynasty-can-teach-us-about-incident-response-49db</link>
      <guid>https://dev.to/tandemmedia/from-eclipses-to-p95-latency-what-the-joseon-dynasty-can-teach-us-about-incident-response-49db</guid>
      <description>&lt;p&gt;The Joseon Dynasty ruled Korea for more than five centuries, from 1392 to 1897.&lt;/p&gt;

&lt;p&gt;That is longer than the United States has existed. Longer than the printing press has been in widespread use. Five hundred years of one government, one bureaucracy, one record-keeping system.&lt;/p&gt;

&lt;p&gt;And they documented everything.&lt;/p&gt;

&lt;p&gt;The 朝鮮王朝實錄 (Veritable Records of the Joseon Dynasty) is one of the most extensive continuous historical records ever produced. Royal decrees, diplomatic correspondence, criminal cases, military campaigns, natural disasters, celestial observations, agricultural conditions, and administrative decisions were meticulously recorded.&lt;/p&gt;

&lt;p&gt;Every eclipse. Every comet. Every drought. Every flood. Every tiger that wandered into a village.&lt;/p&gt;

&lt;p&gt;At first glance, it reads like a civilisation obsessed with omens. Look closer and it begins to resemble something else: an accountability system operating at a national scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Mandate of Heaven as an Accountability Mechanism&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Joseon inherited the concept of the Mandate of Heaven from Chinese political philosophy.&lt;/p&gt;

&lt;p&gt;The basic premise was simple: Heaven's approval of a ruler could be inferred from events in the natural world. Stable harvests, favorable weather, and orderly skies suggested good governance. Floods, eclipses, unusual celestial events, and other disruptions demanded attention.&lt;/p&gt;

&lt;p&gt;Whether or not one accepted the underlying cosmology, the system functioned as a powerful accountability mechanism.&lt;/p&gt;

&lt;p&gt;When an eclipse occurred, someone had to observe it. Someone had to record it. Officials had to discuss its significance. The court had to determine whether action was required. The response had to be documented. And the entire process became part of a permanent historical record.&lt;/p&gt;

&lt;p&gt;A king could not plausibly claim ignorance of a reported eclipse.&lt;br&gt;
An official could not quietly invent a justification for a policy years later. The record existed. The deliberation existed. The decision existed.&lt;/p&gt;

&lt;p&gt;Accountability was structural.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://ajin.im/is/building/omen.ops/" rel="noopener noreferrer"&gt;https://ajin.im/is/building/omen.ops/&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Dynasty Rendered as Telemetry&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Recently, Ajin built omen.ops, a project that renders the Veritable Records as a modern observability dashboard.&lt;/p&gt;

&lt;p&gt;It is one of the most serious pieces of digital scholarship I have encountered. Every entry is sourced, annotated, and presented with the gravity the original record-keepers intended. Rather than merely digitising the records, the project reinterprets them through the lens of modern operations, observability, and incident management.&lt;/p&gt;

&lt;p&gt;Suddenly, centuries-old historical events look remarkably familiar. Eclipses appear as system alerts. Comets register as anomaly spikes. Droughts become degradation events. The Mandate of Heaven itself is represented as a system health score.&lt;/p&gt;

&lt;p&gt;The effect is both humorous and strangely illuminating.&lt;br&gt;
A guest star observed over thirteen consecutive night-watches in 1592 appears as a P1 incident. The court astronomers of the Gwansanggam—the royal bureau responsible for celestial observation—tracked its position relative to known stars, recorded its persistence, and noted the absence of any established remediation procedure.&lt;/p&gt;

&lt;p&gt;In modern operations language, the alert was acknowledged, classified, documented, and ultimately deemed unactionable.&lt;/p&gt;

&lt;p&gt;Every engineer who has ever been paged at 3 AM has encountered the same category of problem.The dashboard also presents a derived metric called the Mandate Volatility Index. It compresses centuries of recorded anomalies into a single score relative to a reign's baseline conditions.&lt;/p&gt;

&lt;p&gt;The historical court and the modern SRE team face the same challenge: overwhelming amounts of signal with limited human attention. Different centuries. Different tools. Same problem.&lt;br&gt;
They needed summaries. We use dashboards. They had volatility indexes. We have P95 latency graphs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Tiger Incident&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The most memorable entry may not involve the heavens at all.&lt;/p&gt;

&lt;p&gt;In 1571, a white-browed tiger reportedly killed hundreds of people and livestock near the capital. The response escalated rapidly. The court mobilised a specialised tiger-catching commander and launched a coordinated hunt.&lt;/p&gt;

&lt;p&gt;Then a secondary problem emerged. The soldiers sent to eliminate the tiger began looting civilians. The court was suddenly managing two incidents instead of one. The tiger threat was eventually mitigated. Multiple animals were killed. The military operation was scaled back. Reports of civilian misconduct were documented.&lt;/p&gt;

&lt;p&gt;Incident opened. Response initiated. Unexpected side effects detected. Mitigation adjusted. Incident closed. The terminology changes. The workflow does not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Governance Problem Hasn't Changed&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The tools have changed completely. The governance problem has not. What fascinates me about the Joseon system is not the astronomy. It is the process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Observe - Record - Deliberate - Authorize - Act - Document&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That sequence appears repeatedly throughout the Veritable Records. It is also the sequence behind every mature operational system. Modern monitoring platforms are excellent at collecting signals. They can detect latency spikes, memory pressure, queue backlogs, failed deployments, and infrastructure degradation in seconds.&lt;/p&gt;

&lt;p&gt;What many systems still struggle with is everything that happens after detection. Who saw the alert? Who approved the response?&lt;br&gt;
What evidence was considered? Why was a particular action taken?&lt;br&gt;
Can someone reconstruct the decision six months later?&lt;br&gt;
Detection is only the beginning. Accountability begins when decisions become traceable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building for Active Control&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This idea sits at the centre of what I am building with FastAPI AlertEngine.&lt;/p&gt;

&lt;p&gt;FastAPI AlertEngine is incident intelligence for FastAPI services. A free SDK adds health monitoring to your application with a single line of code. When degradation is detected, a managed orchestrator investigates the likely cause and sends you a WhatsApp or Telegram notification containing a single-use approval link. Nothing executes without your authorisation.&lt;/p&gt;

&lt;p&gt;The goal is not simply to collect more alerts. Most organisations already have more alerts than they can reasonably process. The goal is to preserve the decision chain.&lt;/p&gt;

&lt;p&gt;Observe the signal. Capture the evidence. Present the context.&lt;br&gt;
Recommend an action. Require authorisation. Execute the response.&lt;br&gt;
Record everything.&lt;/p&gt;

&lt;p&gt;The Joseon court performed this process with a brush, ink, astronomers, and royal historians. We perform it with telemetry pipelines, machine reasoning, and APIs. The difference is speed. The court might take days to process an eclipse.&lt;br&gt;
We can detect a P95 latency spike, identify likely causes, generate remediation options, and request approval in seconds.&lt;/p&gt;

&lt;p&gt;But the underlying governance problem remains unchanged.&lt;br&gt;
How do you ensure that the people responsible for a system are confronted with evidence, required to make a decision, and unable to rewrite history afterward?&lt;/p&gt;

&lt;p&gt;Five hundred years ago, the Joseon Dynasty answered that question with brush and ink. We are still figuring it out with Redis and JWT tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Explore the historical dashboard:&lt;/em&gt;&lt;/strong&gt; &lt;a href="https://ajin.im/is/building/omen.ops/" rel="noopener noreferrer"&gt;https://ajin.im/is/building/omen.ops/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you run FastAPI in production and want incident response that asks before it acts, the free SDK is available through FastAPI AlertEngine.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How I Taught My Incident Alerts to Say "This Broke 3 Minutes After Your Last Deploy"</title>
      <dc:creator>Lenard Francis</dc:creator>
      <pubDate>Sat, 30 May 2026 19:50:31 +0000</pubDate>
      <link>https://dev.to/tandemmedia/how-i-taught-my-incident-alerts-to-say-this-broke-3-minutes-after-your-last-deploy-641</link>
      <guid>https://dev.to/tandemmedia/how-i-taught-my-incident-alerts-to-say-this-broke-3-minutes-after-your-last-deploy-641</guid>
      <description>&lt;h2&gt;
  
  
  You're staring at a P95 latency spike.
&lt;/h2&gt;

&lt;p&gt;The alert says: "Database pool exhausted. P95: 2847ms."You know what broke. You don't know why.&lt;br&gt;
So you open your git log, check when the spike started, scroll through commits, and try to figure out what changed in the 10 minutes before everything went sideways.&lt;br&gt;
That archaeology takes 20 minutes on a good day. At 2am it takes longer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem with Context-Free Alerts&lt;/strong&gt;&lt;br&gt;
Most incident alerts are great at telling you the “what”. None of them tell you the “when” in relation to your codebase.&lt;br&gt;
The question every engineer asks during an incident isn't "what is the P95?" — they already know that. It's "Did we just deploy something?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Insight: Incidents Have a Deployment Shadow&lt;/strong&gt;&lt;br&gt;
The way I see it, the majority of production incidents fall into one of two categories:&lt;br&gt;
• Infrastructure events — upstream dependency failure, Redis outage, traffic spike&lt;br&gt;
• Deployment shadows — something changed in the last deploy that didn't show up in testing&lt;/p&gt;

&lt;p&gt;For category 2, the fastest path to resolution is knowing exactly what changed and when — down to the commit level.&lt;br&gt;
If your alert says:&lt;br&gt;
 Database pool exhausted (P95: 2847ms)&lt;br&gt;
Recent deployments before incident:&lt;br&gt;
  3m ago — a1b2c3d: "Fix checkout query isolation level" (John, +12/-3)&lt;br&gt;
    1 recent commit touched database/query files&lt;br&gt;
You've just saved 20 minutes of log archaeology.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to Build It&lt;/strong&gt;&lt;br&gt;
The implementation is simpler than it sounds. Three components:&lt;br&gt;
• A commit store — Redis sorted set, scored by timestamp&lt;br&gt;
• A GitHub webhook — receives push events, stores commits&lt;br&gt;
• An incident correlator — maps incident start time to nearby commits&lt;/p&gt;

&lt;p&gt;The Commit Store&lt;br&gt;
def store_commit(tenant_id, sha, message, author, timestamp, files_changed):&lt;br&gt;
    key = f"orchestrator:commits:{tenant_id}"&lt;br&gt;
    redis.zadd(key, {entry: timestamp})&lt;br&gt;
    redis.expire(key, 86400 * 7)  # 7 day TTL&lt;br&gt;
A Redis sorted set gives you O(log N) insertion and O(log N + K) range queries — perfect for "give me commits in the 10 minutes before this timestamp."&lt;/p&gt;

&lt;p&gt;The GitHub Webhook&lt;br&gt;
@app.post("/commits/webhook")&lt;br&gt;
async def github_webhook(request: Request):&lt;br&gt;
    body = await request.json()&lt;br&gt;
    for commit in body.get("commits", []):&lt;br&gt;
        store_commit(...)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Injecting Context into AI Diagnosis&lt;/strong&gt;&lt;br&gt;
Without commit context, Claude sees raw metrics. With commit context, Claude sees the metrics AND what changed 3 minutes before the incident — shifting the diagnosis from "likely database connection issue" to "checkout query isolation level change likely caused connection pool exhaustion."&lt;br&gt;
That's a different quality of diagnosis entirely.&lt;/p&gt;

&lt;p&gt;What the WhatsApp Message Looks Like&lt;br&gt;
⚠️ Action Recommended&lt;br&gt;
Service: Payment API&lt;br&gt;
Issue: Database pool exhausted — P95 2.8s&lt;br&gt;
Likely cause: Checkout query isolation level change&lt;br&gt;
(commit a1b2c3d, 3m ago)&lt;br&gt;
Confidence: 87%&lt;br&gt;
👉 Approve fix: [link]&lt;br&gt;
Nothing will run without your approval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three Setup Options&lt;/strong&gt;&lt;br&gt;
• GitHub webhook (recommended) — POST /commits/webhook with header X-AlertEngine-Tenant-ID&lt;br&gt;
• Manual push from CI — curl from your GitHub Actions workflow&lt;br&gt;
• GitHub API polling — set GITHUB_TOKEN and GITHUB_REPO, AlertEngine fetches automatically&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Broader Pattern&lt;/strong&gt;&lt;br&gt;
This feature is an instance of a broader pattern: enrich your incident context with everything that changed recently, not just the metrics at the moment of failure.&lt;br&gt;
Future extensions of the same idea:&lt;br&gt;
• Feature flag changes in the 10 minutes before an incident&lt;br&gt;
• Infrastructure changes (Terraform applies, Docker image updates)&lt;br&gt;
• Database migration executions&lt;br&gt;
• Config changes&lt;/p&gt;

&lt;p&gt;The alert that says, "Here's what broke, here's what changed right before it broke, here's the fix"—that's the alert worth building for.&lt;/p&gt;

&lt;p&gt;─────────────────────────────────────────&lt;br&gt;
This is now live in FastAPI AlertEngine as commit_context.py.&lt;br&gt;
GitHub: github.com/Tandem-Media/fastapi-alertengine&lt;br&gt;
Docs: tandem-media.github.io/fastapi-alertengine/&lt;br&gt;
pip install fastapi-alertengine&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
    <item>
      <title>Why P95 Latency Is the Only Metric That Matters at 3 AM</title>
      <dc:creator>Lenard Francis</dc:creator>
      <pubDate>Thu, 21 May 2026 18:58:30 +0000</pubDate>
      <link>https://dev.to/tandemmedia/why-p95-latency-is-the-only-metric-that-matters-at-3-am-2b2c</link>
      <guid>https://dev.to/tandemmedia/why-p95-latency-is-the-only-metric-that-matters-at-3-am-2b2c</guid>
      <description>&lt;p&gt;If your checkout endpoint serves 10,000 requests per minute, a 5% latency spike means 500 users are having a bad experience every minute.&lt;/p&gt;

&lt;p&gt;Averages compress that pain into a single comfortable number.&lt;br&gt;
P95 latency — the latency at the 95th percentile — tells you what your slowest users are actually experiencing.&lt;/p&gt;

&lt;p&gt;It's the metric that catches the spike average hides.&lt;br&gt;
This is why I track P95 as the primary health signal, not averages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Latency Spikes Actually Propagate&lt;/strong&gt;&lt;br&gt;
A latency spike rarely starts in your application.It usually starts somewhere else and cascades inward.&lt;/p&gt;

&lt;p&gt;The typical pattern looks like this:&lt;/p&gt;

&lt;p&gt;Slow upstream dependency&lt;br&gt;
        ↓&lt;br&gt;
Connection pool saturation&lt;br&gt;
        ↓&lt;br&gt;
Request queue growth&lt;br&gt;
        ↓&lt;br&gt;
Latency spike propagation&lt;br&gt;
        ↓&lt;br&gt;
Timeouts and failures&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Cascade Pattern&lt;/strong&gt;&lt;br&gt;
An upstream dependency (database, payment gateway, third-party API) slows down&lt;br&gt;
Your FastAPI app keeps accepting requests while waiting for responses.&lt;br&gt;
Your connection pool fills up – new requests queue behind existing ones.&lt;br&gt;
Queue depth grows, memory pressure builds&lt;br&gt;
Response times climb across all endpoints, not just the affected one. Eventually requests start timing out or failing entirely&lt;/p&gt;

&lt;p&gt;By stage 3, you have a problem. By stage 5, your customers know about it before you do.&lt;br&gt;
The cascade failure pattern is particularly nasty.A slow database query holds a connection.&lt;/p&gt;

&lt;p&gt;That held connection blocks another request. That blocked request ties up execution capacity. Multiply that by concurrent users and you get full service degradation from a single slow dependency.&lt;/p&gt;

&lt;p&gt;Under async workloads, the failure mode becomes especially deceptive because the application continues accepting requests while upstream awaits accumulation in the background.&lt;/p&gt;

&lt;p&gt;High Traffic Spikes Make This Worse. Under normal load, a slow upstream dependency is annoying.&lt;br&gt;
Under a traffic spike, it's catastrophic.&lt;/p&gt;

&lt;p&gt;Here's why:&lt;/p&gt;

&lt;p&gt;Connection pool saturation happens faster. If you have 20 database connections and traffic doubles, you hit the ceiling twice as fast.&lt;br&gt;
Queue depth explodes. Requests piling up behind a slow dependency compound each other's wait time.&lt;br&gt;
Memory pressure builds. Each queued request holds state. Enough of them and you drift toward OOM territory.&lt;br&gt;
Recovery is non-linear. Once a connection pool is saturated, it often stays saturated even after the upstream issue resolves — because the backlog keeps it full.&lt;/p&gt;

&lt;p&gt;The cruel irony is that traffic spikes happen when your service matters most.&lt;/p&gt;

&lt;p&gt;A flash sale. A viral moment. A major announcement.&lt;br&gt;
Exactly the wrong time to be debugging latency from a dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Didn't Work For Me&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Monitoring sounds easy in theory. In practice, most setups failed me in one of four ways.&lt;/p&gt;

&lt;p&gt;Prometheus + Grafana. Powerful, but operationally heavy.&lt;/p&gt;

&lt;p&gt;Setting up exporters, configuring dashboards, maintaining the stack — all before writing a single alert rule.&lt;/p&gt;

&lt;p&gt;And when the alert fires at 3am, one still has to log in and interpret charts under pressure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple Health Checks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GET /health → 200 OK tells you the service is alive.&lt;br&gt;
It doesn't tell you it's running at 8x normal latency while technically responding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Average Latency Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Averages mask the spikes that actually hurt users.&lt;/p&gt;

&lt;p&gt;In one case, a payment provider slowdown pushed P95 latency from roughly 180 ms to over 2 seconds within minutes — while average latency still looked acceptable.&lt;/p&gt;

&lt;p&gt;By the time averages reflected the issue, checkout failures had already started.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert Fatigue&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I added more monitors to catch more things. Which meant more alerts. Most of them were noise. When everything is urgent, nothing is. Monitoring systems usually optimise for data collection. &lt;/p&gt;

&lt;p&gt;Operators actually need decision compression.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I Built Instead&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I wanted something that:&lt;br&gt;
Tracked P95, not averages&lt;br&gt;
Produced a single health score instead of 15 metrics to interpret&lt;br&gt;
Caught degradation trends early, before full failure&lt;br&gt;
Required zero config to add to an existing FastAPI app&lt;/p&gt;

&lt;p&gt;The result is a FastAPI middleware that continuously computes degradation signals directly from live request traffic.&lt;/p&gt;

&lt;p&gt;from fastapi import FastAPI&lt;br&gt;
from fastapi_alertengine import instrument&lt;/p&gt;

&lt;p&gt;app = FastAPI()&lt;br&gt;
instrument(app)&lt;/p&gt;

&lt;p&gt;The middleware exposes a structured /health/alerts endpoint:&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "status": "warning",&lt;br&gt;
  "health_score": {&lt;br&gt;
    "score": 61,&lt;br&gt;
    "trend": "degrading"&lt;br&gt;
  },&lt;br&gt;
  "metrics": {&lt;br&gt;
    "overall_p95_ms": 1847.3,&lt;br&gt;
    "error_rate": 0.08,&lt;br&gt;
    "anomaly_score": 0.9&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;One status. One score. One trend direction. No dashboards to configure. No agents to run. No Prometheus exporters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Human-in-the-Loop Layer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once I had a reliable health signal, the next question was:&lt;br&gt;
What do I do with it?&lt;/p&gt;

&lt;p&gt;I built a managed orchestration layer that polls /health/alerts every 5 seconds. When the score drops below the threshold, it:&lt;/p&gt;

&lt;p&gt;Runs Claude AI diagnosis on the metric context&lt;br&gt;
Sends a WhatsApp or Telegram message (or Slack) with a plain-English summary&lt;br&gt;
Generates a single-use recovery link&lt;/p&gt;

&lt;p&gt;Most AI incident tooling jumps straight to autonomous remediation. I intentionally didn't.&lt;/p&gt;

&lt;p&gt;Production systems deserve human authorisation before recovery actions execute. I read the diagnosis, preview the recovery action, and tap approve – all from my phone.&lt;/p&gt;

&lt;p&gt;Nothing executes automatically. Every action is logged immutably.&lt;/p&gt;

&lt;p&gt;I built the mobile-first delivery because I work in Zimbabwe, where engineers aren't always at laptops when things break.&lt;/p&gt;

&lt;p&gt;WhatsApp is the operational control plane here.&lt;/p&gt;

&lt;p&gt;That constraint produced something better than I expected:&lt;/p&gt;

&lt;p&gt;Alerts that find you, rather than dashboards you have to find.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Open Source Core&lt;/strong&gt;&lt;br&gt;
The telemetry middleware is free and MIT licensed.&lt;br&gt;
pip install fastapi-alertengine&lt;/p&gt;

&lt;p&gt;The managed orchestration layer (AI diagnosis, WhatsApp/Telegram alerts, and human-authorised recovery) is a commercial service.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/Tandem-Media/fastapi-alertengine" rel="noopener noreferrer"&gt;https://github.com/Tandem-Media/fastapi-alertengine&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;YouTube: &lt;br&gt;
  &lt;iframe src="https://www.youtube.com/embed/vKLqcVdSMO8"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Most monitoring stacks are good at detecting incidents.&lt;br&gt;
Very few are good at reducing operator uncertainty during one.&lt;br&gt;
How are you handling that gap today?&lt;/p&gt;

</description>
      <category>backend</category>
      <category>monitoring</category>
      <category>performance</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
