<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lenard Francis</title>
    <description>The latest articles on DEV Community by Lenard Francis (@tandemmedia).</description>
    <link>https://dev.to/tandemmedia</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3944281%2F75480865-6efb-4519-b055-c468650b1a8f.jpg</url>
      <title>DEV Community: Lenard Francis</title>
      <link>https://dev.to/tandemmedia</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tandemmedia"/>
    <language>en</language>
    <item>
      <title>I Spent Years Balancing Ledgers. Now I Balance Redis Connections.</title>
      <dc:creator>Lenard Francis</dc:creator>
      <pubDate>Wed, 03 Jun 2026 09:57:54 +0000</pubDate>
      <link>https://dev.to/tandemmedia/i-spent-years-balancing-ledgers-now-i-balance-redis-connections-pb</link>
      <guid>https://dev.to/tandemmedia/i-spent-years-balancing-ledgers-now-i-balance-redis-connections-pb</guid>
      <description>&lt;p&gt;I spent my career in accounting and finance before building infrastructure in Zimbabwe.&lt;br&gt;
In accounting, every transaction has three properties:&lt;br&gt;
Authorization — no entry without approval&lt;br&gt;
Immutability — once recorded, never altered&lt;br&gt;
Reconciliation — every debit has a corresponding credit, provable by audit&lt;br&gt;
When I started building FastAPI AlertEngine, I applied the same discipline to production incidents. The result is not a monitoring tool. It's an operational governance system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring Tools Are for Forensics. Governance Tools Are for Control.
&lt;/h2&gt;

&lt;p&gt;Monitoring tools tell you what broke after it broke. Datadog, Grafana, Sentry — they produce beautiful post-mortems.&lt;br&gt;
Governance tools enforce that nothing executes without authorization, and they prove it afterward.&lt;br&gt;
Most teams conflate the two. They buy monitoring, assume governance, and get surprised when auditors ask: "Who approved that deploy?"&lt;br&gt;
AlertEngine separates them explicitly:&lt;br&gt;
plain&lt;br&gt;
Detection    →  Policy  (deterministic, no AI)&lt;br&gt;
Diagnosis    →  AI      (explains, recommends, does not decide)&lt;br&gt;
Authorization →  Human  (engineer taps approve)&lt;br&gt;
Execution    →  Webhook (your infrastructure, your control)&lt;br&gt;
Audit        →  Ledger  (immutable, replayable, actor-attributed)&lt;br&gt;
This is not a feature list. It's an architectural hierarchy enforced by code.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Zimbabwe Constraint&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Engineers in Zimbabwe aren't always at laptops when things break. WhatsApp is ubiquituous and can be the operational control plane.&lt;br&gt;
That constraint produces something better than a dashboard: alerts that find you, with a single tap to authorise recovery. No SSH. No runbooks. No "log into Grafana and interpret the graph."&lt;br&gt;
Just: "Something broke. Here's why. Tap approve. Nothing runs without you."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Ledger Philosophy&lt;/strong&gt;&lt;br&gt;
In finance, a ledger has two sides: what happened, and who authorized it.&lt;br&gt;
AlertEngine's audit trail has the same structure:&lt;br&gt;
JSON&lt;br&gt;
{&lt;br&gt;
  "timestamp": 1717344000,&lt;br&gt;
  "incident_id": "inc-abc123-1685000000",&lt;br&gt;
  "stage": "AUTHORIZED",&lt;br&gt;
  "actor": "engineer",&lt;br&gt;
  "decision": "approve",&lt;br&gt;
  "reason": "Database connection pool exhausted — restart recommended",&lt;br&gt;
  "confidence": 0.87,&lt;br&gt;
  "policy_version": "1.0.0",&lt;br&gt;
  "tenant_id": "tenant-xyz789"&lt;br&gt;
}&lt;br&gt;
Every entry is append-only. Every entry has an actor. Every entry is replayable.&lt;br&gt;
This is not logging. Logging tells you what the system did. A ledger tells you who authorized it and why.&lt;br&gt;
Policy Is the Floor. AI Is the Ceiling.&lt;br&gt;
The most important architectural decision in AlertEngine is this:&lt;br&gt;
Claude cannot trigger a state transition.&lt;br&gt;
Policy decides whether an incident exists. Policy decides when a system has recovered. Claude diagnoses and explains — but the state machine doesn't listen to Claude. It listens to incident_policy.py.&lt;br&gt;
When health metrics recover, the pipeline doesn't ask Claude what to do. It calls should_recover(score, err) and if the threshold is met, it transitions to RECOVERED with actor="policy". Claude's recommendation is irrelevant.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;This means:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A confident wrong AI diagnosis cannot cause an incident to escalate&lt;br&gt;
A policy recovery override is logged as actor: "policy" — auditors can see exactly when and why&lt;br&gt;
Changing thresholds is a one-line edit in one file, versioned, and logged in every subsequent audit entry&lt;br&gt;
The audit trail never lies about who made the decision&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why This Matters Now&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Three forces are converging:&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Regulators are tightening.&lt;/strong&gt; SOC 2, PCI DSS, HIPAA, GDPR — all require documented authorisation for production changes. "The AI did it" is not a compliant answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI is getting faster.&lt;/strong&gt; Claude can diagnose an incident in 3 seconds. Without governance, the temptation is to let it act autonomously. That's how you get a confident wrong diagnosis: restarting your database at peak traffic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineers are burning out.&lt;/strong&gt; 3 AM alerts with no context, no authorisation trail, and no proof of what happened. The answer isn't better dashboards — it's better workflows.
AlertEngine addresses all three: policy gates prevent AI from acting alone, human authorisation prevents burnout, and the audit trail prevents regulatory surprises.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Honest Part&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I'm also building a payment orchestration platform for the African "hustler" context. Getting infrastructure funding in Zimbabwe is genuinely hard.&lt;br&gt;
So I packaged the operational governance layer as a standalone product. It solves a real problem — I needed it myself at 2am. It also funds the bigger build.&lt;br&gt;
That felt worth being honest about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;The Code&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
The orchestrator is source-available. Every claim in this post is verifiable:&lt;br&gt;
orchestrator/pipeline.py — policy hierarchy, actor="policy" on recovery override&lt;br&gt;
orchestrator/incident_policy.py — single POLICY dict, versioned, env-configurable&lt;br&gt;
orchestrator/audit.py — append-only Redis LIST, full actor attribution, replayable&lt;br&gt;
Read the code. Audit the architecture. Then decide if your infrastructure deserves the same discipline as your accounting.&lt;br&gt;
GitHub: github.com/Tandem-Media/fastapi-alertengine&lt;br&gt;
Install:&lt;br&gt;
bash&lt;br&gt;
pip install fastapi-alertengine&lt;br&gt;
Managed orchestrator: &lt;a href="mailto:anchorflowalertengine@outlook.com"&gt;anchorflowalertengine@outlook.com&lt;/a&gt;&lt;br&gt;
Built in Harare, Zimbabwe. 🇿🇼&lt;/p&gt;

</description>
      <category>devops</category>
      <category>fintech</category>
      <category>fastapi</category>
      <category>observability</category>
    </item>
    <item>
      <title>From Eclipses to P95 Latency: What the Joseon Dynasty Can Teach Us About Incident Response</title>
      <dc:creator>Lenard Francis</dc:creator>
      <pubDate>Sun, 31 May 2026 16:22:51 +0000</pubDate>
      <link>https://dev.to/tandemmedia/from-eclipses-to-p95-latency-what-the-joseon-dynasty-can-teach-us-about-incident-response-49db</link>
      <guid>https://dev.to/tandemmedia/from-eclipses-to-p95-latency-what-the-joseon-dynasty-can-teach-us-about-incident-response-49db</guid>
      <description>&lt;p&gt;The Joseon Dynasty ruled Korea for more than five centuries, from 1392 to 1897.&lt;/p&gt;

&lt;p&gt;That is longer than the United States has existed. Longer than the printing press has been in widespread use. Five hundred years of one government, one bureaucracy, one record-keeping system.&lt;/p&gt;

&lt;p&gt;And they documented everything.&lt;/p&gt;

&lt;p&gt;The 朝鮮王朝實錄 (Veritable Records of the Joseon Dynasty) is one of the most extensive continuous historical records ever produced. Royal decrees, diplomatic correspondence, criminal cases, military campaigns, natural disasters, celestial observations, agricultural conditions, and administrative decisions were meticulously recorded.&lt;/p&gt;

&lt;p&gt;Every eclipse. Every comet. Every drought. Every flood. Every tiger that wandered into a village.&lt;/p&gt;

&lt;p&gt;At first glance, it reads like a civilisation obsessed with omens. Look closer and it begins to resemble something else: an accountability system operating at a national scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Mandate of Heaven as an Accountability Mechanism&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Joseon inherited the concept of the Mandate of Heaven from Chinese political philosophy.&lt;/p&gt;

&lt;p&gt;The basic premise was simple: Heaven's approval of a ruler could be inferred from events in the natural world. Stable harvests, favorable weather, and orderly skies suggested good governance. Floods, eclipses, unusual celestial events, and other disruptions demanded attention.&lt;/p&gt;

&lt;p&gt;Whether or not one accepted the underlying cosmology, the system functioned as a powerful accountability mechanism.&lt;/p&gt;

&lt;p&gt;When an eclipse occurred, someone had to observe it. Someone had to record it. Officials had to discuss its significance. The court had to determine whether action was required. The response had to be documented. And the entire process became part of a permanent historical record.&lt;/p&gt;

&lt;p&gt;A king could not plausibly claim ignorance of a reported eclipse.&lt;br&gt;
An official could not quietly invent a justification for a policy years later. The record existed. The deliberation existed. The decision existed.&lt;/p&gt;

&lt;p&gt;Accountability was structural.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://ajin.im/is/building/omen.ops/" rel="noopener noreferrer"&gt;https://ajin.im/is/building/omen.ops/&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Dynasty Rendered as Telemetry&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Recently, Ajin built omen.ops, a project that renders the Veritable Records as a modern observability dashboard.&lt;/p&gt;

&lt;p&gt;It is one of the most serious pieces of digital scholarship I have encountered. Every entry is sourced, annotated, and presented with the gravity the original record-keepers intended. Rather than merely digitising the records, the project reinterprets them through the lens of modern operations, observability, and incident management.&lt;/p&gt;

&lt;p&gt;Suddenly, centuries-old historical events look remarkably familiar. Eclipses appear as system alerts. Comets register as anomaly spikes. Droughts become degradation events. The Mandate of Heaven itself is represented as a system health score.&lt;/p&gt;

&lt;p&gt;The effect is both humorous and strangely illuminating.&lt;br&gt;
A guest star observed over thirteen consecutive night-watches in 1592 appears as a P1 incident. The court astronomers of the Gwansanggam—the royal bureau responsible for celestial observation—tracked its position relative to known stars, recorded its persistence, and noted the absence of any established remediation procedure.&lt;/p&gt;

&lt;p&gt;In modern operations language, the alert was acknowledged, classified, documented, and ultimately deemed unactionable.&lt;/p&gt;

&lt;p&gt;Every engineer who has ever been paged at 3 AM has encountered the same category of problem.The dashboard also presents a derived metric called the Mandate Volatility Index. It compresses centuries of recorded anomalies into a single score relative to a reign's baseline conditions.&lt;/p&gt;

&lt;p&gt;The historical court and the modern SRE team face the same challenge: overwhelming amounts of signal with limited human attention. Different centuries. Different tools. Same problem.&lt;br&gt;
They needed summaries. We use dashboards. They had volatility indexes. We have P95 latency graphs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Tiger Incident&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The most memorable entry may not involve the heavens at all.&lt;/p&gt;

&lt;p&gt;In 1571, a white-browed tiger reportedly killed hundreds of people and livestock near the capital. The response escalated rapidly. The court mobilised a specialised tiger-catching commander and launched a coordinated hunt.&lt;/p&gt;

&lt;p&gt;Then a secondary problem emerged. The soldiers sent to eliminate the tiger began looting civilians. The court was suddenly managing two incidents instead of one. The tiger threat was eventually mitigated. Multiple animals were killed. The military operation was scaled back. Reports of civilian misconduct were documented.&lt;/p&gt;

&lt;p&gt;Incident opened. Response initiated. Unexpected side effects detected. Mitigation adjusted. Incident closed. The terminology changes. The workflow does not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Governance Problem Hasn't Changed&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The tools have changed completely. The governance problem has not. What fascinates me about the Joseon system is not the astronomy. It is the process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Observe - Record - Deliberate - Authorize - Act - Document&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That sequence appears repeatedly throughout the Veritable Records. It is also the sequence behind every mature operational system. Modern monitoring platforms are excellent at collecting signals. They can detect latency spikes, memory pressure, queue backlogs, failed deployments, and infrastructure degradation in seconds.&lt;/p&gt;

&lt;p&gt;What many systems still struggle with is everything that happens after detection. Who saw the alert? Who approved the response?&lt;br&gt;
What evidence was considered? Why was a particular action taken?&lt;br&gt;
Can someone reconstruct the decision six months later?&lt;br&gt;
Detection is only the beginning. Accountability begins when decisions become traceable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building for Active Control&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This idea sits at the centre of what I am building with FastAPI AlertEngine.&lt;/p&gt;

&lt;p&gt;FastAPI AlertEngine is incident intelligence for FastAPI services. A free SDK adds health monitoring to your application with a single line of code. When degradation is detected, a managed orchestrator investigates the likely cause and sends you a WhatsApp or Telegram notification containing a single-use approval link. Nothing executes without your authorisation.&lt;/p&gt;

&lt;p&gt;The goal is not simply to collect more alerts. Most organisations already have more alerts than they can reasonably process. The goal is to preserve the decision chain.&lt;/p&gt;

&lt;p&gt;Observe the signal. Capture the evidence. Present the context.&lt;br&gt;
Recommend an action. Require authorisation. Execute the response.&lt;br&gt;
Record everything.&lt;/p&gt;

&lt;p&gt;The Joseon court performed this process with a brush, ink, astronomers, and royal historians. We perform it with telemetry pipelines, machine reasoning, and APIs. The difference is speed. The court might take days to process an eclipse.&lt;br&gt;
We can detect a P95 latency spike, identify likely causes, generate remediation options, and request approval in seconds.&lt;/p&gt;

&lt;p&gt;But the underlying governance problem remains unchanged.&lt;br&gt;
How do you ensure that the people responsible for a system are confronted with evidence, required to make a decision, and unable to rewrite history afterward?&lt;/p&gt;

&lt;p&gt;Five hundred years ago, the Joseon Dynasty answered that question with brush and ink. We are still figuring it out with Redis and JWT tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Explore the historical dashboard:&lt;/em&gt;&lt;/strong&gt; &lt;a href="https://ajin.im/is/building/omen.ops/" rel="noopener noreferrer"&gt;https://ajin.im/is/building/omen.ops/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you run FastAPI in production and want incident response that asks before it acts, the free SDK is available through FastAPI AlertEngine.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How I Taught My Incident Alerts to Say "This Broke 3 Minutes After Your Last Deploy"</title>
      <dc:creator>Lenard Francis</dc:creator>
      <pubDate>Sat, 30 May 2026 19:50:31 +0000</pubDate>
      <link>https://dev.to/tandemmedia/how-i-taught-my-incident-alerts-to-say-this-broke-3-minutes-after-your-last-deploy-641</link>
      <guid>https://dev.to/tandemmedia/how-i-taught-my-incident-alerts-to-say-this-broke-3-minutes-after-your-last-deploy-641</guid>
      <description>&lt;h2&gt;
  
  
  You're staring at a P95 latency spike.
&lt;/h2&gt;

&lt;p&gt;The alert says: "Database pool exhausted. P95: 2847ms."You know what broke. You don't know why.&lt;br&gt;
So you open your git log, check when the spike started, scroll through commits, and try to figure out what changed in the 10 minutes before everything went sideways.&lt;br&gt;
That archaeology takes 20 minutes on a good day. At 2am it takes longer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem with Context-Free Alerts&lt;/strong&gt;&lt;br&gt;
Most incident alerts are great at telling you the “what”. None of them tell you the “when” in relation to your codebase.&lt;br&gt;
The question every engineer asks during an incident isn't "what is the P95?" — they already know that. It's "Did we just deploy something?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Insight: Incidents Have a Deployment Shadow&lt;/strong&gt;&lt;br&gt;
The way I see it, the majority of production incidents fall into one of two categories:&lt;br&gt;
• Infrastructure events — upstream dependency failure, Redis outage, traffic spike&lt;br&gt;
• Deployment shadows — something changed in the last deploy that didn't show up in testing&lt;/p&gt;

&lt;p&gt;For category 2, the fastest path to resolution is knowing exactly what changed and when — down to the commit level.&lt;br&gt;
If your alert says:&lt;br&gt;
 Database pool exhausted (P95: 2847ms)&lt;br&gt;
Recent deployments before incident:&lt;br&gt;
  3m ago — a1b2c3d: "Fix checkout query isolation level" (John, +12/-3)&lt;br&gt;
    1 recent commit touched database/query files&lt;br&gt;
You've just saved 20 minutes of log archaeology.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to Build It&lt;/strong&gt;&lt;br&gt;
The implementation is simpler than it sounds. Three components:&lt;br&gt;
• A commit store — Redis sorted set, scored by timestamp&lt;br&gt;
• A GitHub webhook — receives push events, stores commits&lt;br&gt;
• An incident correlator — maps incident start time to nearby commits&lt;/p&gt;

&lt;p&gt;The Commit Store&lt;br&gt;
def store_commit(tenant_id, sha, message, author, timestamp, files_changed):&lt;br&gt;
    key = f"orchestrator:commits:{tenant_id}"&lt;br&gt;
    redis.zadd(key, {entry: timestamp})&lt;br&gt;
    redis.expire(key, 86400 * 7)  # 7 day TTL&lt;br&gt;
A Redis sorted set gives you O(log N) insertion and O(log N + K) range queries — perfect for "give me commits in the 10 minutes before this timestamp."&lt;/p&gt;

&lt;p&gt;The GitHub Webhook&lt;br&gt;
@app.post("/commits/webhook")&lt;br&gt;
async def github_webhook(request: Request):&lt;br&gt;
    body = await request.json()&lt;br&gt;
    for commit in body.get("commits", []):&lt;br&gt;
        store_commit(...)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Injecting Context into AI Diagnosis&lt;/strong&gt;&lt;br&gt;
Without commit context, Claude sees raw metrics. With commit context, Claude sees the metrics AND what changed 3 minutes before the incident — shifting the diagnosis from "likely database connection issue" to "checkout query isolation level change likely caused connection pool exhaustion."&lt;br&gt;
That's a different quality of diagnosis entirely.&lt;/p&gt;

&lt;p&gt;What the WhatsApp Message Looks Like&lt;br&gt;
⚠️ Action Recommended&lt;br&gt;
Service: Payment API&lt;br&gt;
Issue: Database pool exhausted — P95 2.8s&lt;br&gt;
Likely cause: Checkout query isolation level change&lt;br&gt;
(commit a1b2c3d, 3m ago)&lt;br&gt;
Confidence: 87%&lt;br&gt;
👉 Approve fix: [link]&lt;br&gt;
Nothing will run without your approval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three Setup Options&lt;/strong&gt;&lt;br&gt;
• GitHub webhook (recommended) — POST /commits/webhook with header X-AlertEngine-Tenant-ID&lt;br&gt;
• Manual push from CI — curl from your GitHub Actions workflow&lt;br&gt;
• GitHub API polling — set GITHUB_TOKEN and GITHUB_REPO, AlertEngine fetches automatically&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Broader Pattern&lt;/strong&gt;&lt;br&gt;
This feature is an instance of a broader pattern: enrich your incident context with everything that changed recently, not just the metrics at the moment of failure.&lt;br&gt;
Future extensions of the same idea:&lt;br&gt;
• Feature flag changes in the 10 minutes before an incident&lt;br&gt;
• Infrastructure changes (Terraform applies, Docker image updates)&lt;br&gt;
• Database migration executions&lt;br&gt;
• Config changes&lt;/p&gt;

&lt;p&gt;The alert that says, "Here's what broke, here's what changed right before it broke, here's the fix"—that's the alert worth building for.&lt;/p&gt;

&lt;p&gt;─────────────────────────────────────────&lt;br&gt;
This is now live in FastAPI AlertEngine as commit_context.py.&lt;br&gt;
GitHub: github.com/Tandem-Media/fastapi-alertengine&lt;br&gt;
Docs: tandem-media.github.io/fastapi-alertengine/&lt;br&gt;
pip install fastapi-alertengine&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
    <item>
      <title>Why P95 Latency Is the Only Metric That Matters at 3 AM</title>
      <dc:creator>Lenard Francis</dc:creator>
      <pubDate>Thu, 21 May 2026 18:58:30 +0000</pubDate>
      <link>https://dev.to/tandemmedia/why-p95-latency-is-the-only-metric-that-matters-at-3-am-2b2c</link>
      <guid>https://dev.to/tandemmedia/why-p95-latency-is-the-only-metric-that-matters-at-3-am-2b2c</guid>
      <description>&lt;p&gt;If your checkout endpoint serves 10,000 requests per minute, a 5% latency spike means 500 users are having a bad experience every minute.&lt;/p&gt;

&lt;p&gt;Averages compress that pain into a single comfortable number.&lt;br&gt;
P95 latency — the latency at the 95th percentile — tells you what your slowest users are actually experiencing.&lt;/p&gt;

&lt;p&gt;It's the metric that catches the spike average hides.&lt;br&gt;
This is why I track P95 as the primary health signal, not averages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Latency Spikes Actually Propagate&lt;/strong&gt;&lt;br&gt;
A latency spike rarely starts in your application.It usually starts somewhere else and cascades inward.&lt;/p&gt;

&lt;p&gt;The typical pattern looks like this:&lt;/p&gt;

&lt;p&gt;Slow upstream dependency&lt;br&gt;
        ↓&lt;br&gt;
Connection pool saturation&lt;br&gt;
        ↓&lt;br&gt;
Request queue growth&lt;br&gt;
        ↓&lt;br&gt;
Latency spike propagation&lt;br&gt;
        ↓&lt;br&gt;
Timeouts and failures&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Cascade Pattern&lt;/strong&gt;&lt;br&gt;
An upstream dependency (database, payment gateway, third-party API) slows down&lt;br&gt;
Your FastAPI app keeps accepting requests while waiting for responses.&lt;br&gt;
Your connection pool fills up – new requests queue behind existing ones.&lt;br&gt;
Queue depth grows, memory pressure builds&lt;br&gt;
Response times climb across all endpoints, not just the affected one. Eventually requests start timing out or failing entirely&lt;/p&gt;

&lt;p&gt;By stage 3, you have a problem. By stage 5, your customers know about it before you do.&lt;br&gt;
The cascade failure pattern is particularly nasty.A slow database query holds a connection.&lt;/p&gt;

&lt;p&gt;That held connection blocks another request. That blocked request ties up execution capacity. Multiply that by concurrent users and you get full service degradation from a single slow dependency.&lt;/p&gt;

&lt;p&gt;Under async workloads, the failure mode becomes especially deceptive because the application continues accepting requests while upstream awaits accumulation in the background.&lt;/p&gt;

&lt;p&gt;High Traffic Spikes Make This Worse. Under normal load, a slow upstream dependency is annoying.&lt;br&gt;
Under a traffic spike, it's catastrophic.&lt;/p&gt;

&lt;p&gt;Here's why:&lt;/p&gt;

&lt;p&gt;Connection pool saturation happens faster. If you have 20 database connections and traffic doubles, you hit the ceiling twice as fast.&lt;br&gt;
Queue depth explodes. Requests piling up behind a slow dependency compound each other's wait time.&lt;br&gt;
Memory pressure builds. Each queued request holds state. Enough of them and you drift toward OOM territory.&lt;br&gt;
Recovery is non-linear. Once a connection pool is saturated, it often stays saturated even after the upstream issue resolves — because the backlog keeps it full.&lt;/p&gt;

&lt;p&gt;The cruel irony is that traffic spikes happen when your service matters most.&lt;/p&gt;

&lt;p&gt;A flash sale. A viral moment. A major announcement.&lt;br&gt;
Exactly the wrong time to be debugging latency from a dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Didn't Work For Me&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Monitoring sounds easy in theory. In practice, most setups failed me in one of four ways.&lt;/p&gt;

&lt;p&gt;Prometheus + Grafana. Powerful, but operationally heavy.&lt;/p&gt;

&lt;p&gt;Setting up exporters, configuring dashboards, maintaining the stack — all before writing a single alert rule.&lt;/p&gt;

&lt;p&gt;And when the alert fires at 3am, one still has to log in and interpret charts under pressure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple Health Checks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GET /health → 200 OK tells you the service is alive.&lt;br&gt;
It doesn't tell you it's running at 8x normal latency while technically responding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Average Latency Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Averages mask the spikes that actually hurt users.&lt;/p&gt;

&lt;p&gt;In one case, a payment provider slowdown pushed P95 latency from roughly 180 ms to over 2 seconds within minutes — while average latency still looked acceptable.&lt;/p&gt;

&lt;p&gt;By the time averages reflected the issue, checkout failures had already started.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert Fatigue&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I added more monitors to catch more things. Which meant more alerts. Most of them were noise. When everything is urgent, nothing is. Monitoring systems usually optimise for data collection. &lt;/p&gt;

&lt;p&gt;Operators actually need decision compression.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I Built Instead&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I wanted something that:&lt;br&gt;
Tracked P95, not averages&lt;br&gt;
Produced a single health score instead of 15 metrics to interpret&lt;br&gt;
Caught degradation trends early, before full failure&lt;br&gt;
Required zero config to add to an existing FastAPI app&lt;/p&gt;

&lt;p&gt;The result is a FastAPI middleware that continuously computes degradation signals directly from live request traffic.&lt;/p&gt;

&lt;p&gt;from fastapi import FastAPI&lt;br&gt;
from fastapi_alertengine import instrument&lt;/p&gt;

&lt;p&gt;app = FastAPI()&lt;br&gt;
instrument(app)&lt;/p&gt;

&lt;p&gt;The middleware exposes a structured /health/alerts endpoint:&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "status": "warning",&lt;br&gt;
  "health_score": {&lt;br&gt;
    "score": 61,&lt;br&gt;
    "trend": "degrading"&lt;br&gt;
  },&lt;br&gt;
  "metrics": {&lt;br&gt;
    "overall_p95_ms": 1847.3,&lt;br&gt;
    "error_rate": 0.08,&lt;br&gt;
    "anomaly_score": 0.9&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;One status. One score. One trend direction. No dashboards to configure. No agents to run. No Prometheus exporters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Human-in-the-Loop Layer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once I had a reliable health signal, the next question was:&lt;br&gt;
What do I do with it?&lt;/p&gt;

&lt;p&gt;I built a managed orchestration layer that polls /health/alerts every 5 seconds. When the score drops below the threshold, it:&lt;/p&gt;

&lt;p&gt;Runs Claude AI diagnosis on the metric context&lt;br&gt;
Sends a WhatsApp or Telegram message (or Slack) with a plain-English summary&lt;br&gt;
Generates a single-use recovery link&lt;/p&gt;

&lt;p&gt;Most AI incident tooling jumps straight to autonomous remediation. I intentionally didn't.&lt;/p&gt;

&lt;p&gt;Production systems deserve human authorisation before recovery actions execute. I read the diagnosis, preview the recovery action, and tap approve – all from my phone.&lt;/p&gt;

&lt;p&gt;Nothing executes automatically. Every action is logged immutably.&lt;/p&gt;

&lt;p&gt;I built the mobile-first delivery because I work in Zimbabwe, where engineers aren't always at laptops when things break.&lt;/p&gt;

&lt;p&gt;WhatsApp is the operational control plane here.&lt;/p&gt;

&lt;p&gt;That constraint produced something better than I expected:&lt;/p&gt;

&lt;p&gt;Alerts that find you, rather than dashboards you have to find.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Open Source Core&lt;/strong&gt;&lt;br&gt;
The telemetry middleware is free and MIT licensed.&lt;br&gt;
pip install fastapi-alertengine&lt;/p&gt;

&lt;p&gt;The managed orchestration layer (AI diagnosis, WhatsApp/Telegram alerts, and human-authorised recovery) is a commercial service.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/Tandem-Media/fastapi-alertengine" rel="noopener noreferrer"&gt;https://github.com/Tandem-Media/fastapi-alertengine&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;YouTube: &lt;br&gt;
  &lt;iframe src="https://www.youtube.com/embed/vKLqcVdSMO8"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Most monitoring stacks are good at detecting incidents.&lt;br&gt;
Very few are good at reducing operator uncertainty during one.&lt;br&gt;
How are you handling that gap today?&lt;/p&gt;

</description>
      <category>backend</category>
      <category>monitoring</category>
      <category>performance</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
