<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lenard Francis</title>
    <description>The latest articles on DEV Community by Lenard Francis (@tandemmedia).</description>
    <link>https://dev.to/tandemmedia</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3944281%2Fe53139d9-000f-4c56-8edb-96c657bae1c9.png</url>
      <title>DEV Community: Lenard Francis</title>
      <link>https://dev.to/tandemmedia</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tandemmedia"/>
    <language>en</language>
    <item>
      <title>Why P95 Latency Is the Only Metric That Matters at 3 AM</title>
      <dc:creator>Lenard Francis</dc:creator>
      <pubDate>Thu, 21 May 2026 18:58:30 +0000</pubDate>
      <link>https://dev.to/tandemmedia/why-p95-latency-is-the-only-metric-that-matters-at-3-am-2b2c</link>
      <guid>https://dev.to/tandemmedia/why-p95-latency-is-the-only-metric-that-matters-at-3-am-2b2c</guid>
      <description>&lt;p&gt;If your checkout endpoint serves 10,000 requests per minute, a 5% latency spike means 500 users are having a bad experience every minute.&lt;/p&gt;

&lt;p&gt;Averages compress that pain into a single comfortable number.&lt;br&gt;
P95 latency — the latency at the 95th percentile — tells you what your slowest users are actually experiencing.&lt;/p&gt;

&lt;p&gt;It's the metric that catches the spike average hides.&lt;br&gt;
This is why I track P95 as the primary health signal, not averages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Latency Spikes Actually Propagate&lt;/strong&gt;&lt;br&gt;
A latency spike rarely starts in your application.It usually starts somewhere else and cascades inward.&lt;/p&gt;

&lt;p&gt;The typical pattern looks like this:&lt;/p&gt;

&lt;p&gt;Slow upstream dependency&lt;br&gt;
        ↓&lt;br&gt;
Connection pool saturation&lt;br&gt;
        ↓&lt;br&gt;
Request queue growth&lt;br&gt;
        ↓&lt;br&gt;
Latency spike propagation&lt;br&gt;
        ↓&lt;br&gt;
Timeouts and failures&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Cascade Pattern&lt;/strong&gt;&lt;br&gt;
An upstream dependency (database, payment gateway, third-party API) slows down&lt;br&gt;
Your FastAPI app keeps accepting requests while waiting for responses.&lt;br&gt;
Your connection pool fills up – new requests queue behind existing ones.&lt;br&gt;
Queue depth grows, memory pressure builds&lt;br&gt;
Response times climb across all endpoints, not just the affected one. Eventually requests start timing out or failing entirely&lt;/p&gt;

&lt;p&gt;By stage 3, you have a problem. By stage 5, your customers know about it before you do.&lt;br&gt;
The cascade failure pattern is particularly nasty.A slow database query holds a connection.&lt;/p&gt;

&lt;p&gt;That held connection blocks another request. That blocked request ties up execution capacity. Multiply that by concurrent users and you get full service degradation from a single slow dependency.&lt;/p&gt;

&lt;p&gt;Under async workloads, the failure mode becomes especially deceptive because the application continues accepting requests while upstream awaits accumulation in the background.&lt;/p&gt;

&lt;p&gt;High Traffic Spikes Make This Worse. Under normal load, a slow upstream dependency is annoying.&lt;br&gt;
Under a traffic spike, it's catastrophic.&lt;/p&gt;

&lt;p&gt;Here's why:&lt;/p&gt;

&lt;p&gt;Connection pool saturation happens faster. If you have 20 database connections and traffic doubles, you hit the ceiling twice as fast.&lt;br&gt;
Queue depth explodes. Requests piling up behind a slow dependency compound each other's wait time.&lt;br&gt;
Memory pressure builds. Each queued request holds state. Enough of them and you drift toward OOM territory.&lt;br&gt;
Recovery is non-linear. Once a connection pool is saturated, it often stays saturated even after the upstream issue resolves — because the backlog keeps it full.&lt;/p&gt;

&lt;p&gt;The cruel irony is that traffic spikes happen when your service matters most.&lt;/p&gt;

&lt;p&gt;A flash sale. A viral moment. A major announcement.&lt;br&gt;
Exactly the wrong time to be debugging latency from a dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Didn't Work For Me&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Monitoring sounds easy in theory. In practice, most setups failed me in one of four ways.&lt;/p&gt;

&lt;p&gt;Prometheus + Grafana. Powerful, but operationally heavy.&lt;/p&gt;

&lt;p&gt;Setting up exporters, configuring dashboards, maintaining the stack — all before writing a single alert rule.&lt;/p&gt;

&lt;p&gt;And when the alert fires at 3am, one still has to log in and interpret charts under pressure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple Health Checks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GET /health → 200 OK tells you the service is alive.&lt;br&gt;
It doesn't tell you it's running at 8x normal latency while technically responding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Average Latency Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Averages mask the spikes that actually hurt users.&lt;/p&gt;

&lt;p&gt;In one case, a payment provider slowdown pushed P95 latency from roughly 180 ms to over 2 seconds within minutes — while average latency still looked acceptable.&lt;/p&gt;

&lt;p&gt;By the time averages reflected the issue, checkout failures had already started.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert Fatigue&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I added more monitors to catch more things. Which meant more alerts. Most of them were noise. When everything is urgent, nothing is. Monitoring systems usually optimise for data collection. &lt;/p&gt;

&lt;p&gt;Operators actually need decision compression.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I Built Instead&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I wanted something that:&lt;br&gt;
Tracked P95, not averages&lt;br&gt;
Produced a single health score instead of 15 metrics to interpret&lt;br&gt;
Caught degradation trends early, before full failure&lt;br&gt;
Required zero config to add to an existing FastAPI app&lt;/p&gt;

&lt;p&gt;The result is a FastAPI middleware that continuously computes degradation signals directly from live request traffic.&lt;/p&gt;

&lt;p&gt;from fastapi import FastAPI&lt;br&gt;
from fastapi_alertengine import instrument&lt;/p&gt;

&lt;p&gt;app = FastAPI()&lt;br&gt;
instrument(app)&lt;/p&gt;

&lt;p&gt;The middleware exposes a structured /health/alerts endpoint:&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "status": "warning",&lt;br&gt;
  "health_score": {&lt;br&gt;
    "score": 61,&lt;br&gt;
    "trend": "degrading"&lt;br&gt;
  },&lt;br&gt;
  "metrics": {&lt;br&gt;
    "overall_p95_ms": 1847.3,&lt;br&gt;
    "error_rate": 0.08,&lt;br&gt;
    "anomaly_score": 0.9&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;One status. One score. One trend direction. No dashboards to configure. No agents to run. No Prometheus exporters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Human-in-the-Loop Layer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once I had a reliable health signal, the next question was:&lt;br&gt;
What do I do with it?&lt;/p&gt;

&lt;p&gt;I built a managed orchestration layer that polls /health/alerts every 5 seconds. When the score drops below the threshold, it:&lt;/p&gt;

&lt;p&gt;Runs Claude AI diagnosis on the metric context&lt;br&gt;
Sends a WhatsApp or Telegram message (or Slack) with a plain-English summary&lt;br&gt;
Generates a single-use recovery link&lt;/p&gt;

&lt;p&gt;Most AI incident tooling jumps straight to autonomous remediation. I intentionally didn't.&lt;/p&gt;

&lt;p&gt;Production systems deserve human authorisation before recovery actions execute. I read the diagnosis, preview the recovery action, and tap approve – all from my phone.&lt;/p&gt;

&lt;p&gt;Nothing executes automatically. Every action is logged immutably.&lt;/p&gt;

&lt;p&gt;I built the mobile-first delivery because I work in Zimbabwe, where engineers aren't always at laptops when things break.&lt;/p&gt;

&lt;p&gt;WhatsApp is the operational control plane here.&lt;/p&gt;

&lt;p&gt;That constraint produced something better than I expected:&lt;/p&gt;

&lt;p&gt;Alerts that find you, rather than dashboards you have to find.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Open Source Core&lt;/strong&gt;&lt;br&gt;
The telemetry middleware is free and MIT licensed.&lt;br&gt;
pip install fastapi-alertengine&lt;/p&gt;

&lt;p&gt;The managed orchestration layer (AI diagnosis, WhatsApp/Telegram alerts, and human-authorised recovery) is a commercial service.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/Tandem-Media/fastapi-alertengine" rel="noopener noreferrer"&gt;https://github.com/Tandem-Media/fastapi-alertengine&lt;/a&gt;&lt;br&gt;
Docs: &lt;a href="https://tandem-media.github.io/fastapi-alertengine/" rel="noopener noreferrer"&gt;https://tandem-media.github.io/fastapi-alertengine/&lt;/a&gt;&lt;br&gt;
Youtube: &lt;a href="https://youtu.be/vKLqcVdSMO8?si=eMU3Fm_WPmJTQi2Y" rel="noopener noreferrer"&gt;https://youtu.be/vKLqcVdSMO8?si=eMU3Fm_WPmJTQi2Y&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most monitoring stacks are good at detecting incidents.&lt;br&gt;
Very few are good at reducing operator uncertainty during one.&lt;br&gt;
How are you handling that gap today?&lt;/p&gt;

</description>
      <category>backend</category>
      <category>monitoring</category>
      <category>performance</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
