Rodrigo Escorsim

Posted on Jun 30 • Originally published at cachesnap.com

I skipped the LLM and built a 9-rule deterministic diagnosis engine for my performance monitoring SaaS

#rust #webdev #performance #showdev

Most developers don't know if their CDN is actually caching in Tokyo.

They check the dashboard, see a green dot, assume everything is fine. Meanwhile, every request from Asia is hitting their origin in Frankfurt because a CDN config never propagated. TTFB is 800ms. Users are leaving. Nobody noticed because the uptime monitor only checks "is the site up?", not "is it fast from where your users actually are?"

That's what I built CacheSnap to fix. It probes your URLs from 8 AWS Lambda regions every few minutes and, instead of showing you raw headers, tells you what's wrong and what to do about it. This post is about the two pieces of engineering that make that work: the deterministic diagnosis engine and the Redis-gated scheduler.

What the user actually sees

Before getting into implementation, it helps to understand what we're targeting. When CacheSnap detects a problem, a card like this appears on the dashboard:

⚠ CRITICAL · sa-east (São Paulo)
Cache MISS: origin server is being consulted for every request in sa-east.
Action: Add `Cache-Control: public, s-maxage=300` to your response headers.
        For Next.js, use `export const revalidate = 300` in your page.
Estimated gain: ~450ms

No headers to decode. No raw JSON to interpret. The cause is a sentence. The fix is two lines of config. The gain is a number.

Getting from raw probe data to that card is the job of the diagnosis engine.

The translation problem

A probe returns something like this:

TTFB: 480ms
Cache-Status: MISS
HTTP: HTTP/2
Redirects: 0
Served-By: 87c1d4a2b3c4d5e6-IAD  ← Cloudflare CF-Ray header

That's a measurement. It tells you what happened, not why or what to do. The gap between "480ms TTFB" and "your CDN isn't caching: here's the exact config line to fix it" is where most monitoring tools stop.

The obvious path is to feed the data to an LLM and let it generate the explanation. I spent a week thinking seriously about this and decided against it.

Three reasons:

1. Volume and latency. Diagnosis runs on every probe ingest. With 50 monitors × 8 regions × 1-minute intervals, that's 400 diagnosis calls per minute at steady state. An LLM call averaging 800ms would add more latency to the pipeline than the performance problems it's diagnosing. Diagnosis needs to be sub-millisecond.

2. Correctness. An LLM will generate plausible advice regardless of whether it's applicable. It might say "try adding a Cache-Control header" when one already exists and the problem is a CDN misconfiguration. A rule engine is wrong in known, fixable ways: you can write a test for every mistake it makes.

3. Testability. I want diagnose(input) to be a pure function with deterministic output I can run in CI. The priority between rules is a product decision: "cache MISS beats anycast mismatch" is something I can assert and lock down. With an LLM that's not possible.

The alternative: a priority-ordered rule table. Each rule maps an observable condition to a structured diagnosis. Rules evaluate top-to-bottom; first match wins.

The diagnosis engine

The core types:

pub struct DiagnosisInput {
    pub ttfb_ms: Option<f64>,
    pub cache_status: Option<String>,
    pub baseline_ttfb_ms: Option<f64>,  // 7-day rolling average for this URL + region
    pub redirect_count: Option<i32>,
    pub http_version: Option<String>,
    pub region: String,
    pub error: Option<String>,
    pub served_by: Option<String>,  // CF-Ray header, used for anycast audit
    pub age_s: Option<i32>,         // Age response header
}

pub struct Diagnosis {
    pub severity: String,           // "critical" | "warning" | "info" | "ok"
    pub cause: String,
    pub action: String,
    pub summary: String,
    pub estimated_gain_ms: Option<f64>,
}

The diagnose() function evaluates 9 rules in fixed priority order:

1. Connectivity error              → critical  (site unreachable)
2. Cache MISS/BYPASS + TTFB>200ms → critical  (highest actionable impact)
3. TTFB > 2× 7-day baseline       → warning   (regression vs. normal)
4. Redirect count > 1              → warning   (redirect chain cost)
5. Cache HIT but TTFB > 150ms     → warning   (slow edge function)
6. HTTP/1.1 + TTFB > 100ms        → info      (upgrade available)
7. Cache Age > 86400s              → warning   (stale content risk)
8. Anycast routing mismatch        → warning   (cross-region routing)
9. (fallthrough)                   → ok

Here's rule 2 verbatim from the source:

if (cache.contains("MISS") || cache.contains("BYPASS")
    || cache.contains("EXPIRED") || cache.contains("DYNAMIC"))
    && ttfb > 200.0
{
    let gain = ttfb - 30.0; // estimate: a cache HIT would cost ~30ms at the edge
    return Diagnosis {
        severity: "critical".into(),
        cause: format!(
            "Cache {}: origin server is being consulted for every request in {}.",
            cache, input.region
        ),
        action: "Add `Cache-Control: public, s-maxage=300` to your response headers. \
                 For Next.js, use `export const revalidate = 300` in your page.".into(),
        summary: format!("Cache {} in {} is adding ~{:.0}ms", cache, input.region, gain),
        estimated_gain_ms: Some(gain),
    };
}

The estimated_gain_ms field is worth pausing on. "Your site is slow" is vague. "Fixing this saves 450ms in São Paulo" is a business case. The number is an estimate (actual gain depends on origin latency after fix), but even a rough estimate turns a warning into a prioritizable action.

Why rule priority matters

The order isn't arbitrary. Rules 2–4 surface "fix this now" problems. Rules 7–8 are audit signals: real issues, but lower urgency than a production cache miss that's affecting every user right now.

Without explicit priority, overlapping signals create ambiguity. A cache MISS and an anycast routing mismatch can both be true simultaneously. The engine needs to surface the most actionable one. The tests lock this down:

// Cache MISS must win over anycast mismatch (rule 2 > rule 8)
#[test]
fn cache_miss_beats_anycast() {
    let d = diagnose(&DiagnosisInput {
        ttfb_ms: Some(500.0),
        cache_status: Some("MISS".into()),
        served_by: Some("abc123-IAD".into()),
        region: "sa-east".into(),
        ..base_input()
    });
    assert_eq!(d.severity, "critical");
    assert!(d.cause.contains("MISS"));
}

// Baseline regression must win over stale cache age (rule 3 > rule 7)
#[test]
fn baseline_anomaly_beats_cache_age_breach() {
    let d = diagnose(&DiagnosisInput {
        ttfb_ms: Some(800.0),
        cache_status: Some("HIT".into()),
        baseline_ttfb_ms: Some(100.0), // 8× slower than normal
        age_s: Some(172_800),           // content also 2 days old
        ..base_input()
    });
    assert_eq!(d.severity, "warning");
    assert!(d.cause.contains("7-day baseline"));
}

These tests document the intended priority as much as they verify correctness. When I change a rule's position, a failing test tells me exactly what got displaced and forces an explicit decision about whether that's right.

The insight most devs miss: anycast mismatch

Rule 8 is the most unusual and the one users consistently didn't know they had.

CDNs use anycast to route requests to the nearest PoP. When it works, a user in São Paulo gets served by GRU or GIG. When something is misconfigured (geo-steering rules, load balancer health checks, origin pull settings), the same request travels to IAD (Dulles, Virginia) instead. That's an extra 100–150ms RTT on every request, invisible unless you're explicitly probing from the right region.

Cloudflare exposes which PoP served a request via the CF-Ray header: 87c1d4a2b3c4d5e6-GRU. The suffix is the IATA airport code of the serving PoP. The engine extracts it and checks whether that PoP belongs to the probe's expected region:

pub fn extract_iata_from_served_by(served_by: &str) -> Option<&str> {
    // CF-Ray format: hexdigest-IATA
    if let Some(pos) = served_by.rfind('-') {
        let candidate = &served_by[pos + 1..];
        if candidate.len() >= 2
            && candidate.len() <= 4
            && candidate.chars().all(|c| c.is_ascii_uppercase())
        {
            return Some(candidate);
        }
    }
    None
}

fn iata_to_region(iata: &str) -> Option<&'static str> {
    match iata {
        "GRU" | "GIG" | "EZE" | "SCL" | "BOG" | "LIM" => Some("sa-east"),
        "IAD" | "JFK" | "EWR" | "ORD" | "LAX" | "DFW" => Some("us-east"),
        "LHR" | "AMS" | "FRA" | "CDG" | "MXP" | "MAD" => Some("eu-west"),
        "NRT" | "SIN" | "HKG" | "BOM" | "DEL" | "ICN" => Some("ap-east"),
        "SYD" | "MEL" | "BNE" | "PER" | "AKL"         => Some("oc"),
        // ... full table covers ~80 codes
        _ => None,
    }
}

If the probe is from sa-east and the PoP is IAD, that's a mismatch. The diagnosis tells the user exactly which PoP answered and which CDN config to check. Without multi-region probing this problem is nearly impossible to notice. Uptime monitors that check from a single location or from the same region as the CDN PoP will never see it.

The adaptive baseline

Rule 3 (TTFB anomaly vs. 7-day baseline) is where the engine personalizes to your specific URL.

A fixed threshold like "warn if TTFB > 400ms" is meaningless without context. A CDN-cached static page at 400ms is broken. A database-backed API at 400ms is completely normal. Using a threshold calibrated to what's normal for that URL means the engine warns about actual regressions, not just "slowness" in the abstract.

Every 15 minutes, a background worker updates a (monitor_id, region, mean, stddev) record from a 7-day sliding window:

SELECT
    AVG(ttfb_ms)    AS mean_ttfb,
    STDDEV(ttfb_ms) AS stddev_ttfb
FROM probe_results
WHERE monitor_id = $1
  AND region     = $2
  AND time > NOW() - INTERVAL '7 days'
  AND ttfb_ms IS NOT NULL
  AND error IS NULL

When a new probe comes in, the engine receives baseline_ttfb_ms and computes the factor. Factor > 2.0 → warning. The diagnosis includes the factor, the raw numbers, and the estimated recovery:

TTFB is 7.5× slower than your 7-day baseline (600ms vs 80ms normal) in eu-west.
Action: Check for recent deploys, increased origin load, or cold starts.
Estimated gain: ~520ms

The first time CacheSnap detects this on a real site it feels like magic. In practice it's just a TimescaleDB window query and a ratio check, but the framing as "your normal" rather than "some threshold" is what makes the alert actionable.

The Redis-gated scheduler

The diagnosis engine is fast and pure, but the scheduler that triggers probes has a harder problem: ensuring each monitor fires exactly once per interval across multiple API instances.

The naive approach, where each instance tracks last-check time in memory, breaks immediately under horizontal scale. Two instances both fire at t=0, both record t=300 in memory, both fire again at t=300. You get duplicate probe rows, double Lambda invocations, and corrupted baselines.

The fix is a Redis Lua script that atomically reads and writes the last-check timestamp:

local last_check   = redis.call('GET', KEYS[1])
local warmup_index = redis.call('GET', KEYS[2])
local now          = tonumber(ARGV[1])
local interval     = tonumber(ARGV[2])

local effective_interval = interval
if warmup_index ~= false then
    local idx = tonumber(warmup_index)
    local warmup_gap
    if     idx == 0 then warmup_gap = 60
    elseif idx == 1 then warmup_gap = 60
    end
    if warmup_gap and warmup_gap < interval then
        effective_interval = warmup_gap
    end
end

if not last_check or (now - tonumber(last_check)) >= effective_interval then
    redis.call('SET', KEYS[1], now)
    if not last_check then
        redis.call('SET', KEYS[2], '0', 'EX', '300')
    elseif warmup_index ~= false and tonumber(warmup_index) < 2 then
        redis.call('INCR', KEYS[2])
        redis.call('EXPIRE', KEYS[2], '300')
    end
    return 1
end
return 0

Redis executes Lua scripts atomically, so no other command runs between the GET and SET. If the script returns 1, this instance won the race and dispatches the probe. Any other instance evaluating at the same millisecond returns 0 and skips.

The warmup logic solves a UX problem: if you add a monitor with a 5-minute interval, you wait 5 minutes before seeing your first data point. The first two probes instead use a 60-second gap (warmup_index 0 and 1). By the time you refresh the dashboard, data is already there. The warmup only accelerates: if your configured interval is already shorter than 60s, the gap is ignored.

Dispatch concurrency: two semaphores and a hash

Once should_run_now returns true, the scheduler dispatches probes to all 8 regions in parallel using Tokio. But unbounded parallelism at scale would instantly 429 every Lambda region: 500 monitors × 8 regions = 4,000 simultaneous invocations per tick.

The solution is two Tokio semaphores: one global cap, one per-region cap:

pub struct ProbeDispatchLimits {
    global: Arc<Semaphore>,
    per_region: RwLock<HashMap<String, Arc<Semaphore>>>,
    cfg: Arc<ProbeDispatchConfig>,
}

Before each Lambda POST, the dispatcher acquires both permits with a timeout:

let global_permit = match acquire_with_timeout(&self.global, acquire_wait).await {
    Ok(p) => p,
    Err(_) => {
        warn!("global semaphore acquire timed out, skipping region for this cycle");
        return;
    }
};

let rsem = self.region_semaphore(ctx.region_id).await;
let region_permit = match acquire_with_timeout(&rsem, acquire_wait).await {
    Ok(p) => p,
    Err(_) => {
        drop(global_permit);
        return;
    }
};

The timeout is the important part. Without it, a stuck Lambda region holds a permit indefinitely and starves every other monitor waiting for that region. With a timeout, the region is simply skipped for that cycle and retried on the next tick.

When Lambda returns 429, both permits are released before the sleep, not after. If we held permits during backoff, we'd block the entire dispatch queue waiting on one throttled region.

Bursts are smoothed with a deterministic jitter function (no rand crate required):

fn dispatch_jitter(monitor_id: Uuid, region_id: &str) -> Duration {
    let mut h: u32 = (monitor_id.as_u128() & 0xffff_ffff) as u32;
    for b in region_id.bytes() {
        h = h.wrapping_mul(31).wrapping_add(b as u32);
    }
    Duration::from_millis((h % 72) as u64)
}

Same monitor + region always produces the same jitter. The 72ms ceiling means the entire burst window fits within a single scheduler tick, so no probe is ever delayed into the next cycle.

What this architecture gets you

Combined, these three pieces give you a system that:

Explains problems in plain text, not raw headers, so non-technical stakeholders can act on the output directly
Fires exactly once per interval regardless of how many API instances are running
Degrades gracefully when Lambda regions throttle or go cold: it skips, logs, and retries next tick instead of cascading

For a solo project, keeping the diagnosis deterministic and the scheduler atomic were the two decisions that eliminated the most production incidents. Both feel obvious in retrospect. Neither is what you reach for when moving fast.

CacheSnap is live at cachesnap.com.

Free tier: 3 monitors, 7-day data retention, 10-minute check intervals. No credit card required. If you want to see what the diagnosis engine finds on your own URLs, it takes about 90 seconds to add a monitor and see the first results.

Questions about the architecture, the Redis scheduling approach, or the Lambda probe design? Happy to discuss in the comments.

DEV Community