The Problem We Were Actually Solving
It started with a single Slack alert on a Tuesday at 3:47 PM. Our in-house treasure hunt engine—basically a graph traversal service that crawled 2.8 million user-generated routes every night—began returning HTTP 410 Gone for 12% of its target URLs. That was bad because the hunt scoreboard depended on those links staying alive for 36 hours. Worse, the failures werent clustered on any single CDN; they were spread across five different hosts running in Kubernetes with identical resource limits. The on-call engineer rerouted traffic via a circuit breaker and watched the error rate spike back to 0%, but the episode revealed a latent failure mode: our engine treated a single 410 as a node failure and would detach the entire subtree, wiping out hundreds of downstream routes in one shot. That was the problem we were actually solving—eventual consistency under noisy input.
What We Tried First (And Why It Failed)
We bolted on a retry budget in the first 20 minutes, setting max_retries to 3 with exponential backoff. Within an hour we hit another problem: tail latency spiked to 8.2 seconds on the retry path, and our 95th percentile deadlines (set at 5 seconds) started failing. The retry logic lived in the Node.js worker that also computed shortest-path scores, so adding sleeps inside the async queue crushed throughput from 14k URLs/min to 3k. Next we tried moving retries to a sidecar using Envoys retry policy, but the sidecar introduced 150ms of additional hop time, and the engine still missed deadlines when upstream L7 load balancers were under pressure.
Then we tried circuit breakers. We wrapped each outbound HTTP call in a breaker with failure_threshold=5 and timeout=1.2s. On paper it looked sane, but the breaker didnt know about the semantic weight of a 410—it just counted it as a failure. So when 700k URLs in the Tokyo region started returning 410 in a cascade after a CDN purge, every breaker tripped, and the engine switched to backup endpoints that had even older data. Ten minutes later the scoreboard update came in with 28% of the routes showing stale scores because the backup endpoints were five hours behind. The circuit breaker solved one problem and created another.
The Architecture Decision
We killed the circuit breaker and the sidecar retries in one merge request at 10:23 PM. Instead, we built a two-stage pipeline:
Stage 1: Pre-validation
Every URL gets a HEAD request with max_timeout=800ms and strict_status_filter=[200,301,404]. If the response is 410, we immediately mark the node as dead and prune it from the graph without propagating failure. This stage runs in a separate Go worker pool sized at 4× the CPU cores, so it never contends with score computation.
Stage 2: Real-time reconciliation
A separate reconcile loop wakes every 30 seconds. It queries the database for all dead nodes that were added after the last crawl cycle. It then re-queues only those URLs into Stage 1, but with a jittered delay to avoid thundering-herd retries. We also added a bloom filter on the crawl frontier so we never re-queue a URL that Stage 1 has already rejected.
The decision came down to cost and correctness: adding more compute to the validation path was cheaper than adding latency to the scoring path. The Go pool runs on spot instances that cost $0.012 per thousand URLs; the tripped circuit breakers used to cost us $0.08 per thousand due to cascade-induced 5xxs and pager duty burn.
What The Numbers Said After
Two weeks later:
- Pre-validation false-positive rate: 0.08% (all were temporary redirects misclassified as 410).
- 95th percentile Stage 1 latency: 415ms.
- Pipeline throughput: 22k URLs/min, up from 3k.
- Memory usage per worker: 180MB, down from 290MB.
- Scoreboard freshness variance: ±2.4 minutes, which met our SLO.
We also stopped waking the on-call team for 410 avalanches. The alerts shifted from error_count to validation_staleness, which had a 2.6% false-positive rate.
What I Would Do Differently
I would never again mix retry logic with business-score computation in the same language runtime. The Node.js workers should never have been asked to sleep inside an async queue. If we had run the two pipelines from day one, we would have saved three weeks of on-call time and avoided a 14% drop in user engagement while the scoreboard was stale.
Second, I would replace the simple HEAD filter with a lightweight feature store that stores per-URL historical status codes. When a new 410 appears, we can check the median time-before-death for that host; if its less than 48 hours, we treat it as a transient purge and re-queue after 15 minutes instead of pruning. That would cut our prune rate from 12% to 2.3% without changing the architecture.
Finally, I would expose the two pipelines in our Grafana dashboards not as success/failure counts but as graph prune rate vs. user engagement delta. Once the business saw that a 1% rise in prune rate correlated with a 3% drop in daily active users, the argument for more validation compute became trivial.
Top comments (0)