We built a free status monitor for 77 AI APIs. Here's what 6 weeks of data taught us.

#ai #webdev #cloudflare #devops

Every AI developer has been here: your app is throwing 503s, users are pinging you, and you have 12 browser tabs open — OpenAI status page, Anthropic status page, the GitHub Copilot health page, three different Discord servers — trying to figure out is this me or is it them?

That's the problem we set out to solve. Prismix aggregates status from 77 AI services in one place. Six weeks of running it in production taught us some things that might save you time.

The problem is worse than you think

AI APIs don't fail like traditional infrastructure. They fail in weird, partial ways:

Degraded performance that passes your health checks but makes your product feel broken
Regional outages — OpenAI US-East is down while EU is fine, so half your users are affected
Silent rate-limit cascades — the API returns 429s but their status page says "operational" for another 20 minutes
Incident lag — providers often post status updates 10–30 minutes after engineers are already aware

The official status pages are optimistic by design. They're customer-facing communications tools, not real-time engineering dashboards. There's nothing wrong with this — but it means you need a different mental model for "is this service down?"

What 77 status pages look like in aggregate

When you watch 77 AI services simultaneously, patterns emerge fast.

OpenAI is the most-watched service (and has the most incidents to watch). The pattern is almost always the same: investigating → identified → monitoring → resolved, typically in 45–90 minutes. The investigating phase is where most developers panic — it looks bad but usually resolves without action on your end.

Anthropic runs noticeably clean compared to its API usage growth. Incidents are rarer and shorter. When they do happen, updates arrive faster than most providers.

The long tail is interesting. Services like Replicate, Runway, ElevenLabs, and Suno have incident patterns that don't correlate with OpenAI at all. If you're routing across multiple providers for redundancy, these are genuinely independent failure domains — worth knowing.

The "silent degradation" problem is real. Multiple times we've seen a service show "operational" on its status page while our uptime probe was timing out. This is the main reason Prismix shows a latency sparkline per service — the status page is authoritative for announced incidents, but the probe catches real ones.

What Prismix built (and why it's free)

Prismix pulls from official status pages, aggregates them into a single dashboard, and adds a few things that the individual pages don't have:

Per-service latency probes — 24-hour sparklines showing actual response times, not just announced incidents. This catches the "silent degradation" cases.

Cross-service incident timeline — /incidents shows everything that happened across all 77 services in one scrollable feed. Useful for postmortems ("was anything else degraded when our error rate spiked at 3pm Tuesday?").

Embeddable status badges — put a live "OpenAI: operational" badge in your own app's status page with one line of HTML.

Public REST API — GET /api/v1/statuses returns current status for all 77 services as JSON. No auth, no rate limit for reasonable use, CORS open. Free forever.

RSS feed — /incidents.rss if you want AI incident updates in your feed reader.

It's free because it runs entirely on Cloudflare's free tier (Workers + KV). The Pro tier ($10/mo) adds email and webhook alerts for services you care about, but the core dashboard stays free.

The technical part (because this is dev.to)

The stack is Astro 5 SSR + Cloudflare Workers + KV. We wrote about the performance walls we hit in a previous post — the short version is that 77 parallel KV reads per request is a bad idea and a single pre-aggregated snapshot blob is much better.

One thing that surprised us: KV's free tier gives you 100,000 reads per day but only 1,000 writes. The cron job that refreshes status runs every 5 minutes, so every write is conditional — only write if the content actually changed. That dropped writes from ~8,400/day to ~600/day. Monitoring infrastructure has to be cheap to run, otherwise the incentive to keep it free disappears.

What we don't know yet — and why we're writing this

Six weeks in, Prismix tracks 77 services with a clean incident timeline and growing usage. What we don't have yet is signal on what matters to you.

Some things we're genuinely uncertain about:

Which services are missing? The list is opinionated — mostly LLM APIs, popular AI tools, and infrastructure adjacent to them. We've probably missed something obvious in your stack.
Is the latency probe useful? It tells you "this service is slow right now" but not "slow compared to what" — no historical baseline yet.
What would make you actually use this every day? A Slack bot? A PagerDuty integration? Something in your terminal?

If any of that resonates, drop a comment. Honest feedback shapes what gets built next.

Live at prismix.dev.

Also at Prismix: an MCP server directory with 500+ servers and a curated AI news feed — but the status monitoring is the part we're most curious to hear about.