Retell Logged 31 Outages in 11 Months. What Fallback Architecture Should Look Like.

#ai #architecture #voice #devops

Originally published on the BuildWithHermes blog, with the full incident table and fallback checklist.

Per StatusGator, more than 31 outages have affected Retell AI users in the past 12 months. IsDown, tracking since January 2026, has logged roughly 50 incidents in about five months. The most recently acknowledged outage on Retell's official status page was April 13, 2026.

These numbers do not mean Retell is a bad product. They mean it is an infrastructure dependency, and every infrastructure dependency fails. The question is whether your architecture assumes that.

Most agencies running voice AI on Retell do not have a fallback plan. They have a single platform, a set of client agents pointing at it, and no answer for what happens to a live client call when the platform goes down mid-day.

The incident record

The incidents break into three categories:

TTS provider dependency failures. Most visibly the March 14, 2025 incident titled "TTS provider openai is down," where Retell's dependency on OpenAI's TTS propagated directly into agent failures.
Cloud infrastructure incidents. Including October 20, 2025, when an AWS outage caused Retell login and analytics failures for 4 hours 49 minutes.
Platform component degradations. Dashboard, web call, and end-to-end calling each show separate incident histories, meaning an incident may not take the whole platform offline but may take out the component your clients are actively using.

How "99.99% uptime" and 31 incidents coexist

The math does not reconcile on first read. Four nines allows roughly 52 minutes of downtime per year. The answer is definitional: vendors measure uptime against full service unavailability. Third-party monitors flag any detectable degradation, including partial component failures, elevated error rates, and upstream dependency failures that cause degraded but nonzero call completion.

An event where 15% of calls fail due to an upstream TTS issue is a monitoring incident but may not trigger the uptime SLA, because most calls still complete.

For an agency, that distinction is irrelevant. If 15% of your client's calls fail on a Thursday afternoon during an outbound campaign, you have a problem regardless of what the status page says. The client does not see the SLA. They see calls not connecting, and they ask you what happened.

What breaks during an incident

Four things, in order of client-impact severity:

Mid-queue outbound calls do not go out. Depending on retry logic, some contacts receive no call during their designated window. For time-sensitive campaigns (appointment reminders, lead follow-up inside the 5-minute window), those contacts are simply lost.
Inbound goes to dead air. An AI receptionist that does not pick up is worse than no receptionist, because the client turned off the human fallback when they bought the agent.
Visibility disappears. If the dashboard and analytics are part of the degradation, you cannot even tell your client what failed or how many contacts were affected.
The blame lands on you. The end-client has no relationship with Retell. Their contract is with the agency. Every upstream incident is, commercially, your incident.

What real fallback architecture looks like

Minimum viable fallback for an agency running revenue-bearing voice agents:

Independent uptime monitoring on your own agents (synthetic test calls on a schedule), not the vendor status page.
Telephony-level failover: if the agent does not answer within N seconds, the carrier routes to a human number or voicemail-with-callback. This lives at the number layer, not inside the failed platform.
Campaign engines that checkpoint and resume, so a mid-run incident produces delayed calls, not silently dropped contacts.
An incident comms template for clients, pre-written, with a defined trigger. Agencies lose accounts over silence, not over outages.
A written answer to "what is our RTO" per client. If you cannot state how long until calls flow again, you do not have an architecture, you have a hope.

The platform-layer point

An agency can build all of this on top of a bare API. It takes engineering time the typical 1-5 person agency does not have. That is the argument for running on an operating platform where campaign checkpointing, monitoring, and client comms live in the same system as the agents, one layer above any single voice vendor.

That is the design position Hermes takes: the agency-facing operating layer (multi-tenant workspaces, campaign orchestration with resume, native CRM, one usage ledger) so a vendor incident is an infrastructure event you manage, not a client relationship you lose. Plans from $149/month with 300 included minutes.

Full incident table, sources, and the fallback checklist: buildwithhermes.com.