Every AI system eventually fails.
The interesting question isn't whether failure happens.
The interesting question is:
What remains usable after failure occurs?
When I review AI architectures, I often see teams investing heavily in model quality, retrieval performance, agent capabilities, and automation workflows.
Far fewer teams spend time designing failure paths.
That's a mistake.
In production, users experience both.
The success path and the failure path.
If you've only designed one of them, you've only designed half the system.
The Architecture Test I Use
I use a simple exercise during architecture reviews.
I remove one dependency.
Then I ask the team what happens next.
For example:
Remove the LLM.
Remove the vector database.
Remove the CRM connection.
Remove the authentication provider.
Remove the document store.
Most architecture diagrams look great until this exercise begins.
That's when hidden assumptions start appearing.
A Typical AI Stack
A simplified enterprise AI system usually looks something like this:
User
│
▼
Gateway
│
▼
Agent Layer
│
├── Retrieval Layer
│
├── Tool Layer
│
└── Model Layer
│
▼
Business Systems
The problem is obvious.
Every box is a potential failure point.
The mistake many teams make is assuming every component must succeed for the user to receive value.
That's rarely true.
Failure Mode #1: The Model Is Unavailable
Most teams panic here.
I don't.
Because not every task requires generation.
Imagine an outage affects your primary model provider.
Can users still:
- Search documents?
- Access previous reports?
- View historical outputs?
- Export data?
If the answer is no, you've coupled too much functionality to the model.
A model outage shouldn't automatically become a platform outage.
Those are different events.
Failure Mode #2: Retrieval Stops Working
This one is more dangerous.
Because many systems continue answering.
The user sees a response.
The response looks confident.
The response may be completely disconnected from company knowledge.
That's a trust problem.
My preferred behavior is:
Retrieval Status: Failed
Answer Mode:
General Knowledge Only
The system should become less capable.
Not less honest.
Failure Mode #3: Tool Execution Fails
Agent systems introduce another challenge.
Consider this flow:
Agent
│
├── CRM
├── Ticketing
├── Billing
└── Knowledge Base
What happens if billing becomes unavailable?
A fragile architecture fails everything.
A resilient architecture returns partial results.
Example:
✓ Customer Profile
✓ Support History
✗ Billing Information
The user still receives value.
The system remains operational.
Only one capability degrades.
That's the outcome we want.
Designing For Partial Success
One architectural principle I strongly believe in:
Partial success is usually better than complete failure.
Yet many AI workflows are designed as all-or-nothing chains.
One step fails.
Everything fails.
This works in demos.
It performs poorly in production.
Real systems should be designed around survivability.
Not perfection.
The Importance Of Trust Signals
Users don't expect technology to be perfect.
They expect transparency.
The fastest way to destroy trust is hiding degradation.
The system knows retrieval failed.
The user doesn't.
The system knows a tool timed out.
The user doesn't.
The system knows context is incomplete.
The user doesn't.
That creates invisible risk.
Instead, degradation should be visible.
Not alarming.
Just visible.
Something as simple as:
"Response generated without access to internal knowledge sources."
can dramatically improve trust.
What I Look For In Architecture Reviews
When evaluating AI systems, I look for six things:
- Dependency isolation
Can one failure remain isolated?
- Fallback behavior
What happens next?
- User visibility
Can users see degraded states?
- Partial execution
Can workflows continue?
- Recovery mechanisms
How does the system return to normal?
- Monitoring
Can operators detect degradation before users complain?
Most systems have answers for the first question.
Very few have answers for all six.
The Wrong Goal
The goal is not preventing failure.
That isn't realistic.
Cloud providers fail.
Models fail.
APIs fail.
People fail.
The goal is preventing failure from becoming catastrophe.
There is a huge difference.
One is an incident.
The other is a business outage.
Good architecture understands that difference.
Closing Note
When people evaluate AI systems, they usually ask:
"How smart is it?"
I think a more useful question is:
"How useful is it on its worst day?"
Because production environments don't reward perfection.
They reward resilience.
And resilience is not something you add later.
It's something you design from the beginning.
Top comments (0)