Every time an enterprise AI system feels slow, somebody eventually says the same thing:
"We need a faster model."
Maybe.
But after reviewing enough production deployments, I've noticed something interesting.
The model is rarely the first problem.
It's usually the most visible problem.
There is a difference.
A team spends months debating GPT versus Claude versus open-source alternatives.
Meanwhile nobody can explain where the first three seconds of latency are coming from.
That's backwards.
Before discussing models, I want to see a latency budget.
If there isn't one, we're guessing.
The Question I Ask First
Imagine a user submits a query.
The answer appears six seconds later.
What happened during those six seconds?
Most teams can't answer that precisely.
They know the system feels slow.
They don't know which component is responsible.
That's like trying to reduce fuel consumption without knowing whether the engine, tires, or driver is causing the problem.
You cannot optimize what you haven't measured.
Where The Time Actually Goes
A typical enterprise AI request is not a single operation.
It's a chain.
Query arrives.
Authentication happens.
Retrieval starts.
Results get ranked.
Context gets assembled.
The model generates.
The response gets formatted.
The answer is delivered.
Every step consumes part of the budget.
The mistake is assuming the model owns most of it.
Sometimes it does.
Sometimes it doesn't.
I've reviewed systems where retrieval consumed more time than generation.
I've reviewed others where logging pipelines were slower than inference.
The model got blamed anyway.
The Most Expensive 500 Milliseconds In AI
If I had to pick one place where teams accidentally destroy latency budgets, it would be re-ranking.
Because re-ranking usually enters the architecture late.
The conversation often goes like this:
Retrieval quality isn't good enough.
Someone suggests a re-ranker.
The quality improves.
Everyone celebrates.
Then response times suddenly increase.
Nobody updated the budget.
The architecture absorbed another dependency without accounting for its cost.
The quality gain was real.
The latency cost was real too.
Only one of those was measured.
Why Averages Are Dangerous
One metric I almost never trust is average latency.
Averages make bad systems look healthy.
Imagine this:
90% of requests complete in two seconds.
10% take fifteen seconds.
The average looks acceptable.
The user experience doesn't.
Users remember the frustrating interactions.
Not the average.
This is why I care about p95 and p99 much more than p50.
Production trust is built at the edges.
Not in the middle.
Latency Is An Architecture Problem
This is the part many teams miss.
Latency is not a model problem.
Latency is not a retrieval problem.
Latency is not an infrastructure problem.
Latency is an architecture problem.
Because architecture determines how those pieces interact.
A slow component can be acceptable.
Five acceptable components chained together often aren't.
That's why latency budgets need to exist before implementation begins.
Not after users start complaining.
My Rule
Before adding any new capability to an AI system, I ask one question:
"Which part of the latency budget will pay for this?"
If nobody knows the answer, the feature probably isn't ready.
Because every feature consumes resources.
Every dependency introduces cost.
Every architectural decision spends part of the user's patience.
And user patience is usually the smallest budget in the entire system.
Top comments (0)