The Hidden Cost of Calling AI Too Early
I stopped calling AI on every request — and everything got better.
The Problem
In one of my projects, I was generating AI-based insights from user activity.
The initial design was simple:
Every request for today’s insight → call the AI model → return a fresh response.
GET /api/insights/today
At first, this felt clean and correct.
But in practice, it created serious problems:
- 429 rate limit errors within hours
- Daily quota exhausted before noon
- Random failures affecting users
- Costs scaling linearly with traffic
The system was working — but it wasn’t sustainable.
The Real Issue
The problem wasn’t the AI provider.
It was the trigger model.
The system never asked basic questions before making an expensive call:
- Has anything actually changed?
- Did I already generate a response recently?
- Is the user even active today?
Without these checks, every request was treated as:
“Generate a new insight now.”
That assumption was the real bug.
The New Approach
Instead of adding caching on top, I redesigned the system into an event-driven pipeline.
AI became the last step, not the default.
System Flow
Here’s the simplified request flow:
flowchart TD
A[Request for today's insight] --> B{Activity today?}
B -- No --> C[Reuse latest insight or fallback]
B -- Yes --> D{Meaningful change?}
D -- No --> C
D -- Yes --> E{Cooldown passed?}
E -- No --> C
E -- Yes --> F{Daily cap reached?}
F -- Yes --> C
F -- No --> G{Global AI limit reached?}
G -- Yes --> H[Use deterministic fallback]
G -- No --> I[Call AI model]
I --> J[Persist insight]
H --> J
C --> J
Most requests now end at a simple database read — not an AI call.
(Optional) System Screenshot
Add your architecture / sequence diagram here

The Five-Layer Redesign
1. Activity Gate
Start with the cheapest check:
boolean hasActivity = activityService.hasActivityToday(userId, context);
if (!hasActivity) {
return getLatestOrFallback(userId, today);
}
If nothing happened → don’t call AI.
2. Event-Driven Triggers
AI should only run when something meaningful changes.
Examples:
- user updates intent
- significant behavior change
- threshold crossed
No change → reuse previous insight.
3. Cooldown Window
Avoid frequent re-generation:
Duration cooldown = Duration.ofMinutes(30);
if (elapsed < cooldown) {
return getLatestOrFallback(userId, today);
}
This prevents unnecessary repeated calls.
4. Per-User Daily Cap
if (todayCount >= 10) {
return getLatestOrFallback(userId, today);
}
Even active users shouldn’t trigger unlimited AI calls.
5. Global AI Guard
if (dailyAiCalls.get() >= 50) {
useFallback = true;
}
This acts as a system-wide circuit breaker.
Configuration
All thresholds are configurable:
insight:
activity-delta: 30
cooldown-minutes: 30
daily-cap-per-user: 10
max-ai-calls-per-day: 50
freshness-window-hours: 8
This allows tuning without redeploying code.
What Changed
After this redesign:
- AI calls dropped from ~100/day → ~5–10/day
- Rate limit errors disappeared
- Most requests became fast database reads
- Free-tier usage became sustainable
- System behavior became more predictable
Engineering Takeaway
AI should be the exception, not the rule.
A well-designed backend should first decide:
“Is this request even worth sending to the model?”
That decision layer — gating, triggers, cooldowns — is where the real engineering happens.
Final Thought
If most requests can be handled using deterministic logic or cached state:
Do that first.
Use AI only when it actually adds value.
That single shift can make your system:
- cheaper
- faster
- more reliable
—and much easier to scale.
## blog link -
https://anupamkushwaha.me/blog/stopped-calling-ai-on-every-request
Top comments (0)