DEV Community

Cover image for How I cut AI calls by 95% without losing quality?
Anupam Kushwaha
Anupam Kushwaha

Posted on

How I cut AI calls by 95% without losing quality?

The Hidden Cost of Calling AI Too Early

I stopped calling AI on every request — and everything got better.


The Problem

In one of my projects, I was generating AI-based insights from user activity.

The initial design was simple:

Every request for today’s insight → call the AI model → return a fresh response.

GET /api/insights/today
Enter fullscreen mode Exit fullscreen mode

At first, this felt clean and correct.

But in practice, it created serious problems:

  • 429 rate limit errors within hours
  • Daily quota exhausted before noon
  • Random failures affecting users
  • Costs scaling linearly with traffic

The system was working — but it wasn’t sustainable.


The Real Issue

The problem wasn’t the AI provider.

It was the trigger model.

The system never asked basic questions before making an expensive call:

  • Has anything actually changed?
  • Did I already generate a response recently?
  • Is the user even active today?

Without these checks, every request was treated as:

“Generate a new insight now.”

That assumption was the real bug.


The New Approach

Instead of adding caching on top, I redesigned the system into an event-driven pipeline.

AI became the last step, not the default.


System Flow

Here’s the simplified request flow:

flowchart TD
    A[Request for today's insight] --> B{Activity today?}
    B -- No --> C[Reuse latest insight or fallback]
    B -- Yes --> D{Meaningful change?}
    D -- No --> C
    D -- Yes --> E{Cooldown passed?}
    E -- No --> C
    E -- Yes --> F{Daily cap reached?}
    F -- Yes --> C
    F -- No --> G{Global AI limit reached?}
    G -- Yes --> H[Use deterministic fallback]
    G -- No --> I[Call AI model]
    I --> J[Persist insight]
    H --> J
    C --> J
Enter fullscreen mode Exit fullscreen mode

Most requests now end at a simple database read — not an AI call.


(Optional) System Screenshot

Add your architecture / sequence diagram here

![System Flow](your-image-url-or-upload)
Enter fullscreen mode Exit fullscreen mode

The Five-Layer Redesign

1. Activity Gate

Start with the cheapest check:

boolean hasActivity = activityService.hasActivityToday(userId, context);

if (!hasActivity) {
    return getLatestOrFallback(userId, today);
}
Enter fullscreen mode Exit fullscreen mode

If nothing happened → don’t call AI.


2. Event-Driven Triggers

AI should only run when something meaningful changes.

Examples:

  • user updates intent
  • significant behavior change
  • threshold crossed

No change → reuse previous insight.


3. Cooldown Window

Avoid frequent re-generation:

Duration cooldown = Duration.ofMinutes(30);

if (elapsed < cooldown) {
    return getLatestOrFallback(userId, today);
}
Enter fullscreen mode Exit fullscreen mode

This prevents unnecessary repeated calls.


4. Per-User Daily Cap

if (todayCount >= 10) {
    return getLatestOrFallback(userId, today);
}
Enter fullscreen mode Exit fullscreen mode

Even active users shouldn’t trigger unlimited AI calls.


5. Global AI Guard

if (dailyAiCalls.get() >= 50) {
    useFallback = true;
}
Enter fullscreen mode Exit fullscreen mode

This acts as a system-wide circuit breaker.


Configuration

All thresholds are configurable:

insight:
  activity-delta: 30
  cooldown-minutes: 30
  daily-cap-per-user: 10
  max-ai-calls-per-day: 50
  freshness-window-hours: 8
Enter fullscreen mode Exit fullscreen mode

This allows tuning without redeploying code.


What Changed

After this redesign:

  • AI calls dropped from ~100/day → ~5–10/day
  • Rate limit errors disappeared
  • Most requests became fast database reads
  • Free-tier usage became sustainable
  • System behavior became more predictable

Engineering Takeaway

AI should be the exception, not the rule.

A well-designed backend should first decide:

“Is this request even worth sending to the model?”

That decision layer — gating, triggers, cooldowns — is where the real engineering happens.


Final Thought

If most requests can be handled using deterministic logic or cached state:

Do that first.

Use AI only when it actually adds value.

That single shift can make your system:

  • cheaper
  • faster
  • more reliable

—and much easier to scale.

## blog link -
https://anupamkushwaha.me/blog/stopped-calling-ai-on-every-request

Top comments (0)