Adamo Software

Posted on Apr 7

Building an AI fallback system: when to use GPT-4o, when to fall back to Haiku, when to skip the LLM entirely

#webdev #ai #javascript #tutorial

Not every query deserves a frontier model. A user asking "what is your cancellation policy?" does not need GPT-4o to generate the answer. A rules engine or a simple database lookup handles it in 5 milliseconds at zero token cost.

We learned this the hard way. Our first production deployment sent everything through GPT-4o. The quality was great. The bill was $7,200/month for a feature that should have cost $2,000. Worse, 60% of those queries were simple enough that a smaller model (or no model at all) would have produced identical output.

This article covers the three-tier fallback system we built: a rules engine for deterministic queries, a cheap model (Claude Haiku) for simple generation, and a frontier model (GPT-4o) for complex reasoning. Stack: Node.js 20, TypeScript.

The three tiers

Here is the routing logic:

Incoming query
    ↓
┌─────────────────────┐
│  Tier 0: Rules      │  → deterministic lookup, no LLM
│  (FAQ, status, data)│     cost: $0, latency: <10ms
└─────────┬───────────┘
          ↓ not matched
┌─────────────────────┐
│  Tier 1: Haiku      │  → simple generation
│  (summaries, format)│     cost: $1/$5 per 1M tokens
└─────────┬───────────┘
          ↓ quality check fails
┌─────────────────────┐
│  Tier 2: GPT-4o     │  → complex reasoning
│  (analysis, compare)│     cost: $2.50/$10 per 1M tokens
└─────────────────────┘

The classifier that decides the tier is itself a cheap LLM call. We use GPT-4o-mini with a one-line system prompt. The classification costs roughly $0.0001 per request, which is negligible.

Tier 0: skip the LLM entirely

This is the highest-ROI tier because it costs nothing. Before any query hits an LLM, we check if it matches a deterministic pattern:

interface TierZeroRule {
  patterns: RegExp[];
  handler: (query: string, context: any) => string;
}

const tierZeroRules: TierZeroRule[] = [
  {
    // FAQ lookups
    patterns: [
      /cancellation\s*policy/i,
      /refund\s*policy/i,
      /check.?in\s*time/i,
      /check.?out\s*time/i,
    ],
    handler: (query, context) => {
      return faqDatabase.findBestMatch(query);
    },
  },
  {
    // Booking status (pure data lookup)
    patterns: [
      /booking\s*(status|confirmation)/i,
      /order\s*#?\d+/i,
    ],
    handler: (query, context) => {
      const bookingId = extractBookingId(query);
      return bookingService.getStatus(bookingId);
    },
  },
  {
    // Price checks (structured data, no generation needed)
    patterns: [
      /how much.*(cost|price)/i,
      /price\s*(for|of)/i,
    ],
    handler: (query, context) => {
      return pricingService.lookup(query, context);
    },
  },
];

function tryTierZero(query: string, context: any): string | null {
  for (const rule of tierZeroRules) {
    if (rule.patterns.some(p => p.test(query))) {
      return rule.handler(query, context);
    }
  }
  return null; // no match, proceed to LLM tiers
}

In our system, Tier 0 catches 22% of all queries. Those are 22% of queries that never touch an LLM, never cost a token, and return in under 10 milliseconds.

The key insight: do not overthink the pattern matching. Simple regex works. If a query contains "cancellation policy", you know the answer. You do not need embeddings or a classifier for this.

The classifier: deciding between Tier 1 and Tier 2

For queries that pass Tier 0, we need to decide: cheap model or expensive model? We use a lightweight classifier:

async function classifyComplexity(query: string): Promise<'simple' | 'complex'> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: `Classify the user query as SIMPLE or COMPLEX.
SIMPLE: factual questions, summaries, formatting, translation, single-step tasks.
COMPLEX: comparisons, multi-step reasoning, analysis, recommendations with tradeoffs, ambiguous intent.
Reply with one word only.`,
      },
      { role: 'user', content: query },
    ],
    max_tokens: 1,
    temperature: 0,
  });

  const classification = response.choices[0].message.content?.trim().toUpperCase();
  return classification === 'COMPLEX' ? 'complex' : 'simple';
}

This adds about 200ms latency and costs $0.0001 per call. At 10,000 queries/day, the classifier costs $1/day total. The savings from routing 60% of queries to Haiku instead of GPT-4o are roughly $150/day.

Tier 1 and Tier 2: the fallback chain

Here is where the fallback logic lives. We call Tier 1 first. If the response fails a quality check, we escalate to Tier 2:

async function processQuery(query: string, context: any) {
  // Tier 0: deterministic
  const tierZeroResult = tryTierZero(query, context);
  if (tierZeroResult) {
    return { response: tierZeroResult, tier: 0, cost: 0 };
  }

  // Classify
  const complexity = await classifyComplexity(query);

  if (complexity === 'simple') {
    // Tier 1: Haiku
    const haiku = await callModel('claude-haiku', query, context);

    if (passesQualityCheck(haiku, query)) {
      return { response: haiku, tier: 1, cost: haiku.tokenCost };
    }

    // Fallback to Tier 2 if Haiku output is low quality
    const gpt4o = await callModel('gpt-4o', query, context);
    return { response: gpt4o, tier: 2, cost: haiku.tokenCost + gpt4o.tokenCost };
  }

  // Complex queries go straight to Tier 2
  const gpt4o = await callModel('gpt-4o', query, context);
  return { response: gpt4o, tier: 2, cost: gpt4o.tokenCost };
}

The quality check is the most important function in the entire system. A bad quality check either wastes money (escalating when unnecessary) or serves bad responses (not escalating when needed).

Our quality check is simple:

function passesQualityCheck(response: ModelResponse, query: string): boolean {
  // Check 1: response is not empty or too short
  if (!response.text || response.text.length < 20) return false;

  // Check 2: model did not refuse or hedge excessively
  const hedgePatterns = [
    /i'm not sure/i,
    /i don't have.*information/i,
    /i cannot.*determine/i,
  ];
  if (hedgePatterns.some(p => p.test(response.text))) return false;

  // Check 3: response addresses the query topic
  // (simple keyword overlap check, not semantic)
  const queryKeywords = extractKeywords(query);
  const responseKeywords = extractKeywords(response.text);
  const overlap = queryKeywords.filter(k => responseKeywords.includes(k));
  if (overlap.length < queryKeywords.length * 0.3) return false;

  return true;
}

We intentionally kept this rule-based, not LLM-based. Using another LLM call to check quality would add latency and cost that defeats the purpose.

Results after 6 weeks

Metric	Before (GPT-4o only)	After (3-tier)
Monthly LLM cost	$7,200	$2,100
Avg response latency	1.8s	0.6s
Tier 0 (no LLM)	0%	22%
Tier 1 (Haiku)	0%	51%
Tier 2 (GPT-4o)	100%	27%
Fallback rate (Tier 1→2)	n/a	4.2%
User satisfaction (CSAT)	4.3/5	4.2/5

The CSAT dropped by 0.1 points. We consider that acceptable for a 71% cost reduction. The drop came entirely from the 4.2% of queries where Haiku's response was served but was slightly less thorough than what GPT-4o would have produced. The quality check catches the obvious failures, but subtle quality differences slip through.

What we would do differently

Track cost per tier from day one. We only started logging which tier handled each query after week 2. Those first two weeks of data were lost, making it harder to benchmark improvements.

Start with a stricter quality check and loosen it. Our first quality check was too lenient. It let through some Haiku responses that should have been escalated. We tightened the keyword overlap threshold from 0.2 to 0.3, which increased the fallback rate from 2.1% to 4.2% but eliminated the worst-quality responses.

Consider a Tier 1.5. We now think there is room for a middle tier between Haiku ($1/$5 per 1M tokens) and GPT-4o ($2.50/$10). Something like GPT-4o-mini or Claude Sonnet for queries that are too complex for Haiku but do not need frontier-level reasoning. We are testing this now.

Wrapping up

The single biggest cost optimization in our LLM stack was not caching, not prompt compression, not fine-tuning. It was routing queries to the right tier. 22% of queries never needed an LLM. 51% needed only a cheap model. Only 27% actually required GPT-4o.

If you are running everything through a single frontier model, start by logging your queries for a week and categorizing them by complexity. You will likely find that the majority do not need frontier reasoning. Build Tier 0 first (it is free), add a classifier, and let the data tell you where the boundaries should be.

Built by the engineering team at Adamo Software. We build AI-powered platforms for travel, healthcare, and enterprise applications.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.