How I Cut My AI API Costs by 70% Without Sacrificing Quality

#discuss #ai #webdev #python

A few months ago, I was building a chatbot for a client that needed to handle customer support queries. The requirements were straightforward: answer common questions, escalate complex issues, and keep latency under 2 seconds. I started with OpenAI’s API because it’s easy, but after a week of testing, the bill was already climbing into triple digits. That’s when I realized I couldn’t just throw more money at the problem—I needed a smarter architecture.

The Problem: Every query costs money

I had a list of about 200 common support questions that covered 80% of what users asked. But my naive implementation sent every single user message to GPT-4. Even with prompt caching and reduced tokens, each conversation was racking up $0.03–$0.10 per turn. Multiply that by hundreds of users, and it became unsustainable fast.

An even bigger issue: latency. For simple questions like “What are your business hours?” a full round-trip to the API took 1–3 seconds. Users expected instant answers, not a spinning loader.

What I tried that didn’t work

First, I tried using a cheaper model (GPT-3.5 Turbo). The cost dropped by 80%, but the accuracy suffered. It often hallucinated instructions or gave outdated information. Clients complained.

Then I built a hard-coded FAQ with exact keyword matching. It was fast and free, but it failed on typos, synonyms, and paraphrased questions. Maintenance was a nightmare—adding a single new Q&A required merging updates with the existing logic.

I also experimented with simple embeddings + cosine similarity (semantic search). That worked okay for retrieval, but it couldn’t handle multi-turn conversations or vague queries like “I have a problem with my order.” Users still needed a fallback to the LLM.

What finally worked: A hybrid approach

After days of reading papers and GitHub repos, I landed on a pattern that many production systems use: route simple queries to a local or lightweight model, and only escalate to a full cloud LLM when necessary.

Here’s the architecture:

Rule-based classifier – a tiny regex + keyword mapper that tags intent (e.g., “hours”, “refund”, “tracking”). It catches exact matches and common patterns instantly.
Small local model fallback – For intents that aren’t matched exactly, we use a small quantized model (like Llama 3.2 1B or Phi-3-mini) running locally via Ollama. This handles paraphrases and typos at near-zero cost and sub-second latency.
Cloud API as last resort – Only queries that the local model flags as low confidence (or that are explicitly tagged “complex”) get sent to OpenAI/GPT-4.

Code: Putting it together

Here’s a simplified Python version of the router. I use this in a FastAPI endpoint.

import re
import json
from typing import Optional

# Local model inference (assumes Ollama running locally)
import ollama

# Cloud API (OpenAI)
from openai import OpenAI

def classify_intent(user_input: str) -> str:
    """Rule-based fast classification."""
    input_lower = user_input.lower().strip()

    if re.search(r'(hours|open|close|business time)', input_lower):
        return 'hours'
    if re.search(r'(refund|return|money back)', input_lower):
        return 'refund'
    if re.search(r'(order status|tracking|where is my)', input_lower):
        return 'tracking'
    # ... more rules ...
    return 'unknown'

def query_local_model(user_input: str) -> dict:
    """Use a small local LLM to answer, and also get a confidence score."""
    response = ollama.chat(
        model='phi3:mini',
        messages=[
            {'role': 'system', 'content': 'You are a helpful support assistant. Keep answers brief and factual.'},
            {'role': 'user', 'content': user_input}
        ]
    )
    answer = response['message']['content']
    # Dummy confidence estimation: check if answer is short and doesn't contain hedging
    confidence = 0.9 if len(answer) < 200 and 'I think' not in answer else 0.5
    return {'answer': answer, 'confidence': confidence}

def query_cloud_api(user_input: str, conversation_history: list) -> str:
    client = OpenAI(api_key='sk-...')
    messages = [{'role': 'system', 'content': 'You are a support agent.'}]
    messages.extend(conversation_history[-4:])  # last few turns
    messages.append({'role': 'user', 'content': user_input})
    response = client.chat.completions.create(
        model='gpt-4',
        messages=messages,
        max_tokens=300
    )
    return response.choices[0].message.content

def handle_query(user_input: str, conversation_history: list) -> dict:
    intent = classify_intent(user_input)

    # Step 1: Known intents can be answered directly from a predefined response
    if intent != 'unknown':
        return {'answer': f'Quick answer for {intent}', 'source': 'rule', 'latency_ms': 5}

    # Step 2: Try local LLM
    local_result = query_local_model(user_input)
    if local_result['confidence'] > 0.7:
        return {'answer': local_result['answer'], 'source': 'local_llm', 'latency_ms': 350}

    # Step 3: Fall back to cloud
    answer = query_cloud_api(user_input, conversation_history)
    return {'answer': answer, 'source': 'cloud', 'latency_ms': 1500}

Lessons learned and trade-offs

Cost: My bill dropped from ~$200/week to ~$60. The bulk of queries (about 70%) are handled by the rule engine or local model. Only 10% hit the cloud API. (The remaining 20% are misclassifications that still cost money, but that’s tolerable.)

Latency: Rule-based answers are 5ms. Local model is 300–500ms. Cloud is 1–3s. The user rarely notices the slow path because most queries are fast.

Accuracy: The local model (Phi-3-mini) is surprisingly good for simple support tasks, but it occasionally gives wrong info (e.g., “Yes we accept Bitcoin” when we don’t). To mitigate, I added a confidence heuristic based on answer length and hedging words. It’s not perfect, but it reduces errors from ~10% to ~3%.

Maintenance: The rule set grows over time, but it’s a flat file – easy to edit. The local model needs periodic updates if the business changes (e.g., new products). The cloud model stays the same.

What I’d do differently next time

Better logging: I wish I had instrumented each path from day one. Now I have to retrofit metrics to see which intents are misrouted.
A/B test the cutoff: My confidence threshold of 0.7 was a guess. I should run a randomized trial to find the optimal balance between cost and accuracy.
Use a dedicated classification model instead of a crude confidence heuristic. A tiny BERT classifier (e.g., distilbert-base-uncased fine-tuned on my intents) would be more reliable and still cheap to run.

When NOT to use this approach

If your queries require deep reasoning or multi-step logic, a local model will fail. Don’t bother.
If your traffic is very low (<100 queries/day), the cost of building the hybrid system might not be worth it. Just use the cloud API.
If latency isn’t a concern (e.g., batch processing overnight), then simplicity wins. Stick with one model.

Final thoughts

I’m not a machine learning engineer – I’m a regular backend dev who needed to pay the bills. This approach let me keep the quality of a top-tier LLM while making it affordable. It’s not revolutionary; it’s just good engineering: use the right tool for the job.

One more thing – I ended up hosting my local model on a small rented GPU instance (like a $40/month box). You could also use an edge device or even a laptop, depending on traffic. The key is to keep the fallback path as thin as possible.

So, what’s your setup? Are you going all-in on cloud APIs, or have you found a clever hybrid that keeps costs down? I’d love to hear about your routing strategy.

Top comments (1)

CapeStart • Jun 9

Many companies are overpaying for intelligence they don't need while underinvesting in orchestration, caching, and retrieval quality.