DEV Community

zhongqiyue
zhongqiyue

Posted on

How I Cut My AI API Costs by 70% Without Sacrificing Quality

A few months ago, I was building a chatbot for a client that needed to handle customer support queries. The requirements were straightforward: answer common questions, escalate complex issues, and keep latency under 2 seconds. I started with OpenAI’s API because it’s easy, but after a week of testing, the bill was already climbing into triple digits. That’s when I realized I couldn’t just throw more money at the problem—I needed a smarter architecture.

The Problem: Every query costs money

I had a list of about 200 common support questions that covered 80% of what users asked. But my naive implementation sent every single user message to GPT-4. Even with prompt caching and reduced tokens, each conversation was racking up $0.03–$0.10 per turn. Multiply that by hundreds of users, and it became unsustainable fast.

An even bigger issue: latency. For simple questions like “What are your business hours?” a full round-trip to the API took 1–3 seconds. Users expected instant answers, not a spinning loader.

What I tried that didn’t work

First, I tried using a cheaper model (GPT-3.5 Turbo). The cost dropped by 80%, but the accuracy suffered. It often hallucinated instructions or gave outdated information. Clients complained.

Then I built a hard-coded FAQ with exact keyword matching. It was fast and free, but it failed on typos, synonyms, and paraphrased questions. Maintenance was a nightmare—adding a single new Q&A required merging updates with the existing logic.

I also experimented with simple embeddings + cosine similarity (semantic search). That worked okay for retrieval, but it couldn’t handle multi-turn conversations or vague queries like “I have a problem with my order.” Users still needed a fallback to the LLM.

What finally worked: A hybrid approach

After days of reading papers and GitHub repos, I landed on a pattern that many production systems use: route simple queries to a local or lightweight model, and only escalate to a full cloud LLM when necessary.

Here’s the architecture:

  1. Rule-based classifier – a tiny regex + keyword mapper that tags intent (e.g., “hours”, “refund”, “tracking”). It catches exact matches and common patterns instantly.
  2. Small local model fallback – For intents that aren’t matched exactly, we use a small quantized model (like Llama 3.2 1B or Phi-3-mini) running locally via Ollama. This handles paraphrases and typos at near-zero cost and sub-second latency.
  3. Cloud API as last resort – Only queries that the local model flags as low confidence (or that are explicitly tagged “complex”) get sent to OpenAI/GPT-4.

Code: Putting it together

Here’s a simplified Python version of the router. I use this in a FastAPI endpoint.

import re
import json
from typing import Optional

# Local model inference (assumes Ollama running locally)
import ollama

# Cloud API (OpenAI)
from openai import OpenAI

def classify_intent(user_input: str) -> str:
    """Rule-based fast classification."""
    input_lower = user_input.lower().strip()

    if re.search(r'(hours|open|close|business time)', input_lower):
        return 'hours'
    if re.search(r'(refund|return|money back)', input_lower):
        return 'refund'
    if re.search(r'(order status|tracking|where is my)', input_lower):
        return 'tracking'
    # ... more rules ...
    return 'unknown'

def query_local_model(user_input: str) -> dict:
    """Use a small local LLM to answer, and also get a confidence score."""
    response = ollama.chat(
        model='phi3:mini',
        messages=[
            {'role': 'system', 'content': 'You are a helpful support assistant. Keep answers brief and factual.'},
            {'role': 'user', 'content': user_input}
        ]
    )
    answer = response['message']['content']
    # Dummy confidence estimation: check if answer is short and doesn't contain hedging
    confidence = 0.9 if len(answer) < 200 and 'I think' not in answer else 0.5
    return {'answer': answer, 'confidence': confidence}

def query_cloud_api(user_input: str, conversation_history: list) -> str:
    client = OpenAI(api_key='sk-...')
    messages = [{'role': 'system', 'content': 'You are a support agent.'}]
    messages.extend(conversation_history[-4:])  # last few turns
    messages.append({'role': 'user', 'content': user_input})
    response = client.chat.completions.create(
        model='gpt-4',
        messages=messages,
        max_tokens=300
    )
    return response.choices[0].message.content

def handle_query(user_input: str, conversation_history: list) -> dict:
    intent = classify_intent(user_input)

    # Step 1: Known intents can be answered directly from a predefined response
    if intent != 'unknown':
        return {'answer': f'Quick answer for {intent}', 'source': 'rule', 'latency_ms': 5}

    # Step 2: Try local LLM
    local_result = query_local_model(user_input)
    if local_result['confidence'] > 0.7:
        return {'answer': local_result['answer'], 'source': 'local_llm', 'latency_ms': 350}

    # Step 3: Fall back to cloud
    answer = query_cloud_api(user_input, conversation_history)
    return {'answer': answer, 'source': 'cloud', 'latency_ms': 1500}
Enter fullscreen mode Exit fullscreen mode

Lessons learned and trade-offs

Cost: My bill dropped from ~$200/week to ~$60. The bulk of queries (about 70%) are handled by the rule engine or local model. Only 10% hit the cloud API. (The remaining 20% are misclassifications that still cost money, but that’s tolerable.)

Latency: Rule-based answers are 5ms. Local model is 300–500ms. Cloud is 1–3s. The user rarely notices the slow path because most queries are fast.

Accuracy: The local model (Phi-3-mini) is surprisingly good for simple support tasks, but it occasionally gives wrong info (e.g., “Yes we accept Bitcoin” when we don’t). To mitigate, I added a confidence heuristic based on answer length and hedging words. It’s not perfect, but it reduces errors from ~10% to ~3%.

Maintenance: The rule set grows over time, but it’s a flat file – easy to edit. The local model needs periodic updates if the business changes (e.g., new products). The cloud model stays the same.

What I’d do differently next time

  • Better logging: I wish I had instrumented each path from day one. Now I have to retrofit metrics to see which intents are misrouted.
  • A/B test the cutoff: My confidence threshold of 0.7 was a guess. I should run a randomized trial to find the optimal balance between cost and accuracy.
  • Use a dedicated classification model instead of a crude confidence heuristic. A tiny BERT classifier (e.g., distilbert-base-uncased fine-tuned on my intents) would be more reliable and still cheap to run.

When NOT to use this approach

  • If your queries require deep reasoning or multi-step logic, a local model will fail. Don’t bother.
  • If your traffic is very low (<100 queries/day), the cost of building the hybrid system might not be worth it. Just use the cloud API.
  • If latency isn’t a concern (e.g., batch processing overnight), then simplicity wins. Stick with one model.

Final thoughts

I’m not a machine learning engineer – I’m a regular backend dev who needed to pay the bills. This approach let me keep the quality of a top-tier LLM while making it affordable. It’s not revolutionary; it’s just good engineering: use the right tool for the job.

One more thing – I ended up hosting my local model on a small rented GPU instance (like a $40/month box). You could also use an edge device or even a laptop, depending on traffic. The key is to keep the fallback path as thin as possible.


So, what’s your setup? Are you going all-in on cloud APIs, or have you found a clever hybrid that keeps costs down? I’d love to hear about your routing strategy.

Top comments (1)

Collapse
 
capestart profile image
CapeStart

Many companies are overpaying for intelligence they don't need while underinvesting in orchestration, caching, and retrieval quality.