A few months ago, I was building a chatbot for a client that needed to handle customer support queries. The requirements were straightforward: answer common questions, escalate complex issues, and keep latency under 2 seconds. I started with OpenAI’s API because it’s easy, but after a week of testing, the bill was already climbing into triple digits. That’s when I realized I couldn’t just throw more money at the problem—I needed a smarter architecture.
The Problem: Every query costs money
I had a list of about 200 common support questions that covered 80% of what users asked. But my naive implementation sent every single user message to GPT-4. Even with prompt caching and reduced tokens, each conversation was racking up $0.03–$0.10 per turn. Multiply that by hundreds of users, and it became unsustainable fast.
An even bigger issue: latency. For simple questions like “What are your business hours?” a full round-trip to the API took 1–3 seconds. Users expected instant answers, not a spinning loader.
What I tried that didn’t work
First, I tried using a cheaper model (GPT-3.5 Turbo). The cost dropped by 80%, but the accuracy suffered. It often hallucinated instructions or gave outdated information. Clients complained.
Then I built a hard-coded FAQ with exact keyword matching. It was fast and free, but it failed on typos, synonyms, and paraphrased questions. Maintenance was a nightmare—adding a single new Q&A required merging updates with the existing logic.
I also experimented with simple embeddings + cosine similarity (semantic search). That worked okay for retrieval, but it couldn’t handle multi-turn conversations or vague queries like “I have a problem with my order.” Users still needed a fallback to the LLM.
What finally worked: A hybrid approach
After days of reading papers and GitHub repos, I landed on a pattern that many production systems use: route simple queries to a local or lightweight model, and only escalate to a full cloud LLM when necessary.
Here’s the architecture:
- Rule-based classifier – a tiny regex + keyword mapper that tags intent (e.g., “hours”, “refund”, “tracking”). It catches exact matches and common patterns instantly.
- Small local model fallback – For intents that aren’t matched exactly, we use a small quantized model (like Llama 3.2 1B or Phi-3-mini) running locally via Ollama. This handles paraphrases and typos at near-zero cost and sub-second latency.
- Cloud API as last resort – Only queries that the local model flags as low confidence (or that are explicitly tagged “complex”) get sent to OpenAI/GPT-4.
Code: Putting it together
Here’s a simplified Python version of the router. I use this in a FastAPI endpoint.
import re
import json
from typing import Optional
# Local model inference (assumes Ollama running locally)
import ollama
# Cloud API (OpenAI)
from openai import OpenAI
def classify_intent(user_input: str) -> str:
"""Rule-based fast classification."""
input_lower = user_input.lower().strip()
if re.search(r'(hours|open|close|business time)', input_lower):
return 'hours'
if re.search(r'(refund|return|money back)', input_lower):
return 'refund'
if re.search(r'(order status|tracking|where is my)', input_lower):
return 'tracking'
# ... more rules ...
return 'unknown'
def query_local_model(user_input: str) -> dict:
"""Use a small local LLM to answer, and also get a confidence score."""
response = ollama.chat(
model='phi3:mini',
messages=[
{'role': 'system', 'content': 'You are a helpful support assistant. Keep answers brief and factual.'},
{'role': 'user', 'content': user_input}
]
)
answer = response['message']['content']
# Dummy confidence estimation: check if answer is short and doesn't contain hedging
confidence = 0.9 if len(answer) < 200 and 'I think' not in answer else 0.5
return {'answer': answer, 'confidence': confidence}
def query_cloud_api(user_input: str, conversation_history: list) -> str:
client = OpenAI(api_key='sk-...')
messages = [{'role': 'system', 'content': 'You are a support agent.'}]
messages.extend(conversation_history[-4:]) # last few turns
messages.append({'role': 'user', 'content': user_input})
response = client.chat.completions.create(
model='gpt-4',
messages=messages,
max_tokens=300
)
return response.choices[0].message.content
def handle_query(user_input: str, conversation_history: list) -> dict:
intent = classify_intent(user_input)
# Step 1: Known intents can be answered directly from a predefined response
if intent != 'unknown':
return {'answer': f'Quick answer for {intent}', 'source': 'rule', 'latency_ms': 5}
# Step 2: Try local LLM
local_result = query_local_model(user_input)
if local_result['confidence'] > 0.7:
return {'answer': local_result['answer'], 'source': 'local_llm', 'latency_ms': 350}
# Step 3: Fall back to cloud
answer = query_cloud_api(user_input, conversation_history)
return {'answer': answer, 'source': 'cloud', 'latency_ms': 1500}
Lessons learned and trade-offs
Cost: My bill dropped from ~$200/week to ~$60. The bulk of queries (about 70%) are handled by the rule engine or local model. Only 10% hit the cloud API. (The remaining 20% are misclassifications that still cost money, but that’s tolerable.)
Latency: Rule-based answers are 5ms. Local model is 300–500ms. Cloud is 1–3s. The user rarely notices the slow path because most queries are fast.
Accuracy: The local model (Phi-3-mini) is surprisingly good for simple support tasks, but it occasionally gives wrong info (e.g., “Yes we accept Bitcoin” when we don’t). To mitigate, I added a confidence heuristic based on answer length and hedging words. It’s not perfect, but it reduces errors from ~10% to ~3%.
Maintenance: The rule set grows over time, but it’s a flat file – easy to edit. The local model needs periodic updates if the business changes (e.g., new products). The cloud model stays the same.
What I’d do differently next time
- Better logging: I wish I had instrumented each path from day one. Now I have to retrofit metrics to see which intents are misrouted.
- A/B test the cutoff: My confidence threshold of 0.7 was a guess. I should run a randomized trial to find the optimal balance between cost and accuracy.
-
Use a dedicated classification model instead of a crude confidence heuristic. A tiny BERT classifier (e.g.,
distilbert-base-uncasedfine-tuned on my intents) would be more reliable and still cheap to run.
When NOT to use this approach
- If your queries require deep reasoning or multi-step logic, a local model will fail. Don’t bother.
- If your traffic is very low (<100 queries/day), the cost of building the hybrid system might not be worth it. Just use the cloud API.
- If latency isn’t a concern (e.g., batch processing overnight), then simplicity wins. Stick with one model.
Final thoughts
I’m not a machine learning engineer – I’m a regular backend dev who needed to pay the bills. This approach let me keep the quality of a top-tier LLM while making it affordable. It’s not revolutionary; it’s just good engineering: use the right tool for the job.
One more thing – I ended up hosting my local model on a small rented GPU instance (like a $40/month box). You could also use an edge device or even a laptop, depending on traffic. The key is to keep the fallback path as thin as possible.
So, what’s your setup? Are you going all-in on cloud APIs, or have you found a clever hybrid that keeps costs down? I’d love to hear about your routing strategy.
Top comments (0)