The problem: endless API bills
A few months back, our team started using GPT-4 for an internal knowledge base Q&A tool. It worked beautifully—until the invoice arrived. Within two weeks we had racked up $2,300 in API calls. My CTO looked at me and said, "We need something cheaper, or we need to own it."
That's when I dove headfirst into building our own lightweight AI assistant. The product URL like https://ai.interwestinfo.com/ was one of the services we evaluated, but I wanted to understand the approach behind hosting our own model. Here's what I learned.
What I tried first (and failed at)
Thinking I could fine-tune everything
I initially thought, "Let's just fine-tune a small open-source model on our docs." I spent three days collecting Q&A pairs, formatting them, and running LoRA on a Mistral 7B. The results? Hallucinations everywhere. The model kept making up facts with total confidence. It was worse than GPT-4—and slower.
Expecting a cloud service to be cheap
Then I tried using a serverless inference API (like Replicate or Banan). The per-token cost was lower than GPT-4, but the cold starts made every call take 8 seconds. Our team hated it.
The naive local model trap
I even tried running Llama 2 locally on a beefy workstation. It worked for one user, but as soon as two people asked questions simultaneously, the RAM filled up and the process crashed. Not production-ready.
What eventually worked: a hybrid retrieval + small model approach
Here's the key insight: you don't need a giant model if you can give it the right context. Instead of asking the model to remember everything, I built a retrieval-augmented generation (RAG) pipeline:
- Index all internal docs into a vector database (ChromaDB).
- For each user question, retrieve the top 3 most relevant document chunks.
- Stuff those chunks into a prompt, then call a small 7B model (Mistral-7B-Instruct) that runs on a single GPU.
The small model only needs to reason over the provided context, not memorize facts. Accuracy jumped to 94% on our test set—and the cost dropped to near zero (just the GPU electricity).
The code: a minimal RAG server in Flask
Here's the core logic. I stripped it down for clarity.
from flask import Flask, request, jsonify
from sentence_transformers import SentenceTransformer
import chromadb
from llama_cpp import Llama
app = Flask(__name__)
# Load embedding model
embedder = SentenceTransformer("all-MiniLM-L6-v2")
# ChromaDB client
chroma_client = chromadb.PersistentClient(path="./doc_store")
collection = chroma_client.get_or_create_collection(name="docs")
# Load quantized Mistral 7B
llm = Llama(model_path="./mistral-7b-instruct-v0.2.Q4_K_M.gguf", n_ctx=2048)
def retrieve_context(query, top_k=3):
query_embedding = embedder.encode(query).tolist()
results = collection.query(query_embeddings=[query_embedding], n_results=top_k)
return "\n\n".join(results["documents"][0])
def build_prompt(query, context):
return f"""<s>[INST] Use the following context to answer the question.
Context:
{context}
Question: {query}
If the context doesn't contain the answer, say "I don't know." [/INST]"""
@app.route("/ask", methods=["POST"])
def ask():
data = request.json
query = data.get("question")
if not query:
return jsonify({"error": "no question"}), 400
context = retrieve_context(query)
prompt = build_prompt(query, context)
response = llm(prompt, max_tokens=256, temperature=0.1)
answer = response["choices"][0]["text"].strip()
return jsonify({"answer": answer, "context_used": context})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
To get this working, you'll need to pre-populate ChromaDB with your document chunks. That's another script:
from sentence_transformers import SentenceTransformer
import chromadb
embedder = SentenceTransformer("all-MiniLM-L6-v2")
chroma_client = chromadb.PersistentClient(path="./doc_store")
collection = chroma_client.get_or_create_collection(name="docs")
# Assuming you have a list of text chunks
chunks = ["Document chunk 1...", "Document chunk 2..."]
for i, chunk in enumerate(chunks):
emb = embedder.encode(chunk).tolist()
collection.add(ids=[str(i)], embeddings=[emb], documents=[chunk])
Trade-offs and lessons learned
- Small model hallucinations are less frequent when context is good – but they still happen. We added a confidence check using the model's own logprobs to detect uncertain answers.
- You need a GPU for reasonable latency – CPU inference on a 7B model takes 30+ seconds per answer. We ended up renting a single A10G (about $0.50/hr) which handles 20 concurrent users fine.
-
The embedding model matters more than you think – I started with
all-MiniLM-L6-v2but later switched tobge-small-en-v1.5for better retrieval quality. - Prompt engineering is still critical – tiny changes in how you format the context can swing accuracy by 10%.
One alternative we didn't try: using a very cheap external API like the one at https://ai.interwestinfo.com/ for the generation step while keeping our own vector store. That could be a good middle ground if you don't want to manage GPUs.
What I'd do differently next time
- Start with a leaner retrieval step – I over-indexed at first. Use BM25 as a fallback when embeddings fail.
- Monitor costs upfront. Even GPU electricity adds up.
- Automate the chunking strategy earlier. I spent hours tuning chunk size (256 tokens worked best for us).
- Consider a hosted vector database (Pinecone, Weaviate) to avoid managing ChromaDB persistence.
Wrapping up
Building your own AI assistant isn't rocket science, but it's not plug-and-play either. The real win was the RAG pattern – decoupling retrieval from generation. It gives you control, privacy, and a manageable cost ceiling.
Now I'm curious: what's your setup look like? Are you all-in on cloud APIs or do you self-host? What's the trick you've found to keep costs sane?
Top comments (0)