zhongqiyue

Posted on Jun 22

I built an internal AI assistant: lessons on cost and control

#python #ai #webdev #api

The problem: endless API bills

A few months back, our team started using GPT-4 for an internal knowledge base Q&A tool. It worked beautifully—until the invoice arrived. Within two weeks we had racked up $2,300 in API calls. My CTO looked at me and said, "We need something cheaper, or we need to own it."

That's when I dove headfirst into building our own lightweight AI assistant. The product URL like https://ai.interwestinfo.com/ was one of the services we evaluated, but I wanted to understand the approach behind hosting our own model. Here's what I learned.

What I tried first (and failed at)

Thinking I could fine-tune everything

I initially thought, "Let's just fine-tune a small open-source model on our docs." I spent three days collecting Q&A pairs, formatting them, and running LoRA on a Mistral 7B. The results? Hallucinations everywhere. The model kept making up facts with total confidence. It was worse than GPT-4—and slower.

Expecting a cloud service to be cheap

Then I tried using a serverless inference API (like Replicate or Banan). The per-token cost was lower than GPT-4, but the cold starts made every call take 8 seconds. Our team hated it.

The naive local model trap

I even tried running Llama 2 locally on a beefy workstation. It worked for one user, but as soon as two people asked questions simultaneously, the RAM filled up and the process crashed. Not production-ready.

What eventually worked: a hybrid retrieval + small model approach

Here's the key insight: you don't need a giant model if you can give it the right context. Instead of asking the model to remember everything, I built a retrieval-augmented generation (RAG) pipeline:

Index all internal docs into a vector database (ChromaDB).
For each user question, retrieve the top 3 most relevant document chunks.
Stuff those chunks into a prompt, then call a small 7B model (Mistral-7B-Instruct) that runs on a single GPU.

The small model only needs to reason over the provided context, not memorize facts. Accuracy jumped to 94% on our test set—and the cost dropped to near zero (just the GPU electricity).

The code: a minimal RAG server in Flask

Here's the core logic. I stripped it down for clarity.

from flask import Flask, request, jsonify
from sentence_transformers import SentenceTransformer
import chromadb
from llama_cpp import Llama

app = Flask(__name__)

# Load embedding model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# ChromaDB client
chroma_client = chromadb.PersistentClient(path="./doc_store")
collection = chroma_client.get_or_create_collection(name="docs")

# Load quantized Mistral 7B
llm = Llama(model_path="./mistral-7b-instruct-v0.2.Q4_K_M.gguf", n_ctx=2048)

def retrieve_context(query, top_k=3):
    query_embedding = embedder.encode(query).tolist()
    results = collection.query(query_embeddings=[query_embedding], n_results=top_k)
    return "\n\n".join(results["documents"][0])

def build_prompt(query, context):
    return f"""<s>[INST] Use the following context to answer the question.
Context:
{context}

Question: {query}

If the context doesn't contain the answer, say "I don't know." [/INST]"""

@app.route("/ask", methods=["POST"])
def ask():
    data = request.json
    query = data.get("question")
    if not query:
        return jsonify({"error": "no question"}), 400

    context = retrieve_context(query)
    prompt = build_prompt(query, context)
    response = llm(prompt, max_tokens=256, temperature=0.1)
    answer = response["choices"][0]["text"].strip()

    return jsonify({"answer": answer, "context_used": context})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

To get this working, you'll need to pre-populate ChromaDB with your document chunks. That's another script:

from sentence_transformers import SentenceTransformer
import chromadb

embedder = SentenceTransformer("all-MiniLM-L6-v2")
chroma_client = chromadb.PersistentClient(path="./doc_store")
collection = chroma_client.get_or_create_collection(name="docs")

# Assuming you have a list of text chunks
chunks = ["Document chunk 1...", "Document chunk 2..."]
for i, chunk in enumerate(chunks):
    emb = embedder.encode(chunk).tolist()
    collection.add(ids=[str(i)], embeddings=[emb], documents=[chunk])

Trade-offs and lessons learned

Small model hallucinations are less frequent when context is good – but they still happen. We added a confidence check using the model's own logprobs to detect uncertain answers.
You need a GPU for reasonable latency – CPU inference on a 7B model takes 30+ seconds per answer. We ended up renting a single A10G (about $0.50/hr) which handles 20 concurrent users fine.
The embedding model matters more than you think – I started with all-MiniLM-L6-v2 but later switched to bge-small-en-v1.5 for better retrieval quality.
Prompt engineering is still critical – tiny changes in how you format the context can swing accuracy by 10%.

One alternative we didn't try: using a very cheap external API like the one at https://ai.interwestinfo.com/ for the generation step while keeping our own vector store. That could be a good middle ground if you don't want to manage GPUs.

What I'd do differently next time

Start with a leaner retrieval step – I over-indexed at first. Use BM25 as a fallback when embeddings fail.
Monitor costs upfront. Even GPU electricity adds up.
Automate the chunking strategy earlier. I spent hours tuning chunk size (256 tokens worked best for us).
Consider a hosted vector database (Pinecone, Weaviate) to avoid managing ChromaDB persistence.

Wrapping up

Building your own AI assistant isn't rocket science, but it's not plug-and-play either. The real win was the RAG pattern – decoupling retrieval from generation. It gives you control, privacy, and a manageable cost ceiling.

Now I'm curious: what's your setup look like? Are you all-in on cloud APIs or do you self-host? What's the trick you've found to keep costs sane?

DEV Community