Ali Afana

Posted on Apr 18

How I Cut My AI Chatbot Costs by 55% With One Architecture Change

#webdev #architecture #ai #saas

TL;DR: I split one big GPT-4o-mini call into two small, specialized calls. Tokens per message dropped from ~1,820 to ~830. Projected cost went from $300/1M messages to $140/1M messages. Here's exactly how.

The $300 Problem

I'm building Provia, an AI-powered e-commerce platform where an AI sales chatbot handles customer conversations — discovery, product search, objection handling, closing. The AI model is GPT-4o-mini, which is already one of the cheapest options out there.

After my first real end-to-end test — a 42-API-call conversation that consumed 30,654 tokens and cost $0.0054 — I sat down and did the math. At scale, my architecture would cost $30 per 100K messages and $300 per 1M messages. For an indie SaaS product, that's a margin killer.

The worst part? Most of those tokens were wasted. The AI was looping through the same searches, re-reading old context it didn't need, and writing responses three times longer than necessary. The problem wasn't the model. It was my architecture.

One structural change cut costs by 54.4%. No model downgrade. No quality loss. Actually, response quality went up because the AI stopped confusing itself with stale context.

The Before: One Big Call Per Message

My original architecture was the obvious one. Every time a customer sent a message, I made a single OpenAI call that looked like this:

Component	Token Cost
System prompt (persona, instructions, rules)	~500 tokens
Conversation history (last 20 messages)	~1,000 tokens
Conversation summary (AI-generated recap)	~200 tokens
Model response (avg)	~120 tokens
Total per message	~1,820 tokens

The system prompt was verbose — 500+ tokens of instructions covering persona, tone, sales stage logic, search rules, and formatting guidelines. The history window was the last 20 messages, both customer and bot. And a conversation summary was injected into every call to give the AI "memory" of earlier topics.

On paper, it seems reasonable. In practice, it created three expensive problems.

The Three Problems That Were Burning Money

1. Summary Pollution

The conversation summary was supposed to help the AI remember context. Instead, it poisoned every interaction.

Here's what happened: a customer asks about red dresses in message #3. The summary captures "customer is looking for red dresses." Ten messages later, the customer asks about shoes. But the summary still says "red dresses." So the AI searches for red dresses and shoes. Then the summary updates to include both. Next message, the customer asks about a specific shoe, and the AI searches for red dresses, shoes, and that specific shoe.

The summary accumulated topics like a snowball. Every search included ghosts of old queries. More searches meant more tool calls, more tokens, more cost.

2. History Bloat

Loading the last 20 messages sounds like a safe default. But in a sales conversation, most of those messages are irrelevant to the current question. If the customer is asking "do you have this in size 8?" they don't need the AI to re-read the greeting, the initial product discovery, and the three messages where they discussed shipping.

Twenty messages at ~50 tokens each (both sides) is 1,000 tokens of context. Most of it noise. The model has to read all of it, process all of it, and pay for all of it.

3. Search Loops

This was the most expensive bug. Because the summary and history contained references to previous searches, the AI would frequently re-trigger searches it had already done. The conversation summary would say "customer was shown product X" and the AI would interpret that as a reason to search for product X again.

In my 42-call test conversation, I counted multiple redundant search cycles — the AI searching for the same products it had already found, because the context told it those products were relevant.

Each unnecessary search cycle costs a tool call round-trip: the model generates search parameters, the function executes, results come back, and the model processes them. That's easily 300-500 extra tokens per loop.

The Fix: Two Small Calls Instead of One Big One

The core insight was simple: searching and responding are different jobs. They need different context.

A search call needs to know what the customer just said. That's it. It doesn't need conversation history, personality instructions, or a summary of past topics. Adding those things actively hurts search quality.

A response call needs personality, recent context, and search results. But it doesn't need 20 messages of history — the last 6 from the current session are enough.

Call #1: The Search Call

// SEARCH CALL — minimal, focused
const searchSys = `You are a product search assistant for "${store.name}".
The customer just said: "${message}"
Call search_products with what they want.`;

const { result: r1 } = await loggedChatCompletion({
  model: "gpt-4o-mini",
  messages: [{ role: "system", content: searchSys }],
  tools,
  max_tokens: 150,
}, ...);

Input: Only the customer's latest message (~60 tokens).
Job: Decide whether to search, and if so, what to search for.
max_tokens: 150 (hard cap — it either calls a tool or it doesn't).
History: Zero. None. Impossible to pollute.

This call is almost free. Sixty tokens in, 100 tokens out at most. And because it has zero history, it can never loop on old searches. It only sees the current message.

Call #2: The Response Call

// RESPONSE CALL — context-aware but bounded
const { result: r2 } = await loggedChatCompletion({
  model: "gpt-4o-mini",
  messages: [
    { role: "system", content: responseSys },
    ...toChat(responseCtx),  // last 6 session messages
    choice,                   // search call's tool decision
    ...toolMsgs,             // search results
  ],
  max_tokens: 250,
}, ...);

Input: System prompt + customer profile + last 6 session messages + search results (~500 tokens).
Job: Write the actual reply to the customer.
max_tokens: 250 (prevents essay-length responses).
History: Last 6 messages from the current session only.

This call has enough context to write a good, personalized response, but not so much that it drowns in irrelevant history.

The Math

Here's the token breakdown, before and after:

Before (Single Call)

Component	Tokens
System prompt	~500
History (20 messages)	~1,000
Summary	~200
Response output	~120
Total	~1,820

After (Two Calls)

Component	Tokens
Search call input	~60
Search call output	~100
Response call input	~500
Response call output	~170
Total	~830

Token reduction: 54.4%

Cost at Scale

Using GPT-4o-mini pricing ($0.15/1M input tokens, $0.60/1M output tokens):

Metric	Before	After	Savings
Tokens per message	~1,820	~830	54.4%
Cost per message	~$0.0003	~$0.00014	53.3%
Cost per 100K messages	~$30	~$14	$16 saved
Cost per 1M messages	~$300	~$140	$160 saved

At 1M messages, that's $160 back in your pocket every month. For an indie SaaS, that's the difference between profitable and not.

Bonus Optimizations That Stacked

The two-call split was the biggest win, but three other changes compounded the savings.

Session-Based Memory Instead of Fixed Window

Instead of always loading the last 20 messages regardless of when they were sent, I switched to session-based windowing. If there's a gap of 30+ minutes between messages, that's a new session. The response call only sees messages from the current session (last 6 max).

This means if a customer comes back the next day, the AI doesn't reload yesterday's entire conversation. It starts fresh with their profile data, which contains everything it needs to personalize.

Impact: Eliminated 60-80% of irrelevant history tokens in returning-customer conversations.

Customer Profile Instead of Summary

The conversation summary was unstructured text — a paragraph the AI generated after each exchange. It was expensive to generate, expensive to include, and caused the search loop problem.

I replaced it with a structured customer profile: bullet points covering name, archetype, preferences, and current intent. This profile is updated incrementally, not regenerated from scratch. It's smaller (~80 tokens vs ~200), more precise, and doesn't accumulate stale search topics.

Impact: 60% reduction in "memory" token cost, plus elimination of search pollution.

Product Card Filtering

In the old architecture, when the AI searched for products, all results were sent back to the customer as product cards — even if the AI only mentioned one of them in its response. This didn't affect token cost directly, but it confused customers and led to follow-up messages asking about products the AI didn't recommend.

Now, the frontend only renders product cards for items the AI explicitly referenced in its response text. Fewer confused follow-ups means fewer total messages, which means fewer API calls.

Impact: Hard to quantify, but anecdotally reduced "what about this one?" follow-up messages.

Why This Works (The Principle)

The underlying principle is context isolation. Different tasks need different context windows. When you shove everything into one call, you're paying for context that actively degrades output quality.

Think of it like database queries. You wouldn't write SELECT * FROM every_table when you only need one column from one table. But that's exactly what a single-call architecture does with LLM context.

The two-call pattern works because:

The search call is stateless. It doesn't know or care about conversation history. This makes it immune to context pollution and extremely cheap.
The response call is bounded. It has enough context to be helpful (6 recent messages, customer profile, fresh search results) but not so much that it wastes tokens on noise.
max_tokens caps prevent runaway costs. The search call can't exceed 150 tokens. The response call can't exceed 250. This eliminates the long tail of expensive responses.

The Tradeoffs

This isn't free. There are real tradeoffs:

Two API calls means two round-trips. Latency increases by the duration of the search call (~200-400ms for GPT-4o-mini). In practice, users don't notice because the search call is fast and the total response time stays under 2 seconds.

The search call can't reference history. If a customer says "show me more like the last one," the search call doesn't know what "the last one" is. I handle this by having the response call detect anaphoric references and include the last-shown product ID in the search context. It's an edge case, but it needs handling.

Two calls means two points of failure. If the search call fails, you need fallback logic. I default to skipping search and letting the response call work without product results — the AI can still have a conversation, it just can't recommend products until search recovers.

None of these tradeoffs have been deal-breakers. The cost savings far outweigh the added complexity.

Try This Today

If you're running an AI chatbot with a single-call architecture, here's a checklist to estimate your own savings:

Measure your current tokens per message. Log input and output tokens for 100+ real messages. Calculate the average.
Identify what context each task actually needs. List every component in your prompt (system instructions, history, summaries, tool results). For each one, ask: "Does the model need this to do its current job?"
Split calls by responsibility. If your model is both deciding what to do (search, lookup, API call) and generating a response, those are two different jobs. Separate them.
Set max_tokens aggressively. For tool-calling decisions, 100-200 tokens is usually enough. For responses, set a cap based on your desired response length. A chatbot reply rarely needs more than 250 tokens.
Replace summaries with structured data. If you're generating text summaries to maintain context, switch to structured profiles or key-value pairs. They're smaller, more precise, and less likely to cause context pollution.
Use session windows, not fixed windows. Don't load the last N messages blindly. Detect session boundaries (time gaps, topic changes) and only load relevant recent context.

The two-call pattern isn't specific to e-commerce or sales bots. Any chatbot that does retrieval + response can benefit from this split. RAG pipelines, customer support bots, coding assistants — if your model is searching and responding in the same call, you're probably paying 40-60% more than you need to.

Final Numbers

	Before	After
Architecture	1 call per message	2 calls per message
Tokens per message	~1,820	~830
Cost per message	$0.0003	$0.00014
Cost per 1M messages	$300	$140
Search pollution	Frequent loops	Eliminated
Response quality	Verbose, unfocused	Concise, on-topic

One architecture change. Two smaller calls. 55% cost reduction. Ship it.

I'm documenting my entire journey building an AI sales platform from Gaza. Follow me @AliMAfana for more real bugs from a real product.

Previous articles:

Top comments (7)

JakobSmith • May 15

The summary pollution bug is so real. The AI accumulates stale context like a snowball. Customer asked about red dresses ten messages ago, now they're asking about shoes, but the summary still says "customer wants red dresses." So the AI searches for red shoes. Then red shoes AND red dresses. Then the summary updates to include both. It's a death spiral. Splitting search and response into separate calls with zero shared history is the clean fix.

Ali Afana • May 16

"Death spiral" is the right name for it. First time I watched it happen in real logs — Turn 1 asks about dresses, summary captures "wants dresses." Turn 6 asks about shoes. Summary updates to "wants dresses and shoes." Turn 12 asks about a wedding venue, and now the summary reads like a half-coherent event-planning brief that the search call is dutifully embedding. Zero shared history was the only fix that didn't degrade with conversation length.
The labeled-context cousin of this one bit me again last week — different mechanism (single-call labeling, not multi-turn accumulation), same family: the model trusting context blocks beyond what they deserve. Once you've seen the shape, you start seeing it everywhere.

JakobSmith • May 16

Labeled-context cousin - yep, that's the same family of bugs. Model sees a block of context, treats it as gospel, ignores that the user just contradicted it. The fix is always the same: less context, not more. Every time I think I've seen all the variants, another one shows up. Appreciate you naming these patterns.

Ali Afana • May 16

"Treats it as gospel, ignores that the user just contradicted it" — that's the precise failure mode. The user is providing live signal; the context block is stale or vague; the model picks the wrong one. Three bugs I've shipped that all reduce to that sentence.
Appreciate you reading the catalog. Honestly didn't realize this was a family until I started writing them up one at a time and the same shape kept reappearing — which is half the reason I publish. The pattern only becomes legible in retrospect.

JakobSmith • May 16

This is gold. The "death spiral" + "context treated as gospel" pattern just connected a bunch of dots for me. I had been thinking mostly about routing and cost reduction, but this made me realize there's a deeper rule hiding underneath: context should be scoped by purpose, not accumulated globally.

Really appreciate you documenting these patterns publicly — reading bug archeology like this is weirdly more useful than polished architecture diagrams 😭

Already patching my own pipeline because of this: splitting context paths harder and making components consume only the minimum context they actually need. Feels like one of those ideas that keeps paying rent later.

Ali Afana • May 16

"Context should be scoped by purpose, not accumulated globally" — this is the rule. I've been pattern-matching around it for months without naming it, and you just did it in one sentence. The two-context split is scoping by purpose (intent extraction ≠ response generation). The labeled-context fix is scoping by purpose (brand voice ≠ inventory). The accidentally-adversarial merchant descriptions I've been worrying about in another thread — scope problem too, well-intentioned marketing copy leaking across boundaries. One principle, three failure modes I've already shipped fixes for. You've given me language for a thing I was carrying around in pieces.
The bug-archeology-over-architecture-diagrams line genuinely made my day. Most architecture diagrams encode the final shape after the dust has settled — clean arrows, labeled boxes, everything in its place. The interesting part is always the dust though: the wrong path, the misframe, the variable name that looked fine in code review and turned into a hallucination in production. That's the part that survives the next refactor. Really glad it's useful enough that you're patching off the back of it.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.