Keeping a chat app's token bill flat as conversations grow

#ai #llm #firebase #performance

Every chat feature has the same quiet problem. The first message costs almost nothing. The hundredth message costs a fortune, because by then you are re-sending the entire backlog on every single turn.

We hit this building Meme Chat AI, a chat app where the assistant talks back in memes. A conversation that ran long enough would start sending five, ten, twenty thousand tokens of history with each reply, most of it old and irrelevant to what the user just typed. The model still has to read all of it, you still pay for all of it, and latency creeps up the whole time. Here is what we did about it, and the rate limiter we put in front of it so a single client can't run the bill up on its own.

The shape of the fix

The naive options are both bad. You can send the whole transcript (cost grows without bound) or send only the last few messages (the model forgets what happened earlier in the chat). We wanted neither.

The pattern we landed on is a rolling summary plus a verbatim window. Every prompt looks like this:

[ stable system / persona prompt ]
[ summary of older turns ]
[ last N turns, word for word ]
[ the current user message ]

Older turns don't get dropped. They get folded into a running summary. Recent turns stay exactly as written, because that's the part the model actually needs at full fidelity to answer the next message. Nothing is ever silently lost: a message is either inside the verbatim window or inside the summary.

Sizing the window by tokens, not message count

Our first version capped the window at a flat message count. That turned out to be the wrong knob.

A flat count punishes everyone equally, which means it punishes the wrong people. A user on a higher tier has a much larger input budget to work with, so there's no reason to start summarizing their conversation as aggressively as a free user's. But a fixed "keep the last 12 messages" rule did exactly that.

So we size the window from the token budget instead. Take the plan's input allowance, subtract the fixed overhead that rides along in every prompt (the persona prompt, the summary slot, the current turn), and let the verbatim tail fill most of what's left:

function verbatimBudgetTokens(maxInputTokens: number): number {
  const headroom = maxInputTokens - PROMPT_OVERHEAD_TOKENS;
  if (headroom <= 0) return 0;
  return Math.round(headroom * 0.85);
}

That 0.85 is deliberate. Our token count is an estimate, and the provider's count is the one that bills you. Leaving a margin means a small drift between the two estimates never pushes the assembled prompt over the model's actual input limit. There's also a hard ceiling on message count sitting on top of the token budget, purely as a safety bound so a flood of tiny one-word turns can't balloon the prompt or the database reads. In normal use the token budget is what gates; the count cap almost never bites.

Truncation is a fallback, not the main mechanism

The summary handles the long-term growth. But assembly still does a final check before anything goes to the model: build the prompt, count it, and if it's somehow over budget, drop the oldest verbatim message and recount. Repeat until it fits.

let current = recent.slice();
let messages = build(current);
let inputTokens = countMessagesTokens(messages);

while (inputTokens > maxInputTokens && current.length > 0) {
  current = current.slice(1);
  messages = build(current);
  inputTokens = countMessagesTokens(messages);
}

The system prompt, the summary, and the current turn are never candidates for dropping. They're load-bearing. Only the recent-history tail gets trimmed, oldest first. In practice this loop rarely does anything, because the window was already sized to fit. It exists for the edge case where a single pasted wall of text blows past the estimate, and it guarantees we never hand the API a prompt it will reject.

The cheapest token is the one you stop re-sending

A subtle source of bloat was attachments. When a user sends an image or a GIF, that turn is expensive. The image parts alone can be a couple hundred tokens for one still and several times that for a GIF that gets sampled into frames. The model needs all of that on the turn the image arrives. It does not need it five turns later.

So once an attachment turn ages into history, we collapse it to a short text placeholder instead of re-sending the pixels:

// historical turn that once carried an image
"[User sent an image]"

The model keeps the thread of "the user showed me something here" without paying the visual token cost on every subsequent turn. Only the current turn is ever allowed to carry real image data.

Two things worth knowing about caching

Two design choices are really about the prompt cache, which most providers now price at a steep discount for tokens they've seen before.

First, the big static persona prompt goes first and stays byte-identical across every turn and every user. Anything user-specific (their name, their language, any per-user memory) lives in a second block after it, so the expensive cacheable prefix never changes shape from one user to the next.

Second, the summary only changes when we actually re-summarize. As long as it's stable, the [persona][summary] prefix stays cacheable between turns. That's also why we don't re-summarize on every message. We batch it: the background summarizer only folds aged-out turns into the summary once enough of them have accumulated, by count or by token volume. Re-summarizing constantly would churn the prefix and throw away cache hits to save a trivial amount of summary length, which is a bad trade.

The summarizer itself runs as a background job on a cheaper utility model, decoupled from the request path. The user's reply never waits on it.

Rate limiting, kept boring

Token discipline controls cost per conversation. It does nothing about a client hammering the endpoint. For that we put a small per-IP limiter in front of the streaming function, backed by the database we already had rather than a new piece of infrastructure.

It's a fixed window: one document per IP per hour, an atomic increment, reject once the count crosses the threshold.

const hourBucket = Math.floor(Date.now() / WINDOW_MS);
const docId = `${ipKey(ip)}_${hourBucket}`;

return db.runTransaction(async (tx) => {
  const snap = await tx.get(ref);
  const count = snap.data()?.count ?? 0;
  if (count >= REQUESTS_PER_HOUR) return false;
  tx.set(ref, {
    count: FieldValue.increment(1),
    expireAt: Timestamp.fromMillis((hourBucket + 2) * WINDOW_MS),
  }, { merge: true });
  return true;
});

A few details that matter more than the algorithm:

The IP is hashed before it ever touches storage, so we're not keeping a log of raw client addresses. The bucket carries an expireAt, so a TTL policy sweeps old documents and the collection doesn't grow forever. And the limiter fails open when there's no IP to key on or when it's running locally, so development against a single localhost address doesn't trip the cap every few minutes. The cost is one read and one write per request, which is cheap next to an LLM call.

A fixed window has a known weakness: a client can fire a full window's worth of requests at 1:59 and another full window at 2:00. A sliding window or token bucket smooths that out. For our traffic the simple version was the right amount of engineering, and you can always tighten it later without touching anything upstream.

What it bought us

Long conversations stopped getting linearly more expensive. Cost per turn flattened into a band set by the plan's budget instead of climbing with the message count. Older context survives as a summary rather than vanishing, recent context stays exact, and the persona prompt stays cached across turns. The rate limiter caps the blast radius of any single client for the price of one extra read and write.

None of this is exotic. It's a summary buffer, a token budget, a placeholder for old attachments, and a counter in a database. The useful part was picking the token budget as the thing to scale on, and treating the cache prefix as something to protect rather than an afterthought.

All of it runs in production behind Meme Chat AI if you want to see where it ended up.