Last week, finance dropped a screenshot into the group chat: this month’s LLM API bill was ¥5,368, up 4x month-over-month. “Do you tech people not feel anything if you don’t spend money?” That moment I suddenly understood every algorithm team that’s ever had their budget slashed.
We run an intelligent customer-service system with three or four large clients. Daily active users aren’t huge, but conversations are extremely long. Some users chat with the bot for hundreds of turns, and every request has to stuff the entire message history into the context. Every single token the model generates forces it to re-read that mountain of chat logs. Tokens flow like water.
I knew right away we had to implement caching. Not Redis caching, not a CDN — but context caching. The idea is to de-duplicate model inputs at the semantic level: if a full context has already been processed once, don’t blindly re-compute it the second time. After we shipped this, daily token consumption dropped from ~1 million to ~100k, cutting costs by 90%. Median API latency fell from 3.2s to 0.4s. This post walks through the full approach, the code, and the two landmines that almost blew us up.
Where exactly are tokens being wasted?
First, some background. We use the Chat Completions API. Each turn of a conversation constructs a very long messages list and sends it to the model. Suppose a user’s conversation has already gone 30 rounds. The current request looks like this:
messages = [
{"role": "system", "content": "你是客服,请友好回答..."},
{"role": "user", "content": "你好"},
{"role": "assistant", "content": "您好,请问有什么可以帮您的?"},
{"role": "user", "content": "我的订单没收到"},
{"role": "assistant", "content": "请提供订单号..."},
...
{"role": "user", "content": "还是没收到,已经三天了"}
]
Every new request has 90% of the content identical to the previous round, yet the model still processes all those tokens from scratch, and billing counts every one of them as input tokens. The typical “cache responses in Redis” trick doesn’t help here, because the messages list changes every time (one new round appended), so the cache key never matches.
The root cause is clear: we aren’t stripping the “prefix that has already been computed” out of the billing and computation. If we could recognise that a prefix has been cached and reuse the model’s intermediate state from last time, we’d save a ton of tokens. But OpenAI’s API doesn’t expose a native Prompt Caching feature the way Anthropic does (it only appeared for some models late 2024). We had to simulate it ourselves.
Design: why not vector search, and why we built our own KV cache
We had three paths in front of us:
-
Full-messages response caching: only return a cached answer when the entire
messageslist is identical. Hit rate is practically zero because every new request has one extra round. - Vector database for semantic matching: embed historical messages, find “semantically similar” questions, and reuse previous answers. But this introduces semantic drift, and fast-evolving conversations with partial mismatches are risky — a customer-service bot can’t afford to hallucinate.
- Prefix caching: extract the prefix of the conversation (all but the latest user message), compute a deterministic hash, and if there’s a cache hit, use the model’s “intermediate result” from that prefix to answer the follow-up. The problem is OpenAI’s API doesn’t expose intermediate states. So we compromise: cache the prefix of the messages (excluding the last user message), and store the model’s final assistant reply for that prefix. If the prefix is identical, it means the conversation has reached the same fork. We can directly return the last assistant reply and append the new user message — we lose a bit of flexibility, but in a deterministic customer-service scenario it’s more than enough.
I chose path three. The core idea: use a hash table (persisted to disk) to store the mapping from "prefix messages → last assistant reply". Specifically, we take messages[:-1] as the cache key, and the value is the last assistant message. The next time a request comes in with the same first N messages, we instantly retrieve that assistant reply and only send the latest one or two rounds to the model. Input tokens drop from thousands to dozens in one shot.
Core implementation: building a real-world context cache in three steps
Step 1: compute a stable hash for the message list
This code solves the problem of turning an unpredictable Python dict into a stable string key. We use json.dumps with fixed options, then MD5.
import json
import hashlib
from typing import List, Dict
def messages_hash(messages: List[Dict[str, str]]) -> str:
"""
对消息列表做确定性哈希。
注意:必须用 sort_keys 和 ensure_ascii 保证跨环境一致。
"""
serialized = json.dumps(messages, sort_keys=True, ensure_ascii=False)
return hashlib.md5(serialized.encode('utf-8')).hexdigest()
Step 2: the disk cache layer — LRU and persistence
We need to store hash -> assistant_message without blowing up the disk. I used the diskcache library, which comes with built-in expiry and LRU. Way cleaner than hand-rolling pickle.
from diskcache import Cache
import time
# 缓存目录,过期时间 7 天
context_cache = Cache("./context_cache")
CACHE_TTL = 7 * 24 * 3600
def get_cached_reply(messages_prefix: List[Dict[str, str]]) -> str | None:
key = messages_hash(messages_prefix)
return context_cache.get(key)
def set_cached_reply(messages_prefix: List[Dict[str, str]], assistant_reply: str):
key = messages_hash(messages_prefix)
context_cache.set(key, assistant_reply, expire=CACHE_TTL)
Step 3: inserting the cache logic before the API call
The actual function that calls OpenAI looks like this. Every time, we take messages[:-1] as the prefix and check the cache. If it hits, we grab the cached assistant reply, append the latest user message, and only send that slim payload to the model.
from openai import OpenAI
client = OpenAI()
def chat_with_cache(messages: List[Dict[str, str]]) -> str:
# 取前缀(去掉最后一条用户消息)
prefix = messages[:-1]
cached = get_cached_reply(prefix)
if cached:
# 命中缓存:只发送最新一轮给模型
latest_turn = [cached, messages[-1]]
response = client.chat.completions.create(
model="gpt-4o",
messages=latest_turn
)
else:
# 未命中:完整请求,并缓存前缀的 assistant 回复
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
# 缓存最后一条 assistant 消息,对应 messages 的前缀
set_cached_reply(prefix, response.choices[0].message.content)
return response.choices[0].message.content
That’s the core. In production, we added a few safety checks (e.g. ensure the last message is from the user, handle streaming, etc.), but the skeleton above already delivers the 90% token reduction.
The two “landmines” I mentioned — long prefix hash collisions and cache stampedes under concurrency — are stories for another post. But with this architecture, our smart-customer system now handles long conversations without bleeding tokens, and the finance group chat has been blissfully quiet.
Top comments (0)