I built an LLM caching library to test what AI-assisted development actually looks like
I've been spending my evenings on a personal side project — just learning by building. The latest experiment was wiring up an AI agent into it.
While testing, I caught myself sending almost the same prompts over and over. Same intent, slightly different wording. And every test run cost me real money.
Then a thought hit me: if I'm doing this while testing, real users in production absolutely will too. The first 1000 users of any AI chatbot mostly ask the same handful of questions. The LLM charges you for every single one.
I looked for a good caching solution and didn't find one that ticked all my boxes. So I built llm-cacher — and used it as an excuse to try something I hadn't done before: work with an AI assistant as a real collaborator throughout the entire build. I'd drive, it would implement, and I'd review everything that came out.
Here's what almost every LLM integration looks like:
const openai = new OpenAI();
async function summarize(text: string) {
const res = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "Summarize the following text." },
{ role: "user", content: text },
],
});
return res.choices[0].message.content;
}
If summarize() gets called with the same text twice, you pay twice. Run an eval suite a hundred times? Pay a hundred times.
You could roll your own cache:
const cache = new Map();
async function summarize(text: string) {
if (cache.has(text)) return cache.get(text);
// ...
}
But now you're maintaining cache logic for every API call. And it only handles exact matches — "Summarize this article" and "Summarize this article please" become different cache keys, even though the model returns essentially the same answer.
That's the gap I wanted to close. Three things drove the design: zero code changes to existing code, multiple storage backends to fit any stack, and semantic matching so near-identical prompts share the same cache entry.
What started as "just cache the response" turned out to be more involved than I expected — streaming, semantic search, distributed storage, and index management each brought their own surprises.
Who is this for
There are a few other caching options in the Node.js ecosystem worth knowing about:
LangChain.js has built-in caching, but only if you write your entire integration against the LangChain abstraction layer. If you're already using it — great, use theirs. If you're not, adopting LangChain just for caching is a lot.
Helicone and Portkey are SaaS proxies that include caching as part of a broader observability platform. If you need cost tracking, rate limiting, and request logging alongside caching, they're worth looking at. The trade-off is that your requests go through their servers.
GPTCache is the closest open-source equivalent with semantic caching, but it's Python-first and runs as a Docker sidecar — not a direct npm install.
Upstash Semantic Cache is a JavaScript SDK with semantic caching, but it's tied to Upstash's managed service.
Anthropic's built-in prompt caching is worth mentioning separately because it's easy to confuse with what llm-cacher does. Anthropic's feature caches the model's internal state for long system prompts, reducing the cost of re-processing repeated prefixes. llm-cacher caches the full response. They're complementary — you can use both.
llm-cacher is for when you want self-hosted caching, you're using the OpenAI or Anthropic SDK directly, and you don't want to adopt a framework or sign up for a service to get there.
Quick start
npm install llm-cacher
import OpenAI from "openai";
import { createCachedClient } from "llm-cacher";
const openai = createCachedClient(new OpenAI(), {
ttl: "24h",
storage: "memory",
});
// First call hits the API
const res1 = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "What is 2+2?" }],
});
// Second identical call is served from cache instantly
const res2 = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "What is 2+2?" }],
});
createCachedClient returns a Proxy with the same TypeScript type as the original client. The rest of your code stays identical.
How it works under the hood
The cache key is a SHA-256 hash of the request parameters: model, messages, temperature, top_p, and so on. The stream flag is excluded from the key, so streaming and non-streaming calls for the same request share the same cache key.
When a streaming request is cached, the chunks are accumulated and stored as a list. On a cache hit, they're replayed as an AsyncGenerator — your for await loop never knows the difference:
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [...],
stream: true,
})
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? '')
}
// Works whether the response came from the API or from cache
Storage backends
// Memory — default, zero deps
createCachedClient(client, { storage: "memory", maxSize: 500 });
// File — useful for CI and local dev
createCachedClient(client, { storage: "file", storagePath: "./cache.json" });
// SQLite — great for single-process apps
import { SQLiteStorage } from "llm-cacher";
createCachedClient(client, {
storage: new SQLiteStorage({ path: "./cache.db" }),
});
// Redis — for multi-instance production
import { RedisStorage } from "llm-cacher";
import Redis from "ioredis";
createCachedClient(client, {
storage: new RedisStorage({ client: new Redis() }),
});
// DynamoDB — for serverless
import { DynamoDBStorage } from "llm-cacher";
createCachedClient(client, {
storage: new DynamoDBStorage({ tableName: "llm-cache", region: "us-east-1" }),
});
The backends aren't interchangeable — each fits a specific environment. Memory for tests, SQLite when you need persistence without a server, Redis for multi-instance production, DynamoDB when you're serverless and want expiry handled at the infrastructure level. All backends are optional peer dependencies, so you only install what you actually use.
Semantic caching
Exact-match caching misses a lot of real-world hits. Consider:
"Summarize this article."
"Summarize the article above."
"Can you summarize this article please?"
To a hash function, these are three completely different requests. To the model, the outputs are nearly identical.
llm-cacher solves this by computing embeddings for each prompt and comparing them with cosine similarity. If the similarity is above your threshold, it's a cache hit.
import { LocalEmbedder } from "llm-cacher";
const openai = createCachedClient(new OpenAI(), {
storage: "sqlite",
semantic: {
embedder: new LocalEmbedder(), // ~25MB model, runs locally, no API key
threshold: 0.92, // higher = stricter matching
},
});
LocalEmbedder uses all-MiniLM-L6-v2 via @huggingface/transformers. No API key, no extra cost. For higher accuracy, you can switch to OpenAI embeddings:
import { OpenAIEmbedder } from 'llm-cacher'
semantic: {
embedder: new OpenAIEmbedder({ client: new OpenAI() }),
threshold: 0.95,
indexType: 'hnsw', // O(log n) lookup for large caches
}
By default, similarity search does a linear scan across all cached embeddings — which is fine for most use cases. If your cache grows into the tens of thousands of unique entries, indexType: 'hnsw' switches to an HNSW graph index and keeps lookups fast.
Framework integrations
Each framework has its own way of sharing state across requests — Express augments req, Hono uses typed context variables, NestJS uses dependency injection. The integrations follow those conventions so withCache feels native to whatever stack you're in.
NestJS
// app.module.ts
@Module({
imports: [
LlmCacheModule.forRoot({
ttl: "24h",
storage: new RedisStorage({ client: new Redis() }),
}),
],
})
export class AppModule {}
// chat.service.ts
@Injectable()
export class ChatService {
private readonly openai: OpenAI;
constructor(@InjectLlmCache() private readonly llmCache: LlmCacheService) {
this.openai = this.llmCache.wrap(new OpenAI());
}
}
Express
app.use(llmCacheMiddleware({ ttl: "24h", storage: "memory" }));
app.post("/chat", async (req, res) => {
const openai = req.withCache(new OpenAI());
// ...
});
Hono
app.use(llmCacheMiddleware({ ttl: "24h", storage: "sqlite" }));
app.post("/chat", async (c) => {
const openai = c.get("withCache")(new OpenAI());
// ...
});
Things I learned along the way
Streaming is harder to cache than it looks. You can't just intercept the response — you have to yield each chunk to the caller in real time while simultaneously collecting them into an array. And if storage fails after the stream has fully delivered, you can't throw: the caller already received their data. That's why the set() call after a stream ends is wrapped in .catch(() => undefined). It's not lazy error handling — it's deliberate. A storage failure at that point is not the caller's problem.
The similarity index needs active cleanup. When a cache entry expires in storage, its embedding stays in the in-memory index indefinitely if you don't do anything about it. Left unchecked, the index keeps growing and starts returning keys that no longer exist in storage. The fix is to remove the key from the index whenever a get() returns null — whether that's on a direct lookup or after a semantic match comes back empty.
HNSW doesn't delete — it marks. hnswlib doesn't support removing a vector from the index outright. Instead it uses markDelete(), which flags the entry but leaves it in memory. To reclaim those slots, you track a deletedCount and pass replaceDeleted: true on the next addPoint() — which lets the library reuse a marked slot instead of allocating a new one. It's not obvious from the docs and easy to get wrong.
Proxy over a class wrapper. The obvious approach to wrapping an SDK is subclassing or a decorator class. The problem: you'd have to declare every method statically, and the return type would diverge from the original. A Proxy intercepts only chat.completions.create and passes everything else through to the real client untouched — so the TypeScript type stays identical to the original OpenAI instance. No re-declarations, no type casting.
On using AI to build this
One of my goals going into this was to test how far an AI assistant could get without much hand-holding — give it a direction, see what it produces.
The honest answer: it produced a lot of code quickly, and a lot of that code had bugs. Not obvious crashes, but subtle logic errors — a TTL check that was off by up to a second, a mock in a test that never actually exercised the code it was supposed to test, edge cases in the similarity index that only showed up when I read the implementation carefully. Each one took me sitting down, understanding what the code was doing, and explaining back to the AI where it went wrong.
Using an AI assistant genuinely speeds up development — I wouldn't have built this as fast on my own. But the speed only works if you understand what it's generating. If you accept the output without reading it, the bugs ship with the code. The AI is confident whether it's right or wrong, and it's on you to tell the difference.
I'd use it again. But I'd go in knowing that "review everything that comes out" isn't optional. And also, I suggest using it in a virtual machine with lower access rights.
What's next
- Cost tracking — show how much you've saved compared to always hitting the API
- A dashboard for inspecting cache contents and hit rates
- Gemini and Mistral adapters
If any of this sounds useful, or you want something completely different, open an issue — I'm genuinely open to feedback on direction.
Try it out
npm install llm-cacher
If you hit a weird edge case or want to plug in a new storage backend, PRs are welcome.
Top comments (0)