The framing question developers ask when standing up LLM caching is wrong. It's not "Redis or vector database?" — it's "which layer of caching does this backend serve?" Redis is the right backend for exact-match caching: sub-millisecond lookups, simple key-value semantics, dirt cheap at any scale. Vector databases are the right backend for semantic caching: HNSW-indexed similarity search, ~30ms p95 lookups including embedding inference, dollars-not-cents per GB of stored embeddings. Production LLM caches run both, side by side, serving different request slices. This post walks through the latency math, the cost model, and the pick-list per use case — including when pgvector on your existing Postgres is the right call vs when a dedicated managed vector DB is.
The parent guide AI API caching covers the three-layer cache strategy at the system level; this post is the infrastructure-choice level below that.
Why both, not one
The two caching layers solve overlapping but distinct problems.
Exact-match caching stores responses keyed by a deterministic fingerprint of the request — typically a SHA-256 hash. New request arrives, you hash it, look up the key. If the key exists, return. Sub-8ms p95 lookup. Hit rate in production AI traffic is 5-15% — the byte-identical-request slice (cron jobs, regression tests, duplicate-submit user actions).
Semantic caching embeds the user's prompt with a sentence-embedding model, looks up the nearest stored embedding in a vector index, returns the cached response if cosine similarity exceeds a threshold. 20-40ms p95 including the embedding inference. Hit rate is 25-50% on top of whatever exact-match caught — the paraphrasable-intent slice (customer support, FAQ, documentation Q&A).
The numbers above are why production caches run both. Exact-match alone leaves 30-45 percentage points of total traffic uncached. Semantic alone pays embedding latency and infrastructure cost on requests that would have hit the cheap exact cache. Stacked, exact-match short-circuits the byte-identical slice in sub-10ms and semantic catches the rest.
The infrastructure question is which backend serves which layer. Redis serves exact-match; a vector database serves semantic. They don't substitute for each other.
The Redis layer (Layer 1)
What you need from the Layer 1 backend:
- Sub-10ms p95 GET latency on a key lookup. Redis delivers ~1-3ms p95 even on remote managed deployments.
- Atomic INCR + EXPIRE primitives for cache statistics + TTL.
- Eviction policy support (LRU is the typical choice — drop entries that haven't been read recently when the storage cap binds).
- Pub/sub or similar for invalidation (optional — most LLM caches rely on TTL-only invalidation rather than explicit purge).
- Persistence is optional. Cache data is recoverable on a restart by simply repopulating from new requests; durability isn't a hard requirement.
Redis hits all of these natively. The interesting questions are which Redis to run.
The Redis options
Managed Redis Cloud (Redis Inc.) — the canonical choice. Pay-as-you-go, decent latency, 99.9% SLA on paid tiers. Geographic placement matters; co-locate the cache with your origin region.
Upstash Redis — serverless Redis with REST API. Lower base cost than Redis Cloud at low-to-moderate scale, scales well at high QPS. The REST interface adds a few milliseconds of HTTP latency over native TCP but eliminates connection-pool management. Default choice for serverless deployments. This is what Prism uses for the Layer 1 cache.
ElastiCache (AWS) / Memorystore (GCP) / Azure Cache for Redis — cloud-native managed offerings. Generally cheaper than the third-party managed services at scale but with worse multi-region story (you're locked to one cloud's region topology).
Self-hosted Redis — straightforward to run; one binary. Operationally simple at small scale; gets harder at scale (replication, failover, monitoring). Reasonable choice if you have infrastructure capacity and want to avoid managed pricing.
KeyDB / DragonflyDB — Redis-protocol-compatible alternatives with higher throughput per core. DragonflyDB specifically claims 10-25x throughput over Redis for some workloads via a multi-threaded architecture. Worth considering at high QPS; otherwise standard Redis is fine.
Sizing the Redis cache
Two parameters: storage (how big can the cache get) and ops/sec (how many lookups + writes per second).
Storage is dominated by response size. A typical LLM response is ~500-2000 bytes serialised (JSON envelope + content + usage block). A cache holding 100,000 entries at 1KB each is 100MB. A cache holding 1,000,000 entries is 1GB. Most production caches at meaningful traffic land in the 500MB-5GB range. Cheap on any managed Redis offering.
Ops/sec scales with traffic. Each request does ~2 operations (lookup + write on miss; lookup + INCR on hit for stats). 100K requests per day = ~1.2 ops/sec average; 100K requests per hour = ~30 ops/sec; 100 requests per second = ~200 ops/sec. Redis handles 100K+ ops/sec on a single shard without breaking a sweat; most production caches never come close to needing horizontal scaling.
Bottom-line cost: ~$10-30/month for 5GB managed Redis at moderate traffic. Negligible against avoided LLM cost.
The vector layer (Layer 2)
What you need from the Layer 2 backend:
- HNSW or equivalent approximate-nearest-neighbour index. Brute-force cosine-similarity scans don't scale past ~10K vectors; HNSW indices support millions of vectors with sub-10ms index lookup.
- Insert + query with vector + metadata. Each entry stores the embedding (384 or 1536 dimensions) plus the associated cached response. The query returns nearest-neighbour vectors plus their metadata.
- Configurable distance metric (cosine similarity is the standard for sentence-embedding cache lookups; L2 distance and inner product also valid for some embedding models).
- Namespacing or filtering — production deployments usually scope the cache per project (avoid serving Project A's response to Project B's query).
Vector databases vary substantially on infrastructure shape, pricing model, and operational requirements.
The vector database options
Upstash Vector — serverless, REST-API-based, namespace-scoped, runs on the same architecture as Upstash Redis. Built for AI workloads specifically; pricing scales linearly with vectors stored + queries per month. This is what Prism uses for semantic caching. Default choice for serverless AI deployments.
Pinecone — the canonical managed vector database. Production-grade, well-instrumented, multi-region. Pricing is higher than Upstash at small-to-moderate scale; comparable at large scale. Strong fit if you're already on Pinecone for other vector workloads.
Qdrant — open-source vector database, self-hostable. Managed Qdrant Cloud also available. Strong feature set; lower managed pricing than Pinecone. Good choice for teams that want flexibility between self-host and managed.
Weaviate — similar shape to Qdrant; OSS with managed cloud. Heavier than needed for pure caching workloads (it ships document-storage features as well); fine if you're using it for other vector workloads alongside.
pgvector (Postgres extension) — runs inside your existing Postgres. The right call if you're already on Postgres and want to consolidate operational surface area, IF your vector volume stays modest. Performance is fine up to ~1-5 million vectors per table with proper indexing; beyond that, dedicated vector DBs pull ahead.
LanceDB / Chroma / Milvus / Weaviate (self-hosted) — additional self-host options. Each has its own performance profile; Chroma in particular is popular for prototype work but isn't widely deployed at serious scale yet.
Sizing the vector cache
Storage is dominated by embedding size. A 384-dimensional float32 embedding is 1.5KB raw; with HNSW index overhead the effective storage is ~3-4KB per vector. 100,000 vectors ≈ 400MB. 1,000,000 vectors ≈ 4GB. Plus the metadata (the cached response itself, similar size to the Redis layer).
Query rate matches the cache-miss rate from Layer 1 — semantic only runs when exact-match misses. If exact catches 10% and total traffic is 100K req/day, the semantic layer handles 90K req/day ≈ 1 op/sec average; bursts to ~20-30 ops/sec peak. Well within the operating range of any managed vector DB.
Embedding inference cost is the other dimension. BGE-small-en-v1.5 at 384 dimensions runs on CPU at ~10-30ms per embedding; on a small GPU at sub-5ms. OpenAI text-embedding-3-small is ~$0.00002 per embedding (1536 dimensions; slightly higher accuracy but adds network latency and per-call cost). At 100K embeddings per day, the OpenAI cost is $0.60/day; BGE-small on a dedicated small VM is ~$15-30/month for the compute.
Bottom-line cost: ~$30-50/month for the vector index + embedding inference at moderate traffic. Stacks meaningfully against Layer 1 cost; still trivial against avoided LLM spend.
The latency math
Per-layer latency breakdown for a typical production setup (managed Upstash Redis + Upstash Vector + BGE-small embedding sidecar, all co-located in one region):
| Stage | Layer 1 (Redis exact-match) | Layer 2 (vector semantic) |
|---|---|---|
| Fingerprint / canonicalise | <1ms | <1ms (also runs to short-circuit on exact hit) |
| Embedding inference | n/a | ~10-20ms (BGE-small CPU); ~5ms (managed embedding API like OpenAI) |
| Index lookup | 1-3ms p95 | 5-15ms p95 (HNSW with default ef) |
| Deserialise + return | <1ms | <1ms |
| Total round-trip p95 | ~5-8ms | ~20-40ms |
The 4-6x latency gap is why Layer 1 runs first and short-circuits when it hits. Layer 2's 20-40ms is acceptable for a cache hit (compared to the 500-2000ms cache miss + LLM call it avoids), but you don't want to pay it on every request when Layer 1 would have caught the same request faster.
VERIFY (founder): confirm the Prism-specific p95 numbers above against current telemetry. The "~5-8ms exact lookup" and "~20-40ms semantic lookup" should map to actual production p95 figures from usage_logs / cache analytics.
The cost model
For a representative production deployment running 100K LLM requests/day at $0.015/request baseline (50K input + 30K output tokens):
| Component | Monthly cost | Notes |
|---|---|---|
| Layer 1 (Upstash Redis, 5GB) | ~$15 | Storage cap + ops volume well within scale |
| Layer 2 (Upstash Vector, 500K vectors) | ~$30 | HNSW indexed, namespace-scoped |
| Embedding inference (BGE-small on sidecar) | ~$15 | t3.small CPU running embedding service |
| Total caching infra | ~$60/mo | |
| Baseline LLM spend uncached | ~$3,000/mo | 100K req/day × $0.015 × 30 days |
| Caching savings (50% bill reduction) | ~$1,500/mo | Net positive impact on cost |
| ROI | 25x infra cost |
The math is favourable across most production scales. Even at 10x lower traffic (10K req/day), the infrastructure cost stays roughly constant while the savings drop proportionally — break-even still arrives below 5K req/day on a workload where caching applies.
When pgvector is the right call
Three conditions favour pgvector over a dedicated vector database:
- You're already on Postgres and want to consolidate operational surface area to one database.
- Your vector volume is bounded (probably under 2 million entries for the semantic cache). pgvector performance degrades non-linearly above this threshold; HNSW dedicated vector DBs are designed to scale.
- You don't have separate scaling concerns for the vector workload. Putting the cache in your primary Postgres means a cache-side burst can pressure your application's database. Acceptable for moderate workloads; dangerous at high QPS.
The pgvector advantage at small scale: no new managed service, no new operational expertise, no separate per-month bill, transactions span cache + application data, the embedding column is just another Postgres column. The downside: at scale, dedicated vector DBs (Pinecone, Qdrant, Upstash Vector) are substantially faster per query and isolate the workload from your primary database.
A reasonable starting heuristic: under 500K vectors → pgvector if you're on Postgres anyway. Above 1M → dedicated vector DB. Middle ground depends on your team's preference for operational consolidation vs scaling headroom.
When Redis isn't the right Layer 1 backend
Three edge cases where you might pick something other than Redis for Layer 1:
Memcached — if your cache is purely GET/SET (no TTL semantics, no stats counters), Memcached has marginally lower latency than Redis. Rarely worth the switch in 2026 because most production caches use the richer Redis primitives (TTL, INCR, EXPIRE-on-write).
SQLite / DuckDB / in-process KV — if you have a single-process application that doesn't scale horizontally, an in-process cache (Python dict, lru_cache, SQLite) is faster than any network round-trip. The constraint is "single process" — the moment you scale to multiple workers, you need a shared cache, which means a network hop, which means Redis.
S3 / object storage — only sensible for very large responses (multi-MB blobs of generated content, video, etc.) where the entry size exceeds typical Redis comfort. Most LLM responses are small enough that Redis is fine.
For 99%+ of production LLM caching workloads, Redis is the right Layer 1.
How Prism implements both
Prism's Layer 1 runs Upstash Redis (Mumbai region, single-replica). Layer 2 runs Upstash Vector with BGE-small-en-v1.5 embeddings (384-dim, cosine similarity, namespace-scoped per account by default — per-project on Pro+). The embedding inference runs on a sidecar container co-located with the API process so an embedding-side spike can't take the API down.
Specific design choices worth calling out for teams building their own:
- Layer 1 fingerprint via shared canonicaliser (covered in prompt cache fingerprinting pitfalls) — every cache write and every cache lookup goes through the same function.
- Layer 2 namespace per project — keeps Project A's responses out of Project B's cache. Scoping at the API-key level was the original v1.1 default; moved to project scoping in v1.2 when workspaces shipped.
-
Threshold tuning on Pro+ via the
X-Prism-Cache-Thresholdheader. Default 0.95; customers tune per workload. - Edge replication — Layer 1 entries propagate to Cloudflare Workers KV globally so cache hits at edge PoPs don't round-trip to Mumbai origin. Layer 2 stays at origin (embedding-at-edge isn't worth it today; covered in multi-region LLM API).
- Failure isolation — embedding service failure falls through to "cache miss, dispatch to provider" rather than blocking the request. Cache infrastructure failure is degraded but never fatal.
The total Prism cache infrastructure on EC2 + Upstash + a small embedding sidecar runs under $60/month even at meaningful customer traffic. The math holds up.
Decision framework
If you're standing up the LLM cache backend for your application:
- Always run both layers. Don't try to pick one; the layers solve different problems.
- Layer 1 = Redis. Managed Upstash or Redis Cloud at small/medium scale; KeyDB/DragonflyDB or self-host at high scale if pricing matters.
- Layer 2 = vector DB. Upstash Vector or Pinecone for managed at any scale; pgvector if you're already on Postgres and volume stays under ~1M vectors.
- Co-locate backends with origin. Cross-region cache latency dominates the savings; pick backends in the same cloud region as your application.
- Don't over-engineer. Even at meaningful production traffic, the cache infrastructure cost is rounding-error against the LLM spend it avoids. Pick a reasonable managed offering and ship.
The two-backend pattern (Redis + vector DB) is the production-tested shape. Variations on the components are fine; the architectural split between the layers isn't optional.
Where to go next
For the broader caching framework: AI API caching. For the discipline that makes Layer 1 actually hit: prompt cache fingerprinting pitfalls. For threshold tuning on Layer 2: exact vs semantic caching for LLMs.
For modelling cache impact on your workload: cache hit rate estimator + savings calculator.
FAQ
Can I use Redis Stack for both layers (Redis as both KV and vector index)?
Yes — Redis Stack ships RediSearch + RedisJSON + RedisVL, including vector similarity search via HNSW. It's a credible alternative if you want a single backend. The trade-offs: Redis Stack's vector search is newer than Pinecone/Qdrant/Weaviate; the operational complexity of a single Redis Stack instance handling both layers is higher than two separate (simpler) backends; at scale, dedicated vector DBs typically still outperform on pure query latency. Reasonable starting point for teams who want operational consolidation; revisit if performance becomes a constraint.
Is BGE-small the right embedding model?
For caching specifically, yes — it's fast (sub-30ms CPU inference), accurate enough for similarity matching, 384-dimensional (small storage footprint), and runs anywhere (no managed API dependency). Alternatives: text-embedding-3-small from OpenAI (more accurate but adds network hop and per-call cost), gte-small (similar profile to BGE-small), and BGE-base or text-embedding-3-large for higher-fidelity matching at higher cost. For most LLM caching workloads BGE-small is the right default.
Do I need to re-embed my entire cache when I switch embedding models?
Yes. Embeddings from different models live in different vector spaces; cosine similarity across models is meaningless. If you switch from BGE-small to text-embedding-3-small, you need to re-embed every cached entry. Production migrations either do a one-shot reindex job (downtime cost: a few hours of degraded hit rate while the new index warms) or run both indexes in parallel for a transition window. Plan for it before deploying a new embedding model.
What's the right TTL for Layer 1 and Layer 2?
Default Layer 1 TTL: 1 hour for time-sensitive workloads (real-time prices, user-specific context) and 24 hours for stable workloads (FAQ, documentation Q&A). Default Layer 2 TTL: similar range; some teams set it higher because semantic-cache entries are more valuable per-entry (each catches more variations). Prism defaults to 1 hour on both with per-project tuning on Pro+.
Can I run the embedding inference in the same process as the API?
You can; you shouldn't. An embedding-inference spike can pressure your API process and degrade non-embedding requests. Run the embedding service as a sidecar (separate container, separate process) so resource contention stays isolated. Prism's v1.6.5 architecture split moved embedding off the API process for exactly this reason.
What happens if Redis is down?
The cache layer returns "miss" and the request falls through to the provider. Cache miss isn't a hard error — just lost savings. The downside of Redis-down is a hit-rate cliff (suddenly every request pays full provider cost) until Redis recovers. Mitigation: pick a managed Redis with 99.9%+ SLA; monitor Redis health; alert on extended outages so you can take action.
Should the vector index store the response inline or just a reference?
Both patterns work. Inline (entire response stored as metadata on the vector) is simpler — one round-trip retrieves the response on a hit. Reference (the vector stores a key into a separate KV store that holds the response) is more storage-efficient if responses are large or you want to share cache entries across multiple vector indexes. Prism uses inline; the response size is small enough that the storage savings of separation aren't worth the second round-trip.
For the layered cache strategy at the system level, read AI API caching. For the production-shape Prism uses, see how Prism handles caching.
Top comments (0)