This is a submission for the Redis AI Challenge: Real-Time AI Innovators.
What I Built
Latency Slayer is a tiny Rust reverse-proxy that sits in front of any LLM API.
It uses embeddings + vector search in Redis 8 to detect “repeat-ish” prompts and return a cached answer instantly. New prompts are answered once by the LLM and stored with per-field TTLs, so only the response expires while metadata persists.
Why it matters: dramatically lower latency and cost, with transparent drop-in integration for any chat or RAG app.
Core tricks
- Redis Query Engine + HNSW vectors (COSINE) to find semantically similar earlier prompts.
-
Hash field expiration (
HSETEX/HGETEX) so we can expire just the “response” field without deleting the whole hash. - Redis Streams for real-time hit-rate & latency metrics, rendered in a tiny dashboard.
Demo
Screenshots:
How I Used Redis 8
- Vector search (HNSW, COSINE) on a HASH document that stores an embedding field (FP32, 1536-d from OpenAI text-embedding-3-small).
-
Per-field TTL on hashes:
HSETEXto set the response field and its TTL in a single step;HGETEXto read and optionally refresh TTLs. This gives granular cache lifetimes without deleting other fields (like usage or model metadata). -
Redis Streams:
XADD analytics:cacheper request; the dashboard subscribes and renders hit rate, token savings, and latency deltas in real time.
Data model (simplified)
-
cache:{fingerprint}→ Hash fields:prompt,resp,meta,usage,created_at(withresphaving its own TTL) -
vec:{fingerprint}→ Vector field + tags (model,route,user) - Stream:
analytics:cachewith{event, hit, latency_ms, tokens_saved}
Why Redis 8?
- New field-level expiration commands on hashes make cache lifecycle clean and safe.
- New int8 vectors keep memory low and speed high.
- Battle-tested Streams/PubSub give us real-time observability with a tiny footprint.
What’s next
- Prefetch: predict likely next prompts and warm them proactively.
- Hybrid filters: combine vector similarity + tags (model/route) for stricter cache hits.
- Cold-start tuning: adapt hit threshold by route and user cohort.
- Currently storing FP32 vectors for simplicity; INT8 quantization is planned to lower memory and speed up search



Top comments (0)