Latency Slayer: a Redis 8 semantic cache gateway that makes LLMs feel instant

#redischallenge #devchallenge #database #ai

Redis AI Challenge: Real-Time AI Innovators

This is a submission for the Redis AI Challenge: Real-Time AI Innovators.

What I Built

Latency Slayer is a tiny Rust reverse-proxy that sits in front of any LLM API.

It uses embeddings + vector search in Redis 8 to detect “repeat-ish” prompts and return a cached answer instantly. New prompts are answered once by the LLM and stored with per-field TTLs, so only the response expires while metadata persists.

Why it matters: dramatically lower latency and cost, with transparent drop-in integration for any chat or RAG app.

Core tricks

Redis Query Engine + HNSW vectors (COSINE) to find semantically similar earlier prompts.
Hash field expiration (HSETEX / HGETEX) so we can expire just the “response” field without deleting the whole hash.
Redis Streams for real-time hit-rate & latency metrics, rendered in a tiny dashboard.

Demo

Video: https://youtu.be/lA-d4WO0Fjg
Repo: https://github.com/mohitagnihotri/latency_slayer

Screenshots:

How I Used Redis 8

Vector search (HNSW, COSINE) on a HASH document that stores an embedding field (FP32, 1536-d from OpenAI text-embedding-3-small).
Per-field TTL on hashes: HSETEX to set the response field and its TTL in a single step; HGETEX to read and optionally refresh TTLs. This gives granular cache lifetimes without deleting other fields (like usage or model metadata).
Redis Streams: XADD analytics:cache per request; the dashboard subscribes and renders hit rate, token savings, and latency deltas in real time.

Data model (simplified)

cache:{fingerprint} → Hash fields: prompt, resp, meta, usage, created_at (with resp having its own TTL)
vec:{fingerprint} → Vector field + tags (model, route, user)
Stream: analytics:cache with {event, hit, latency_ms, tokens_saved}

Why Redis 8?

New field-level expiration commands on hashes make cache lifecycle clean and safe.
New int8 vectors keep memory low and speed high.
Battle-tested Streams/PubSub give us real-time observability with a tiny footprint.

What’s next

Prefetch: predict likely next prompts and warm them proactively.
Hybrid filters: combine vector similarity + tags (model/route) for stricter cache hits.
Cold-start tuning: adapt hit threshold by route and user cohort.
Currently storing FP32 vectors for simplicity; INT8 quantization is planned to lower memory and speed up search

DEV Community