DEV Community

Cover image for Cut Amazon Bedrock Costs with a 3-Layer Caching Pipeline on AWS Lambda + ElastiCache

Cut Amazon Bedrock Costs with a 3-Layer Caching Pipeline on AWS Lambda + ElastiCache

If you're building AI-powered apps on AWS, you've probably felt the sting of Bedrock inference costs. Every token counts — and when users hammer your app with similar or identical questions, you're paying for the same answer over and over again.

In this post I'll walk through a three-layer caching and optimization pipeline I built inside a single Lambda function backed by ElastiCache (Redis). By the end, you'll have a pattern that can dramatically reduce Bedrock calls in any support chatbot, internal knowledge assistant, or document Q&A tool you're shipping.

Here's what we're building:

User prompt → Hash Check → Semantic Check → Prompt Compression → Bedrock → Cache Write
                  ↓               ↓
             hash_hit        semantic_hit
Enter fullscreen mode Exit fullscreen mode

Architecture at a Glance

Component Role
AWS Lambda (Python) Caching logic, embedding, compression
Amazon ElastiCache (Redis 7.1) Persistent shared memory across invocations
Amazon Bedrock (Nova Micro) Foundation model, only called on a true miss
Titan Embeddings v2 Converts prompts to semantic vectors

Because Lambda is stateless, every invocation starts fresh with zero memory of prior calls. ElastiCache fills that gap — it's the shared brain that persists across invocations and across different users hitting your function simultaneously.


Layer 1 — Hash-Based Caching: The Fastest Win

Before anything touches Bedrock, we check whether we've already answered this exact question.

The trick is normalizing the prompt first — lowercase, collapse whitespace — so " What is MACHINE LEARNING? " and "what is machine learning?" produce the same SHA-256 fingerprint and share one cache entry.

def normalize(prompt: str) -> str:
    return " ".join(prompt.lower().split())

def compute_hash(prompt: str) -> str:
    return hashlib.sha256(normalize(prompt).encode()).hexdigest()
Enter fullscreen mode Exit fullscreen mode

On every invocation, we check Redis with the hash: prefix before doing anything else:

hash_key = HASH_PREFIX + compute_hash(prompt)
cached = redis_client.get(hash_key)
if cached:
    return {"response": cached, "cache": "hash_hit"}
Enter fullscreen mode Exit fullscreen mode

A hash hit costs you a single Redis GET — no embedding call, no Bedrock invocation, no tokens burned. This is the fastest and cheapest path through the entire pipeline.

When does this shine? Any FAQ-style workload where users repeatedly ask the same questions. Support bots. Help center chatbots. Internal HR assistants.


Layer 2 — Semantic Similarity Caching: Catching Paraphrases

Hash-based caching misses paraphrases. "What is machine learning?" and "How would you define machine learning?" are semantically identical but produce completely different hashes.

Semantic caching solves this with vector embeddings. We convert every prompt to a list of floats that encodes its meaning, then compare incoming prompts to stored vectors using cosine similarity.

def embed(prompt: str) -> np.ndarray:
    bedrock = boto3.client("bedrock-runtime", region_name="us-west-2")
    response = bedrock.invoke_model(
        modelId=EMBED_MODEL_ID,
        body=json.dumps({"inputText": prompt}),
        contentType="application/json",
        accept="application/json"
    )
    body = json.loads(response["body"].read())
    return np.array(body["embedding"], dtype=np.float32)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    norm_a, norm_b = np.linalg.norm(a), np.linalg.norm(b)
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return float(np.dot(a, b) / (norm_a * norm_b))
Enter fullscreen mode Exit fullscreen mode

Since Redis stores bytes, not arrays, we serialize the vector with struct.pack before writing and unpack it on read:

def serialize_embedding(vector: np.ndarray) -> bytes:
    return struct.pack(f"{len(vector)}f", *vector)

def deserialize_embedding(data: bytes) -> np.ndarray:
    n = len(data) // 4
    return np.array(struct.unpack(f"{n}f", data), dtype=np.float32)
Enter fullscreen mode Exit fullscreen mode

In the handler, after a hash miss we embed the incoming prompt and scan stored vectors:

query_vector = embed(prompt)
stored = load_embeddings()
best_score, best_response = 0.0, None

for _, vector, response in stored:
    score = cosine_similarity(query_vector, vector)
    if score > best_score:
        best_score = score
        best_response = response

if best_score >= SIMILARITY_THRESHOLD:
    return {"response": best_response, "cache": "semantic_hit", "score": round(best_score, 4)}
Enter fullscreen mode Exit fullscreen mode

The SIMILARITY_THRESHOLD environment variable (default 0.90) is your dial for how aggressive the matching should be. Lower it to 0.80 and you'll catch more paraphrases at the risk of serving a slightly off response. Tune it against your own traffic.

💡 In practice, I've seen semantic_hit catch prompts like "Explain ML to me" against a cached answer for "What is machine learning?" with a score around 0.94 — well above threshold, and a completely avoided Bedrock call.


Layer 3 — Prompt Compression: Saving Tokens on Every Miss

Even with two cache layers, some prompts will always be new. Prompt compression squeezes cost out of every genuine cache miss by stripping filler language before the prompt reaches Bedrock.

Filler phrases like "Could you please", "I was wondering if", or "As an AI language model" consume tokens without improving the model's response. We maintain a simple list and strip them at runtime:

FILLER_PHRASES = [
    "please could you",
    "i was wondering if",
    "could you please",
    "i would like you to",
    "as an ai",
    "can you please",
    # ... extend this list based on your traffic patterns
]

def compress(prompt: str) -> str:
    compressed = prompt.lower()
    for phrase in FILLER_PHRASES:
        compressed = compressed.replace(phrase, "")
    compressed = " ".join(compressed.split())

    original_tokens = len(prompt.split())
    compressed_tokens = len(compressed.split())
    print(f"[compression] original: {original_tokens} tokens, "
          f"compressed: {compressed_tokens} tokens, "
          f"saved: {original_tokens - compressed_tokens}")

    return compressed
Enter fullscreen mode Exit fullscreen mode

The CloudWatch log line gives you a measurable view of the savings on every miss — you can query these logs over time to identify your most common filler patterns and keep optimizing the list.

One critical design decision: compression runs after both cache checks, not before.

If you compressed first, you'd alter the prompt before hashing it — so "Could you please explain ML?" and "Explain ML" would hash to the same key on the second call but different keys on the first, breaking cache consistency. The original prompt is always used for cache lookups; compression is purely a token cost optimization that only fires when a Bedrock call is actually going to happen.


The Full Pipeline in lambda_handler

Putting it all together, the handler becomes a clean sequential pipeline:

def lambda_handler(event, context):
    prompt = event.get("prompt", "").strip()

    # Layer 1: Exact hash match — fastest path, zero AI calls
    hash_key = HASH_PREFIX + compute_hash(prompt)
    cached = redis_client.get(hash_key)
    if cached:
        return {"response": cached.decode(), "cache": "hash_hit"}

    # Layer 2: Semantic similarity — catches paraphrases
    query_vector = embed(prompt)
    stored = load_embeddings()
    best_score, best_response = 0.0, None
    for _, vector, response in stored:
        score = cosine_similarity(query_vector, vector)
        if score > best_score:
            best_score, best_response = score, response
    if best_score >= SIMILARITY_THRESHOLD:
        return {"response": best_response, "cache": "semantic_hit", "score": round(best_score, 4)}

    # Layer 3: Compress before sending to Bedrock
    compressed_prompt = compress(prompt)
    response_text = call_bedrock(compressed_prompt)

    # Write back both hash and embedding for future hits
    redis_client.set(hash_key, response_text.encode(), ex=CACHE_TTL_SECONDS)
    embed_key = EMBED_PREFIX + compute_hash(prompt)
    store_embedding(embed_key, query_vector, response_text)

    return {"response": response_text, "cache": "miss"}
Enter fullscreen mode Exit fullscreen mode

Key Observations from Testing

Prompt Cache Result Bedrock Call?
"What is machine learning?" (1st call) miss ✅ Yes
"What is machine learning?" (2nd call) hash_hit ❌ No
" What is MACHINE LEARNING? " hash_hit ❌ No
"How would you define machine learning?" semantic_hit (0.94) ❌ No
"Could you please explain what machine learning is?" miss → compressed ✅ Yes (fewer tokens)

When to Use This Pattern

This three-layer pipeline is most valuable when:

  • Query volume is high — the cost savings on cache hits compound quickly at scale
  • Users tend to ask similar questions — support bots, knowledge bases, FAQ tools
  • Prompts are verbose — compression delivers more savings when users write long-winded queries
  • Latency matters — a Redis GET is orders of magnitude faster than a Bedrock roundtrip

It's less impactful for highly creative or unique queries (content generation, code synthesis) where every prompt is genuinely different and semantic similarity won't trigger often.


What I'd Do Differently in Production

A few things worth considering as you take this pattern to prod:

  • Replace the linear embedding scan with a proper vector search (Redis Stack's HNSW index, or OpenSearch with k-NN). Scanning every stored embedding is fine at low volume but doesn't scale.
  • Instrument cache hit rates with CloudWatch metrics so you can track ROI over time and justify the ElastiCache spend.
  • Tune SIMILARITY_THRESHOLD per use case. A support bot can be aggressive (0.85); a medical or legal assistant should be conservative (0.95+).
  • Analyze your CloudWatch compression logs weekly and update FILLER_PHRASES based on real traffic patterns.
  • Add a warm-up step for known common queries — pre-populate the cache on deploy so the very first user gets a cache hit.

Wrapping Up

Three layers, one Lambda function, one ElastiCache cluster. Together they cover the most common sources of Bedrock cost:

Layer What it eliminates
Hash caching Exact duplicate calls
Semantic caching Paraphrased duplicate calls
Prompt compression Excess tokens on every genuine miss

The pattern is modular — you can adopt any one layer independently, and each one pays for itself at a different traffic threshold. Start with hash caching (zero additional AWS cost beyond ElastiCache), add semantic caching once you see recurring paraphrases in your logs, and layer in prompt compression as your prompt corpus grows longer.

If you're building on Amazon Bedrock, this is one of the highest-ROI architectural patterns you can drop into an existing Lambda-based backend with minimal rework.


Built and tested as part of an AWS hands-on lab. All code runs on Python 3.12, Redis 7.1, and Amazon Bedrock Nova Micro via a cross-region inference profile.

Have questions or want to share your own caching numbers? Drop them in the comments below 👇

Top comments (0)