Sachin m

Posted on Feb 20

How to Implement Prompt Caching on Amazon Bedrock and Cut Inference Costs in Half

#tutorial #aws #ai #python

Introduction

You're running a multi-turn support agent on Amazon Bedrock. Every API call sends a ~2,100-token system prompt — your agent's persona, rules, and the product documentation — along with the growing conversation history. The model doesn't remember any of this between calls. It reprocesses those tokens fresh every single turn, and you pay for every one of them.

For a single five-turn conversation on Nova Pro, that adds up to 12,834 input tokens. Over 80% of that is the static system prompt, repeated identically across all five turns. Scale to 1,000 conversations a day and your monthly bill hits $384. Most of that is money spent processing the same static text, over and over.

Amazon Bedrock's prompt caching fixes this. You mark a cache point in your prompt where the static content ends. Bedrock stores everything before that marker. On subsequent calls within the cache window, it reads from cache instead of reprocessing. Cache reads cost 90% less than regular input tokens.

I ran benchmarks across three Amazon Nova models to measure the real impact. Adding a single cachePoint to the system block cut Nova Pro's monthly projection roughly in half. And when I combined prompt caching with switching from Nova Pro to Nova Micro, the total reduction hit 97%. From $384 a month to under $10.

By the end of this tutorial, you will have:

Built a multi-turn customer support agent on Bedrock
Measured what it actually costs per conversation — baseline numbers
Added prompt caching (one line of code) and seen the difference
Run the same benchmark across Nova Pro, Lite, and Micro to compare
Set up CloudWatch monitoring so you know caching is actually working

All code is available in the companion repository: bedrock-prompt-caching-distillation-tutorial.

Without caching, every API call reprocesses the full system prompt at full price. With caching, the first call stores it, and subsequent calls read it back at a 90% discount:

Prerequisites

An AWS account with Amazon Bedrock access enabled in us-east-1
Model access granted for Amazon Nova Pro, Nova Lite, and Nova Micro (enabled by default for new accounts). Open the Bedrock Model Catalog to confirm they're listed:

Python 3.11+ with boto3 >= 1.35.76 (prompt caching support requires recent versions)
AWS CLI v2 configured with credentials that have bedrock-runtime:* permissions
Estimated cost to complete this tutorial: under $0.15

python3 --version   # 3.11+
pip show boto3 | grep Version   # 1.35.76+
aws bedrock list-foundation-models \
  --query "modelSummaries[?modelId=='amazon.nova-pro-v1:0'].modelId" \
  --output text
# Expected: amazon.nova-pro-v1:0

Step 1 — Building a Realistic Baseline

Before optimizing anything, you need to know what you're spending. We'll build a customer support agent that answers questions using product documentation — basically what every Bedrock-powered support bot looks like under the hood.

The scenario

Your agent has a system prompt (~200 tokens) defining its persona and rules, plus product documentation (~1,900 tokens) pasted right into the system block — features, pricing, troubleshooting, the works. Then there's the conversation history, which grows with every turn.

The combined system content is 2,130 tokens. Every API call resends all of it, plus the growing conversation history. By Turn 5, the model is processing 2,860 input tokens per call — and the 2,130-token prefix hasn't changed since Turn 1.

The baseline code

The product documentation is an ~1,900-token spec for a fictional SaaS product called SmartWidget Pro — features, pricing, API docs, migration guides, troubleshooting. The full text is in product_docs.txt in the companion repo.

# 01_baseline_no_cache.py

import boto3
import time
from pathlib import Path

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

MODEL_ID = "amazon.nova-pro-v1:0"

SYSTEM_PROMPT = """You are a senior customer support agent for SmartWidget, a SaaS company.

Rules:
- Always be polite, professional, and concise
- Reference the product documentation when answering
- If the answer is not in the documentation, say so honestly
- Format responses with bullet points for clarity
- Never make up product features or pricing that isn't documented
- Keep responses under 150 words unless the question requires more detail
"""

# Load product docs (~1,900 tokens)
PRODUCT_DOCS = Path("product_docs.txt").read_text()
FULL_SYSTEM = SYSTEM_PROMPT + "\n\n--- PRODUCT DOCUMENTATION ---\n\n" + PRODUCT_DOCS

# Realistic customer support conversation
QUESTIONS = [
    "What are the main features of SmartWidget Pro?",
    "How do I configure the API integration? Give me a quick start guide.",
    "What's the pricing for enterprise customers?",
    "My API is returning 429 errors. How do I fix this?",
    "How do I migrate from v3.x to v4.2? What are the breaking changes?",
]

def ask_question(question, conversation_history):
    messages = conversation_history + [
        {"role": "user", "content": [{"text": question}]}
    ]

    start_time = time.time()
    response = bedrock.converse(
        modelId=MODEL_ID,
        system=[{"text": FULL_SYSTEM}],
        messages=messages,
        inferenceConfig={"maxTokens": 512, "temperature": 0.1},
    )
    elapsed = time.time() - start_time
    usage = response["usage"]

    print(f"  Latency: {elapsed:.2f}s | Input: {usage['inputTokens']} | Output: {usage['outputTokens']}")
    return response, elapsed

# Run 5-turn conversation
history = []
for i, question in enumerate(QUESTIONS):
    print(f"Turn {i+1}: {question}")
    response, latency = ask_question(question, history)
    history.append({"role": "user", "content": [{"text": question}]})
    history.append({"role": "assistant", "content": response["output"]["message"]["content"]})

Real output

Here's what a five-turn conversation looks like without caching. These are real numbers from our AWS account:

Total across five turns: 12,834 input tokens and 796 output tokens in 8.78 seconds.

Notice how input tokens grow each turn — from 2,140 to 2,860. That's the conversation history accumulating. But the first ~2,130 tokens in every call are the same system prompt and product docs, unchanged from Turn 1.

What this costs at scale

Nova Pro charges $0.80 per million input tokens and $3.20 per million output tokens. For this single conversation:

Input cost: 12,834 tokens × $0.80/M = $0.0103
Output cost: 796 tokens × $3.20/M = $0.0025
Total: $0.0128 per conversation

At 1,000 conversations per day, 30 days a month: $384 per month. For one model running one workflow.

Step 2 — Adding Prompt Caching

The change is small — one additional element in the system array. Everything before the cachePoint marker gets stored. Subsequent calls that send the same prefix read from cache instead of reprocessing it.

If you want to try this visually first, the Bedrock Chat Playground has a prompt caching toggle built in. Select your model, scroll down in the left panel, and flip the Prompt caching switch:

Good for quick experiments, but you'll want this in code for production.

The code change

Compare this to the baseline — the only difference is two lines in the system parameter:

# 02_with_prompt_caching.py

def ask_question_cached(question, conversation_history):
    messages = conversation_history + [
        {"role": "user", "content": [{"text": question}]}
    ]

    start_time = time.time()
    response = bedrock.converse(
        modelId=MODEL_ID,
        system=[
            {"text": FULL_SYSTEM},
            {"cachePoint": {"type": "default"}},   # <-- this is the only change
        ],
        messages=messages,
        inferenceConfig={"maxTokens": 512, "temperature": 0.1},
    )
    elapsed = time.time() - start_time
    usage = response["usage"]

    cache_read = usage.get("cacheReadInputTokens", 0)
    cache_write = usage.get("cacheWriteInputTokens", 0)
    print(f"  Latency: {elapsed:.2f}s | Input: {usage['inputTokens']} | "
          f"Output: {usage['outputTokens']} | Cache read: {cache_read} | Cache write: {cache_write}")

    return response, elapsed

That {"cachePoint": {"type": "default"}} tells Bedrock: everything above this marker is static content. Cache it.

Real output with caching

Look at Turn 1 first. Cache WRITE tokens: 2130 and Cache READ tokens: 0 — the prefix has to be stored before it can be reused. This first call is the setup cost.

From Turn 2 onwards, every call shows Cache READ tokens: ~2,145 and Cache WRITE tokens: 0. The system prefix is being read from cache at 90% discount instead of reprocessed.

The input token counts tell the same story. Turn 1 reports only 10 input tokens (just the user question) instead of 2,140 — the other 2,130 show up in cacheWriteInputTokens. Turns 2–5 only count the conversation history and new question as input. The prefix tokens move to cacheReadInputTokens.

Cost breakdown

With caching, the billing splits into four components:

Component	Tokens	Rate (per 1M)	Cost
Non-cached input	1,738	$0.80	$0.001390
Cache read	8,581	$0.08 (90% off)	$0.000686
Cache write	2,130	$1.00 (25% premium)	$0.002130
Output	588	$3.20	$0.001882
Total			$0.006089

Cache reads are billed at $0.08 per million — one-tenth of the regular $0.80 input price. Cache writes carry a 25% premium at $1.00 per million, so Turn 1 actually costs slightly more than a non-cached call. But by Turn 2, the write premium is already paid off. Each subsequent cache read saves over $0.001 compared to reprocessing those tokens.

Side-by-side comparison

Metric	Baseline	Cached	Change
Non-cached input tokens	12,834	1,738	-86%
Cost per conversation	$0.0128	$0.0061	-52%
Monthly cost (1K convos/day)	$384	$183	-52%
Monthly savings	—	$201/month

Don't expect a latency win — caching saves money, not time. The model still processes user messages and generates responses at the same speed. Turn 1 actually runs a bit slower due to the cache write overhead. The win is purely in token billing: those 2,130 tokens of system content go from $0.80/M to $0.08/M on every cache hit.

Key rule: The content before the cache point must be byte-for-byte identical across requests. If you change even one character in the system prompt, it's a cache miss and a new cache write. Keep your cached prefix genuinely static.

Step 3 — Comparing Across Model Tiers

Prompt caching works on all Amazon Nova models. But per-token pricing is wildly different between tiers, and so are the absolute savings. I ran the same five-turn benchmark on Nova Pro, Nova Lite, and Nova Micro.

Full comparison table

These numbers are from a single benchmark run across all three models, using identical system prompts and questions:

Model	Baseline Monthly	Cached Monthly	Cost Reduction	Input Price
Nova Pro	$334.61	$169.99	49%	$0.80/M
Nova Lite	$30.33	$18.41	39%	$0.06/M
Nova Micro	$16.99	$9.47	44%	$0.035/M

Caching saves 39–49% regardless of model tier. You might notice the Nova Pro number here (49%) differs slightly from the 52% in Step 2 — that's because these are separate benchmark runs with slightly different response lengths. The pattern holds: caching cuts costs by roughly half on Nova Pro and 40–45% on the cheaper tiers. The percentage is lower on cheaper models because the cache write premium ($1.00/M write vs $0.80/M regular input for Pro) has a larger relative impact when the base price is already low.

The real optimization: caching + model selection

Here's the number that matters for production. If your baseline uses Nova Pro without caching, and you switch to Nova Micro with caching, the combined savings are:

$334.61/month → $9.47/month — a 97% reduction.

That's not a typo. Most of the savings come from the model switch — Nova Micro's input tokens cost 23x less than Nova Pro's. Caching cuts the remaining bill roughly in half on top of that. For many customer support and document Q&A workloads, Nova Micro with caching delivers sufficient quality at a fraction of the cost.

Obviously, model quality matters. You shouldn't blindly switch from Pro to Micro. Test your specific use case, measure output quality, pick the cheapest model that meets your bar. But model selection and prompt caching together get you far bigger savings than either one alone.

Step 4 — Caching Tool Definitions for Agentic Workflows

If your agent uses tools, those JSON schema definitions are resent with every API call — just like the system prompt. Cache them too.

You can place up to four cache points per request. A common pattern is two: one after the system content, one after the tool definitions.

# Cache both system content and tool definitions

response = bedrock.converse(
    modelId="amazon.nova-pro-v1:0",
    system=[
        {"text": FULL_SYSTEM},
        {"cachePoint": {"type": "default"}},      # Cache point 1: system + docs
    ],
    messages=messages,
    toolConfig={
        "tools": [
            {
                "toolSpec": {
                    "name": "lookup_order",
                    "description": "Look up a customer order by order ID",
                    "inputSchema": {
                        "json": {
                            "type": "object",
                            "properties": {
                                "order_id": {
                                    "type": "string",
                                    "description": "The order ID to look up"
                                }
                            },
                            "required": ["order_id"]
                        }
                    }
                }
            },
            {
                "toolSpec": {
                    "name": "check_inventory",
                    "description": "Check inventory for a product SKU",
                    "inputSchema": {
                        "json": {
                            "type": "object",
                            "properties": {
                                "sku": {"type": "string", "description": "Product SKU"}
                            },
                            "required": ["sku"]
                        }
                    }
                }
            },
            {"cachePoint": {"type": "default"}},   # Cache point 2: tool definitions
        ]
    },
    inferenceConfig={"maxTokens": 512},
)

Cache point rules to know

The content before a cache point needs at least 1,024 tokens for Nova and Claude Sonnet, or 2,048 tokens for Haiku. Below that, the cache point is silently ignored — no error, just no caching. I missed this initially and spent 10 minutes wondering why my short test prompt wasn't caching.

You can place up to four cache points per request — system, tools, and up to two more in your messages. The default TTL is 5 minutes; if no calls hit the same prefix in that window, the entry expires.

One detail that matters at scale: cached tokens don't count against your rate limits. They bypass the per-model token-per-minute throttle, which is useful if you're running near your provisioned limit.

Step 5 — Monitoring Cache Performance in Production

In production, you need to track whether caching is actually working. A misconfigured prompt that changes the prefix on every call will silently miss the cache, and you'll pay full price without realizing it.

Publishing cache metrics to CloudWatch

# 08_observability.py

import boto3

cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")

def publish_cache_metrics(usage, model_id):
    """Publish cache performance metrics after each Bedrock call."""
    cache_read = usage.get("cacheReadInputTokens", 0)
    cache_write = usage.get("cacheWriteInputTokens", 0)
    total_input = usage["inputTokens"] + cache_read + cache_write

    hit_rate = (cache_read / total_input * 100) if total_input > 0 else 0

    cloudwatch.put_metric_data(
        Namespace="BedrockLLMOps",
        MetricData=[
            {
                "MetricName": "CacheHitTokens",
                "Value": cache_read,
                "Unit": "Count",
                "Dimensions": [{"Name": "ModelId", "Value": model_id}],
            },
            {
                "MetricName": "CacheHitRate",
                "Value": hit_rate,
                "Unit": "Percent",
                "Dimensions": [{"Name": "ModelId", "Value": model_id}],
            },
        ],
    )

Setting up alerts

A sudden drop in cache hit rate usually means something changed in your prompt prefix — maybe a deployment modified the system prompt, or a dynamic value crept into what should be static content.

cloudwatch.put_metric_alarm(
    AlarmName="LowCacheHitRate",
    MetricName="CacheHitRate",
    Namespace="BedrockLLMOps",
    Statistic="Average",
    Period=300,
    EvaluationPeriods=3,
    Threshold=70,
    ComparisonOperator="LessThanThreshold",
    AlarmActions=["arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:llmops-alerts"],
)

This fires if the average cache hit rate drops below 70% for three consecutive five-minute windows. In a healthy multi-turn system, you should see hit rates above 80% — the only misses should be Turn 1 of each conversation (the initial cache write).

Conclusion

One line of code, roughly half off. That's the short version. Adding a cachePoint to our system block cut Nova Pro's monthly bill from $384 to $183, no infrastructure changes, no output quality difference. Across all three Nova tiers, savings landed between 39% and 52%.

The bigger win is combining caching with model selection. Nova Pro without caching to Nova Micro with caching: $384 down to under $10. A 97% reduction that mostly comes from the model switch, with caching shaving off the rest.

A few gotchas before you ship this:

Cache writes cost 25% more than regular input. Caching only pays for itself after the second call using the same prefix — so for single-turn workflows with no reuse, skip it.

The default TTL is 5 minutes. Works fine for steady traffic. Bursty conversations with long gaps between turns will see more cache misses. Claude models on Bedrock support a 1-hour TTL if you need it.

Monitor your cache hit rate. A misconfigured prompt that changes the prefix between calls will miss the cache every time, silently. The CloudWatch alert from Step 5 catches this.

Where to go next

If you want to push the cost savings further, model distillation on Bedrock lets you train a smaller student model on a larger teacher's outputs — purpose-built for your specific task. I'll cover that in a follow-up tutorial.

For production hardening, look into Bedrock Guardrails for content policy enforcement and LLM-as-a-Judge pipelines to continuously validate output quality as you swap models and tune prompts.

The complete code for this tutorial is available at: bedrock-prompt-caching-distillation-tutorial

Appendix: IAM Policy for This Tutorial

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": "arn:aws:bedrock:us-east-1::foundation-model/amazon.nova-*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "cloudwatch:PutMetricAlarm"
            ],
            "Resource": "*"
        }
    ]
}

Appendix: Cost Reference

All pricing is per 1 million tokens as of February 2026. Verify current pricing at the Amazon Bedrock pricing page.

Model	Input	Output	Cache Read (90% off)	Cache Write (25% premium)
Nova Pro	$0.80	$3.20	$0.08	$1.00
Nova Lite	$0.06	$0.24	$0.006	$0.075
Nova Micro	$0.035	$0.14	$0.0035	$0.044

DEV Community