Mukunda Rao Katta

Posted on May 19

AWS Bedrock prompt caching has a hidden cost most people miss

#hermesagent #ai #aws #python

I moved a workload from the Anthropic API to AWS Bedrock for compliance reasons. Same Claude model. Same prompts. My monthly bill went up about 35%.

I assumed Bedrock just costs more. It does, a little, but not 35% more. The real story was prompt caching.

On the Anthropic API, prompt cache writes cost 1.25x the input price and reads cost 0.1x. On Bedrock, the cache-write price for Claude is the same multiplier but reads are slightly more expensive per token, and the cache TTL is shorter on some configurations. Combine that with the way Bedrock charges for cache-aware models and the breakeven hit ratio is around 30%.

If your cache hit ratio is below 30%, every cache write is costing you more than the reads save.

I was at 14%.

How I figured it out

I built bedrock-kit to wrap my Bedrock calls and surface the cost breakdown per call. It is opinionated. It does three things:

Pulls input, output, cache-read, and cache-write token counts from every response.
Computes USD cost using the current Bedrock price sheet for that model and region.
Tries to repair malformed JSON output before raising.

That last one is unrelated but worth its weight. More below.

The hit ratio surface

from bedrock_kit import BedrockClient

client = BedrockClient(model="anthropic.claude-3-5-sonnet-20241022-v2:0")

result = client.complete(
    system=long_system_prompt,
    messages=[{"role": "user", "content": "..."}],
    cache=True,
)

print(result.cost)
print(result.cache_stats)

Output:

cost: $0.0042
  input:        $0.0008  (2700 tokens)
  output:       $0.0030  (900 tokens)
  cache_read:   $0.0001  (100 tokens, 1% of input)
  cache_write:  $0.0003  (2600 tokens)
cache_stats:
  read_ratio: 0.04
  status: COLD  (write > read, you are paying the tax)

The status field is the part that woke me up. It tells you whether the cache is actually helping. COLD means writes exceed reads. WARM means roughly breakeven. HOT means reads dominate and you are getting your money's worth.

Over a few days of running this, I saw that my system prompts were getting evicted faster than I expected on Bedrock. The 5-minute TTL was the problem. My traffic was bursty and a lot of my prompts were arriving 6-7 minutes apart.

I batched my requests into tighter windows and the hit ratio went from 14% to 71%. Bill dropped to about 5% above where it was on the Anthropic API. That I could live with.

The JSON repair part

The other thing in this library is JSON repair. Bedrock will sometimes return Claude output wrapped in fenced code blocks even when you ask for JSON only. Or with a trailing comma. Or with the JSON object truncated because you hit max_tokens.

result = client.complete(
    messages=[...],
    response_format="json",
)

data = result.json  # repaired, parsed dict

The repair runs three passes:

Strip markdown fences ( `

`and `

json `).

Find the largest balanced {...} or [...] substring.
Remove trailing commas inside objects/arrays.

If all three fail, you get the raw text and a JsonRepairError you can handle. About 95% of the malformed outputs I have seen are recoverable with these three passes. The remaining 5% are usually a truncated response, which is a real bug, not a parse bug, and you want to know about it.

Throttle handling

Bedrock throttles aggressively in some regions. The library has a built-in exponential backoff with jitter for ThrottlingException. Same defaults that work for me: 5 retries, base 1s, cap 30s, full jitter.

python client = BedrockClient( model="anthropic.claude-3-5-sonnet-20241022-v2:0", retry_throttle=True, # default max_retries=5, )

If you want to handle it yourself, set retry_throttle=False and you get the raw boto3 error.

What this is not

This is not a full SDK. It is a thin wrapper around boto3 with a few quality-of-life things. If you want streaming, raw byte access, or anything past the basic complete-and-cost flow, drop down to boto3 directly. The client exposes the underlying session so you can mix.

It also does not call out the cache cost story automatically. You have to log the cache_stats field and look at it. I would like to add a background warning when status is COLD for more than N requests in a row. That is on the list.

Repo

GitHub: https://github.com/MukundaKatta/bedrock-kit
PyPI: pip install bedrock-kit

If you are on Bedrock and you turned on prompt caching because it was free on the Anthropic side, check your hit ratio. The breakeven on Bedrock is higher than you think.

DEV Community