I shipped three different Bedrock clients in the same quarter. Three different teams, three different products, three different "production" Python wrappers around boto3.client('bedrock-runtime'). Each one ended up reinventing the same three things from scratch.
- Retry logic that respected Bedrock's
ThrottlingExceptionandServiceQuotaExceededExceptionwithout burning the quota on every retry. - Cost accounting that knew about prompt caching, because the published per-1k-token rates are misleading once you turn on
cache_control. - JSON parse-and-repair, because models return JSON with explanations around it and your code has to handle that or crash.
After the third client I extracted those three things into one library. bedrock-kit is on PyPI. It is what I wish I had had the first time.
The three things
1. Throttle-aware retry
Bedrock has two relevant error types you will hit at any non-trivial scale.
ThrottlingException means you are over your per-model TPM or RPM quota. Naive retry burns the quota faster.
ServiceQuotaExceededException means a higher-tier quota (provisioned throughput, per-account quota) is exhausted. Retrying immediately does literally nothing.
bedrock-kit handles both correctly. Throttle errors get exponential backoff with jitter, max attempts configurable, and a small dampening factor so a fleet retrying together does not synchronize. Service-quota errors fail fast with a clear message, because they are a config problem not a transient one.
from bedrock_kit import BedrockClient
client = BedrockClient(
region="us-east-1",
throttle_max_attempts=5,
throttle_initial_delay_ms=200,
)
response = client.converse(
model="us.anthropic.claude-sonnet-4-7-v1:0",
messages=[{"role": "user", "content": [{"text": "Hello"}]}],
)
The retry happens inside the call. The caller code looks normal.
2. Cache-aware cost
Bedrock charges differently for cached input tokens than for fresh ones. When you turn on cache_control and your prompt repeats, the input cost can drop by 90%. But the per-1k-token rate AWS publishes is the fresh rate. If you sum naively you over-report cost.
bedrock-kit reads the cache_creation_input_tokens and cache_read_input_tokens fields from the response, applies the cached discount per model, and returns a single cost_usd float that is what you actually pay.
print(response.cost_usd)
# 0.0034
print(response.cost_breakdown)
# {"cached_input": 0.0001, "fresh_input": 0.0009, "output": 0.0024}
The breakdown is what you want for a per-callsite cost dashboard. The single float is what you want for a budget cap.
3. JSON parse-and-repair
Bedrock models often return JSON with prose around it. "Here is the JSON you requested:
json {...}
. Let me know if you want anything else." If your code does json.loads(response.text) you get a JSONDecodeError you have to handle.
bedrock-kit ships a parse_json(response) helper that:
- Tries
json.loads(response.text)directly. - If that fails, walks the text looking for the largest valid JSON object or array.
- If that fails, runs a three-pass repair (strip code fences, balance braces, drop trailing commas).
- If that fails, raises a clear
JsonRepairFailedwith the exact text the model returned.
from bedrock_kit import parse_json
data = parse_json(response) # never raises JSONDecodeError directly
The repair logic is its own crate (llm-json-repair) so non-Bedrock callers can use it too.
What it does NOT do
- It does not replace
boto3. It is a thin wrapper. If you need an obscure Bedrock API surface, drop toboto3directly. - It does not handle streaming. The streaming API has its own quirks; if you need streaming use the SDK directly and pipe responses through
parse_jsonat the end. - It does not multiplex providers. If you want to fall back from Bedrock to a direct Anthropic API call, use
llm-fallback-router. - It does not handle prompt caching itself. You set
cache_controlon your prompt. The kit only reads the resulting cache fields back out.
Inside the lib: one design choice worth showing
The hard call was whether to ship pricing data inline or have callers configure it.
Inline means the kit becomes stale every time AWS updates a rate. Caller-configured means every team has to maintain a rates file or import one from somewhere.
I picked inline. The rates file ships with the kit and is versioned. The BedrockClient reads it at startup and prints a warning if the version is more than 90 days old. Callers who want to override can pass rates_path=....
The kit-shipped rates are accurate as of the release date. Bedrock pricing does not change often enough for the inline approach to be wrong in practice. The warning makes the staleness visible.
When this is useful
- You call Bedrock from a production service and want one wrapper that handles the three boring failure modes.
- You run agents on Bedrock-hosted Claude or Llama and want per-call cost without writing the math.
- You ask Bedrock models for structured JSON and have been bitten by
JSONDecodeErrormore than once.
When this is NOT what you want
- If you call Bedrock through a higher-level framework that already wraps
boto3. Stack the kit underneath that framework or pick one path. - If you only care about Claude on the Anthropic API directly. Use
claude-costfor the cost layer and the Anthropic SDK directly.
Install
pip install bedrock-kit
Repo: https://github.com/MukundaKatta/bedrock-kit
Sibling libraries
| Lib | Boundary | Repo |
|---|---|---|
| bedrock-kit | Opinionated Bedrock client | this repo |
| bedrock-cost | Cross-vendor Bedrock pricing (Rust) | https://github.com/MukundaKatta/bedrock-cost |
| llm-json-repair | The JSON repair pass, standalone | https://github.com/MukundaKatta/llm-json-repair |
| llm-retry | Standalone retry with jitter | https://github.com/MukundaKatta/llm-retry |
What's next
A streaming wrapper that handles ConverseStream with the same cost accounting. A BatchBedrockClient that pools requests into Bedrock's batch API where available, similar to how llmfleet pools Anthropic calls.
Top comments (0)