Mukunda Rao Katta

Posted on May 25

I rewrote my Bedrock client three times. Then I extracted the boring parts into one library.

#hermeschallenge #aws #bedrock #python

I shipped three different Bedrock clients in the same quarter. Three different teams, three different products, three different "production" Python wrappers around boto3.client('bedrock-runtime'). Each one ended up reinventing the same three things from scratch.

Retry logic that respected Bedrock's ThrottlingException and ServiceQuotaExceededException without burning the quota on every retry.
Cost accounting that knew about prompt caching, because the published per-1k-token rates are misleading once you turn on cache_control.
JSON parse-and-repair, because models return JSON with explanations around it and your code has to handle that or crash.

After the third client I extracted those three things into one library. bedrock-kit is on PyPI. It is what I wish I had had the first time.

The three things

1. Throttle-aware retry

Bedrock has two relevant error types you will hit at any non-trivial scale.

ThrottlingException means you are over your per-model TPM or RPM quota. Naive retry burns the quota faster.

ServiceQuotaExceededException means a higher-tier quota (provisioned throughput, per-account quota) is exhausted. Retrying immediately does literally nothing.

bedrock-kit handles both correctly. Throttle errors get exponential backoff with jitter, max attempts configurable, and a small dampening factor so a fleet retrying together does not synchronize. Service-quota errors fail fast with a clear message, because they are a config problem not a transient one.

from bedrock_kit import BedrockClient

client = BedrockClient(
    region="us-east-1",
    throttle_max_attempts=5,
    throttle_initial_delay_ms=200,
)

response = client.converse(
    model="us.anthropic.claude-sonnet-4-7-v1:0",
    messages=[{"role": "user", "content": [{"text": "Hello"}]}],
)

The retry happens inside the call. The caller code looks normal.

2. Cache-aware cost

Bedrock charges differently for cached input tokens than for fresh ones. When you turn on cache_control and your prompt repeats, the input cost can drop by 90%. But the per-1k-token rate AWS publishes is the fresh rate. If you sum naively you over-report cost.

bedrock-kit reads the cache_creation_input_tokens and cache_read_input_tokens fields from the response, applies the cached discount per model, and returns a single cost_usd float that is what you actually pay.

print(response.cost_usd)
# 0.0034
print(response.cost_breakdown)
# {"cached_input": 0.0001, "fresh_input": 0.0009, "output": 0.0024}

The breakdown is what you want for a per-callsite cost dashboard. The single float is what you want for a budget cap.

3. JSON parse-and-repair

Bedrock models often return JSON with prose around it. "Here is the JSON you requested:

json {...}

. Let me know if you want anything else." If your code does json.loads(response.text) you get a JSONDecodeError you have to handle.

bedrock-kit ships a parse_json(response) helper that:

Tries json.loads(response.text) directly.
If that fails, walks the text looking for the largest valid JSON object or array.
If that fails, runs a three-pass repair (strip code fences, balance braces, drop trailing commas).
If that fails, raises a clear JsonRepairFailed with the exact text the model returned.

from bedrock_kit import parse_json

data = parse_json(response)  # never raises JSONDecodeError directly

The repair logic is its own crate (llm-json-repair) so non-Bedrock callers can use it too.

What it does NOT do

It does not replace boto3. It is a thin wrapper. If you need an obscure Bedrock API surface, drop to boto3 directly.
It does not handle streaming. The streaming API has its own quirks; if you need streaming use the SDK directly and pipe responses through parse_json at the end.
It does not multiplex providers. If you want to fall back from Bedrock to a direct Anthropic API call, use llm-fallback-router.
It does not handle prompt caching itself. You set cache_control on your prompt. The kit only reads the resulting cache fields back out.

Inside the lib: one design choice worth showing

The hard call was whether to ship pricing data inline or have callers configure it.

Inline means the kit becomes stale every time AWS updates a rate. Caller-configured means every team has to maintain a rates file or import one from somewhere.

I picked inline. The rates file ships with the kit and is versioned. The BedrockClient reads it at startup and prints a warning if the version is more than 90 days old. Callers who want to override can pass rates_path=....

The kit-shipped rates are accurate as of the release date. Bedrock pricing does not change often enough for the inline approach to be wrong in practice. The warning makes the staleness visible.

When this is useful

You call Bedrock from a production service and want one wrapper that handles the three boring failure modes.
You run agents on Bedrock-hosted Claude or Llama and want per-call cost without writing the math.
You ask Bedrock models for structured JSON and have been bitten by JSONDecodeError more than once.

When this is NOT what you want

If you call Bedrock through a higher-level framework that already wraps boto3. Stack the kit underneath that framework or pick one path.
If you only care about Claude on the Anthropic API directly. Use claude-cost for the cost layer and the Anthropic SDK directly.

Install

pip install bedrock-kit

Repo: https://github.com/MukundaKatta/bedrock-kit

Sibling libraries

Lib	Boundary	Repo
bedrock-kit	Opinionated Bedrock client	this repo
bedrock-cost	Cross-vendor Bedrock pricing (Rust)	https://github.com/MukundaKatta/bedrock-cost
llm-json-repair	The JSON repair pass, standalone	https://github.com/MukundaKatta/llm-json-repair
llm-retry	Standalone retry with jitter	https://github.com/MukundaKatta/llm-retry

What's next

A streaming wrapper that handles ConverseStream with the same cost accounting. A BatchBedrockClient that pools requests into Bedrock's batch API where available, similar to how llmfleet pools Anthropic calls.

DEV Community