DEV Community

Neeraja Khanapure
Neeraja Khanapure

Posted on

Claude on AWS Bedrock was throttling requests and the billing dashboard showed zero issues

Most teams running Claude on Bedrock watch latency and cost.
Neither shows you when you are about to get throttled.
Claude Sonnet output tokens cost 5x more compute to generate than input tokens to process. AWS counts them at 5x against your TPM quota. Your bill charges for real tokens. Your quota gate reflects real compute.
Your bill shows 100 tokens.
Bedrock counted 500 against your limit.
Throttling hits. Dashboard looks clean.
What AWS just shipped
AWS just released two CloudWatch metrics that fix this blind spot. Both are free, automatic, and already in your AWS/Bedrock CloudWatch namespace. No code changes. No opt-in.
EstimatedTPMQuotaUsage
Real quota consumed per request, burndown multipliers included. Not what you were billed. What Bedrock actually counted against your limit.
TimeToFirstToken
Server side metric. Measures time from request to first Claude response token. Tells you if slowness lives in Bedrock or your own stack. Stops the guessing. Narrows the debug in seconds.
3 alarms worth setting today
80% of TPM limit on EstimatedTPMQuotaUsage
Warning before throttle hits. You get runway instead of a surprise.
P95 threshold on TimeToFirstToken
Catch Claude response degradation before users feel it.
Compare TimeToFirstToken vs InvocationLatency
If TTFT is fine but total latency is high, the problem is output generation not model startup. Narrows the debug surface immediately.
Were you tracking billing tokens thinking that was your quota? Most teams are.
Source: https://aws.amazon.com/blogs/machine-learning/improve-operational-visibility-for-inference-workloads-on-amazon-bedrock-with-new-cloudwatch-metrics-for-ttft-and-estimated-quota-consumption/[](url)

Top comments (0)