DEV Community

Harish Aravindan
Harish Aravindan

Posted on

Your Bedrock Bill Is a Ticking Clock — Here's How to Stop It

You deploy a Lambda that calls Bedrock. It works beautifully in testing.

Then someone runs a batch job, a retry loop goes wrong, or traffic spikes and your AWS bill at the end of the month looks like a phone number.

Bedrock has no built-in spend cap. No circuit breaker. No "stop after $X." It will happily invoke your model ten thousand times before you notice anything is wrong.

This post is about the patterns that prevent that applied specifically to serverless AI workloads on AWS.


Why Bedrock Cost Blowups Happen

Bedrock charges per input token and output token. The pricing varies by model:

Model Input (per 1K tokens) Output (per 1K tokens)
Claude Haiku ~$0.00025 ~$0.00125
Claude Sonnet ~$0.003 ~$0.015
Claude Opus ~$0.015 ~$0.075

Haiku looks cheap and it is, until you're running it at scale with large prompts. A 2,000 token prompt + 500 token response at Haiku pricing is about $0.0007 per call. At 100,000 calls per day that's $70/day, $2,100/month. From a single Lambda function.

The three failure modes that turn a reasonable bill into a bad one:

1. Unbounded retry loops. Lambda retries failed invocations automatically. If your Bedrock call fails and you don't handle it properly, Lambda will retry it twice tripling your token spend on every failure.

2. Prompt size creep. You add context, history, or document content to your prompt over time. Input tokens grow. You don't notice because the latency stays roughly the same but the cost per call has doubled.

3. No model fallback logic. You default to Sonnet for everything because it performs better. You never switch to Haiku for the 80% of calls where Haiku would have been fine.


Pattern 1: Model Tiering Use the Cheapest Model That's Good Enough

The most impactful cost control you can add. Route calls to the cheapest model that can handle the task, with automatic escalation when confidence is low.

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="ap-south-1")

HAIKU  = "anthropic.claude-haiku-4-5-20251001"
SONNET = "anthropic.claude-sonnet-4-6"

def invoke_with_tiering(prompt: str, require_confidence: bool = True) -> dict:
    """
    Always try Haiku first.
    If confidence score < threshold, escalate to Sonnet.
    Returns: {"result": str, "model_used": str, "escalated": bool}
    """

    haiku_prompt = f"""{prompt}

After your response, on a new line write exactly:
CONFIDENCE: <score between 0.0 and 1.0>"""

    haiku_response = invoke_bedrock(HAIKU, haiku_prompt)
    confidence     = extract_confidence(haiku_response)

    if not require_confidence or confidence >= 0.75:
        return {
            "result":     clean_response(haiku_response),
            "model_used": "haiku",
            "escalated":  False,
        }

    # Escalate to Sonnet
    sonnet_response = invoke_bedrock(SONNET, prompt)
    return {
        "result":     sonnet_response,
        "model_used": "sonnet",
        "escalated":  True,
    }


def invoke_bedrock(model_id: str, prompt: str) -> str:
    response = bedrock.invoke_model(
        modelId=model_id,
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens":        1024,
            "messages": [{"role": "user", "content": prompt}]
        })
    )
    body = json.loads(response["body"].read())
    return body["content"][0]["text"]


def extract_confidence(text: str) -> float:
    for line in reversed(text.strip().split("\n")):
        if line.startswith("CONFIDENCE:"):
            try:
                return float(line.split(":")[1].strip())
            except ValueError:
                pass
    return 1.0  # Assume high confidence if parsing fails


def clean_response(text: str) -> str:
    lines = text.strip().split("\n")
    return "\n".join(
        l for l in lines if not l.startswith("CONFIDENCE:")
    ).strip()
Enter fullscreen mode Exit fullscreen mode

In practice, Haiku handles the majority of straightforward tasks when you classify by confidence. The cost difference between Haiku and Sonnet is roughly 12–15x per call so even a 70/30 split produces significant savings at scale.


Pattern 2: Token Counting Before You Invoke

Bedrock charges for tokens you send, not just tokens you receive. A prompt that accidentally includes a full document when it only needed a summary can cost 10x more than intended.

Count your tokens before invoking. If the prompt is above a threshold, truncate or summarize first.

def estimate_tokens(text: str) -> int:
    """
    Rough estimate: ~4 characters per token for English text.
    Use this as a pre-flight check, not for billing accuracy.
    """
    return len(text) // 4


MAX_INPUT_TOKENS = 1500   # Your cost-control threshold
HARD_MAX_TOKENS  = 4000   # Bedrock model limit buffer


def safe_invoke(prompt: str, context: str = "") -> dict:
    full_prompt    = f"{prompt}\n\nContext:\n{context}" if context else prompt
    estimated_toks = estimate_tokens(full_prompt)

    if estimated_toks > HARD_MAX_TOKENS:
        # Truncate context, keep prompt intact
        max_context_chars = (HARD_MAX_TOKENS - estimate_tokens(prompt)) * 4
        context           = context[:max_context_chars] + "... [truncated]"
        full_prompt       = f"{prompt}\n\nContext:\n{context}"
        estimated_toks    = estimate_tokens(full_prompt)

    if estimated_toks > MAX_INPUT_TOKENS:
        # Log a warning — this call is more expensive than expected
        print(f"[COST WARNING] Large prompt: ~{estimated_toks} tokens estimated")

    return invoke_with_tiering(full_prompt)
Enter fullscreen mode Exit fullscreen mode

This catches the most common cause of unexpected cost spikes: context that grew over time without anyone noticing.


Pattern 3: Lambda-Level Rate Limiting with DynamoDB

Bedrock has service-level quotas, but they're per-account, not per-function. If you have multiple Lambda functions all calling Bedrock, one runaway function can exhaust your quota and spike your bill before the others even notice.

Add a lightweight rate limiter using DynamoDB atomic counters:

import time

dynamodb    = boto3.resource("dynamodb", region_name="ap-south-1")
rate_table  = dynamodb.Table("bedrock-rate-limits")

MAX_CALLS_PER_MINUTE = 60   # Per function, per minute window


def check_rate_limit(function_name: str) -> bool:
    """
    Returns True if call is allowed, False if rate limit exceeded.
    Uses DynamoDB atomic increment + TTL for automatic window reset.
    """
    minute_key = f"{function_name}#{int(time.time() // 60)}"

    response = rate_table.update_item(
        Key={"rate_key": minute_key},
        UpdateExpression=(
            "SET call_count = if_not_exists(call_count, :zero) + :one, "
            "expiry_ttl = :ttl"
        ),
        ExpressionAttributeValues={
            ":zero": 0,
            ":one":  1,
            ":ttl":  int(time.time()) + 120,  # 2-minute TTL, auto-cleanup
        },
        ReturnValues="UPDATED_NEW"
    )

    count = int(response["Attributes"]["call_count"])
    return count <= MAX_CALLS_PER_MINUTE


def rate_limited_invoke(function_name: str, prompt: str) -> dict:
    if not check_rate_limit(function_name):
        raise Exception(
            f"Rate limit exceeded for {function_name}. "
            f"Max {MAX_CALLS_PER_MINUTE} Bedrock calls/minute."
        )
    return safe_invoke(prompt)
Enter fullscreen mode Exit fullscreen mode

The DynamoDB TTL means the counter auto-resets every window. No cron, no cleanup Lambda. Cost for this table at moderate usage is under $1/month.


Pattern 4: CloudWatch Alarm on Bedrock Invocation Spend

All three patterns above are reactive at the code level. You also need a proactive alert before the bill hits.

Bedrock publishes InvocationCount and InputTokenCount metrics to CloudWatch. Set an alarm on invocation count as a leading indicator — it's more reliable than waiting for billing alerts.

# Terraform — alert when Bedrock invocations exceed threshold
resource "aws_cloudwatch_metric_alarm" "bedrock_invocation_spike" {
  alarm_name          = "bedrock-invocation-spike"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "InvocationCount"
  namespace           = "AWS/Bedrock"
  period              = 300          # 5-minute window
  statistic           = "Sum"
  threshold           = 500          # Adjust to your expected volume
  alarm_description   = "Bedrock invocations unusually high — check for runaway loops"

  dimensions = {
    ModelId = "anthropic.claude-haiku-4-5-20251001"
  }

  alarm_actions = [var.sns_alert_topic_arn]
}
Enter fullscreen mode Exit fullscreen mode

Set the threshold at roughly 2x your expected peak volume. The alarm fires before cost becomes a problem, not after.


Pattern 5: Disable Lambda Retries for Bedrock Callers

This one is often overlooked. By default, Lambda retries asynchronous invocations twice on failure. If your Bedrock call times out or returns a throttling error, Lambda will invoke your function two more times automatically tripling the number of tokens consumed for that failure.

For Bedrock-calling Lambdas, set maximum retries to zero:

resource "aws_lambda_event_source_mapping" "bedrock_processor" {
  # ... your S3/SQS trigger config
  bisect_batch_on_function_error = true
}

resource "aws_lambda_function_event_invoke_config" "bedrock_caller" {
  function_name = aws_lambda_function.bedrock_processor.function_name

  maximum_retry_attempts = 0   # No automatic retries for Bedrock callers

  destination_config {
    on_failure {
      destination = aws_sqs_queue.bedrock_dlq.arn   # Failed events go to DLQ
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Handle retries explicitly in your code with backoff logic, so you control when and how many times a Bedrock call is retried not Lambda's default behaviour.


Putting It Together

A production ready Bedrock caller in a serverless AI pipeline needs all five layers:

Request
  → rate_limited_invoke()        # Pattern 3: per-function rate limit
      → safe_invoke()            # Pattern 2: token count pre-flight
          → invoke_with_tiering()  # Pattern 1: Haiku first, Sonnet on escalation
              → CloudWatch alarm   # Pattern 4: spike detection
  Lambda retry = 0               # Pattern 5: no automatic retry blowup
Enter fullscreen mode Exit fullscreen mode

None of these are complex individually. The value is in having all five in place before you hit production traffic not after the bill arrives.


Cost Reference: What This Saves

Assuming a pipeline processing 10,000 documents/day with an average 1,500 input tokens and 400 output tokens per call:

Setup Model mix Daily cost Monthly cost
All Sonnet, no controls 100% Sonnet ~$210 ~$6,300
Tiered (80% Haiku / 20% Sonnet) Mixed ~$35 ~$1,050
Tiered + token control (avg 10% reduction) Mixed ~$31 ~$945

The tiering alone is an 83% cost reduction. Token control and rate limiting are the safety net that keeps the tiering from being undone by a bad day.


Final Thought

These five patterns are cheap to add and expensive to skip. The DynamoDB rate limiter costs under $1/month. The CloudWatch alarm is free under AWS free tier limits. The model tiering requires no infrastructure changes at all.

None of this is complex. The value is in having all five in place before you hit production traffic not after the bill arrives.

If you're running Bedrock in production and have hit a cost gotcha not covered here, drop it in the comments would be good to build out this list further.

Top comments (0)