Mukunda Rao Katta

Posted on May 25

Why Your Bedrock Cost Estimate Was Wrong: Cross-Region Inference Multipliers

#hermeschallenge #ai #rust #agents

The wrong cost estimate

We ran a model comparison. Four models, all hosted on AWS Bedrock, all accessed through cross-region inference profiles. The task was to pick the most cost-effective model for a specific classification workload. We measured accuracy, latency, and cost per 1,000 requests.

The cost column was wrong.

We had used the same per-token rate for all four models. One rate for input tokens, one for output tokens, applied uniformly. The problem: cross-region inference profiles do not have uniform multipliers across vendors. Anthropic's cross-region multiplier on Bedrock is different from Llama's. Mistral's is different again. And the multiplier itself varies by region tier: us. prefix, eu. prefix, and ap. prefix each have their own factor.

By the time we caught it, we had already written the cost comparison into a report. The model we recommended was the second cheapest, not the cheapest. The cheapest was a Llama model where we had over-estimated the cross-region cost.

That mistake prompted bedrock-cost.

The shape of the fix

use bedrock_cost::{estimate, ModelId, Tokens};

// On-demand pricing
let cost = estimate(
    ModelId::parse("anthropic.claude-sonnet-4-5-v1:0")?,
    Tokens { input: 1_500, output: 400 },
)?;
println!("Input: ${:.6}", cost.input_usd);
println!("Output: ${:.6}", cost.output_usd);
println!("Total: ${:.6}", cost.total_usd);

// Cross-region inference profile (us. prefix)
let cost = estimate(
    ModelId::parse("us.anthropic.claude-sonnet-4-5-v1:0")?,
    Tokens { input: 1_500, output: 400 },
)?;
// Applies the Anthropic cross-region multiplier for the us. tier

// Llama on cross-region, different multiplier
let cost = estimate(
    ModelId::parse("us.meta.llama3-70b-instruct-v1:0")?,
    Tokens { input: 1_500, output: 400 },
)?;

The ModelId::parse function extracts the region prefix (if any) and the vendor from the model string. From those two pieces it selects the right base rate and the right multiplier.

What this is NOT

This is not a billing API wrapper. It does not call AWS Cost Explorer or the Bedrock pricing API. The rates are embedded in the crate as a static table, current as of the crate publish date. For production billing reconciliation, use the actual AWS bill. This crate is for pre-flight cost estimation and model comparison.

This is not a token counter. You supply the token counts. If you need to estimate token counts before making the call, pair this with a token counter crate. bedrock-cost just does the arithmetic.

This is not multi-turn aware. The estimate is for a single request. Multi-turn conversations accumulate cost across turns because the full history is re-submitted each turn. Track cumulative cost across turns in your own loop.

Inside the lib

The pricing table structure keeps things compact. Instead of a flat table of every (model, region) pair, the approach is:

One base on-demand rate per model (input USD per 1M tokens, output USD per 1M tokens).
One cross-region multiplier per (vendor, region_tier) combination.

struct BaseRate {
    input_per_1m: f64,
    output_per_1m: f64,
}

struct CrossRegionMultipliers {
    us: f64,
    eu: f64,
    ap: f64,
}

For a us.anthropic.claude-sonnet-4-5-v1:0 request:

Strip the us. prefix, look up anthropic.claude-sonnet-4-5-v1:0 base rate.
Look up Anthropic's us tier multiplier.
Multiply base rate by multiplier.
Apply token counts.

This keeps the table size proportional to num_models + num_vendor_region_combinations, not num_models * num_regions. For six vendors and three region tiers, that is 18 multiplier entries instead of hundreds of model-region pairs.

The ModelId parser handles the common model string formats found in Bedrock:

let id = ModelId::parse("eu.mistral.mistral-large-2402-v1:0")?;
assert_eq!(id.region_tier, Some(RegionTier::EU));
assert_eq!(id.vendor, Vendor::Mistral);

Unknown model IDs return a CostError::UnknownModel rather than guessing. Better to fail loudly than silently apply the wrong rate.

When this is useful

Model selection and cost comparison. Before committing to a model for a production workload, run a sample batch through each candidate, measure actual token counts, and use bedrock-cost to compute the full cost picture including region tier.

Agent loop budget tracking. At the end of each turn, compute the turn cost and add it to a running total. Stop the loop when the total exceeds a configured budget. This requires your token usage from the API response, which Bedrock includes in the response metadata.

let usage = response.usage;  // from Bedrock response
let turn_cost = estimate(
    model_id.clone(),
    Tokens { input: usage.input_tokens, output: usage.output_tokens },
)?;
cumulative_cost += turn_cost.total_usd;
if cumulative_cost > budget_usd {
    return Err(BudgetExceeded { spent: cumulative_cost });
}

Multi-region deployment cost modeling. If you are deciding whether to serve a workload from us. vs eu. profiles (for latency reasons), this gives you the cost delta per request to factor into the tradeoff.

When NOT to use this

When you need real-time pricing. AWS changes Bedrock pricing periodically. The embedded table is a snapshot. For billing-critical applications, pull the current rates from the AWS pricing API and do your own math.

When you are using provisioned throughput. Provisioned throughput billing is hourly, not per-token. The on-demand and cross-region rates in this crate do not apply to provisioned throughput purchases.

When you are using a model not yet in the table. The crate covers Anthropic, Meta Llama, Mistral, Cohere, Amazon Titan, and AI21 as of the ship date. New models added to Bedrock after that date will return UnknownModel. Check the GitHub issue tracker for update PRs.

Install

[dependencies]
bedrock-cost = "0.1"

GitHub: MukundaKatta/bedrock-cost

crates.io: bedrock-cost

Siblings

Crate	What it does
claude-cost	Cache-aware cost for Anthropic API (direct, not Bedrock)
token-budget-pool	Thread-safe shared USD/token cap across concurrent agents
llm-budget-window	Sliding time-window budget (per minute/hour/day)
llm-cost-cap	Pre-flight cost gate: estimate and reject before the API call

What is next

The most requested addition is a compare function that takes a list of model IDs and a token pair and returns them sorted by cost. Right now you call estimate in a loop and sort yourself. A convenience function would make the model comparison workflow cleaner.

Rate freshness is the bigger problem. The embedded table will drift from actual AWS pricing over time. A bedrock-cost-fetch companion crate that pulls live pricing from the AWS pricing API and serializes it to the same format as the embedded table is worth building. The base crate stays zero-dependency and fast; the fetch crate handles the network call and can regenerate the embedded table.

For now, the static table is accurate for the models and region tiers that existed at ship date, and it covers the calculation that was wrong in the comparison that created this crate.

DEV Community