You've deployed the agent. It works. Now let's make sure it doesn't cost you a surprise at the end of the month.
This is the part that most tutorials skip. Real production systems need cost visibility before incidents — not after. Here's everything I've done to keep costs predictable and to save money where it counts.
The cost components
An AgentCore deployment has several cost drivers:
| Component | Pricing model |
|---|---|
| Bedrock model invocations | Per token (input + output) |
| AgentCore Runtime | Per container-hour (when active) |
| AgentCore Memory | Per memory operation |
| ECR | Per GB stored + data transfer |
| CloudWatch Logs | Per GB ingested |
| S3 (if used) | Negligible for this setup |
The dominant cost is almost always Bedrock model invocations. Everything else is small by comparison.
Prompt caching: the biggest lever
If you haven't read Part 3 carefully, go back and re-read the prompt caching section. It's the highest-impact optimisation in the system.
Quick recap: by marking your system prompt with cache_control: ephemeral, Bedrock caches those tokens and charges the cache read price on subsequent calls.
For Claude Sonnet 4.6:
- Cache write: $3.00 / 1M input tokens
- Cache read: $0.30 / 1M input tokens (10x cheaper)
- Output tokens: $15.00 / 1M output tokens (not cached)
For a 1,500-token system prompt:
| Scenario | Cost per turn |
|---|---|
| Without caching | $0.0045 (system prompt) + output tokens |
| With caching (turns 2+) | $0.00045 (system prompt) + output tokens |
| Saving per turn | ~$0.004 |
That sounds small. Scale it:
- 100 users × 10 conversations/day × 5 turns each = 5,000 turns/day
- 4,000 of those turns are turns 2+ (caching applies)
- Saving: 4,000 × $0.004 = $16/day → $480/month on system prompt tokens alone
The saving scales linearly with session depth and volume.
Enable prompt caching:
primary_model = BedrockModel(
model_id="anthropic.claude-sonnet-4-6-20251001-v1:0",
additional_request_fields={"anthropic_beta": ["prompt-caching-2024-07-31"]},
)
cached_system_prompt = [{"text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}]
Model selection strategy
Not every task needs Claude Sonnet 4.6. Using the right model for each task type dramatically reduces costs.
| Task | Recommended model | Reason |
|---|---|---|
| Main conversation | Claude Sonnet 4.6 | Best reasoning, multi-turn, complex tool use |
| Intent classification | Amazon Nova Pro | Simple classification, ~15x cheaper |
| Session summarisation | Amazon Nova Pro | Structured output, no complex reasoning needed |
| FAQ matching | Amazon Nova Pro or embedding model | Simple retrieval pattern |
| Billing dispute analysis | Claude Sonnet 4.6 | Complex reasoning required |
Current pricing comparison (us-east-1):
| Model | Input ($/1M) | Output ($/1M) |
|---|---|---|
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| Amazon Nova Pro | $0.80 | $3.20 |
| Amazon Nova Lite | $0.06 | $0.24 |
For a classification task that returns 1-2 tokens and processes 500 input tokens:
- Claude Sonnet 4.6: $0.0015 per call
- Amazon Nova Pro: $0.0004 per call
- Saving: ~75% just by routing to the right model
In agent.py, the Nova model is available alongside the primary model:
nova_model = BedrockModel(
model_id="amazon.nova-pro-v1:0",
boto_config=boto_config,
)
Use it when you need a cheap background task before or after the main conversation.
AgentCore lifecycle configuration
AgentCore has two lifecycle settings that affect cost:
Idle timeout (IdleTimeoutInSeconds): how long AgentCore waits before pausing a container instance after the last request. Set in the CDK stack:
LifecycleConfiguration: {
IdleTimeoutInSeconds: 900, // 15 minutes
MaxSessionDurationInSeconds: 28800, // 8 hours
}
- Lower idle timeout = containers paused sooner = lower cost for bursty workloads
- Higher idle timeout = containers stay warm longer = better latency for returning users
- The sweet spot depends on your session gap pattern. 15 minutes is a reasonable default.
Max session duration: the hard limit per session. 8 hours is appropriate for a long-running assistant. For short transactional interactions, you could reduce this.
CloudFront PriceClass_100
For the blog/portfolio site, using PriceClass.PRICE_CLASS_100 restricts CloudFront distribution to US and European edge locations only. This cuts CF cost by ~50% compared to the global price class.
For a personal portfolio with mostly English-speaking traffic, the 95th percentile of users are in the US and Europe anyway.
// infra/lib/hosting-stack.ts
priceClass: cloudfront.PriceClass.PRICE_CLASS_100,
For the AgentCore endpoint itself, there's no CloudFront in front — AgentCore is a regional service.
CloudWatch alarms: catch runaway costs before they hit your bill
Two alarms are critical for an AgentCore deployment.
Alarm 1: OutputTokenCount spike
An agentic loop that gets stuck (tool keeps failing, model keeps retrying) can generate thousands of output tokens per minute. This alarm fires when output tokens per 5 minutes exceed a threshold:
new cloudwatch.Alarm(this, 'OutputTokenAlarm', {
alarmName: `customerServiceAgent-OutputTokenCount-dev`,
metric: new cloudwatch.Metric({
namespace: 'AWS/Bedrock',
metricName: 'OutputTokenCount',
dimensionsMap: { ModelId: 'anthropic.claude-sonnet-4-6-20251001-v1:0' },
statistic: 'Sum',
period: cdk.Duration.minutes(5),
}),
threshold: 50_000, // Tune to your expected usage
evaluationPeriods: 2,
comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
});
Set the threshold to 2-3x your normal peak. Monitor for a week after launch to establish a baseline, then tune.
Alarm 2: InvocationLatency P99
High P99 latency indicates your agent is taking too long — possibly waiting on a tool timeout, or the model is iterating excessively:
new cloudwatch.Alarm(this, 'LatencyAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/Bedrock',
metricName: 'InvocationLatency',
statistic: 'p99',
period: cdk.Duration.minutes(5),
}),
threshold: 30_000, // 30 seconds
evaluationPeriods: 3,
comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
});
Both alarms publish to the SNS topic (also in the CDK stack), which sends you an email. For production, replace email with a PagerDuty or Slack notification via SNS → Lambda → webhook.
Actual cost estimates
For a moderately used customer service agent at ~500 conversations/day, 5 turns each:
| Component | Monthly estimate |
|---|---|
| Bedrock (Claude Sonnet 4.6, with caching) | $120-180 |
| Bedrock (Nova Pro for classification) | $5-10 |
| AgentCore Runtime | $15-30 (depends on idle config) |
| AgentCore Memory operations | $5-10 |
| ECR storage | $1-2 |
| CloudWatch Logs | $3-5 |
| Total | ~$150-240/month |
Without prompt caching: add ~$60-80/month to the Bedrock line.
Without the dual-model strategy (Claude Sonnet 4.6 for everything): add ~$20-30/month to the Bedrock line.
These numbers will vary significantly based on your conversation length and output token counts. The alarms will tell you when something is outside the expected range.
Quick optimisation checklist
Before going to production:
- [ ] Prompt caching enabled (
anthropic_beta: ["prompt-caching-2024-07-31"]) - [ ] System prompt marked with
cache_control: ephemeral - [ ] Nova Pro used for background tasks (not Claude for everything)
- [ ] Idle timeout set appropriately (900s is a good default)
- [ ] OutputTokenCount alarm configured and tested
- [ ] InvocationLatency alarm configured and tested
- [ ] SNS topic with email subscription (or PagerDuty) set up
- [ ] CloudFront PriceClass_100 set (blog site)
- [ ] Model invocation logging enabled (for debugging cost spikes)
Wrapping up the series
Over 6 parts, we built a complete production AI agent on AWS:
- Part 1: Why AgentCore — the Lambda limitations and what AgentCore solves
- Part 2: CDK infrastructure — the full stack + 9 gotchas documented
- Part 3: The Python agent — Strands SDK, prompt caching, AgentCore Memory
- Part 4: Local dev loop — Docker, platform flags, .env pattern
- Part 5: CI/CD — GitHub Actions OIDC, ECR dual-tag strategy, Runtime updates
- Part 6 (this post): Cost and performance — prompt caching savings, model selection, alarms
The full demo repo is at github.com/rajmurugan01/bedrock-agentcore-starter. Every pattern in this series maps to real code in that repo.
If this series saved you some debugging time (or a surprise AWS bill), star the repo and share it. If I got something wrong or you've found a better pattern, open an issue — I'll update the posts.
← Back to Part 5: CI/CD with GitHub Actions OIDC
Originally published at rajmurugan.com. This is Part 6 of the Ultimate Guide to Building AI Agents on AWS with Bedrock AgentCore series.
Top comments (0)