This article walks through a project I built on Amazon Bedrock AgentCore: an agent that turns campaign briefings into ranked email subject lines, and improves across sessions as it learns from the user.
The goal here isnβt to cover every AgentCore primitive, but to show how a few of them (Runtime and Memory) fit together in a real loop, and to be honest about which ones I deliberately left out. The project itself is intentionally small. The interesting part is the architecture around it: where scoring runs, why the optimization loop stays framework-agnostic, what memory actually stores and which tradeoffs come with each decision.
*π You can access the project on Github.
*
Contents
- Agents-to-AgentCore Evolution
- The use case: briefing in, subject lines out
- Drafting a solution
- I. The entrypoint (main.py)
- II. The imperative shell (agent/builder.py)
- III. The functional core (agent/iteration.py)
- IV. Scoring
- V. The two agents and Bedrock
- VI. AgentCore Memory: four strategies
- VII. Observability
- VIII. What I didn't use this time
- IX. How to deploy and run it
- Closing thoughts
Agents-to-AgentCore Evolution
Bedrock was launched in 2023 as AWS's response to the rapid growth of foundation models use: a single API to call models from companies such as Anthropic, AI21 Labs, Cohere or even Amazon's own Titan family, without managing inference infrastructure. Later that year AWS added Bedrock Agents, a configuration-driven product that bolted tool-calling, knowledge bases, and memory onto a Bedrock model.
It works for many cases, but it is a closed product: the ecosystem strongly revolves around Lambda-based tool execution, retrieval has to be Knowledge Bases, models have to be Bedrock-hosted, and you can't see or control how the agent decides what to do at each step. For more ambitious use cases, teams end up bypassing Bedrock Agents and writing their own harness on EC2 or Lambda, which meant rebuilding the same plumbing every team had to rebuild: session management, sandboxing, identity, memory and observability.
AgentCore, announced in 2025, was AWS's evolution to that pattern. Instead of a single "agent product," it broke the harness apart into composable services and made them framework-agnostic, so you could bring Strands, LangGraph, CrewAI or anything else. April 2026 added the managed Harness, which closed the loop: Harness offers the same easy, configuration-based approach as Bedrock Agents, but it runs on the AgentCore platform and lets you switch to code when you need more control.
AWS continues to maintain both Bedrock Agents and AgentCore in parallel. Bedrock Agents remains available for teams that already use it or prefer a fully managed, configuration-only approach, while AgentCore is positioned as the path forward for new projects that need flexibility, framework choice or production-grade infrastructure.
The use case: briefing in, subject lines out
Email subject lines have an outsized effect on open rates and they're often the only impression a campaign makes. Marketers who have the volume and the time for A/B testing can ship two or more variants and let the data decide. Many marketers don't, so they write a subject line, second-guess it and hit send.
Imagine that you run email marketing for a specialty coffee brand. You open the optimizer, fill in the briefing and set a few constraints: nothing longer than 55 characters, no discount language, no emojis. You hit optimize and watch the rounds come in. Round one produces 8 candidates covering the full stylistic range, from urgency-led to curiosity-led to plain and direct. The scorer immediately tells you which ones carry spam risk, which ones are the right length, which ones align with a retention audience. The weakest three get dropped and the Critic explains why. Round two regenerates those slots with that guidance in mind. By round three the scores have converged and you have a ranked shortlist of five, each with a predicted open-rate band and a breakdown of what drove the score.
Next week you run a cross-sell campaign for the same brand. The briefing is different but the session ID carries your name. Before generating a single candidate, the optimizer reads what it learned from your prior sessions: urgency-led lines consistently underperformed for this brand; premium and exclusivity framing reached the shortlist every time. Round one already looks different from what a first-time user would see.
Input: a campaign briefing
A generic JSON brief with the objective, audience, offer, brand voice and constraints. The kind of structure you'd find in any agency template, nothing platform-specific.
Output: a ranked shortlist of 5
Each variant comes with predicted open-rate range, the dimensions where it scored highest and any flagged risks (spam triggers, length penalty, audience mismatch). The user can ask for follow-ups in the same session β "give me shorter versions of #2 and #4" β and the agent refines while preserving what made those variants score well.
Drafting a solution
Three tiers, top to bottom.
The caller sends a campaign briefing JSON and gets a streamed response back. That exchange happens through a single @app.entrypoint function inside AgentCore Runtime, a managed AWS service that handles the HTTP transport, session lifecycle, and streaming framing so the agent code doesn't have to.
Inside the Runtime, the architecture splits into two layers. The imperative shell (agent/builder.py) owns everything that touches a framework: two Strands agents (Generator and Critic), an in-process scorer, and a memory recall helper. It wires these into four plain Python callables and injects them into the functional core (agent/iteration.py). The core runs the generate/score/critique/regenerate loop and returns a ranked shortlist.
At the bottom, two managed AWS services sit off the request path. Bedrock serves every LLM call via Strands' BedrockModel. AgentCore Memory receives session events automatically from the Strands session manager, and returns extracted patterns when the loop asks for them at the start of each run. The async strategy extraction that makes cross-session learning possible runs roughly 60 seconds after each session ends.
I. The entrypoint (main.py)
The entrypoint has one job: receive a campaign briefing, run the optimization loop and stream results back as they arrive.
AgentCore Runtime gives you this as a single decorator. You don't write a router, configure middleware or manage a server process; instead, you hand it an async generator and it handles the rest: HTTP transport, session lifecycle, streaming framing, session_id and user_id extraction from request headers.
app = BedrockAgentCoreApp()
@app.entrypoint
async def invoke(payload, context):
session_id = getattr(context, "session_id", None) or "default-session"
user_id = getattr(context, "user_id", None) or "default-user"
try:
briefing = validate_briefing(payload.get("prompt") or "")
except (ValueError, json.JSONDecodeError) as exc:
yield f"Invalid briefing: {exc}\n"
return
result = run_initial_optimization(briefing, session_id, user_id)
for round_log in result.rounds:
for line in _format_round_lines(round_log):
yield line + "\n"
yield _format_shortlist(result) + "\n"
yield json.dumps(_serialize_result(result), ensure_ascii=False)
Every yield sends a chunk to the caller immediately. This matters because an optimization run takes from 30 to 90 seconds and the caller sees each round's scores as they complete, not a blank screen followed by a wall of text.
Refinement works by re-submitting a modified briefing with the same session_id β with, for example, a tighter length constraint, different brand voice or added avoid-words. AgentCore Memory carries learned patterns from prior sessions forward automatically; the entrypoint doesn't need to know about that.
II. The imperative shell (agent/builder.py)
The shell is the wiring layer. It knows about Strands, AgentCore and Bedrock while the functional core doesn't (it receives callables). The shell is what turns those framework dependencies into plain Python functions the core can call without importing anything.
Four things live here: a generator agent, a critic agent, an in-process scorer and a memory recall helper. At the start of each optimization run, all four get injected into the loop as callables.
def run_initial_optimization(briefing, session_id, user_id):
generator = _make_agent(session_id, user_id, GENERATOR_SYSTEM_PROMPT)
critic = _make_agent(session_id, user_id, CRITIQUE_SYSTEM_PROMPT)
def generate(prompt, _n):
return _strip_json_array(str(generator(prompt)))
def critique(scored, to_drop):
return str(critic(build_critique_prompt(scored, to_drop))).strip()
return run_optimization(
briefing,
generate=generate,
score=score_candidates,
critique=critique,
recall=recall_for_user,
on_round=_emit_round_telemetry,
actor_id=user_id,
round_one_prompt_builder=round_one_prompt,
regenerate_prompt_builder=regenerate_prompt,
)
The agents
Generator and Critic are separate Strands Agent instances with separate system prompts. The generator is told to produce strict JSON arrays of strings and nothing else. The critic is told to produce two to four sentences of explicit, actionable guidance.
Both share the same AgentCoreMemorySessionManager, so they write to the same session namespace and see the same conversation history.
Scoring stays in-process
score_candidates is not an agent and not a service call β it's a direct import of score_subject_line from scoring/score.py, with the heuristic rules loaded once. No network, no latency, no failure mode beyond a bad CSV row.
The observability hook
on_round=_emit_round_telemetry is the last injection. After each completed round, the loop calls it with a RoundLog. The shell opens an OpenTelemetry span, records seven attributes (round number, candidate count, top score, top subject line, guidance excerpt), closes the span and emits a structured log line.
III. The functional core (agent/iteration.py)
Keeping the loop free of framework imports means it can be read, tested and reasoned about without knowing anything about the surrounding infrastructure. You can swap the LLM, replace the scoring backend or drop AgentCore entirely and the loop doesn't change. The entire contract is in the signature:
def run_optimization(
briefing: Any,
generate: Callable[[str, int], list[str]],
score: Callable[[list[str], Any], list[dict]],
critique: Callable[[list[dict], list[dict]], str],
recall: Callable[[str, Any], MemoryContext] | None = None,
on_round: Callable[[RoundLog], None] | None = None,
actor_id: str = "anonymous",
max_rounds: int = 4,
plateau_epsilon: float = 0.5,
*,
round_one_prompt_builder: ...,
regenerate_prompt_builder: ...,
) -> IterationResult:
Round one generates eight candidates across eight archetypes and scores them all. Each subsequent round generates only the slots vacated by pruning. New and surviving candidates are merged, deduped by subject line and sorted by score.
After sorting, the loop checks whether the top-3 average improved by at least plateau_epsilon points over the prior round. If not, the scores have converged and the loop stops early to avoid wasting LLM calls.
If there's still room to improve and rounds remain, the bottom 40% are pruned and handed to the critic. The critic explains what patterns made them lose in two to four sentences, referencing specific signals. That guidance feeds into the next round's regeneration prompt.
After at most four rounds, the loop returns the top five candidates with the full per-round log and the memory context read at the start.
IV. Scoring
Every candidate produced by the generator gets a score before the loop decides what to keep and what to prune. The scorer runs 45 heuristics against each subject line and returns a composite score between 0 and 100, a predicted open-rate band, per-dimension breakdowns and an explanation of the top contributions. They're stored in a simple CSV and cover 9 dimensions: length, urgency, spam risk, curiosity triggers, value signals, personalization, style, audience fit and brand voice.
A heuristic is not a rule. A rule is binary β pass or fail, always. A heuristic is a signal that contributes positively or negatively to a score based on what the literature says tends to correlate with open rates. "Subject lines between 30 and 50 characters score better" is an empirical observation, not a constraint.
The same urgency words that lift acquisition open rates by +1.0 point hurt retention campaigns by β2.0 and are inappropriate for regulatory notices at β5.0. The audience_modifier column captures that context-dependence per rule; audience tags are inferred automatically from briefing free-text.
rule_id,category,pattern,match_type,weight,audience_modifier
LEN_SWEET_SPOT_30_50,length,30-50,range,8.0,,
URGENCY_BASE_LIFT,urgency,urgent|hurry|now|today,word_any,3.5,acquisition:+1.0;retention:-2.0;regulatory:-5.0
SPAM_TRIGGER_FREE,spam_risk,free|100% free,word_any,-6.0,,
VALUE_FREE_SHIPPING,value_signals,free shipping,phrase,4.0,acquisition:+1.0
These weights are starting points. A real company would probably use a model trained on their own send history or something similar.
V. The two agents and Bedrock
The Generator and Critic share a model and a session manager but have nothing else in common.
Both agents are Strands Agent instances backed by Amazon Nova 2 Lite, invoked via Bedrock's cross-region inference profile. Both are wired to AgentCoreMemorySessionManager, which writes each turn as a session event automatically. Neither agent needs to know about memory because the session manager handles it transparently.
The Generator enforces its output shape through Strands' structured output support rather than prompt engineering. Instead of instructing the model to "output strict JSON arrays only" and then parsing whatever it produces, you pass a Pydantic schema to the agent call and Bedrock enforces it at the model level:
class SubjectLineList(BaseModel):
subject_lines: list[str] = Field(
description="Email subject line candidates, one per requested slot."
)
def generate(prompt: str, _n: int) -> list[str]:
response = generator(prompt, structured_output_model=SubjectLineList)
return response.structured_output.subject_lines
The Critic is told to produce two to four sentences of explicit, actionable guidance. No generic advice β it must reference specific patterns in the candidates it's reviewing. "These lines are too long" is not useful guidance for regeneration. "The urgency phrasing in candidates 3 and 5 reads as promotional spam rather than genuine time pressure. Try anchoring to a specific benefit instead" is.
That guidance becomes part of the next round's prompt directly:
prompt = regenerate_prompt_builder(
briefing,
memory_context,
[c["subject_line"] for c in survivors],
guidance,
n_to_generate,
)
The surviving candidates and the Critic's diagnosis arrive together. The Generator sees what worked, what didn't and why.
VI. AgentCore Memory: four strategies
Memory touches this project in two distinct ways. The Strands session manager writes conversation events automatically and every Generator and Critic turn lands in the session namespace without any code in the agent to make that happen. Separately, recall_for_user reads extracted patterns explicitly at the start of each optimization run, before any candidate is generated.
Memory also works on two timescales:
Short-term: Within a session, the Strands session manager keeps the full conversation in context, making sure the Critic sees every prior round and the Generator sees every prior critique.
Long-term: Across sessions, AgentCore's four extraction strategies run asynchronously, roughly 60 seconds after a session ends, and populate long-term namespaces:
| Strategy | Description | Namespace |
|---|---|---|
SEMANTIC |
General facts inferred from session content | /users/{actor_id}/facts |
USER_PREFERENCE |
Behavioral patterns inferred from what consistently scored well or was pruned | /users/{actor_id}/preferences |
SUMMARIZATION |
A compressed summary of the session | /summaries/{actor_id}/{session_id} |
EPISODIC |
A record of what happened | /episodes/{actor_id}/{session_id} |
The agent never writes subject lines or briefings directly to long-term storage. It writes session events and the strategies decide what is worth keeping and in what form.
What the loop reads
At the start of each run, recall_for_user queries the facts and preferences namespaces using the briefing as the retrieval query. It returns up to five patterns per namespace, ranked by relevance score. Those patterns flow into the round-one generation prompt:
Patterns observed across this user's prior sessions:
- Urgency-led subject lines were pruned in 3 of 4 prior sessions
- Premium and exclusivity framing consistently reached the final shortlist
Inferred preferences for this user:
- Discount and price-led language does not align with observed brand voice
- Sentence case outperformed title case across recent campaigns
The Generator sees what worked and what didn't before producing a single candidate.
VII. Observability
The optimization loop runs for 30β90 seconds across up to four rounds. Without instrumentation, a slow run and a failing run look identical from the outside.
After each completed round, the loop calls an on_round callback with a RoundLog. The shell's implementation emits two concurrent signals:
def _emit_round_telemetry(round_log: RoundLog) -> None:
span = _tracer.start_span("optimization_round")
try:
span.set_attribute("round.number", round_log.round_number)
span.set_attribute("round.candidate_count", len(candidates))
span.set_attribute("round.pruned_count", len(round_log.pruned))
span.set_attribute("round.top3_average", round(top3_avg, 2))
span.set_attribute("round.top_score", round(top_score, 2))
span.set_attribute("round.top_subject_line", top_subject[:200])
span.set_attribute("round.guidance_excerpt", round_log.guidance[:300])
finally:
span.end()
log.info("optimization_round_complete", extra={...})
An optimization_round OTel span with seven attributes appears as a child span under each invocation in AgentCore traces. The round.top3_average and round.top_score attributes show whether scores are improving across rounds. round.guidance_excerpt shows what the critic said before each regeneration step β the most useful signal when a run plateaus unexpectedly.
The same fields appear as a structured log.info event, queryable in CloudWatch Logs Insights:
fields @timestamp, round_number, top3_average, top_score, guidance_excerpt
| filter @message = "optimization_round_complete"
| sort @timestamp asc
The traceId field on the log event matches the span ID in the traces UI, so you can move between the two surfaces without losing context.
VIII. What I didn't use this time
AgentCore ships with more primitives than this project uses. Some of the ones that didn't make it in are worth naming.
Code Interpreter is the right choice when code has to run in a sandbox: the LLM authors it at runtime, it carries untrusted dependencies or it comes from a source outside the agent's own deployment artifact. The scoring script here has 250 lines of standard library Python, authored by the agent's owner, deployed in the same place as main.py. There is no organizational or security boundary for Code Interpreter to enforce. Adding a managed sandbox would introduce spin-up latency, a separate billing line, a separate failure mode and a service dependency.
Policy manages what agents are allowed to do β which tools they can call, which users can invoke them, which actions are gated behind approval. It earns its place in multi-tenant deployments where different users have different permissions or in agentic workflows where the consequences of a wrong action are hard to reverse.
Gateway exposes an agent as a managed API endpoint with authentication, rate limiting, and request routing. It's the right choice when the agent is a shared service consumed by multiple callers across organizational boundaries.
Every primitive on this list is genuinely useful for the problems it was designed to solve. The real challenge is identifying which problem you actually have. Adopting every available managed service does not make an agent more sophisticated, and it often makes the system harder to reason about, more expensive to run or even more fragile.
When evaluating a new service or functionality, think about: what complexity am I introducing alongside it, what failure modes come with it, and whether it meaningfully simplifies the system.
IX. How to deploy and run it
Prerequisites
Before deploying, make sure you have:
- An AWS account with Amazon Bedrock access
- The agentcore CLI installed (
npm install -g @aws/agentcore-cli) - AWS credentials configured with permissions for Bedrock, CloudFormation, IAM, ECR, and S3
To confirm your credentials are working:
aws sts get-caller-identity
Testing locally before deploying
To iterate on the agent without incurring a full deploy cycle, you can run the local dev server:
agentcore dev
This starts the runtime at http://localhost:8080. Bedrock model calls still go to AWS, so you need valid credentials, but no CloudFormation changes are made. Memory is not active in dev mode unless you export MEMORY_ID manually with the ID from your deployed memory resource.
Deploying
From the project root:
agentcore deploy
The CLI runs CDK under the hood, synthesizing a CloudFormation stack, bootstrapping the CDK environment if needed, and provisioning the runtime, the memory resource, and the IAM roles. The first deploy takes a couple of minutes. When it completes you'll see a runtime ARN in the output β that ARN is the deployed endpoint.
Running an invocation
The project ships with four example briefings under app/subject_line_optimizer/briefing/examples/. Each is a self-contained campaign briefing JSON ready to send.
To invoke the deployed agent with the reactivation example:
agentcore invoke \
--target default \
--prompt-file app/subject_line_optimizer/briefing/examples/reactivation.json \
--session-id "create-an-id-here-001" \
--user-id "your-user-id" \
--stream
A few things to note:
-
--target defaultroutes to the deployed endpoint, not the local dev server -
--prompt-filereads the briefing directly; the CLI wraps it in{"prompt": "..."}before sending, so pass the raw briefing file -
--session-idmust be at least 33 characters; a UUID works well -
--user-idscopes the AgentCore Memory namespaces β use a consistent identifier across sessions so the agent accumulates preferences for that user over time -
--streamprints each chunk as it arrives, so you see the rounds progressing in real time
Example output
Optimizing subject lines for: Q3 Lapsed Customer Reactivation
[round 1]
83.0 Ready to rediscover your favorite things? Claim 25% off now
(LEN_ACCEPTABLE_20_60, URGENCY_BASE_LIFT, VALUE_PERCENT_OFF)
80.0 Unlock 25% off β your loyalty reward is ready to claim
(LEN_ACCEPTABLE_20_60, VALUE_PERCENT_OFF, LOYALTY_LANGUAGE)
...
pruned: 3
guidance for next round: Avoid overly conversational openings...
[round 2]
91.2 Claim your 25% loyalty reward β 14 days to save
(LEN_SWEET_SPOT_30_50, LOYALTY_LANGUAGE)
...
=== Final shortlist ===
1. Claim your 25% loyalty reward β 14 days to save
score 91.2 open-rate band 42.0β50.0%
...
The final chunk is a machine-readable JSON object with the full shortlist, per-round logs, and plateau status.
There is currently no built-in UI in the AWS console to browse memory contents. A community-built tool called AgentCore Memory Browser fills this gap.
Closing thoughts
The thing I keep coming back to is that AgentCore doesn't have an opinion about what the agent should look like. Bedrock Agents did, and the opinion was reasonable for a lot of cases. AgentCore gives you a Runtime, a Memory service, a set of primitives, and trusts you to assemble them. That trust is the feature.








Top comments (0)