DEV Community: Ana Silva

Partition and Sort Keys on DynamoDB: Modeling data for batch-and-stream convergence

Ana Silva — Fri, 22 May 2026 15:11:53 +0000

How to design partition keys, sort keys and a serving layer when your data has two sources with different SLAs

Batch pipelines are reliable, auditable, and built to handle complexity at scale, but they come with a cost: by the time data is consolidated, validated, and ready to serve, hours or days might have passed. For internal analytics that’s often fine, but for a customer staring at a mobile app wondering whether their action was registered, it is not.

This article is about a DynamoDB modeling problem that lives at that boundary. The case study below is fictional and designed to keep the focus on the technical problem rather than on any specific domain. Business rules and delays are simplified accordingly.

The core challenge is one that appears frequently in data-intensive systems: a batch pipeline owns the authoritative view of the data, but its consolidation cycle introduces a lag that makes the user-facing experience feel stale. The goal is to bridge that gap with a near-real-time layer without replacing the batch, without duplicating business logic, and without turning the serving layer into a consistency problem.

The Use Case

I've covered this use case in my previous articles about DynamoDB. Here’s a quick recap:

We’re a financial institution developing a new feature in our mobile app to promote our customers’ financial health, offering them monthly saving goals. As a reward for reaching these goals, the saved amount is automatically invested under special conditions and higher interest rates.

The goals are calculated in batch by a data platform and then loaded into the database to be displayed in the app. The only information the app writes to the database is which goal the customer chose.

For this article, we’re extending the use case. The batch pipeline remains the source of truth, but business needs a new capability: when a customer makes a deposit toward their goal, it should appear in the app within minutes, not the next day.

Each customer has a monthly saving goal per investment type: savings account, government bonds and a certificate of deposit. Each goal has a target amount and accumulates progress as the customer makes deposits over the course of the month. The batch pipeline computes the consolidated state of each goal daily and writes it to DynamoDB.

The catch is that a contribution is not immediately final; it goes through reconciliation across multiple systems before it can be considered confirmed. The new capability is about making contributions visible before the batch catches up and display them with a status of “pending confirmation.” If reconciliation fails, it disappears. If it succeeds, the batch pipeline will eventually pick it up and reflect it in the authoritative view. This gives us two data sources with different freshness and different levels of trust, both feeding the same API response. That is the problem this article addresses.

Solution diagram

The diagram below shows how the two pipelines are organized and where they converge.

The batch pipeline reads from consolidated gold tables stored as files on an object storage, S3 in this case, and a Spark-based job applies business rules to compute the updated state for each customer. Those rules live in a relational database, shared between both pipelines. Once processed, the results are bulk loaded into DynamoDB.

The stream pipeline consumes customer activity events from a managed Kafka cluster and a containerized consumer service enriches each event before writing to a second DynamoDB table in near real-time. The two pipelines never write to the same table. They converge only at read time, where a single service merges both into one API response.

This is a practical application of the lambda architecture pattern, but what matters here is what that pattern demands from the data modeling and serving layer, and that is what the next section addresses.

Why two tables

The instinct when modeling in DynamoDB is to reach for the single-table design. In this case, two reasons pull the tables apart.

The first is TTL. DynamoDB’s time-to-live is a table-level configuration, not item-level behavior you can selectively enable. The NRT table needs items to expire automatically after a few days, while the batch table does not; its lifecycle is controlled by the batch job itself, which rewrites the data on each run. Sharing a table would mean either applying TTL to records that should never expire or managing expiration manually through application logic.

The second is the write pattern. The batch pipeline performs a bulk load on a daily schedule, replacing all existing records. The NRT consumer writes continuously, one event at a time, with idempotency requirements that bulk loads do not need. These are different operational profiles, and mixing them in a single table introduces coupling between two pipelines that can be independent. It also sounds like a nightmare for scenarios where reprocessing might be needed.

Splitting the tables keeps each pipeline simple, isolated and easier to operate. The complexity of merging them is contained in the serving layer.

Modeling the tables

A DynamoDB table does not have a fixed schema in the traditional sense, since each item can have different attributes, but every item must have a primary key. That key can be a single attribute, the partition key, or a composite of two, the partition key plus a sort key. DynamoDB uses the partition key to distribute data across storage nodes, and when a sort key is present, it orders items within the same partition.

*The choice of key matters more than anything else in DynamoDB modeling, because it determines how data is distributed, how it is accessed and what queries are possible.
*

The batch table

The batch table holds the consolidated view of each customer: personal attributes, preferences, the goals they have chosen, the progress computed by the last batch run, and a few other fields that the API exposes to the mobile app. It is, effectively, the full customer profile + information about the goals the customer has already accomplished.

Every read starts with a customer. The API asks “what does this person look like right now?” and expects one consolidated response. Given that access pattern, there is no reason to spread the data across multiple items. The partition key is personId.

{
  "personId": "person-uuid",
  "name": "Ana Silva",
  "preferences": { "notifications": true, "language": "pt-BR" },
  "goals": [
    {
      "goalId": "SAVINGS_ACCOUNT",
      "targetAmount": 500.00,
      "savedAmount": 320.00,
      "status": "IN_PROGRESS",
      "specialRate": 0.08
    },
    {
      "goalId": "GOVERNMENT_BONDS",
      "targetAmount": 1500.00,
      "savedAmount": 0.00,
      "status": "NOT_STARTED",
      "specialRate": 0.12
    },
    {
      "goalId": "CERTIFICATE_OF_DEPOSIT",
      "targetAmount": 2000.00,
      "savedAmount": 2000.00,
      "status": "COMPLETED",
      "specialRate": 0.14
    }
  ],
  "lastUpdated": "2026-05-15T03:00:00Z"
}

There is no TTL on this table because nothing needs to expire passively between runs. The batch job owns the lifecycle of every item, rewriting records on each run.

The NRT table

The NRT table has a narrower scope. It holds only what the stream pipeline produces: the most recent balance update per investment type per customer, written as deposit events arrive. There is no need to model the rest of the customer here, because the serving layer will combine this data with the batch table at read time.

The partition key is personId and the sort key is goalId. Every Kafka event carries the customer’s updated balance for one investment type, and the consumer writes it to the NRT table using PutItem. Because there can be at most one pending balance per investment type at any moment, newer events naturally overwrite older ones for the same (personId, goalId) pair. The serving layer can retrieve all pending balances for a customer in a single query.

{
  "personId": "person-uuid",
  "goalId": "SAVINGS_ACCOUNT",
  "currentBalance": 370.00,
  "status": "PENDING_CONFIRMATION",
  "eventId": "event-uuid-123",
  "updatedAt": "2026-05-16T14:23:00Z",
  "ttl": 1747699380
}

The ttl attribute is a Unix timestamp telling DynamoDB when to delete this item automatically. That is how time-to-live works in DynamoDB: you designate an attribute to hold the expiration timestamp, enable TTL on the table pointing to that attribute and Dynamo handles deletion in the background. Items in the NRT table expire after five days, long enough for the batch pipeline to process the contribution and take ownership, short enough to prevent stale pending records from accumulating unnecessarily.

The updatedAt attribute carries the event timestamp from Kafka and plays a role in the merge logic: the serving layer compares it against the batch's lastUpdated attribute to decide whether the NRT snapshot is fresher than the batch view. If the batch has already caught up, the NRT value is ignored; otherwise it is used and the goal is displayed with a "pending confirmation" indicator.

Serving the data

The two tables exist to be merged. Every read from the API does the same thing: fetch the customer’s batch view, fetch any pending updates from the NRT table and combine them into a single response before returning it to the app.

The two reads are independent and can run in parallel. The first is a GetItem on the batch table, keyed by personId. It returns the customer profile and the consolidated state of every goal as of the last batch run. The second is a Query on the NRT table, also keyed by personId. It returns every pending balance update for that customer across all investment types, typically zero to three items.

batch_response = dynamodb.get_item(
    TableName="CustomerBatch",
    Key={"personId": person_id}
)

nrt_response = dynamodb.query(
    TableName="CustomerNRT",
    KeyConditionExpression="personId = :pid",
    ExpressionAttributeValues={":pid": person_id}
)

batch_item = batch_response.get("Item", {})
pending_by_goal = {
    item["goalId"]: item
    for item in nrt_response.get("Items", [])
}

response_goals = []
for goal in batch_item.get("goals", []):
    pending = pending_by_goal.get(goal["goalId"])
    if pending and pending["updatedAt"] > batch_item["lastUpdated"]:
        response_goals.append({
            **goal,
            "savedAmount": pending["currentBalance"],
            "status": "PENDING_CONFIRMATION"
        })
    else:
        response_goals.append(goal)

A subtle but important property of this design is that the API does not need to know which events have already been incorporated by the batch, nor does it need to track any history of pending updates. The timestamp comparison is enough and the TTL eventually removes the NRT item.

About the costs

The GetItem on the batch table consumes 1 RCU for a strongly consistent read or 0.5 RCU for an eventually consistent one on items under 4KB.

The Query on the NRT table consumes RCUs proportional to the total size of the items returned, rounded up to the nearest 4KB block. With at most 3 items per customer (one per investment type) and small payloads, this almost always lands within a single 4KB block, meaning 1 RCU strongly consistent or 0.5 RCU eventually consistent.

For this use case, eventually consistent reads are acceptable on both tables. The batch item is small enough to fit in a single read unit and the NRT query returns at most 3 small items per customer, so the total read cost per API request lands at roughly 1 RCU. This is a direct consequence of the modeling choices: bounded goal count, single-item batch view, one NRT item per investment type.

Closing thoughts

What makes this work in DynamoDB is the modeling, and most of it comes down to how you use partition keys and sort keys. They are not just identifiers, they are the access path. The partition key decides how data is distributed and which queries are cheap. The sort key decides how items are organized within a partition and which queries are even possible.

Sort keys can do much more than what is depicted here. They support range queries with operators like begins_with, between and inequality comparisons, which means you can model hierarchies, timelines or status transitions directly in the key. Composite sort keys like STATUS#ACTIVE#DATE#2026-05-16 let a single Query retrieve, sort and filter items in ways that would otherwise require a secondary index or a scan.

If you are working on something similar or have approached the same problem differently, I would love to hear about it.

Cracking the Bedrock, Reaching the Core: Building Agents with AWS AgentCore Runtime and Memory

Ana Silva — Fri, 08 May 2026 23:49:40 +0000

This article walks through a project I built on Amazon Bedrock AgentCore: an agent that turns campaign briefings into ranked email subject lines, and improves across sessions as it learns from the user.

The goal here isn’t to cover every AgentCore primitive, but to show how a few of them (Runtime and Memory) fit together in a real loop, and to be honest about which ones I deliberately left out. The project itself is intentionally small. The interesting part is the architecture around it: where scoring runs, why the optimization loop stays framework-agnostic, what memory actually stores and which tradeoffs come with each decision.

*👉 You can access the project on Github.
*

Agents-to-AgentCore Evolution
The use case: briefing in, subject lines out
Drafting a solution
I. The entrypoint (main.py)
II. The imperative shell (agent/builder.py)
III. The functional core (agent/iteration.py)
IV. Scoring
V. The two agents and Bedrock
VI. AgentCore Memory: four strategies
VII. Observability
VIII. What I didn't use this time
IX. How to deploy and run it
Closing thoughts

Agents-to-AgentCore Evolution

Bedrock was launched in 2023 as AWS's response to the rapid growth of foundation models use: a single API to call models from companies such as Anthropic, AI21 Labs, Cohere or even Amazon's own Titan family, without managing inference infrastructure. Later that year AWS added Bedrock Agents, a configuration-driven product that bolted tool-calling, knowledge bases, and memory onto a Bedrock model.

It works for many cases, but it is a closed product: the ecosystem strongly revolves around Lambda-based tool execution, retrieval has to be Knowledge Bases, models have to be Bedrock-hosted, and you can't see or control how the agent decides what to do at each step. For more ambitious use cases, teams end up bypassing Bedrock Agents and writing their own harness on EC2 or Lambda, which meant rebuilding the same plumbing every team had to rebuild: session management, sandboxing, identity, memory and observability.

AgentCore, announced in 2025, was AWS's evolution to that pattern. Instead of a single "agent product," it broke the harness apart into composable services and made them framework-agnostic, so you could bring Strands, LangGraph, CrewAI or anything else. April 2026 added the managed Harness, which closed the loop: Harness offers the same easy, configuration-based approach as Bedrock Agents, but it runs on the AgentCore platform and lets you switch to code when you need more control.

AWS continues to maintain both Bedrock Agents and AgentCore in parallel. Bedrock Agents remains available for teams that already use it or prefer a fully managed, configuration-only approach, while AgentCore is positioned as the path forward for new projects that need flexibility, framework choice or production-grade infrastructure.

The use case: briefing in, subject lines out

Email subject lines have an outsized effect on open rates and they're often the only impression a campaign makes. Marketers who have the volume and the time for A/B testing can ship two or more variants and let the data decide. Many marketers don't, so they write a subject line, second-guess it and hit send.

Imagine that you run email marketing for a specialty coffee brand. You open the optimizer, fill in the briefing and set a few constraints: nothing longer than 55 characters, no discount language, no emojis. You hit optimize and watch the rounds come in. Round one produces 8 candidates covering the full stylistic range, from urgency-led to curiosity-led to plain and direct. The scorer immediately tells you which ones carry spam risk, which ones are the right length, which ones align with a retention audience. The weakest three get dropped and the Critic explains why. Round two regenerates those slots with that guidance in mind. By round three the scores have converged and you have a ranked shortlist of five, each with a predicted open-rate band and a breakdown of what drove the score.

Next week you run a cross-sell campaign for the same brand. The briefing is different but the session ID carries your name. Before generating a single candidate, the optimizer reads what it learned from your prior sessions: urgency-led lines consistently underperformed for this brand; premium and exclusivity framing reached the shortlist every time. Round one already looks different from what a first-time user would see.

Input: a campaign briefing

A generic JSON brief with the objective, audience, offer, brand voice and constraints. The kind of structure you'd find in any agency template, nothing platform-specific.

Output: a ranked shortlist of 5

Each variant comes with predicted open-rate range, the dimensions where it scored highest and any flagged risks (spam triggers, length penalty, audience mismatch). The user can ask for follow-ups in the same session — "give me shorter versions of #2 and #4" — and the agent refines while preserving what made those variants score well.

Drafting a solution

Three tiers, top to bottom.

The caller sends a campaign briefing JSON and gets a streamed response back. That exchange happens through a single @app.entrypoint function inside AgentCore Runtime, a managed AWS service that handles the HTTP transport, session lifecycle, and streaming framing so the agent code doesn't have to.

Inside the Runtime, the architecture splits into two layers. The imperative shell (agent/builder.py) owns everything that touches a framework: two Strands agents (Generator and Critic), an in-process scorer, and a memory recall helper. It wires these into four plain Python callables and injects them into the functional core (agent/iteration.py). The core runs the generate/score/critique/regenerate loop and returns a ranked shortlist.

At the bottom, two managed AWS services sit off the request path. Bedrock serves every LLM call via Strands' BedrockModel. AgentCore Memory receives session events automatically from the Strands session manager, and returns extracted patterns when the loop asks for them at the start of each run. The async strategy extraction that makes cross-session learning possible runs roughly 60 seconds after each session ends.

I. The entrypoint (main.py)

The entrypoint has one job: receive a campaign briefing, run the optimization loop and stream results back as they arrive.

AgentCore Runtime gives you this as a single decorator. You don't write a router, configure middleware or manage a server process; instead, you hand it an async generator and it handles the rest: HTTP transport, session lifecycle, streaming framing, session_id and user_id extraction from request headers.

app = BedrockAgentCoreApp()

@app.entrypoint
async def invoke(payload, context):
    session_id = getattr(context, "session_id", None) or "default-session"
    user_id = getattr(context, "user_id", None) or "default-user"
    try:
        briefing = validate_briefing(payload.get("prompt") or "")
    except (ValueError, json.JSONDecodeError) as exc:
        yield f"Invalid briefing: {exc}\n"
        return

    result = run_initial_optimization(briefing, session_id, user_id)
    for round_log in result.rounds:
        for line in _format_round_lines(round_log):
            yield line + "\n"
    yield _format_shortlist(result) + "\n"
    yield json.dumps(_serialize_result(result), ensure_ascii=False)

Every yield sends a chunk to the caller immediately. This matters because an optimization run takes from 30 to 90 seconds and the caller sees each round's scores as they complete, not a blank screen followed by a wall of text.

Refinement works by re-submitting a modified briefing with the same session_id — with, for example, a tighter length constraint, different brand voice or added avoid-words. AgentCore Memory carries learned patterns from prior sessions forward automatically; the entrypoint doesn't need to know about that.

II. The imperative shell (agent/builder.py)

The shell is the wiring layer. It knows about Strands, AgentCore and Bedrock while the functional core doesn't (it receives callables). The shell is what turns those framework dependencies into plain Python functions the core can call without importing anything.

Four things live here: a generator agent, a critic agent, an in-process scorer and a memory recall helper. At the start of each optimization run, all four get injected into the loop as callables.

def run_initial_optimization(briefing, session_id, user_id):
    generator = _make_agent(session_id, user_id, GENERATOR_SYSTEM_PROMPT)
    critic = _make_agent(session_id, user_id, CRITIQUE_SYSTEM_PROMPT)

    def generate(prompt, _n):
        return _strip_json_array(str(generator(prompt)))

    def critique(scored, to_drop):
        return str(critic(build_critique_prompt(scored, to_drop))).strip()

    return run_optimization(
        briefing,
        generate=generate,
        score=score_candidates,
        critique=critique,
        recall=recall_for_user,
        on_round=_emit_round_telemetry,
        actor_id=user_id,
        round_one_prompt_builder=round_one_prompt,
        regenerate_prompt_builder=regenerate_prompt,
    )

The agents

Generator and Critic are separate Strands Agent instances with separate system prompts. The generator is told to produce strict JSON arrays of strings and nothing else. The critic is told to produce two to four sentences of explicit, actionable guidance.

Both share the same AgentCoreMemorySessionManager, so they write to the same session namespace and see the same conversation history.

Scoring stays in-process

score_candidates is not an agent and not a service call — it's a direct import of score_subject_line from scoring/score.py, with the heuristic rules loaded once. No network, no latency, no failure mode beyond a bad CSV row.

The observability hook

on_round=_emit_round_telemetry is the last injection. After each completed round, the loop calls it with a RoundLog. The shell opens an OpenTelemetry span, records seven attributes (round number, candidate count, top score, top subject line, guidance excerpt), closes the span and emits a structured log line.

III. The functional core (agent/iteration.py)

Keeping the loop free of framework imports means it can be read, tested and reasoned about without knowing anything about the surrounding infrastructure. You can swap the LLM, replace the scoring backend or drop AgentCore entirely and the loop doesn't change. The entire contract is in the signature:

def run_optimization(
    briefing: Any,
    generate: Callable[[str, int], list[str]],
    score: Callable[[list[str], Any], list[dict]],
    critique: Callable[[list[dict], list[dict]], str],
    recall: Callable[[str, Any], MemoryContext] | None = None,
    on_round: Callable[[RoundLog], None] | None = None,
    actor_id: str = "anonymous",
    max_rounds: int = 4,
    plateau_epsilon: float = 0.5,
    *,
    round_one_prompt_builder: ...,
    regenerate_prompt_builder: ...,
) -> IterationResult:

Round one generates eight candidates across eight archetypes and scores them all. Each subsequent round generates only the slots vacated by pruning. New and surviving candidates are merged, deduped by subject line and sorted by score.

After sorting, the loop checks whether the top-3 average improved by at least plateau_epsilon points over the prior round. If not, the scores have converged and the loop stops early to avoid wasting LLM calls.

If there's still room to improve and rounds remain, the bottom 40% are pruned and handed to the critic. The critic explains what patterns made them lose in two to four sentences, referencing specific signals. That guidance feeds into the next round's regeneration prompt.

After at most four rounds, the loop returns the top five candidates with the full per-round log and the memory context read at the start.

IV. Scoring

Every candidate produced by the generator gets a score before the loop decides what to keep and what to prune. The scorer runs 45 heuristics against each subject line and returns a composite score between 0 and 100, a predicted open-rate band, per-dimension breakdowns and an explanation of the top contributions. They're stored in a simple CSV and cover 9 dimensions: length, urgency, spam risk, curiosity triggers, value signals, personalization, style, audience fit and brand voice.

A heuristic is not a rule. A rule is binary — pass or fail, always. A heuristic is a signal that contributes positively or negatively to a score based on what the literature says tends to correlate with open rates. "Subject lines between 30 and 50 characters score better" is an empirical observation, not a constraint.

The same urgency words that lift acquisition open rates by +1.0 point hurt retention campaigns by −2.0 and are inappropriate for regulatory notices at −5.0. The audience_modifier column captures that context-dependence per rule; audience tags are inferred automatically from briefing free-text.

rule_id,category,pattern,match_type,weight,audience_modifier
LEN_SWEET_SPOT_30_50,length,30-50,range,8.0,,
URGENCY_BASE_LIFT,urgency,urgent|hurry|now|today,word_any,3.5,acquisition:+1.0;retention:-2.0;regulatory:-5.0
SPAM_TRIGGER_FREE,spam_risk,free|100% free,word_any,-6.0,,
VALUE_FREE_SHIPPING,value_signals,free shipping,phrase,4.0,acquisition:+1.0

These weights are starting points. A real company would probably use a model trained on their own send history or something similar.

V. The two agents and Bedrock

The Generator and Critic share a model and a session manager but have nothing else in common.

Both agents are Strands Agent instances backed by Amazon Nova 2 Lite, invoked via Bedrock's cross-region inference profile. Both are wired to AgentCoreMemorySessionManager, which writes each turn as a session event automatically. Neither agent needs to know about memory because the session manager handles it transparently.

The Generator enforces its output shape through Strands' structured output support rather than prompt engineering. Instead of instructing the model to "output strict JSON arrays only" and then parsing whatever it produces, you pass a Pydantic schema to the agent call and Bedrock enforces it at the model level:

class SubjectLineList(BaseModel):
    subject_lines: list[str] = Field(
        description="Email subject line candidates, one per requested slot."
    )

def generate(prompt: str, _n: int) -> list[str]:
    response = generator(prompt, structured_output_model=SubjectLineList)
    return response.structured_output.subject_lines

The Critic is told to produce two to four sentences of explicit, actionable guidance. No generic advice — it must reference specific patterns in the candidates it's reviewing. "These lines are too long" is not useful guidance for regeneration. "The urgency phrasing in candidates 3 and 5 reads as promotional spam rather than genuine time pressure. Try anchoring to a specific benefit instead" is.

That guidance becomes part of the next round's prompt directly:

prompt = regenerate_prompt_builder(
    briefing,
    memory_context,
    [c["subject_line"] for c in survivors],
    guidance,
    n_to_generate,
)

The surviving candidates and the Critic's diagnosis arrive together. The Generator sees what worked, what didn't and why.

VI. AgentCore Memory: four strategies

Memory touches this project in two distinct ways. The Strands session manager writes conversation events automatically and every Generator and Critic turn lands in the session namespace without any code in the agent to make that happen. Separately, recall_for_user reads extracted patterns explicitly at the start of each optimization run, before any candidate is generated.

Memory also works on two timescales:

Short-term: Within a session, the Strands session manager keeps the full conversation in context, making sure the Critic sees every prior round and the Generator sees every prior critique.
Long-term: Across sessions, AgentCore's four extraction strategies run asynchronously, roughly 60 seconds after a session ends, and populate long-term namespaces:

Strategy	Description	Namespace
`SEMANTIC`	General facts inferred from session content	`/users/{actor_id}/facts`
`USER_PREFERENCE`	Behavioral patterns inferred from what consistently scored well or was pruned	`/users/{actor_id}/preferences`
`SUMMARIZATION`	A compressed summary of the session	`/summaries/{actor_id}/{session_id}`
`EPISODIC`	A record of what happened	`/episodes/{actor_id}/{session_id}`

The agent never writes subject lines or briefings directly to long-term storage. It writes session events and the strategies decide what is worth keeping and in what form.

What the loop reads

At the start of each run, recall_for_user queries the facts and preferences namespaces using the briefing as the retrieval query. It returns up to five patterns per namespace, ranked by relevance score. Those patterns flow into the round-one generation prompt:

Patterns observed across this user's prior sessions:
- Urgency-led subject lines were pruned in 3 of 4 prior sessions
- Premium and exclusivity framing consistently reached the final shortlist

Inferred preferences for this user:
- Discount and price-led language does not align with observed brand voice
- Sentence case outperformed title case across recent campaigns

The Generator sees what worked and what didn't before producing a single candidate.

VII. Observability

The optimization loop runs for 30–90 seconds across up to four rounds. Without instrumentation, a slow run and a failing run look identical from the outside.

After each completed round, the loop calls an on_round callback with a RoundLog. The shell's implementation emits two concurrent signals:

def _emit_round_telemetry(round_log: RoundLog) -> None:
    span = _tracer.start_span("optimization_round")
    try:
        span.set_attribute("round.number", round_log.round_number)
        span.set_attribute("round.candidate_count", len(candidates))
        span.set_attribute("round.pruned_count", len(round_log.pruned))
        span.set_attribute("round.top3_average", round(top3_avg, 2))
        span.set_attribute("round.top_score", round(top_score, 2))
        span.set_attribute("round.top_subject_line", top_subject[:200])
        span.set_attribute("round.guidance_excerpt", round_log.guidance[:300])
    finally:
        span.end()
    log.info("optimization_round_complete", extra={...})

An optimization_round OTel span with seven attributes appears as a child span under each invocation in AgentCore traces. The round.top3_average and round.top_score attributes show whether scores are improving across rounds. round.guidance_excerpt shows what the critic said before each regeneration step — the most useful signal when a run plateaus unexpectedly.

The same fields appear as a structured log.info event, queryable in CloudWatch Logs Insights:

fields @timestamp, round_number, top3_average, top_score, guidance_excerpt
| filter @message = "optimization_round_complete"
| sort @timestamp asc

The traceId field on the log event matches the span ID in the traces UI, so you can move between the two surfaces without losing context.

VIII. What I didn't use this time

AgentCore ships with more primitives than this project uses. Some of the ones that didn't make it in are worth naming.

Code Interpreter is the right choice when code has to run in a sandbox: the LLM authors it at runtime, it carries untrusted dependencies or it comes from a source outside the agent's own deployment artifact. The scoring script here has 250 lines of standard library Python, authored by the agent's owner, deployed in the same place as main.py. There is no organizational or security boundary for Code Interpreter to enforce. Adding a managed sandbox would introduce spin-up latency, a separate billing line, a separate failure mode and a service dependency.

Policy manages what agents are allowed to do — which tools they can call, which users can invoke them, which actions are gated behind approval. It earns its place in multi-tenant deployments where different users have different permissions or in agentic workflows where the consequences of a wrong action are hard to reverse.

Gateway exposes an agent as a managed API endpoint with authentication, rate limiting, and request routing. It's the right choice when the agent is a shared service consumed by multiple callers across organizational boundaries.

Every primitive on this list is genuinely useful for the problems it was designed to solve. The real challenge is identifying which problem you actually have. Adopting every available managed service does not make an agent more sophisticated, and it often makes the system harder to reason about, more expensive to run or even more fragile.

When evaluating a new service or functionality, think about: what complexity am I introducing alongside it, what failure modes come with it, and whether it meaningfully simplifies the system.

IX. How to deploy and run it

Prerequisites

Before deploying, make sure you have:

An AWS account with Amazon Bedrock access
The agentcore CLI installed (npm install -g @aws/agentcore-cli)
AWS credentials configured with permissions for Bedrock, CloudFormation, IAM, ECR, and S3

To confirm your credentials are working:

aws sts get-caller-identity

Testing locally before deploying

To iterate on the agent without incurring a full deploy cycle, you can run the local dev server:

agentcore dev

This starts the runtime at http://localhost:8080. Bedrock model calls still go to AWS, so you need valid credentials, but no CloudFormation changes are made. Memory is not active in dev mode unless you export MEMORY_ID manually with the ID from your deployed memory resource.

Deploying

From the project root:

agentcore deploy

The CLI runs CDK under the hood, synthesizing a CloudFormation stack, bootstrapping the CDK environment if needed, and provisioning the runtime, the memory resource, and the IAM roles. The first deploy takes a couple of minutes. When it completes you'll see a runtime ARN in the output — that ARN is the deployed endpoint.

Running an invocation

The project ships with four example briefings under app/subject_line_optimizer/briefing/examples/. Each is a self-contained campaign briefing JSON ready to send.

To invoke the deployed agent with the reactivation example:

agentcore invoke \
  --target default \
  --prompt-file app/subject_line_optimizer/briefing/examples/reactivation.json \
  --session-id "create-an-id-here-001" \
  --user-id "your-user-id" \
  --stream

A few things to note:

--target default routes to the deployed endpoint, not the local dev server
--prompt-file reads the briefing directly; the CLI wraps it in {"prompt": "..."} before sending, so pass the raw briefing file
--session-id must be at least 33 characters; a UUID works well
--user-id scopes the AgentCore Memory namespaces — use a consistent identifier across sessions so the agent accumulates preferences for that user over time
--stream prints each chunk as it arrives, so you see the rounds progressing in real time

Example output

Optimizing subject lines for: Q3 Lapsed Customer Reactivation

[round 1]
  83.0  Ready to rediscover your favorite things? Claim 25% off now
        (LEN_ACCEPTABLE_20_60, URGENCY_BASE_LIFT, VALUE_PERCENT_OFF)
  80.0  Unlock 25% off – your loyalty reward is ready to claim
        (LEN_ACCEPTABLE_20_60, VALUE_PERCENT_OFF, LOYALTY_LANGUAGE)
  ...
  pruned: 3
  guidance for next round: Avoid overly conversational openings...

[round 2]
  91.2  Claim your 25% loyalty reward – 14 days to save
        (LEN_SWEET_SPOT_30_50, LOYALTY_LANGUAGE)
  ...

=== Final shortlist ===
1. Claim your 25% loyalty reward – 14 days to save
   score 91.2   open-rate band 42.0–50.0%
   ...

The final chunk is a machine-readable JSON object with the full shortlist, per-round logs, and plateau status.

There is currently no built-in UI in the AWS console to browse memory contents. A community-built tool called AgentCore Memory Browser fills this gap.

Closing thoughts

The thing I keep coming back to is that AgentCore doesn't have an opinion about what the agent should look like. Bedrock Agents did, and the opinion was reasonable for a lot of cases. AgentCore gives you a Runtime, a Memory service, a set of primitives, and trusts you to assemble them. That trust is the feature.

Upload, Describe, Discover: Architecting a Marketing Assets Library

Ana Silva — Sat, 02 May 2026 00:46:13 +0000

Glórund crossing heterogeneous terrain in search of Túrin felt like an apt opener for an article about searching across heterogeneous assets!

Is this too much of a stretch? Well, anyway…

If you search for "digital asset management software," you'll find many mature solutions. Adobe Experience Manager — probably the most recognizable name in enterprise marketing infrastructure — handles digital assets as part of a broader content management platform. Cloudinary and Bynder represent the more focused end of the spectrum: purpose-built DAMs with polished interfaces, rich metadata management, and integrations designed for marketing teams. These are mature, well-funded products with years of iteration behind them.

So why build one from scratch?

The honest answer: I didn't build this because the market had a gap. I built it because I had some questions:

How do you model metadata for creative assets that are structurally heterogeneous: a PNG, an HTML email and a push notification living in the same library?
How do you integrate an LLM into an indexing pipeline without making uploads feel slow?
How do you expose a single search endpoint that handles both rigid filter-based queries and natural language, without the interface becoming a mess?

These are questions that appear the moment you try to build anything resembling a searchable content repository. Whether you're integrating with an off-the-shelf DAM via API, building a lightweight internal tool or extending an existing platform, the underlying mechanics are the same. Understanding them gives you leverage regardless of which path you choose.

The fictional system I built — Orqestra Assets — is a DAM focused on marketing creative pieces: app banners (PNG), email templates (HTML), and SMS/push payloads (JSON). It's not a production system, it's a deliberate architecture built to answer those questions, with real code, real tradeoffs, and a stack that maps directly to what you'd use in an AWS environment.

It's also part of a larger platform I've been working on, so there may be more parts to come. Here, I'll walk through the architecture for Assets: how assets are ingested, how they're indexed asynchronously with LLM-generated descriptions and how search works across both structured filters and natural language queries.

The code is available on Github.

The solution draft

"Orqestra Assets" is built around three distinct flows that happen in sequence but are deliberately decoupled from each other:

Upload: a client uploads an asset or submits a text payload; the API stores the file in S3, registers a row in PostgreSQL, and publishes a message to an SQS queue.
Describe: a worker consumes the queue, generates a description if needed, creates an embedding and upserts a document into OpenSearch.
Discover: a client queries the library, either through structured filters resolved in SQL, or through natural language resolved via hybrid search in OpenSearch, enriched with data from Postgres.

The stack maps directly to AWS primitives you'd use in production: S3 for object storage, SQS for async decoupling, OpenSearch for vector and full-text search, PostgreSQL as the source of truth for structured metadata, and the OpenAI API for both description generation and embeddings.

1. Upload

The upload layer has the job of accepting an asset, persist it reliably and hand off to the indexing pipeline without blocking the client.

"Without blocking" is the key constraint. A multimodal LLM call for a PNG can take several seconds, so if the upload endpoint waited for indexing to complete before responding, the client experience would be unacceptable. Because of this, the API does the minimum necessary synchronously, and delegates everything else to a queue.

The API receives the file, generates a S3 key, stores the object, writes a row to PostgreSQL and publishes a message to SQS. The response returns immediately with the asset ID and indexing_status: pending.

The S3 key encodes the asset's channel and type in the path — a PNG uploaded to a campaign might land at campaigns/{id}/App/{space}/{uuid}.png — and uses a UUID as the filename. Separately, the API computes a SHA-256 of the content and stores it in the asset's metadata, giving you a foundation for deduplication logic if you need it later.

The OpenSearch document ID is derived from a hash of the S3 key. This means that if the same object triggers multiple indexing attempts — a duplicate queue message, an S3 notification racing with an explicit publish — the upsert always lands on the same document. Re-indexing is safe; OpenSearch doesn't accumulate duplicates.

PNG and HTML come as file uploads — POST /assets/upload-app and POST /assets/upload-email respectively. The API validates the format, reads the bytes, and writes the object to S3. SMS and push work differently: the client submits the message text as a JSON body to POST /assets/text, and the API itself serialises it into a .json file before writing it to the bucket. There is no file to upload; the file is constructed server-side.

All three paths write a row to the assets table with a channel and format field — App/png, E-mail/html, SMS/text, Push/text — and then publish to the same queue. By the time the worker picks up the message, it knows what it's dealing with: the combination of channel and format is enough to decide whether to call the vision model, the text completion model, or neither and go straight to embedding.

2. Describe

Once a message lands in the queue, a background worker takes over. Its job is to do everything the upload endpoint deliberately skipped: fetch the asset from S3, generate a description if the asset type requires one, create an embedding, and push the result to OpenSearch.

So, "the worker" is a long-polling SQS consumer. It receives batches of up to ten messages, processes each one concurrently using a thread pool and deletes a message from the queue only after its asset has been successfully indexed. If processing fails, the message is not deleted, SQS makes it visible again after the visibility timeout, and the worker will retry on the next poll. Failures that exhaust all retries land in the DLQ (dead-letter queue).

For each message, the worker reads the s3_key from the payload, downloads the object from S3 and decides what to do based on channel and format. The decision tree from that point is straightforward. For PNG app banners, the worker encodes the image in base64 and sends it to a multimodal model with a prompt asking for a concise marketing description: dominant colours, visible text, campaign theme, appropriate channel. For HTML email templates, it decodes the file and sends it to the same model with a different prompt focused on the email's call to action, tone and campaign fit. For SMS and push payloads there is no LLM call, the text is extracted directly from the JSON stored in S3 and used as-is.

In our case, minor overruns are acceptable, so a 500-character prompt instruction is sufficient to keep descriptions within a reasonable size without needing hard truncation or other techniques in code.

_SYSTEM_PROMPT = (
    "You are a digital marketing expert specialized in creative asset cataloguing. "
    "Generate concise, retrieval-optimized descriptions of marketing assets. "
    "Reply only with the description text, in English, in at most 500 characters."
    "No introduction, no title, no 'here is', no numbered lists, no meta-commentary."
)

_PNG_PROMPT = (
    "Describe this creative asset for search retrieval. "
    "Include: dominant colors, main visual elements, visible text, campaign theme, and suitable channel."
)

_HTML_PROMPT = (
    "Describe this HTML email template for search retrieval. "
    "Include: main theme, call to action, message tone, and suitable campaign type."
)

This is the asymmetry the queueing design was built to absorb. A push notification costs one fast JSON parse. An app banner costs a vision model call that might take several seconds. From the upload client's perspective, both are the same: post the asset, get a response, check back later.

Once a description exists, the worker prepends the asset's display title if one was set, and passes the combined text to OpenAI's text-embedding-3-small to generate a vector. That vector, along with the description and the asset's structured fields (channel, format, locale, lifecycle status, campaign id) is upserted into OpenSearch under the document ID derived from the S3 key.

The final step is a write back to Postgres: the description and embedding_id columns are updated and indexing_status is set to indexed. If anything fails before that point, the status is set to error instead, and the message stays in the queue for retry.

One thing this pipeline doesn't do is validate description quality before indexing. A description that's technically successful but semantically weak lands in OpenSearch indistinguishably from a good one. The practical consequence is that recall degrades silently: users searching for "bold red promotional banner" may not surface an asset that matches visually, if the model described it as "a marketing creative with promotional messaging." Validating description quality without a reference set is hard. The most honest mitigation at this stage is observability: log every description, monitor length distributions across batches and treat significant anomalies as a signal to inspect manually.

⚠️ The local approach was to make embedding generation sit outside OpenSearch entirely. The trade-off of this option is that you own the orchestration: every indexing job and every search request carries an outbound API call to a model provider, with the associated latency and failures.

One alternative, available in production on Amazon OpenSearch Service, is to register a model connector either pointing to Amazon Bedrock or to an external provider, and delegate embedding generation to OpenSearch itself via an ingest pipeline processor at index time and a neural query at search time. In that setup, the worker would send plain text and OpenSearch would handle the vector internally, removing the custom embedding code from the application entirely.

3. Discover

In the library page, assets appear as a card grid, where PNG banners render with a thumbnail, HTML and text assets show an icon and a type badge. Clicking any card opens a detail sheet with the full metadata, the generated description, the indexing status and a download link.

Above the grid sits a search bar and a row of filters: channel, format, locale, lifecycle status, tags and campaign partition. They all coexist on the same view and feed the same request. A user can narrow to all active push notifications in Brazilian Portuguese or type a natural language query like "summer promotion with red background" and let the ranking handle the rest. The user can also do both at once, combining structured filters with semantic search in a single call. Typing triggers a debounced query, so the grid updates as the user types without hammering the API on every keystroke.

When no query text is provided, the request goes entirely through Postgres. Assets are filtered by the supplied fields, ordered by creation date and paginated. It's a straightforward SQL query and returns quickly. When a query string is present, the path is different.

How hybrid search works

To understand why hybrid search matters here, it helps to understand what each component does on its own.

BM25 is the algorithm behind traditional keyword search. It ranks documents by how often the query terms appear in them, adjusted for document length and term frequency across the corpus. It's fast, interpretable and works well when the user knows the right words. But it's brittle: a query for "urgent promotional tone" returns nothing if none of those exact words appear in the indexed descriptions, even if a perfectly relevant asset exists.
kNN (k-nearest neighbors) operates on embeddings — vector representations that encode semantic meaning rather than surface text. When you embed the query and search for the nearest vectors in the index, you're finding assets that are conceptually similar, regardless of whether they share any words with the query. This is what makes "something warm and summery for a mobile audience" a valid search. kNN is indifferent to exact matches, though, so a query for a specific campaign name or a precise tag will often return semantically adjacent but wrong results.

Hybrid search combines both. In this project, when a natural language query arrives, the API embeds it in real time using text-embedding-3-small, the same model used during indexing, and sends both the query string and the embedding to OpenSearch as a hybrid query. OpenSearch runs the BM25 and kNN sub-queries in parallel, normalizes each score set independently using min-max normalization, and combines them into a single ranking via weighted arithmetic mean. The weights favor the vector component slightly, on the assumption that semantic similarity is more useful than keyword overlap for creative asset retrieval.

Those asset IDs come back from OpenSearch without the full metadata. The API then fetches the corresponding rows from Postgres — joined to whatever SQL-only filters remain, such as tags — and re-orders them to match the ranking OpenSearch produced. What the client receives is a page of fully hydrated asset objects, ordered by relevance, with pagination driven by the original limit and offset parameters.

The separation between the two stores is intentional. OpenSearch owns relevance ranking, while Postgres owns the "facts" metadata.

Does it work?

Hybrid search returns results, but that doesn't mean it returns the right results. Without a way to measure retrieval quality, tuning the pipeline is guesswork: you don't know whether changing parameters helped or whether a prompt revision improved description usefulness. Evaluation doesn't need to be elaborate to be useful, but it needs to be systematic.

What I built in this project is a lightweight evaluation pipeline to test the natural-language retrieval quality only. That means no deterministic UI filters (channel, format, locale, etc.) are allowed to influence the score. Each test query is sent as plain language, and the system must rank relevant assets using the same hybrid search path the system uses.

What was built

The evaluation flow is split into three scripts:

evals/generate_eval_dataset.py
evals/upload_assets.py
evals/run_eval.py

Together, they form a reproducible loop from synthetic asset generation to scored retrieval results.

1) Dataset generation (generate_eval_dataset.py)

This script creates a controlled benchmark corpus:

synthetic creative assets (PNG, HTML, SMS, Push);
a manifest describing each asset;
a query specification file (query_specs.json) containing query, expected_ids and type.

The query types were reorganized around the search intent:

exact_intent
paraphrase_intent
cross_channel_intent
ambiguous_intent

This makes reporting easier to interpret: you can see whether the engine performs differently on literal requests, paraphrased requests, cross-channel intents or harder ambiguous intents.

2) Upload + dataset resolution (upload_assets.py)

This script uploads generated assets to the DAM API, waits for indexing, and resolves logical IDs into real API asset IDs. It then builds eval_dataset.json, which is what the runner consumes.

3) Evaluation runner (run_eval.py)

The runner reads eval_dataset.json, sends each query to POST /assets/search and compares ranked results against expected relevant assets.

What it measures

The evaluation reports quality per query, per intent category and globally.

The metrics are:

Success@1 — Did the first result match any expected relevant asset?
Success@3 — Did at least one relevant asset appear in top 3?
Recall@3 — How much of the relevant set appears in top 3?
MRR — How early does the first relevant result appear?

How to run it

From the repository root:

# 1) Start the stack
docker compose up -d --build

# 2) Build eval dataset from existing uploaded mapping
docker compose --profile eval run --rm eval-upload --build-dataset-only

# 3) Run evaluation
docker compose --profile eval run --rm eval-run

Outputs:

evals/output/eval_results.json for per-query details
evals/output/eval_summary.json for aggregate metrics

Interpreting results

With K=3, the benchmark produced:

Success@1: 0.7143
Success@3: 1.0000
Recall@3:  0.8988
MRR:       0.8393

The big picture: the right asset is always somewhere in the top 3, but it only shows up in first place about 71% of the time. The system is good at finding the right assets, but not always at ranking them first.

1. Paraphrase intent — perfect

S@1 = 1.00, S@3 = 1.00, Recall@3 = 1.00, MRR = 1.00

When describing what is wanted in natural language (pink floral banner for mother's day, abandoned cart recovery email) the system got it right every time. All 12 queries in this category landed the correct asset at rank 1.

This is the category the system is built for: the LLM-generated descriptions and the embeddings are doing exactly what they should, bridging the gap between how a user phrases a request and how the asset was originally described. With the hybrid weighting set at 0.35 lexical / 0.65 vector, this is also the category that benefits most from the current configuration.

2. Cross-channel intent — mostly limited by the K=3 cap

S@1 = 1.00, S@3 = 1.00, MRR = 1.00, Recall@3 = 0.7083

When a query expects 4 assets (one per channel) and we only look at the top 3, we can never get full recall. That's a limitation of how we chose to measure, not of the system itself. Three of the four queries hit this ceiling cleanly.

The exception is reactivation of inactive customers with offer: the email shows up first, but the SMS and Push versions don't make the top 3, even though they exist and the system finds them on other queries. This one query is dragging the average down.

It's also worth noting that K=3 is a deliberate choice, not the only sensible one. It reflects what users actually see in the first row of results, but for cross-channel queries it under-rewards the system. A small refinement worth considering would be reporting Recall@N (where N matches the number of expected assets).

3. Exact intent — the weak spot

S@1 = 0.125, S@3 = 1.00, Recall@3 = 1.00, MRR = 0.5208

For keyword-style queries (free shipping, order tracking, black friday urgency countdown) the right asset is always in the top 3, but almost never at rank 1. It usually lands at rank 2 or 3, behind thematically similar assets.

The cause is fairly direct: the hybrid search currently weights lexical matches at 0.35 and vector matches at 0.65. That bias works beautifully for paraphrased queries, but for short, literal queries it lets thematically related assets outrank the one that matches the exact words. This is still good enough, though.

4. Ambiguous intent — better than expected

S@1 = 0.75, S@3 = 1.00, Recall@3 = 0.5833, MRR = 0.8333

Vague queries actually do better at S@1 than the exact ones above, which reinforces the idea that the system favours semantic matches. For example, creative with warm tone and soft visual elements correctly surfaces the Mother's Day pink floral banner at rank 1, even though nothing in the query mentions Mother's Day or florals.

Recall@3 is the lowest of all categories, but that's expected: when a query is broad, more assets could plausibly be relevant, and not all of them fit in the 3 slots.

Wrapping up

This project forced a series of decisions that documentation tends to skip over: where exactly to place filters, why description quality matters, how the failure model of a queue-based pipeline is fundamentally different from a synchronous one. Those things only become visible when you have to make them yourself.

One thing I'd revisit is embedding ownership. Generating embeddings in the application layer works fine at this scale, but it's something that Amazon OpenSearch can absorb in production through model connectors and neural queries. Whether that tradeoff is worth it depends on how much you want to own.

Evaluation showed that the search reliably finds the right things. The main area that could be improved is ranking, especially for short keyword queries. One option worth exploring could be to give literal word matches a bit more weight in the hybrid search.

If you've built something similar or made different trade-offs around indexing, search or evaluation, I'd be curious to hear how you approached it.

DEV Community: Ana Silva

Partition and Sort Keys on DynamoDB: Modeling data for batch-and-stream convergence

How to design partition keys, sort keys and a serving layer when your data has two sources with different SLAs

The Use Case

Solution diagram

Why two tables

Modeling the tables

The batch table

The NRT table

Serving the data

About the costs

Closing thoughts

Cracking the Bedrock, Reaching the Core: Building Agents with AWS AgentCore Runtime and Memory

Contents

Agents-to-AgentCore Evolution

The use case: briefing in, subject lines out

Input: a campaign briefing

Output: a ranked shortlist of 5

Drafting a solution

I. The entrypoint (main.py)

II. The imperative shell (agent/builder.py)

The agents

Scoring stays in-process

The observability hook

III. The functional core (agent/iteration.py)

IV. Scoring

V. The two agents and Bedrock

VI. AgentCore Memory: four strategies

What the loop reads

VII. Observability

VIII. What I didn't use this time

IX. How to deploy and run it

Prerequisites

Testing locally before deploying

Deploying

Running an invocation

Example output

Closing thoughts

Upload, Describe, Discover: Architecting a Marketing Assets Library

The solution draft

1. Upload

2. Describe

3. Discover

How hybrid search works

Does it work?

What was built

What it measures

How to run it

Interpreting results

Wrapping up