DEV Community: Maisie Ouyang

S3 Annotations: Three Practical Use Cases in Agent Scenarios

Maisie Ouyang — Wed, 24 Jun 2026 02:12:12 +0000

S3 objects have historically carried three types of native metadata: system-defined metadata, user-defined metadata (2 KB, immutable after upload), and object tags. With Amazon S3 Metadata — a separate bucket-level feature — you can aggregate all metadata types into managed Iceberg tables for SQL-based querying.

In June 2026, AWS introduced another metadata type: S3 Annotations — mutable, structured payloads (up to 1,000 per object, each up to 1 MB, totaling 1 GB) in JSON, XML, YAML, or plain text that can be added, updated, or deleted at any time without touching the underlying object.

Having worked closely with Agent infrastructure and being familiar with Agent architectures, I'd like to share three concrete use cases based on my experience.

Scenario 1: Meeting Summary — Keeping Processing Results Attached to the Source

📌 Core point: S3 Annotations let a file's context (transcripts, summaries, action items) live on the file itself — so deletion, replication, and migration don't require separate mapping or cleanup logic. The meeting summary example illustrates this, but the pattern applies to many scenarios where processing context needs to stay bound to its source object. 📌

Most Agent products today offer some form of meeting transcription and summarization. The Agent takes a recording — either captured live via a streaming model, or uploaded after the fact for batch processing — produces a full transcript, and then generates structured meeting minutes with action items.

In practice, companies building these Agents retain the recording alongside its transcript and minutes for future search, knowledge building, and audit purposes. That raises a straightforward question: how should these files live together, and how do you keep them linked?

The Usual Approach

Typically, the recording, transcript, and minutes end up as separate S3 objects grouped by a shared path prefix:

s3://agent-data/meetings/
├── 2026-06-15-standup/
│   ├── recording.mp4
│   ├── transcript.json
│   └── minutes.json
├── 2026-06-16-review/
│   ├── recording.mp4
│   ├── transcript.json
│   └── minutes.json
└── ...

Or their relationship is tracked in an external database (DynamoDB, RDS, etc.).

Either way, the context (transcript, minutes) lives separately from the data (recording), and their association depends on the application layer. At scale — thousands or tens of thousands of recordings — this starts to cost you:

Deletion requires coordination: removing a recording means you also need to clean up its associated files or database mappings. Miss one, and you've got orphaned data accruing storage charges.
Replication and migration need extra care: moving data across regions or buckets means ensuring every related file travels together, or updating path mappings in your external system.
The link isn't native to S3: S3 has no idea that recording.mp4 and transcript.json belong together. That relationship exists only in your path convention or external database — any gap in that layer, and the association breaks silently.

The underlying issue: context doesn't live with the data. They're separate objects, held together by convention.

How S3 Annotations Change This

With S3 Annotations, you attach the transcript and minutes directly to the recording object itself:

transcript: the full transcript document
meeting_minutes: the generated minutes
action_items: structured action items

Each is an independent, named annotation that can be written or updated separately. What this gives you:

Lifecycle is automatic — delete the recording, and its annotations go with it. No orphans.
Copy and replication just work — annotations follow the object by default. No extra rules to configure.

Why User-Defined Metadata Doesn't Work Here

Take the offline case: a user uploads a 1 GB recording, and only then does the Agent begin transcription. S3's user-defined metadata (x-amz-meta-*) can't help because:

It's write-once: you set it at upload time, and it's locked. The transcript and minutes don't exist yet at that point — there's nothing to write.
It's tiny: the 2 KB total limit can't hold a transcript that runs tens of KB.
Updating means copying the object: to change metadata, you'd have to copy the entire 1 GB file. That's not practical.

S3 Annotations solve all three: they're writable at any time, hold up to 1 MB each, and are mutable without touching the object.

Scenario 2: Process Once, Search Forever — Structured Retrieval Without Vectorization

📌 Core point: When Agents output structured JSON as annotations — whether as processing results (summaries, transcripts) or generation context (prompts, outlines) — those files become precisely searchable via annotation tables without any additional vectorization pipeline. This isn't a full replacement for semantic search, but it means vector-based retrieval is no longer the only option for multimodal file discovery.📌

Today's Agent products handle multimodal content across the board — generating images, building presentations, summarizing PDFs, transcribing recordings. In doing so, Agents naturally produce structured outputs: summaries, classifications, outlines, prompts. If you write these as structured JSON annotations at processing time, those files become searchable immediately — no separate vectorization pipeline needed, and no repeated analysis when the file comes up again later.

Examples Across Modalities

File Type	What the Agent Does	JSON Annotation	How You Search Later
Recording	Transcribes and generates structured minutes	`{"date": "...", "topics": [...], "action_items": [...]}`	"Meetings about EKS upgrade last week" — filter by topic, date, owner
PDF	Extracts summary, entities, key insights	`{"summary": "...", "entities": [...], "keywords": [...]}`	"Reports mentioning customer X" — filter by entity, no re-analysis
Image/Video	User generates via prompts with iterative refinement	`{"prompt": "...", "iterations": [...], "style": "..."}`	"That cyberpunk logo I made" — search prompt keywords directly
PPT	Generates presentation from user requirements, producing an outline	`{"outline": [...], "topics": [...], "slide_count": 12}`	"My cost optimization deck" — search by outline topics

The Key Design Decision: Output JSON, Not Plain Text

These scenarios split into two types:

Processing results (recordings, PDFs): the Agent's job is to analyze the file and produce output. That output becomes the annotation.
Generation context (images, PPT): the Agent's job is to create something for the user. The intermediate context — prompts, iteration history, outlines — is a byproduct that's worth preserving.

In both cases, the architectural principle is the same: design the Agent workflow to emit structured JSON rather than plain text for anything that should be searchable later. Structured JSON enables precise field-level queries through annotation tables; plain text limits you to fuzzy LIKE matching.

For meeting minutes, this means having the Agent produce:

// annotation name: "meeting_minutes"
{
  "date": "2026-06-15",
  "topics": ["EKS upgrade", "cost optimization"],
  "action_items": [
    {"owner": "Alice", "task": "Complete EKS 1.30 upgrade plan"},
    {"owner": "Bob", "task": "Submit RI purchase request"}
  ],
  "summary": "Discussed EKS cluster upgrade timeline and Q3 cost optimization goals..."
}

With structured fields, you can run json_extract_scalar queries — filtering by date, grouping by topic, searching by assignee — things that are impossible against freeform text.

Cross-File Search via Annotation Tables

To query across objects, you need S3 Metadata — a bucket-level feature (not the object-level user-defined metadata discussed earlier; the naming is confusing, but they're completely different things). Once enabled, S3 automatically streams all annotations into a managed Apache Iceberg table (annotation tables).

From there, Agents can query with Athena SQL:

-- Find meetings with EKS-related action items from last week
SELECT bucket, object_key, text_value
FROM annotation_table
WHERE name = 'meeting_minutes'
AND text_value LIKE '%EKS%'
AND json_extract_scalar(text_value, '$.date') >= '2026-06-09'

Or via S3 Tables MCP Server, Agents can simply ask: "Find all meeting minutes that discussed EKS upgrade last week." The MCP Server handles the SQL translation.

How This Relates to Vector Search

Traditional multimodal search requires a vectorization pipeline — embedding model, vector store (like S3 Vectors), index maintenance. It's great for semantic similarity. But in the scenarios above, the Agent is already producing structured content as part of its job. Writing that content as a JSON annotation gives you queryability for free — no extra pipeline.

These aren't substitutes; they solve different problems. Vector search handles semantic queries (fuzzy, open-ended). Annotation tables handle structured queries (exact field matching, aggregation). Use them together or independently, depending on the scenario.

Scenario 3: Context Offloading and Long-Term Archival — Self-Contained Traces

📌 Core point: When Agents archive execution traces to S3, offloaded context can be attached as annotations rather than stored at separate paths — making each trace a self-contained unit that doesn't depend on external file references remaining valid over time.📌

In Agent context engineering, context offloading is standard practice. An Agent working on a document or code task pulls in content from many external files. Once that content is no longer needed in the active window, the Agent swaps it out — replacing the actual content with a file path or link. If it's needed again, the Agent reads from that path.

When the session ends, the full execution trace goes to S3 for long-term storage. At that point, the trace contains only path references; the offloaded content lives at a separate S3 location. This creates the same problem as Scenario 1: the trace and its referenced content are decoupled. Paths can go stale after bucket reorganization, and lifecycle management requires extra care.

How S3 Annotations Help

Attach the offloaded content directly as annotations on the trace object. The relationship goes from "two separate objects linked by a path string" to "one self-contained object." The trace carries everything needed to reconstruct the full execution context, independent of any external file paths.

This applies specifically to post-session archival — active sessions still manage context in memory or a database. S3 Annotations address the archival problem: how to make stored traces fully self-contained.

The payoff is simple: when you need to retrace a past execution — for debugging, auditing, or session restoration — you don't need to chase down external paths and verify they still resolve. Just read the annotations on the trace object. Self-contained archives are dramatically easier to maintain long-term.

How to Use S3 Annotations in Practice

S3 Annotations offer three levels of interaction, each suited to different needs:

Method	Best For	How It Works
Direct API	Single-object read/write	CRUD operations on individual annotations
Annotation Tables + Athena	Cross-object batch queries	Requires S3 Metadata enabled; annotations auto-flow into Iceberg tables
MCP Server	Natural language retrieval	Agent queries without writing SQL

Direct API

The foundation. Agents use these when they know exactly which object to work with. Amazon S3 supports the following API operations for annotations:

PutObjectAnnotation – Creates or overwrites an annotation on an object. You specify the annotation name and payload in the request.
GetObjectAnnotation – Returns the payload of a specific annotation by name.
ListObjectAnnotations – Returns the list of annotations on an object. The response includes each annotation's name, size, ETag, and last modified date.
DeleteObjectAnnotation – Removes a specific annotation by name.

Example — writing a transcript annotation:

PUT /meeting-001.mp4?annotation&name=transcript HTTP/1.1
Content-Type: application/json

{"date": "2026-06-15", "speakers": [...], "segments": [...]}

These suit the case where the Agent just finished processing a file and needs to persist results, or a user referenced a specific file and the Agent needs to pull its annotation.

Annotation Tables + Athena SQL

When you need to search across objects — "which recordings are missing minutes," "all meetings about EKS last week" — you can't call single-object APIs one by one.

Enable S3 Metadata (bucket-level) and annotations automatically flow into managed Iceberg tables:

Journal table: near-real-time, good for detecting fresh annotations quickly
Annotation table: ~1 hour refresh, good for batch analysis and auditing

Query with Athena:

-- Recordings that have a transcript but no minutes yet
SELECT bucket, object_key
FROM annotation_table a
WHERE a.name = 'transcript'
AND NOT EXISTS (
  SELECT 1 FROM annotation_table b
  WHERE b.object_key = a.object_key AND b.name = 'meeting_minutes'
)

-- Find all meetings about EKS from last week
SELECT bucket, object_key, text_value
FROM annotation_table
WHERE name = 'meeting_minutes'
AND text_value LIKE '%EKS%'
AND json_extract_scalar(text_value, '$.date') >= '2026-06-09'

This is ideal for batch scheduling, pipeline monitoring, and any query that needs precise conditions.

MCP Server + Natural Language

S3 Tables MCP Server puts a natural language interface on top of annotation tables. Agents describe what they want:

"Find all meeting minutes that discussed EKS upgrade last week"
"Which PDFs haven't been summarized yet?"
"List all open action items assigned to Alice"

The MCP Server generates the SQL, runs it, and returns results. This fits conversational Agent scenarios where users ask questions in natural language and the Agent needs to search across its historical outputs.

Conclusion

This article explored three distinct ways S3 Annotations can be applied in Agent scenarios, each with a different focus:

Scenario 1 — Lifecycle binding: processing results (transcripts, summaries) stay attached to the source file. No separate mapping, no orphaned files, no manual cleanup on deletion or migration.
Scenario 2 — Structured multimodal retrieval: by outputting structured JSON during processing or generation, Agents gain searchable metadata for free — enabling precise retrieval across recordings, PDFs, images, and presentations without a dedicated vectorization pipeline. Semantic vector search is no longer the only option for multimodal file discovery.
Scenario 3 — Self-contained archival: offloaded context lives on the trace object itself, making long-term archives independent of external file paths.

The three scenarios address different needs, but share one insight: Agents already produce structured context as part of their work — S3 Annotations give that context a place to live that's natively bound to the data, lifecycle-managed, and queryable. If you're building on S3 as your Agent's persistence layer, it's worth evaluating which outputs could benefit from this pattern.

Some Caveats

Two behaviors worth noting in specific configurations (source):

Multipart upload copies (objects > ~8 MB): when the AWS CLI or SDK uses Transfer Manager to copy large objects, annotations are not copied by default. Specify --copy-props all in the CLI or the equivalent SDK configuration to include them.
Versioned buckets: a simple DELETE (without specifying a version ID) creates a delete marker, but annotations on the underlying version remain intact. This doesn't break the lifecycle binding described in Scenario 1 — it simply means that in versioned buckets, you need to specify the version ID when deleting to fully remove both the object and its annotations.

References:

Amazon S3 annotations: attach rich, queryable context directly to your objects — AWS Blog, 2026-06-16
Annotating your objects - Amazon S3 User Guide
S3 Metadata annotation table schema
Building persistent memory for multi-agent AI systems with Amazon S3 Vectors — AWS Storage Blog, 2026-06-08
Derive intelligent storage insights using S3 Metadata and MCP — AWS Storage Blog
S3 Metadata Features - Data Discovery Accelerator

#AmazonS3 #AIAgents #ContextEngineering #CloudArchitecture #AWS

From Agent Loop to Durable Execution: An Architecture Guide for Production Agents

Maisie Ouyang — Wed, 10 Jun 2026 07:35:19 +0000

Agent frameworks like Claude SDK, LangGraph, and Strands already give us a lot: the Agent Loop, error handling, task dispatch, memory management, and engineering scaffolding for getting agents to production quickly.

So when I first encountered orchestration engines like Temporal and Lambda Durable Functions in production architectures, my reaction was: why do we need yet another layer?

After researching multiple implementations and reading through source code, I realized the confusion stems from three questions that rarely get addressed together:

Frameworks already handle the Agent Loop and basic production needs. Why add orchestration on top?
With orchestration managing execution, is the result still an "Agent" or just a "Workflow"?
What does the orchestration engine actually do — what problems does it solve that frameworks don't?

This post is my attempt to untangle these. Chapter 1 addresses questions 1 and 2; Chapter 2 addresses question 3 with concrete scenarios.

Chapter 1: Agent Loop and Agent Orchestration

1.1 Three Concepts: Agent Loop, Agent Framework, and Orchestration Engine

To answer question 1, we first need to distinguish three concepts that often get conflated:

Agent Loop — application-level reasoning logic. How the agent selects tools, manages context, decides whether to plan or act, and when to stop. This is what makes an agent an agent.

Agent Framework (Claude SDK, LangGraph, Strands) — application-level engineering layer. Wraps the Agent Loop with production-ready scaffolding: error handling, memory ingestion, tool registration, logging, tracing, multi-agent task dispatch. Its value is acceleration — turning an Agent Loop into something deployable fast.

Orchestration Engine (Temporal, Lambda Durable Functions) — infrastructure layer. Externalizes execution state to solve problems that emerge at production scale: long waits (days for human approval), crash recovery across process boundaries, distributed coordination between agents, and fine-grained retry policies.

┌─────────────────────────────────────────────────────────┐
│  Application Logic: Agent Loop                           │
│  "How the agent thinks and acts"                         │
│  LLM reasoning → select tool → execute → observe → loop │
│  Implemented by: developer-defined logic                 │
├─────────────────────────────────────────────────────────┤
│  Application Engineering: Agent Framework                │
│  "Get this agent to production fast"                     │
│  Error handling / memory / tracing / multi-agent dispatch│
│  Implemented by: Claude SDK / LangGraph / Strands        │
└─────────────────────────────────────────────────────────┘
                        ↕ complementary layers
┌─────────────────────────────────────────────────────────┐
│  Infrastructure: Orchestration Engine                    │
│  "Reliably run this agent at production scale"           │
│  State persistence / crash recovery / long waits (HITL) /│
│  distributed coordination / fine-grained retry           │
│  Implemented by: Temporal / Lambda Durable Functions     │
└─────────────────────────────────────────────────────────┘

Frameworks simplify development and accelerate time-to-production; orchestration engines ensure stability and reliability at scale. They are complementary — not competing.

Why do agents specifically need this infrastructure layer? As Temporal's blog puts it:

"AI applications and agents are distributed systems. I even suggest they are distributed systems on steroids because your app may end up making an order of magnitude more remote requests to fulfill a user experience." — Temporal Blog

A single agent task might involve 10+ LLM calls, external APIs, tool executions, and human approval waits spanning days. All can fail, timeout, or be interrupted. Framework-level error handling covers basic retries, but it can't persist state across process boundaries, survive pod evictions, or coordinate multi-agent workflows over days. This is why we need orchestration engines — they fill the reliability gap that frameworks alone cannot close.

1.2 Does Orchestration Kill Agent Autonomy?

When I first encountered orchestration engines, before fully understanding how they work, my immediate worry was: does orchestration mean the agent loses its autonomy?

Think about what makes an agent an agent. Consider a PPT-generation Agent: you give it tools (create slides, search images, write text) and a goal ("make a presentation about Q2 results"). The agent autonomously decides its own plan — maybe it drafts an outline first, then generates content slide by slide, iterating and revising as it goes. The tool selection, execution order, and loop count are all determined by the LLM in real-time. This autonomous decision-making is the essence of "agent."

Now introduce an orchestration engine. Does that mean the agent's decision-making has been pre-orchestrated — each step defined in advance, like "Step 1: generate outline → Step 2: write content → Step 3: produce file"? If so, the agent has lost its autonomy — what we have is no longer an "Agent" but a pre-defined "Workflow." Is the orchestrated result still an Agent, or has it become a Workflow?

After studying multiple production implementations and reading through their source code, I found three patterns. My conclusion: orchestration does NOT have to kill agent autonomy. In all three patterns, the agent can retain full decision-making capability.

Pattern A — External Orchestration (Agent Loop as Black Box)

This is the most intuitive pattern. The orchestration layer treats each Agent invocation as a single, opaque step. It manages the flow between agents but never looks inside.

Orchestration Layer (Temporal / Lambda DF)
  Step 1: Invoke Budget Agent  
    [Agent Loop (black box): LLM→Tool→LLM→Done]
  Step 2: Wait for user approval (Signal/waitForCallback)
  Step 3: Invoke Analysis Agent  
    [Agent Loop (black box)]
  Step 4: Merge outputs

Each step internally runs a complete Agent Loop — the agent decides what tools to call, how many iterations to run, when to stop. The orchestration layer only sees "Budget Agent started → Budget Agent finished." It manages sequencing, retry on failure, and the human-in-the-loop wait between steps.

Agent autonomy is fully preserved because orchestration operates entirely outside the reasoning loop. The trade-off is observability — it's coarse-grained: you can see "Agent-1 started → 5.1s → success," but you can't see what happened inside that 5.1s.

This pattern has a concrete public example. The AWS APN Blog published a multi-agent financial advisor built with AgentCore + Temporal. In the original source code, the architecture looks like this — an Orchestrator Agent dispatching tasks to specialist agents (Budget Agent, Financial Analysis Agent), each running its own complete Agent Loop:

The blog author then refactored this into a Temporal-orchestrated version, where each agent invocation becomes a Temporal Activity — the entire Agent Loop wrapped as a single durable step:

This is exactly Pattern A: orchestration manages the flow between agents (sequencing, retry, HITL waits), while each agent internally retains full autonomous decision-making.

Pattern B — Internal Orchestration (Every Step Visible)

Pattern B fuses the Agent Loop with the orchestration engine — every LLM inference and every tool call is wrapped as an independent orchestration step:

┌─ Orchestration Layer ──────────────────────────────────────────┐
│  Step 1: LLM inference → returns "call calculate_budget"       │
│  Step 2: Execute calculate_budget → {income: 6000}             │
│  Step 3: LLM inference → returns "call create_chart"           │
│  Step 4: Execute create_chart → chart_url                      │
│  Step 5: LLM inference → "Done"                                │
└────────────────────────────────────────────────────────────────┘

Looking at this, it immediately feels like a pre-defined workflow — step 1, step 2, step 3... in fixed sequence. Hasn't the agent been reduced to a sequential pipeline? Is this just a Workflow?

The answer is no — because Pattern B can be fully dynamic. Here's what it actually looks like in code:

// Dynamic Pattern B (Lambda Durable Functions)
export const handler = withDurableExecution(
  async (event, context) => {
    let messages = [{ role: "user", content: event.prompt }];
    let stepCount = 0;
    while (true) {  // ← loop count decided by LLM, not pre-defined
      const llmResult = await context.step(`llm-${stepCount}`, async () => {
        return await invokeBedrock(messages);
      });
      if (llmResult.hasToolCall) {
        const toolResult = await context.step(
          `tool-${llmResult.toolName}-${stepCount}`,  // ← step name generated dynamically
          async () => executeTool(llmResult.toolName, llmResult.toolInput)
        );
        messages.push(/* tool result */); stepCount++;
      } else { break; }  // LLM decides to stop
    }
});

The while(true) loop + dynamically generated step names mean: the LLM retains complete autonomy. It decides which tool to call next, how many iterations to run, and when to stop. Nothing is pre-defined.

What Pattern B actually does is move the Agent Loop from a black-box inside the framework to an explicit loop written with orchestration APIs. The behavior is identical to a framework-managed Agent Loop. The infrastructure benefit is that every step gets checkpointed — if the process crashes after step 3, it resumes from step 4 instead of starting over.

It's still an Agent. Just a durable one.

Official examples:
Temporal AI Cookbook (Agentic Loop with Claude);
community project temporal-ai-agent (545 Stars).

Pattern 3 — Hybrid (Composite)

Production systems often mix both patterns in the same workflow: black-box agent calls (Pattern A) for specialist sub-agents + fine-grained steps (Pattern B) for critical operations + HITL waits. This is the "Workflow-orchestrated Agent" pattern.

Temporal Workflow (Composite Pattern)

┌─ Step 1: call_llm() — Initial analysis ─────────────────┐
│  Fine-grained (Pattern B)                                │
└──────────────────────────────────────────────────────────┘
                            ↓
┌─ Step 2: invoke_agent() — Call specialist Agent ─────────┐
│  Black-box (Pattern A) — Agent runs full Loop internally │
└──────────────────────────────────────────────────────────┘
                            ↓
┌─ Step 3: wait_signal("approval") ────────────────────────┐
│  HITL pause, zero resources, can wait days               │
└──────────────────────────────────────────────────────────┘
                            ↓
┌─ Step 4: while loop — Fine-grained execution ────────────┐
│  Every LLM/Tool Call = independent Activity              │
│  Pattern B, individually retryable and auditable         │
└──────────────────────────────────────────────────────────┘
                            ↓
┌─ Step 5: send_notification() — Deterministic step ───────┐
└──────────────────────────────────────────────────────────┘

Comparison and Decision Guide

Having walked through all three patterns, here's how they compare across key dimensions:

Dimension	Pattern A (External)	Pattern B (Internal)	Hybrid
Agent autonomy	✔ Fully preserved	✔ Preserved (dynamic)	✔ Autonomy + controlled checkpoints
Observability	Coarse (agent-level)	Fine (tool-call-level)	Adjustable
Fault tolerance	Agent-level retry	Tool-call-level retry	Activity-level retry
Complexity	Low (add decorator)	High (decompose Agent Loop)	Medium-High
Flexibility	High (agent self-adapts)	Medium (depends on code dynamism)	High

Table 1: Pattern comparison across key dimensions

In practice, the choice often comes down to your specific scenario. The following table summarizes which pattern fits which situation:

Scenario	Recommended	Why
Multi-agent distributed collaboration	Pattern A	Agents deploy independently; orchestration handles coordination
Compliance audit every tool call	Pattern B	Tool-level visibility; every operation recorded and queryable
High-risk tools (payment/deletion)	Pattern B (partial)	Only wrap critical steps to reduce overall complexity
Quick prototype / PoC	No orchestration	Premature introduction adds complexity before value is validated
Production general purpose	Hybrid	Balance reliability + flexibility; agents retain autonomy

Table 2: Pattern selection guide by scenario

Of course, real-world systems rarely fit neatly into one box. Many production deployments evolve — starting with Pattern A for simplicity, then gradually wrapping critical steps with Pattern B as compliance or reliability requirements grow. The right pattern depends on your team's maturity, regulatory context, and where failures actually hurt.

Chapter 2: Core Mechanism of the Orchestration Engine

Chapter 1 established why we need orchestration and how it integrates with agents without killing autonomy. But what does the orchestration engine actually do at the infrastructure level? This chapter answers question 3.

Step → Checkpoint → Replay

Regardless of which pattern you choose, all orchestration engines solve problems through the same fundamental mechanism: record each step's result in storage external to the process; on interruption, resume from records rather than re-executing.

① Step (mark key operations) — Developers use context.step() or Activity to mark "meaningful operations": each LLM call, each tool execution, each external API request. This tells the engine: "this step's result is worth remembering."

② Checkpoint (automatic persistence) — After each step completes, the engine automatically writes the result to persistent storage external to the process (Temporal: Event History; Lambda DF: Checkpoint Store). Developers write zero checkpoint code — the engine handles it at the infrastructure level.

③ Replay (deterministic recovery) — After interruption (crash/pause/timeout): the engine replays the code, returns cached results for completed steps without re-executing them, skips all completed steps, and continues from the first incomplete step. Side effects don't repeat. Zero extra tokens.

Now that we understand the core mechanism, let's look at five production scenarios to see how orchestration engines deliver reliability guarantees in practice.

Scenario 1: Crash Recovery — Don't Start Over

Imagine that an agent ran for 20 minutes, completed 8 steps, and crashed on step 9.

Without orchestration: You have three options: (1) start over from zero (re-call every LLM, re-execute every tool, re-pay all tokens), (2) try reloading from trace data and re-construct context, hoping the LLM picks up correctly, or (3) build your own checkpoint logic.

With orchestration: Every context.step() is a checkpoint. Step 9 crashes → steps 1-8 results read directly from Event History → no LLM calls, no API calls → continues directly from step 9. Already-spent tokens are not re-spent; side effects don't repeat.

How this differs from "save trace and reload context":

Some teams attempt crash recovery by saving the agent's conversation history and feeding it back as context on restart. The fundamental difference: Checkpoint + Replay is an infrastructure-layer mechanism — like a database's Write-Ahead Log or a game's save point, process-independent and business-logic-independent. Reloading context is an application-layer mechanism — like human "recall," dependent on trace completeness and LLM comprehension. This manifests in two concrete ways:

Zero extra LLM tokens — Checkpoint + Replay returns cached results directly for completed steps without re-feeding history to the LLM. Reloading context requires sending the entire conversation back as input, consuming tokens proportional to history length.
Code-level guarantee against re-execution — The orchestration engine guarantees at the infrastructure layer: completed steps return cached values, function bodies never re-execute. This is deterministic and independent of LLM judgment. Reloading context depends on the LLM correctly understanding history and "choosing not to repeat" — works most of the time, but isn't a guarantee.

Scenario 2: Long-Running Tasks — Surviving Beyond Process Lifecycle

Agents can run in various runtime environments — Lambda, EKS Pods, AgentCore, etc. But regardless of which runtime you choose, they all have lifecycle constraints:

Lambda: max 15 minutes, forced termination on timeout
EKS Pod: can be evicted anytime (Spot instance reclaim, scaling, OOM)
AgentCore: max 8 hours, idle 15 minutes auto-terminates

Yet agent tasks can span days or weeks — continuous monitoring, multi-round approvals, complex research. If the process dies, the agent task dies with it. The solution is state externalization: the process is just a temporary carrier, while execution state and checkpoints persist at the infrastructure layer via the orchestration engine — independent of any single process.

Temporal Server / Lambda DF Backend (permanent state storage)

Worker A (Pod 1)              Worker B (Pod 2)              Worker C (Pod 3)
Execute Steps 1-3     →      Replay Steps 1-3 (cache)  →   Replay (cache)
Results report to Server      Continue Steps 4-6            Continue Step 7 → Done ✔
⚡ Pod evicted                ⏸ HITL pause (wait 3 days)

The execution state lives in the server, not in any process. Processes are fungible workers — any worker can pick up any workflow by replaying from its checkpoints. A workflow can survive unlimited pod restarts, Lambda timeouts, and infrastructure disruptions.

Scenario 3: Human-in-the-Loop — Wait Without Burning Resources

Many agent workflows require human approval at critical points — a manager sign-off before executing a payment, a compliance review before publishing content, or a multi-round approval chain that takes days. The challenge: how does the agent "pause" and wait without wasting resources or risking process death?

Without orchestration: Either keep the process alive waiting (wasting compute, risking being killed by timeout), or build your own "pause → serialize state → persist → wait for callback → deserialize → restore" logic from scratch.

With orchestration: The engine provides a waitForCallback primitive. When the agent reaches a point requiring approval, it calls this function — the current process immediately ends (zero compute cost), the manager is notified, and a configurable timeout is set (e.g., 7 days). The execution state persists externally. When the manager clicks "approve," the engine replays the code from the beginning, skips completed steps from cache, and resumes precisely at the approval point.

Agent executing Steps 1-3...
        │
        ▼
Step 4: waitForCallback("manager-approval")
        │
        ├──→ Process ends (zero cost)
        │    Manager notified, timeout: 7 days
        │
        │    ... 3 days pass ...
        │
        ├──← Manager clicks "Approve"
        │
        ▼
Engine replays → Steps 1-4 from cache → Continue Step 5

The wait can last days or months with zero resource consumption — no thread held, no compute billed.

Scenario 4: Stability — Retry + Error Isolation

Agent tasks involve many external calls — LLM APIs, third-party services, databases. Transient failures (API timeouts, rate limits, network jitter) are inevitable at scale. The question is: when one step fails, does it take down the entire agent run?

Without orchestration: You need to manually implement retry logic for each call: is-it-retryable checks, backoff strategies, max attempt counts, and whether context needs reloading after failure. This means scattering error-handling code across every tool function, every API wrapper, and every agent step — mixing infrastructure concerns into business logic. Without this tedious work, a single transient error kills the entire agent run.

With orchestration: Each step can be configured with a declarative retry policy (e.g., max 3 attempts, exponential backoff). If a tool call fails transiently, the engine automatically retries that specific step — without re-executing any previously completed steps, and without consuming additional LLM tokens for the unaffected portions. More importantly, error handling is unified at the infrastructure layer: you define retry policies once, declaratively, rather than writing try/catch logic in every function. Your business code stays focused on what the agent actually does, not on how to survive failures.

Step 1 ✔  →  Step 2 ✔  →  Step 3 ✘↻↻✔  →  Step 4 ✔  →  Done

Step 3 fails then succeeds after 2 retries. Steps 1-2 are completely unaffected. Each step's retry policy is independent — one failure doesn't cascade. The policy is declarative configuration, not code mixed into business logic.

Scenario 5: Observability — Event History + Visibility + LLM Tracing

Most teams today rely on tracing tools (Langfuse, Braintrust) for agent observability — tracking prompts, completions, token usage, and latency. This is valuable, but in my view it's only half the picture. Production agent observability should have two complementary layers:

Layer 1: Execution-level observability — from the orchestration engine (Event History + Visibility)

Event History records every step's start/complete/fail/retry, Signal send/receive, timing, and return values — a complete audit trail of execution flow. On top of this, a Visibility layer provides SQL-like queries across all workflows (e.g., "find all Failed workflows for customer C123"). This tells you what happened during execution: which steps ran, which failed, how long the wait was, how many retries occurred. (Temporal Events Docs)

Layer 2: Inference-level observability — from LLM tracing tools (Langfuse/Braintrust)

This records prompt/completion content, token usage + cost, latency, and evaluation scores. This tells you how the agent reasoned: what it was asked, what it answered, how much it cost, and whether the output quality was good. It does NOT record workflow execution state, retry logic, or Signal handling.

Connecting the two layers

The two layers can be connected via OpenTelemetry and shared trace_id, enabling full-stack agent observability:

Agent output quality is poor → check Langfuse for prompt/completion to debug reasoning
Agent execution stuck/failed → check Temporal UI for execution state and retry logs
Jump between them via trace correlation — from Temporal's Activity directly into Langfuse to see the specific LLM call content

(Source: Langfuse Temporal Integration)

Closing

Let me circle back to the three questions that started this research:

Why orchestration on top of frameworks? — Because they operate at different layers. The Agent Loop defines reasoning logic; frameworks simplify development; orchestration ensures reliability at scale. They complement, not compete.
Agent or Workflow? — Orchestration does not kill agent autonomy. Whether Pattern A, B, or Hybrid, the LLM retains full decision-making. Orchestration provides durability, not rigidity.
What does orchestration actually do? — Through Step → Checkpoint → Replay, it delivers crash recovery, long-running execution, zero-cost human waits, failure isolation, and multi-layer observability.

Not every agent needs orchestration. But if yours coordinates multiple specialists, waits for human decisions, or simply needs to survive beyond a single process — it's worth understanding deeply.

References

Temporal Blog — Durable Execution Meets AI: Why Temporal Is the Perfect Foundation for AI. temporal.io/blog
AWS APN Blog — How Temporal Uses Amazon Bedrock AgentCore to Create Robust AI Systems. aws.amazon.com/blogs/apn
Temporal Documentation — Workflow Execution Events. docs.temporal.io
Langfuse — Temporal Integration. langfuse.com/integrations
Temporal — AI Cookbook (Agentic Loop with Claude). github.com/temporalio/ai-cookbook
temporal-community — temporal-ai-agent. github.com/temporal-community/temporal-ai-agent