Jangwook Kim

Posted on May 10 • Originally published at effloow.com

Temporal for AI Agents: Durable Execution Guide 2026

#temporal #aiagents #durableexecution #python

Most AI agent frameworks share a quiet assumption: the process will stay alive. Set off a multi-step research agent, and the code assumes the LLM API will respond, the network will behave, and your machine will keep running until the job finishes. In practice, none of those hold for jobs that run for hours or days.

Temporal is built for exactly that gap. It is a durable execution platform that serializes every step of a workflow into an event log. If the worker process crashes, network partitions, or an LLM rate-limit kicks in, Temporal restores state and resumes execution from the last completed step — no lost progress, no orphaned agent loops.

This guide covers what Temporal is, why it matters specifically for long-running AI workflows, and how to structure a Python-based durable agent today. Effloow Lab installed temporalio 1.27.0, validated the Workflow + Activity + Signal pattern locally, and documented what works and what requires a running server. The lab-run notes are at data/lab-runs/temporal-ai-agent-workflows-durable-execution-2026.md.

Why Stateless Agent Frameworks Break at Scale

Popular frameworks like LangChain, LangGraph, and CrewAI are optimized for short-lived reasoning loops. They handle the cognitive layer — planning, tool selection, iteration — extremely well. What they do not handle is the infrastructure layer: what happens when a 4-hour research agent crashes on step 47 of 60.

If you have worked through our LangGraph Python tutorial, you have seen how LangGraph's checkpointers save state between graph nodes. The critical limitation: if a crash happens inside a node — mid-loop, mid-LLM call — the entire node restarts from zero. For tasks that take seconds, that is acceptable. For tasks that take minutes per step, it becomes expensive and unpredictable.

The problems that surface at production scale cluster around three categories:

Durability gap. A single process crash wipes in-memory state. Retrying the entire multi-hour job from scratch wastes money and time. There is no concept of "resume from step N."

Retry inconsistency. Different tools and LLM providers have different rate limits and error modes. Hand-rolled retry logic tends to be inconsistent and untested under partial failure.

Human-in-the-loop friction. Pausing an agent to wait for a human approval — for hours or days — requires either polling loops that burn compute, or external state storage that you build and maintain yourself.

Temporal addresses all three through its core abstraction: the durable workflow.

How Temporal Works: The Event History Model

Every Temporal workflow generates an append-only event log called the event history. When a worker completes an activity (a discrete unit of work like an LLM call or a database write), the result is written to this log on the Temporal server. The code itself is treated as deterministic instructions for replaying state.

If the worker process dies and restarts, Temporal replays the event history to reconstruct exactly where execution was. Previously completed activities are not re-executed — their results are read from the log. Execution picks up at the first incomplete step.

This model is often described as event sourcing applied to code execution, and it has several practical consequences for AI agent design:

LLM calls must live inside Activities, not inside the Workflow function. The Workflow function must be deterministic; LLM responses are not.
Activities get their own retry policies. A rate-limited OpenAI call can back off and retry independently without affecting the rest of the workflow.
Workflows can wait indefinitely — days or weeks — for an external signal (human approval, webhook callback, async job completion) without holding a process open or burning compute.

The Temporal server stores this event history. In development you run temporal server start-dev locally. In production, Temporal Cloud handles it with a 99.99% SLA guaranteed by Multi-Region Replication, which reached General Availability in early 2026.

Setting Up the Temporal Python SDK

Effloow Lab confirmed that the temporalio package installs without compilation on Python 3.12:

mkdir ai-agent-project && cd ai-agent-project
python3 -m venv venv && source venv/bin/activate
pip install temporalio
# Installed: temporalio-1.27.0, nexus-rpc-1.4.0, protobuf-6.33.6

For the local development server, Temporal provides a CLI (Homebrew formula temporal 1.7.0):

brew install temporal
temporal server start-dev
# Temporal UI available at http://localhost:8233
# gRPC on localhost:7233

The dev server stores workflow state in memory. It resets on restart. For persistent local state, pass --db-filename temporal.db to use SQLite.

Structuring a Durable AI Agent

The core pattern separates concerns into three types: Activities, the Workflow, and Workers.

Activities: The Impure Layer

Activities are where side effects live — LLM calls, database reads, web requests. Each activity is retried independently according to its retry policy. They are regular async functions decorated with @activity.defn:

from temporalio import activity
from datetime import timedelta

@activity.defn
async def call_llm(prompt: str) -> str:
    # Call OpenAI, Claude, or any LLM API here
    # This function can be retried without replaying the whole workflow
    import openai
    client = openai.AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@activity.defn
async def search_web(query: str) -> str:
    # Tool call: web search, database lookup, API call
    # Any network I/O belongs here, not in the Workflow
    ...

@activity.defn
async def write_to_storage(key: str, data: str) -> bool:
    # Persist results to S3, Postgres, or any external store
    ...

The Workflow: The Durable Orchestrator

The Workflow function is the durable skeleton. It calls activities, waits for signals, and branches on results. The key constraint: no side effects, no randomness, no external I/O. Only calls to workflow.execute_activity() and workflow.wait_condition().

from temporalio import workflow
from temporalio.common import RetryPolicy
from datetime import timedelta

@workflow.defn
class DurableResearchAgentWorkflow:
    def __init__(self):
        self._approved = False
        self._approval_notes = ""

    @workflow.run
    async def run(self, research_goal: str) -> str:
        retry = RetryPolicy(
            initial_interval=timedelta(seconds=5),
            maximum_attempts=10,
            non_retryable_error_types=["AuthenticationError"],
        )

        # Step 1 — Search phase (retried independently on failure)
        search_context = await workflow.execute_activity(
            search_web,
            research_goal,
            start_to_close_timeout=timedelta(minutes=2),
            retry_policy=retry,
        )

        # Step 2 — LLM reasoning (result cached in event history after completion)
        draft = await workflow.execute_activity(
            call_llm,
            f"Research goal: {research_goal}\n\nContext:\n{search_context}",
            start_to_close_timeout=timedelta(minutes=5),
            retry_policy=retry,
        )

        # Step 3 — Human-in-the-loop: waits up to 3 days, zero compute cost
        await workflow.wait_condition(
            lambda: self._approved,
            timeout=timedelta(days=3),
        )

        # Step 4 — Execute with human approval
        final_output = await workflow.execute_activity(
            call_llm,
            f"Refine based on feedback '{self._approval_notes}':\n\n{draft}",
            start_to_close_timeout=timedelta(minutes=5),
            retry_policy=retry,
        )

        await workflow.execute_activity(
            write_to_storage, research_goal, final_output,
            start_to_close_timeout=timedelta(seconds=30),
        )
        return final_output

    @workflow.signal
    async def approve(self, notes: str) -> None:
        """Unblocks step 3. Called from a webhook, Slack bot, or admin UI."""
        self._approval_notes = notes
        self._approved = True

Effloow Lab ran this skeleton through Python's import machinery and confirmed all decorators, RetryPolicy, and wait_condition instantiate correctly with temporalio 1.27.0.

The Worker: Running It

A Worker is the process that polls Temporal's Task Queue and executes Workflows and Activities:

import asyncio
from temporalio.client import Client
from temporalio.worker import Worker

async def main():
    client = await Client.connect("localhost:7233")

    async with Worker(
        client,
        task_queue="ai-research-queue",
        workflows=[DurableResearchAgentWorkflow],
        activities=[call_llm, search_web, write_to_storage],
    ):
        print("Worker running — polling task queue...")
        await asyncio.Future()  # runs forever

if __name__ == "__main__":
    asyncio.run(main())

Start a workflow from any service, CLI, or schedule:

async def trigger():
    client = await Client.connect("localhost:7233")
    handle = await client.start_workflow(
        DurableResearchAgentWorkflow.run,
        "Analyze competitive landscape for vector database market",
        id="research-agent-001",
        task_queue="ai-research-queue",
    )
    print(f"Started: {handle.id}")
    # Later: send human approval signal
    await handle.signal(DurableResearchAgentWorkflow.approve, "Looks good, publish it")
    result = await handle.result()
    print(result)

The OpenAI Agents SDK Integration

As of March 23, 2026, the integration between Temporal's Python SDK and OpenAI's Agents SDK reached General Availability. If your team already uses OpenAI's agent primitives, you can add durable execution with minimal code changes.

The key helper is activity_as_tool, which wraps any Temporal Activity as an OpenAI-compatible tool schema:

from temporalio.contrib.openai_agents import activity_as_tool
import openai

# Existing Temporal activity
@activity.defn
async def query_database(query: str) -> str:
    # Your DB call here
    ...

# Wrap it as an OpenAI tool — schema generated automatically
db_tool = activity_as_tool(query_database)

# Use it directly with OpenAI's Agent
agent = openai.Agent(
    name="ResearchAgent",
    model="gpt-4o",
    tools=[db_tool],
)

Every time the agent invokes query_database, it runs as a Temporal Activity — with automatic retries, event history, and crash recovery. The three OpenAI Agents SDK demos in temporal-community/openai-agents-demos on GitHub show production-ready patterns including web search, code execution, and multi-agent orchestration.

Temporal vs LangGraph: Not a Competition

A comparison that comes up frequently is Temporal vs LangGraph. The 2026 perspective from production teams is that they are complementary rather than competing.

Dimension	Temporal	LangGraph
Primary abstraction	Durable workflow + event history	Stateful graph traversal
Control flow	Deterministic code	Cyclic graph (loops, branches)
Crash recovery	Activity-level granularity	Node-level (intra-node work lost)
Human-in-the-loop wait	Days/weeks, no compute cost	Requires external checkpointer + polling
Best for	Long-running orchestration, retry logic	Reasoning loops, dynamic agent logic
2026 pattern	Temporal activity wraps a LangGraph agent for reasoning-intensive subtasks

The architecture that emerges in production: Temporal handles the macro-level durable lifecycle of a multi-hour job, while LangGraph handles the micro-level dynamic reasoning inside individual activities. A Temporal activity spins up a LangGraph agent, lets it reason through a subtask, and returns the result. If the LangGraph run fails partway through, Temporal retries the whole activity. State between activities is safe in the event history.

This is not a case of choosing the better tool. It is a case of using each at the right layer of abstraction.

2026 Feature Highlights from Replay 2026

Temporal held its Replay 2026 conference in San Francisco with several announcements directly relevant to AI workloads:

External Storage (Public Preview — Python, Go). Workflows and Activities can now store and retrieve large payloads — LLM outputs, embeddings, retrieved documents — directly to Amazon S3 or a custom storage driver. Previously, teams had to manage external storage themselves or hit Temporal's event history size limits with large payloads. External Storage solves both problems.

Serverless Workers on AWS Lambda (Pre-release). Workers can now run on AWS Lambda. Temporal Cloud manages invocation, scaling, and graceful shutdown based on workload volume. For bursty AI agent pipelines — many agents triggered simultaneously, then quiet — this eliminates the need to provision always-on worker fleets. Workers scale to zero when there is no work.

Standalone Activity. A new primitive that lets Activities run independently without a parent Workflow. Useful for one-shot, durable tasks that do not need full workflow orchestration overhead.

Temporal Nexus (GA). Nexus connects Workflows across isolated Namespaces. Teams can expose their Workflows as versioned, discoverable endpoints that other teams call across namespace boundaries. For multi-team AI platforms, this means a data team's retrieval workflow and an engineering team's code-generation workflow can compose without sharing a namespace.

Multi-Region Replication (GA, 99.99% SLA). Workflows are asynchronously replicated to a secondary region and automatically fail over if the primary region has an outage. For long-running AI agent jobs, regional failure no longer means job loss.

Common Mistakes When Using Temporal with AI Agents

Calling LLM APIs inside the Workflow function. The Workflow must be deterministic. LLM responses are not. Put every LLM call in an Activity. If you call an LLM inside the Workflow, replay after a crash may produce different results and corrupt the event history.

Setting start_to_close_timeout too short for LLM calls. Models like GPT-4o can take 30–90 seconds on long prompts. A 30-second timeout causes unnecessary retries. Use 2–5 minutes for reasoning steps, shorter for embedding or classification calls.

Ignoring Workflow ID uniqueness. If you start a workflow with an existing ID, Temporal either rejects it or reuses the running workflow depending on your WorkflowIdReusePolicy setting. For agent pipelines, set TERMINATE_IF_RUNNING when you want each trigger to start fresh, or ALLOW_DUPLICATE_FAILED_ONLY to retry only previously failed runs.

Building human-in-the-loop with polling instead of signals. Polling a database every 60 seconds to check for human approval wastes compute and adds latency. Use workflow.wait_condition() with a signal handler. The workflow sleeps with zero resource usage until the signal arrives.

Not using non_retryable_error_types. By default, Temporal retries on any exception. If your LLM API raises AuthenticationError (wrong API key), you do not want 10 automatic retries before failing. List non-retryable error types explicitly in the retry policy.

FAQ

Q: Does Temporal replace my LLM framework (LangChain, LangGraph, CrewAI)?

No. Temporal handles the infrastructure layer — durability, retries, state persistence, scheduling. LLM frameworks handle the cognitive layer — planning, memory, tool selection. They solve different problems and are often used together in production.

Q: How does Temporal handle LLM rate limits?

Rate limit errors are transient. Configure a RetryPolicy with exponential backoff on the activity that calls the LLM. Temporal retries that activity independently, backing off with increasing intervals. The rest of the workflow is unaffected while retries happen.

Q: What is the difference between a Temporal Activity timeout and a workflow timeout?

start_to_close_timeout limits how long a single activity execution attempt can run. schedule_to_close_timeout limits the total time across all retries. Workflow execution has its own execution_timeout. For LLM activities, set start_to_close_timeout to 2–5 minutes to allow for slow model responses.

Q: Can I run Temporal on-premise or is Temporal Cloud required?

You can self-host using the open-source Temporal server. temporal server start-dev is a single-binary local server. The Docker Compose setup supports production self-hosting with PostgreSQL persistence. Temporal Cloud is the managed option with Multi-Region Replication and the 99.99% SLA.

Q: Does Temporal work with models other than OpenAI?

Yes. The core Temporal Python SDK is model-agnostic. You call any LLM API inside an Activity. The OpenAI Agents SDK integration (temporalio.contrib.openai_agents) is specific to OpenAI's SDK, but Google's Gemini documentation also includes a Temporal integration example, and the pattern works identically with Claude, Mistral, or any other API.

Key Takeaways

Temporal's durable execution model fills a genuine gap in the AI agent stack. Most frameworks handle reasoning well but assume a stable, short-lived process. Temporal assumes the opposite — that any step can fail and that jobs may run for days — and it builds the entire execution model around that assumption.

The Python SDK (temporalio 1.27.0) installs in seconds and the core Workflow + Activity + Signal pattern is straightforward to understand. The cognitive overhead is low; the operational benefit for long-running, production-critical agents is significant.

The 2026 releases — External Storage, Serverless Workers, Temporal Nexus GA, Multi-Region Replication GA — position Temporal as a mature platform for enterprise agentic pipelines, not just a developer tool for resilient microservices.

For teams building AI agents that run for more than a few minutes, need human approval gates, or must survive infrastructure failures, Temporal is now the default starting point for the orchestration layer.

Bottom Line

If your AI agent runs for more than a few minutes or needs human-in-the-loop gates, add Temporal. The Python SDK is a single pip install, the Workflow/Activity split is straightforward, and crash recovery comes for free. It does not replace LangGraph or CrewAI — it makes them production-safe at the orchestration layer.

DEV Community