Marc Newstead

Posted on Jun 15

Building Persistent AI Agents: A Dev's Guide to State Management and Long-Running Workflows

#ai #architecture #agenticai #softwareengineering

The Problem with Stateless Agents

Most AI agents we build today are essentially fancy request-response systems. User asks, agent responds, context dies. Rinse and repeat. But what happens when you need an agent that can start a workflow on Monday, wait for external approval on Wednesday, and resume execution on Friday — all while maintaining perfect context?

That's the shift from chatbots to persistent agents. And it changes everything about how we architect AI systems.

What Makes an Agent "Persistent"?

A persistent agent isn't just a chatbot with better memory. It's a system that:

Maintains state across sessions — not just conversation history, but workflow position, pending actions, and decision context
Can pause and resume — waiting on external events, human input, or scheduled triggers without losing its place
Initiates actions autonomously — checking conditions, triggering workflows, and making decisions without explicit user prompts
Recovers gracefully — handling failures, retries, and state corruption without manual intervention

Think less "chat interface" and more "background worker with reasoning capabilities".

The Hard Part Isn't the LLM Calls

Here's what surprised me: building the agent logic itself is relatively straightforward. Frameworks like LangGraph, CrewAI, and AutoGen handle the orchestration. Calling GPT-4 or Claude is trivial. Integrating tools and APIs is just... normal backend work.

The hard part is state management.

State Storage: More Than Just JSON

You need to persist:

{
  "workflow_id": "claim_review_001",
  "current_step": "awaiting_manager_approval",
  "context": {
    "claim_amount": 1250,
    "supporting_docs": [...],
    "previous_decisions": [...]
  },
  "pending_actions": [
    {"type": "wait_for_approval", "timeout": "2024-02-15T17:00:00Z"}
  ],
  "llm_state": {
    "reasoning_trace": [...],
    "tool_call_history": [...]
  }
}

But here's the thing: this state evolves. The agent needs atomic updates. You need versioning for rollbacks. You need to handle concurrent modifications if multiple agents or humans interact with the same workflow.

Suddenly you're designing a state machine with persistence, not just prompt engineering.

Interruption as a First-Class Concept

In traditional software, interruption is an exception case. In persistent agents, interruption is the normal case.

Your agent will:

Pause to wait for human approval
Stop because a rate limit was hit
Yield while waiting for an external system
Get interrupted because a higher-priority task arrived

Each interruption point needs explicit handling:

def process_claim_step(state):
    if state.requires_human_review():
        return PauseState(
            resume_trigger="approval_received",
            context=state.to_dict(),
            timeout_hours=72
        )
    # Continue processing...

You're not building a linear function anymore. You're building a resumable state machine.

Human-in-the-Loop Isn't Optional

For any agent doing real work — approving expenses, modifying production data, sending customer communications — human oversight isn't a nice-to-have. It's regulatory, ethical, and practical table stakes.

But "human-in-the-loop" means different things:

Approval gates — agent pauses, human approves/rejects, agent continues
Suggested actions — agent proposes, human edits, agent executes
Monitoring dashboards — humans can intervene at any point

This isn't a UI concern. It's an architectural decision that affects your state model, your event system, and your error handling.

If you're building systems that blend AI automation and software development, you need to design for human intervention from day one, not bolt it on later.

Observability Gets Weird

How do you debug an agent that's been running for three days and is currently paused?

Your standard APM tools won't help. You need:

Workflow visualisation — where is each agent in its process?
State inspection — what decisions has it made? What's it waiting for?
Reasoning traces — why did it take action X instead of Y?
Replay capability — can you rerun from a checkpoint with different conditions?

Logging becomes critical. Every LLM call, every tool invocation, every state transition needs to be traceable.

Where to Start

If you're building your first persistent agent:

Choose a state backend early — Redis, PostgreSQL, or a proper workflow engine like Temporal
Design your state schema first — before you write agent logic
Build pause/resume into every step — don't assume linear execution
Make state transitions explicit — log everything, version your state
Test interruption scenarios — not just happy paths

The Shift in Thinking

Persistent agents force us to think differently. We're not building APIs or microservices anymore. We're building systems that think in the background — agents that are always on, always aware, and always ready to pick up where they left off.

The code isn't harder. The architecture is just... different. And that's the real challenge.

DEV Community