Henry Li

Posted on Jun 21 • Originally published at exesolution.com

When Your AI Agent Restarts Mid-Task: Building Durable Workflows in Spring Boot

#java #springboot #ai #llm

The first agentic feature I shipped looked great in demos. The LLM picked a tool, called it, looked at the result, decided what to do next. Three tool calls, clean output, happy stakeholders.

Then we put it in front of real users.

Within a week we had three incidents. A user retried a request because it felt slow, and we created two support tickets instead of one. A deploy restarted the pod mid-workflow, and the agent simply forgot what it was doing — the user got a half-finished response and no error. Someone needed to approve a refund the agent recommended, but our "approval" was a Thread.sleep waiting on a Redis flag, and the pod that was sleeping got rescheduled. The approval just vanished.

Looking back, every one of those bugs came from the same root cause: we treated tool-calling as a chat completion loop with in-memory state. It works in a notebook. It does not work in production.

This post walks through what changed when we rebuilt the agent runtime as a proper workflow engine on Spring Boot and PostgreSQL — runs persisted in the database, retries with backoff, idempotency keys, human-in-the-loop checkpoints as a real wait state. The full runnable code, Docker Compose setup, and execution evidence are at exesolution.com. This post covers the design choices that matter and how to run it locally.

The Three Failures Worth Naming

Most agent demos hide three real problems behind small token counts and happy paths.

Duplicate side effects. Your agent calls "create_jira_ticket". The HTTP call to Jira succeeds, but the response packet is lost. Your retry logic kicks in. Now you have two tickets. This is not theoretical — it happens any time the network is involved, which is every time.

Restart amnesia. Mid-workflow, the pod gets rescheduled. If the agent's state lives in a Java object on the heap, it is gone. The client polling for results either hangs forever or gets a confusing error. Your monitoring sees "workflow disappeared" as a metric you cannot alert on.

Approval state as code, not data. When the agent wants to do something sensitive, you want a human to approve it first. The naive implementation: pause execution, wait for a webhook. The problem: "pause execution" means a thread sleeping in memory, which dies on restart. The human approves something nobody is waiting for anymore.

All three are solvable. None of them are solved by adding @Retryable and hoping.

The Shape of the Fix

The architecture is straightforward once you stop thinking about it as a chat loop and start thinking about it as a workflow.

A run is a row in PostgreSQL. It has a status (queued, running, waiting_for_approval, completed, failed), an input, a tenant ID, an idempotency key from the client. It does not have a Java object in memory.

A step is the unit of work — the agent's next decision, a tool call, an approval gate, the final response. Each step is also a row, with its own state machine. When something needs to retry, the step row has attempt, next_attempt_at, last_error. When the process restarts, nothing changes about the step — it just sits there until a dispatcher picks it up again.

A dispatcher runs on a Spring @Scheduled tick. Every few seconds it looks for steps that are QUEUED or RETRY_PENDING with next_attempt_at <= now(). It claims them using database leases (so two instances don't pick the same step), executes them, writes the result back, and moves on.

Tool calls are recorded as their own rows in tool_execution, with the actual request and response stored as JSON. When something goes wrong six hours later and a customer wants to know what happened, you read those rows. You do not parse logs.

That is the entire mental model. Everything else is implementation detail.

Why PostgreSQL Instead of a Queue

We considered Redis Streams and RabbitMQ. We chose PostgreSQL because it was already in the stack and because the same transaction that advances step state needs to record the tool execution. With a separate queue and database, you get the "outbox problem" — your queue update and database update can drift apart on failure, and you spend weeks writing reconciliation code.

With Postgres-as-queue (or close enough), the dispatcher's UPDATE workflow_step SET status='running' WHERE id=... AND status='queued' happens in the same transaction as the tool execution record. If the transaction rolls back, both rolls back. There is no drift.

The trade-off is throughput. If you need to process more than a few thousand steps per second, you outgrow this design and want a real queue. For most agentic features that is years away.

Idempotency Done Twice

We enforce idempotency at two boundaries, and this distinction matters.

Run creation uses a client-supplied Idempotency-Key header. Same key, same input, same response — even if the client retries because of a timeout. The first request creates the run; the second returns the run ID from the first. The client never knows it retried; from their perspective the operation succeeded once.

Tool execution uses a derived key: tenant_id + run_id + step_id + tool_name + hash(input). When the dispatcher picks up a step to execute, it checks this key in the idempotency_record table. If the key exists with a completed status, it returns the stored response without calling the tool again. This protects against the case where the tool succeeded but the database update failed — on the next dispatcher tick, we don't re-create the Jira ticket.

The key insight: idempotency is not a library you import. It is a discipline applied at every place a side effect can happen.

Approvals as a Wait State

This is where most agent implementations fall apart. The naive version:

if (action.requiresApproval()) {
    waitForApproval(action);
    executeAction(action);
}

That waitForApproval call has to live somewhere. If it blocks the request thread, your client times out. If it spawns a background thread, the thread dies on restart. If it uses a Redis pub/sub, the subscriber connection dies on restart.

The fix is to model the approval as a step like any other:

Step type=APPROVAL, status=WAITING_FOR_APPROVAL, request_json={action: ...}

The dispatcher sees WAITING_FOR_APPROVAL and just skips it — there is nothing to do. When the approver calls POST /api/runs/{id}/approve, the API endpoint updates the step status to QUEUED. On the next dispatcher tick, the step gets picked up and execution continues.

There is no process waiting on anything. There is no in-memory state. The pod can restart 50 times during the approval window and it makes no difference. When the human eventually decides, the work continues from exactly where it stopped.

This is the design pattern most agent frameworks miss entirely.

Running It Locally

docker compose up -d --build

This starts the Spring Boot service, PostgreSQL with the schema migrated, and a small mock LLM endpoint so you do not need an API key to see the workflow execute.

Check the service is up:

curl -s http://localhost:8080/actuator/health

The expected response shows status: UP.

Create a run with an idempotency key:

curl -s -X POST http://localhost:8080/api/runs \
  -H "Authorization: Bearer <TOKEN>" \
  -H "Idempotency-Key: run-demo-001" \
  -H "Content-Type: application/json" \
  -d '{
    "tenantId": "t-001",
    "input": { "goal": "Create a ticket via tool, requires approval" }
  }'

The response includes a runId. Poll it:

curl -s http://localhost:8080/api/runs/<RUN_ID> \
  -H "Authorization: Bearer <TOKEN>"

You will see the status progress through QUEUED, RUNNING, then WAITING_FOR_APPROVAL once the agent decides it needs human sign-off. Approve it:

curl -s -X POST http://localhost:8080/api/runs/<RUN_ID>/approve \
  -H "Authorization: Bearer <APPROVER_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{ "reason": "Approved for demo" }'

On the next dispatcher tick, the run continues and completes.

Test the idempotency: repeat the original create call with the same Idempotency-Key. You get the same runId back, not a new one. Check the database — there is exactly one row in workflow_run for that key.

Then, the test that surprised the team the most: while a run is WAITING_FOR_APPROVAL, kill and restart the container:

docker compose restart app

The run state is in PostgreSQL. The approval call still works. The run still completes. Nothing in-memory was lost because nothing important was ever in memory.

What's in the Full Solution

The verified solution at exesolution.com includes:

Complete Spring Boot project: REST API, workflow engine, dispatcher, tool registry, planner adapter
PostgreSQL schema with Flyway migrations for all six tables: workflow_run, workflow_step, tool_execution, idempotency_record, approval_checkpoint, run_event
Lease-based dispatcher implementation that is safe to run on multiple instances later
A mock LLM that lets you test the full workflow without spending API credits
OpenTelemetry tracing wired through: HTTP request → run → step → tool call appear as a single trace
Docker Compose stack with the app, PostgreSQL, and optional OTel collector
7 evidence screenshots: code structure, build, health check, run creation, polling, history endpoint, approval flow

Full solution + runnable code + evidence at exesolution.com

Free registration required to access the code bundle and evidence images.

What I Would Tell Past Me

If you are building anything where an LLM gets to call tools that affect the real world — create tickets, send emails, charge cards, update inventory — you are building a workflow system whether you realize it or not. The question is whether you do it on purpose, with durable state and idempotency from day one, or whether you discover the requirements one production incident at a time.

The setup in this solution is maybe a thousand lines of Java plus the schema. It is not a lot of code. It is the minimum amount of code that lets you sleep through a deploy while an agent run is in flight.

Have a specific scenario you want to talk through — long-running approvals, multi-tenant isolation, scaling the dispatcher? Drop a comment.

DEV Community