Your AI agent crashes after 2 minutes of work. You lose everything. It doesn't have to be this way.
I've seen it happen in every production AI project.
You build a beautiful agent. It works perfectly in local. You ship it.
Then it runs for 90 seconds, hits a timeout, crashes — and all context is gone.
Your user stares at a spinner. Your logs show nothing useful.
The problem isn't your LLM. It's your architecture.
AI agents are stateful, long-running processes. And we've been treating them like stateless HTTP handlers.
Let me show you how to fix this in 10 lines.
The Problem: Agents Are Ephemeral by Default
// Standard agent — everything lives in RAM
const agent = new StreamingToolAgent({ goal: 'Research assistant' }, llm)
const result = await agent.run('Research the top 10 AI papers of 2025')
// If this crashes at step 8/10... you restart from zero.
The issue is that AI agent state lives only in memory. The moment the process dies — OOM, timeout, deployment, network blip — you lose everything.
The Solution: @orka-js/durable
import { DurableAgent, MemoryDurableStore } from '@orka-js/durable'
import { StreamingToolAgent } from '@orka-js/agent'
import { OpenAIAdapter } from '@orka-js/openai'
const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! })
const agent = new StreamingToolAgent({ goal: 'Research assistant', tools: [] }, llm)
const store = new MemoryDurableStore()
const durable = new DurableAgent(agent, store, { maxRetries: 3 })
// That's it. Your agent is now durable.
const job = await durable.run('job-001', 'Research the top 10 AI papers of 2025')
console.log(job.status) // 'completed'
console.log(job.result) // The full research
That's literally 10 lines. Let me break down what just happened.
What DurableAgent Actually Does
1. Every job has a persistent ID
const job = await durable.run('job-001', 'Research AI papers')
The job-001 identifier is the key insight. If this job already ran and completed, calling run('job-001', ...) again returns the cached result immediately — no LLM call made.
This is idempotency. It's the foundation of reliable distributed systems, now applied to AI agents.
2. Automatic retry with backoff
const durable = new DurableAgent(agent, store, {
maxRetries: 3,
retryDelayMs: 2000, // wait 2s between retries
})
If your agent fails (rate limit, network error, LLM timeout), DurableAgent retries automatically. No try/catch soup. No manual retry loops.
3. Human-in-the-loop: pause and resume
This is where it gets genuinely powerful.
// Start a long research job
const job = await durable.run('analysis-42', 'Analyze Q4 financial reports')
// Pause it — maybe you need human approval before continuing
await durable.pause('analysis-42')
// Hours later, after your manager reviews the intermediate output...
const resumed = await durable.resume('analysis-42')
console.log(resumed.result)
You can build approval workflows where an agent pauses waiting for a human decision, then resumes exactly where it left off. No state reconstruction. No prompt re-injection hacks.
Production: Switch to Redis in One Line
MemoryDurableStore is great for local dev. For production:
import { RedisDurableStore } from '@orka-js/durable'
import { createClient } from 'redis'
const redis = createClient({ url: process.env.REDIS_URL })
await redis.connect()
const store = new RedisDurableStore(redis) // <- one line change
const durable = new DurableAgent(agent, store, { maxRetries: 3 })
Your agent jobs now survive:
- Server restarts
- Deployments
- OOM crashes
- Cloud instance preemptions
Scheduled Agents: Cron for AI
One more thing. You can schedule an agent to run on a cron:
const durable = new DurableAgent(agent, store)
// Run every day at 9am
durable.schedule('daily-briefing', '0 9 * * *', 'Generate the daily news summary')
No cron infrastructure to set up. No separate worker process. Just declare it and forget it.
Stream Events While Persisting State
If your agent supports streaming (it should), DurableAgent preserves it:
const stream = durable.stream('job-002', 'Write a full market analysis')
for await (const event of stream) {
if (event.type === 'text') process.stdout.write(event.content)
if (event.type === 'tool_call') console.log('Using tool:', event.name)
if (event.type === 'done') console.log('Job saved:', event.jobId)
}
The user sees a streaming response. The job state is persisted in the background. Best of both worlds.
The Mental Model Shift
| Before | After |
|---|---|
| Agent = function call | Agent = persistent job |
| Failure = start over | Failure = resume from checkpoint |
| Long tasks = scary | Long tasks = no problem |
| Human review = impossible | Human review = pause/resume |
| Cron = separate infra | Cron = built-in |
Get Started
npm install @orka-js/durable @orka-js/agent @orka-js/openai
import { DurableAgent, MemoryDurableStore } from '@orka-js/durable'
import { StreamingToolAgent } from '@orka-js/agent'
import { OpenAIAdapter } from '@orka-js/openai'
const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! })
const agent = new StreamingToolAgent({ goal: 'Assistant', tools: [] }, llm)
const durable = new DurableAgent(agent, new MemoryDurableStore(), { maxRetries: 2 })
const job = await durable.run('my-first-durable-job', 'Hello, world!')
console.log(job.status, job.result)
That's it. Your agents are now production-grade.
OrkaJS is a TypeScript-first framework for building production AI agents. Modular, typed, and provider-agnostic.
- GitHub: github.com/orka-ai/orkajs
- Docs: orkajs.com
- npm:
npm install orka-js/<package>
If this solved a problem you've had, share it — other devs are fighting the same battle right now.
Top comments (0)