A real-world engineering experiment on checkpointing, persistence, and why your storage backend matters more than your AI model.
Everyone talks about LLMs.
Bigger models.
Better prompts.
Smarter agents.
But almost nobody talks about what happens after an AI agent has been running for hours.
- Where does its state live?
- How do you resume after a crash?
- How do hundreds of agents save progress simultaneously?
- What happens when persistence becomes the bottleneck?
Those questions led me to build Living AI—an experimental checkpointing engine for long-running AI agents.
Its purpose isn't to replace agent frameworks.
It's to solve a simpler problem:
Persist agent state quickly without letting storage stall execution.
The Experiment
I built Living AI around a few simple ideas:
- Pluggable storage backends
- In-memory hot cache
- Compression
- Recovery API
- Execution-budgeted persistence
- Built-in performance metrics
The architecture looks like this:
Agent
│
▼
Checkpoint Engine
│
├── Compress State
├── Update Hot Cache
└── Persist to Storage
│
├── SQLite
└── Redis-compatible Store
Instead of benchmarking models, I benchmarked the infrastructure beneath them.
Benchmark #1 — Single Agent
First I tested a single long-running workflow.
Workload
- 500 checkpoints
- 250 KB state
- Forced cache eviction
Results
✅ Average save: 13.45 ms
✅ Hot-cache recovery: 1.11 ms
✅ Cold recovery (SQLite + decompression): 1.48 ms
At this point everything looked healthy.
Benchmark #2 — Then I Added 50 Concurrent Agents
This is where things became interesting.
Configuration:
- 50 concurrent agents
- 1000 checkpoint attempts
- SQLite backend
- 50 ms persistence budget
Results:
| Metric | SQLite |
|---|---|
| Successful writes | 5 |
| Timed-out writes | 995 |
| Average write latency | 282 ms |
| Maximum latency | 735 ms |
At first glance, this looked terrible.
Then I realized something important.
The checkpoint engine wasn't slow.
SQLite had become the bottleneck.
Because SQLite allows only one writer at a time, concurrent checkpoint requests began waiting on database locks.
The engine's timeout policy skipped slow persistence attempts rather than blocking agent execution.
That trade-off kept the agents responsive under load.
The Real Test
Without changing the checkpoint engine...
Without changing compression...
Without changing agent logic...
I swapped only the storage backend.
SQLite became a Redis-compatible implementation.
Exactly the same workload.
Exactly the same checkpoint engine.
Here were the results.
| Metric | SQLite | Redis-compatible |
|---|---|---|
| SLA Compliance | 0.5% | 100% |
| Average Write | 282 ms | 0.64 ms |
| p99 Write | 735 ms | 1.23 ms |
That single experiment completely changed where optimization effort should go.
The bottleneck wasn't checkpointing.
It wasn't serialization.
It wasn't caching.
It was storage contention.
One More Surprise
After removing storage contention, another bottleneck appeared.
Compression.
Large checkpoints (~793 KB) spent far more time compressing data than writing it.
In other words:
Once storage became fast enough, CPU work became the limiting factor.
That's exactly the kind of bottleneck you want to discover through benchmarking.
What Living AI Is
Living AI is an experiment in building infrastructure for long-running AI systems.
Current components include:
- Pluggable persistence layer
- Compression abstraction
- Hot memory cache
- Recovery API
- Performance metrics
- Benchmark suite
Rather than optimizing prompts, the focus is on making agent execution more resilient and observable.
Lessons Learned
This project reinforced three engineering lessons:
1. Architecture matters more than micro-optimizations.
A clean storage abstraction made it possible to compare backends without rewriting checkpoint logic.
2. Benchmarks often reveal a different bottleneck than you expect.
I started by trying to optimize checkpointing.
I ended up learning much more about storage systems.
3. Infrastructure deserves as much attention as models.
As AI agents become longer-lived and more autonomous, persistence, recovery, and state management become increasingly important parts of the stack.
What's Next
I'm currently exploring:
- Additional storage backends
- Faster compression algorithms
- Async persistence queues
- Framework integrations
- Larger-scale reproducible benchmarks
If you're building AI agents, workflow engines, or distributed systems, I'd love to hear how you're approaching checkpointing and recovery.
The code is still evolving, and feedback from other engineers would be incredibly valuable.
Check the code: LivingAI
Top comments (0)