LIKKI SAMARTH REDDY

Posted on Jul 2

I Thought SQLite Was Fast—Until 50 AI Agents Started Writing at Once

#ai #python #sql #opensource

A real-world engineering experiment on checkpointing, persistence, and why your storage backend matters more than your AI model.

Everyone talks about LLMs.

Bigger models.
Better prompts.
Smarter agents.

But almost nobody talks about what happens after an AI agent has been running for hours.

Where does its state live?
How do you resume after a crash?
How do hundreds of agents save progress simultaneously?
What happens when persistence becomes the bottleneck?

Those questions led me to build Living AI—an experimental checkpointing engine for long-running AI agents.

Its purpose isn't to replace agent frameworks.

It's to solve a simpler problem:

Persist agent state quickly without letting storage stall execution.

The Experiment

I built Living AI around a few simple ideas:

Pluggable storage backends
In-memory hot cache
Compression
Recovery API
Execution-budgeted persistence
Built-in performance metrics

The architecture looks like this:

Agent
   │
   ▼
Checkpoint Engine
   │
   ├── Compress State
   ├── Update Hot Cache
   └── Persist to Storage
          │
          ├── SQLite
          └── Redis-compatible Store

Instead of benchmarking models, I benchmarked the infrastructure beneath them.

Benchmark #1 — Single Agent

First I tested a single long-running workflow.

Workload

500 checkpoints
250 KB state
Forced cache eviction

Results

✅ Average save: 13.45 ms

✅ Hot-cache recovery: 1.11 ms

✅ Cold recovery (SQLite + decompression): 1.48 ms

At this point everything looked healthy.

Benchmark #2 — Then I Added 50 Concurrent Agents

This is where things became interesting.

Configuration:

50 concurrent agents
1000 checkpoint attempts
SQLite backend
50 ms persistence budget

Results:

Metric	SQLite
Successful writes	5
Timed-out writes	995
Average write latency	282 ms
Maximum latency	735 ms

At first glance, this looked terrible.

Then I realized something important.

The checkpoint engine wasn't slow.

SQLite had become the bottleneck.

Because SQLite allows only one writer at a time, concurrent checkpoint requests began waiting on database locks.

The engine's timeout policy skipped slow persistence attempts rather than blocking agent execution.

That trade-off kept the agents responsive under load.

The Real Test

Without changing the checkpoint engine...

Without changing compression...

Without changing agent logic...

I swapped only the storage backend.

SQLite became a Redis-compatible implementation.

Exactly the same workload.

Exactly the same checkpoint engine.

Here were the results.

Metric	SQLite	Redis-compatible
SLA Compliance	0.5%	100%
Average Write	282 ms	0.64 ms
p99 Write	735 ms	1.23 ms

That single experiment completely changed where optimization effort should go.

The bottleneck wasn't checkpointing.

It wasn't serialization.

It wasn't caching.

It was storage contention.

One More Surprise

After removing storage contention, another bottleneck appeared.

Compression.

Large checkpoints (~793 KB) spent far more time compressing data than writing it.

In other words:

Once storage became fast enough, CPU work became the limiting factor.

That's exactly the kind of bottleneck you want to discover through benchmarking.

What Living AI Is

Living AI is an experiment in building infrastructure for long-running AI systems.

Current components include:

Pluggable persistence layer
Compression abstraction
Hot memory cache
Recovery API
Performance metrics
Benchmark suite

Rather than optimizing prompts, the focus is on making agent execution more resilient and observable.

Lessons Learned

This project reinforced three engineering lessons:

1. Architecture matters more than micro-optimizations.

A clean storage abstraction made it possible to compare backends without rewriting checkpoint logic.

2. Benchmarks often reveal a different bottleneck than you expect.

I started by trying to optimize checkpointing.

I ended up learning much more about storage systems.

3. Infrastructure deserves as much attention as models.

As AI agents become longer-lived and more autonomous, persistence, recovery, and state management become increasingly important parts of the stack.

What's Next

I'm currently exploring:

Additional storage backends
Faster compression algorithms
Async persistence queues
Framework integrations
Larger-scale reproducible benchmarks

If you're building AI agents, workflow engines, or distributed systems, I'd love to hear how you're approaching checkpointing and recovery.

The code is still evolving, and feedback from other engineers would be incredibly valuable.

Check the code: LivingAI

DEV Community