DEV Community

LIKKI SAMARTH REDDY
LIKKI SAMARTH REDDY

Posted on

Why AI Agents Need a 50ms SLA Checkpoint Engine (and How We Built One)

Building AI agents that survive production is a different problem than building AI agents that work in development.

In development, your agent runs once, on your machine, with no concurrent users and a database that responds in milliseconds. In production, you have fifty agents running simultaneously, conversation histories that grow to hundreds of kilobytes, and a database that occasionally locks, times out, or becomes briefly unavailable.

Most agent frameworks were not designed for this reality. And the gap shows up in one specific place: checkpointing.


The silent killer in production agent architectures

Checkpointing is how an agent saves its state between steps. Every major framework does it. LangGraph has SqliteSaver and PostgresSaver. CrewAI has its own persistence layer. The OpenAI Agents SDK has thread state management.

What almost none of them account for is what happens when the database is slow.

The standard implementation looks roughly like this:

async def save_checkpoint(state):
    await database.write(state)  # blocks until complete
    continue_execution()
Enter fullscreen mode Exit fullscreen mode

Under normal conditions, this is fine. Under concurrent load with SQLite, this is catastrophic. SQLite uses file-level write locking. When fifty agents try to write simultaneously, they queue behind each other. Write latencies spike from under a millisecond to over seven hundred milliseconds. Your agent, which was supposed to respond in two seconds, is now waiting three quarters of a second just to save its state at each step.

We ran this exact scenario. Fifty concurrent agents, one thousand total writes, payloads growing from five to one hundred kilobytes as conversation histories accumulated. With SQLite as the backing store, average write latency was 282ms. The p99 was 735ms. SLA compliance, defined as completing the write within 50ms, was 0.5%.

That is not a configuration problem. That is a fundamental architectural mismatch between SQLite's single-writer model and concurrent agent workloads.


The architecture we built

Living AI is our open-source solution to this problem. The core insight is that checkpointing should never block the agent execution thread, regardless of what the database is doing.

The architecture has three components.

The first is a hot RAM cache. When an agent saves state, it writes synchronously to an in-process LRU cache with a configurable TTL. This write is always sub-millisecond because it never touches disk or network. Reads check this cache first. In a running agent, the most recent state is almost always in the cache, which means the common read path resolves in microseconds.

The second is a budgeted durable write. After updating the RAM cache, the engine attempts to write to the backing database. This write runs inside asyncio.wait_for with a hard timeout, fifty milliseconds by default. If the database cannot complete the write within budget, the engine drops the write, logs it as a missed checkpoint, and continues. The agent thread is never blocked.

The third is a self-describing compression layer. Every state blob is compressed with zlib at level six and prepended with a one-byte codec header. The header value 0x00 means uncompressed, 0x01 means zlib. This detail matters more than it sounds: it means you can change the compression algorithm to zstd in the future without breaking any existing checkpoints. Old blobs read their own header and decompress correctly regardless of the current default.

The ordering of the first two components is the critical design decision. The RAM cache is updated before the database write is attempted. This means even if every database write times out, the agent still has access to its current state through the cache, and crash recovery still works. We stress tested this directly: in our hyperscale test with 150 concurrent agents and 0.77MB state payloads, 99.3% of database writes timed out, and recovery success rate was 100% across all 1500 agents.


What the benchmark numbers actually show

We ran three test tiers and want to be transparent about what each one measures.

The single-agent benchmark, which is what the README headline numbers come from, uses the SQLite store with one writer and 50KB compressed blobs. Checkpoint write latency at p50 is 0.3ms, at p95 is 0.8ms, at p99 is approximately 1ms. Hot cache reads resolve in around 4 microseconds.

The production workload test uses 50 concurrent agents, 1000 total writes, and payloads growing from 5KB to 100KB. This is where the SQLite versus Redis comparison becomes meaningful:

Metric SQLite Redis
SLA compliance within 50ms 0.5% 100%
Average write latency 282ms 0.64ms
p99 write latency 735ms 1.23ms
Recovery success rate 100% 100%

The hyperscale test uses 150 concurrent agents, 1500 total writes, and 0.77MB payloads representing large context windows with long histories and extensive tool call records:

Metric SQLite Redis
SLA compliance within 50ms 0.7% 100%
p99 write latency above 800ms for successes 62ms
Recovery success rate 100% 100%
p99 recovery read latency 8.84ms 6.61ms
Total execution time 85.81s 68.54s

One honest observation about the Redis hyperscale p99 of 62ms: this is slightly above the 50ms SLA, and all writes still completed because asyncio loop scheduling allowed them through. The bottleneck at this scale is not the database. It is CPU. Compressing a 0.77MB blob with zlib is a CPU-bound operation that runs under Python's GIL. At that payload size, the compression itself takes approximately 40ms, which leaves little budget for I/O. Teams hitting this ceiling have two options: switch to zstd, which compresses significantly faster, or offload compression to a process pool executor. We will add both as configuration options in a future release.

The important pattern across all three tiers is that recovery success rate is 100% regardless of SLA compliance. The two metrics are independent because recovery reads from the RAM cache, not the database. SLA compliance tells you how much of your state made it to durable storage. Recovery success tells you whether your agents can resume after a crash. Both matter, but they are not the same number.


How Living AI fits with LangGraph and CrewAI

Living AI is not a replacement for agent frameworks. LangGraph handles graph compilation, conditional routing, state schemas, and the execution model that makes complex multi-agent workflows possible. CrewAI handles crew orchestration, role assignment, and agent collaboration. These are problems Living AI does not solve and does not try to solve.

What Living AI adds is the production reliability layer that sits underneath the framework:

Your agent logic
    ↓
LangGraph / CrewAI / OpenAI Agents
    ↓
Living AI runtime
    ↓
Redis / PostgreSQL / SQLite
Enter fullscreen mode Exit fullscreen mode

The framework decides where the agent goes. Living AI makes sure it gets there reliably, can recover if it crashes, and leaves a complete execution record for debugging and compliance.

The adapter layer makes this composable. Each framework adapter is a thin translation layer that maps framework execution events to Living AI's ExecutionNode model. The core runtime has zero framework dependencies. Swapping from LangGraph to CrewAI does not change how checkpointing, recovery, or replay works.


The replay capability

Crash recovery is the obvious use case. But the more interesting capability for day-to-day development is replay.

When an agent produces a wrong answer, or books the wrong flight, or sends the wrong message, the question you want to answer is: what exactly happened, and why did the model make that decision? With a standard observability tool, you have logs. You can see what happened. But you cannot re-run the execution with the exact same inputs to reproduce and debug the behavior.

Living AI stores every prompt, every response, every tool call, and every intermediate state in an append-only execution graph. The replay engine can re-execute any recorded run in four modes.

FULL replay re-executes every node from scratch, making real API calls and tool invocations. FROM_NODE replay re-executes from a specific node, skipping the work that preceded it. MOCK_TOOLS replay is the most useful for debugging: it re-runs the LLM reasoning with recorded tool responses served from the execution history, so you can iterate on prompt changes without making real API calls or triggering real side effects. COUNTERFACTUAL replay re-executes with modified input at a specific node, letting you test what would have happened if a particular tool had returned a different value.

The MOCK_TOOLS mode is what makes Living AI useful beyond just crash recovery. If a customer reports that the AI booked the wrong flight, you can replay that exact execution, inspect the LLM's reasoning at each step with the recorded context, and identify where the decision went wrong, all without touching a live system.


Getting started

The core library has zero runtime dependencies. Everything uses the Python standard library.

pip install livingai
Enter fullscreen mode Exit fullscreen mode

For Redis:

pip install "livingai[redis]"
Enter fullscreen mode Exit fullscreen mode

For PostgreSQL:

pip install "livingai[postgres]"
Enter fullscreen mode Exit fullscreen mode

A minimal crash recovery example:

import asyncio
from livingai import (
    CheckpointEngine, SQLiteStore, ExecutionNode,
    RecoveryEngine, NodeType, Status
)

async def main():
    engine = CheckpointEngine(SQLiteStore("agent.db"))

    step = ExecutionNode(
        execution_id="run-1",
        type=NodeType.PROMPT,
        status=Status.SUCCESS,
        output="plan ready"
    )
    await engine.save(step, state=b"serialized agent state")

    charge = ExecutionNode(
        execution_id="run-1",
        type=NodeType.TOOL,
        status=Status.SUCCESS,
        output={"receipt": "R-1"}
    )
    await engine.save(charge)

    recovery = RecoveryEngine(CheckpointEngine(SQLiteStore("agent.db")))
    plan = await recovery.plan("run-1")

    print("resume from:", plan.resume_node_id)
    print("skip effects:", len(plan.skipped_nodes))

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

The skip effects line is the one that matters. Tool nodes are marked non-idempotent by default. The recovery engine will never re-run them. If your agent charged a card on step six and crashed on step eight, the card is not charged again on recovery.

The examples directory in the repository has five runnable demos covering crash recovery, MOCK_TOOLS debugging, cost tracking, and the LangGraph adapter. None of them require an LLM API key or network access.


Choosing a store for your workload

One lesson from the benchmarks worth making explicit: the right store depends on your concurrency level, not your preference.

SQLite is the right default for local development and single-agent workloads. It requires zero configuration, ships with Python, and performs well under low concurrency. The benchmark numbers at p99 under 1ms are real and achievable in this scenario.

Redis is the right choice for production workloads with multiple concurrent agents. The switch is one import change and a connection URL. No agent logic changes. No core configuration changes. SLA compliance goes from 0.5% to 100%.

PostgreSQL is the right choice when you need long-term durable storage with query capabilities, cost aggregation across runs, and the ability to reconstruct execution history after a process restart that evicted the Redis cache.

You can also layer them: Redis as the hot tier for active executions, PostgreSQL as the cold tier for historical records. This is the configuration we recommend for teams running agents at scale.


The project is Apache-2.0 licensed and completely open source.

GitHub: github.com/likkisamarthreddy/livingai

If you are running agents in production and have hit reliability problems we have not covered here, open a GitHub Discussion. We are actively building the next milestone, which is a FastAPI cloud backend with a web-based replay UI, and real production feedback is shaping what gets built.


Top comments (0)