I kill -9'd a running AI agent mid-task. It resumed without re-spending a cent.

#ai #llm #opensource #devops

The premise is simple: a long-running AI agent crashes mid-task. What should happen next?

The common answer is: your orchestrator checkpoints the agent's context, so it can resume from the last step. LangGraph does this with MemorySaver/SQLite. Temporal replays the event log. These work. For the agent's state.

But there's a second thing nobody's checkpointing: the budget envelope around the agent.

What did this run spend before it crashed? How many loops did it complete? How much wall-clock time has elapsed? If those limits live only in memory and the process dies, your safety layer vanishes with it, and a "resumed" agent is really a new agent with a fresh, reset budget.

The failure mode

Here's what that looks like in practice:

Agent is mid-run: loop 47 of a research task, $2.50 of a $5 budget spent.
The runtime crashes -- OOM kill, deployment restart, a developer kill -9s the process to redeploy a fix.
Your agent orchestrator resumes from its checkpoint. Good.
Your enforcement layer? No crash persistence. It restarts fresh, budget at $5. The run is now allowed to spend another $5, not $2.50.

You've just doubled your effective ceiling. A $5 limit became $7.50 in practice, and you won't know it until the bill arrives.

What crash-resumable enforcement looks like

riskkernel stores the complete run state in a SQLite database, not in memory: budget allocated, budget spent (from the cost ledger), loop counter, wall-clock start time, halt reason. When the runtime process dies and restarts, the state is still there. A call to Runtime.resume_run() picks up the same run ID -- the same budget remaining, the same loop counter, the same ceiling.

The demo (examples/kill-9-resume/demo.sh) is deliberately hostile:

# 1. Start the runtime and kick off an agent run
riskkernel start &
python3 examples/quickstart.py &   # agent starts consuming budget

# 2. Mid-run: kill -9 the runtime process (not the agent -- the proxy)
sleep 4
kill -9 $(pgrep riskkernel)

# 3. Restart and resume the same run
riskkernel start &
python3 - <<'EOF'
from riskkernel import Runtime
rt = Runtime()
rt.resume_run("run_abc123")     # same run ID
EOF

When it completes: the total spend equals what a clean, uninterrupted run would have spent. The loop counter doesn't double -- loops completed before the crash are still logged. The budget remaining at the time of the crash is the budget remaining on resume. If the run was going to halt at call 20 under a $0.25 ceiling, it still halts at call 20 -- not call 20+N because the counter reset.

That last point is the one worth stating clearly: the budget is a property of the run, not the process.

What's complementary, what's different

Temporal checkpoints your workflow's state -- which steps executed, what data they returned, what the replay log looks like. LangGraph checkpoints your agent's context -- conversation history, tool outputs, scratchpad. Both are right and necessary.

RiskKernel checkpoints the enforcement envelope: the hard limits that keep the run bounded and resumable. These are orthogonal layers and they compose:

Your agent resumes from its own checkpoint at step 47 of 100.
RiskKernel resumes from the same run ID, with $2.50 remaining and 47 loops logged.
Together: durability of intent and durability of constraint.

One thing to be explicit about: RiskKernel doesn't checkpoint your agent's LLM context. It doesn't know what your agent was thinking at loop 47. That's your orchestrator's job. What it does know -- durably -- is that this run ID has spent $2.50, completed 47 loops, and the ceiling is $5. The API proxy enforces that on every subsequent call, regardless of what the agent's internal state looks like.

Why this matters more than the setup cost

A one-shot agent that stays under budget on a clean run is easy. The hard problem is a long-running agent: a code-review pipeline, a research sweep, an overnight batch job. These are exactly the agents that crash, get OOM-killed, or get redeployed while they're mid-run.

Without crash-resumable enforcement, your safety layer is episodic: it runs until the process dies, and then it's gone. With it, the budget is durable -- a crashed run is a paused run, not a reset one.

The runtime is RiskKernel: open-source (Apache-2.0), self-hosted, pip install riskkernel. The full demo is in examples/kill-9-resume/ -- the kill -9 step is in the script, not just described. Run it, then run it again with a different budget and confirm the ceiling held. If it doesn't hold, open an issue; that's the only way to know it works is if someone tried to break it.