DEV Community: ADARSH PRASHAR

I kill -9'd a running AI agent mid-task. It resumed without re-spending a cent.

ADARSH PRASHAR — Wed, 10 Jun 2026 16:23:45 +0000

The premise is simple: a long-running AI agent crashes mid-task. What should happen next?

The common answer is: your orchestrator checkpoints the agent's context, so it can resume from the last step. LangGraph does this with MemorySaver/SQLite. Temporal replays the event log. These work. For the agent's state.

But there's a second thing nobody's checkpointing: the budget envelope around the agent.

What did this run spend before it crashed? How many loops did it complete? How much wall-clock time has elapsed? If those limits live only in memory and the process dies, your safety layer vanishes with it, and a "resumed" agent is really a new agent with a fresh, reset budget.

The failure mode

Here's what that looks like in practice:

Agent is mid-run: loop 47 of a research task, $2.50 of a $5 budget spent.
The runtime crashes -- OOM kill, deployment restart, a developer kill -9s the process to redeploy a fix.
Your agent orchestrator resumes from its checkpoint. Good.
Your enforcement layer? No crash persistence. It restarts fresh, budget at $5. The run is now allowed to spend another $5, not $2.50.

You've just doubled your effective ceiling. A $5 limit became $7.50 in practice, and you won't know it until the bill arrives.

What crash-resumable enforcement looks like

riskkernel stores the complete run state in a SQLite database, not in memory: budget allocated, budget spent (from the cost ledger), loop counter, wall-clock start time, halt reason. When the runtime process dies and restarts, the state is still there. A call to Runtime.resume_run() picks up the same run ID -- the same budget remaining, the same loop counter, the same ceiling.

The demo (examples/kill-9-resume/demo.sh) is deliberately hostile:

# 1. Start the runtime and kick off an agent run
riskkernel start &
python3 examples/quickstart.py &   # agent starts consuming budget

# 2. Mid-run: kill -9 the runtime process (not the agent -- the proxy)
sleep 4
kill -9 $(pgrep riskkernel)

# 3. Restart and resume the same run
riskkernel start &
python3 - <<'EOF'
from riskkernel import Runtime
rt = Runtime()
rt.resume_run("run_abc123")     # same run ID
EOF

When it completes: the total spend equals what a clean, uninterrupted run would have spent. The loop counter doesn't double -- loops completed before the crash are still logged. The budget remaining at the time of the crash is the budget remaining on resume. If the run was going to halt at call 20 under a $0.25 ceiling, it still halts at call 20 -- not call 20+N because the counter reset.

That last point is the one worth stating clearly: the budget is a property of the run, not the process.

What's complementary, what's different

Temporal checkpoints your workflow's state -- which steps executed, what data they returned, what the replay log looks like. LangGraph checkpoints your agent's context -- conversation history, tool outputs, scratchpad. Both are right and necessary.

RiskKernel checkpoints the enforcement envelope: the hard limits that keep the run bounded and resumable. These are orthogonal layers and they compose:

Your agent resumes from its own checkpoint at step 47 of 100.
RiskKernel resumes from the same run ID, with $2.50 remaining and 47 loops logged.
Together: durability of intent and durability of constraint.

One thing to be explicit about: RiskKernel doesn't checkpoint your agent's LLM context. It doesn't know what your agent was thinking at loop 47. That's your orchestrator's job. What it does know -- durably -- is that this run ID has spent $2.50, completed 47 loops, and the ceiling is $5. The API proxy enforces that on every subsequent call, regardless of what the agent's internal state looks like.

Why this matters more than the setup cost

A one-shot agent that stays under budget on a clean run is easy. The hard problem is a long-running agent: a code-review pipeline, a research sweep, an overnight batch job. These are exactly the agents that crash, get OOM-killed, or get redeployed while they're mid-run.

Without crash-resumable enforcement, your safety layer is episodic: it runs until the process dies, and then it's gone. With it, the budget is durable -- a crashed run is a paused run, not a reset one.

The runtime is RiskKernel: open-source (Apache-2.0), self-hosted, pip install riskkernel. The full demo is in examples/kill-9-resume/ -- the kill -9 step is in the script, not just described. Run it, then run it again with a different budget and confirm the ceiling held. If it doesn't hold, open an issue; that's the only way to know it works is if someone tried to break it.

Benchmarking a kill switch for runaway AI agents -- and why the real number is a ceiling, not a %

ADARSH PRASHAR — Mon, 08 Jun 2026 20:39:12 +0000

Claims about AI cost control are cheap. "Cut your agent spend by 60%!" is on every landing page. So instead of a claim, here's a benchmark you can run yourself in one command -- and an honest reading of what its number actually means, because the headline percentage is the least interesting part.

The short version: I ran the same looping agent twice -- once unguarded, once behind a hard dollar budget -- against a deterministic provider, and measured the spend. Then I'll show you why the "% saved" framing undersells it, and why a flat ceiling is the number that matters.

The setup

I wrote about why a runaway agent slips past logging, monitoring, and max_tokens: it's not one anomalous call, it's a thousand individually-valid ones, and the only thing that stops it is a deterministic, pre-call, per-run limit. This benchmark measures exactly that limit doing its job.

The harness (benchmark/ in the repo) is built so the only variable between the two runs is whether the budget fired:

Deterministic provider. A mock of the Chat Completions API returns a fixed token usage (1000 in / 1000 out) on every call. No network variance, no real money, exactly reproducible.
Real prices, pinned. gpt-4o at its list price ($2.50 / $10.00 per 1M input/output tokens). That makes one call cost 1000·2.50/1e6 + 1000·10.00/1e6 = $0.0125.
Measured, not modeled. The governed run's spend is read straight from the runtime's own cost ledger (GET /v1/runs/{id} -> usage.dollars), not computed by the benchmark. The runtime meters each call and halts the run before the call that would cross the ceiling.
Same per-call price on both sides, so the two numbers are directly comparable.

A 50-iteration runaway, with a $0.25 ceiling on the governed run:

  RiskKernel cost benchmark -- runaway loop
  ------------------------------------------------------
  loop length (N)            50
  dollar budget              $0.25
  per-call cost              $0.0125   (gpt-4o, from the ledger)
  ------------------------------------------------------
                            calls        spend
  baseline (no governance)     50      $0.6250
  governed (RiskKernel)        20      $0.2500
  ------------------------------------------------------
  dollars saved              $0.3750   (60%)
  stopped by                 dollar_budget_exceeded

20 calls × $0.0125 = exactly $0.25. The 21st call was refused before it left the process. The baseline ran all 50.

Why "60%" is the wrong number

Sixty percent looks like the headline. It isn't -- it's an artifact of where I set N. I chose a 50-call loop; the budget caught it at 20. Make the loop longer and the percentage climbs, because the governed spend doesn't move:

If the runaway loops…	Baseline spend	Governed spend	Saved
50×	$0.63	$0.25	$0.38 (60%)
1,000×	$12.50	$0.25	$12.25 (98%)
10,000×	$125.00	$0.25	$124.75 (99.8%)

The governed column is flat. That's the whole point. A runaway loop has no natural stopping condition -- that's what makes it a runaway -- so the baseline grows until a human notices, which in the canonical $47K incident took eleven days. The thing you're buying isn't a percentage discount. It's a number that cannot exceed what you set, no matter how badly the agent misbehaves or how long before anyone looks.

So I distrust "X% cheaper" claims in this space, including ones I could make. The percentage depends entirely on the failure you benchmark against. The honest guarantee is the ceiling: spend is bounded by the budget, full stop.

Why this benchmark is honest (and where it isn't the whole story)

I'd rather you trust the harness than the author, so:

It's one command, key-free, no real spend: python3 benchmark/benchmark.py. The mock and the pricing file are right there -- inspect them, change them, break them.
It deliberately removes provider latency and variance to isolate the governance effect. This is a benchmark about dollars, not milliseconds. The enforcement overhead the runtime itself adds is small and belongs in a separate latency benchmark -- I won't smuggle it into this one.
It measures one dimension: the cost ceiling. The other half of "safe to leave running" is crash-recovery -- kill -9 a long run and resume it without re-spending -- which is demonstrated end-to-end in examples/kill-9-resume, not in this harness. A timed recovery benchmark is next.

The takeaway

If you're evaluating anything that claims to control agent cost, ask it for two things: the harness (so you can reproduce the number) and the ceiling (so you know the worst case, not the average case). A percentage without a reproducible loop length is marketing. A flat, enforced ceiling -- refused pre-call, in compiled code, read back from a ledger -- is an SLA you can reason about.

The runtime is RiskKernel: open-source (Apache-2.0), self-hosted, pip install riskkernel or docker run, one env var in front of an agent you already have. Run the benchmark, then tell me where you'd push on it -- a benchmark only earns trust if people try to break it.

The $47K agent loop: why logging, monitoring, and max_tokens all failed to stop it

ADARSH PRASHAR — Sun, 07 Jun 2026 15:28:36 +0000

In November 2025, four AI agents ran for eleven days and produced a $47,000 bill.

You've probably seen the story. A market-research pipeline: four LangChain agents coordinating over A2A. Two of them, an Analyzer and a Verifier, started ping-ponging. The Analyzer produced analysis, the Verifier asked for more, the Analyzer produced more. No termination condition, no budget cap. Eleven days later the invoice showed up.

The number is what makes it travel. But the number is not the interesting part. This is the interesting part, from the post-mortem:

They had logging. They had monitoring. They did not have a hard limit.

Sit with that. The team was not flying blind. The data was all there: every call, every token, every dollar, streaming into a dashboard the whole time. None of it mattered, because observability is a witness, not a circuit breaker. It can tell you the building is on fire. It cannot close the gas valve.

I want to walk through why the usual defenses don't catch this, and what the thing that actually catches it has to look like. Not the product pitch, the mechanism. If you run agents in production, this is the failure mode that should keep you up at night, and it's more structural than it looks.

The failure mode: a thousand perfectly valid calls

A runaway agent is not one big anomalous request. It's a thousand small, individually reasonable ones.

Every single call the Analyzer and Verifier made was well-formed. Each was under its max_tokens. Each returned a 200. Each looked, in isolation, exactly like a healthy agent doing its job. The pathology only exists at the level of the run (the loop that never closes), and almost nothing in a normal stack is watching at that level.

That's why the obvious guards slide right off it:

max_tokens is per-call, not per-run. It bounds the size of one response. It has nothing to say about ten thousand responses. A loop is the size of a tweet, ten thousand times.
Cost dashboards are post-hoc. They render spend after the calls have already happened and the money is already gone. By design they trail reality. The fire has to start before the smoke detector has anything to detect.
Alerts need a human in the loop, awake, watching. "Spend exceeded $X" fires into a Slack channel at 3am on a Saturday. Nobody saw it for eleven days because seeing it was a person's job, and people sleep, and go on vacation, and assume the thing that's been fine is still fine.

Each of these is useful. None of them is a stop. They are all instruments on the dashboard; not one of them is the brake pedal.

What a stop actually requires

If you want to stop a runaway run, not narrate it, the enforcement has to satisfy three properties. Miss any one and you're back to writing post-mortems.

1. It has to be deterministic. No model in the decision path. The whole problem is an unbounded non-deterministic system; you do not get to bound it with another non-deterministic system and call it safe. "We added an LLM that decides when to stop the LLM" is not a control, it's a second thing that can fail. The limit is total_cost > ceiling evaluated in compiled code, or it is not a limit.

2. It has to be pre-call. The check runs before the next request leaves your process, and refuses it. Anything that runs after the call has, by definition, already let the call happen and the dollars leave. Post-hoc enforcement is a contradiction: the enforcement and the spend race, and the spend wins.

3. It has to be per-run, not per-call. The unit that goes wrong is the run: total dollars, total loop iterations, total wall-clock, accumulated across every call the run makes, plus a kill switch you can pull from outside. That's the altitude the pathology lives at, so that's the altitude the budget has to live at.

Deterministic, pre-call, per-run. That's the shape of a brake pedal. I'll say the obvious thing: this is not novel computer science. It's a hard-coded resource governor, the kind of thing operating systems and databases and trading systems have had for decades. The novelty is purely that the agent world skipped it.

I've built this before, in a less forgiving domain

I've spent years building deterministic risk engines that wrap non-deterministic systems, the kind where a mistake costs real money, in real time, with no undo. And the lesson was always the same: the thing that kept those systems safe was never the smart part. It was a dumb, hard-coded, deterministic layer wrapped around the smart part, with the authority to say no.

The smart part proposes. The deterministic layer disposes. Every irreversible action is gated. You do not let the clever, probabilistic component hold the kill switch, because the whole reason you need a kill switch is that the clever component is the thing that goes wrong.

Agents are exactly this pattern wearing new clothes. The LLM is the strategy. The risk engine belongs in compiled, statically-typed code that the LLM cannot talk its way past.

What it looks like in practice

That conviction is why I've been building RiskKernel, an open-source, self-hosted runtime that puts that deterministic layer in front of an agent you already have. I'll keep this concrete rather than salesy, because the shape is the point and you could build a version of it yourself.

You point an existing agent at it with one environment variable:

OPENAI_BASE_URL=http://localhost:7070/v1

Now every call routes through a governor. You set a per-run budget (dollars, loop count, wall-clock seconds), and the moment the run crosses a ceiling, the next call is refused with an HTTP 402 instead of being forwarded. Enforced in Go, never by a model. Bring your own provider key; nothing leaves your machine except the call you were already making.

One detail I think is worth stealing regardless of what you use: a call that already reached the provider is never silently discarded. You paid for it, so it's returned to you, and it's the following call that gets refused. The ledger stays honest; the budget never double-counts the request that tipped it over. The brake engages on the next rotation of the loop, not by throwing away work you already paid for.

And because the failure mode I actually lose sleep over is a long, legitimate run dying halfway through, it checkpoints: you can kill -9 a run and resume it without re-spending or restarting from zero. That's the part that makes a long agent run safe to leave running, which is the whole game.

The honest part

The honest edges, today: single instance on SQLite, one API token, no streaming yet (the proxy returns a clean 501 for stream: true; mid-stream enforcement is genuinely hard and I'd rather ship it right than ship it loud). Native providers are Anthropic and OpenAI, with the long tail via an upstream gateway. It's Apache-2.0, and it phones home to no one. The only outbound traffic is to your provider and the backends you point it at. I'd rather you know the edges than discover them.

It is also not trying to be your observability stack or your policy firewall. It emits OpenTelemetry to whatever you already run; it interoperates with the gateways and the dashboards. It competes on exactly one thing: deterministically stopping a run before it hurts you.

The takeaway

If you run agents, the question to ask isn't "will I find out when one goes runaway?" You will, eventually: in the logs, in the bill, in the post-mortem. The question is "what stops it before I find out?"

Logging is not that thing. Monitoring is not that thing. max_tokens is not that thing. A deterministic, pre-call, per-run limit with a kill switch is. And it's a few hours of work to put one in front of an agent you already have, whether you use what I built or roll your own.

Eleven days. Forty-seven thousand dollars. A dashboard that saw all of it. Don't be the dashboard.

RiskKernel is open-source (Apache-2.0) and self-hosted (pip install riskkernel or docker run). If you put it in front of an agent and the guardrails are too strict or too loose, I'd genuinely like to hear where. That feedback is what the next release is made of.