DEV Community

saltxd
saltxd

Posted on

I replaced 1,000 lines of Python with a 500-word prompt

Or: what building a $0/month autonomous AI librarian taught me about running LLM agents in production.

My documentation wiki was rotting. Not dramatically — just the usual entropy: pages with no tags, books filed on the wrong shelf, duplicated titles, stale metadata blocks pasted from templates. I had written governance rules. Nobody was enforcing them, because the only "body" available was me at 11pm.

So I built an autonomous agent to do it. Twice, it turned out.

Version 1: the framework reflex

The first version was what most engineers would build in 2026: a Python service. Rule engine for the mechanical checks, an LLM API call for the judgment calls, an executor module to apply fixes, an SDK wrapper, retry logic. About a thousand lines.

It worked. It also:

  • cost ~$0.70 per run, because one badly-scoped rule sent every page to the model without a pre-filter,
  • died mid-sweep the first live night when API credits ran out,
  • and hid two genuine integration bugs (a skipped protocol handshake that made tool calls silently no-op) inside plumbing I had written myself and therefore trusted.

I retired it the same day it shipped.

Version 2: the prompt is the program

The realization: an agentic CLI (I use Claude Code, but the shape generalizes) already is the rule engine, the tool executor, the retry loop, and the judgment module. I had rebuilt, worse, the thing I was paying for.

Version 2 is a Kubernetes CronJob. The container has the CLI installed. The entrypoint is ~50 lines of shell: load a prompt from a ConfigMap, point the CLI at an MCP server that exposes my wiki's API as tools, run headless, post the summary to my chat channel. The prompt — about 500 words — is the curator logic.

CronJob (weekly, Sunday 04:00)
  └─ agent pod (fresh per run)
       ├─ agentic CLI + auth
       ├─ ConfigMap: system-prompt.md   ← the "codebase"
       ├─ MCP server: wiki API as tools
       └─ entrypoint: load prompt → run → post summary
Enter fullscreen mode Exit fullscreen mode

Because it authenticates with the subscription I already pay for rather than metered API keys, the marginal cost per run is zero. That's not an accounting footnote — it changed the architecture. When runs are free, you can afford to re-scan everything weekly instead of building incremental-state machinery. Half of v1's code existed to avoid API spend.

And the prompt-based version finds more real problems than the Python version did, with better judgment — it flags near-miss duplicate titles for human review instead of blindly "fixing" them.

The part that actually matters: letting an agent edit things

Giving an LLM write access to a system you care about is the whole question. Permissions alone don't answer it. Three design choices did the work:

1. Two lanes, not one. Every finding is classified: Tier 1 (mechanically obvious — auto-fix) or Tier 2 (judgment — propose to the human, touch nothing). Tag-case normalization is Tier 1. "This page might belong in a different book" is Tier 2 forever, because moving someone's content is disruptive even when you're right.

2. An edit-review gate the agent runs against itself. Before any write, the prompt requires the agent to state the exact change and ask: "Could a reasonable maintainer object to this?" If yes — or anywhere near yes — the item is demoted to the propose lane with a reason. The bias is explicit: a bad auto-fix costs more than another week of backlog. In months of weekly runs, zero regressions.

3. Undo as a first-class dependency. The wiki keeps full page revision history, so every agent edit is one click from reverted. I would not point this pattern at a system without native versioning. Reversibility, not permission scoping, is what makes autonomy safe.

One more habit that earned its place: the agent reads its own previous report before starting, so every summary carries a trend line — "12 items awaiting review, was 14 last run, net −2." A snapshot tells you nothing; the derivative tells you whether the system is healing.

Lessons I'd generalize

  1. Delete the framework. If your agent code is mostly orchestration — routing, retries, tool dispatch — you've rebuilt the harness. Ship a prompt and a schedule instead. Mine went from ~1,000 lines to ~50 plus 500 words of English.
  2. Prompts are code. The curator prompt lives in git, deploys through the same chart as everything else, and gets reviewed like a module — because it is one.
  3. Give agents an undo, not just an allowlist. Scoped credentials limit blast radius; versioned targets make mistakes cheap. You want both, but if I had to pick one, I'd pick undo.
  4. Make the agent defend each write. A self-applied "would a maintainer object?" gate sounds soft. Empirically it's the difference between an agent you trust weekly and one you babysit.
  5. Marginal cost shapes architecture. Free re-runs beat clever state. Before optimizing an agent, check whether the economics that forced the complexity still exist.

The same skeleton now runs other jobs in my cluster — different prompt, same image, same MCP tools, same report-to-chat pattern. Each new agent is a config change, not a codebase.

The wiki, for the record, is clean. The librarian doesn't sleep, doesn't get bored, and — unlike v1 — has never once billed me.


I'm writing a series on AI-native, self-hosted IT automation: running LLM agents in production on infrastructure you own — safely, cheaply, and without a SaaS bill. Follow along if that's your kind of problem.

Top comments (0)