DEV Community: Antoinette C. Lennox

Guardsman SKILL; From Lazy to Loyal: Why My AI Agent Needed a Promotion

Antoinette C. Lennox — Fri, 17 Jul 2026 08:22:25 +0000

Ponytail made my agent lazy. Guardsman made it loyal. The difference is everything.

The Day I Met Ponytail

It was a Tuesday in June. I was pair-programming with Claude Code on a FastAPI project, and I made the mistake of asking for a date picker.

What I got back was a masterclass in over-engineering: flatpickr installed via npm, a React wrapper component, a CSS import, and a three-paragraph essay on timezone edge cases I never asked about. By the time I deleted all of it, I had lost twenty minutes and a fair bit of faith in AI-assisted coding.

Then a colleague DM'd me a link to Ponytail.

The README opens with a drawing of a guy with a ponytail and oval glasses. The tagline reads: "Makes your AI agent think like the laziest senior dev in the room." You know the type. He's been at the company longer than the version control. You show him fifty lines; he looks at them, says nothing, and replaces them with one.

I installed it. I asked for the same date picker.

<!-- ponytail: browser has one -->
<input type="date">

That was it. No dependencies. No wrappers. No manifesto. Just the platform feature that had been sitting there since HTML5.

I was sold. The benchmarks backed it up: ~54% less code, ~20% cheaper, ~27% faster. I told my team. I put it in every repo. For about three weeks, I thought I had found the holy grail.

The Refund Webhook Incident

The problem showed up quietly.

I asked my agent to add a small utility for processing refund webhooks. Five lines of Python. Elegant. Ponytail-approved. It compiled on the first try. The demo looked clean. I merged it.

Three days later, a race condition in those five lines double-charged a customer.

Here's the thing: the code was small. It was correct in the sense that it handled the happy path. But it was also dangerous—it touched money, ran in an async context, and had zero failure-path coverage. Ponytail's ladder had guided the agent to write the minimum code. It had not guided the agent to ask whether that minimum was safe.

I spent the weekend writing post-mortem docs and fixing the logic. And I started wondering: what if "write less" isn't the whole story?

The Guard at the Gate

That's when I found Guardsman.

The README opens differently. No ponytail. No glasses. Just a bear-skinned guard, boots planted, eyes forward. The tagline: "No diff ships unchallenged."

The philosophy clicked immediately. Ponytail asks: "Can this be smaller?" Guardsman asks: "What happens if this is wrong?"

Where Ponytail channels a lazy senior dev who deletes your abstractions, Guardsman stations a royal guard who challenges whatever approaches the codebase—friend or stranger, five lines or five hundred—and demands proof it belongs there.

I installed it. I asked for the same refund webhook utility.

The agent paused. It ran a convention detection script. It grepped the codebase for how we already handled webhooks. It assigned the change a risk tier. Then it wrote six lines instead of five—and included a runnable check that exercised the failure path, executed in the same turn, with output shown.

[code]
→ skipped: async retry loop, add when volume > 100/min.
  verified: failure-path test run, output below. tier: sensitive.

No essays. No feature tours. Just code, then three lines of accountability.

Three Things Guardsman Does That Ponytail Doesn't

1. It Reads the Standing Orders First

Before writing anything, Guardsman runs a deterministic script to detect your repo's real conventions. Language version. Formatter. Linter. Actual test command. Naming style. Error-handling patterns.

Then it greps how your codebase already solves the nearest similar problem.

This matters more than it sounds. In my team, we use Biome, not Prettier. We use unittest, not pytest. Ponytail's static rules never knew that. Guardsman's standing orders mean the agent finally respects the house style instead of imposing its defaults.

2. It Sizes Danger by Blast Radius, Not Line Count

Guardsman's core insight is simple and brutal: a five-line change to a payment path is more dangerous than a four-hundred-line internal script run once.

Code size is not risk. Blast radius is risk.

So every change gets a tier before a single line is written:

Trivial — internal one-off, nothing downstream. One manual run, output shown.
Standard — everyday feature work. One runnable check, written and executed this turn.
Sensitive — money, auth, or user data in the blast radius. A real test exercising the failure path, with coverage notes.
Critical — getting it wrong is an incident. Full coverage, or the absence of harness is surfaced as a blocker.

Two rules keep the tiers honest. When signals disagree, the higher tier wins. And the agent never infers a downgrade just because the code looks simple. A confident walk is not a countersign.

That second rule is the one that would have saved my refund webhook. Five lines touching money? That's sensitive tier, minimum. No exceptions.

3. Verification Runs Before the Diff Ships

In Ponytail, verification is a suggestion. "Non-trivial logic leaves ONE runnable check behind." But there's no enforcement. No requirement it runs this turn. No tier scaling.

In Guardsman, verification is a gate. The code isn't done when it's written. It's done when it has answered the challenge—the check behind it actually run, in this turn, by the agent, with the output shown. Not left as homework for future-you.

The output format enforces this discipline:

[code]
→ skipped: <X>, add when <Y>. verified: <how>. tier: <T>.

At most three lines. No design memoirs. No "here's what I considered." Just the facts.

The Build Orders Compared

Both tools use a ladder. But the ladders have different personalities.

Ponytail's ladder is pure minimalism:

1. Does this need to exist?       → YAGNI
2. Already in this codebase?      → Reuse it
3. Stdlib does it?                → Use it
4. Native platform feature?       → Use it
5. Installed dependency?          → Use it
6. Can it be one line?            → One line
7. Only then: minimum code that works

Guardsman's ladder adds accountability:

1. Does this need to exist?       → YAGNI
2. Already in this codebase?      → Reuse it
3. Stdlib does it?                → Use it
4. Native platform feature?       → Use it
5. Installed dependency?          → Use it (count onboarding cost)
6. Can it be one line?            → One line
7. Only then: minimum code, scaled to the tier

Notice rung 5. Guardsman weighs the onboarding-token cost for every future reader—human or agent—not just the maintenance tail. And step 7 scales the solution to the risk tier. Minimum code, yes. But minimum safe code.

What the SKILL.md Files Reveal

If you want to understand the architectural difference, read the condensed rules.

Ponytail's personality:

You are a lazy senior developer. Lazy means efficient, not careless.

Rules:
- No abstractions not explicitly requested
- No new dependency if avoidable
- Deletion over addition
- Mark intentional simplifications with `ponytail:` comment

Guardsman's protocol:

## Engineering rules (Guardsman)

### 1. Read before you build
- Detect and match this repo's conventions
- Grep how this codebase already solves the nearest similar problem
- Grep every caller of any function you change

### 2. Tier by blast radius (not code size)
- trivial / standard / sensitive / critical
- Higher tier wins when signals disagree
- Never downgrade just because code looks simple

### 3. Build order — stop at first rung
- Need exist? → skip
- In codebase? → reuse
- Stdlib? → use
- Native platform? → use
- Installed dependency? → use (count onboarding cost)
- One line? → one line
- Only then: minimum, scaled to tier

### 4. Verification floors
- standard: one runnable check, written AND executed this turn
- sensitive: real test with failure-path coverage
- critical: full coverage, or absence is a blocker

Ponytail gives you a personality. Guardsman gives you a protocol. One is a vibe. The other is a contract.

The Logbook: Where "Later" Becomes "Now"

Guardsman ships with a persistent log called GUARDSMAN-LOGBOOK.md. Every shortcut, every deferred optimization, every tier decision gets recorded with a # guardsman: marker.

Ponytail has something similar—/ponytail-debt harvests ponytail: comments into a ledger. But Guardsman's logbook is structured. It tracks severity, cost implications, and the condition under which the shortcut should be revisited.

# GUARDSMAN-LOGBOOK.md

## 2026-07-15
- `refund_webhook.py:42` — skipped async retry loop
  - severity: medium
  - cost: latency
  - add when: volume > 100/min
  - tier at skip: sensitive

"Later" doesn't become "never" when it's written down.

Why I Made the Switch

I didn't uninstall Ponytail. I demoted it.

Ponytail still runs on my prototyping repos, my throwaway scripts, my greenfield experiments. It's genuinely brilliant at stopping over-engineering. The date picker example will never not be satisfying.

But when I'm merging to main—when the diff touches money, auth, or user data—Guardsman is at the post. Because I've learned that correctness and size are independent variables. A minimal diff that breaks production is worse than a verbose diff that works.

The risk tier system changed how I think about AI-assisted coding. I no longer measure success by lines deleted. I measure it by confidence at merge time. A utility script gets a quick run. A payment handler gets a failure-path test. Nothing ships on "it compiled" alone.

And the standing orders detection finally solved the design-system problem. My agent doesn't suggest pytest when we use unittest. It doesn't propose Prettier when we use Biome. It reads the repo first, then writes.

Final Thoughts

If your biggest pain point is AI agents that install npm packages to do what the browser shipped in 2014, start with Ponytail. It will save you tokens, time, and sanity. The "lazy senior dev" archetype is real, and Ponytail captures it perfectly.

But if you're shipping production code—if your diffs walk paths where a mistake is an incident—you need more than minimal. You need accountable.

Guardsman stands on Ponytail's shoulders. It keeps the YAGNI reflex, the stdlib-first instinct, the one-line preference. Then it adds the missing piece: a verification system that scales with danger, not code size.

Use Ponytail for speed. Use Guardsman for trust.

Because the best code isn't just the code you never wrote.

It's the code that proves it belongs there.

Guardsman: github.com/hedimanai-pro/guardsman

Ponytail: github.com/DietrichGebert/ponytail

Why I Switched from Ponytail to Guardsman for AI Coding

Antoinette C. Lennox — Fri, 17 Jul 2026 08:05:54 +0000

TL;DR: Ponytail taught my AI agent to write less code. Guardsman taught it to write accountable code. If you're tired of deleting AI-generated bloat or debugging AI-generated "minimal" disasters three weeks later, read this.

Introduction: The Date Picker Moment

You know the story. You ask Claude Code for a date picker. Without blinking, it installs flatpickr, writes a React wrapper, adds a CSS import, and starts a thoughtful essay about timezone edge cases you didn't ask for.

Then someone shows you Ponytail.

You install it. You ask for the same date picker. Claude pauses, climbs the "decision ladder," and returns:

<!-- ponytail: browser has one -->
<input type="date">

No dependency. No wrapper. No manifesto. Just the native platform feature doing exactly what you asked.

I was hooked immediately. Ponytail gave me something I didn't know I needed: an AI agent that thought like the senior engineer who replaces fifty lines with one and says nothing. The benchmarks were real—~54% less code, ~20% cheaper, ~27% faster. I told everyone. I wrote about it. I put it in every project.

But after three weeks of daily use, I noticed something. Ponytail was excellent at stopping failure mode #1: the over-build. But it had nothing to say about failure mode #2—the one that keeps you up at night.

The Discovery: When "Minimal" Becomes Dangerous

It happened on a Tuesday. I asked my agent to add a small utility for processing refund webhooks. It was five lines. Elegant. Ponytail-approved minimal. It compiled, it demoed, it merged.

Three days later, a race condition in that "minimal" diff double-charged a customer.

The code was small. The blast radius was not.

I started looking for something that could enforce both sides of the equation: write less, and verify what you write. That's when I found Guardsman.

What Ponytail Does Well

Let's be fair. Ponytail is brilliant at what it sets out to do.

The "lazy senior dev" archetype is instantly recognizable to anyone who's worked with great engineers. The decision ladder is simple, memorable, and effective:

1. Does this need to exist?       → YAGNI
2. Already in this codebase?      → Reuse it
3. Stdlib does it?                → Use it
4. Native platform feature?       → Use it
5. Installed dependency?          → Use it
6. Can it be one line?            → One line
7. Only then: minimum code that works

The results speak for themselves. Ponytail ships with plugins for Claude Code, Codex, GitHub Copilot CLI, Gemini CLI, OpenCode, and more. It has modes (lite, full, ultra, off). It has useful commands like /ponytail-review to audit diffs for over-engineering and /ponytail-debt to track deferred shortcuts.

For greenfield projects and backend code where the main risk is unnecessary complexity, Ponytail is a genuine productivity multiplier.

Where Ponytail Starts Showing Limits

The problem isn't what Ponytail does. It's what it doesn't do.

First, Ponytail's ladder is static. It doesn't read your repo's existing conventions before deciding. In a codebase with shadcn/ui installed, the "correct" answer for a date picker isn't <input type="date>—it's your design system's <DatePicker />. Ponytail's rung 5 should catch this, but only if the agent actually reads package.json and your component files first. The instruction says "read the code it touches," but instruction-following is probabilistic, not guaranteed.

Second, and more critically: Ponytail has no concept of risk. A five-line change to a payment path and a five-hundred-line internal script get the same treatment. The ladder optimizes for code size, not blast radius. It will happily approve "minimal" code that touches auth, money, or user data—without demanding any verification beyond "it compiles."

Third, verification is an afterthought. Ponytail says "non-trivial logic leaves ONE runnable check behind," but there's no enforcement mechanism. No tier system. No requirement that the check actually runs this turn. It's a suggestion, not a gate.

These aren't design flaws in Ponytail. They're simply outside its scope. Ponytail solves over-building. It doesn't solve under-verifying.

Discovering Guardsman

Guardsman's tagline hit me immediately:

"No diff ships unchallenged."

Where Ponytail puts a "lazy senior dev" in your agent, Guardsman puts a royal guard at the post. Bearskin down over his eyes. Boots planted. He challenges whatever approaches—friend or stranger, five lines or five hundred: prove you belong here.

The philosophy is different. Ponytail asks "can this be smaller?" Guardsman asks "what happens if this is wrong?"

Why Guardsman Feels Better

1. Standing Orders: Your Repo, Your Rules

Before writing anything, Guardsman runs a deterministic script to detect your repo's real conventions—language version, formatter, linter, test command, naming style. Then it greps how your codebase already solves the nearest similar problem.

Your patterns win. Always.

This fixes the design-system problem I hit with Ponytail. Guardsman doesn't just check if a dependency exists; it understands that your established conventions take precedence over generic "minimal" advice.

2. Risk Tiers: Blast Radius, Not Code Size

This is Guardsman's core insight. A five-line change to a payment path is more dangerous than a four-hundred-line internal script run once. Code size is not risk. Blast radius is risk.

Guardsman assigns every change a risk tier before a single line is written:

Trivial — Internal one-off, nothing downstream. The challenge: one manual run, output shown.
Standard — Everyday feature work. The challenge: one runnable check, written AND executed this turn.
Sensitive — Money, auth, or user data in the blast radius. The challenge: a real test exercising the failure path, with explicit coverage notes.
Critical — Getting it wrong is an incident. The challenge: full coverage—and if the repo has no harness reaching this path, that absence is surfaced as a blocker, never routed around.

Two rules keep the tiers honest:

When signals disagree, the higher tier wins.
You can downgrade a tier explicitly. The agent never infers a downgrade just because the code looks simple.

That second rule is the killer feature. It stops "failure mode #2"—the confidently small diff—from marching past the post with a confident walk.

3. The Build Order: Minimalism With Accountability

Guardsman has its own ladder, but it's not just "write less." It's "write less, then prove it works":

1. Does this need to exist?       → YAGNI
2. Already in this codebase?      → Reuse it
3. Stdlib does it?                → Use it
4. Native platform feature?       → Use it
5. Installed dependency?          → Use it (count onboarding cost)
6. Can it be one line?            → One line
7. Only then: minimum code, scaled to the tier

Notice rung 5: Guardsman weighs the onboarding-token cost for every future reader (human or agent), not just maintenance. And step 7 scales the solution to the risk tier, not just "make it small."

4. Verification That Actually Runs

Here's the difference that matters. In Guardsman, non-trivial logic isn't "done" when the code is written. It's done when it has answered the challenge—the check behind it actually run, in this turn, by the agent, with the output shown.

Not left behind as homework for future-you.

The output format enforces this:

[code]
→ skipped: <X>, add when <Y>. verified: <how>. tier: <T>.

No essays. No feature tours. Just code, then at most three lines of accountability.

Side-by-Side: Ponytail vs. Guardsman

Primary Goal

Ponytail eliminates over-engineering. Guardsman eliminates over-engineering and under-verification.

Philosophy

Ponytail channels a lazy senior dev. Guardsman stations a royal guard at the post.

Risk Awareness

Ponytail has none—code size is the only metric. Guardsman uses blast-radius tiers (trivial, standard, sensitive, critical).

Repo Conventions

Ponytail uses static rules and may miss project context. Guardsman detects conventions dynamically—your patterns always win.

Verification

Ponytail suggests checks but doesn't enforce them. Guardsman applies tier-scaled floors and demands execution this turn.

Failure Mode #1: The Over-Build

Both handle this well. Ponytail is excellent. Guardsman is excellent.

Failure Mode #2: The Confident Minimal

Ponytail offers no protection. Guardsman blocks it with the tier system.

Modes

Ponytail offers lite, full, ultra, and off. Guardsman offers build, review, audit, logbook, and post-report.

Adapters

Ponytail supports 15+ agent platforms. Guardsman ships a Claude Code plugin plus a universal AGENTS.md adapter.

Learning Curve

Ponytail is low—install and forget. Guardsman is medium—understanding tiers takes one session.

Best For

Ponytail shines on greenfield projects, backend code, and API billing work. Guardsman excels on production code, user-facing features, and team workflows.

Real Examples: The SKILL.md Difference

Ponytail's SKILL.md (condensed)

You are a lazy senior developer. Lazy means efficient, not careless.

Before writing any code, stop at the first rung that holds:
1. Does this need to be built at all?
2. Already in codebase? Reuse it.
3. Stdlib? Use it.
4. Native platform? Use it.
5. Installed dependency? Use it.
6. One line? Make it one line.
7. Only then: minimum code.

Rules:
- No abstractions not explicitly requested
- No new dependency if avoidable
- Deletion over addition
- Mark intentional simplifications with `ponytail:` comment

Guardsman's Standing Orders (condensed)

## Engineering rules (Guardsman)

### 1. Read before you build
- Detect and match this repo's conventions: language, formatter,
  linter, test command, naming, error-handling.
- Grep how this codebase already solves the nearest similar problem.
- Grep every caller of any function you change.

### 2. Tier by blast radius (not code size)
- trivial / standard / sensitive / critical
- Higher tier wins when signals disagree.
- Never downgrade just because code looks simple.

### 3. Build order — stop at first rung
1. Need exist? → skip
2. In codebase? → reuse
3. Stdlib? → use
4. Native platform? → use
5. Installed dependency? → use (count onboarding cost)
6. One line? → one line
7. Only then: minimum, scaled to tier

### 4. Verification floors
- standard: one runnable check, written AND executed this turn
- sensitive: real test with failure-path coverage
- critical: full coverage, or absence is a blocker

The difference is architectural. Ponytail gives you a personality. Guardsman gives you a protocol.

Why I Switched

I switched because I got tired of playing roulette with my AI agent.

Ponytail made my agent write less code. But "less" isn't always "better." A minimal diff that breaks production is worse than a verbose diff that works. Guardsman understands that correctness and size are independent variables.

The risk tier system changed how I work. Now, when I ask for a feature, the agent stops and asks itself: "What happens if this is wrong?" The answer determines the verification floor. A utility script gets a quick run. A payment handler gets a failure-path test. Nothing ships on "it compiled" alone.

The standing orders detection means my agent finally respects my team's conventions. It doesn't suggest pytest when we use unittest. It doesn't propose Prettier when we use Biome. It reads the repo first.

And the logbook (GUARDSMAN-LOGBOOK.md) is genuinely useful. Every shortcut, every deferred optimization, every tier decision is recorded. "Later" doesn't become "never."

Final Thoughts

Ponytail is not a bad tool. If your main problem is AI agents that install npm packages to do what the browser has done natively since 2014, Ponytail will save you tokens and sanity. It's a genuine contribution to the minimal-code movement.

But if you're shipping production code—if your diffs touch money, auth, or user data—you need more than minimal. You need accountable.

Guardsman gives you that. It stands on Ponytail's shoulders (the YAGNI ladders, the stdlib-first reflexes) and adds the missing piece: a verification system that scales with danger, not code size.

My recommendation? Use both. Let Ponytail handle your prototypes, your scripts, your greenfield experiments. But when you're merging to main, put Guardsman at the post.

Because the best code isn't just the code you never wrote.

It's the code that proves it belongs there.

Guardsman: github.com/hedimanai-pro/guardsman

Ponytail: github.com/DietrichGebert/ponytail

Why I put the Guardsman SKILL on duty in front of every repo I touch

Antoinette C. Lennox — Mon, 13 Jul 2026 13:48:59 +0000

I almost skipped this one. Another "AI coding skill," another README promising your agent will finally behave. I've installed six of those this year and uninstalled six of those this year.

Then I actually read what Guardsman does, and I stopped scrolling.

The problem it names correctly

Most of these tools optimize for one thing: stop the agent from over-building. Fair enough — we've all watched an agent turn a two-line ask into a new dependency, a config layer, and an unprompted essay on edge cases nobody has.

But Guardsman names a second failure mode I hadn't seen anyone else call out directly: the confidently small diff. Five clean lines. No test. Written after skimming half the flow. Landing straight on a path that touches money, auth, or user data. It compiles. It demos. It's the one that actually gets you, three weeks later, because nothing about it looked dangerous.

That's the sentence that sold me: code size is not risk, blast radius is risk. I'd been unconsciously grading diffs by how they looked, not by what they touched. Once you say it out loud like that, you can't unsee it.

What made me actually install it

It's not a vibe check, it's a mechanism. Every change gets a risk tier — trivial, standard, sensitive, critical — decided by blast radius before a line is written. Money, auth, or user data in scope pushes the tier up automatically. The agent can be told to downgrade a tier explicitly. It is never allowed to infer a downgrade just because the code looks simple. That one rule is the whole point: "looks simple" is exactly the disguise the dangerous diff wears, so the tool refuses to take the bait on your behalf.

Underneath the tiers is a build ladder I actually agree with in order: does this need to exist at all → is it already in the codebase → does the stdlib do it → does the platform already do it → does something already installed solve it → can it be one line → only then, the minimum code the tier calls for. It's the YAGNI checklist I keep meaning to run in my head and don't, made mandatory instead of aspirational.

And the part I didn't expect to care about: TODO comments get replaced with structured, severity-tagged logbook entries instead — a scanner can actually harvest them and flag what's overdue. I have TODOs in production from two years ago that nobody, including me, remembers the context for. That's not a hypothetical problem for me.

The detail that made me trust the rest of it

Buried near the bottom of the README is a line most tool authors would never write:

A tool that invents numbers about your codebase has no business guarding it.

No "saved you 12,000 lines," no unverifiable percentage. The author states plainly that there's no baseline version of your repo to diff against, so no honest tool can produce that number — and the real benchmark protocol is public, not yet finished, and will ship with its method and limits when it lands. A tool that's this disciplined about not overclaiming its own value is the same kind of tool that'll actually enforce discipline on your diffs. That consistency is what moved this from "interesting README" to "installed in my project."

Getting it running

/plugin marketplace add hedimanai-pro/guardsman
/plugin install guardsman@guardsman

Five modes if you don't want it running persistently: build (default, on every turn), review (diff-only, changes nothing), audit (whole-repo pass), logbook (surfaces open shortcuts by severity), post-report (only what's directly countable, right now, no invented figures). Not on Claude Code? The condensed ruleset in adapters/AGENTS.md ports the tiers to any agent that reads custom instructions.

I'm running it in build mode on my current work. I'll follow up with what I actually observe once I've got real mileage on it — not before.

💂

I found the one AI coding skill that refuses to lie to you about its numbers

Antoinette C. Lennox — Mon, 13 Jul 2026 08:35:11 +0000

I wasn't looking for this. I was digging through Claude Code skill repos at 1am, the way you do, and I almost scrolled past a small project called Guardsman. Two stars. Eighteen commits. No flashy benchmark chart at the top of the README.

That absence is exactly what made me stop scrolling.

The pattern I'm tired of

Every "AI coding skill" README follows the same script now: a bold claim in the title ("73% fewer tokens!", "10x faster shipping!"), a chart with no visible methodology, and a repo that's three weeks old. You can't reproduce the number. You can't even tell what it was measured against. You just have to believe it.

Guardsman does the opposite, and it says so out loud, in its own README:

A tool that invents numbers about your codebase has no business guarding it.

There's no unbuilt parallel-universe version of your repo to diff against, so the author — Hedi Manai — just doesn't claim a savings number. The benchmarks/ folder is there, with a real protocol (a live open-source repo, twelve feature tickets, multiple arms, scored on lines changed, guards kept, and checks actually executed) — but the results aren't in yet, and the repo says so plainly instead of backfilling a plausible-looking figure.

That's a strange thing to lead with in a space full of overclaiming. It's also why I kept reading.

What it actually does

Guardsman is a Claude Code / Codex / Cursor skill built around one observation: code size is not risk. Blast radius is risk.

Most "write less code" tools optimize for one failure mode — the agent that over-builds, adds a dependency and a config layer for a five-line ask. Guardsman targets that and the failure mode that's much harder to catch: the confidently small diff. Five clean lines, no tests, straight onto a path that touches money or auth. It compiles. It demos. It breaks in week three.

Guardsman's answer is a risk tier, assigned before a line of code is written:

Tier	When	The check
trivial	one-off, no downstream	manual run, output shown
standard	everyday feature work	one check written and executed this turn
sensitive	money, auth, user data in scope	a real test on the failure path
critical	getting it wrong is an incident	full coverage, or the gap is surfaced, never routed around

Higher tier always wins on disagreement. The agent can be told to downgrade a tier — it's never allowed to infer one just because the code looks simple. That's the detail that sold me: "looks simple" is precisely the disguise the dangerous diff wears.

Before any of that, it reads your repo's actual conventions — formatter, linter, real test command, how your codebase already solves the nearest similar problem — with a deterministic detection script, not a guess.

The build order

Once the tier is set, it climbs a fixed ladder and stops at the first rung that holds: does this need to exist → is it already in the codebase → does the stdlib do it → does the platform already do it (a DB constraint instead of app-level validation, CSS instead of JS) → does an already-installed dependency solve it → can it be one line → only then, the minimum code the tier calls for.

Two refinements stood out. Bug fixes are chased to root cause — it greps every caller before touching a shared function, instead of patching the one call site the ticket mentioned. And YAGNI isn't treated as an excuse to ignore work that's already scheduled: if a second consumer is near-certain, it adds the seam and says why in one line, because the rewrite next week costs more than the extra line today.

The logbook

Instead of TODO comments that nobody greps and nobody triages, deliberate shortcuts get written as structured, severity-tagged entries:

# guardsman: retry capped at 3, no backoff config | severity:med | revisit:second-caller-appears | cost:none

A bundled scanner reads the whole repo and flags anything overdue or malformed. Technical debt stops being folklore in someone's head and becomes something a script can actually check.

Five modes, one skill

build (persistent, default), review (diff-only, no changes applied), audit (whole-repo pass for duplicate deps, dead config, single-implementation interfaces), logbook (surfaces open shortcuts by severity), and post-report — which only ever prints what's directly countable on your repo, right now. If nothing's been scanned yet, it says so instead of showing a number.

Try it

/plugin marketplace add hedimanai-pro/guardsman
/plugin install guardsman@guardsman

Not on Claude Code? The condensed ruleset in adapters/AGENTS.md ports the tiers and build ladder to any agent that reads custom instructions.

Why I'm writing this up now, before the benchmarks land

Normally I'd wait for the numbers. Here, the absence of numbers is the actual signal worth sharing — a tool built by someone who'd rather ship "we haven't measured this yet" than a chart that doesn't survive scrutiny. In a category full of unverifiable 10x claims, that restraint is the differentiator.

I'll be watching benchmarks/ for when the real, reproducible results land. Starring the repo gets you the same notification.

💂

ToolOps - Most Developers Building AI Agents Are Solving the Wrong Problem. I Was One of Them

Antoinette C. Lennox — Mon, 01 Jun 2026 08:21:33 +0000

A genuine note to the community — not a product review.

There's a particular kind of developer frustration that doesn't have a name yet.

It's not a bug. It's not a deployment failure. It's not a model hallucination or a broken API contract. It's the feeling you get when you've built something technically correct — something that works, something users actually want — and you're still losing. Slowly, quietly, in ways that don't show up in your error logs.

You're losing to your own architecture.

I want to talk about that. And somewhere in the middle of talking about it, I'm going to mention a tool. When I do, I want you to notice something: you'll probably feel a reflex to discount what I'm saying, the way you discount anything that sounds like a recommendation. That reflex is correct. It has kept you from wasting time on overhyped libraries a hundred times.

But I'm going to ask you to hold it for a few minutes. Not because I want to sell you anything — I'm not affiliated with this project, I receive nothing for writing this — but because I spent months with that reflex firmly intact, solving the wrong problem in my own agent infrastructure, and I want to spare you the same detour.

The Problem That Doesn't Look Like a Problem

Here's what production AI agent development actually looks like, once you're past the demo phase.

You're making external calls — LLMs, APIs, databases, third-party tools. Those calls are slow, expensive, and unreliable. You know this. Every developer building in this space knows this. The standard response is to optimize the obvious things: compress your prompts, choose the right model tier, cache where you can.

The trap is that these optimizations feel sufficient. Your error rate is low. Your latency is acceptable. Your system, by most observable measures, is performing.

What you're not seeing — because it doesn't surface as a failure — is the structural waste underneath. In a multi-agent system, multiple agents fire identical or semantically equivalent queries to the same endpoints, independently, simultaneously, with no shared memory between them. Each one pays the full price for a result that already exists. The system isn't broken. It's just forgetting, constantly, at scale, and you're paying for every instance of that forgetting.

The reason this doesn't get talked about enough is simple: it doesn't produce errors. It produces invoices.

And because invoices are a business problem rather than an engineering problem, engineers often don't feel responsible for solving them — until the number gets large enough that someone asks a question in a meeting that's hard to answer.

I've been in that meeting. I've watched other developers sit through it. And I've noticed that every time, the real answer — the architectural answer — wasn't part of the conversation.

What the Architectural Answer Looks Like

The correct fix operates at the layer between your business logic and your external calls.

Not at the prompt level. Not at the model selection level. At the infrastructure layer — the one that manages what happens when a call is made, how results are stored, whether a redundant call is even necessary, and what happens when an endpoint fails.

Most teams build this layer themselves, from scratch, for every project. Custom cache managers. Hand-rolled retry logic. Circuit breakers copy-pasted from a Stack Overflow answer three projects ago. Pages of scaffolding that wraps three lines of actual work, grows beyond anyone's full understanding, and has to be rebuilt the next time.

A few months ago, I stopped rebuilding it.

The tool is called ToolOps. It's an open-source Python middleware SDK — a single decorator that wraps any async function and provides the full resilience layer automatically. Caching, retry logic, circuit breaking, request coalescing, semantic cache for natural language inputs, observability. Framework-agnostic. One install command.

pip install toolops

I'm not going to spend the rest of this article listing features. You can read the documentation. What I want to do instead is tell you what I think is actually interesting about this project — and why I've been thinking about it long after I integrated it.

The Part That's Worth Thinking About

Here's what stayed with me.

When I added ToolOps to a client's multi-agent system — a chatbot handling over ten thousand conversations per day, running paid tool integrations across a network of sub-agents — the cost reduction was significant. Real numbers, real impact. But that's not what I keep thinking about.

What I keep thinking about is that the fix took a weekend.

Everything my client needed — caching, resilience, request coalescing across concurrent agents — was already built, already tested, already production-ready. The integration was decorator placement and backend configuration. The agents didn't change. The business logic didn't change. The entire infrastructure problem, which had been bleeding money for months, was resolved in two days.

And then I thought: how many production AI systems are running right now with this exact problem unfixed, not because the solution is hard, but because nobody told the team it existed?

That question is why I'm writing this.

What I'm Actually Asking You to Do

Nothing dramatic.

If you're building AI agents in Python — with LangChain, CrewAI, LlamaIndex, raw OpenAI calls, anything — I'm asking you to spend twenty minutes with ToolOps. Install it. Wrap one tool function. Watch what happens to your observability output.

Not because you'll be amazed. Maybe you won't be. Maybe your infrastructure is already handled. Maybe this solves a problem you don't have yet.

But if you're at the scale where external call costs are becoming real — or you're heading there — I think you'll recognize the problem it solves immediately. And that recognition is usually followed by a specific thought:

I wish someone had told me about this six months ago.

That thought is the one I'm trying to put in front of you now, rather than later.

And If You Recognize the Problem

Tell someone.

Not as a product recommendation. As a practitioner. In the same way you'd mention a debugging technique that saved you hours, or an architectural pattern that finally made a messy codebase legible. The developer community runs on that kind of knowledge transfer — the informal, peer-to-peer kind that doesn't show up in documentation.

The teams who'd benefit most from knowing this exists are the ones currently hand-rolling their own infrastructure, burning through API credits, sitting through that billing meeting, wondering if the problem is their model choice when the problem is actually the layer underneath.

They're not going to find it through a product page. They're going to find it because someone on their team read something, tried something, and said — in a Slack message, a forum comment, a short post, a conversation at a conference — have you looked at this?

That's all I'm doing here.

GitHub: github.com/hedimanai-pro/toolops

Documentation: hedimanai.vercel.app/projects/toolops.html

If you try it and form a view — positive, skeptical, or somewhere in between — I'd genuinely like to read it in the comments. The most useful thing you can do for a project this early isn't to endorse it. It's to engage with it honestly, publicly, where other developers can find the conversation.

I Just Upgraded My AI Agent Stack — ToolOps Is Now the First Thing I Install on Every New Project

Antoinette C. Lennox — Mon, 01 Jun 2026 08:12:17 +0000

I don't usually write posts like this. I'm not a library evangelist. But every once in a while something in the tooling layer changes meaningfully enough that I feel like I'd be doing the community a disservice by staying quiet about it.

Toolops just hit stable release, and it's worth knowing about.

Quick background, if you're not familiar

ToolOps is a Python middleware SDK for AI agent infrastructure. The concept is simple: you wrap any async tool function in a single decorator, and it instantly gets production-grade caching, retry logic, circuit breaking, request coalescing, and observability — without touching your business logic.

I've been using it in client work for a few months. I've written about specific use cases before — a startup handling 10,000+ conversations a day that was quietly bleeding money on redundant API calls, multi-agent systems where sub-agents were independently firing the same paid tool queries with no shared memory between them. In both cases, ToolOps was the fix.

The stable release changes the installation story significantly, and that's what I want to talk about.

What just changed — and why it matters

Up until now, using ToolOps with specific database backends meant managing optional extras. PostgreSQL required toolops[postgres]. Semantic caching required toolops[semantic]. If you wanted everything, you installed toolops[all] and hoped your dependency resolver didn't complain.

That's gone now.

pip install toolops

That's the whole command. One install gives you:

PostgreSQL via asyncpg
SQLite via aiosqlite
Valkey / Redis via redis
MySQL and MariaDB via aiomysql
Semantic caching via sentence-transformers and numpy
OpenAI embeddings support
OpenTelemetry and Prometheus telemetry

All of it. Out of the box. No extras flags, no dependency juggling, no "why is semantic caching not working" debugging sessions at midnight because you forgot to install the right variant.

For anyone who has spent time managing Python dependency extras in CI pipelines or Docker images, you know how much hidden friction this removes.

The new backends are the real headline

Four new first-class cache backends shipped alongside this release, and the expansion matters more than it sounds.

SQLite is the one I'm most immediately glad to have. For local development, single-process tools, or serverless deployments where standing up a Redis instance is overkill — SQLite now works out of the box with full tag-based invalidation. It uses a two-table relational schema with indexed lookups, so it's not a toy implementation. It's genuinely fast for the use cases it fits.

Valkey is the open-source Redis fork that's been gaining serious traction since Redis changed its licensing. If your infrastructure team has already migrated — or is planning to — ToolOps now supports it natively with an async connection pool and O(1) tag-based invalidation using Sets.

RedisCache is provided as a clean alias that inherits from the Valkey backend. If your existing deployment scripts reference Redis by name, nothing breaks. The nomenclature is preserved; the backend is unified.

MySQL and MariaDB round out the database support. Compatible with MySQL 8.0+ and MariaDB 10.5+, normalized dual-table schema, transactional commits, upsert semantics via ON DUPLICATE KEY UPDATE. For teams already running MySQL in production — which is most teams, if you're honest about the industry — this removes the last remaining reason to reach for a separate caching solution.

What this means in practice

Before this release, choosing a cache backend was a deployment decision that also became an install decision. PostgreSQL for one project, Redis for another — each one required a different install command, different CI configuration, different Dockerfile lines.

Now you pick the backend at configuration time, not install time:

ToolOpsManager.register_backend(
    "main_cache",
    MySQLCache(
        host="localhost",
        database="toolops_cache",
        user="root",
        password="password"
    )
)

The decorator stays identical regardless of backend:

@readonly(cache_backend="main_cache", cache_ttl=3600, retry_count=3)
async def run_tool(query: str) -> str:
    return await paid_tool.call(query)

Backend is a configuration choice. Decorator is business logic. They don't bleed into each other.

Who should care about this right now

Teams already using ToolOps: Upgrade is a single command — pip install --upgrade toolops — and requires zero code changes. Your existing decorators, backends, and CLI commands work exactly as before. You can simplify your requirements.txt or pyproject.toml immediately.

Teams running MySQL or MariaDB in production: You now have a first-class, native cache backend that integrates directly into the infrastructure you already operate. No Redis sidecar, no extra managed service, no additional monthly cost.

Teams doing local AI development: SQLite backend means you can run a fully-featured, properly cached agent pipeline with zero external infrastructure. It's the fastest possible path from "I want to test this tool" to a working, resilient, observable local environment.

Teams building multi-agent systems at scale: None of the core features changed. Request coalescing, semantic caching, circuit breaking — all still there, all still the core reason to use this. The stable release just means you can trust the foundation underneath them.

A note on the codebase itself

For anyone who evaluates libraries by looking at the internals before adopting them — the caching subsystem was refactored into a clean modular package structure as part of this release. Each backend lives in its own module. The interface contracts are properly typed and pass mypy strict mode across all nine cache modules with zero errors. The test matrix covers each new backend with unit and integration tests.

It's the kind of release that signals a project transitioning from "promising experiment" to "something you can actually build on." The internals match the ambitions.

Final thought

The tooling layer for AI agents has been a mess for a long time — not because people weren't trying, but because the problem surface is broad and the solutions kept arriving piecemeal. Caching here, retry logic there, a circuit breaker you wrote yourself two projects ago and copy-paste ever since.

ToolOps has been quietly assembling those pieces into a coherent whole. The stable release, and the database expansion that came with it, is the point where I'd say it's no longer a library you evaluate — it's a library you add to the stack and stop thinking about.

That's the highest compliment I can give infrastructure.

GitHub: github.com/hedimanai-pro/toolops

Happy to answer questions about specific integration patterns or use cases in the comments — particularly around multi-agent setups and high-volume pipelines.

ToolOps, The Night the Dashboard Turned Red

Antoinette C. Lennox — Mon, 25 May 2026 08:23:36 +0000

The alert came at 2:17 a.m.

Marcus didn't hear it at first. He was already awake, sitting at the kitchen table in the blue light of his laptop, watching numbers refresh that he no longer wanted to see. Three months since launch. Forty thousand users. A waitlist that kept growing. By every measure that was supposed to matter, Aria — his AI-powered research assistant — was a success.

The billing dashboard said otherwise.

He'd built Aria to do what human researchers did, only faster. You asked her a question, she deployed a team of sub-agents — each one specialized, each one pulling from a different paid data source — and within seconds you had an answer that would have taken a junior analyst two hours to compile. The product was elegant. The demo converted. Investors had used words like inevitable.

What investors don't model for, Marcus had learned, is what happens when something inevitable reaches ten thousand conversations in a single day.

His phone buzzed. Then buzzed again. Then held a continuous, low vibration that meant the alert had escalated from warning to critical. He looked at the screen.

API spend: $4,340. Today.

He set the phone face-down on the table.

The problem, he knew, wasn't that the system was broken. It was that the system was working exactly as designed. Every sub-agent was doing its job. Every tool call was legitimate. Somewhere inside those ten thousand daily conversations, the same searches were being fired independently, simultaneously, by agents that had no way of knowing another agent had asked the same question four seconds earlier. Three agents, three API calls, three invoices — for one piece of information that had already been retrieved.

At scale, that math became its own kind of catastrophe.

He'd tried to fix it himself, six weeks earlier. Stayed up three nights writing a custom cache layer, proud of the architecture, satisfied with the elegance of the solution. It held for eleven days. Then a memory leak he hadn't anticipated took down the entire pipeline at peak traffic, and he spent the following morning explaining to users why Aria had gone silent for four hours.

The custom fix was now disabled. The billing clock was running again.

At 3:05 a.m., he sent me a message. We'd worked together briefly the year before, on an earlier project that never launched. The message was two sentences.

I think I've built something people actually want. I'm not sure I can afford to keep running it.

I called him the next morning.

He walked me through the architecture slowly, with the particular exhaustion of someone who has explained a problem so many times that the explanation itself has started to feel like the problem. The sub-agent network. The paid tool integrations. The volume. The redundancy he couldn't eliminate without building infrastructure he didn't have time to build.

I let him finish. Then I asked him one question.

"What's sitting between your agents and your APIs?"

Silence.

"Nothing," he said. "Just the calls."

That was the problem. Not the product, not the model choices, not the architecture of the agents themselves. The layer between the business logic and the external world was empty — no caching, no coalescing, no circuit breaking, no shared memory. Every call landed cold. Every duplicate query cost real money as if it were the first time it had ever been asked.

I told him about ToolOps. I'd been using it in my own work for a few months — a Python middleware SDK that wraps tool functions in a single decorator and handles the entire resilience layer automatically. Caching, retry logic, circuit breaking, and observability. For a multi-agent system like his, the critical feature was request coalescing: when multiple agents fire the same endpoint simultaneously, ToolOps executes the call once and distributes the result. Semantic caching meant that queries with identical intent but different phrasing — the kind a chatbot generates by the thousands — hit the same cache entry rather than triggering separate calls.

He was quiet for a moment.

"How long to integrate?"

"A decorator per tool function," I said. "The agents don't change. The business logic doesn't change. You're wrapping the calls, not rewriting the system."

We shipped the integration over a weekend.

I remember watching Marcus go quiet on the call as the first full day of data came in. Not the anxious quiet of someone bracing for bad news. Something slower and more private — the particular stillness of a person watching a problem they'd carried for months simply stop.

The duplicate calls that had been firing in parallel were coalescing into single upstream requests. The semantic cache was catching intent matches his exact-match logic had never seen. The circuit breaker — which he'd never had before — flagged one unreliable third-party endpoint that had been silently degrading his response quality for weeks, long before it showed up as an error.

His spend that day was a fraction of what it had been.

He didn't say much. He didn't need to. He took a screenshot of the dashboard — the same dashboard he'd been watching turn red every morning for three months — and sent it to me without a caption.

The numbers were green.

A few weeks later, over coffee, he told me what he'd done with the savings.

Not pocketed them. Not used them to extend runway. He'd hired a front-end developer he'd been putting off, shipped a feature that had been sitting in the backlog since January, and started building an integration his enterprise users had been asking about since launch.

The infrastructure efficiency hadn't just saved the product. It had funded the next version of it.

He said something I've thought about since.

"I kept thinking the problem was that we were growing too fast. But it wasn't that. The problem was that nothing was remembering anything."

That's it, really. That's the whole lesson, told better than I could have told it.

At scale, memory is money. And the systems that forget — the ones that fire every call cold, that treat every question as if it's never been asked before — pay for that forgetting, every single day, in ways that only show up when you're staring at a dashboard at 2 a.m. trying to understand how something this successful can feel this fragile.

Aria is still running. Still growing. Still handling her ten thousand conversations a day.

She just remembers now.

GitHub: https://github.com/hedimanai-pro/toolops

ToolOps Saved My Client’s Startup. Here’s the Architecture Problem Nobody Talks About.

Antoinette C. Lennox — Fri, 22 May 2026 07:17:14 +0000

A field report from the production layer.

The call came at a bad time — or maybe exactly the right time.

My client had built something that was actually working. An AI-powered chatbot handling web searches, pulling from multiple paid tool integrations, serving real users at real volume. The product was live. Users were engaged. By every surface metric, the startup was on track.

Except the infrastructure was silently bleeding money.

I've spent years helping teams build production-ready AI applications. I've seen the full range: systems that collapse under their first real traffic spike, systems that work beautifully at demo scale and become unmanageable at ten times that, and systems like my client's — architecturally sound, functionally impressive, and quietly unsustainable because of a single layer nobody had addressed.

When we got on the call and he walked me through the numbers, it clicked immediately.

The Architecture Behind the Problem

The system wasn't simple. It was never going to be simple — the product didn't allow for it.

The chatbot operated through a network of sub-agents. Each conversation didn't trigger one process; it triggered a cascade. Each sub-agent had its own set of tools — search APIs, data services, third-party integrations — and every single one of those tools billed per call. The architecture was correct for the product requirements. But there was no shared intelligence between the agents. No layer that could recognize when the same query had already been answered sixty seconds ago. No mechanism to prevent three sub-agents in three parallel conversations from independently firing the same API call, paying three times for one piece of information.

At 10,000 conversations a day, that redundancy compounds fast.

Here's what makes this problem invisible until it isn't: every individual call looks justified. The sub-agent needed that data. The tool returned the right result. Nothing failed. The system log shows clean executions from top to bottom. The billing dashboard tells a different story — one that only becomes legible when you step back and look at the aggregate, at the patterns, at the sheer volume of duplicate intent spread across thousands of simultaneous conversations.

This is the infrastructure problem nobody talks about, because it doesn't produce errors. It produces invoices.

The Standard Fix — And Why It Doesn't Scale

Before I found a better solution, I would have approached this the way I'd always approached it: write a custom cache layer per tool.

I've done it enough times to know the real cost of that approach. A proper cache implementation for a single tool — one that handles cache logic correctly, manages TTL, deals with edge cases, and doesn't introduce new failure modes — requires at minimum 20 lines of code. For a system with multiple paid tools spread across multiple sub-agents, you're writing that infrastructure over and over again, for every tool, maintained separately, tested separately, debugged separately.

That's weeks of engineering time that produces no product value. It makes the system more complex. It gives you more surface area for failure. And it still doesn't solve the multi-agent problem cleanly, because hand-rolled cache layers don't naturally share state across independently running sub-agents.

The deeper issue is philosophical: caching, retry logic, circuit breaking, and observability aren't features you bolt onto a production AI system after the fact. They're the foundation. But the tooling to implement that foundation properly hadn't existed in a form that was fast to integrate — until recently.

Why ToolOps Was the Right Call

I'd been using ToolOps in my own work before this client came to me. It's a Python middleware SDK built specifically for AI agent infrastructure — it wraps any async function in a single decorator and handles caching, retry logic, circuit breaking, and observability automatically, without touching your business logic.

For a multi-agent system running paid tools at high volume, the critical feature is request coalescing: when multiple agents call the same endpoint simultaneously, ToolOps executes the actual API call once and distributes the result across all callers. In a system handling thousands of daily conversations with overlapping query patterns — which is exactly what my client had — this collapses cascading duplicate calls into a fraction of the original volume.

The semantic caching layer compounds the effect. Unlike exact-match caching, it recognizes intent rather than literal string matches. A chatbot fielding 10,000 conversations a day generates enormous natural language variety around a relatively finite set of underlying queries. Most caching systems miss that entirely. Semantic caching catches it.

The integration required no architectural overhaul. One decorator per tool function:

@readonly(cache_backend="semantic", cache_ttl=3600, retry_count=3)
async def run_tool(query: str) -> str:
    return await paid_tool.call(query)

Every tool in the system, wrapped. The sub-agents kept running exactly as before. The layer between them and the APIs changed everything.

What Actually Changed

The cost reduction was significant — significant enough that my client didn't just stabilize the unit economics of his existing system. He had runway he hadn't had before.

What he did with it matters more than the savings themselves: he reinvested directly into the product. Better capabilities. Improvements that had been on the roadmap for months, waiting for budget that kept getting consumed by infrastructure overhead. The efficiency gain at the tooling layer funded the next stage of the build.

That's the outcome that's hard to explain to someone who hasn't seen it happen. Optimizing your token count gets you incremental savings on one line of the bill. Fixing the infrastructure layer changes what the business can do.

There's something else that changed, quieter but just as real: the operational experience of running the system. Fewer unexpected spikes. A circuit breaker that detects failing endpoints and stops hammering them before the errors cascade. A single CLI command — toolops doctor — that validates backend health and reports state without digging through logs. For a startup at this scale, that kind of operational clarity isn't a convenience. It's the difference between a system you can manage and one that manages you.

The Pattern I Keep Seeing

This client's situation wasn't unusual. It's representative of a failure mode I encounter consistently in production AI systems: the product architecture is solid, the model selection is thoughtful, and the infrastructure layer — the one that sits between the business logic and the external world — is either missing entirely or stitched together from custom code that's grown beyond anyone's full understanding.

The mistake isn't negligence. It's sequencing. You build the product first, which is correct. You defer the infrastructure, which is understandable. And then the system scales, and the infrastructure debt becomes the most expensive line on the bill.

Multi-agent architectures make this worse by nature. Every agent you add multiplies the external call volume. Every paid tool you integrate adds another billing surface. The redundancy that's invisible at demo scale becomes structurally significant at production scale — not because anything broke, but because nothing in the system was built to recognize and eliminate it.

The teams that will run efficiently at scale — as models get cheaper, as agent architectures grow more complex, as API-dependent products become the norm — are the ones who treat the infrastructure layer as a first-class concern from the beginning. Not an afterthought, not a future sprint, not something to fix when the bill becomes impossible to ignore.

The caching layer is not a performance optimization. It's an architectural decision about how much of your operating cost you're willing to pay twice.

I work with teams building production AI systems and help them move from prototype to production-ready architecture. If this pattern sounds familiar in your own stack, I'd be glad to hear about it in the comments.

Stack: ToolOps: github.com/hedimanai-pro/toolops

ToolOps: The Python Middleware That's Quietly Cutting AI Infrastructure Costs for Teams Running at Scale

Antoinette C. Lennox — Wed, 20 May 2026 09:20:13 +0000

There's a number most AI teams discover too late.

It's not in the documentation. It's not in the LLM provider's pricing FAQ. It shows up on the bill — usually during a routine review, usually after a production deployment that "went well." According to CloudZero's research, average monthly AI spend jumped from $63,000 in 2024 to $85,500 in 2025 — a 36% increase. And for the teams that figure out what's actually driving that number, the culprit is almost never the model they chose. It's the calls they didn't need to make.

This article is about a Python SDK called ToolOps that I started using a few months ago. I'm not affiliated with the project. I'm a developer who was burning through LLM credits faster than I should have been, tried a few solutions, and eventually found one that actually worked.

The Real Cost of Production AI Agents

Token prices are falling. LLM API prices dropped approximately 80% between early 2025 and early 2026 — GPT-4o input pricing fell from $5.00 to $2.50 per million tokens, and newer models offer input at just $0.55/MTok. On paper, that sounds like great news for anyone building AI systems.

In practice, it barely moves the needle if your architecture is inefficient.

Here's why: each tool call in an agent adds the full message history back into the prompt. A 5-step agent with a 30,000-token system prompt can pay for that prompt five or more times per request. Now multiply that by concurrent agents, parallel pipelines, and repetitive queries that ask effectively the same thing in slightly different words. The token price per million is irrelevant. You're paying for the same computation over and over.

The cheapest API call is the one you don't make. Efficient prompts, smart caching, and appropriate model selection matter more than provider choice. That principle sounds obvious until you're the one writing the infrastructure to enforce it — at which point you realize it's neither simple nor fast.

What Most Teams Do (And Why It Doesn't Scale)

The standard approach to managing these costs involves writing custom infrastructure: a cache layer, retry logic, a circuit breaker for when APIs go down, observability hooks so you can debug what's happening, and concurrency controls to prevent 40 agents from hammering the same endpoint in parallel.

Every piece of that is necessary. And every piece of it is code you write yourself, from scratch, for each project.

When you build AI agents, external calls — LLMs, APIs, databases — are expensive, unreliable, and slow. ToolOps eliminates the boilerplate: it's a framework-agnostic middleware SDK that wraps any Python function in a single decorator, instantly upgrading it with caching, resilience, observability, and concurrency control.

That's the pitch. Here's what it actually looks like in code.

One Decorator. Everything Else Is Handled.

The before/after is stark.

Before ToolOps, a properly resilient LLM tool call involves cache management, retry logic, circuit breaker state, timeout handling, and tracing — spread across dozens of lines of infrastructure code that wraps three lines of actual work.

After:

@readonly(cache_backend="semantic", cache_ttl=3600, retry_count=3)
async def ask_llm(query: str) -> str:
    return await llm.complete(query)

Automatically cached, retried, and traced. Every agent developer hits a wall when moving from demo to production — and that one decorator is what stands between a clean codebase and an unmaintainable nest of infrastructure scaffolding.

The @readonly decorator signals that this function is idempotent — safe to cache and retry. The @readonly / @sideeffect decorator split is opinionated in a good way: it forces you to be explicit about whether a tool call is idempotent or not, which matters a lot when deciding what's safe to cache and retry.

The Feature That Makes the Biggest Difference at Scale

For teams running multi-agent systems — which is increasingly the default architecture for any serious AI workflow — there's one ToolOps feature that changes the economics of high-volume operations more than anything else.

Request coalescing.

If 50 agents call the same endpoint simultaneously, ToolOps executes the real API call once and multicasts the result.

At first pass, this sounds like a minor optimization. It's not. In a production pipeline where multiple agents are processing similar inputs concurrently, this collapses what would be dozens of identical upstream requests into a single one. In a 50-concurrent-call benchmark, 50 calls collapsed to 1 upstream request — the thundering herd problem on cache miss is real, and this handles it cleanly.

One request. One credit charge. One point of failure.

For large-scale document processing, RAG pipelines, customer-facing AI products, or any architecture that handles bursty, repetitive loads — this is a structural cost reduction that no amount of model-switching will replicate.

Semantic Caching: Catching Costs That Exact-Match Misses

Standard caching is binary: the input either matches a cached key or it doesn't. That works well for structured data. For natural language queries — which is most of what LLM-powered agents process — it misses an enormous opportunity.

The semantic caching in ToolOps uses an intent-matching approach that's genuinely useful for NLP tool inputs. Queries like "Check status of invoice #442" and "Is invoice 442 paid?" hit the same cache entry, reducing LLM token usage noticeably.

This matters more than it might seem. In customer support agents, document analysis pipelines, and data extraction workflows, users phrase the same underlying question dozens of different ways. Every variation that misses an exact-match cache is a redundant API call. Semantic caching eliminates that category of waste entirely.

Production-Grade Resilience Without the Ceremony

Beyond cost reduction, there's the reliability side of production AI infrastructure.

LLM APIs go down. External services rate-limit. Downstream databases return transient errors. The naive response is to let your agent fail. The correct response is a circuit breaker that detects consistent failures, temporarily halts calls to the affected service, and allows recovery — without you having to build that logic yourself.

ToolOps includes this out of the box. A single CLI command — toolops doctor — validates all your backends and reports circuit breaker state. It's exactly what you want to wire into a health check endpoint.

That kind of operational visibility — knowing the status of every backend, every circuit breaker, without digging through logs — is the difference between an agent that fails silently and one you can actually run in production with confidence.

Framework Compatibility: It Works With What You Already Use

The natural concern when evaluating any new piece of infrastructure is migration cost. How much do I have to change?

ToolOps decorates plain Python async functions, making it 100% compatible with your favorite agent frameworks. It works across LangGraph, CrewAI, LlamaIndex, and MCP natively.

You don't rewrite your agents. You don't change your business logic. You add a decorator to the functions that make external calls and configure backends once at startup.

You register backends once at application startup, then reference them by name. ToolOps supports multiple backends simultaneously. Redis for persistent caching, in-memory for low-latency hot paths, semantic backends for NLP tools — you configure the combination that fits your architecture. Then you stop thinking about it.

The core package has zero external dependencies. You only install what you need. No forced opinions on your stack, no transitive dependency conflicts on day one, no bloat.

Who Benefits Most From This

ToolOps is most valuable in three specific situations.

High-volume production pipelines. If your system makes thousands or tens of thousands of API calls per day, even modest cache hit rates translate to significant cost reductions. At scale, organizations can achieve cost reductions of 50% to 90% while maintaining or even improving the quality of their AI applications.

Multi-agent architectures. The request coalescing feature was built for this. The more agents you run in parallel on overlapping workloads, the more redundant upstream calls you're generating without it.

Teams who've been hand-rolling infrastructure. If your codebase currently has a custom retry wrapper, a homemade cache manager, and a circuit breaker you wrote yourself — that's infrastructure debt ToolOps replaces directly. The integration is one decorator per function, with zero changes to business logic.

Getting Started

pip install "toolops[all]"

From there, it's backend configuration at startup and decorator placement on your tool functions. The GitHub repository covers the full setup, and the official documentation walks through backend configuration and the decorator API in detail.

The project is early — a web dashboard and budget control features are still on the roadmap — but the core resilience layer is solid. It's Apache 2.0 licensed. Open source, production-ready for its current feature set, actively developed.

The Architecture Principle It Enforces

There's something more fundamental happening here than a useful library.

ToolOps is built on the idea that every external call an AI agent makes should be treated as a first-class operation — not an afterthought. Caching, retry logic, circuit breaking, observability, and concurrency control aren't optional production concerns you bolt on later. They're the minimum viable infrastructure for anything that talks to an LLM or an external API.

Most teams know this. Most teams also don't have time to build it properly for every project. ToolOps packages that infrastructure into a decorator and gets out of the way.

Don't over-optimize for today's prices. What matters is building the architecture that can take advantage of future pricing improvements. The teams that will operate efficiently as models get cheaper, as APIs multiply, as agent systems scale — are the ones who built the right plumbing early. ToolOps is that plumbing.

If you're building production AI agents and you've hit the credit-burn problem, I'd genuinely like to hear how you've handled it. Drop a comment below.

GitHub: github.com/hedimanai-pro/toolops