DEV Community: Hedi Manai

Your AI agent's smallest diffs are its most dangerous

Hedi Manai — Mon, 13 Jul 2026 00:18:31 +0000

Last month, an AI coding agent handed me a beautiful fix. Five lines. Elegant. It reused an existing helper, matched the codebase style, compiled on the first try. Exactly the kind of diff we've all learned to praise since "make the agent write less code" became the standard advice.

It was also completely untested, and it sat on a password-recovery path.

That diff taught me something I now consider the central problem of AI-assisted coding in 2026: we've spent a year teaching agents to write less code, and almost no time teaching them to prove the code they kept actually holds.

The two failure modes

Every AI coding agent fails in one of two directions.

Failure mode #1: the over-build. You ask for a date comparison; you get a new dependency, a ValidationService class, and a config layer. This one is well known — it's why minimal-code prompts and skills became popular, and they genuinely work on it.

Failure mode #2: the confidently small diff. Minimal, clean, written after reading half the flow, verified never — dropped onto a path that handles money, auth, or user data. It compiles. It demos. It detonates in week three.

Here's the uncomfortable part: fixing #1 aggressively makes #2 more likely. When the objective function is "shortest diff," the first things to quietly disappear are edge-case handling, failure-path tests, and the guard clause that looked optional. The diff gets smaller. The blast radius doesn't.

A five-line change to a payment path is more dangerous than a four-hundred-line internal script that runs once. Code size is not risk. Blast radius is risk. Yet almost every skill and prompt in this category optimizes for size alone.

What a guard does differently

This is why I built Guardsman 💂 — an open-source skill that behaves less like a minimalist and more like the royal guard in front of the palace: nothing passes the post unchallenged, and the level of challenge depends on what's behind the gate.

Three duties, on every task:

1. Read the standing orders. Before writing anything, a deterministic script detects the repo's real conventions — language, formatter, linter, actual test command — and the agent greps how this codebase already solves the nearest similar problem. Your patterns win over its defaults. Always.

2. Set the threat level. Every change gets a risk tier before a single line is written:

Tier	When	The challenge
trivial	internal one-off	one manual run, output shown
standard	everyday feature work	one check written and executed this turn
sensitive	money, auth, user data in the blast radius	a test exercising the failure path
critical	getting it wrong is an incident	full coverage, or the gap is surfaced as a blocker

Two rules keep the tiers honest: when signals disagree, the higher tier wins — and the agent never infers a downgrade just because the code looks simple. A confident walk is not a countersign.

3. Issue the challenge. Non-trivial logic isn't "done" when the code is written. It's done when the check behind it was actually run, in the session, with output shown. A test that was written but never executed doesn't count. A promise to test later doesn't count.

The logbook: tech debt that can't hide

TODO comments rot. Nobody greps them, nobody triages them. Guardsman replaces them with structured watch-log entries:

# guardsman: retry capped at 3, no backoff config | severity:med | revisit:second-caller-appears | cost:none

Severity and revisit trigger are both required — a shortcut nobody can triage is worse than an unmarked one, because it creates false confidence that it's tracked. A bundled scanner harvests the whole repo and reports open entries by severity, overdue first. Your deferred work stops being folklore and becomes a list with dates on it.

The part I expect skepticism about

Guardsman ships with a rule I haven't seen in this category: it never invents numbers about your repo. It will not tell you it "saved you 12,000 lines" — there is no unbuilt baseline of your codebase to diff against, so no tool can honestly compute that figure. Its status mode reports only what is directly countable, live, or says "nothing computed yet."

And that rule applies to the project itself, which is why I'm launching with zero benchmark numbers. What the repo contains instead is a fully pre-registered protocol: 12 realistic tickets against a pinned open-source FastAPI + React codebase, 43 pre-registered guard checks (no user enumeration on the recovery endpoint, webhook signature verified before any field is trusted, idempotent retries…), four arms including the most popular minimal-code skill, and frozen scoring rules — including the disclosure rule that any published result must include cases where Guardsman did worse, if they exist.

The tasks and rubric were committed before any run, so the report can't be bent around the results afterward. Numbers get published when they're measured, with raw diffs and full transcripts. And if you don't want to wait for mine: third-party runs are welcome via PR.

Try it

Claude Code, two commands:

/plugin marketplace add hedimanai-pro/guardsman
/plugin install guardsman@guardsman

Codex, Cursor, or any agent that reads an AGENTS.md: a condensed, self-contained adapter ships in adapters/AGENTS.md — the tiers, the ladder, and the challenge floors travel everywhere.

Repo: https://github.com/hedimanai-pro/guardsman

If you've been burned by a confidently small diff — I'd genuinely like to hear the story in the comments. It's the failure mode nobody benchmarks, and I'm collecting cases.

No diff ships unchallenged. 💂

I'm Hedi Manai — AI & backend engineer building reliability tooling for AI agents. Also the author of ToolOps, middleware for agent tool-calling. Find me on LinkedIn and X.

I really appreciate it!

Hedi Manai — Tue, 12 May 2026 21:16:27 +0000

Bessie Gannon

May 12

My LangGraph agent was hammering the same API endpoints 40 per run. Solved it with ToolOps

#agents #ai #api #performance

2 min read

One Decorator Away From Production-Ready AI Agents

Hedi Manai — Sun, 10 May 2026 11:22:17 +0000

Every agent developer hits the same wall.

The demo works. Then it goes to production — and the cracks show up fast. No retry logic when APIs fail. Identical queries hammering your LLM endpoint over and over. No visibility into what's actually happening. And before long, you're writing the same cache managers, retry decorators, and circuit-breaker wrappers you wrote on the last project.

ToolOps is built to make that boilerplate disappear.

What It Is:
ToolOps is a framework-agnostic Python SDK that wraps any async function in a single decorator, adding caching, retries, circuit breakers, request coalescing, and observability — with zero changes to your business logic.

Think of it the way a service mesh works for microservices: the infrastructure wraps around your code without touching it.

What It Does:
Caching that actually fits production. ToolOps supports in-memory caching for speed, file-based for lightweight persistence, and PostgreSQL for durable, distributed caching shared across processes. Pick the backend that fits the function.

Semantic caching for LLM calls. Standard caches match on exact strings — so "weather in Paris" and "Paris weather" hit the LLM twice. ToolOps uses vector embeddings to match by meaning, collapsing semantically similar queries into a single cached result. For agents handling natural language, this can cut LLM calls by up to 90%.

Request coalescing. When dozens of agents request the same data at once during a cache miss, ToolOps fires one real API call and returns the result to all of them. The thundering herd problem, solved automatically.

Stale-if-error fallback. When an upstream service goes down, ToolOps can serve the last known good value instead of crashing your agent — exactly what you want for slowly-changing data like exchange rates or configuration.

Observability out of the box. Every cache hit, miss, retry, and circuit-breaker event is logged as structured JSON. Add the optional OpenTelemetry extra and you get full distributed tracing and Prometheus metrics — production-grade visibility in a few lines of setup.

Works With Your Stack:
ToolOps wraps plain Python async functions, so it drops into any framework without friction: LangChain, LangGraph, CrewAI, LlamaIndex, PydanticAI, Agno, and more. It also has native MCP support, letting you expose any decorated function as a typed MCP tool definition compatible with Claude Desktop, Cursor, and any MCP-compatible host.

When you migrate frameworks — and most teams eventually do — your infrastructure layer doesn't budge.

The Practical Upside:
The core package installs with a single pip command and has zero external dependencies. Postgres, semantic caching, and OpenTelemetry support are optional extras you add only when you need them. A built-in CLI gives you live hit rates, latency stats, and backend health checks without touching your application code.

Two decorators cover every case: @readonly for functions that read data, @sideeffect for functions that act on the world. That's the entire model.

Try It:
ToolOps is open source, Apache 2.0 licensed, and actively maintained.
If you're building agents that need to survive real traffic, real API failures, and real costs, it's worth your time.

GitHub: https://github.com/hedimanai-pro/toolops
PyPI: https://pypi.org/project/toolops/
Documentation: https://hedimanai.vercel.app/projects/toolops.html

ToolOps: Stop Rewriting the Same Boilerplate Every Time You Build an AI Agent

Hedi Manai — Sat, 09 May 2026 23:04:33 +0000

You've built the demo. It works. The LLM responds, the tools fire, the output looks great.

Then you push it to production — and everything breaks.

API calls fail with no retry logic. Identical queries hammer your LLM endpoint ten times per minute, burning through credits. A single bad response cascades into an agent loop. You have no idea what's happening inside because there's nothing to look at.

So you start writing infrastructure. A retry decorator here. A cache manager there. A circuit-breaker wrapper you found on Stack Overflow. Eighty lines of boilerplate — just to make one tool call production-safe.

This is the problem ToolOps was built to solve.

The Production Gap Nobody Talks About

Building AI agents has never been easier. Frameworks like LangChain, CrewAI, and LlamaIndex get you from idea to working prototype in an afternoon. But moving that prototype to production exposes a gap that frameworks don't fill: the reliability, cost, and observability layer that every real agent needs.

Every external call your agent makes — to an LLM, an API, a database — is a tool call. In production, those calls are expensive, slow, and unreliable. Without proper infrastructure around them, you're flying blind.

Most developers solve this by copy-pasting the same boilerplate across every project. ToolOps solves it with a single decorator.

What ToolOps Actually Does

ToolOps is a framework-agnostic middleware SDK for Python. It sits between your agent and the external world, wrapping any async function with caching, retries, circuit breakers, request coalescing, and observability — without touching your business logic.

The core idea is elegant:

# Before: 80+ lines of custom infrastructure
# After:

@readonly(cache_backend="memory", cache_ttl=3600, retry_count=3)
async def get_market_data(ticker: str) -> dict:
    return await api.fetch(ticker)

One decorator. Your function is now cached for an hour, automatically retried on failure, and fully traced. That's the entire API surface for most use cases.

Two Decorators, Every Case Covered

ToolOps makes a clean architectural distinction between two types of tool calls:

@readonly — for functions that read data. API lookups, database queries, LLM calls, file reads. These get full caching + retry support.

@sideeffect — for functions that write or act. Sending emails, executing trades, posting messages. These are never cached (you genuinely want them to run), but they're protected by retries and circuit breakers.

# Read: cache it, retry it, trace it
@readonly(cache_ttl=3600, retry_count=3, stale_if_error=True)
async def fetch_stock_price(ticker: str) -> dict:
    return await market_api.fetch(ticker)

# Write: protect it, but always execute it
@sideeffect(circuit_breaker=True, timeout=5.0, retry_count=2)
async def execute_trade(order: dict) -> dict:
    return await broker_api.submit(order)

This separation is intentional and surprisingly useful. It forces you to think clearly about what your agent is actually doing — and gives each class of operation exactly the protection it needs.

The Features That Matter in Production

Semantic Caching

Standard caches match on exact strings. "weather in Paris" and "Paris weather" hit different cache keys, so your LLM gets called twice for the same answer.

ToolOps includes a semantic cache that matches by meaning using vector embeddings. Queries above a configurable similarity threshold share the same cached result:

embedder = SentenceTransformerEmbedder("all-MiniLM-L6-v2")
cache_manager.register("semantic", SemanticCache(embedder=embedder, threshold=0.92))

@readonly(cache_backend="semantic", cache_ttl=7200)
async def ask_llm(prompt: str) -> str:
    return await openai_client.chat(prompt)

# Three prompts, one real LLM call:
await ask_llm("Summarize the latest AI news")
await ask_llm("Give me a summary of recent AI news")        # Cache hit ✅
await ask_llm("What's happening in AI recently?")           # Cache hit ✅

For agents that handle natural language queries, this can cut LLM calls by up to 90%.

Request Coalescing

When 50 concurrent agents request the same data during a cache miss, ToolOps fires one real API call and returns the result to all 50. Without this, a thundering herd can overwhelm your API rate limits instantly. With it, the problem simply doesn't exist.

Stale-If-Error Fallback

If your upstream API goes down, ToolOps can serve the last known good cached value instead of throwing an exception. For slowly-changing data like exchange rates or configuration, this is often exactly the right behavior:

@readonly(
    cache_ttl=3600,
    stale_if_error=True,
    stale_ttl=86400,  # Accept data up to 24 hours old if the API is down
)
async def get_exchange_rates(base: str = "USD") -> dict:
    return await forex_api.fetch(base)

Multiple Cache Backends

Backend	Best For
`MemoryCache`	Development, single-process, low-latency hot data
`FileCache`	Local scripts, lightweight persistence
`PostgresCache`	Production, distributed, durable across restarts
`SemanticCache`	NLP queries, RAG pipelines, LLM cost reduction

A hot-cold cache pattern — in-memory for frequent reads, Postgres for expensive computations — is a single configuration call.

Built-In Observability

Every cache hit, miss, retry, timeout, and circuit-breaker event is logged as structured JSON, compatible with Datadog, Loki, CloudWatch, and any log aggregator. Add the [otel] extra and you get full OpenTelemetry tracing and Prometheus metrics with zero extra code:

agent_run (450ms)
  ├── get_market_data (12ms)  [cache: hit]
  ├── get_news_feed (310ms)   [cache: miss, retries: 1]
  └── send_report (128ms)     [circuit: closed]

Going from zero insight to full distributed tracing takes about five lines.

Framework Agnostic by Design

ToolOps wraps plain Python async functions. That means it works with whatever agent framework you're using — no special integration required:

LangChain / LangGraph — stack @readonly under @tool
CrewAI — apply it directly to BaseTool._run()
LlamaIndex — decorate then pass to FunctionTool.from_defaults()
MCP — generate a fully typed MCP tool definition with MCPIntegration.to_mcp_definition()
PydanticAI, Agno, AutoGPT, Haystack — any framework that calls Python async functions

When you migrate frameworks (and you will), your infrastructure layer stays the same.

Getting Started in Under 2 Minutes

Install:

pip install "toolops[all]"

Verify:

toolops doctor

Use:

from toolops import readonly, cache_manager
from toolops.cache import MemoryCache

cache_manager.register("memory", MemoryCache(), is_default=True)

@readonly(cache_backend="memory", cache_ttl=3600, retry_count=3)
async def fetch_weather(city: str) -> dict:
    return await weather_api.fetch(city)

The modular install system means zero required external dependencies for the core package. Add [postgres], [semantic], or [otel] only when you need them.

The CLI (toolops stats, toolops clear, toolops doctor) gives you a live view into cache hit rates, latency, and backend health without touching your code.

Why This Matters Now

AI agents are moving fast from demos to production. The infrastructure gap between "it works on my machine" and "it's running reliably at scale" is real, and it's expensive to rebuild from scratch every time.

ToolOps is a clean answer to a problem that every agent developer hits eventually. It's not a framework — it's the layer beneath your framework, the one that makes your tools trustworthy.

The code is open source, Apache 2.0 licensed, and actively maintained.

If you're building agents that need to survive real traffic, real failures, and real costs, it's worth ten minutes of your time.

GitHub · PyPI · Documentation