Hedi Manai

Posted on May 9 • Originally published at hedimanai.vercel.app

ToolOps: Stop Rewriting the Same Boilerplate Every Time You Build an AI Agent

#python #ai #agents #programming

You've built the demo. It works. The LLM responds, the tools fire, the output looks great.

Then you push it to production — and everything breaks.

API calls fail with no retry logic. Identical queries hammer your LLM endpoint ten times per minute, burning through credits. A single bad response cascades into an agent loop. You have no idea what's happening inside because there's nothing to look at.

So you start writing infrastructure. A retry decorator here. A cache manager there. A circuit-breaker wrapper you found on Stack Overflow. Eighty lines of boilerplate — just to make one tool call production-safe.

This is the problem ToolOps was built to solve.

The Production Gap Nobody Talks About

Building AI agents has never been easier. Frameworks like LangChain, CrewAI, and LlamaIndex get you from idea to working prototype in an afternoon. But moving that prototype to production exposes a gap that frameworks don't fill: the reliability, cost, and observability layer that every real agent needs.

Every external call your agent makes — to an LLM, an API, a database — is a tool call. In production, those calls are expensive, slow, and unreliable. Without proper infrastructure around them, you're flying blind.

Most developers solve this by copy-pasting the same boilerplate across every project. ToolOps solves it with a single decorator.

What ToolOps Actually Does

ToolOps is a framework-agnostic middleware SDK for Python. It sits between your agent and the external world, wrapping any async function with caching, retries, circuit breakers, request coalescing, and observability — without touching your business logic.

The core idea is elegant:

# Before: 80+ lines of custom infrastructure
# After:

@readonly(cache_backend="memory", cache_ttl=3600, retry_count=3)
async def get_market_data(ticker: str) -> dict:
    return await api.fetch(ticker)

One decorator. Your function is now cached for an hour, automatically retried on failure, and fully traced. That's the entire API surface for most use cases.

Two Decorators, Every Case Covered

ToolOps makes a clean architectural distinction between two types of tool calls:

@readonly — for functions that read data. API lookups, database queries, LLM calls, file reads. These get full caching + retry support.

@sideeffect — for functions that write or act. Sending emails, executing trades, posting messages. These are never cached (you genuinely want them to run), but they're protected by retries and circuit breakers.

# Read: cache it, retry it, trace it
@readonly(cache_ttl=3600, retry_count=3, stale_if_error=True)
async def fetch_stock_price(ticker: str) -> dict:
    return await market_api.fetch(ticker)

# Write: protect it, but always execute it
@sideeffect(circuit_breaker=True, timeout=5.0, retry_count=2)
async def execute_trade(order: dict) -> dict:
    return await broker_api.submit(order)

This separation is intentional and surprisingly useful. It forces you to think clearly about what your agent is actually doing — and gives each class of operation exactly the protection it needs.

The Features That Matter in Production

Semantic Caching

Standard caches match on exact strings. "weather in Paris" and "Paris weather" hit different cache keys, so your LLM gets called twice for the same answer.

ToolOps includes a semantic cache that matches by meaning using vector embeddings. Queries above a configurable similarity threshold share the same cached result:

embedder = SentenceTransformerEmbedder("all-MiniLM-L6-v2")
cache_manager.register("semantic", SemanticCache(embedder=embedder, threshold=0.92))

@readonly(cache_backend="semantic", cache_ttl=7200)
async def ask_llm(prompt: str) -> str:
    return await openai_client.chat(prompt)

# Three prompts, one real LLM call:
await ask_llm("Summarize the latest AI news")
await ask_llm("Give me a summary of recent AI news")        # Cache hit ✅
await ask_llm("What's happening in AI recently?")           # Cache hit ✅

For agents that handle natural language queries, this can cut LLM calls by up to 90%.

Request Coalescing

When 50 concurrent agents request the same data during a cache miss, ToolOps fires one real API call and returns the result to all 50. Without this, a thundering herd can overwhelm your API rate limits instantly. With it, the problem simply doesn't exist.

Stale-If-Error Fallback

If your upstream API goes down, ToolOps can serve the last known good cached value instead of throwing an exception. For slowly-changing data like exchange rates or configuration, this is often exactly the right behavior:

@readonly(
    cache_ttl=3600,
    stale_if_error=True,
    stale_ttl=86400,  # Accept data up to 24 hours old if the API is down
)
async def get_exchange_rates(base: str = "USD") -> dict:
    return await forex_api.fetch(base)

Multiple Cache Backends

Backend	Best For
`MemoryCache`	Development, single-process, low-latency hot data
`FileCache`	Local scripts, lightweight persistence
`PostgresCache`	Production, distributed, durable across restarts
`SemanticCache`	NLP queries, RAG pipelines, LLM cost reduction

A hot-cold cache pattern — in-memory for frequent reads, Postgres for expensive computations — is a single configuration call.

Built-In Observability

Every cache hit, miss, retry, timeout, and circuit-breaker event is logged as structured JSON, compatible with Datadog, Loki, CloudWatch, and any log aggregator. Add the [otel] extra and you get full OpenTelemetry tracing and Prometheus metrics with zero extra code:

agent_run (450ms)
  ├── get_market_data (12ms)  [cache: hit]
  ├── get_news_feed (310ms)   [cache: miss, retries: 1]
  └── send_report (128ms)     [circuit: closed]

Going from zero insight to full distributed tracing takes about five lines.

Framework Agnostic by Design

ToolOps wraps plain Python async functions. That means it works with whatever agent framework you're using — no special integration required:

LangChain / LangGraph — stack @readonly under @tool
CrewAI — apply it directly to BaseTool._run()
LlamaIndex — decorate then pass to FunctionTool.from_defaults()
MCP — generate a fully typed MCP tool definition with MCPIntegration.to_mcp_definition()
PydanticAI, Agno, AutoGPT, Haystack — any framework that calls Python async functions

When you migrate frameworks (and you will), your infrastructure layer stays the same.

Getting Started in Under 2 Minutes

Install:

pip install "toolops[all]"

Verify:

toolops doctor

Use:

from toolops import readonly, cache_manager
from toolops.cache import MemoryCache

cache_manager.register("memory", MemoryCache(), is_default=True)

@readonly(cache_backend="memory", cache_ttl=3600, retry_count=3)
async def fetch_weather(city: str) -> dict:
    return await weather_api.fetch(city)

The modular install system means zero required external dependencies for the core package. Add [postgres], [semantic], or [otel] only when you need them.

The CLI (toolops stats, toolops clear, toolops doctor) gives you a live view into cache hit rates, latency, and backend health without touching your code.

Why This Matters Now

AI agents are moving fast from demos to production. The infrastructure gap between "it works on my machine" and "it's running reliably at scale" is real, and it's expensive to rebuild from scratch every time.

ToolOps is a clean answer to a problem that every agent developer hits eventually. It's not a framework — it's the layer beneath your framework, the one that makes your tools trustworthy.

The code is open source, Apache 2.0 licensed, and actively maintained.

If you're building agents that need to survive real traffic, real failures, and real costs, it's worth ten minutes of your time.

GitHub · PyPI · Documentation