Mukunda Rao Katta

Posted on May 25

Stop Paying for the Same Tool Call Twice: Memoize LLM Tool Calls with tool-call-cache

#hermeschallenge #ai #python #agents

The problem showed up in a log file

I was debugging a multi-turn research agent. The agent had a tool called fetch_url. It was supposed to retrieve documentation pages and summarize them.

I grepped the trace log for fetch_url. Twelve calls. Same URL, twelve times.

The agent had fetched the exact same page at the start of the conversation, once per reasoning step, and once more at the end for a "final verification." The page had not changed. The HTTP response was identical every time. The LLM processed twelve identical blobs of HTML and produced twelve summaries of the same content.

That is twelve HTTP calls. Twelve prompt expansions. Twelve sets of output tokens. Eleven of them were waste.

The root cause is that LLMs do not track their own tool call history at the argument level. The model knows what it called, but it does not know that call number seven returned the same bytes as call number one. It has no memoization layer. Each call goes through as if it were new.

The fix is not complicated. Cache the result the first time. Return the cached result on subsequent calls with identical arguments. Standard memoization, applied to LLM tool calls.

tool-call-cache does exactly that.

The shape of the fix

Install the library:

pip install tool-call-cache

Wrap the tool function with @cacheable:

from tool_call_cache import ToolCallCache, cacheable

cache = ToolCallCache(max_size=256, ttl_seconds=3600)

@cacheable(cache)
def fetch_url(url: str) -> str:
    # real HTTP call here
    return requests.get(url).text

Now the agent can call fetch_url("https://docs.example.com/api") twelve times. Only the first call hits the network. The other eleven return instantly from cache.

For persistence across agent runs, swap in a JsonFileStore:

from tool_call_cache import ToolCallCache, JsonFileStore, cacheable

store = JsonFileStore(path="/tmp/tool-call-cache.json")
cache = ToolCallCache(max_size=256, ttl_seconds=3600, store=store)

@cacheable(cache)
async def search_docs(query: str, section: str = "all") -> list[dict]:
    # real search call here
    return await api.search(query=query, section=section)

The @cacheable decorator works on both sync and async functions. The cache is keyed on the function name plus a SHA-256 hash of the canonical JSON representation of the arguments.

You can also call the cache directly without the decorator:

result = cache.get("fetch_url", {"url": "https://docs.example.com/api"})
if result is None:
    result = fetch_url_impl(url)
    cache.set("fetch_url", {"url": "https://docs.example.com/api"}, result)

What it does NOT do

Before going further, here is what the library intentionally skips:

No semantic similarity matching. search("python asyncio") and search("asyncio python") hash to different keys. The library does not call an LLM to decide if two queries are "basically the same." That would require a network call to cache a network call, which defeats the purpose.
No cross-function deduplication. fetch_url and get_page are separate namespaces. If two tools return the same data by different paths, the cache does not know that.
No distributed cache backend. The built-in backends are in-process LRUCache and a single-machine JsonFileStore. Redis, Memcached, and similar are out of scope for this library.
No result validation. The cache stores and returns whatever the tool returned the first time. If the first call produced an error response, the error gets cached too. Callers should validate results before caching or use the direct API to skip caching on error.

Inside the lib: canonical JSON keying

The cache key is computed like this:

import json, hashlib

def make_key(fn_name: str, kwargs: dict) -> str:
    canonical = json.dumps(kwargs, sort_keys=True, ensure_ascii=False)
    payload = f"{fn_name}:{canonical}"
    return hashlib.sha256(payload.encode()).hexdigest()

sort_keys=True is the important part. It means:

make_key("search", {"query": "python", "limit": 10})
make_key("search", {"limit": 10, "query": "python"})

Both produce the same hash. Argument key order does not matter. Argument values do.

This matches how LLM tool calls behave in practice. The model sometimes reorders keyword arguments. The canonical representation absorbs that variation without any special logic.

What does NOT hash to the same key:

make_key("search", {"query": "python asyncio"})
make_key("search", {"query": "asyncio python"})

These are different strings. They get different hashes. The library treats them as different calls. Semantic equivalence is a research problem, not a caching problem.

When this is useful

The library earns its keep in a few specific patterns:

Repeated tool calls within a conversation. Multi-turn agents re-fetch context they already have. A research agent that checks the same documentation page at each reasoning step is a common example.

Tool calls across agent runs. With JsonFileStore, the cache survives process restarts. An agent that resumes a long job does not re-fetch pages it already retrieved in a previous session.

Expensive but stable tools. Database queries, API calls to third-party services, file reads on large documents. If the data does not change between calls, there is no reason to fetch it twice.

Testing and development. Cache real API responses locally. Run the agent against cached results without hitting rate limits or paying for API calls during debugging.

When NOT to use it

Some tool calls should not be cached:

Side-effecting tools. send_email, post_message, write_file. Caching these would silently skip the action on the second call.
Time-sensitive data. Stock prices, live sensor readings, anything where staleness matters. Set TTL to zero or do not use the cache for these tools.
Tools where identical args produce intentionally different results. generate_uuid(), get_current_time(), random sampling. The cache would return the same value every time, which breaks the contract.

The library does not try to detect side effects. That judgment belongs to the caller. Wrap only the tools that are safe to memoize.

Install

pip install tool-call-cache

Zero runtime dependencies. Python 3.9 and up. 54 tests.

The library has no opinion about how you run your agent, which LLM SDK you use, or how you structure your tool functions. It works on any callable. Wrap it with @cacheable or call the cache API directly.

Source: MukundaKatta/tool-call-cache

Siblings

Lib	Boundary	Repo
tool-call-budgets	Per-tool call-count cap, stops runaway loops	MukundaKatta/tool-call-budgets
llm-message-hash-py	Same canonical hashing idea applied to full LLM requests	MukundaKatta/llm-message-hash-py
tool-result-cache	Similar memoization, result-oriented API surface	MukundaKatta/tool-result-cache
agent-resume	Checkpoint and resume long runs, complements persistent caching	MukundaKatta/agent-resume

What is next

The library covers the basic memoization case well. A few things on the list:

Cache invalidation hooks. A callback that lets the caller decide at runtime whether to bypass or evict a cached entry, without having to wrap the decorator.
Async file store. JsonFileStore uses blocking I/O. An async-native store would avoid blocking the event loop in async agents.
Hit/miss metrics. A simple counter surface so callers can measure cache effectiveness without instrumenting the decorator themselves.

If any of those would be useful for your agent stack, open an issue or PR on the repo.

This is part of the Hermes Agent Challenge, a sprint to build and ship practical agent infrastructure libraries. The goal is a library per day covering the gaps between LLM SDK calls and production-ready agent behavior. Each library is small, focused, and ships with a full test suite.

DEV Community