Mukunda Rao Katta

Posted on May 25

LRU Memoization for Agent Tool Calls with tool-result-cache

#hermeschallenge #ai #python #agents

The bill that showed up at the end of the session

I ran a research agent that used a geocoding tool. The tool called a paid geocoding API and returned lat/lng for an address string. Cost: $0.005 per call.

The agent ran for a session. At the end, I pulled the tool call log and counted.

800 calls to the geocoding tool. 40 unique addresses. That is 20 calls per address on average.

760 of those calls were duplicates. The agent had re-geocoded addresses it had already resolved earlier in the same session. Same string in, same coordinates out, 760 times over.

Cost from unique work: $0.20. Cost from cache misses: $3.80. Four dollars total, $3.80 of it wasted.

The fix is not complicated. Memoize the tool function. Return the cached result when the same args come in again. Do not call the API a second time.

tool-result-cache is that memoizer.

The shape of the fix

Install the library:

pip install tool-result-cache

Wrap the tool function with @cache_result:

from tool_result_cache import ToolResultCache, cache_result

cache = ToolResultCache(max_size=256)

@cache_result(cache)
def geocode(address: str) -> dict:
    # real API call here
    return geocoding_api.lookup(address)

Now the agent can call geocode("1600 Pennsylvania Ave NW, Washington DC") 20 times. The first call hits the API. The other 19 return from cache instantly.

With an optional TTL:

cache = ToolResultCache(max_size=512, ttl_seconds=3600)

@cache_result(cache)
def geocode(address: str) -> dict:
    return geocoding_api.lookup(address)

Entries older than one hour are treated as misses and re-fetched. Useful for data that changes on a known schedule.

You can also call the cache directly without the decorator:

result = cache.get({"address": "1600 Pennsylvania Ave NW"})
if result is None:
    result = geocoding_api.lookup(address)
    cache.set({"address": "1600 Pennsylvania Ave NW"}, result)

The direct API is useful when you need to skip the cache conditionally or when the tool function is not under your control.

What it does NOT do

Before going further, here is what the library intentionally skips:

No semantic similarity matching. geocode("1600 Pennsylvania Ave") and geocode("1600 Pennsylvania Avenue") hash to different keys. The library does not try to decide if two argument strings are "close enough." That would require another model call to cache a tool call, which is backwards.
No cross-session persistence. The cache is in-memory. When the process exits, cached results are gone. If you need results to survive restarts, pair this with agent-resume or write your own persistence layer around the direct API.
No distributed backend. This is a single-process LRU cache. Redis, Memcached, and similar are out of scope.
No side-effect detection. The library does not know whether your tool has side effects. It is the caller's job to only wrap tools that are safe to memoize. Do not wrap send_email or write_file with this decorator.

Inside the lib: SHA-256 canonical key design

The cache key is computed from the JSON-serialized, key-sorted args dict:

import json
import hashlib

def make_key(kwargs: dict) -> str:
    canonical = json.dumps(kwargs, sort_keys=True, ensure_ascii=False)
    return hashlib.sha256(canonical.encode()).hexdigest()

sort_keys=True is the load-bearing part. It means:

make_key({"address": "1600 Pennsylvania Ave", "country": "US"})
make_key({"country": "US", "address": "1600 Pennsylvania Ave"})

Both produce the same hash. Argument insertion order does not affect the key. The values are what matter.

This handles a real quirk of LLM tool calls. The model sometimes generates keyword arguments in different orders across turns. The canonical representation absorbs that variation without any special handling.

What does NOT hash to the same key:

make_key({"address": "1600 Pennsylvania Ave"})
make_key({"address": "1600 Pennsylvania Avenue"})

Different strings, different hashes, treated as different calls. The library does structural identity, not semantic similarity. That boundary is intentional and important. Crossing it would mean the cache can produce incorrect results if the similarity judgment is wrong.

When this is useful

The library earns its keep in a few specific patterns:

High-cost external APIs with stable data. Geocoding, currency conversion, taxonomy lookups, anything billed per call where the underlying data does not change between agent turns. Cache on the first call, skip the API on every repeat.

Multi-turn conversations with context re-fetching. Agents tend to re-establish context at each reasoning step. A tool that retrieves a customer record or a product description may get called once per turn even though the record has not changed. A result cache cuts that to one call per unique set of args per session.

Development and testing. Wrap the real API tool with @cache_result during development. The first test run hits the live API and populates the cache. Subsequent runs return cached results without burning rate limit or budget. Useful when you are iterating quickly on prompt logic and the tool outputs are not what you are testing.

Reducing latency in multi-agent setups. When multiple agents share a process and can call the same tools with the same args, a shared ToolResultCache instance prevents redundant parallel calls to the same endpoint.

When NOT to use it

Some tool calls should not be cached:

Side-effecting tools. send_notification, charge_card, post_to_slack. The second call to a cached side-effecting tool would silently do nothing. The agent would think the action happened. It did not.

Time-sensitive data. Live stock prices, current queue depths, real-time availability. Use TTL to bound staleness if you need any caching here at all, or skip the cache entirely.

Tools where identical args should produce different results. Random sampling, UUID generation, "get current time." The cache would return the same value on every call, which is wrong by design.

Stateful tools. Tools that depend on server-side session state or that perform streaming reads. The first result may not represent what a second call would return.

The library has no way to detect any of these cases automatically. That judgment belongs to the code that wires up the agent. Wrap only the tools where same-args-same-result is the correct contract.

Install

pip install tool-result-cache

Zero runtime dependencies. Python 3.9 and up. 25 tests.

The library has no opinion about your LLM SDK, your agent framework, or how you define tool functions. It works on any callable that takes keyword arguments and returns a JSON-serializable value. Wrap it with @cache_result or use the direct ToolResultCache API.

Source: MukundaKatta/tool-result-cache

Siblings

Lib	Boundary	Repo
tool-call-cache	Similar idea, intercepts at the LLM call layer rather than the tool function layer	MukundaKatta/tool-call-cache
tool-result-cache-rs	Rust port of this library, same 30 tests, LRU plus pure-Rust SHA-256	MukundaKatta/tool-result-cache-rs
llm-message-hash-py	Canonical JSON hashing they share, applied to full LLM request structs	MukundaKatta/llm-message-hash-py
agent-resume	Persistence across runs, a different concern, complements in-session caching	MukundaKatta/agent-resume

What is next

A few things worth adding once the basics are solid:

Hit and miss counters. A simple stats surface so callers can see cache effectiveness without adding their own instrumentation around the decorator. Knowing your hit rate is useful when deciding whether to increase max_size.

Eviction hooks. A callback that fires when an entry is evicted from the LRU. Useful for logging or for writing evicted entries to an external store before they disappear.

Async decorator support. The current @cache_result decorator works on sync functions. An async-native variant would avoid awkward wrapping in asyncio agent loops.

If any of those would be useful for your agent stack, open an issue or PR on the repo.

This is part of the Hermes Agent Challenge, a sprint to build and ship practical agent infrastructure libraries. The goal is a library per day covering the gaps between LLM SDK calls and production-ready agent behavior. Each library is small, focused, and ships with a full test suite.

DEV Community