Mukunda Rao Katta

Posted on May 25

When to Move an Agent Library From Python to Rust

#hermeschallenge #ai #python #agents

Most of Your Agent is Not the Bottleneck

When Python agent developers think about performance, they often look at the wrong thing. The LLM call takes 800ms to 3 seconds. The tool calls take 50ms to 500ms each. Your Python orchestration code takes microseconds. Rewriting that in Rust does not move the needle.

But there are three specific cases where Python becomes the actual bottleneck. In those cases, a Rust port with a Python binding via PyO3 or Maturin changes the numbers.

This post walks through three real migrations from the sprint, shows what changed, and explains the decision criteria you should use before committing to a Rust port.

Signal 1: Tool Cache Lookup in a High-QPS Service

The Python version

# tool_result_cache.py
import hashlib, json
from collections import OrderedDict
from threading import Lock
from time import time

class ToolResultCache:
    def __init__(self, max_size=512, ttl_seconds=300):
        self._cache = OrderedDict()
        self._lock = Lock()
        self._max = max_size
        self._ttl = ttl_seconds

    def _key(self, tool, args):
        payload = json.dumps({"tool": tool, "args": args}, sort_keys=True)
        return hashlib.sha256(payload.encode()).hexdigest()

    def get(self, tool, args):
        key = self._key(tool, args)
        with self._lock:
            entry = self._cache.get(key)
            if entry is None:
                return None
            if time() - entry["ts"] > self._ttl:
                del self._cache[key]
                return None
            self._cache.move_to_end(key)
            return entry["value"]

    def set(self, tool, args, value):
        key = self._key(tool, args)
        with self._lock:
            if len(self._cache) >= self._max:
                self._cache.popitem(last=False)
            self._cache[key] = {"ts": time(), "value": value}

At 100 RPS with 10 concurrent agents, the GIL becomes the problem. Each cache lookup acquires a lock, serializes JSON, runs SHA-256, and returns. Under load, threads pile up waiting on the lock. Latency at p99 climbs from 2ms to 40ms for cache hits that should be instant.

The Rust version (same API shape)

# With tool-result-cache-rs installed:
from tool_result_cache_rs import ToolResultCache

cache = ToolResultCache(max_size=512, ttl_seconds=300)
result = cache.get("search", {"query": "python asyncio"})
if result is None:
    result = search_tool(query="python asyncio")
    cache.set("search", {"query": "python asyncio"}, result)

The Rust implementation uses DashMap (a concurrent hashmap with fine-grained shard locking) and releases the GIL during lock acquisition. At 100 RPS, p99 drops back to under 3ms. The API is identical. Migration is one import swap and a pip install.

Signal 2: Arg Validation in the Hot Path

The Python version (agentvet)

from agentvet import ArgValidator

validator = ArgValidator(schema={
    "query": {"type": "string", "min_length": 1, "max_length": 500},
    "top_k": {"type": "integer", "min": 1, "max": 100},
})

def validated_search(query, top_k=10):
    errors = validator.validate({"query": query, "top_k": top_k})
    if errors:
        raise ValueError(errors)
    return search_tool(query=query, top_k=top_k)

This is fine for low-QPS use. At 500 RPS, the validator runs on every tool call. The Python dict traversal and regex compilation (for string pattern matching) add up. It becomes 8% of total request time on a profiler trace.

The Rust version (agentvet-rs)

from agentvet_rs import ArgValidator

# Same constructor, same validate() call, no code change needed
validator = ArgValidator(schema={...})
errors = validator.validate({"query": "...", "top_k": 5})

The Rust validator pre-compiles regex patterns at ArgValidator construction time. Validation runs in a tight Rust loop without GIL pressure. At 500 RPS, that 8% drops to under 0.5%.

The rule: if a validation or computation step shows up on your profiler at more than 3-5% of request time, and it is called on every request, it is a Rust candidate.

Signal 3: A Native Desktop App Without a Python Runtime

This one is different. The motivation is not speed. It is dependency footprint.

If you are building a macOS or Windows native app that embeds agent logic, shipping a Python runtime adds 30-100MB to the installer and requires the user to have Python available (or bundle it yourself). Neither is great.

token-budget-pool started as a Python library. The Rust port (token-budget-pool crate) exposes the same logic as a native library with no runtime dependency. You link it into a Swift or C++ desktop app directly.

// Rust usage in a native app
use token_budget_pool::{BudgetPool, BudgetConfig};

let pool = BudgetPool::new(BudgetConfig {
    max_tokens: 100_000,
    max_usd: 5.0,
    window_seconds: 3600,
});

let permit = pool.acquire(estimated_tokens, estimated_usd)?;
// run the LLM call
permit.commit(actual_tokens, actual_usd);

From Python you would still use the PyO3 binding. But the key difference is that the core library has no Python dependency. The Python binding is additive.

What Migration Does NOT Solve

LLM call latency is the network round-trip to the model provider. Rust cannot help with that. If your p99 is 4 seconds, it is because the model is slow, not because your Python is slow.

I/O-bound tool calls are the same. A search tool that calls a remote API takes as long as that API takes. Rust will not speed up the API.

Algorithmic complexity does not disappear. If your Python cache has O(n) lookup because it uses a list instead of a hashmap, a Rust port of the same algorithm is still O(n). Fix the algorithm first in Python. Port to Rust only if profiling shows it is still a bottleneck.

Memory layout bugs survive porting. Unsafe Rust is harder to get right than Python. If you are not sure what you are doing with ownership, stick to Python until you are.

Decision Criteria

Before porting to Rust, answer these questions:

Does a profiler show this code at more than 3% of request time? If no, stop. The effort is not worth it.
Is it GIL-bound? If the bottleneck is waiting for other threads, Rust with GIL release helps. If it is a single-threaded loop, Rust helps proportionally less.
Is the API stable? Porting then changing the API is costly. Wait until the Python version's API is settled.
Do you have good Python tests? The Rust port should pass the same test suite via the Python binding. If you do not have tests, write them in Python first.
Is this a native app constraint? If yes, that is a valid non-performance reason to port.

If you answer yes to 1 and either 2 or 5, you have a real candidate.

Install or Quick-Start

Python versions:

pip install tool-result-cache  # Python LRU+TTL cache
pip install agentvet           # Python arg validator
pip install token-budget-pool  # Python budget enforcer

Rust crate versions (same APIs via PyO3):

pip install tool-result-cache-rs
pip install agentvet-rs
# token-budget-pool is a native Rust crate; use cargo for native apps

Siblings Table

Library	Language	When to use the Rust version
`tool-result-cache` / `tool-result-cache-rs`	Python / Rust+PyO3	100+ RPS with concurrent agents hitting the cache
`agentvet` / `agentvet-rs`	Python / Rust+PyO3	Arg validation on every tool call at high QPS
`token-budget-pool`	Rust (PyO3 binding for Python)	Native app without Python runtime, or budget enforcement at extreme QPS
`llm-retry` / `llm-retry-rs`	Python / Rust	Retry logic with backoff in a tight loop (rare: retries should be slow by design)
`agentguard` / `agentguard-rs`	Python / Rust+PyO3	Egress allowlist enforcement at gateway-level QPS

What is Next

The migration tooling is the next investment. Porting requires writing a PyO3 wrapper, testing it against the Python test suite, and publishing to both PyPI and crates.io. That is about 4-6 hours of work per library. The goal is to create a template that reduces that to 2 hours.

A benchmark suite that runs both the Python and Rust versions against the same workload and outputs a comparison table is also planned. Right now, the before/after numbers above are from manual profiling. An automated benchmark would make the migration decision more objective.

If you are working through the Hermes Agent Challenge sprint and you have a performance-sensitive component, run a profiler first. If the bottleneck is your own code and not the LLM or network, reach for the Rust version. Otherwise, the Python version is simpler and easier to maintain.

DEV Community