Rob

Posted on Jun 3 • Originally published at strake.dev

Cross-Machine Memory Query: About 20 Milliseconds, Most Days

#performance #benchmarks #machinelearning #wireguard

I wrote about hardware benchmarks twice this week. Different problem this time. Same machines.

I have a Mac for daily work, a Linux box that runs a few media services and a GPU, and a Windows desktop I keep for gaming and AMD testing. They are all on the same Tailscale-managed WireGuard mesh. Each one runs a local memory store I use with my AI coding tools.

The store is local-first by design. No vendor cloud. No memory-sharing API. When I move from the Mac to the Linux box for some weekend project, the context I built up on the Mac stays on the Mac, and the new context I build on Linux stays on Linux. That has always been a feature for me. Until last weekend, when I realized it was also a constraint I had built for myself.

I wanted to query the Linux box's memory store from my Mac.

The simplest version of the problem

The local sidecar that powers the memory store exposes an HTTP API. Locally, my agent hits POST /memory/search and gets back the relevant memories. The sidecar binds to 0.0.0.0:8321 so household-mesh peers can dial it. A middleware enforces that non-loopback callers can only reach /peer/* and /health. Everything else returns 403.

This is the right default. The local user's memories, tasks, chat history, and observability data should not be reachable from any other machine, including machines I own, until I explicitly opt that in.

The result is a clean privacy boundary and exactly zero cross-machine memory features.

The design fork before any code

Before I wrote a line of code I had to settle one thing. Three real options.

A. Memory stays strictly local. The cross-machine claim was an aspiration that does not match the actual product. Remove the claim, do not build the feature.

B. Own-fleet memory aggregation gated behind a per-machine opt-in environment variable. When the user sets it, household-mesh peers can query that machine's full memory store. Trust is mesh-IP reachability.

C. Per-project sharing flags. Memories tagged for shared projects are queryable; everything else stays strictly local. Cleaner privacy model, more code.

A is the cleanest privacy story, but it walks back a published claim. C is the right long-run answer but it is at least three times the code to ship. B is the pragmatic prototype.

I picked B with an opt-in default of off. The privacy posture stays correct for users who never opt in. The architecture stays correct for users who do.

What the endpoint looks like

@router.post("/memory/search", response_model=PeerMemorySearchResponse)
async def peer_memory_search(request: PeerMemorySearchRequest):
    if not _peer_memory_enabled():
        raise HTTPException(
            status_code=503,
            detail="Peer memory search is opt-in and disabled on this machine. "
                   "Set POTLUCK_PEER_MEMORY_ENABLED=1 to enable household-fleet "
                   "memory aggregation.",
        )
    if not _memory_active():
        raise HTTPException(status_code=403, detail=_DISABLED_DETAIL)

    pairs = await semantic_search(
        query=request.query,
        project=request.project,
        limit=request.limit,
        min_confidence=request.min_confidence,
        include_pinned=True,
    )
    return PeerMemorySearchResponse(
        memories=[m.model_dump() for m, _ in pairs],
        count=len(pairs),
    )

That is the whole endpoint. Same retriever as the local call. Read-only. The env var gate reads the variable per request so the user can toggle without restarting the sidecar.

There is also a header comment block that documents the new trust model so the next person who reads the file does not have to guess at what is intentional.

The first measurement, and why I did not trust it

Three machines on Tailscale. Each runs the sidecar bound to 0.0.0.0:8321. Opt-in env var set on each one. I issued 30 warm-embedder queries from the Mac to each peer's /peer/memory/search and to my own local /memory/search for comparison.

Path	p50	p95
Mac local `/memory/search` (baseline)	14.8 ms	18.3 ms
Mac to Linux peer `/peer/memory/search`	29.0 ms	37.8 ms
Mac to Windows peer `/peer/memory/search`	40.8 ms	53.5 ms

The Linux peer added 14 ms over local. The minimum cross-machine call was inside the local p95. The shape of the numbers was clean and the conclusion was easy. It feels instant.

I had a draft post written around that 14 ms. I did not publish it.

The numbers felt too generous. The probe was a single ad-hoc script against whatever store happened to be on the Linux box at the time, which was nearly empty. That is not a measurement. That is a vibe check.

I wrote a real bench.

The second measurement: the actual bench script

bench/run_peer_retrieval.py does two things.

First, it populates the peer's store with a known set of 10 synthetic memories, queries them with semantically distinct natural-language questions, and verifies recall@1, recall@3, and recall@10. This catches silent breakage in the wire path: truncation, reordering, dropped fields. All three recall numbers should be 1.00 by construction. The point of the probe is the negative result: confirming the architecture introduces no silent corruption.

Second, it populates the peer at three store sizes (10, 100, 500 memories), then issues 30 warm-embedder queries against each. The point is to characterize how latency scales with the size of the embedding scan, which is the part of the system most likely to degrade gracelessly as memory stores grow.

The populate step uses SSH local-forwarding from Mac to peer so the writes hit the peer's loopback-only /memory endpoint, satisfying the peer_access_middleware loopback check. Memory writes stay strictly local even during the bench. The query step uses the cross-machine /peer/memory/search directly.

I ran it the same evening as the first measurement. The numbers came back higher than the ad-hoc probe.

Store size on peer	p50	p95
10 memories	41.8 ms	52.6 ms
100 memories	39.5 ms	56.6 ms
500 memories	45.0 ms	85.2 ms

Overhead vs Mac local jumped from 14 ms to roughly 27 ms. Correctness was 1.00 across the board.

Two questions immediately. First: which number is right, 14 ms or 27 ms? Second: why did 500 memories show a p95 of 85 ms when 10 memories showed 52 ms? The linear-scan answer explains 500-vs-10 in principle, but only by a few milliseconds at this scale, not thirty.

The third measurement: the next morning

I ran the same bench script the next morning. Twice in a row, about ten minutes apart.

Run	10 mem p50	100 mem p50	500 mem p50	worst p95
First (evening)	41.8 ms	39.5 ms	45.0 ms	85.2 ms
Second (morning)	35.9 ms	36.1 ms	36.3 ms	46.0 ms
Third (morning, ten min later)	34.7 ms	34.9 ms	41.3 ms	50.4 ms

The two morning runs agree to within ~1.5 ms p50 at every store size. The evening run was 5 to 10 ms higher across the board. Same code. Same machines. Same Tailscale mesh.

The variance is the Tailscale path. Tailscale prefers direct UDP between peers when the network conditions allow; if that fails, traffic relays through Tailscale's DERP servers, which adds a hop and a few milliseconds of geographic latency. Whether a given session lands direct vs DERP can flip based on residential ISP behavior, NAT state, and time of day. The 5 to 10 ms band in my morning-vs-evening numbers is what that flip looks like from the Mac's HTTP stopwatch. The evening's worst p95 (85 ms at 500 memories) is the same flip plus the long tail of a worse direct path.

What I publish

For a stable headline: cross-machine memory query adds about 20 milliseconds of overhead over local on Mac, plus or minus 5 milliseconds of Tailscale jitter, plus a few more milliseconds of p95 widening as the peer's memory store grows past a few hundred entries. The 14 ms number from the ad-hoc probe was the low end of that band. The 27 ms from the first bench was the high end. About 20 ms is the honest middle.

The architectural conclusion does not depend on which number you pick. Twenty milliseconds is well inside the "feels instant" range for interactive coding-agent workflows. Even at the worst measurement I have on file (85 ms p95 on a 500-memory store on a slow-Tailscale night), it is faster than most users can perceive as a pause.

What that 20 ms is made of

Roughly: WireGuard tunnel round-trip plus HTTP request and response serialization plus FastAPI middleware plus the actual retriever call on the peer side. The WireGuard round-trip alone is around 9 to 12 ms when Tailscale lands a direct path, 14 to 18 ms when it relays through DERP. The retriever and serialization are the rest.

This is meaningfully lower overhead than I expected when I started. I had been mentally budgeting cross-machine as a 50 to 100 ms operation that I would have to design around. At 20 to 30 ms most days, it just works.

Things that surprised me along the way

I started a sidecar on Linux from an SSH session via nohup ... &. The process died as soon as the SSH session closed. SSH sessions over Tailscale's built-in SSH server do not behave like normal openssh sessions. setsid works. nohup does not. Half an hour of debugging that I would rather have back.

On Windows I tried to launch the sidecar by passing set X=1 && set Y=path && python -m ... through a single cmd.exe /c chain via WMI Win32_Process.Create. The env vars did not survive. The fix was to write a .bat wrapper file and invoke that. Cleaner. Reliable. Should have been my first move.

The first peer query took 22 seconds. That was the sentence-transformers embedder lazy-loading on the peer the first time /peer/memory/search was called. Subsequent calls were ~36 ms. Worth pre-warming after a sidecar restart.

The privacy guard middleware works exactly as designed. It returned 403 from /memory/search to my Mac, returned 200 from /peer/memory/search after I set the env var, and stayed 503 from peers where I had not opted in. No accidental data leaks during all my probing.

And the headline number from my own first probe was 50% off from the rigorous measurement, which is why the rigorous measurement exists. The 14 ms number would have aged badly the first time a user with a real store ran the same probe and reported back the truth.

Honest limits of the prototype

Opt-in is all-or-nothing per machine. If I set the env var on Linux, every peer in the mesh that can dial my Linux box's IP can query the whole memory store. There is no per-project sharing flag. The cleaner project-scoped sharing model is the obvious next step.

Trust is mesh-IP reachability. Whoever can dial my mesh IP can call the peer endpoint if I have opted in. Signed-nonce challenge replacing mesh-IP-as-credential is the next hardening pass.

There is no graceful federation. If I query my Mac and it forwards to Linux, I get Linux's results back. If I want Mac's local results merged with Linux's, I do that in the client. A peer-aware retriever that automatically aggregates across all opted-in peers is the next product step.

The endpoint is read-only. Memory writes stay strictly local. That is deliberate; turning the cross-machine endpoint into a write path needs more careful trust modeling than I want to do in a prototype.

What this changed

Until last weekend, the Linux box was a peer that could serve me inference but not memory. The memory layer was strictly local. That was a clean privacy story but also a real product limit.

After two hours of work, the same architecture has a new opt-in endpoint that lets me query any of my machines' memory stores from any other. The default privacy posture is unchanged. The published architecture invariants still hold. The only change is that the people who want this can have it, by opting in on the machines they want to participate.

I measured it three times before I published a number because the first answer felt too clean. The truth turned out to be a band, not a point. About 20 ms most days. About 25 ms on slow-Tailscale nights. Scaling gracefully through 500 memories on the peer. That is a more honest claim than 14 ms, and it took two extra runs to learn it.

The architecture was right. The trust boundary was already where it needed to be. The thing I was missing was an env var, one new path on the peer surface, and the discipline to measure twice before publishing once.

Rob is building Strake, a GitHub Action deploy gate for engineering teams that run production without a full SRE bench. Strake posts a GO / HOLD / CRITICAL verdict in the pull request using context from incidents, deploy history, dependency changes, service health, and runbooks.

DEV Community