Samuel Amin

Posted on Apr 28

MuBit Review: Execution Memory That Actually Earns Its Keep

#showdev #ai #python #agents

We just shipped a hackathon project called CafeTwin. It's an AI agent that watches cafe CCTV, spots operational patterns like queue crossings, table blockages, and staff detours, and recommends one geometry-checked layout change at a time. Two PydanticAI agents in a pipeline. Logfire for tracing. And the reason for this post: MuBit as the persistent memory layer underneath all of it.

This is a deep review. We wired MuBit into two distinct roles, hit a few sharp edges, and came out with strong opinions.

The short version: MuBit is the first execution memory product we've used that does what it says on the tin. Wire it up on day one of any agent project that needs to remember anything between runs.

What MuBit actually is

Most "agent memory" products fall into one of two camps. They're either chat history wrappers storing conversational turns, or they're vector stores in a trench coat. MuBit is neither. It's an execution memory layer. It captures what an agent did, what worked, what failed, and what the user did with the output, then makes that available on the next run.

For CafeTwin, that distinction mattered. Our OptimizationAgent proposes a layout change. The operator accepts or rejects it. On the next session, when a similar pattern shows up, we don't want a semantically-similar recommendation. We want the exact prior recommendation for that exact pattern, plus the operator's reaction to it.

A vector store would have given us "kind of like this." MuBit gave us "this same thing, last time you saw it, here's what happened."
That's the gap it fills. Once you start thinking in those terms, you notice how many agent products are pretending to have memory when they actually have search.

How we wired it in

Two roles, both behind a single env var (MUBIT_API_KEY), both with a local JSONL mirror so the demo never breaks if MuBit is offline.

Role 1: Memory store

Recommendations, accept/reject feedback, and detected patterns get written to /v2/control/ingest and recalled on the next run. The "seen 3× before" chip in our UI is powered by this. When the agent proposes a layout change for a pattern it has seen, the operator sees the prior recommendation and what they decided last time.
python

# Simplified write path
{
    "run_id": session_id,
    "agent_id": resolved_agent_id,
    "items": [{
        "item_id": f"cafetwin_{lane}_{layout_change_fingerprint}",
        "content_type": "text",
        "text": f"CafeTwin {intent} memory...\nCAFETWIN_MEMORY_RECORD_JSON={record_json}",
        "intent": intent,
        "lane": lane,
        "agent_id": agent_id,
        "metadata_json": json.dumps(record),
        "occurrence_time": int(time.time())
    }]
}

Two design choices in our integration are worth calling out, because they're the kind of thing the docs don't tell you and you figure out by hitting the wall.

First, we use stable item_ids based on the LayoutChange fingerprint. Re-running the same recommendation for the same pattern produces the same item_id, which means MuBit's own dedup logic does the right thing automatically. We don't have to track "have we written this before" ourselves. This also sidestepped a 422 we hit early on for missing item_id and content_type fields. The API is right to enforce it. It's still a footgun if you skim the docs.

Second, we stuff the canonical JSON into the human-readable text field with a marker. We embed CAFETWIN_MEMORY_RECORD_JSON= inside the text so recall can round-trip the exact MemoryRecord even if MuBit's response only echoes the text field. This was defensive engineering on our part. See the rough edges section below. But it worked, and our recall logic is bulletproof against partial responses.

Role 2: Agent registry and prompt versioning

This is the part of MuBit that got less press in the marketing. It's also the more interesting half of the product.

When CAFETWIN_MUBIT_AGENTS=1 is set, on FastAPI startup we register PatternAgent, OptimizationAgent, and SimAgent in MuBit's control plane as AgentDefinitions, with their full system prompts attached. The bootstrap flow:

POST /v2/control/projects/list to find or create the cafetwin project.
POST /v2/control/projects/agents/get for each agent to check if it exists.
If new: POST /v2/control/projects/agents mints prompt v1.
If existing: POST /v2/control/prompt/get, compare the active prompt to our in-code instructions, and if they've drifted, POST /v2/control/prompt/set mints a new version and retires the previous one.

What this gives you, in practice, is a console where every agent is named, every prompt is versioned, and every memory write is tagged with the correct agent_id so you can see which agent learned what.
We've seen prompt versioning solved by hand in approximately every agent project we've worked on. Usually with a mix of git, a prompts/ folder, and prayer. MuBit just absorbs that problem.
The prompt drift detection is the bit that quietly impressed us most. We changed the OptimizationAgent's instructions mid-project, redeployed, and the next bootstrap minted a new prompt version automatically. No manual versioning. No "did we remember to bump this." It just worked. (Yes, we double-checked the console. Twice.)

The fallback design is the unsung hero

Every MuBit call in our codebase has a JSONL mirror at demo_data/mubit_fallback.jsonl. We append on every write, even when MuBit succeeds. We read from both on every recall and merge with deduplication. The PriorRecommendationMemory.source field gets tagged as mubit, jsonl, or merged so the UI knows where the data came from.

This wasn't MuBit's idea. It's our defensive demo engineering. But it's worth describing because it shaped how confidently we could rely on the API. We could ship knowing that conference wifi could die, MuBit could have a bad five minutes, our API key could expire, and the demo would still tell a coherent story.

That kind of fallback is only possible because MuBit's data model is conceptually simple enough to mirror locally. If the product were more opinionated about how memory works, we couldn't have built this safety net.

The honest tradeoff: in offline or fallback mode, the "seen N× before" chip is fed entirely from JSONL, so we can't fairly attribute that demo behavior to MuBit alone. With the API key set, it's a true merge of MuBit and JSONL, and you can see the source tag flip to merged in the UI. We mention this for accuracy.

What worked

Sub-100ms recall in practice. Marketing claims sub-80ms. We didn't benchmark systematically, but our Logfire spans for memory.recall.mubit consistently came in under 100ms during the demo, including network round-trip from London to wherever MuBit's API lives. For an agent that runs every few seconds, that's effectively free.
The async ingest pattern is the right call. When the API returns a job_id, we poll GET /v2/control/ingest/jobs/{job_id} up to four times at 150ms intervals waiting for status=completed. This makes "write a recommendation, immediately recall it on the next run" deterministic for the UI, which is the actual hard problem in any memory system. Most products in this space either go fully async (and you get race conditions) or fully sync (and you eat latency on every write). Job polling with a short window is the sensible middle.

Per-lane agent routing falls out naturally. Our memory has three lanes: patterns, recommendations, feedback. We route patterns to PatternAgent's slug, and recommendations and feedback to OptimizationAgent's slug, on the principle that the feedback teaches the agent that emitted the proposal. MuBit's data model accommodates this without complaint. agent_id is a first-class field, lane is metadata, and the activity API filters on both.
The control plane is real product, not a marketing artifact. A lot of agent platforms have a console that's a thin viewer over a database. MuBit's console actually represents the things we registered. Agents are agents. Prompts are versioned. Memory writes are scoped to runs and tagged with agent IDs. When we onboarded a teammate, we sent them the console URL and they understood the system in five minutes. Try doing that with a vector store schema.
The failure modes are sensible. 401, 403, and 404 don't crash the bootstrap. They leave the registry empty and agents keep running with their in-code prompts. The API distinguishes "this agent doesn't exist yet" (404, create it) from "auth failed" (401, log and skip) cleanly enough that we could write tolerant bootstrap code without playing whack-a-mole with status codes.

Rough edges, honestly

A balanced review needs the friction. Here's what we ran into.
Recall has two endpoints and the difference matters. /v2/control/activity is the primary recall path. Give it a run_id and it returns recent items. /v2/control/query is the semantic fallback if activity returns too few results. We had to write our recall logic to try activity first, count the results, and fall back to query with a budget cap. This works, but the docs could do more to explain when to reach for which. We figured it out from API behavior, not from the docs.

Response shapes are inconsistent enough that we wrote a parser. Our _mubit_items function walks through entries, evidence, results, items, records, memories, and data keys defensively, because different endpoints return different shapes. Same for IDs. We look for mubit_id, memory_id, record_id, entry_id, node_id, id, reference_id, and job_id depending on which call we made. None of this is fatal. But it means a thin SDK wouldn't have saved us much work. We needed to write a parser anyway. A more uniform response envelope across endpoints would be a real quality-of-life improvement.

The 422 on missing item_id and content_type is correct but unfriendly. First time you hit ingest without those fields, you get a validation error that's accurate but not particularly self-explaining. We figured it out by reading the error response carefully. A code sample in the docs that explicitly shows the minimum viable ingest call would have saved us 20 minutes.
Three writes, three paths. This is on us as much as on MuBit, but: recommendation writes, feedback writes, and AgentDefinition registration go through three different code paths in our integration. They use different endpoints and different request shapes. Our app/memory.py and app/mubit_agents.py are two separate modules for this reason. The API surface is internally consistent (everything is /v2/control/*) but the cognitive load of "which endpoint for which write" is higher than it could be. Worth a refactor on our end once the surface settles. And worth keeping in mind on MuBit's end if there's an opportunity to unify.

No native client library at the time we built this. We rolled our own HTTP client with httpx. That was fine. But a Python SDK would have absorbed the response-shape parsing and the polling loop. The blog post on MuBit's homepage shows mubit.learn.init() and @mubit.learn.run decorators, which suggests a higher-level SDK is the intended path for many users. We needed lower-level control for the agent registry use case, so the raw HTTP API was the right choice for us. Most teams will want the SDK.
What MuBit unlocks that we couldn't have built ourselves in 24 hours

This is the test that matters. Anyone can wrap an API. The question is whether the thing on the other end of the API is doing work you couldn't easily replicate.
For us, three things passed that test.
Cross-run recommendation history with operator feedback baked in. We could have stored layout changes in Postgres. We couldn't have built the recall pattern in 24 hours: fingerprint-stable IDs, agent-scoped activity feeds, semantic fallback when activity is sparse, dedup across writes. That's a week of work, minimum. The result would have been worse.

Versioned system prompts as a managed service. We could have built prompt versioning ourselves. We've all built prompt versioning ourselves. It's always worse than you want it to be. Usually a folder of .md files with timestamps, sometimes a database table, occasionally a YAML registry that nobody updates. MuBit's prompt versioning is built into the same control plane as the agent definitions, which means the prompt history and the agent it belongs to are never out of sync. We'd happily use this even if we didn't need the memory layer.

The "seen 3× before" chip is a product, not a feature. When operators see that a recommendation has been made before and rejected, the conversation changes. This isn't an AI capability. It's an institutional memory capability. MuBit gave us this in roughly 200 lines of integration code. If we had to build it from scratch, schema design, recall logic, dedup, fallback, observability, we'd still be building it.

Verdict
If you're shipping an agent that needs to remember anything between runs, MuBit is the right call. It's the first execution memory product we've used that understood the problem correctly. Not chat history. Not vector search. "What did the agent do last time and what happened."

The product is opinionated in the right places: sub-100ms recall, stable item IDs, prompt versioning as a control plane primitive. And unopinionated in the right places: writes are flexible enough that we could mirror locally without fighting the schema.
The rough edges are real but small. Docs could be clearer on activity vs query. Response shapes could be more uniform. The SDK story is still maturing for low-level integrations. None of these are reasons to avoid the product. They're the kind of thing that gets fixed in the next quarter, and we'd rather use a sharp product with a roadmap than a blunt one without.

If we were starting another agent project tomorrow, MuBit would be the first dependency we'd add. Before the LLM provider. Before the tracing layer. Before the framework.

The agent registry alone is worth the price of admission. The memory layer is the bonus.

DEV Community

MuBit Review: Execution Memory That Actually Earns Its Keep

Top comments (0)