DEV Community: Samuel Amin

MuBit Review: Execution Memory That Actually Earns Its Keep

Samuel Amin — Tue, 28 Apr 2026 13:56:09 +0000

We just shipped a hackathon project called CafeTwin. It's an AI agent that watches cafe CCTV, spots operational patterns like queue crossings, table blockages, and staff detours, and recommends one geometry-checked layout change at a time. Two PydanticAI agents in a pipeline. Logfire for tracing. And the reason for this post: MuBit as the persistent memory layer underneath all of it.

This is a deep review. We wired MuBit into two distinct roles, hit a few sharp edges, and came out with strong opinions.

The short version: MuBit is the first execution memory product we've used that does what it says on the tin. Wire it up on day one of any agent project that needs to remember anything between runs.

What MuBit actually is

Most "agent memory" products fall into one of two camps. They're either chat history wrappers storing conversational turns, or they're vector stores in a trench coat. MuBit is neither. It's an execution memory layer. It captures what an agent did, what worked, what failed, and what the user did with the output, then makes that available on the next run.

For CafeTwin, that distinction mattered. Our OptimizationAgent proposes a layout change. The operator accepts or rejects it. On the next session, when a similar pattern shows up, we don't want a semantically-similar recommendation. We want the exact prior recommendation for that exact pattern, plus the operator's reaction to it.

A vector store would have given us "kind of like this." MuBit gave us "this same thing, last time you saw it, here's what happened."
That's the gap it fills. Once you start thinking in those terms, you notice how many agent products are pretending to have memory when they actually have search.

How we wired it in

Two roles, both behind a single env var (MUBIT_API_KEY), both with a local JSONL mirror so the demo never breaks if MuBit is offline.

Role 1: Memory store

Recommendations, accept/reject feedback, and detected patterns get written to /v2/control/ingest and recalled on the next run. The "seen 3× before" chip in our UI is powered by this. When the agent proposes a layout change for a pattern it has seen, the operator sees the prior recommendation and what they decided last time.
python

# Simplified write path
{
    "run_id": session_id,
    "agent_id": resolved_agent_id,
    "items": [{
        "item_id": f"cafetwin_{lane}_{layout_change_fingerprint}",
        "content_type": "text",
        "text": f"CafeTwin {intent} memory...\nCAFETWIN_MEMORY_RECORD_JSON={record_json}",
        "intent": intent,
        "lane": lane,
        "agent_id": agent_id,
        "metadata_json": json.dumps(record),
        "occurrence_time": int(time.time())
    }]
}

Two design choices in our integration are worth calling out, because they're the kind of thing the docs don't tell you and you figure out by hitting the wall.

First, we use stable item_ids based on the LayoutChange fingerprint. Re-running the same recommendation for the same pattern produces the same item_id, which means MuBit's own dedup logic does the right thing automatically. We don't have to track "have we written this before" ourselves. This also sidestepped a 422 we hit early on for missing item_id and content_type fields. The API is right to enforce it. It's still a footgun if you skim the docs.

Second, we stuff the canonical JSON into the human-readable text field with a marker. We embed CAFETWIN_MEMORY_RECORD_JSON= inside the text so recall can round-trip the exact MemoryRecord even if MuBit's response only echoes the text field. This was defensive engineering on our part. See the rough edges section below. But it worked, and our recall logic is bulletproof against partial responses.

Role 2: Agent registry and prompt versioning

This is the part of MuBit that got less press in the marketing. It's also the more interesting half of the product.

When CAFETWIN_MUBIT_AGENTS=1 is set, on FastAPI startup we register PatternAgent, OptimizationAgent, and SimAgent in MuBit's control plane as AgentDefinitions, with their full system prompts attached. The bootstrap flow:

POST /v2/control/projects/list to find or create the cafetwin project.
POST /v2/control/projects/agents/get for each agent to check if it exists.
If new: POST /v2/control/projects/agents mints prompt v1.
If existing: POST /v2/control/prompt/get, compare the active prompt to our in-code instructions, and if they've drifted, POST /v2/control/prompt/set mints a new version and retires the previous one.

What this gives you, in practice, is a console where every agent is named, every prompt is versioned, and every memory write is tagged with the correct agent_id so you can see which agent learned what.
We've seen prompt versioning solved by hand in approximately every agent project we've worked on. Usually with a mix of git, a prompts/ folder, and prayer. MuBit just absorbs that problem.
The prompt drift detection is the bit that quietly impressed us most. We changed the OptimizationAgent's instructions mid-project, redeployed, and the next bootstrap minted a new prompt version automatically. No manual versioning. No "did we remember to bump this." It just worked. (Yes, we double-checked the console. Twice.)

The fallback design is the unsung hero

Every MuBit call in our codebase has a JSONL mirror at demo_data/mubit_fallback.jsonl. We append on every write, even when MuBit succeeds. We read from both on every recall and merge with deduplication. The PriorRecommendationMemory.source field gets tagged as mubit, jsonl, or merged so the UI knows where the data came from.

This wasn't MuBit's idea. It's our defensive demo engineering. But it's worth describing because it shaped how confidently we could rely on the API. We could ship knowing that conference wifi could die, MuBit could have a bad five minutes, our API key could expire, and the demo would still tell a coherent story.

That kind of fallback is only possible because MuBit's data model is conceptually simple enough to mirror locally. If the product were more opinionated about how memory works, we couldn't have built this safety net.

The honest tradeoff: in offline or fallback mode, the "seen N× before" chip is fed entirely from JSONL, so we can't fairly attribute that demo behavior to MuBit alone. With the API key set, it's a true merge of MuBit and JSONL, and you can see the source tag flip to merged in the UI. We mention this for accuracy.

What worked

Sub-100ms recall in practice. Marketing claims sub-80ms. We didn't benchmark systematically, but our Logfire spans for memory.recall.mubit consistently came in under 100ms during the demo, including network round-trip from London to wherever MuBit's API lives. For an agent that runs every few seconds, that's effectively free.
The async ingest pattern is the right call. When the API returns a job_id, we poll GET /v2/control/ingest/jobs/{job_id} up to four times at 150ms intervals waiting for status=completed. This makes "write a recommendation, immediately recall it on the next run" deterministic for the UI, which is the actual hard problem in any memory system. Most products in this space either go fully async (and you get race conditions) or fully sync (and you eat latency on every write). Job polling with a short window is the sensible middle.

Per-lane agent routing falls out naturally. Our memory has three lanes: patterns, recommendations, feedback. We route patterns to PatternAgent's slug, and recommendations and feedback to OptimizationAgent's slug, on the principle that the feedback teaches the agent that emitted the proposal. MuBit's data model accommodates this without complaint. agent_id is a first-class field, lane is metadata, and the activity API filters on both.
The control plane is real product, not a marketing artifact. A lot of agent platforms have a console that's a thin viewer over a database. MuBit's console actually represents the things we registered. Agents are agents. Prompts are versioned. Memory writes are scoped to runs and tagged with agent IDs. When we onboarded a teammate, we sent them the console URL and they understood the system in five minutes. Try doing that with a vector store schema.
The failure modes are sensible. 401, 403, and 404 don't crash the bootstrap. They leave the registry empty and agents keep running with their in-code prompts. The API distinguishes "this agent doesn't exist yet" (404, create it) from "auth failed" (401, log and skip) cleanly enough that we could write tolerant bootstrap code without playing whack-a-mole with status codes.

Rough edges, honestly

A balanced review needs the friction. Here's what we ran into.
Recall has two endpoints and the difference matters. /v2/control/activity is the primary recall path. Give it a run_id and it returns recent items. /v2/control/query is the semantic fallback if activity returns too few results. We had to write our recall logic to try activity first, count the results, and fall back to query with a budget cap. This works, but the docs could do more to explain when to reach for which. We figured it out from API behavior, not from the docs.

Response shapes are inconsistent enough that we wrote a parser. Our _mubit_items function walks through entries, evidence, results, items, records, memories, and data keys defensively, because different endpoints return different shapes. Same for IDs. We look for mubit_id, memory_id, record_id, entry_id, node_id, id, reference_id, and job_id depending on which call we made. None of this is fatal. But it means a thin SDK wouldn't have saved us much work. We needed to write a parser anyway. A more uniform response envelope across endpoints would be a real quality-of-life improvement.

The 422 on missing item_id and content_type is correct but unfriendly. First time you hit ingest without those fields, you get a validation error that's accurate but not particularly self-explaining. We figured it out by reading the error response carefully. A code sample in the docs that explicitly shows the minimum viable ingest call would have saved us 20 minutes.
Three writes, three paths. This is on us as much as on MuBit, but: recommendation writes, feedback writes, and AgentDefinition registration go through three different code paths in our integration. They use different endpoints and different request shapes. Our app/memory.py and app/mubit_agents.py are two separate modules for this reason. The API surface is internally consistent (everything is /v2/control/*) but the cognitive load of "which endpoint for which write" is higher than it could be. Worth a refactor on our end once the surface settles. And worth keeping in mind on MuBit's end if there's an opportunity to unify.

No native client library at the time we built this. We rolled our own HTTP client with httpx. That was fine. But a Python SDK would have absorbed the response-shape parsing and the polling loop. The blog post on MuBit's homepage shows mubit.learn.init() and @mubit.learn.run decorators, which suggests a higher-level SDK is the intended path for many users. We needed lower-level control for the agent registry use case, so the raw HTTP API was the right choice for us. Most teams will want the SDK.
What MuBit unlocks that we couldn't have built ourselves in 24 hours

This is the test that matters. Anyone can wrap an API. The question is whether the thing on the other end of the API is doing work you couldn't easily replicate.
For us, three things passed that test.
Cross-run recommendation history with operator feedback baked in. We could have stored layout changes in Postgres. We couldn't have built the recall pattern in 24 hours: fingerprint-stable IDs, agent-scoped activity feeds, semantic fallback when activity is sparse, dedup across writes. That's a week of work, minimum. The result would have been worse.

Versioned system prompts as a managed service. We could have built prompt versioning ourselves. We've all built prompt versioning ourselves. It's always worse than you want it to be. Usually a folder of .md files with timestamps, sometimes a database table, occasionally a YAML registry that nobody updates. MuBit's prompt versioning is built into the same control plane as the agent definitions, which means the prompt history and the agent it belongs to are never out of sync. We'd happily use this even if we didn't need the memory layer.

The "seen 3× before" chip is a product, not a feature. When operators see that a recommendation has been made before and rejected, the conversation changes. This isn't an AI capability. It's an institutional memory capability. MuBit gave us this in roughly 200 lines of integration code. If we had to build it from scratch, schema design, recall logic, dedup, fallback, observability, we'd still be building it.

Verdict
If you're shipping an agent that needs to remember anything between runs, MuBit is the right call. It's the first execution memory product we've used that understood the problem correctly. Not chat history. Not vector search. "What did the agent do last time and what happened."

The product is opinionated in the right places: sub-100ms recall, stable item IDs, prompt versioning as a control plane primitive. And unopinionated in the right places: writes are flexible enough that we could mirror locally without fighting the schema.
The rough edges are real but small. Docs could be clearer on activity vs query. Response shapes could be more uniform. The SDK story is still maturing for low-level integrations. None of these are reasons to avoid the product. They're the kind of thing that gets fixed in the next quarter, and we'd rather use a sharp product with a roadmap than a blunt one without.

If we were starting another agent project tomorrow, MuBit would be the first dependency we'd add. Before the LLM provider. Before the tracing layer. Before the framework.

The agent registry alone is worth the price of admission. The memory layer is the bonus.

Building CafeTwin: what we shipped, and how Logfire + PydanticAI carried the weekend

Samuel Amin — Tue, 28 Apr 2026 13:37:09 +0000

CafeTwin is a live simulation platform for cafes. You point it at your floor plan and your CCTV, and it gives you back a working twin of the room: every table, every queue, every staff path, replayed and re-runnable. Operators use it to spot what's quietly costing them throughput, test layout changes against real footfall before moving a single chair, and track how the room actually performs week over week instead of trusting POS numbers to tell the whole story.The pitch is simple. POS systems tell you what sold.
CafeTwin watches the room and tells you why throughput stalled, then proposes a single, geometry-checked layout change with predicted KPI impact, evidence, and a memory of how the operator responded last time. The twin is the surface.

The agent layer is what turns it from a dashboard into something that actually moves the room.This write-up is about the hackathon slice. We had 24 hours, a strong opinion that "AI agent" should mean more than a chat box that occasionally hallucinates a chair, and two pieces of plumbing that did most of the work: PydanticAI and Logfire.

What follows is what we shipped, what worked, and why those two tools are the reason the demo held together on stage.

Intelligence is real. Two PydanticAI agents in sequence.

PatternAgent reads the bundle and emits a typed OperationalPattern ("queue crossing" / "staff detour" / "table blockage" / "pickup congestion"). OptimizationAgent then picks one geometry-safe move from a deterministic candidate set and emits a typed LayoutChange.
Memory is real. MuBit as the primary store, with a local JSONL file as a fallback mirror. Recommendations and accept/reject feedback are persisted and recalled, scoped to (session_id, pattern_id) so cafes never see each other's history.
Observability is real. Every /api/run produces one Logfire trace, end-to-end, with a clickable URL on the top bar.

The frontend is the deliberately scrappy bit: Babel-in-browser JSX, no build step, an iso-twin we already had. We bound real data into it additively rather than rewriting it.

Why PydanticAI was the right call

We've all written the same boilerplate before: call an LLM, get a string back, pray it parses, write defensive JSON parsing, write retry logic, write a fallback for when the parse fails the third time, give up and ship it anyway. With two agents in a pipeline, that compounds.

PydanticAI removed a category of work entirely. The agent declaration looks roughly like this:

pythonoptimization_agent = Agent(
    _agent_model_spec(),
    deps_type=CafeEvidencePack,
    output_type=OptimizationChoice,
    instructions=INSTRUCTIONS,
    retries=1,
    output_retries=1,
)

@optimization_agent.output_validator
async def validate_agent_output(ctx, output: OptimizationChoice) -> OptimizationChoice:
    errors = validate_optimization_choice(output, ctx.deps)
    if errors:
        raise ModelRetry("Fix these errors:\n- " + "\n- ".join(errors))
    return output

A few things this buys you that turned out to matter:

The output type is the contract. OptimizationChoice is a strict Pydantic model with extra="forbid". The agent cannot invent fields. It cannot return a selected_candidate_id that isn't a string. We didn't write a single line of "what if the JSON is malformed" code in the whole project.
output_validator + ModelRetry is the part that earns its keep. The agent's job is selection, not invention. It picks one candidate from a deterministic, geometry-checked list we generated in code. The validator enforces semantic constraints (the candidate ID you picked must actually exist, the evidence IDs you cited must come from the pattern), and on failure it doesn't crash. It raises ModelRetry with the error list, the model gets to try again with explicit feedback, and it usually succeeds on the second try. We watched this fire exactly once in testing, fix itself, and produce a valid output.
Deps are typed. deps_type=CafeEvidencePack means the validator gets ctx.deps already parsed and validated. We never touched a dict.
The fallback path is dead simple. If no LLM key is configured, optimization_agent is None, we fall back to a cached recommendation, and the demo still works offline. That's the same code path that runs when an exception bubbles up from the agent. One flag flips between "live" and "cached", which is useful when wifi at the venue is what wifi at venues always is.

The pattern that emerged for both agents: let the LLM do the bit only an LLM can do (judgment, prioritization, prose), and let typed code do everything else. Geometry checks, candidate generation, KPI deltas, fingerprinting are all deterministic Python. The LLM sees a JSON list of pre-vetted options and picks one. That made the agent reliable enough to demo without a safety net.

Why Logfire was worth wiring up at hour two

The temptation in a hackathon is to wire observability last, if at all. We did the opposite. Roughly thirty lines of setup, once, at boot:

pythonlogfire.configure(
    service_name="cafetwin-backend-tier1",
    environment="demo",
    send_to_logfire="if-token-present",
    scrubbing=logfire.ScrubbingOptions(callback=_scrub_callback),
)
logfire.instrument_pydantic_ai()
logfire.instrument_httpx()
logfire.instrument_fastapi(app)

Three things came out of that.

You see the agent thinking. instrument_pydantic_ai() automatically captures every model call, including the prompt, the parsed output, the retry loop when the validator fires, token counts, and latency. We didn't have to instrument it ourselves. When a teammate asked "why did the agent pick that table?" we had a URL to send them, not a conversation.

The trace tree is the architecture diagram. We wrapped the pipeline stages in named spans (evidence_pack, pattern_agent, optimization_agent, memory.write.mubit, memory.write.jsonl, memory.recall.mubit). When we then drove the timings of the front-end's "agent flow" animation off RunResponse.stages[], the animation matched reality because both came from the same span tree. The five glowing nodes in the UI aren't a loading spinner. They're a stripped-down view of a real Logfire trace.
The "Logfire" button in the top bar is what made the demo land. Every /api/run returns a logfire_trace_url filtered to that trace's ID. During the pitch we clicked it and showed the full trace: the prompt, the tool call, the validator retry (when there was one), the memory write to MuBit, the JSONL mirror, the timings. That's harder to fake than a screenshot, and the judges noticed.
The cost of all this: one config block, one with span(...) per logical stage, and a scrub callback so we don't accidentally publish a session ID that looks like a secret. There is no version of "we'll add observability later" that beats wiring it up at the start.

The bit nobody warns you about: ordering

Logfire has one footgun in a FastAPI app. logfire.configure() has to run before you import anything that constructs an Agent, and instrument_fastapi(app) has to run after the app object exists. We learned this the irritating way. The fix was to move all the configuration into a single helper module that the FastAPI app factory imports first, and to make configure_logfire() idempotent so it's safe to call from anywhere.
If you take one piece of advice from this write-up, it's this: configure Logfire in a module that gets imported before any PydanticAI agent is constructed. Otherwise your traces silently drop the model spans and you'll spend an hour wondering where they went.

What we'd do again

Type the boundaries. The CafeEvidencePack schema is the contract between perception and intelligence. The LayoutChange schema is the contract between intelligence and the UI. PydanticAI made those two boundaries enforceable. Everything else was free to be messy.
Make the agent select from a pre-built list. Generating geometry-safe candidates in code and asking the LLM to pick one is the move. The agent gets to be smart about prioritization. It doesn't get to hallucinate coordinates that put a table in the wall.
Wire Logfire on day one. It paid for itself before lunch.
Ship with an offline mode. CAFETWIN_FORCE_FALLBACK=1 runs the entire demo without an LLM key. Conference wifi has opinions. So should your demo.

What we'd do differently

The handoff between PatternAgent and OptimizationAgent is currently a synchronous chain. With more time, we'd stream stage events to the frontend so the agent flow animation reflects real-time progress instead of post-hoc timings. PydanticAI supports this, we just didn't get there.
The MuBit integration grew tendrils. Recommendation/feedback writes go through one path, AgentDefinition registration through another, and prior-memory recall through a third. Worth a refactor once the API surface settles.
We treated the iso-twin as decoration. With another day, the simulated layout change would actually re-render the twin and re-compute the synthesized KPIs, closing the "before / after" loop visually rather than narratively.

The takeaway

The shape of this project (typed schemas, two narrow agents, deterministic candidate generation, structured memory, end-to-end tracing) wasn't original. It's the boring version of an agent system. PydanticAI made the boring version cheap to build, and Logfire made it cheap to debug and demo. Twenty-four hours later we had something that did one thing well and could prove its own work with a click.

If you're building a small agent for the first time and you're not reaching for these two tools, reach for them. The boilerplate they delete is the boilerplate that costs you the weekend.