Building CafeTwin: what we shipped, and how Logfire + PydanticAI carried the weekend

#ai #pydantic #logfire #hackathon

CafeTwin is a live simulation platform for cafes. You point it at your floor plan and your CCTV, and it gives you back a working twin of the room: every table, every queue, every staff path, replayed and re-runnable. Operators use it to spot what's quietly costing them throughput, test layout changes against real footfall before moving a single chair, and track how the room actually performs week over week instead of trusting POS numbers to tell the whole story.The pitch is simple. POS systems tell you what sold.
CafeTwin watches the room and tells you why throughput stalled, then proposes a single, geometry-checked layout change with predicted KPI impact, evidence, and a memory of how the operator responded last time. The twin is the surface.

The agent layer is what turns it from a dashboard into something that actually moves the room.This write-up is about the hackathon slice. We had 24 hours, a strong opinion that "AI agent" should mean more than a chat box that occasionally hallucinates a chair, and two pieces of plumbing that did most of the work: PydanticAI and Logfire.

What follows is what we shipped, what worked, and why those two tools are the reason the demo held together on stage.

Intelligence is real. Two PydanticAI agents in sequence.

PatternAgent reads the bundle and emits a typed OperationalPattern ("queue crossing" / "staff detour" / "table blockage" / "pickup congestion"). OptimizationAgent then picks one geometry-safe move from a deterministic candidate set and emits a typed LayoutChange.
Memory is real. MuBit as the primary store, with a local JSONL file as a fallback mirror. Recommendations and accept/reject feedback are persisted and recalled, scoped to (session_id, pattern_id) so cafes never see each other's history.
Observability is real. Every /api/run produces one Logfire trace, end-to-end, with a clickable URL on the top bar.

The frontend is the deliberately scrappy bit: Babel-in-browser JSX, no build step, an iso-twin we already had. We bound real data into it additively rather than rewriting it.

Why PydanticAI was the right call

We've all written the same boilerplate before: call an LLM, get a string back, pray it parses, write defensive JSON parsing, write retry logic, write a fallback for when the parse fails the third time, give up and ship it anyway. With two agents in a pipeline, that compounds.

PydanticAI removed a category of work entirely. The agent declaration looks roughly like this:

pythonoptimization_agent = Agent(
    _agent_model_spec(),
    deps_type=CafeEvidencePack,
    output_type=OptimizationChoice,
    instructions=INSTRUCTIONS,
    retries=1,
    output_retries=1,
)

@optimization_agent.output_validator
async def validate_agent_output(ctx, output: OptimizationChoice) -> OptimizationChoice:
    errors = validate_optimization_choice(output, ctx.deps)
    if errors:
        raise ModelRetry("Fix these errors:\n- " + "\n- ".join(errors))
    return output

A few things this buys you that turned out to matter:

The output type is the contract. OptimizationChoice is a strict Pydantic model with extra="forbid". The agent cannot invent fields. It cannot return a selected_candidate_id that isn't a string. We didn't write a single line of "what if the JSON is malformed" code in the whole project.
output_validator + ModelRetry is the part that earns its keep. The agent's job is selection, not invention. It picks one candidate from a deterministic, geometry-checked list we generated in code. The validator enforces semantic constraints (the candidate ID you picked must actually exist, the evidence IDs you cited must come from the pattern), and on failure it doesn't crash. It raises ModelRetry with the error list, the model gets to try again with explicit feedback, and it usually succeeds on the second try. We watched this fire exactly once in testing, fix itself, and produce a valid output.
Deps are typed. deps_type=CafeEvidencePack means the validator gets ctx.deps already parsed and validated. We never touched a dict.
The fallback path is dead simple. If no LLM key is configured, optimization_agent is None, we fall back to a cached recommendation, and the demo still works offline. That's the same code path that runs when an exception bubbles up from the agent. One flag flips between "live" and "cached", which is useful when wifi at the venue is what wifi at venues always is.

The pattern that emerged for both agents: let the LLM do the bit only an LLM can do (judgment, prioritization, prose), and let typed code do everything else. Geometry checks, candidate generation, KPI deltas, fingerprinting are all deterministic Python. The LLM sees a JSON list of pre-vetted options and picks one. That made the agent reliable enough to demo without a safety net.

Why Logfire was worth wiring up at hour two

The temptation in a hackathon is to wire observability last, if at all. We did the opposite. Roughly thirty lines of setup, once, at boot:

pythonlogfire.configure(
    service_name="cafetwin-backend-tier1",
    environment="demo",
    send_to_logfire="if-token-present",
    scrubbing=logfire.ScrubbingOptions(callback=_scrub_callback),
)
logfire.instrument_pydantic_ai()
logfire.instrument_httpx()
logfire.instrument_fastapi(app)

Three things came out of that.

You see the agent thinking. instrument_pydantic_ai() automatically captures every model call, including the prompt, the parsed output, the retry loop when the validator fires, token counts, and latency. We didn't have to instrument it ourselves. When a teammate asked "why did the agent pick that table?" we had a URL to send them, not a conversation.

The trace tree is the architecture diagram. We wrapped the pipeline stages in named spans (evidence_pack, pattern_agent, optimization_agent, memory.write.mubit, memory.write.jsonl, memory.recall.mubit). When we then drove the timings of the front-end's "agent flow" animation off RunResponse.stages[], the animation matched reality because both came from the same span tree. The five glowing nodes in the UI aren't a loading spinner. They're a stripped-down view of a real Logfire trace.
The "Logfire" button in the top bar is what made the demo land. Every /api/run returns a logfire_trace_url filtered to that trace's ID. During the pitch we clicked it and showed the full trace: the prompt, the tool call, the validator retry (when there was one), the memory write to MuBit, the JSONL mirror, the timings. That's harder to fake than a screenshot, and the judges noticed.
The cost of all this: one config block, one with span(...) per logical stage, and a scrub callback so we don't accidentally publish a session ID that looks like a secret. There is no version of "we'll add observability later" that beats wiring it up at the start.

The bit nobody warns you about: ordering

Logfire has one footgun in a FastAPI app. logfire.configure() has to run before you import anything that constructs an Agent, and instrument_fastapi(app) has to run after the app object exists. We learned this the irritating way. The fix was to move all the configuration into a single helper module that the FastAPI app factory imports first, and to make configure_logfire() idempotent so it's safe to call from anywhere.
If you take one piece of advice from this write-up, it's this: configure Logfire in a module that gets imported before any PydanticAI agent is constructed. Otherwise your traces silently drop the model spans and you'll spend an hour wondering where they went.

What we'd do again

Type the boundaries. The CafeEvidencePack schema is the contract between perception and intelligence. The LayoutChange schema is the contract between intelligence and the UI. PydanticAI made those two boundaries enforceable. Everything else was free to be messy.
Make the agent select from a pre-built list. Generating geometry-safe candidates in code and asking the LLM to pick one is the move. The agent gets to be smart about prioritization. It doesn't get to hallucinate coordinates that put a table in the wall.
Wire Logfire on day one. It paid for itself before lunch.
Ship with an offline mode. CAFETWIN_FORCE_FALLBACK=1 runs the entire demo without an LLM key. Conference wifi has opinions. So should your demo.

What we'd do differently

The handoff between PatternAgent and OptimizationAgent is currently a synchronous chain. With more time, we'd stream stage events to the frontend so the agent flow animation reflects real-time progress instead of post-hoc timings. PydanticAI supports this, we just didn't get there.
The MuBit integration grew tendrils. Recommendation/feedback writes go through one path, AgentDefinition registration through another, and prior-memory recall through a third. Worth a refactor once the API surface settles.
We treated the iso-twin as decoration. With another day, the simulated layout change would actually re-render the twin and re-compute the synthesized KPIs, closing the "before / after" loop visually rather than narratively.

The takeaway

The shape of this project (typed schemas, two narrow agents, deterministic candidate generation, structured memory, end-to-end tracing) wasn't original. It's the boring version of an agent system. PydanticAI made the boring version cheap to build, and Logfire made it cheap to debug and demo. Twenty-four hours later we had something that did one thing well and could prove its own work with a click.

If you're building a small agent for the first time and you're not reaching for these two tools, reach for them. The boilerplate they delete is the boilerplate that costs you the weekend.

DEV Community

Building CafeTwin: what we shipped, and how Logfire + PydanticAI carried the weekend

Top comments (0)