Plan a 5-day trip to Japan. Tokyo + Kyoto. $3,000 budget. Love food and temples, hate crowds.
That single sentence is the input. The output is a validated, day-by-day itinerary with real POIs, neighborhood-level stays, transport legs between cities, a budget breakdown that adds up, and a Critic that says passed or sends specific agents back to revise.
Most "AI travel planner" demos are a single mega-prompt that hallucinates fluently. I wanted to find out what changes when you treat it as a systems problem instead of a prompting problem — typed contracts, specialized agents, deterministic validation, retries, and observability.
This post walks through the architecture, the agents, the Critic loop, the schemas, and the production concerns (cost, tracing, edge cases). All code is Python, all LLM calls go through Gemini Flash on the free tier (~$0 per run), and the same LLMClient interface swaps to Claude or OpenAI in one file.
Repo layout follows a 9-phase build plan — every file's docstring names the phase it implements, so contributors can read the codebase in build order.
1. Why multi-agent?
A single prompt can produce a plausible itinerary. It can't reliably:
- Stay inside the budget. LLMs don't add up.
- Match every preference. "Love food, hate crowds" — does each recommendation actually justify itself against those?
- Fail loudly. If the user said "somewhere warm" without a destination, a single prompt invents one. You want it to ask.
- Tell you what it cost. Per-trip token/cost/latency tracking matters once you ship.
So instead of one agent, I split the work into seven specialists + an orchestrator, where each agent has a narrow job, a typed input, and a typed output. The boundary between agents is a Pydantic schema, not a string.
The seven agents:
| Agent | Job | Output schema |
|---|---|---|
| Intent | Parse free text → structured brief | TripBrief |
| Destination | Find 4–8 POIs per city, justified by match_reasons
|
DestinationCatalog |
| Accommodation | 1–2 stay options per city, ~30–40% of budget | AccommodationPlan |
| Transport | Connected legs (flight/train/bus/ferry) | TransportPlan |
| Budget | Allocate across lodging/transport/food/activities/buffer | BudgetBreakdown |
| Itinerary | Stitch everything into day-by-day plan | Itinerary |
| Critic | Validate against 5 mechanical rules | CriticVerdict |
The Critic is the one that doesn't call an LLM at all — and it's the most important one in the system. More on that in a minute.
2. The contract: Pydantic schemas as the inter-agent API
Every agent's output is a strict Pydantic model with extra="forbid". If the LLM hallucinates a field or omits a required one, we catch it at the boundary, not three agents downstream.
open_questions is how the system surfaces ambiguity. If the user wrote "somewhere warm next month, $500", Intent doesn't pick a destination — it returns:
{
"destinations": [],
"budget": {"amount": 500, "currency": "USD"},
"open_questions": [
"What region or continent are you considering?",
"$500 is tight for international travel — is this a domestic trip?"
]
}
The orchestrator sees a non-empty open_questions and short-circuits before running the expensive specialist fan-out. The UI prompts the user; the answers go back into Intent. No tokens wasted planning a trip the user didn't actually ask for.
Every downstream schema follows the same pattern. A POI carries match_reasons: list[str] so a POI labeled "Tsukiji Outer Market" must explicitly justify itself with ["food: street-food legend"] — not just exist. The Critic later checks that every user like shows up somewhere in match_reasons across the plan. This gives traceability for free: every recommendation tells you why it's there.
3. The Critic: where typed validation beats more LLM calls
The single highest-leverage decision in the design was making the Critic deterministic Python, not another LLM call. Five rules, ~200 lines of code, run in milliseconds:
The five rules:
-
Budget rule —
sum(BudgetBreakdown.lines) ≤ TripBrief.budget.amount(currency-normalized via a static rates table, so a JPY budget against USD line items still validates). -
Coverage rule — every user
likeappears in ≥1match_reasonssomewhere in the plan. If the user said "love temples" and no POI / activity references temples, that's a fail. -
Avoidance rule — every user
dislikethat shows up in a recommendation must have an explicit mitigation. "Hate crowds" + a recommendation for Senso-ji is fine if the activity says "early morning to avoid crowds." -
Geo-feasibility rule — within a single day, no two activities are >50km apart without a
transport_leg. Cross-day city changes also require a transport leg. Uses Haversine on POI coordinates. - Day-balance rule — total activity minutes per day ≤ 600 (10 hours). Prevents the LLM's natural tendency to pack 14 activities into "Day 2."
Each violation carries the guilty agent name:
The orchestrator reads the agent field, re-runs only the agents named in failed violations, and re-runs the Critic. Capped at 2 revisions — beyond that, we surface the violations to the user instead of looping forever.
This is the part I'd encourage anyone building agentic systems to copy: anything you can validate in code, validate in code. LLM-as-judge is a fallback for fuzzy properties, not a primary check for "does this number equal that number."
4. Parallel fan-out, then sequential resolution
After Intent succeeds, three specialists are independent of each other — Destination, Accommodation, and Transport all consume TripBrief and produce their own output. The orchestrator runs them in parallel via concurrent.futures:
Budget then runs sequentially (it needs the others' outputs to allocate sensibly), then Itinerary stitches them all together, then Critic validates. On the Gemini Flash free tier (15 RPM), this lands a full plan in 20–40 seconds.
5. Provider-agnostic LLM client with cost tracking
src/llm/client.py is a thin wrapper over google-genai that exposes two tiers:
class LLMClient:
def call(
self,
*,
system: str,
user: str,
schema: type[BaseModel], # ← structured output
tier: Literal["fast", "smart"] = "smart",
) -> tuple[BaseModel, Usage]:
...
Two things this gets right:
- Structured output is required, not optional. Every call passes a Pydantic schema; the response is parsed and validated before returning. No JSON-parsing-from-markdown hacks.
-
Cost is computed per call. The client carries a pricing table (Flash = $0.075/M in, $0.30/M out; Pro = $1.25/M in, $5.00/M out) and returns a
Usageobject with input tokens, output tokens, and dollars. The tracer logs all three.
Swapping providers is a single-file change because every agent imports LLMClient, never the underlying SDK.
6. Observability: a JSONL trace per trip
Each planning run writes one line per span to traces/<trip_id>.jsonl:
7. Edge cases I actually had to handle
The doc/edgeCase.md catalog has 60+ entries split into P0/P1/P2. The ones that shaped the architecture:
-
Underspecified requests ("somewhere warm next month") → Intent emits
open_questionsinstead of guessing. The clarification loop is a real product feature, not an error path. -
Contradictions ("$500 budget, luxury hotels") → Intent records both signals and flags the conflict in
open_questions. - Currency ambiguity ("¥300,000") → Money schema requires explicit currency; Critic normalizes via static rates before comparing.
- Invented origins → If the user didn't say where they're flying from, Transport must scrub any leg whose origin came from the LLM's imagination. Specifically prompted for, then validated.
- Prompt injection in the user request → Intent system prompt frames the user input as data, not instructions. Defense is shallow but explicit.
-
Obscure destinations → If the LLM has no real knowledge, it must mark
confidence: lowand explain. The UI surfaces low-confidence rows with a warning badge instead of failing. - Budget overruns → Budget Agent flags; Critic catches; Accommodation/Itinerary re-run with tighter constraints. After 2 revisions, the user sees the violations and chooses what to relax.
8. Two front-ends: CLI and Streamlit
CLI (python -m src.ui.cli "Plan a 5-day trip...") — uses rich for tree-rendered output, prompts on stdin when Intent emits open_questions. --no-clarify skips the prompts and proceeds with assumptions for scripting / batch evaluation.
Streamlit (streamlit run src/ui/streamlit_app.py) — single-page form. Each agent runs inside an st.status container so the user sees the pipeline live: Intent ✓ → Destination ✓ → Accommodation ⏳ … . The final plan renders as collapsible sections with tables, metrics, and the Critic verdict at the top.
Both front-ends are thin — they call into the same orchestrator. All planning logic lives in src/orchestrator/graph.py, not in the UI.
9. Testing strategy: mock by default, live on demand
Tests are organized by build phase (tests/phase2/, tests/phase4/, etc.). Every agent has:
-
Unit tests with a mocked
LLMClient(deterministic, run in CI, < 1s). -
Live tests marked
@pytest.mark.live, gated by an env var:
This keeps the dev loop fast and CI cheap, while still letting you run a live integration sweep before merging.
For end-to-end quality, there's a fixture file with ~10 representative requests (happy path, underspecified, contradictory, obscure destination, tight budget, multi-city zigzag, prompt injection, non-Latin destination, accessibility requirement). The live integration suite runs all of them and asserts critic pass-rate ≥ 80%.
10. What I'd do differently
A few honest takes after building this:
- Don't reach for "LLM-as-judge" first. I started with an LLM-based critic and it was inconsistent and expensive. Rewriting it as five deterministic Python checks made the whole system more reliable and cut every-trip cost ~30%. Reach for the LLM only for things you genuinely can't express in code.
- Schemas first, prompts second. I started by writing system prompts and learned the hard way that the right design is to nail down the Pydantic schemas first, then write prompts whose only job is to produce something that validates against the schema. The schema is the spec; the prompt is just a way to hit it.
-
open_questionsis more valuable than I expected. Telling the user "I'm not going to guess your destination" is better UX than guessing wrong, and it saves serious tokens by short-circuiting before the fan-out. -
Trace everything from day one. Adding observability later is much harder than wiring it in early. The
trip_summaryevent in particular cost about 30 lines of code and gave me weekly quality reporting basically for free. - 2 revisions is enough. I considered uncapped retries and decided against it. If two passes don't fix the plan, the user wants to see the conflict and decide, not watch me burn 50K more tokens hoping.
11. Where it goes next
v1 scope is closed. v2 candidates from the edge-case catalog:
- Real-time pricing & booking integrations (Skyscanner, Booking.com APIs).
- Cross-session memory — remember user preferences across trips.
- Multi-traveler constraint solving — when the group has conflicting preferences.
- LLM-based "vibe" critic — for qualities like "this itinerary feels rushed" that resist deterministic rules.
A travel planner is a great vehicle for learning agentic-system design because the problem is just structured enough to demand contracts, just fuzzy enough to need LLMs, and just constrained enough (a budget, a date range, a location) that you can actually validate the output.
The takeaways generalize:
- One agent per responsibility. Specialists beat generalists.
- Pydantic at every boundary. Catch hallucinations as schema-validation errors, not three steps downstream.
- Critic in Python, not in another LLM call. Validate what you can validate; LLM-as-judge is a fallback, not a default.
-
Surface ambiguity instead of guessing.
open_questionsis a feature. - Trace everything. Per-trip JSONL + a quality report script answers "is the system getting better?"
- Cap retries. Bounded revision loops; let the user see the conflict.
If you're building anything multi-agent in Python, I hope some of this saves you a few iterations.






Top comments (0)