<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Palash1417</title>
    <description>The latest articles on DEV Community by Palash1417 (@palash1417).</description>
    <link>https://dev.to/palash1417</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3910525%2Fff904fce-b35a-4387-b615-e547f322a273.png</url>
      <title>DEV Community: Palash1417</title>
      <link>https://dev.to/palash1417</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/palash1417"/>
    <language>en</language>
    <item>
      <title>Building a Multi-Agent Travel Planner: From a One-Sentence Prompt to a Validated, Budget-Aware Itinerary</title>
      <dc:creator>Palash1417</dc:creator>
      <pubDate>Sun, 03 May 2026 16:30:24 +0000</pubDate>
      <link>https://dev.to/palash1417/building-a-multi-agent-travel-planner-from-a-one-sentence-prompt-to-a-validated-budget-aware-5f8i</link>
      <guid>https://dev.to/palash1417/building-a-multi-agent-travel-planner-from-a-one-sentence-prompt-to-a-validated-budget-aware-5f8i</guid>
      <description>&lt;p&gt;&lt;strong&gt;Plan a 5-day trip to Japan. Tokyo + Kyoto. $3,000 budget. Love food and temples, hate crowds.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That single sentence is the input. The output is a validated, day-by-day itinerary with real POIs, neighborhood-level stays, transport legs between cities, a budget breakdown that adds up, and a Critic that says &lt;strong&gt;passed&lt;/strong&gt; or sends specific agents back to revise.&lt;/p&gt;

&lt;p&gt;Most "AI travel planner" demos are a single mega-prompt that hallucinates fluently. I wanted to find out what changes when you treat it as a &lt;strong&gt;systems&lt;/strong&gt; problem instead of a prompting problem — typed contracts, specialized agents, deterministic validation, retries, and observability.&lt;/p&gt;

&lt;p&gt;This post walks through the architecture, the agents, the Critic loop, the schemas, and the production concerns (cost, tracing, edge cases). All code is Python, all LLM calls go through &lt;strong&gt;Gemini Flash&lt;/strong&gt; on the free tier (~$0 per run), and the same &lt;code&gt;LLMClient&lt;/code&gt; interface swaps to Claude or OpenAI in one file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo layout follows a 9-phase build plan&lt;/strong&gt; — every file's docstring names the phase it implements, so contributors can read the codebase in build order.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Why multi-agent?
&lt;/h2&gt;

&lt;p&gt;A single prompt can produce a plausible itinerary. It can't reliably:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stay inside the budget.&lt;/strong&gt; LLMs don't add up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Match every preference.&lt;/strong&gt; "Love food, hate crowds" — does each recommendation actually justify itself against those?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fail loudly.&lt;/strong&gt; If the user said "somewhere warm" without a destination, a single prompt invents one. You want it to ask.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tell you what it cost.&lt;/strong&gt; Per-trip token/cost/latency tracking matters once you ship.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So instead of one agent, I split the work into &lt;strong&gt;seven specialists + an orchestrator&lt;/strong&gt;, where each agent has a narrow job, a typed input, and a typed output. The boundary between agents is a Pydantic schema, not a string.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fto6a5mt5qx8zd2yo8u23.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fto6a5mt5qx8zd2yo8u23.jpg" alt="orchestrator" width="705" height="617"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The seven agents:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Job&lt;/th&gt;
&lt;th&gt;Output schema&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Intent&lt;/td&gt;
&lt;td&gt;Parse free text → structured brief&lt;/td&gt;
&lt;td&gt;&lt;code&gt;TripBrief&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Destination&lt;/td&gt;
&lt;td&gt;Find 4–8 POIs per city, justified by &lt;code&gt;match_reasons&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DestinationCatalog&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accommodation&lt;/td&gt;
&lt;td&gt;1–2 stay options per city, ~30–40% of budget&lt;/td&gt;
&lt;td&gt;&lt;code&gt;AccommodationPlan&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transport&lt;/td&gt;
&lt;td&gt;Connected legs (flight/train/bus/ferry)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;TransportPlan&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget&lt;/td&gt;
&lt;td&gt;Allocate across lodging/transport/food/activities/buffer&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BudgetBreakdown&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Itinerary&lt;/td&gt;
&lt;td&gt;Stitch everything into day-by-day plan&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Itinerary&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Critic&lt;/td&gt;
&lt;td&gt;Validate against 5 mechanical rules&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CriticVerdict&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Critic is the one that doesn't call an LLM at all — and it's the most important one in the system. More on that in a minute.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The contract: Pydantic schemas as the inter-agent API
&lt;/h2&gt;

&lt;p&gt;Every agent's output is a strict Pydantic model with &lt;code&gt;extra="forbid"&lt;/code&gt;. If the LLM hallucinates a field or omits a required one, we catch it at the boundary, not three agents downstream.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F604xgqu9c87f33ysv6x2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F604xgqu9c87f33ysv6x2.jpg" alt="TripBrief" width="746" height="296"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;open_questions&lt;/code&gt; is how the system surfaces ambiguity. If the user wrote "somewhere warm next month, $500", Intent doesn't pick a destination — it returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"destinations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"budget"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"currency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"USD"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"open_questions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"What region or continent are you considering?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"$500 is tight for international travel — is this a domestic trip?"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The orchestrator sees a non-empty &lt;code&gt;open_questions&lt;/code&gt; and short-circuits &lt;strong&gt;before&lt;/strong&gt; running the expensive specialist fan-out. The UI prompts the user; the answers go back into Intent. No tokens wasted planning a trip the user didn't actually ask for.&lt;/p&gt;

&lt;p&gt;Every downstream schema follows the same pattern. A &lt;code&gt;POI&lt;/code&gt; carries &lt;code&gt;match_reasons: list[str]&lt;/code&gt; so a POI labeled "Tsukiji Outer Market" must explicitly justify itself with &lt;code&gt;["food: street-food legend"]&lt;/code&gt; — not just exist. The Critic later checks that every user &lt;code&gt;like&lt;/code&gt; shows up somewhere in &lt;code&gt;match_reasons&lt;/code&gt; across the plan. This gives &lt;strong&gt;traceability&lt;/strong&gt; for free: every recommendation tells you why it's there.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The Critic: where typed validation beats more LLM calls
&lt;/h2&gt;

&lt;p&gt;The single highest-leverage decision in the design was making the Critic &lt;strong&gt;deterministic Python&lt;/strong&gt;, not another LLM call. Five rules, ~200 lines of code, run in milliseconds:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5grbx90h8xvovpde5lze.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5grbx90h8xvovpde5lze.jpg" alt="deterministic Python" width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The five rules:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Budget rule&lt;/strong&gt; — &lt;code&gt;sum(BudgetBreakdown.lines) ≤ TripBrief.budget.amount&lt;/code&gt; (currency-normalized via a static rates table, so a JPY budget against USD line items still validates).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage rule&lt;/strong&gt; — every user &lt;code&gt;like&lt;/code&gt; appears in ≥1 &lt;code&gt;match_reasons&lt;/code&gt; somewhere in the plan. If the user said "love temples" and no POI / activity references temples, that's a fail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoidance rule&lt;/strong&gt; — every user &lt;code&gt;dislike&lt;/code&gt; that shows up in a recommendation must have an explicit mitigation. "Hate crowds" + a recommendation for Senso-ji is fine &lt;em&gt;if&lt;/em&gt; the activity says "early morning to avoid crowds."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Geo-feasibility rule&lt;/strong&gt; — within a single day, no two activities are &amp;gt;50km apart without a &lt;code&gt;transport_leg&lt;/code&gt;. Cross-day city changes also require a transport leg. Uses Haversine on POI coordinates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day-balance rule&lt;/strong&gt; — total activity minutes per day ≤ 600 (10 hours). Prevents the LLM's natural tendency to pack 14 activities into "Day 2."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each violation carries the &lt;strong&gt;guilty agent name&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglkzdy5a7jz0kmau57ka.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglkzdy5a7jz0kmau57ka.jpg" alt="guilty agent name" width="732" height="178"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The orchestrator reads the &lt;code&gt;agent&lt;/code&gt; field, re-runs only the agents named in failed violations, and re-runs the Critic. &lt;strong&gt;Capped at 2 revisions&lt;/strong&gt; — beyond that, we surface the violations to the user instead of looping forever.&lt;/p&gt;

&lt;p&gt;This is the part I'd encourage anyone building agentic systems to copy: &lt;strong&gt;anything you can validate in code, validate in code.&lt;/strong&gt; LLM-as-judge is a fallback for fuzzy properties, not a primary check for "does this number equal that number."&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Parallel fan-out, then sequential resolution
&lt;/h2&gt;

&lt;p&gt;After Intent succeeds, three specialists are independent of each other — Destination, Accommodation, and Transport all consume &lt;code&gt;TripBrief&lt;/code&gt; and produce their own output. The orchestrator runs them in parallel via &lt;code&gt;concurrent.futures&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5nnaqwwa6llkkiplevyn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5nnaqwwa6llkkiplevyn.jpg" alt="Parallel fan-out" width="800" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Budget then runs &lt;strong&gt;sequentially&lt;/strong&gt; (it needs the others' outputs to allocate sensibly), then Itinerary stitches them all together, then Critic validates. On the Gemini Flash free tier (15 RPM), this lands a full plan in 20–40 seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Provider-agnostic LLM client with cost tracking
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;src/llm/client.py&lt;/code&gt; is a thin wrapper over &lt;code&gt;google-genai&lt;/code&gt; that exposes two tiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LLMClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# ← structured output
&lt;/span&gt;        &lt;span class="n"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;smart&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;smart&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Usage&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things this gets right:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Structured output is required, not optional.&lt;/strong&gt; Every call passes a Pydantic schema; the response is parsed and validated before returning. No JSON-parsing-from-markdown hacks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost is computed per call.&lt;/strong&gt; The client carries a pricing table (Flash = $0.075/M in, $0.30/M out; Pro = $1.25/M in, $5.00/M out) and returns a &lt;code&gt;Usage&lt;/code&gt; object with input tokens, output tokens, and dollars. The tracer logs all three.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Swapping providers is a single-file change because every agent imports &lt;code&gt;LLMClient&lt;/code&gt;, never the underlying SDK.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Observability: a JSONL trace per trip
&lt;/h2&gt;

&lt;p&gt;Each planning run writes one line per span to &lt;code&gt;traces/&amp;lt;trip_id&amp;gt;.jsonl&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88foduml8pb0ujza857b.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88foduml8pb0ujza857b.jpg" alt="JSONL trace" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Edge cases I actually had to handle
&lt;/h2&gt;

&lt;p&gt;The doc/edgeCase.md catalog has 60+ entries split into P0/P1/P2. The ones that shaped the architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Underspecified requests&lt;/strong&gt; ("somewhere warm next month") → Intent emits &lt;code&gt;open_questions&lt;/code&gt; instead of guessing. The clarification loop is a real product feature, not an error path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contradictions&lt;/strong&gt; ("$500 budget, luxury hotels") → Intent records both signals and flags the conflict in &lt;code&gt;open_questions&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Currency ambiguity&lt;/strong&gt; ("¥300,000") → Money schema requires explicit currency; Critic normalizes via static rates before comparing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invented origins&lt;/strong&gt; → If the user didn't say where they're flying from, Transport must scrub any leg whose origin came from the LLM's imagination. Specifically prompted for, then validated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt injection&lt;/strong&gt; in the user request → Intent system prompt frames the user input as data, not instructions. Defense is shallow but explicit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Obscure destinations&lt;/strong&gt; → If the LLM has no real knowledge, it must mark &lt;code&gt;confidence: low&lt;/code&gt; and explain. The UI surfaces low-confidence rows with a warning badge instead of failing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget overruns&lt;/strong&gt; → Budget Agent flags; Critic catches; Accommodation/Itinerary re-run with tighter constraints. After 2 revisions, the user sees the violations and chooses what to relax.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Two front-ends: CLI and Streamlit
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CLI&lt;/strong&gt; (&lt;code&gt;python -m src.ui.cli "Plan a 5-day trip..."&lt;/code&gt;) — uses &lt;code&gt;rich&lt;/code&gt; for tree-rendered output, prompts on stdin when Intent emits &lt;code&gt;open_questions&lt;/code&gt;. &lt;code&gt;--no-clarify&lt;/code&gt; skips the prompts and proceeds with assumptions for scripting / batch evaluation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streamlit&lt;/strong&gt; (&lt;code&gt;streamlit run src/ui/streamlit_app.py&lt;/code&gt;) — single-page form. Each agent runs inside an &lt;code&gt;st.status&lt;/code&gt; container so the user sees the pipeline live: Intent ✓ → Destination ✓ → Accommodation ⏳ … . The final plan renders as collapsible sections with tables, metrics, and the Critic verdict at the top.&lt;/p&gt;

&lt;p&gt;Both front-ends are thin — they call into the same orchestrator. All planning logic lives in &lt;code&gt;src/orchestrator/graph.py&lt;/code&gt;, not in the UI.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Testing strategy: mock by default, live on demand
&lt;/h2&gt;

&lt;p&gt;Tests are organized by build phase (&lt;code&gt;tests/phase2/&lt;/code&gt;, &lt;code&gt;tests/phase4/&lt;/code&gt;, etc.). Every agent has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unit tests&lt;/strong&gt; with a mocked &lt;code&gt;LLMClient&lt;/code&gt; (deterministic, run in CI, &amp;lt; 1s).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live tests&lt;/strong&gt; marked &lt;code&gt;@pytest.mark.live&lt;/code&gt;, gated by an env var:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps the dev loop fast and CI cheap, while still letting you run a live integration sweep before merging.&lt;/p&gt;

&lt;p&gt;For end-to-end quality, there's a fixture file with ~10 representative requests (happy path, underspecified, contradictory, obscure destination, tight budget, multi-city zigzag, prompt injection, non-Latin destination, accessibility requirement). The live integration suite runs all of them and asserts critic pass-rate ≥ 80%.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. What I'd do differently
&lt;/h2&gt;

&lt;p&gt;A few honest takes after building this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't reach for "LLM-as-judge" first.&lt;/strong&gt; I started with an LLM-based critic and it was inconsistent and expensive. Rewriting it as five deterministic Python checks made the whole system more reliable and cut every-trip cost ~30%. Reach for the LLM only for things you genuinely can't express in code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schemas first, prompts second.&lt;/strong&gt; I started by writing system prompts and learned the hard way that the right design is to nail down the Pydantic schemas first, then write prompts whose only job is to produce something that validates against the schema. The schema is the spec; the prompt is just a way to hit it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;open_questions&lt;/code&gt; is more valuable than I expected.&lt;/strong&gt; Telling the user "I'm not going to guess your destination" is better UX than guessing wrong, and it saves serious tokens by short-circuiting before the fan-out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trace everything from day one.&lt;/strong&gt; Adding observability later is much harder than wiring it in early. The &lt;code&gt;trip_summary&lt;/code&gt; event in particular cost about 30 lines of code and gave me weekly quality reporting basically for free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 revisions is enough.&lt;/strong&gt; I considered uncapped retries and decided against it. If two passes don't fix the plan, the user wants to see the conflict and decide, not watch me burn 50K more tokens hoping.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  11. Where it goes next
&lt;/h2&gt;

&lt;p&gt;v1 scope is closed. v2 candidates from the edge-case catalog:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time pricing &amp;amp; booking integrations&lt;/strong&gt; (Skyscanner, Booking.com APIs).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-session memory&lt;/strong&gt; — remember user preferences across trips.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-traveler constraint solving&lt;/strong&gt; — when the group has conflicting preferences.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-based "vibe" critic&lt;/strong&gt; — for qualities like "this itinerary feels rushed" that resist deterministic rules.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A travel planner is a great vehicle for learning agentic-system design because the problem is &lt;strong&gt;just structured enough&lt;/strong&gt; to demand contracts, &lt;strong&gt;just fuzzy enough&lt;/strong&gt; to need LLMs, and &lt;strong&gt;just constrained enough&lt;/strong&gt; (a budget, a date range, a location) that you can actually validate the output.&lt;/p&gt;

&lt;p&gt;The takeaways generalize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One agent per responsibility.&lt;/strong&gt; Specialists beat generalists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pydantic at every boundary.&lt;/strong&gt; Catch hallucinations as schema-validation errors, not three steps downstream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Critic in Python, not in another LLM call.&lt;/strong&gt; Validate what you can validate; LLM-as-judge is a fallback, not a default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surface ambiguity instead of guessing.&lt;/strong&gt; &lt;code&gt;open_questions&lt;/code&gt; is a feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trace everything.&lt;/strong&gt; Per-trip JSONL + a quality report script answers "is the system getting better?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cap retries.&lt;/strong&gt; Bounded revision loops; let the user see the conflict.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building anything multi-agent in Python, I hope some of this saves you a few iterations.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
