How I stopped my AI trip planner from inventing addresses

#ai #llm #rag #softwaredevelopment

The first time I demoed the trip planner, it wrote a three-day Chengdu itinerary that read like a glossy travel magazine. Everyone nodded along. Then a teammate tapped "Navigate" on one of the restaurants, and the map dropped a pin in the middle of a ring road. The restaurant was real. The address was not. The model had made it up, and it made it up beautifully.

I work on the AI trip planner inside MyChinaGuide, an app for foreigners traveling around China. A wrong address there isn't a typo you laugh off. It's someone standing on a highway shoulder in a city where they can't read the signs or ask a stranger for directions. So I spent a good chunk of last year on one question: how do you let an LLM plan a trip without letting it lie about the facts?

The rule I eventually settled on is boring, and it works. The model gets to write the words and decide the shape of the day. It never gets to be the source of truth for a fact. Names, coordinates, prices, opening hours, all of that comes from somewhere I trust, not from whatever the model felt like generating that second.

Everything below is how that rule turns into actual code.

Why one big prompt isn't enough

The obvious first attempt is a single prompt. "You're a travel planner, plan three days in Chengdu, give me JSON." It demos great. Then real users show up and it breaks the same handful of ways every time.

It invents geography, which is a problem because the address is the one thing a navigation button actually needs. It hands you a wall of markdown when you wanted something your app can render as a map pin or a tappable card. And it pads every plan with advice nobody asked for: download Alipay, carry cash, wear comfortable shoes. Reads like a brochure written by someone who's never been.

So instead of one prompt doing everything, a single request walks through a short pipeline where each step has exactly one job:

quota check
  -> build_context   (pull in facts: RAG + real DB places + real questions)
  -> build_prompt    (stable rules first, the changing stuff last)
  -> stream          (structured output + web search + parse as it arrives)
  -> validate        (a safety net after generation)
  -> enrich          (overwrite the geography with database truth)
  -> save + notify

The parts worth talking about are the grounding up front, the enrichment at the end, and the streaming in the middle. Let me go through them.

One skill, one place to look

Each thing the planner can do, plan a route, recommend food, work out how to get around, is its own self-contained handler. All the logic for that one skill lives together: how it fetches context, how it builds the prompt, what shape of output it's allowed to produce, how it cleans up afterward.

That sounds obvious until you've lived without it. My first version had the per-skill logic smeared across five files: an orchestrator here, prompt templates there, a parser somewhere else, post-processing in yet another place. Changing how trip plans worked meant touching all five and praying. Pulling it into one handler per skill, basically John Ousterhout's deep module idea, small surface and a lot hidden behind it, was the refactor that made the rest of this possible.

Feed it facts before it makes them up

Before the model writes a single word, the handler goes and fetches real things about the city you asked for. Three kinds:

Knowledge from our own database, retrieved with embeddings and a reranker, filtered by city so a Chengdu plan can't pull in Beijing context.
Real places we've already verified, actual attractions and restaurants straight from our tables, with real coordinates. The model picks from these instead of inventing options.
Questions real travelers have asked about this city, dropped in as hints. This is where "the museum's closed Mondays" and "you can't get a cab near there at 6pm" come from.

Two small things matter about how this gets stitched into the prompt. The stable rules go first and the changing data goes last, which plays nicely with prompt caching. And anything that came from a user or a database gets wrapped in tags and escaped, so the model reads it as data instead of accidentally treating it as a new instruction:

def wrap_context_block(tag: str, body: str) -> str:
    body = (body or "").strip()
    if not body:
        return ""
    # escape angle brackets so injected text can't become an instruction
    safe = body.replace("<", "&lt;").replace(">", "&gt;")
    return f"<{tag}>\n{safe}\n</{tag}>"

One more thing: if any of this fetching fails, it falls back to an empty string and the plan still ships. A vector search timing out should never take down someone's itinerary. Grounding is a bonus, not a dependency.

Tell it the goal, not every step

My first prompt was a forest of if-statements. If the travelers are a couple, do this. If there are elderly people in the group, do that. If it's a family with kids, this other thing. It got unmaintainable fast, and worse, the model didn't even follow it well. Long branching instructions are easy for it to lose track of.

The version I run now states a goal and a few constraints, then trusts the model to weigh them:

- 2-3 attractions per day as a baseline; adapt to style
  (relaxed = fewer; with elders = at most 2, low walking, prefer DiDi).
- English name as the primary title, 中文 as secondary;
  use English addresses for navigation.
- Verify ticket price, opening hours and addresses via web search;
  mark anything you could not verify.

That last line is the honesty valve. The model is told to look things up live and flag whatever it couldn't confirm, instead of papering over the gap with a confident guess.

My favorite constraint is the one that kills filler advice. Every plan ends with a tips section, and the prompt is blunt about what a tip is not allowed to be:

Tips must NOT be generic basics. Never write "download Alipay/WeChat Pay/VPN",
"carry cash", "learn Mandarin", or "wear comfortable shoes". The app already
covers these and every tourist knows them.

Match this caliber (example for Chongqing):
"Hongya Cave lights up at 6:30pm but Qianximen Bridge only at 7:30pm,
 arrive 7:30 to catch both lit."
"Buildings here have entrances on several floors; give DiDi the building
 name, not the street address."

Handing the model one concrete example of the bar I wanted did more than any amount of "be specific and helpful." It needed to see the quality, not be told about it.

Stop parsing prose

The planner doesn't emit markdown. It emits a typed list of blocks under a strict schema: a header, days, points of interest, restaurants, transport legs, a transition to the next day, tips. The schema is assembled per skill from the blocks that skill is allowed to produce, and it's enforced while the model decodes, not cleaned up with regex afterward.

Because the output is a JSON array, I parse it as it streams and push each finished block to the app over SSE. You see day one render while day three is still being written. There's also a validation pass at the end as a backstop: it checks a header exists, checks at least one real content block came through, and throws out anything with a block type that shouldn't be there.

The payoff is that the app renders a map pin, a card, and a "Navigate" button straight from typed fields. It never has to read the model's prose to figure out what to draw.

Overwriting the model's geography

This is the part that actually fixed my highway-pin problem.

After the model finishes, every block goes through one more pass that quietly swaps in geography from our own database, the entity reference, the coordinates, the navigation action. Whatever the model wrote for those fields gets thrown away.

The model is free to say "Jinli Ancient Street is a lovely evening walk." It does not get to decide the latitude and longitude the Navigate button fires off. That comes from our places table. If the model mentions somewhere we have on file, the real data wins and the model's guess is gone.

This is the whole "words from the model, facts from the database" rule made literal, and honestly it did more for trust than any prompt tweak. You can spend forever trying to talk a model out of inventing coordinates. It's far easier to just let it write, then overwrite the parts that have to be true.

Don't stream a 40-second answer over the request

Plain chat streams inline. It's a conversation, latency matters, keep it on the connection.

A full multi-day plan is different. With web search and reasoning on, it can take tens of seconds. If you tie that to the request connection, a backgrounded app or a flaky station-platform signal loses the whole thing. So the heavy skills run as background jobs instead: kick one off, get a task id back right away, and the app polls for it. You can close the app, come back, and get a notification when the plan's done. On a durable queue, a worker restart doesn't drop work that was mid-flight. Matching the execution model to the workload turned out to matter as much as any prompt.

If you're building something similar

Decide early what your model is allowed to be right about, and route everything else around it. For me that's the prose and the sequencing, nothing more. Ground it with real data before it generates, then overwrite the fields that have to be true after it's done. Make the output typed so your app renders fields instead of parsing paragraphs. And don't be precious about prompts. A goal with a couple of sharp constraints and one good example beat every clever template I tried.

None of this is fancy. It's grounding, a schema, a database backfill, and picking the right place to draw an async boundary. But it's the difference between a demo that wows the room and something you'd actually hand to a traveler standing on a street corner, in a city they can't read, trusting it to get them to dinner.