Vincent Tran

Posted on Jun 30 • Originally published at 0xgosu.dev on Jun 25

You Can Test the Pipeline, Not the Taste

#ai #productivity #software

A map can be correct and still feel wrong.

The project sounds straightforward at first: enrich a virtual-running app with interesting points of interest along famous routes. A runner logs real mileage through Strava, the app maps that cumulative distance onto a long route, and the user gets the slow satisfaction of moving across a country, a continent, or an iconic road.

The missing feature was discovery. If the app can show that a runner is 400 kilometers into a route, it should also be able to show what is nearby: a national park, an old fort, a mountain, a monument, a historic building, a strange landmark, or a city worth noticing.

That sounds like a data problem. Download a global places database, filter for interesting rows, join it to route geometry, rank the results, and render markers on the map.

Then reality arrives. A global dataset is not a curated travel guide. Wikipedia coverage is not the same thing as importance. A densely populated route can become a map of every town and village. A remote route can need more natural landmarks and fewer administrative names. A model can write smoother prose while inventing facts. A route that looks good under one ranking formula can get worse when that same formula is applied somewhere else.

The problem is not that tests are useless. The problem is that the most important question is not binary. “Is this a good set of places to show a runner?” is a product judgment, a data judgment, and a taste judgment.

Start With Boring Data

The right place to begin was not an LLM. It was GeoNames, a large geographical database with downloadable dumps, feature categories, alternate names, coordinates, population, elevation, and links. That matters because points of interest need a factual spine. A map marker has to have a location before it can have a personality.

The pipeline used Python for processing, Apache Parquet for local columnar storage, and DuckDB as the query layer. That is a sensible stack for this shape of work. Parquet keeps large intermediate datasets compact and queryable. DuckDB makes local analysis feel like database work without requiring a server. Python has the geospatial and file-processing ecosystem to glue the steps together.

The first pass reduced the world.

GeoNames contains many categories that are useful for geography but not useful for a recreational route map. Countries, regions, states, and other administrative divisions are not the same as sights. A route map should not show every boundary object simply because the dataset knows about it.

So the pipeline filtered toward feature codes that sounded more likely to be interesting: parks, historic sites, castles, monuments, mountains, populated places above a threshold, and similar categories. It also used elevation filters for mountains and population filters for settlements. That kind of first cut is crude, but it is necessary. If the initial candidate set is too wide, every later step is forced to rank noise.

Even here, the project exposes a useful lesson about AI-assisted coding. Tryggvason worked with Claude while building the pipeline, but the useful pattern was not “ask the model to solve geography.” It was using the agent to help build one step at a time, then checking each intermediate artifact with domain-specific sanity checks.

That distinction matters. Agents are most useful when the human still owns the shape of the work.

Bias Hides in Useful Signals

One of the early useful signals came from Wikipedia links inside GeoNames alternate names. That sounds odd, but it is practical: if a place has a relevant Wikipedia page, it is more likely to be worth showing than an obscure row with no supporting context.

It is also biased.

English Wikipedia coverage tells you something about notability, but it also tells you where English-speaking editors have spent time. A route through the United States or the United Kingdom may be densely covered. A route through places with less English-language coverage may look emptier, even when the real world is not less interesting.

The pipeline had to handle both false positives and false negatives. The example from the source article is clean: a first draft could find Stonehenge, New South Wales, while missing the prehistoric Stonehenge most users would expect. That is not a small bug in a route-discovery feature. It is the kind of mistake that makes users stop trusting the map.

This is where “data cleaning” becomes product work. The team had to join multiple GeoNames files, select useful feature codes, preserve relevant alternate names, cross-reference Wikipedia URLs, and inspect whether famous landmarks survived the filters.

The result was a much smaller global candidate set. The original GeoNames data had roughly 13 million entries. The filtered point-of-interest dataset had about 725,000 rows. That is still large enough to be useful, but small enough to reason about.

The next step was route matching. For each route, the pipeline took a GeoJSON path, built a bounding box to avoid scanning the whole world, then checked which candidate places fell within a chosen distance of the actual route. It also calculated distance along the route, so the app could decide when a runner should encounter each point.

This is the boring part of the system in the best sense. It is deterministic. It can be inspected. It can be profiled. It can be rerun. If a route returns no candidates, or too many, or places far outside the intended corridor, that is a bug you can chase.

But the output still was not a product. It was a pile of candidates.

The LLM Was Bad at Facts

The tempting move was to ask an LLM to make the data feel like a guidebook.

That is reasonable. A model can turn raw records into readable prose. It can compare landmarks in a way that simple counts cannot. It can recognize that a place sounds culturally or historically interesting even when a single database field does not capture it.

The project tried that. Wikipedia summaries and Wikidata signals were fetched for route candidates. The number of language editions for a Wikipedia topic became another relevance signal: if a subject has articles in many languages, it is probably more globally notable than a page that exists only in English. The data could be cached so that later routes did not need to refetch the same wiki metadata.

Then an LLM-powered step was added. The system used Anthropic’s tool-calling support for structured output and batch processing for cheaper large runs. That fits the workload: many independent rows, similar prompts, and no need for instant interactive latency.

The model produced useful judgments, but it also lied.

The first version was not grounded tightly enough in the input data. Central Park in Decatur, Illinois could get treated like Central Park in Manhattan. Town populations could change. Mountains could become larger than they were. The model’s prose often read better than a Wikipedia summary, but it carried a worse failure mode: it sounded confident while mutating facts.

That is the wrong trade for a map.

Wikipedia can be wrong too, but it is a known and attributable source. If the app shows a Wikipedia-derived summary, the failure model is at least visible. If a generated blurb invents a detail, the application has silently created misinformation and presented it as product knowledge.

The smart move was to demote the model. It stopped being the writer of record. Wikipedia summaries won for factual text.

That is the kind of decision teams should make more often. The best use of an LLM is not always the flashiest one. Sometimes the correct role is smaller, cheaper, and less visible.

The LLM Was Useful at Taste

The model still had a job: rating points of interest.

That sounds contradictory until you separate factual generation from subjective scoring. Asking a model to invent a summary creates a correctness problem. Asking it to produce a bounded significance score from supplied context creates a different problem. The score can still be wrong, biased, or inconsistent, but it does not pretend to be a paragraph of facts.

The pipeline combined several signals:

GeoNames feature class and feature code.
Wikipedia availability.
Wikidata language count.
Population and elevation thresholds where appropriate.
A model-provided subjective rating.
Route-specific filters and weights.
Geographic spacing so one dense area does not crowd out the rest of the route.

That mix is more interesting than “AI ranks the landmarks.” It is a traditional data pipeline with one subjective input. The model is not the system. The model is one instrument in the system.

This is where the phrase “you can’t unit test for taste” becomes concrete.

A unit test can tell you whether a function sorts descending. It can tell you whether a distance calculation returns a known value. It can tell you whether the output JSON matches a schema. It can tell you whether a route endpoint returns markers.

It cannot tell you whether Route 66 should show a small-town museum instead of another nearby populated place. It cannot tell you whether a trail through Iceland should emphasize waterfalls, villages, volcanoes, historic sites, or a balance of all four. It cannot tell you whether a runner will feel delighted, bored, or confused.

Those are not excuses to stop testing. They are reasons to test the mechanical layers harder, then evaluate the product layer differently.

Per-Route Tuning Is Not Failure

The project eventually produced route-specific JSON artifacts that could be version controlled. That is a strong boundary. Raw source dumps are too large and too noisy. Generated per-route outputs are small enough to review, diff, and adjust.

That is also where one-size-fits-all ranking broke down.

Different routes have different personalities. A route through dense cities can become a population map unless populated places are downweighted or spacing rules are applied. A rural or wilderness-heavy route may need natural features to rank higher. A route with famous monuments clustered in one city needs a way to avoid spending all its marker budget in one area.

The answer was not a universal formula. It was parameters: population filters, feature-code weights, LLM-score weights, wiki-count weights, and geographic radius rules. The pipeline needed knobs because the product needed taste.

That is not a hack. It is an honest representation of the domain.

Recommendation systems, search ranking, fraud filters, map labeling, feed algorithms, and moderation queues all end up here. You can measure pieces of the system, but the final judgment involves tradeoffs. Precision and recall are useful, but they are not the same thing as “this feels right to a human using the product.”

What Engineers Should Take From It

The most useful part of this story is not that an LLM hallucinated. Everyone knows that by now.

The useful part is the architecture that survived the hallucination.

The factual substrate came from public datasets and known sources. The expensive subjective step was isolated. The outputs became artifacts that could be inspected. Debug tooling existed alongside the pipeline, including SQL queries and map visualizations. The model was allowed to help, but not allowed to own truth.

That is the pattern worth copying.

If an AI feature is operating over real-world data, split the problem into layers:

Facts should come from sources you can name.
Transformations should create intermediate artifacts you can inspect.
Generated text should be treated as risky unless it is grounded and checked.
Subjective ranking can use model judgment, but it should be bounded and combined with other signals.
Product quality needs human review, route samples, visual inspection, and feedback loops.

The lesson is not anti-AI. It is anti-magical-thinking.

LLMs are useful because they contain a lot of fuzzy judgment. That same fuzziness makes them dangerous when a product needs factual precision. In this project, the model failed as a guidebook writer but helped as a taste signal. The difference is the contract.

When the contract is “tell me what is true,” the model needs grounding, citations, and verification. When the contract is “help me rank which of these already-known places might be more interesting,” the model can be useful even when it is not authoritative.

The hard part is knowing which contract you are signing.

That is why this kind of system will never be finished by unit tests alone. The tests can protect the pipeline. They can keep geometry, joins, schemas, and API responses from breaking. They can make the work repeatable.

But the last mile is taste: what to show, what to hide, what to weight, what to override, and what kind of journey the map is trying to create.

You can test the machinery. You still have to look at the map.