Ioan G. Istrate

Posted on Feb 17 • Originally published at blog.tripvento.com

How I Built a Self Auditing Data Pipeline With Multiple LLMs

#ai #architecture #python #productivity

Cheapest checks first: A 64 cent, multi model pipeline that audits itself.
When your hotel database thinks "Game Room, Deck & Yard: Chicago Home" is a hotel, you have a data quality problem. When it happens across 212 cities in 25 countries, this isn’t a travel problem; it’s an automated systems problem. You need machines checking machines.

I'm a solo founder building Tripvento, a B2B hotel ranking API. My pipeline ingests hotel data from multiple third party sources, enriches it with points of interest, scores everything across 14 traveler personas, and publishes rankings. Every step of that pipeline produces errors. Vacation rentals disguised as hotels. Fake property names. Hostels mixed in with five star resorts. Hotels scoring well for "family with toddlers" despite having no playgrounds within a mile.

No single model catches everything. So I built a pipeline where models audit each other, and where the cheapest checks run first.

The Pipeline

The architecture is straightforward. Data flows through four phases, and every phase has a gate that can halt the pipeline:

Phase 1 — Ingest and Transform. Raw hotel data comes in from multiple sources. A lightweight LLM structures the messy metadata: normalizes hotel names, extracts amenities, assigns star ratings when they're missing or inconsistent. This model is cheap and fast because it runs on every single hotel record.

Phase 2 — Enrich. Points of interest get loaded from geospatial sources. Each hotel gets scored on what's physically around it — restaurants, parks, transit stops, nightlife, grocery stores — using PostGIS spatial queries. A separate scoring pass uses a different LLM to evaluate each hotel against 14 traveler personas based on the hotel's own description and attributes.

Phase 3 — Fuse and Rank. Geospatial scores and semantic scores get fused into a single Smart Score per hotel per persona. Market signals like rating trends, price positioning relative to the neighborhood get layered on top.

Phase 4 — Validate. This is where the real quality control happens, and it's where the multi model architecture earns its keep.

Here's the skeleton of how the pipeline chains steps and gates together:

def run(self, skip_ingest=False):
    """Run the full pipeline."""
    # phase 1: ingest
    if not self.step_ingest_hotels():
        raise PipelineError("Hotel ingestion failed")
    if not self.step_transform_hotels():
        raise PipelineError("Hotel transformation failed")
    if not self.step_load_staging_hotels():  # ← gate: min hotel count, dupe check
        raise PipelineError("Staging hotel loading failed")

    # phase 2: enrich
    if not self.step_load_pois():            # ← gate: min POI count, type diversity
        raise PipelineError("POI loading failed")
    if not self.step_llm_scoring():          # ← gate: zero rate, completeness
        raise PipelineError("LLM scoring failed")
    if not self.step_geo_scoring():          # ← gate: score count vs expected
        raise PipelineError("Geo scoring failed")

    # phase 3: fuse
    if not self.step_fuse_scores():          # ← gate: variance, distribution
        raise PipelineError("Score fusion failed")

    # phase 4: validate
    if not self.step_sniff_test():           # ← AI validates final rankings
        raise PipelineError("Sniff test failed")

Every step_ method runs a command, then calls a validator. If the validator fails, the step returns False and the pipeline halts. No bad data makes it downstream.

Layer 1: Rule Based Gates (Free)

Before any AI touches the output, rule based validators run at every stage. They check the basics:

Are there enough hotels? Is the address enrichment rate above 70%? Do we have at least 10 POI categories with 5+ entries each? Is the score variance high enough, or is everything suspiciously uniform?

These checks are instant and cost nothing. They catch about 60% of problems; the obvious ones like empty datasets, broken ingestion, or scoring runs that produced all zeros.

def validate_score_distribution(self, dest) -> tuple[bool, str]:
    """check score distribution isn't degenerate."""
    total = StagingHotelIntent.objects.filter(
        hotel__destination=dest
    ).count()

    if total == 0:
        return False, "No intents"

    zeros = StagingHotelIntent.objects.filter(
        hotel__destination=dest, final_score=0
    ).count()
    zero_rate = zeros / total
    if zero_rate > 0.3:
        return False, f"Too many zeros: {zeros}/{total} ({zero_rate:.0%})"

    max_scores = StagingHotelIntent.objects.filter(
        hotel__destination=dest, final_score__gte=99
    ).count()
    max_rate = max_scores / total
    if max_rate > 0.3:
        return False, f"Too many max scores: {max_scores}/{total}"

    return True, f"Distribution OK: {zero_rate:.0%} zeros, {max_rate:.0%} max"

If more than 30% of scores are zero, something broke. If the standard deviation is below 5, the scoring logic isn't differentiating hotels. Both of these are cheap to detect and they halt the pipeline before expensive AI validation wastes money on garbage data.

Other gates check POI coverage, address enrichment rates, and semantic score completeness:

def validate_poi_coverage(self, dest) -> tuple[bool, str]:
    """check POI type diversity and location coverage."""
    total = StagingPoi.objects.filter(destination=dest).count()

    type_counts = StagingPoi.objects.filter(
        destination=dest
    ).values('poi_type').annotate(count=Count('id')).order_by('-count')

    MIN_POI_PER_TYPE = 5
    MIN_TYPES_REQUIRED = 10

    types_with_data = [t for t in type_counts if t['count'] >= MIN_POI_PER_TYPE]

    if len(types_with_data) < MIN_TYPES_REQUIRED:
        return False, f"Only {len(types_with_data)} POI types with 5+ entries"

    with_location = StagingPoi.objects.filter(
        destination=dest, location__isnull=False
    ).count()
    if with_location < total:
        return False, f"{total - with_location} POIs missing location points"

    return True, f"{len(types_with_data)} POI types, all with coordinates"

Every validator returns a pass/fail tuple with a human readable message. The pipeline checks these after each stage and halts on failure because there’s no point running expensive LLM scoring on a dataset with missing coordinates.

These checks are simple, but they catch 60% of problems.

Layer 2: The AI Auditor (Runs Once Per City)

The rule based gates can't catch a vacation rental pretending to be a hotel. For that, I use a more capable model that reviews every hotel in the destination and flags anything suspicious.

Here's what it caught in Chicago; 28 flags out of roughly 200 hotels. That's 14% of the data that would have polluted the rankings:

name	reason	reason_detail	confidence	source
Game Room, Deck & Yard: Chicago Home	vacation_rental	Amenity-focused name typical of Airbnb/VRBO	high	ai
Kasa Magnificent Mile Chicago	vacation_rental_company	Kasa is a known managed rental company	high	ai
Logan Square SRO Hotel	not_a_hotel	SRO is long-term housing, not hotel	high	ai
Hotel BnB-3	invented_name	Generic name	high	ai
Loews hotel chicago	duplicate	Duplicate of Loews Chicago Hotel (id: 3843)	high	ai
Sentral Michigan Avenue Chicago Apartments	not_a_hotel	Explicitly labeled as apartments	high	ai

Vacation rentals that leaked in. "Game Room, Deck & Yard: Chicago Home" which is an Airbnb listing with an amenity focused name. "Phill hill mansion" — a private residence. "New & Modern Lux City Escape" which is just marketing copy with a unit number, classic VRBO pattern.

Known rental companies. Kasa had three properties in the dataset. The auditor recognized the brand and flagged all three as a managed rental company, not a hotel.

Institutional housing masquerading as hotels. "Logan Square SRO Hotel" and "northmere the sro hotel" — SRO stands for Single Room Occupancy, which is long term housing. The auditor caught the designation.

Invented names. "Hotel BnB-3"; a generic name with a number suffix that doesn't correspond to any real property.

Duplicates. "Loews hotel chicago" flagged as a duplicate of "Loews Chicago Hotel" — same property, different casing and word order.

This model is expensive, but it only runs once per destination. At 212 cities, that's 212 auditor calls total — not 212 multiplied by every hotel.

Layer 3: The AI Sniffer (Validates Rankings)

The auditor catches bad input. The sniffer catches bad output — rankings that don't make sense even though the individual scores look fine.

It reviews the final rankings for each of the 14 traveler personas and flags anomalies. Here's what a sniffer report looks like:

{
  "overall_status": "PASS",
  "overall_score": 85,
  "intent_results": [
    {
      "intent": "family_with_toddlers",
      "status": "WARN",
      "score": 70,
      "issues": [
        "Top hotel AXIS in Elsdon shows very limited family amenities - only 12 parks, no playgrounds, museums, or family attractions"
      ],
      "verdict": "Hotel with minimal family POIs scores highest. Other hotels show better family infrastructure but lower scores."
    },
    {
      "intent": "wellness_retreat",
      "status": "PASS",
      "score": 88,
      "issues": [],
      "verdict": "Correctly shows low scores (36-43) reflecting Chicago's limited wellness resort options."
    }
  ]
}

Across 8 cities I audited, it caught 6 warnings:

A hotel with no toddler amenities ranking #1 for families. In Chicago, a hotel called AXIS in the Elsdon neighborhood scored highest for "family with toddlers" despite having only 12 parks nearby, no playgrounds, no museums, and no family attractions. The sniffer flagged it: the geo data was technically valid, but the ranking didn't make sense for that persona.

The same pattern in a different city. In Providence, multiple hotels had empty geospatial details but were still scoring above 50 for the toddler persona. The sniffer caught the data gap that the rule based checks missed. The scores existed, they just weren't backed by real location data.

An algorithm over penalizing a category. In St. Louis, the sniffer flagged that family hotels were showing "surprisingly low proximity scores to family relevant amenities" — not a data problem, but a scoring logic problem. The algorithm was weighting certain POI types too heavily. This is something no rule based system would catch because the numbers were technically valid.

Validating that low scores are correct. In both Chicago and Milwaukee, the sniffer confirmed that wellness retreat scores were appropriately low since these are urban cities, not spa destinations. A max score of 43 out of 100 for wellness in Chicago is correct, not a bug. This prevents false positives from triggering unnecessary investigations.

Layer 4: The Orchestrator (LLM on Failures Only)

When a pipeline run fails, the orchestrator decides what to do. In manual mode, it just reports. In auto mode, it applies rules which are: retry on transient errors like timeouts, rollback on data corruption, skip after max retries.

In smart mode, it sends the failure context to a cheap, fast model that investigates and recommends one of four actions: retry, rollback, skip, or escalate to a human. This costs roughly two cents per failure investigation, and most pipeline runs don't fail, so the total cost is negligible.

# rule based thresholds (no LLM needed)
RULES = {
    'max_retries': 2,
    'min_hotels': 20,
    'min_score_std': 5.0,       # flag if scores too uniform
    'max_zero_rate': 0.3,       # flag if >30% zeros
    'auto_rollback_on_fail': True,
}

def _rule_based_decision(self, slug, error, retry_count):
    """Make decision based on rules (no LLM)."""
    transient_keywords = ['timeout', 'connection', 'rate limit', '503', '502']
    is_transient = any(kw in error.lower() for kw in transient_keywords)

    if is_transient and retry_count < RULES['max_retries']:
        return {'action': 'RETRY', 'reason': 'Transient error detected'}

    if RULES['auto_rollback_on_fail']:
        return {'action': 'ROLLBACK', 'reason': 'Auto-rollback on failure'}

    return {'action': 'SKIP', 'reason': 'Max retries exceeded'}

The LLM only gets involved when the rules can't decide. The investigation prompt includes the destination name, the step that failed, the error message, current database state, and recent pipeline history. The model responds with a JSON recommendation:

{
    "action": "RETRY",
    "confidence": 0.85,
    "reason": "Ingest returned 503 — likely transient rate limit",
    "details": "Previous run for this destination succeeded 2 days ago with same config"
}

Below a certain confidence threshold, it defaults to escalating to me. LLM involvement costs ~2 cents.

What This Architecture Actually Costs

The economics of the layered approach matter. Rule based gates are free. The lightweight model that transforms and scores hotel data costs fractions of a cent per hotel. The expensive auditor model runs once per city. The sniffer validates rankings once per city. The orchestrator investigates only on failures.

For a city like Chicago with 200 hotels (at the time of running the pipeline, 283 today) and 14 personas, the total LLM cost for a full pipeline run is 64 cents. The scoring pass is the most expensive step at 36 cents because it runs a lightweight model across every hotel for every persona. The auditor and sniffer combined cost less than 20 cents because they each run once. The transformer that structures raw hotel data is about 8 cents. Essentially, at 212 cities, the entire validation infrastructure costs less than what most startups spend on a single Jira board.

The Pattern

The lesson isn't about travel data or hotel rankings. It's about layered trust in automated systems.

Don't run your most expensive model on every record. Stack your checks from cheapest to most expensive: rule based validators first, then lightweight AI for high volume tasks, then capable models for low frequency audits, then human review for edge cases.

Most problems get caught at the cheapest layer. The expensive layers exist for the subtle issues eg. vacation rentals pretending to be hotels, algorithms that technically work but produce nonsensical rankings, data gaps that look like valid scores.

And when you find something the rules should have caught, add the rule. The AI auditor's job is to get smaller over time, not bigger. Every pattern it detects should eventually become a rule based check that runs for free.

I scaled from 3 cities to 212 without a QA team because each layer catches what the layer below it misses. If you're running a single model with no validation layer, you're not shipping fast — you're likely shipping garbage with confidence.

The pipeline doesn't need to be perfect. It just needs to know when it's wrong.

For the record — it's not perfect and still very much a WIP. I still have hallucinated hotel pages and bad data that slipped through. But I know where they are, and the layers are getting tighter with every run. That's the point. You don't build a flawless pipeline on day one. You build one that tells you where it's failing, then you fix the cheapest layer first.

Perfect systems don’t scale. Layered systems do.

Ioan Istrate is the founder of Tripvento, a B2B hotel ranking API that scores properties by traveler intent using geospatial intelligence. He previously worked on ranking systems at U.S. News & World Report, and has served as Head TA for Georgia Tech’s Graduate Operating Systems course. Connect with him on LinkedIn.