Ioan G. Istrate

Posted on Feb 14 • Originally published at blog.tripvento.com

Scaling to 200+ Cities by Deleting 90% of My Database

#postgis #postgres #database #django

The Problem

I'm building Tripvento, a hotel ranking API that scores properties against 14 traveler personas using geospatial intelligence. The core question the engine answers: what's within walking distance of this hotel, and does it match what this type of traveler actually needs?

To answer that, I was storing every hotel to POI (point of interest) relationship as a row in my database. Hotel near a restaurant? That's a row. Hotel near a park? Another row. Hotel near a nightclub, a gym, a subway station? More rows.

Tripvento started with 3 cities — Charleston, Savannah, and Asheville — with 298 hotels. Even at that scale, the edge table was growing fast with 180,000 rows. By the time I hit 33 cities with 4,607 hotels, my HotelPOI table had 55.6 million rows and weighed 11GB — 91% of my entire 12GB database.

I did the napkin math on scaling to 200+ cities. At ~1.7 million edges per city, 212 destinations would mean roughly 360 million rows and a table north of 70GB. I stopped writing.

The Naive Architecture

Here's the model I eventually deleted:

class StagingHotelPoi(models.Model):
    """
    Pre calculated proximity relationships between hotels and POIs.
    This should make queries fast key lookup instead of
    spatial calculation.
    """
    hotel = models.ForeignKey(StagingHotel, on_delete=models.CASCADE)
    poi = models.ForeignKey(StagingPoi, on_delete=models.CASCADE)

    # distance in meters (calculated once, stored forever)
    distance_meters = models.FloatField()
    distance_km = models.FloatField(editable=False)

    # for intent based weighting
    relevance_score = models.FloatField(default=1.0)
    calculated_at = models.DateTimeField(auto_now_add=True)

    class Meta:
        unique_together = [['hotel', 'poi']]
        indexes = [
            models.Index(fields=['hotel', 'distance_meters']),
            models.Index(fields=['poi']),
            models.Index(fields=['distance_km']),
        ]

Read that docstring again: "calculated once, stored forever." That was the problem in one line. I was treating a many to many spatial relationship as a static materialized view because I feared the overhead of computing distances on the fly. So I persisted every hotel to POI edge — with three indexes on top — for a computation that only happened once during batch scoring.

The pipeline worked like this:

Ingest hotels and POIs for a city
For every hotel, calculate the Haversine distance to every POI within a radius
Store each relationship as a row with distance, category, and a relevance score
At scoring time, look up the pre stored edges and compute the geo score

It was fast at query time because everything was pre joined. The geo scorer just did a filtered lookup on StagingHotelPoi, grouped by category, and weighted the distances. Simple.

But the storage was brutal. Each row carried two foreign keys, two distance fields, a relevance score, a timestamp, a unique constraint, and three indexes. At 55.6 million rows, the table and its indexes ate 11GB — 91% of my entire database. Everything else — hotels, POIs, scores, images — fit in the remaining 1GB.

And the scaling math was ugly. POI density doesn't grow linearly with city count — it explodes. My 33 cities were mostly mid size markets. New York City alone has over 30,000 restaurants. If I'd stayed on this path, adding a few major metros would have pushed the table past 100GB — forcing a vertical tier jump on my droplet just to keep the indexes in memory. I was a handful of cities away from an infrastructure forced pivot, and I hadn't even launched yet.

The Insight

Here's what I realized: I don't need to store that a hotel is 437 meters from a Thai restaurant. I need to ask that question once, at scoring time, and then throw away the intermediate data.

The only thing that matters downstream is the final geo score — a single float per hotel per persona. Everything between "here's a hotel" and "here's its geo score" is intermediate computation. I was materializing millions of rows of intermediate state that got consumed once and never queried again.

The fix was obvious once I saw it: let PostGIS do what it's built for. I'd had PostGIS enabled from day one — I just hadn't needed spatial indexing yet because the materialized edge table worked fine at 3 cities. At 33 cities, the storage model forced the decision.

The Migration

I replaced the stored edge table with spatial queries using PostGIS's ST_DWithin backed by a GiST index on the geometry columns.

Hotels already had lat/lng. I added a PostGIS PointField alongside them:

latitude = models.DecimalField(max_digits=9, decimal_places=6)
longitude = models.DecimalField(max_digits=9, decimal_places=6)
location = models.PointField()

The old geo scoring step looked something like:

# lookup pre stored edges
nearby = StagingHotelPoi.objects.filter(
    hotel=hotel,
    distance_meters__lte=radius
).select_related('poi')

The new version queries POIs directly with ST_DWithin on the geography type:

SELECT 
    id, name, poi_type, quality_tier, popularity_tier,
    ST_Distance(location::geography, %s::geography) AS distance_meters
FROM staging_poi
WHERE destination_id = %s
  AND location IS NOT NULL
  AND ST_DWithin(location::geography, %s::geography, %s)

Django's PointField creates this automatically, but this is what makes ST_DWithin fast under the hood:

CREATE INDEX idx_staging_poi_location ON staging_poi USING GIST (location);

Same logic. Same output. The ::geography cast means ST_DWithin works in meters natively — no Haversine math, no unit conversion. The GiST spatial index on location makes the bounding box pre filter fast, and ST_Distance gives me the exact distance for scoring.

The hard part wasn't the migration. It was convincing myself to DROP TABLE on 55+ million rows.

The Tradeoff

My first instinct was to partition the edge table instead of dropping it. But partitioning 55.6 million rows across city based partitions still meant the same storage overhead — I'd just be organizing the bloat, not eliminating it.

I want to be honest about what changed and what didn't.

What got slower: Precompute time. When the pipeline scores a new city, each hotel now triggers a spatial query against the POI table instead of a simple lookup on pre stored edges. The geo scoring batch job takes more CPU cycles per hotel. We benchmarked ST_DWithin + GiST against the materialized edge lookup and confirmed it remained well within batch SLOs.

What stayed the same: API response time. The travel platform integrating our API hits the rankings endpoint and gets a sorted list of hotels with Smart Scores in under 250ms. That response comes from pre computed scores stored on the hotel record, not from live spatial queries. The ST_DWithin work happens once during ingestion.

What got dramatically better: Everything else.

This was a system rebalancing: I traded cheap disk and expensive RAM (keeping 55M rows and their indexes in memory) for slightly more CPU cycles during a non critical batch window. That's a trade any founder should make 10 out of 10 times. Disk and RAM cost money every second. CPU cycles during a batch job at 4 AM cost nothing.

The entire HotelPOI table — 55.6 million rows, 11GB — is gone. The database went from 12GB at 33 cities to 5.4GB at 212 cities with 24,000+ hotels.

Let me say that differently: I scaled the number of destinations by 6.4x and the database got smaller by more than half.

Why It Worked

The key insight is about where you put the computational cost.

Precompute time is a batch job. It runs once when a new city is ingested. Nobody is waiting on it in real time. If it takes 20 minutes instead of 8 minutes, nobody cares. It's a cron job running at 4 AM.

Query time is what the customer feels. That has to be fast. And it's just as fast as before because the API serves pre computed scores, not spatial queries.

I was optimizing the wrong side of the pipeline. I had fast reads on data I didn't need to persist, at the cost of storing millions of rows that were consumed exactly once.

The Numbers

Metric	Before (Stored Edges)	After (PostGIS Spatial)
Cities	33	212
Hotels	4,607	24,096
HotelPOI rows	55,589,063	0
HotelPOI table size	11 GB	0
Total database size	12 GB	5.4 GB
API response time	<250ms	<250ms
Infrastructure cost	Growing	Stable despite 6.4x scale

The Lesson

Sometimes scaling means storing less, not more.

This pattern shows up everywhere. Precomputing edges in recommendation systems. Materializing joins in analytics pipelines. Caching intermediate state because it "feels faster." Sometimes the real optimization is deleting the table and trusting the index.

The broader pattern is misplacing state. If intermediate computation is consumed once and discarded, persisting it is often architectural debt disguised as optimization.

Every pre computed table is a bet that the cost of storage and maintenance is worth the read time savings. For my use case, it wasn't. The reads happened once during a batch job, and I was paying for 55.6 million rows of intermediate state that had zero value after scoring completed.

PostGIS didn't make my system faster. It made my system leaner — which let me scale 6x on the same infrastructure a solo founder can manage and afford.

If you're building something that does heavy spatial computation, think carefully about what you're materializing. Not every join needs to be a table.

I'm Ioan Istrate, founder of Tripvento — a ranking API that scores hotels by traveler intent using geospatial intelligence. Previously worked on ranking systems at U.S. News & World Report. If you're working on something similar or want to nerd out about PostGIS, find me on LinkedIn.

Top comments (2)

chovy • Feb 14

Great write-up. Hit the same wall with spatial data on a park discovery tool. The napkin math on POI density in major metros is spot on -- catches people off guard every time. Smart move catching it at 33 cities. Curious about scoring accuracy post-migration.

Ioan G. Istrate • Feb 14

Thank you! The major metros are what break the model. Scoring accuracy is identical or almost identical because we're computing the same spatial relationships, just with ST_DWithin instead of Haversine pre calculation, the results are equivalent, we just stopped storing them. Curious, what is your park discovery tool built on?