Location as the Anchor: Mapping Species, Geology, UFOs, and Bigfoot to 105,000 Coordinates

#postgressql #postgis #dataengineering #seo

Part 2 of 5. Part 1 covered the RAM crashes and data ingestion nightmare. This part is about what happens after the data is in the database — and why having data is not the same as having a site.

So the data was in PostgreSQL. The RAM crashes were behind me. 105,757 locations, deduplicated, sitting in a clean table.

I made the mistake of feeling good about this.

Within a day I was staring at the database thinking: these are just rows. A name, a coordinate, a source tag. There's nothing here that would make a person stop scrolling. I hadn't built a resource — I had built a very expensive spreadsheet.

The gap between "data in a table" and "page worth reading" turned out to be the actual project.

Everything needs an anchor

The plan was to enrich every location. What species live there. What plants. What wildlife hazards. What the geology looks like. What's happened in that area that's strange or interesting. All of it needs to be attached to a location somehow — and that "somehow" turns out to matter enormously.

For most enrichment data, the answer is a foreign key. If I have a coordinate for a trailhead and a coordinate for a recorded species sighting, I calculate the distance. If it's within a reasonable threshold, that sighting belongs to that location. Clean, precise, fast to query.

I built 16 enrichment tables on this model. Species, plants, wildlife, astronomy ratings, sun data, climate normals, flood zones, elevation profiles, weather stations, native land territories, tidal data, tick risk zones. The biological tables alone ended up at over 100,000 rows each — location_species at 102,719, location_plants at 100,610, location_wildlife at 101,222. Link them via FK, and a page for any trail can instantly surface what lives there.

The architecture felt elegant. I was pleased with myself in the way that only precedes something going wrong.

When a coordinate is a lie

Then I got to the data that doesn't have coordinates.

UFO sightings. Bigfoot reports. Filming locations. Historical wildfire perimeters. Fossil dig sites. These sources give you a county, a town, sometimes just a general region. "Reported in Northern Arizona." "Sighting near the Colorado border." There's no GPS pin. There's no precise point.

I spent an embarrassing amount of time trying to force these into the foreign key model anyway. Geocode the county centroid, find the nearest location, assign it. It worked technically. It was also completely dishonest — telling a user that a UFO was sighted at a specific trailhead because that trailhead happened to be the closest row in my locations table.

I scrapped it and built a second model: grid cells.

Instead of mapping data to a location, I divided the map into cells and mapped data to whichever cell it fell into. Then the question changed from "is this species at this trail?" to "is this trail inside a grid cell that has a recorded UFO sighting?" Thirteen grid tables in total — grid_paranormal (3,814 rows), grid_filming, grid_geology, grid_meteorites, grid_shipwrecks, grid_karst, grid_watershed, grid_fire, grid_endangered_species, grid_minerals, grid_fossils, grid_phenology, grid_temperature_extremes.

It's a looser association. But it's honest. The site can say "this area has a history of paranormal reports" without claiming a ghost is standing at the 1.2-mile marker of a specific path. The distinction matters — you're not fabricating precision you don't have.

The overnight query I didn't time

Once I had the enrichment model working — FKs for precision, grids for general areas — I wanted to solve the discovery problem. If someone is looking at a park, they want to know what else is nearby.

Real-time spatial queries across 140,000 pages would kill the server. So I pre-calculated everything into a location_nearby table and just do indexed lookups at render time.

INSERT INTO location_nearby (location_id, nearby_id, distance)
SELECT a.id, b.id, ST_Distance(a.geom, b.geom)
FROM locations a, locations b
WHERE a.id != b.id
AND ST_DWithin(a.geom, b.geom, 16093.4); -- 10km radius

Running this across 105,000 locations against itself is not a quick operation. I kicked it off, went and did other things, came back a few hours later to find it done. I didn't measure how long it actually took — which is the most honest thing I can say about a query you only run once. The table ended up at 1,396,359 rows. Every page render is now a simple indexed lookup rather than a live spatial calculation.

I did spend an entire afternoon before this trying to optimise the index type. Trying different strategies, benchmarking, reading PostGIS documentation. Then I remembered it was a static site. The fastest query is the one I already ran. I got that afternoon back in the sense that I learned something, and lost it in the sense that I'll never get those hours back.

The part I didn't expect

I had spent the majority of my time on the biological and geological data. The species tables, the plant taxonomy, making sure the mappings were tight and the data quality was reasonable. That was the "value" of the site in my head — scientific, useful, something a serious hiker would rely on.

Then I started testing actual pages.

The paranormal grid data — which took a fraction of the time to implement because lower resolution means fewer edge cases — was by far the most interesting part of every page it appeared on. People read past the species counts and the climate normals and stopped at the UFO section. The "curiosity" data that I'd treated as a fun side experiment was doing more for engagement than everything I'd agonised over.

I hadn't planned for that. I couldn't have planned for that. You can theorise about what users find interesting, but sometimes you just have to put the page in front of a person and watch where their eyes go.

Turns out people are more likely to click on "haunted" than on "karst geology." I'm choosing not to take this personally.

Where the model broke

The FK and grid approach worked for nearly everything. If it had a coordinate or a region, I could anchor it. Parks, species, paranormal reports, filming locations, fossils — all handled.

Then I ran into climate data. Not a point. Not a region. A continuous variable that shifted every few miles based on elevation, proximity to water, and a dozen other factors. I tried to force it into the FK model. The results were absurd — trails at 8,000 feet getting the climate profile of a weather station 40 miles away in a valley. I tried the grid approach. The resolution was too coarse to be useful.

The anchor model, which had handled everything else cleanly, had nothing to say about climate.

That's where Part 3 starts — how I spent three days building something wrong before I found the approach that actually worked.

Lessons:

Having data in a database and having a site are two completely different problems. The first one took weeks. The second one took months.
Don't force precision onto imprecise data. A grid cell that's honest beats a foreign key that's lying.
Pre-calculate anything spatial that you'd otherwise query at render time. Your server will thank you. Your patience during the build will not.
The feature you spent the least time on will be the one users care about most. Build the weird stuff.