DEV Community: Velk

Location as the Anchor: Mapping Species, Geology, UFOs, and Bigfoot to 105,000 Coordinates

Velk — Sun, 19 Apr 2026 04:13:10 +0000

Part 2 of 5. Part 1 covered the RAM crashes and data ingestion nightmare. This part is about what happens after the data is in the database — and why having data is not the same as having a site.

So the data was in PostgreSQL. The RAM crashes were behind me. 105,757 locations, deduplicated, sitting in a clean table.

I made the mistake of feeling good about this.

Within a day I was staring at the database thinking: these are just rows. A name, a coordinate, a source tag. There's nothing here that would make a person stop scrolling. I hadn't built a resource — I had built a very expensive spreadsheet.

The gap between "data in a table" and "page worth reading" turned out to be the actual project.

Everything needs an anchor

The plan was to enrich every location. What species live there. What plants. What wildlife hazards. What the geology looks like. What's happened in that area that's strange or interesting. All of it needs to be attached to a location somehow — and that "somehow" turns out to matter enormously.

For most enrichment data, the answer is a foreign key. If I have a coordinate for a trailhead and a coordinate for a recorded species sighting, I calculate the distance. If it's within a reasonable threshold, that sighting belongs to that location. Clean, precise, fast to query.

I built 16 enrichment tables on this model. Species, plants, wildlife, astronomy ratings, sun data, climate normals, flood zones, elevation profiles, weather stations, native land territories, tidal data, tick risk zones. The biological tables alone ended up at over 100,000 rows each — location_species at 102,719, location_plants at 100,610, location_wildlife at 101,222. Link them via FK, and a page for any trail can instantly surface what lives there.

The architecture felt elegant. I was pleased with myself in the way that only precedes something going wrong.

When a coordinate is a lie

Then I got to the data that doesn't have coordinates.

UFO sightings. Bigfoot reports. Filming locations. Historical wildfire perimeters. Fossil dig sites. These sources give you a county, a town, sometimes just a general region. "Reported in Northern Arizona." "Sighting near the Colorado border." There's no GPS pin. There's no precise point.

I spent an embarrassing amount of time trying to force these into the foreign key model anyway. Geocode the county centroid, find the nearest location, assign it. It worked technically. It was also completely dishonest — telling a user that a UFO was sighted at a specific trailhead because that trailhead happened to be the closest row in my locations table.

I scrapped it and built a second model: grid cells.

Instead of mapping data to a location, I divided the map into cells and mapped data to whichever cell it fell into. Then the question changed from "is this species at this trail?" to "is this trail inside a grid cell that has a recorded UFO sighting?" Thirteen grid tables in total — grid_paranormal (3,814 rows), grid_filming, grid_geology, grid_meteorites, grid_shipwrecks, grid_karst, grid_watershed, grid_fire, grid_endangered_species, grid_minerals, grid_fossils, grid_phenology, grid_temperature_extremes.

It's a looser association. But it's honest. The site can say "this area has a history of paranormal reports" without claiming a ghost is standing at the 1.2-mile marker of a specific path. The distinction matters — you're not fabricating precision you don't have.

The overnight query I didn't time

Once I had the enrichment model working — FKs for precision, grids for general areas — I wanted to solve the discovery problem. If someone is looking at a park, they want to know what else is nearby.

Real-time spatial queries across 140,000 pages would kill the server. So I pre-calculated everything into a location_nearby table and just do indexed lookups at render time.

INSERT INTO location_nearby (location_id, nearby_id, distance)
SELECT a.id, b.id, ST_Distance(a.geom, b.geom)
FROM locations a, locations b
WHERE a.id != b.id
AND ST_DWithin(a.geom, b.geom, 16093.4); -- 10km radius

Running this across 105,000 locations against itself is not a quick operation. I kicked it off, went and did other things, came back a few hours later to find it done. I didn't measure how long it actually took — which is the most honest thing I can say about a query you only run once. The table ended up at 1,396,359 rows. Every page render is now a simple indexed lookup rather than a live spatial calculation.

I did spend an entire afternoon before this trying to optimise the index type. Trying different strategies, benchmarking, reading PostGIS documentation. Then I remembered it was a static site. The fastest query is the one I already ran. I got that afternoon back in the sense that I learned something, and lost it in the sense that I'll never get those hours back.

The part I didn't expect

I had spent the majority of my time on the biological and geological data. The species tables, the plant taxonomy, making sure the mappings were tight and the data quality was reasonable. That was the "value" of the site in my head — scientific, useful, something a serious hiker would rely on.

Then I started testing actual pages.

The paranormal grid data — which took a fraction of the time to implement because lower resolution means fewer edge cases — was by far the most interesting part of every page it appeared on. People read past the species counts and the climate normals and stopped at the UFO section. The "curiosity" data that I'd treated as a fun side experiment was doing more for engagement than everything I'd agonised over.

I hadn't planned for that. I couldn't have planned for that. You can theorise about what users find interesting, but sometimes you just have to put the page in front of a person and watch where their eyes go.

Turns out people are more likely to click on "haunted" than on "karst geology." I'm choosing not to take this personally.

Where the model broke

The FK and grid approach worked for nearly everything. If it had a coordinate or a region, I could anchor it. Parks, species, paranormal reports, filming locations, fossils — all handled.

Then I ran into climate data. Not a point. Not a region. A continuous variable that shifted every few miles based on elevation, proximity to water, and a dozen other factors. I tried to force it into the FK model. The results were absurd — trails at 8,000 feet getting the climate profile of a weather station 40 miles away in a valley. I tried the grid approach. The resolution was too coarse to be useful.

The anchor model, which had handled everything else cleanly, had nothing to say about climate.

That's where Part 3 starts — how I spent three days building something wrong before I found the approach that actually worked.

Lessons:

Having data in a database and having a site are two completely different problems. The first one took weeks. The second one took months.
Don't force precision onto imprecise data. A grid cell that's honest beats a foreign key that's lying.
Pre-calculate anything spatial that you'd otherwise query at render time. Your server will thank you. Your patience during the build will not.
The feature you spent the least time on will be the one users care about most. Build the weird stuff.

16GB of RAM, 12GB of JSON, and One Very Loud Fan

Velk — Mon, 13 Apr 2026 15:31:02 +0000

This is Part 1 of a 5-part series documenting the build of velktrails.com — a programmatic outdoor recreation resource covering 105,000+ locations across the US. Each part covers a real technical problem I ran into and how I worked around it.

This one starts not with a bang but with a crash. A freeze, rather.

In the midst of my daily brain-rot session, something felt off. A tiny jerk. A stutter. The YouTube video I had open froze mid-frame. Then the freeze stretched longer. Then my mouse pointer just disappeared from the screen and into the abyss. I had no choice but to press and hold the power button for ten seconds. Screen went blank. Booted back to desktop. Opened a terminal and ran python3 import_gov_apis.py — a script I had named with all the creative energy of someone who just wanted it to work.

This time I didn't entertain myself while it ran. I opened the system monitor and watched. Within minutes, Python had consumed all 16GB of RAM and was deep into 14GB of swap. My mouse pointer — which had just returned from its journey into the void — disappeared again.

The culprit was obvious. The size of the JSON files.

To understand why the files were that large, you need to understand what I was trying to build — and how much data it actually takes to describe 105,757 outdoor locations in any meaningful depth.

What I was actually pulling

The starting point was the federal agencies: National Park Service (NPS), US Forest Service (USFS), Recreation Information Database (RIDB), Bureau of Land Management (BLM), US Army Corps of Engineers (USACE), US Fish & Wildlife Service. That's the backbone — parks, trails, campgrounds, recreation areas.

But location names and coordinates alone don't make useful pages. So I kept going. Birds from eBird. Plant species and native flora. Fungi. Amphibians. Insects. Danger species and tick risk zones. Trees. Wildflowers. Stargazing data — constellations, meteor shower windows, Bortle ratings. Indigenous land and language territories. Fossils. Mineral deposits. Wildfire history. Watershed boundaries. Flood zones. Geology. Meteorite impact sites. Filming locations. And yes — UFO sightings and Bigfoot reports, because why not.

Each of these came as its own data source. Some were APIs. Some were CSV dumps. Many were JSON files. By the time I had pulled everything, the total across all source files exceeded 12GB. JSON format inflates size significantly — the same data that fits in 2GB of PostgreSQL can easily be 8GB when it's nested JSON with repeated field names on every record. So 12GB of source files isn't 12GB of unique information. But it's still 12GB that Python has to deal with.

The RAM wall

My initial approach was the naive one. Fetch the JSON, load it into a Python dict, run processing over the whole set. This worked fine when I was testing with a single state. It failed completely when I pointed it at the full national dataset.

The problem is json.load(). It reads the entire file into memory before you can access a single record. When you're dealing with 6 to 8GB of nested JSON — hundreds of thousands of records, each carrying arrays of attributes — Python's object model turns that raw file size into somewhere between 12 and 20GB of heap. On 16GB of RAM, that's not a risk. It's a countdown.

I switched to ijson, which streams through a JSON file without loading the whole structure into memory. One record comes in, gets transformed, gets written to a PostgreSQL staging table, memory clears. Then the next one.

import ijson
import psycopg2

with open('massive_gov_data.json', 'rb') as f:
    locations = ijson.items(f, 'locations.item')
    for location in locations:
        # Transform and insert immediately —
        # never accumulate into a list
        db_insert(transform(location))

This pattern carried through the rest of the pipeline. Streaming when processing PRISM climate tiles. Streaming when joining species records to coordinates. Streaming when pulling weather grids. The rule became simple: if the file is over a few hundred MB, stream it. Python's object overhead will catch you eventually if you don't.

The deduplication mess

Once the data was in PostgreSQL, I hit the next wall: the same location appearing multiple times across different sources.

The NPS calls a place a "Park." The USFS calls the same patch of dirt a "Recreation Area." The RIDB calls it a "Site." They all have different naming conventions, different coordinate formats, and different primary keys. Blindly importing them doesn't give you a comprehensive map — it gives you five slightly different versions of the same trailhead with no way to tell they're the same place.

I couldn't match on names alone because "Grand Canyon Trailhead" in one source becomes "GC Trailhead - North" in another. I couldn't match on coordinates either, because one API places the point at the park entrance and another places it at the geographic centre.

I settled on a two-condition heuristic: are these two points within 100 metres of each other, and is there at least 80% string similarity between the names? If both, merge them — the most official source wins on name, the most detailed source wins on metadata.

The UFO and Bigfoot data was actually the easiest to handle here, which says something. Those sources are coordinate-stamped observations, not location claims. They don't assert "this is a place" — they assert "something was reported near here." No deduplication needed. Just attach to the nearest anchor and move on.

The 3,033 records I refused to delete

After the deduplication passes, I had 3,033 records that didn't match anything in the primary datasets but existed in the pipeline staging area.

The clean move was obvious: delete them, keep the database lean, reduce noise. But these felt like the edge cases — the small local spots that the federal APIs missed but the open-source sources happened to capture. Hard to verify, but also hard to dismiss.

I moved them to a backup table instead. They don't render as pages. But they're still in the system.

The principle: never delete source data from the pipeline. The thing you're convinced is noise today is sometimes the only accurate record tomorrow. Storage is cheap. Re-sourcing deleted data is not.

Where things stood

After all of this: 4.8GB database, 105,757 unique location anchors, and roughly 40% of total development time spent arguing with data about where a park starts and ends rather than writing site code.

I had the "where" for every location. A coordinate, a canonical name, a source attribution.

That's not enough to make a useful page. A coordinate tells you where something is. It says nothing about what lives there, what the climate does across seasons, or what's been reported in the area.

At this point I was naive enough to think the hard part was done. Data in the database, RAM crashes behind me, deduplication solved. I couldn't have been more wrong — and I mean that in the fullest possible sense of the word. Being confident your woes are over is the highest degree of wrong there is. As it turned out, the ghost in the machine was just getting started with me.

Part 2 covers how I turned 105,000 coordinate points into a spatial index using PostGIS, and started mapping species, geology, and stranger things to specific location IDs.

Lessons, if you can call them that:

It takes a system crash to awaken one from a brain-rot trance.
Use ijson if your data source is over 300MB. You will not win a fight against json.load() at scale.
Priorities in life: Food. Shelter. RAM.