Velk

Posted on Apr 13 • Edited on Apr 19

16GB of RAM, 12GB of JSON, and One Very Loud Fan

#python #postgres #dataengineering #seo

This is Part 1 of a 5-part series documenting the build of velktrails.com — a programmatic outdoor recreation resource covering 105,000+ locations across the US. Each part covers a real technical problem I ran into and how I worked around it.

This one starts not with a bang but with a crash. A freeze, rather.

In the midst of my daily brain-rot session, something felt off. A tiny jerk. A stutter. The YouTube video I had open froze mid-frame. Then the freeze stretched longer. Then my mouse pointer just disappeared from the screen and into the abyss. I had no choice but to press and hold the power button for ten seconds. Screen went blank. Booted back to desktop. Opened a terminal and ran python3 import_gov_apis.py — a script I had named with all the creative energy of someone who just wanted it to work.

This time I didn't entertain myself while it ran. I opened the system monitor and watched. Within minutes, Python had consumed all 16GB of RAM and was deep into 14GB of swap. My mouse pointer — which had just returned from its journey into the void — disappeared again.

The culprit was obvious. The size of the JSON files.

To understand why the files were that large, you need to understand what I was trying to build — and how much data it actually takes to describe 105,757 outdoor locations in any meaningful depth.

What I was actually pulling

The starting point was the federal agencies: National Park Service (NPS), US Forest Service (USFS), Recreation Information Database (RIDB), Bureau of Land Management (BLM), US Army Corps of Engineers (USACE), US Fish & Wildlife Service. That's the backbone — parks, trails, campgrounds, recreation areas.

But location names and coordinates alone don't make useful pages. So I kept going. Birds from eBird. Plant species and native flora. Fungi. Amphibians. Insects. Danger species and tick risk zones. Trees. Wildflowers. Stargazing data — constellations, meteor shower windows, Bortle ratings. Indigenous land and language territories. Fossils. Mineral deposits. Wildfire history. Watershed boundaries. Flood zones. Geology. Meteorite impact sites. Filming locations. And yes — UFO sightings and Bigfoot reports, because why not.

Each of these came as its own data source. Some were APIs. Some were CSV dumps. Many were JSON files. By the time I had pulled everything, the total across all source files exceeded 12GB. JSON format inflates size significantly — the same data that fits in 2GB of PostgreSQL can easily be 8GB when it's nested JSON with repeated field names on every record. So 12GB of source files isn't 12GB of unique information. But it's still 12GB that Python has to deal with.

The RAM wall

My initial approach was the naive one. Fetch the JSON, load it into a Python dict, run processing over the whole set. This worked fine when I was testing with a single state. It failed completely when I pointed it at the full national dataset.

The problem is json.load(). It reads the entire file into memory before you can access a single record. When you're dealing with 6 to 8GB of nested JSON — hundreds of thousands of records, each carrying arrays of attributes — Python's object model turns that raw file size into somewhere between 12 and 20GB of heap. On 16GB of RAM, that's not a risk. It's a countdown.

I switched to ijson, which streams through a JSON file without loading the whole structure into memory. One record comes in, gets transformed, gets written to a PostgreSQL staging table, memory clears. Then the next one.

import ijson
import psycopg2

with open('massive_gov_data.json', 'rb') as f:
    locations = ijson.items(f, 'locations.item')
    for location in locations:
        # Transform and insert immediately —
        # never accumulate into a list
        db_insert(transform(location))

This pattern carried through the rest of the pipeline. Streaming when processing PRISM climate tiles. Streaming when joining species records to coordinates. Streaming when pulling weather grids. The rule became simple: if the file is over a few hundred MB, stream it. Python's object overhead will catch you eventually if you don't.

The deduplication mess

Once the data was in PostgreSQL, I hit the next wall: the same location appearing multiple times across different sources.

The NPS calls a place a "Park." The USFS calls the same patch of dirt a "Recreation Area." The RIDB calls it a "Site." They all have different naming conventions, different coordinate formats, and different primary keys. Blindly importing them doesn't give you a comprehensive map — it gives you five slightly different versions of the same trailhead with no way to tell they're the same place.

I couldn't match on names alone because "Grand Canyon Trailhead" in one source becomes "GC Trailhead - North" in another. I couldn't match on coordinates either, because one API places the point at the park entrance and another places it at the geographic centre.

I settled on a two-condition heuristic: are these two points within 100 metres of each other, and is there at least 80% string similarity between the names? If both, merge them — the most official source wins on name, the most detailed source wins on metadata.

The UFO and Bigfoot data was actually the easiest to handle here, which says something. Those sources are coordinate-stamped observations, not location claims. They don't assert "this is a place" — they assert "something was reported near here." No deduplication needed. Just attach to the nearest anchor and move on.

The 3,033 records I refused to delete

After the deduplication passes, I had 3,033 records that didn't match anything in the primary datasets but existed in the pipeline staging area.

The clean move was obvious: delete them, keep the database lean, reduce noise. But these felt like the edge cases — the small local spots that the federal APIs missed but the open-source sources happened to capture. Hard to verify, but also hard to dismiss.

I moved them to a backup table instead. They don't render as pages. But they're still in the system.

The principle: never delete source data from the pipeline. The thing you're convinced is noise today is sometimes the only accurate record tomorrow. Storage is cheap. Re-sourcing deleted data is not.

Where things stood

After all of this: 4.8GB database, 105,757 unique location anchors, and roughly 40% of total development time spent arguing with data about where a park starts and ends rather than writing site code.

I had the "where" for every location. A coordinate, a canonical name, a source attribution.

That's not enough to make a useful page. A coordinate tells you where something is. It says nothing about what lives there, what the climate does across seasons, or what's been reported in the area.

At this point I was naive enough to think the hard part was done. Data in the database, RAM crashes behind me, deduplication solved. I couldn't have been more wrong — and I mean that in the fullest possible sense of the word. Being confident your woes are over is the highest degree of wrong there is. As it turned out, the ghost in the machine was just getting started with me.

Part 2 covers how I turned 105,000 coordinate points into a spatial index using PostGIS, and started mapping species, geology, and stranger things to specific location IDs.

Lessons, if you can call them that:

It takes a system crash to awaken one from a brain-rot trance.
Use ijson if your data source is over 300MB. You will not win a fight against json.load() at scale.
Priorities in life: Food. Shelter. RAM.

DEV Community