Building a world-history map: PostGIS time+bbox search, a full-screen MapLibre UI, and Wikidata ETL lessons

#postgres #webdev #showdev #datascience

Annalium is a free, no-login map + multi-track timeline of world history. You pan a full-bleed world map, drag a timeline, and the map fills with events and redraws era-appropriate borders for whatever moment you land on. No account, every view is a shareable URL.

This post is the engineering behind it: the one query that does most of the work, why the data is pre-indexed instead of live, how view state lives in the URL, and the honest lessons from pulling history out of Wikidata. Every snippet below is real code from the repo, trimmed for the page.

1. The core query: time + bbox in one request

The hot path is GET /events: "give me the events inside this viewport, overlapping this time range." That's a spatial predicate AND a range-overlap predicate, and it has to stay fast while you drag the map around.

The first thing to get right is time. Events store their span as two signed integers, not a date range:

# tsrange is the sub-year-precision window for CE events. Nullable because
# Python's datetime can't bind BCE values — for BCE rows we rely on the
# signed-int year columns below and leave this NULL. See alembic 0008.
valid_time: Mapped[Range[datetime] | None] = mapped_column(TSRANGE, nullable=True)
valid_time_year_start: Mapped[int] = mapped_column(Integer, nullable=False)
valid_time_year_end: Mapped[int] = mapped_column(Integer, nullable=False)

This is the "don't fix the negatives" gotcha. Python's datetime can't represent 500 BCE, so a tsrange column silently drops every pre-year-1 event. The canonical time filter is a pair of signed ints (-500 means 500 BCE, astronomical year numbering). The tsrange sticks around only for sub-year precision on CE events. Anyone new to the codebase wants to "clean up" the negative years — don't.

With that, the query is a plain intersection plus a textbook range-overlap:

w, s, e, n = _parse_bbox(bbox) if bbox else (-180.0, -90.0, 180.0, 90.0)
envelope = ST_MakeEnvelope(w, s, e, n, 4326)
year_start = t_start if t_start is not None else -3000
year_end = t_end if t_end is not None else 3000

# Range overlap: an event's [start, end] intersects the query's
# [year_start, year_end) iff event.year_start < year_end AND event.year_end >= year_start.
filters = [
    Event.location.intersects(envelope),
    Event.valid_time_year_start < year_end,
    Event.valid_time_year_end >= year_start,
    Event.deprecated.is_(False),
]

The index strategy is two indexes, not one, and it's worth being precise about which half does what. The spatial predicate rides a GiST index on the geometry; the year overlap rides a plain btree composite over the two int columns:

# migration 0004 — the spatial half (GiST over the PostGIS geometry)
op.create_index(
    "ix_events_location_valid_time", "events",
    ["location", "valid_time"], postgresql_using="gist",
)

# migration 0008 — the time half (btree over the signed-int years)
op.create_index(
    "ix_events_year_range", "events",
    ["valid_time_year_start", "valid_time_year_end"],
)

Planner reality: on a viewport-shaped query PostGIS narrows on the GiST first (a bbox usually cuts the candidate set far harder than a millennium-wide time window), then filters the survivors on the year columns. The btree earns its keep on the global search bar, which drops the bbox entirely and scans by time.

Last piece: pagination. Result counts aren't stable as you pan and scrub, so OFFSET is meaningless and gets slow as it grows. Keyset (cursor) pagination instead, keyed on (valid_time_year_start, id):

if cursor:
    last_year, last_id = decode_cursor(cursor)
    # (y > last_y) OR (y == last_y AND id > last_id)
    filters.append(
        or_(
            Event.valid_time_year_start > last_year,
            and_(
                Event.valid_time_year_start == last_year,
                Event.id > last_id,
            ),
        )
    )

The cursor is just base64-encoded JSON ({"y": ..., "id": ...}) — opaque to the client, cheap to the database, and it scales to any result size because there's no offset to count past.

2. Why pre-index instead of live-querying Wikidata

The data comes from Wikidata, but requests never touch it. A nightly job pulls events into our own Postgres and that's the source of truth. The reason is latency: the live Wikidata Query Service is ~1s+ per query and rate-limited, which is unworkable for a map you scrub in real time. The design target was sub-100ms p95 on the /events hot path — you can't get there over a network hop to a shared public endpoint. Pre-indexing also buys schema control (we can add our own columns like event_type and categories) and resilience to upstream outages. The cost is an ETL pipeline and a database. Fair trade.

The pull itself is a VALUES-based SPARQL query. The obvious version — "everything that is a subclass of occurrence" — times out on WDQS because the transitive subclass traversal explodes, so we enumerate the event classes we care about directly and relax the time/space requirements:

VALUES ?eventClass { wd:Q178561 wd:Q198 wd:Q188055 ... }   # battle, war, siege, …
?event wdt:P31 ?eventClass .
# Time: prefer P585 (point in time), fall back to P580 (start time).
OPTIONAL { ?event wdt:P585 ?pt585 }
OPTIONAL { ?event wdt:P580 ?pt580 }
BIND(COALESCE(?pt585, ?pt580) AS ?ptInTime)
FILTER(BOUND(?ptInTime))
# Coord: prefer direct P625, fall back to the P276 location entity's own P625.
OPTIONAL { ?event wdt:P625 ?coord625 }
BIND(COALESCE(?coord625, ?coord276) AS ?coord)
FILTER(BOUND(?coord))

That P580 fallback matters more than it looks: the original strict "must have P585 (point in time) AND P625 (coordinates)" pattern was excluding well-formed events with a duration — every multi-day religious council, a lot of treaties. Relaxing it roughly doubled the corpus. Even the enumerated class list gets sliced into small batches per century window, because a single query with all ~50 classes plus a stack of per-language label OPTIONALs hits the 60-second WDQS ceiling about half the time.

Because WDQS is flaky near that ceiling, the loader does not do long exponential backoff (a single stalled century would freeze a 50-chunk run for half an hour). It runs one attempt per chunk, skips failures, continues, then does one retry pass over the windows that failed — which gives WDQS time to recover. Anything that still won't load goes to an S3 dead-letter file, and crucially, a dead-letter failure can never take down the ETL:

def write_chunk(self, *, run_id, chunk_index, records, error) -> str | None:
    """Write failed chunk to S3 as JSONL. Returns the s3:// URI on success,
    or None if the dead-letter destination is unavailable (bucket missing in
    local dev, no creds). A dead-letter failure must NEVER crash the caller."""
    annotated = [{**rec, "_error": error} for rec in records]
    ...
    except (ClientError, BotoCoreError) as exc:
        log.warning("deadletter_write_failed", ...)
        return None

The failed records land as JSONL with the exception stapled on, so a bad night is inspectable instead of lost.

3. A map-first, full-screen MapLibre UI

The map is the whole canvas. Every panel — events list, timeline, filters, status bar — floats as translucent glass over it rather than boxing it into a column. MapLibre GL renders vector tiles, no API key, fully themeable.

The interesting part is state. There's no client store for view state and no accounts; the URL is the state, encoded as a clean path with no query string:

//   /{west},{south},{east},{north}/{start}/{end}[/{category}]
//   e.g. /68.6426,20.7972,132.8467,53.0412/-5/295/Conflict
export function buildPath(v: ViewState, locale = DEFAULT_LOCALE): string {
  const bbox = v.bbox.map(round4).join(",");
  let path = `/${bbox}/${v.tStart}/${v.tEnd}`;
  if (v.category) path += `/${encodeURIComponent(v.category)}`;
  return locale === DEFAULT_LOCALE ? path : `/${locale}${path}`;
}

Coordinates round to 4 decimals (~11 m) to keep the URL short; years are signed ints so BCE round-trips cleanly. Sharing a view is copying the address bar. Server state — the actual events for a viewport — is TanStack Query, keyed on the same view.

One trap worth flagging for anyone doing this in Next's App Router: don't use router.replace to keep the address bar in sync while the user drags. Navigating the catch-all route remounts the page, which wipes local component state and reinitializes the map on every frame. We write the URL cosmetically instead:

// replaceState updates the URL without any Next navigation, so the app stays
// mounted (map drag / timeline scrub stay cheap). A popstate listener re-reads
// the view on back/forward. Debounced ~200ms.
window.history.replaceState(window.history.state, "", path);

The catch-all route still serves hard-loaded and shared URLs on cold load; replaceState only handles the live in-session updates.

4. Wikidata ETL lessons (the honest part)

The data isn't the hard part. The bias is. Pulling events from Wikipedia/Wikidata gives you a corpus that visibly skews European, military/political, and post-1500. A map makes that impossible to hide — pan to sub-Saharan Africa in the year 800 and it's empty in a way a list would quietly paper over. We chose to ship with the limitation stated rather than fake density. If anything the map is a feature here: it's the most honest audit of your own coverage gaps you can build, and it tells us exactly which regions and centuries to backfill next.

Multilingual is a JSONB column, not a translations table. Titles and descriptions are stored as a blob keyed by language code — {"en": "...", "zh": "...", "ja": "..."} — right on the event row. The ETL harvests every label Wikidata has (currently up to ~19 languages); the UI ships nine. A separate event_translations table would multiply the row count and force a JOIN on every read for no real benefit at our scale. The cost is that free-text search can't just ILIKE one column, so a small SQL helper flattens the JSONB into delimited text for a pg_trgm GIN index, and search runs across all languages instead of just English.

Classification is deterministic first, LLM only where it has to be. Most events get their category for free from the Wikidata class that matched them — a battle (Q178561) is Conflict, a treaty (Q131569) is Politics, no model involved. Only the fuzzy buckets that have no clean Wikidata class fall to a local Ollama model. The taxonomy is a closed set:

CATEGORIES = ("Conflict", "Politics", "Disasters", "Exploration", "Economy",
              "Religion", "Science", "Transportation", "Agriculture", "Trade",
              "Culture", "Migration", "Environment", "Other")

The model runs at temperature: 0.0 for reproducibility, and anything it hallucinates outside that set is filtered out before it can reach the database:

normalized = c.strip().title()
if normalized in _CATEGORY_SET and normalized not in seen:
    out.append(normalized)

Keeping the model on the deterministic buckets' terms — closed label set, temperature zero, output validated against a frozenset — is what makes an LLM safe to put in an ingestion pipeline at all.

5. The borders problem

Era-accurate political borders are the hardest data in the whole project. Modern map tiles are everywhere; polygons that are correct for a given year and change as you scrub time barely exist as clean open data. Tang China at 700 CE is not Tang China at 850 CE, and most sources only have one or the other.

So v1 ships a small hand-curated set — roughly 50 polygons for the major polities — rather than a large auto-generated one that looks precise and is quietly wrong. The roadmap is a three-phase grow: hand-curated now, then an OpenHistoricalMap (ODbL) importer to reach a few hundred polities, then an approximate bbox-from-capital fallback rendered with a deliberately dashed border so "we're guessing this frontier" is visible on the map instead of implied. Quality over quantity for launch: fifty good ones, honestly labelled, beats five hundred that lie.

Try it

The live app is at annalium.com — a good first view is the world in 1492, or scrub to 700 CE and watch the Tang, the Abbasids, and the Carolingians all going at once. It's free and needs no login.

The app itself is closed, but the one open, contributable surface is the theme kit — the per-era visual layer that re-skins the UI, which designers can submit without touching product code. Event data is Wikidata/Wikipedia (CC-BY-SA); the base map is © OpenStreetMap (ODbL), with the small hand-curated v1 era borders described above (OpenHistoricalMap import planned). If you poke at it, the most useful feedback I can get is where the coverage feels emptiest — that's the map telling both of us the truth.