DEV Community

Arque Nova
Arque Nova

Posted on

Building a Sub-10ms Australian Address Autocomplete API in Rust

Australia has one of the best open address datasets in the world. The Geocoded National Address File (GNAF) contains 15.8 million addresses — every house, apartment, rural property, and indigenous community in the country — published by the federal government under a free commercial licence.

The catch: it ships as a ~1.5 GB zip of pipe-delimited files across eight relational tables, one set per state. If you want to actually search it, you need to parse it, join it, index it, and build a search API on top of it. That's what I did. The result is DingoFind — a hosted Australian address autocomplete API with typo tolerance, reverse geocoding, and ABS boundary enrichment.

This post covers the technical architecture: how the data pipeline works, how the search index is built, and how the API achieves sub-10ms p99 latency on a single ARM server.


The Data Problem: GNAF's Schema

GNAF is a normalised relational dataset. An address like "1/123 Collins Street Melbourne VIC 3000" is spread across multiple files:

  • ADDRESS_DETAIL — the core record (address_detail_pid, flat_number, number_first, locality_pid, street_locality_pid)
  • STREET_LOCALITY — street name, type, and suffix (joined via street_locality_pid)
  • LOCALITY — suburb name (joined via locality_pid)
  • STATE — state name
  • ADDRESS_MESH_BLOCK_2021 — joins address to mesh block PID for ABS enrichment

On top of that, there are alias tables for streets and localities (different names for the same place), and address default geocode files for lat/lon.

To get a single flat address record you need to join across five or six tables. I do this once, at pipeline time, and materialise the result into a Tantivy index. The API itself never touches the raw schema.


The Pipeline

The pipeline is a one-shot Rust binary that runs weekly (or on demand). It does five things in sequence:

1. Download

GNAF zip:        ~1.5 GB from data.gov.au (updated quarterly)
ABS MB xlsx:     ~30 MB  (mesh block → SA1 mapping)
ABS LGA xlsx:    ~30 MB  (mesh block → LGA name)
ABS CED xlsx:    ~30 MB  (mesh block → federal electoral division)
Enter fullscreen mode Exit fullscreen mode

Downloads are skipped if files already exist in /tmp/gnaf_staging, making re-runs after partial failures fast.

2. Parse

Each state's PSV files are parsed in parallel with Rayon. The join sequence:

ADDRESS_DETAIL
  → STREET_LOCALITY    (street_locality_pid)
  → LOCALITY           (locality_pid)
  → STATE              (state_abbreviation)
  → ADDRESS_DEFAULT_GEOCODE (address_detail_pid → lat/lon)
Enter fullscreen mode Exit fullscreen mode

Output: a Vec<Address> with flat fields: full_address, street_number, street_name, street_type, suburb, state, postcode, lat, lon, gnaf_pid, mesh_block_pid.

3. Enrich

The ABS enrichment adds statistical boundary codes. The join chain:

mesh_block_pid → MB_2021_PID
MB_2021_PID    → MB_CODE_2021 (11-digit code)
MB_CODE_2021   → SA1_CODE_2021
MB_CODE_2021   → LGA_NAME_2021
MB_CODE_2021   → CED_NAME_2021 (federal electorate)
Enter fullscreen mode Exit fullscreen mode

This runs entirely in-memory using HashMap lookups built from the ABS xlsx files. At peak, the enricher holds ~800 MB of lookup tables for the 350,000+ mesh block codes.

4. Build the Index

The search index is built with Tantivy — a full-text search library in Rust, broadly similar to Lucene. One Tantivy document per address, with these fields:

Field Type Purpose
full_address TEXT (tokenised) Primary search field
suburb TEXT (tokenised) Suburb-only searches
postcode TEXT Postcode lookup
gnaf_pid TEXT (stored) Unique address ID
lat, lon F64 (stored) For response payload
sa1_code TEXT (stored) ABS enrichment
lga_name TEXT (stored) ABS enrichment
federal_elec TEXT (stored) ABS enrichment

The index is written into a staging directory (/var/lib/ausaddress/index_staging). At ~15.8M documents it comes out around 6.6 GB on disk.

5. Swap

When the index is built and validated (spot-check: document count must be > 14 million), the pipeline does an atomic rename:

index_staging → index_live
Enter fullscreen mode Exit fullscreen mode

Then sends SIGUSR1 to the running API process, which reopens the index from disk without restarting. Zero downtime index updates.


The API

The HTTP server is axum with axum-server for direct TLS termination using rustls. No nginx in front of the API — it binds port 443 directly. Nginx only serves the static landing page on port 8080.

Hot Path

For an autocomplete request, the hot path is:

Request arrives → rustls decrypts
  → axum routes to handler
  → DashMap: look up API key hash    ~50 ns
  → AtomicU32: check daily counter   ~5 ns
  → moka: check query cache          ~0.3 ms (hit)
  → Tantivy search                   ~5–15 ms (miss)
  → serialize JSON response
Enter fullscreen mode Exit fullscreen mode

About 85% of requests are cache hits (moka LRU, 500k capacity, 5-minute TTL). For cache misses, Tantivy searches the mmapped index. Because the index is memory-mapped, the OS keeps hot segments in RAM — cold-start latency after a restart is higher until pages warm up.

Typo Tolerance

Tantivy supports fuzzy matching out of the box, but I added a pre-processing layer for Australian address quirks:

  • Levenshtein distance 1–2 for street names and suburbs
  • Phonetic normalisation for common mishearings ("woolloomooloo" vs "wooloomooloo")
  • Token reordering — "sydney george st 1" finds "1 George St Sydney"
  • Abbreviation expansion — "st" → Street/Saint, "ave" → Avenue, "rd" → Road

The trickiest case is long suburb names. "Woolloomooloo" has 12 characters and people reliably get the vowel runs wrong. A Levenshtein distance of 2 handles most variants.

Rate Limiting

Rate limiting is done with an in-process DashMap<String, AtomicU32> — one counter per API key per day. A background task flushes the counters to SQLite every 5 minutes and resets them at midnight. No Redis.

This means counters are approximate (up to 5 minutes of requests could be lost on crash), but for the use case (daily limits, not strict quotas) it's fine and avoids a network round-trip per request.

Index Hot-Reload

When the pipeline finishes and swaps the index, it sends SIGUSR1 to the API process. The signal handler reopens the Tantivy index from disk on a background thread and atomically replaces the Arc<Index> in AppState. Existing in-flight requests complete against the old index; new requests pick up the new one immediately.


Infrastructure

Everything runs on a single Oracle Cloud Ampere A1 instance:

  • CPU: 4 OCPU (ARM Neoverse N1)
  • RAM: 24 GB
  • Disk: 48 GB
  • OS: Ubuntu 24.04

RAM breakdown at steady state:

Tantivy index (mmap)    ~6.6 GB
moka query cache        ~2 GB
RTree (reverse geocode) ~830 MB
DashMap + counters      ~100 MB
OS + page cache         ~1 GB
Total                   ~10–11 GB
Enter fullscreen mode Exit fullscreen mode

The pipeline peaks at ~18 GB RAM (the enricher holds all lookup tables in memory while indexing). ZRAM swap provides ~6 GB of extra headroom for pipeline runs.


What I'd Do Differently

Distributed from day one: Right now if the single instance dies, the API is down. A second read-only replica with DNS failover would fix this and isn't expensive. I'll add it when there's real traffic to justify it.

Smarter fuzzy at query time: The current fuzzy matching is decent but doesn't handle transposed words well ("Collins Melbourne St" vs "Collins St Melbourne"). A query-expansion step before Tantivy would help.

Better enrichment coverage: About 2% of addresses don't have a mesh block assignment in GNAF (mostly very new addresses). Those records come back without SA1/LGA/CED codes. A fallback polygon lookup for those cases would close the gap.


Try It

DingoFind — 100,000 free requests first month, no signup required for the live demo.

curl "https://www.dingofind.com/v1/autocomplete?q=1+george+st+sydney&limit=3" \
  -H "Authorization: Bearer YOUR_API_KEY"
Enter fullscreen mode Exit fullscreen mode

Full API docs at dingofind.com/docs.html.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.