Arque Nova

Posted on Apr 26

Building a Sub-10ms Australian Address Autocomplete API in Rust

#australia #api #rust #backend

Australia has one of the best open address datasets in the world. The Geocoded National Address File (GNAF) contains 15.8 million addresses — every house, apartment, rural property, and indigenous community in the country — published by the federal government under a free commercial licence.

The catch: it ships as a ~1.5 GB zip of pipe-delimited files across eight relational tables, one set per state. If you want to actually search it, you need to parse it, join it, index it, and build a search API on top of it. That's what I did. The result is DingoFind — a hosted Australian address autocomplete API with typo tolerance, reverse geocoding, and ABS boundary enrichment.

This post covers the technical architecture: how the data pipeline works, how the search index is built, and how the API achieves sub-10ms p99 latency on a single ARM server.

The Data Problem: GNAF's Schema

GNAF is a normalised relational dataset. An address like "1/123 Collins Street Melbourne VIC 3000" is spread across multiple files:

ADDRESS_DETAIL — the core record (address_detail_pid, flat_number, number_first, locality_pid, street_locality_pid)
STREET_LOCALITY — street name, type, and suffix (joined via street_locality_pid)
LOCALITY — suburb name (joined via locality_pid)
STATE — state name
ADDRESS_MESH_BLOCK_2021 — joins address to mesh block PID for ABS enrichment

On top of that, there are alias tables for streets and localities (different names for the same place), and address default geocode files for lat/lon.

To get a single flat address record you need to join across five or six tables. I do this once, at pipeline time, and materialise the result into a Tantivy index. The API itself never touches the raw schema.

The Pipeline

The pipeline is a one-shot Rust binary that runs weekly (or on demand). It does five things in sequence:

1. Download

GNAF zip:        ~1.5 GB from data.gov.au (updated quarterly)
ABS MB xlsx:     ~30 MB  (mesh block → SA1 mapping)
ABS LGA xlsx:    ~30 MB  (mesh block → LGA name)
ABS CED xlsx:    ~30 MB  (mesh block → federal electoral division)

Downloads are skipped if files already exist in /tmp/gnaf_staging, making re-runs after partial failures fast.

2. Parse

Each state's PSV files are parsed in parallel with Rayon. The join sequence:

ADDRESS_DETAIL
  → STREET_LOCALITY    (street_locality_pid)
  → LOCALITY           (locality_pid)
  → STATE              (state_abbreviation)
  → ADDRESS_DEFAULT_GEOCODE (address_detail_pid → lat/lon)

Output: a Vec<Address> with flat fields: full_address, street_number, street_name, street_type, suburb, state, postcode, lat, lon, gnaf_pid, mesh_block_pid.

3. Enrich

The ABS enrichment adds statistical boundary codes. The join chain:

mesh_block_pid → MB_2021_PID
MB_2021_PID    → MB_CODE_2021 (11-digit code)
MB_CODE_2021   → SA1_CODE_2021
MB_CODE_2021   → LGA_NAME_2021
MB_CODE_2021   → CED_NAME_2021 (federal electorate)

This runs entirely in-memory using HashMap lookups built from the ABS xlsx files. At peak, the enricher holds ~800 MB of lookup tables for the 350,000+ mesh block codes.

4. Build the Index

The search index is built with Tantivy — a full-text search library in Rust, broadly similar to Lucene. One Tantivy document per address, with these fields:

Field	Type	Purpose
`full_address`	TEXT (tokenised)	Primary search field
`suburb`	TEXT (tokenised)	Suburb-only searches
`postcode`	TEXT	Postcode lookup
`gnaf_pid`	TEXT (stored)	Unique address ID
`lat`, `lon`	F64 (stored)	For response payload
`sa1_code`	TEXT (stored)	ABS enrichment
`lga_name`	TEXT (stored)	ABS enrichment
`federal_elec`	TEXT (stored)	ABS enrichment

The index is written into a staging directory (/var/lib/ausaddress/index_staging). At ~15.8M documents it comes out around 6.6 GB on disk.

5. Swap

When the index is built and validated (spot-check: document count must be > 14 million), the pipeline does an atomic rename:

index_staging → index_live

Then sends SIGUSR1 to the running API process, which reopens the index from disk without restarting. Zero downtime index updates.

The API

The HTTP server is axum with axum-server for direct TLS termination using rustls. No nginx in front of the API — it binds port 443 directly. Nginx only serves the static landing page on port 8080.

Hot Path

For an autocomplete request, the hot path is:

Request arrives → rustls decrypts
  → axum routes to handler
  → DashMap: look up API key hash    ~50 ns
  → AtomicU32: check daily counter   ~5 ns
  → moka: check query cache          ~0.3 ms (hit)
  → Tantivy search                   ~5–15 ms (miss)
  → serialize JSON response

About 85% of requests are cache hits (moka LRU, 500k capacity, 5-minute TTL). For cache misses, Tantivy searches the mmapped index. Because the index is memory-mapped, the OS keeps hot segments in RAM — cold-start latency after a restart is higher until pages warm up.

Typo Tolerance

Tantivy supports fuzzy matching out of the box, but I added a pre-processing layer for Australian address quirks:

Levenshtein distance 1–2 for street names and suburbs
Phonetic normalisation for common mishearings ("woolloomooloo" vs "wooloomooloo")
Token reordering — "sydney george st 1" finds "1 George St Sydney"
Abbreviation expansion — "st" → Street/Saint, "ave" → Avenue, "rd" → Road

The trickiest case is long suburb names. "Woolloomooloo" has 12 characters and people reliably get the vowel runs wrong. A Levenshtein distance of 2 handles most variants.

Rate Limiting

Rate limiting is done with an in-process DashMap<String, AtomicU32> — one counter per API key per day. A background task flushes the counters to SQLite every 5 minutes and resets them at midnight. No Redis.

This means counters are approximate (up to 5 minutes of requests could be lost on crash), but for the use case (daily limits, not strict quotas) it's fine and avoids a network round-trip per request.

Index Hot-Reload

When the pipeline finishes and swaps the index, it sends SIGUSR1 to the API process. The signal handler reopens the Tantivy index from disk on a background thread and atomically replaces the Arc<Index> in AppState. Existing in-flight requests complete against the old index; new requests pick up the new one immediately.

Infrastructure

Everything runs on a single Oracle Cloud Ampere A1 instance:

CPU: 4 OCPU (ARM Neoverse N1)
RAM: 24 GB
Disk: 48 GB
OS: Ubuntu 24.04

RAM breakdown at steady state:

Tantivy index (mmap)    ~6.6 GB
moka query cache        ~2 GB
RTree (reverse geocode) ~830 MB
DashMap + counters      ~100 MB
OS + page cache         ~1 GB
Total                   ~10–11 GB

The pipeline peaks at ~18 GB RAM (the enricher holds all lookup tables in memory while indexing). ZRAM swap provides ~6 GB of extra headroom for pipeline runs.

What I'd Do Differently

Distributed from day one: Right now if the single instance dies, the API is down. A second read-only replica with DNS failover would fix this and isn't expensive. I'll add it when there's real traffic to justify it.

Smarter fuzzy at query time: The current fuzzy matching is decent but doesn't handle transposed words well ("Collins Melbourne St" vs "Collins St Melbourne"). A query-expansion step before Tantivy would help.

Better enrichment coverage: About 2% of addresses don't have a mesh block assignment in GNAF (mostly very new addresses). Those records come back without SA1/LGA/CED codes. A fallback polygon lookup for those cases would close the gap.

Try It

DingoFind — 100,000 free requests first month, no signup required for the live demo.

curl "https://www.dingofind.com/v1/autocomplete?q=1+george+st+sydney&limit=3" \
  -H "Authorization: Bearer YOUR_API_KEY"

Full API docs at dingofind.com/docs.html.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.