<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Arque Nova</title>
    <description>The latest articles on DEV Community by Arque Nova (@arque_nova).</description>
    <link>https://dev.to/arque_nova</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898808%2F70fcbdc7-6bda-4703-9646-dbb632f35a48.png</url>
      <title>DEV Community: Arque Nova</title>
      <link>https://dev.to/arque_nova</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arque_nova"/>
    <language>en</language>
    <item>
      <title>Building a Sub-10ms Australian Address Autocomplete API in Rust</title>
      <dc:creator>Arque Nova</dc:creator>
      <pubDate>Sun, 26 Apr 2026 13:36:32 +0000</pubDate>
      <link>https://dev.to/arque_nova/building-a-sub-10ms-australian-address-autocomplete-api-in-rust-3bh</link>
      <guid>https://dev.to/arque_nova/building-a-sub-10ms-australian-address-autocomplete-api-in-rust-3bh</guid>
      <description>&lt;p&gt;Australia has one of the best open address datasets in the world. The &lt;a href="https://data.gov.au/dataset/ds-dga-19432f89-dc3a-4ef3-b943-5326ef1dbecc" rel="noopener noreferrer"&gt;Geocoded National Address File (GNAF)&lt;/a&gt; contains 15.8 million addresses — every house, apartment, rural property, and indigenous community in the country — published by the federal government under a free commercial licence.&lt;/p&gt;

&lt;p&gt;The catch: it ships as a ~1.5 GB zip of pipe-delimited files across eight relational tables, one set per state. If you want to actually search it, you need to parse it, join it, index it, and build a search API on top of it. That's what I did. The result is &lt;a href="https://www.dingofind.com" rel="noopener noreferrer"&gt;DingoFind&lt;/a&gt; — a hosted Australian address autocomplete API with typo tolerance, reverse geocoding, and ABS boundary enrichment.&lt;/p&gt;

&lt;p&gt;This post covers the technical architecture: how the data pipeline works, how the search index is built, and how the API achieves sub-10ms p99 latency on a single ARM server.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Data Problem: GNAF's Schema
&lt;/h2&gt;

&lt;p&gt;GNAF is a normalised relational dataset. An address like "1/123 Collins Street Melbourne VIC 3000" is spread across multiple files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ADDRESS_DETAIL&lt;/code&gt; — the core record (address_detail_pid, flat_number, number_first, locality_pid, street_locality_pid)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;STREET_LOCALITY&lt;/code&gt; — street name, type, and suffix (joined via street_locality_pid)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LOCALITY&lt;/code&gt; — suburb name (joined via locality_pid)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;STATE&lt;/code&gt; — state name&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ADDRESS_MESH_BLOCK_2021&lt;/code&gt; — joins address to mesh block PID for ABS enrichment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On top of that, there are alias tables for streets and localities (different names for the same place), and address default geocode files for lat/lon.&lt;/p&gt;

&lt;p&gt;To get a single flat address record you need to join across five or six tables. I do this once, at pipeline time, and materialise the result into a Tantivy index. The API itself never touches the raw schema.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline
&lt;/h2&gt;

&lt;p&gt;The pipeline is a one-shot Rust binary that runs weekly (or on demand). It does five things in sequence:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Download
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GNAF zip:        ~1.5 GB from data.gov.au (updated quarterly)
ABS MB xlsx:     ~30 MB  (mesh block → SA1 mapping)
ABS LGA xlsx:    ~30 MB  (mesh block → LGA name)
ABS CED xlsx:    ~30 MB  (mesh block → federal electoral division)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Downloads are skipped if files already exist in &lt;code&gt;/tmp/gnaf_staging&lt;/code&gt;, making re-runs after partial failures fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Parse
&lt;/h3&gt;

&lt;p&gt;Each state's PSV files are parsed in parallel with Rayon. The join sequence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ADDRESS_DETAIL
  → STREET_LOCALITY    (street_locality_pid)
  → LOCALITY           (locality_pid)
  → STATE              (state_abbreviation)
  → ADDRESS_DEFAULT_GEOCODE (address_detail_pid → lat/lon)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output: a &lt;code&gt;Vec&amp;lt;Address&amp;gt;&lt;/code&gt; with flat fields: &lt;code&gt;full_address&lt;/code&gt;, &lt;code&gt;street_number&lt;/code&gt;, &lt;code&gt;street_name&lt;/code&gt;, &lt;code&gt;street_type&lt;/code&gt;, &lt;code&gt;suburb&lt;/code&gt;, &lt;code&gt;state&lt;/code&gt;, &lt;code&gt;postcode&lt;/code&gt;, &lt;code&gt;lat&lt;/code&gt;, &lt;code&gt;lon&lt;/code&gt;, &lt;code&gt;gnaf_pid&lt;/code&gt;, &lt;code&gt;mesh_block_pid&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Enrich
&lt;/h3&gt;

&lt;p&gt;The ABS enrichment adds statistical boundary codes. The join chain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mesh_block_pid → MB_2021_PID
MB_2021_PID    → MB_CODE_2021 (11-digit code)
MB_CODE_2021   → SA1_CODE_2021
MB_CODE_2021   → LGA_NAME_2021
MB_CODE_2021   → CED_NAME_2021 (federal electorate)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs entirely in-memory using &lt;code&gt;HashMap&lt;/code&gt; lookups built from the ABS xlsx files. At peak, the enricher holds ~800 MB of lookup tables for the 350,000+ mesh block codes.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Build the Index
&lt;/h3&gt;

&lt;p&gt;The search index is built with &lt;a href="https://github.com/quickwit-oss/tantivy" rel="noopener noreferrer"&gt;Tantivy&lt;/a&gt; — a full-text search library in Rust, broadly similar to Lucene. One Tantivy document per address, with these fields:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;full_address&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TEXT (tokenised)&lt;/td&gt;
&lt;td&gt;Primary search field&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;suburb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TEXT (tokenised)&lt;/td&gt;
&lt;td&gt;Suburb-only searches&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;postcode&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TEXT&lt;/td&gt;
&lt;td&gt;Postcode lookup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gnaf_pid&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TEXT (stored)&lt;/td&gt;
&lt;td&gt;Unique address ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;lat&lt;/code&gt;, &lt;code&gt;lon&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;F64 (stored)&lt;/td&gt;
&lt;td&gt;For response payload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sa1_code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TEXT (stored)&lt;/td&gt;
&lt;td&gt;ABS enrichment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;lga_name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TEXT (stored)&lt;/td&gt;
&lt;td&gt;ABS enrichment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;federal_elec&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TEXT (stored)&lt;/td&gt;
&lt;td&gt;ABS enrichment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The index is written into a staging directory (&lt;code&gt;/var/lib/ausaddress/index_staging&lt;/code&gt;). At ~15.8M documents it comes out around 6.6 GB on disk.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Swap
&lt;/h3&gt;

&lt;p&gt;When the index is built and validated (spot-check: document count must be &amp;gt; 14 million), the pipeline does an atomic rename:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;index_staging → index_live
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then sends &lt;code&gt;SIGUSR1&lt;/code&gt; to the running API process, which reopens the index from disk without restarting. Zero downtime index updates.&lt;/p&gt;




&lt;h2&gt;
  
  
  The API
&lt;/h2&gt;

&lt;p&gt;The HTTP server is &lt;a href="https://github.com/tokio-rs/axum" rel="noopener noreferrer"&gt;axum&lt;/a&gt; with &lt;a href="https://github.com/programatik29/axum-server" rel="noopener noreferrer"&gt;axum-server&lt;/a&gt; for direct TLS termination using rustls. No nginx in front of the API — it binds port 443 directly. Nginx only serves the static landing page on port 8080.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hot Path
&lt;/h3&gt;

&lt;p&gt;For an autocomplete request, the hot path is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request arrives → rustls decrypts
  → axum routes to handler
  → DashMap: look up API key hash    ~50 ns
  → AtomicU32: check daily counter   ~5 ns
  → moka: check query cache          ~0.3 ms (hit)
  → Tantivy search                   ~5–15 ms (miss)
  → serialize JSON response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;About 85% of requests are cache hits (moka LRU, 500k capacity, 5-minute TTL). For cache misses, Tantivy searches the mmapped index. Because the index is memory-mapped, the OS keeps hot segments in RAM — cold-start latency after a restart is higher until pages warm up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Typo Tolerance
&lt;/h3&gt;

&lt;p&gt;Tantivy supports fuzzy matching out of the box, but I added a pre-processing layer for Australian address quirks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Levenshtein distance 1–2&lt;/strong&gt; for street names and suburbs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phonetic normalisation&lt;/strong&gt; for common mishearings ("woolloomooloo" vs "wooloomooloo")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token reordering&lt;/strong&gt; — "sydney george st 1" finds "1 George St Sydney"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Abbreviation expansion&lt;/strong&gt; — "st" → Street/Saint, "ave" → Avenue, "rd" → Road&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trickiest case is long suburb names. "Woolloomooloo" has 12 characters and people reliably get the vowel runs wrong. A Levenshtein distance of 2 handles most variants.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rate Limiting
&lt;/h3&gt;

&lt;p&gt;Rate limiting is done with an in-process &lt;code&gt;DashMap&amp;lt;String, AtomicU32&amp;gt;&lt;/code&gt; — one counter per API key per day. A background task flushes the counters to SQLite every 5 minutes and resets them at midnight. No Redis.&lt;/p&gt;

&lt;p&gt;This means counters are approximate (up to 5 minutes of requests could be lost on crash), but for the use case (daily limits, not strict quotas) it's fine and avoids a network round-trip per request.&lt;/p&gt;

&lt;h3&gt;
  
  
  Index Hot-Reload
&lt;/h3&gt;

&lt;p&gt;When the pipeline finishes and swaps the index, it sends &lt;code&gt;SIGUSR1&lt;/code&gt; to the API process. The signal handler reopens the Tantivy index from disk on a background thread and atomically replaces the &lt;code&gt;Arc&amp;lt;Index&amp;gt;&lt;/code&gt; in &lt;code&gt;AppState&lt;/code&gt;. Existing in-flight requests complete against the old index; new requests pick up the new one immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  Infrastructure
&lt;/h2&gt;

&lt;p&gt;Everything runs on a single Oracle Cloud Ampere A1 instance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU:&lt;/strong&gt; 4 OCPU (ARM Neoverse N1)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAM:&lt;/strong&gt; 24 GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk:&lt;/strong&gt; 48 GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS:&lt;/strong&gt; Ubuntu 24.04&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RAM breakdown at steady state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tantivy index (mmap)    ~6.6 GB
moka query cache        ~2 GB
RTree (reverse geocode) ~830 MB
DashMap + counters      ~100 MB
OS + page cache         ~1 GB
Total                   ~10–11 GB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline peaks at ~18 GB RAM (the enricher holds all lookup tables in memory while indexing). ZRAM swap provides ~6 GB of extra headroom for pipeline runs.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Distributed from day one:&lt;/strong&gt; Right now if the single instance dies, the API is down. A second read-only replica with DNS failover would fix this and isn't expensive. I'll add it when there's real traffic to justify it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Smarter fuzzy at query time:&lt;/strong&gt; The current fuzzy matching is decent but doesn't handle transposed words well ("Collins Melbourne St" vs "Collins St Melbourne"). A query-expansion step before Tantivy would help.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Better enrichment coverage:&lt;/strong&gt; About 2% of addresses don't have a mesh block assignment in GNAF (mostly very new addresses). Those records come back without SA1/LGA/CED codes. A fallback polygon lookup for those cases would close the gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.dingofind.com" rel="noopener noreferrer"&gt;DingoFind&lt;/a&gt; — 100,000 free requests first month, no signup required for the live demo.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="s2"&gt;"https://www.dingofind.com/v1/autocomplete?q=1+george+st+sydney&amp;amp;limit=3"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_API_KEY"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full API docs at &lt;a href="https://www.dingofind.com/docs.html" rel="noopener noreferrer"&gt;dingofind.com/docs.html&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>australia</category>
      <category>api</category>
      <category>rust</category>
      <category>backend</category>
    </item>
  </channel>
</rss>
