DEV Community

bennaceur walid
bennaceur walid

Posted on

Building a French Address Validation API with 26M Addresses

The French government's Base Adresse Nationale (BAN) contains 26 million addresses — every street, every house number, every hamlet across mainland France and overseas territories. We built GEOREFER to make this data accessible through a single REST API, combined with company lookup from the SIRENE database.

This is the technical story of how we did it.

The Problem: Fragmented French Geographic Data

If you're building a FinTech product in France, you need to validate customer addresses for KYC compliance. Sounds simple, right?

Here's what the landscape looks like in 2026:

  • API Adresse (BAN) — Free, but no SLA, rate-limited to 50 req/s, and no company data
  • La Poste RNVP — The gold standard for postal validation, but no public REST API
  • Google Address Validation — Global coverage but $0.005/request adds up fast, and no SIRENE integration
  • INSEE API SIRENE — Company data, but separate authentication, slow responses (~500ms), and no address validation

To do proper KYC, you need at least two of these APIs, with different auth mechanisms, different response formats, and different rate limits.

We decided to build one API that does it all.

Architecture Overview

GEOREFER is built on a straightforward Java stack:

Java 11 + Spring Boot 2.7.5
PostgreSQL 16 (42M+ rows across 12 tables)
Redis 7 (API key cache, TTL 5min)
Elasticsearch 7.17 (city autocomplete, fuzzy search)
Enter fullscreen mode Exit fullscreen mode

The architecture follows a clean layered approach:

REST Controllers (17 controllers, 39 endpoints)
    |
Business Services (12 interfaces, 16 implementations)
    |
Repositories (JPA + Elasticsearch)
    |
PostgreSQL + Redis + Elasticsearch
Enter fullscreen mode Exit fullscreen mode

Importing 26M Addresses from the BAN

The BAN publishes its data as CSV files, updated monthly. The full dataset is around 3.5 GB compressed.

Our import strategy:

  1. Download the latest BAN CSV export
  2. Parse with streaming CSV reader (no full file in memory)
  3. Batch insert using JDBC batch operations (batch size = 5000)
  4. Index city data into Elasticsearch for autocomplete

The key challenge was handling the French administrative hierarchy:

Region (18) → Department (101) → Commune (35,000+) → Address (26M)
Enter fullscreen mode Exit fullscreen mode

Each commune has an INSEE code (5 digits), one or more postal codes, and belongs to exactly one department. Paris, Lyon, and Marseille have arrondissements that function as sub-communes with their own INSEE codes.

We store communes in a french_town_desc table with full hierarchy:

SELECT f.name, f.insee_code, f.postal_code,
       d.name as department, r.name as region
FROM georefer.french_town_desc f
JOIN georefer.department d ON f.department_code = d.code
JOIN georefer.region r ON d.region_code = r.code
WHERE f.name ILIKE 'paris%'
Enter fullscreen mode Exit fullscreen mode

Address Validation with GeoTrust Scoring

The core feature is POST /addresses/validate. You send a French address, and we return:

  • Confidence score (0-100) — how sure we are the address exists
  • GeoTrust Score (0-100) — composite reliability score for KYC
  • Validated address — normalized, corrected, with GPS coordinates
  • AFNOR format — postal-standard NF Z 10-011 formatting

The GeoTrust Score is a weighted composite:

Component Weight What it measures
Confidence 35% Street-level address matching
Geo Consistency 25% Cross-validation: postal code vs commune vs department
Postal Match 20% Postal code precision (exact, partial, invalid)
Country Risk 20% FATF/GAFI country risk rating
curl -X POST 'https://georefer.io/geographical_repository/v1/addresses/validate' \
  -H 'Content-Type: application/json' \
  -H 'X-Georefer-API-Key: YOUR_API_KEY' \
  -d '{
    "street_line": "15 Rue de la Paix",
    "postal_code": "75002",
    "city": "Paris",
    "country_code": "FR"
  }'
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "success": true,
  "data": {
    "validated_address": {
      "street_line": "15 Rue de la Paix",
      "postal_code": "75002",
      "city": "PARIS",
      "country": "France"
    },
    "confidence_score": 95,
    "geotrust_score": {
      "overall": 92,
      "level": "LOW",
      "components": {
        "confidence": 95,
        "geo_consistency": 100,
        "postal_match": 100,
        "country_risk": 0
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Elasticsearch for City Autocomplete

City autocomplete needs to be fast — under 50ms for a good UX. We use Elasticsearch's Completion Suggester with a custom analyzer:

city_analyzer: edge_ngram (min=2, max=15) + ascii_folding
city_search_analyzer: standard + ascii_folding
Enter fullscreen mode Exit fullscreen mode

The ASCII folding is critical for French cities. Users type "Beziers" but the official name is "Beziers". Our analyzer handles both.

The GET /cities/autocomplete?q=marseil&limit=5 endpoint returns results in under 50ms, even with 35,000+ communes indexed.

We also support fuzzy search with GET /cities/search?q=Monplier — using Elasticsearch's fuzziness AUTO parameter, this correctly returns "Montpellier" despite the typos.

Multi-Tenant API Keys & Rate Limiting

GEOREFER is a SaaS with 5 subscription plans:

Plan Daily Limit Rate/min Price
DEMO 50 10 Free
FREE 100 10 Free
STARTER 5,000 30 49 EUR/mo
PRO 50,000 60 199 EUR/mo
ENTERPRISE Unlimited 200 Custom

Each API key gets its own token bucket (Bucket4j) for rate limiting. Authentication goes through a Spring filter chain:

Request → API Key validation (Redis cache) → Quota check → Rate limit → Feature gate → Controller
Enter fullscreen mode Exit fullscreen mode

The Feature Gate controls which endpoints each plan can access. For example, company search (/companies) requires PRO or higher, while city search is available on all plans.

What's Next

We're currently at 16.8 million SIRENE establishments imported and 35,000+ communes indexed. The API handles 39 endpoints across geographic data, address validation, company search, and admin/billing.

If you're building anything that touches French addresses or company data, give it a try:

In the next article, we'll deep-dive into how we query 16.8M SIRENE establishments in 66ms using PostgreSQL trigram indexes.


AZMORIS Engineering — "Software that Endures"

Top comments (0)