DEV Community

bennaceur walid
bennaceur walid

Posted on

How We Query 16.8M SIRENE Establishments in 66ms

The French SIRENE database contains information about every registered business in France — over 30 million establishments. We imported 16.8 million active ones into GEOREFER and needed to make them searchable by name in under 100ms.

Here's how we did it with PostgreSQL 16 and pg_trgm.

The Challenge

Our establishment table has 16.8 million rows. Users need to search by:

  • SIREN (9 digits) — exact match, trivial with a B-tree index
  • SIRET (14 digits) — exact match, same
  • Company name — fuzzy match, this is where it gets interesting

The name search needs to handle:

  • Partial matches: "Total" should find "TotalEnergies SE"
  • Typos: "Miclein" should find "Michelin"
  • Accent insensitivity: "Societe Generale" should match "Societe Generale"

The Naive Approach: ILIKE

First attempt:

SELECT * FROM georefer.establishment
WHERE company_name ILIKE '%total%'
LIMIT 25;
Enter fullscreen mode Exit fullscreen mode

EXPLAIN ANALYZE result:

Seq Scan on establishment
  Filter: (company_name ~~* '%total%')
  Rows Removed by Filter: 16799975
  Planning Time: 0.1ms
  Execution Time: 12,847ms
Enter fullscreen mode Exit fullscreen mode

12.8 seconds. Full sequential scan on 16.8M rows. Unusable.

Enter pg_trgm

PostgreSQL's pg_trgm extension breaks strings into trigrams (3-character sequences) and uses GIN indexes to find similar strings efficiently.

CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX idx_establishment_name_trgm
ON georefer.establishment
USING GIN (company_name gin_trgm_ops);
Enter fullscreen mode Exit fullscreen mode

Now with trigram similarity:

SELECT *, similarity(company_name, 'total') as sim
FROM georefer.establishment
WHERE company_name % 'total'
ORDER BY sim DESC
LIMIT 25;
Enter fullscreen mode Exit fullscreen mode

EXPLAIN ANALYZE:

Bitmap Heap Scan on establishment
  Recheck Cond: (company_name % 'total')
  -> Bitmap Index Scan on idx_establishment_name_trgm
     Index Cond: (company_name % 'total')
  Planning Time: 0.3ms
  Execution Time: 66ms
Enter fullscreen mode Exit fullscreen mode

66ms. From 12.8 seconds to 66ms — a 194x improvement.

The Import Strategy: 3 Phases

Importing 16.8M rows isn't straightforward. We use a 3-phase approach:

Phase 1: Schema + Staging

CREATE TABLE georefer.establishment (
    id SERIAL PRIMARY KEY,
    siren VARCHAR(9) NOT NULL,
    siret VARCHAR(14) NOT NULL UNIQUE,
    company_name VARCHAR(255),
    commercial_name VARCHAR(255),
    legal_form VARCHAR(10),
    naf_code VARCHAR(6),
    employee_range VARCHAR(5),
    postal_code VARCHAR(5),
    city VARCHAR(100),
    department_code VARCHAR(3),
    is_headquarters BOOLEAN DEFAULT FALSE,
    is_active BOOLEAN DEFAULT TRUE,
    created_date DATE,
    last_update DATE
);
Enter fullscreen mode Exit fullscreen mode

Phase 2: Bulk COPY

We use PostgreSQL COPY for maximum throughput:

COPY georefer.establishment
(siren, siret, company_name, ...)
FROM '/tmp/sirene_active.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',');
Enter fullscreen mode Exit fullscreen mode

COPY is 10-50x faster than batch INSERT for bulk loading. The 16.8M rows load in about 8 minutes.

Phase 3: Index Creation

We create indexes AFTER the bulk import (creating them before would slow the import):

-- Exact lookups
CREATE INDEX idx_establishment_siren ON georefer.establishment(siren);
CREATE INDEX idx_establishment_siret ON georefer.establishment(siret);

-- Geographic filtering
CREATE INDEX idx_establishment_postal ON georefer.establishment(postal_code);
CREATE INDEX idx_establishment_dept ON georefer.establishment(department_code);
CREATE INDEX idx_establishment_city ON georefer.establishment(city);

-- Fuzzy name search
CREATE INDEX idx_establishment_naf ON georefer.establishment(naf_code);
CREATE INDEX idx_establishment_name_trgm ON georefer.establishment USING GIN (company_name gin_trgm_ops);
Enter fullscreen mode Exit fullscreen mode

Combined Query: Name + Geographic Filter

The real power comes from combining trigram search with geographic filters:

SELECT *, similarity(company_name, 'boulangerie') as sim
FROM georefer.establishment
WHERE company_name % 'boulangerie'
  AND department_code = '75'
  AND is_active = true
ORDER BY sim DESC
LIMIT 25;
Enter fullscreen mode Exit fullscreen mode

This returns all bakeries in Paris in ~45ms, even across 16.8M rows.

The API Layer

The Spring Boot service exposes this via REST:

# Search by SIREN
curl 'https://georefer.io/geographical_repository/v1/companies?siren=552120222' \
  -H 'X-Georefer-API-Key: YOUR_API_KEY'

# Search by name + department
curl 'https://georefer.io/geographical_repository/v1/companies/search?name=michelin&department_code=63' \
  -H 'X-Georefer-API-Key: YOUR_API_KEY'
Enter fullscreen mode Exit fullscreen mode

The response includes all SIRENE fields:

{
  "success": true,
  "data": [
    {
      "siren": "855200507",
      "siret": "85520050700046",
      "company_name": "MANUFACTURE FRANCAISE DES PNEUMATIQUES MICHELIN",
      "naf_code": "22.11Z",
      "employee_range": "5000+",
      "postal_code": "63000",
      "city": "CLERMONT-FERRAND",
      "is_headquarters": true
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Performance Summary

Query Type Before (no index) After (pg_trgm) Improvement
Name search 12,847ms 66ms 194x
Name + dept filter 13,102ms 45ms 291x
SIREN exact 8,200ms 0.3ms 27,333x
SIRET exact 8,150ms 0.2ms 40,750x

Lessons Learned

  1. Always create indexes after bulk import — creating them before can make the import 10x slower
  2. pg_trgm GIN indexes use a lot of disk — our 16.8M row trigram index is ~2.3 GB
  3. Set maintenance_work_mem high during index creationSET maintenance_work_mem = '1GB' cuts index creation time in half
  4. COPY beats INSERT every time for bulk loading — use COPY for anything over 10K rows

Try It

GEOREFER exposes 16.8M SIRENE establishments through a simple REST API:


AZMORIS Engineering — "Software that Endures"

Top comments (0)