bennaceur walid

Posted on Mar 7

How We Query 16.8M SIRENE Establishments in 66ms

#postgressql #database #performance #api

The French SIRENE database contains information about every registered business in France — over 30 million establishments. We imported 16.8 million active ones into GEOREFER and needed to make them searchable by name in under 100ms.

Here's how we did it with PostgreSQL 16 and pg_trgm.

The Challenge

Our establishment table has 16.8 million rows. Users need to search by:

SIREN (9 digits) — exact match, trivial with a B-tree index
SIRET (14 digits) — exact match, same
Company name — fuzzy match, this is where it gets interesting

The name search needs to handle:

Partial matches: "Total" should find "TotalEnergies SE"
Typos: "Miclein" should find "Michelin"
Accent insensitivity: "Societe Generale" should match "Societe Generale"

The Naive Approach: ILIKE

First attempt:

SELECT * FROM georefer.establishment
WHERE company_name ILIKE '%total%'
LIMIT 25;

EXPLAIN ANALYZE result:

Seq Scan on establishment
  Filter: (company_name ~~* '%total%')
  Rows Removed by Filter: 16799975
  Planning Time: 0.1ms
  Execution Time: 12,847ms

12.8 seconds. Full sequential scan on 16.8M rows. Unusable.

Enter pg_trgm

PostgreSQL's pg_trgm extension breaks strings into trigrams (3-character sequences) and uses GIN indexes to find similar strings efficiently.

CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX idx_establishment_name_trgm
ON georefer.establishment
USING GIN (company_name gin_trgm_ops);

Now with trigram similarity:

SELECT *, similarity(company_name, 'total') as sim
FROM georefer.establishment
WHERE company_name % 'total'
ORDER BY sim DESC
LIMIT 25;

EXPLAIN ANALYZE:

Bitmap Heap Scan on establishment
  Recheck Cond: (company_name % 'total')
  -> Bitmap Index Scan on idx_establishment_name_trgm
     Index Cond: (company_name % 'total')
  Planning Time: 0.3ms
  Execution Time: 66ms

66ms. From 12.8 seconds to 66ms — a 194x improvement.

The Import Strategy: 3 Phases

Importing 16.8M rows isn't straightforward. We use a 3-phase approach:

Phase 1: Schema + Staging

CREATE TABLE georefer.establishment (
    id SERIAL PRIMARY KEY,
    siren VARCHAR(9) NOT NULL,
    siret VARCHAR(14) NOT NULL UNIQUE,
    company_name VARCHAR(255),
    commercial_name VARCHAR(255),
    legal_form VARCHAR(10),
    naf_code VARCHAR(6),
    employee_range VARCHAR(5),
    postal_code VARCHAR(5),
    city VARCHAR(100),
    department_code VARCHAR(3),
    is_headquarters BOOLEAN DEFAULT FALSE,
    is_active BOOLEAN DEFAULT TRUE,
    created_date DATE,
    last_update DATE
);

Phase 2: Bulk COPY

We use PostgreSQL COPY for maximum throughput:

COPY georefer.establishment
(siren, siret, company_name, ...)
FROM '/tmp/sirene_active.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',');

COPY is 10-50x faster than batch INSERT for bulk loading. The 16.8M rows load in about 8 minutes.

Phase 3: Index Creation

We create indexes AFTER the bulk import (creating them before would slow the import):

-- Exact lookups
CREATE INDEX idx_establishment_siren ON georefer.establishment(siren);
CREATE INDEX idx_establishment_siret ON georefer.establishment(siret);

-- Geographic filtering
CREATE INDEX idx_establishment_postal ON georefer.establishment(postal_code);
CREATE INDEX idx_establishment_dept ON georefer.establishment(department_code);
CREATE INDEX idx_establishment_city ON georefer.establishment(city);

-- Fuzzy name search
CREATE INDEX idx_establishment_naf ON georefer.establishment(naf_code);
CREATE INDEX idx_establishment_name_trgm ON georefer.establishment USING GIN (company_name gin_trgm_ops);

Combined Query: Name + Geographic Filter

The real power comes from combining trigram search with geographic filters:

SELECT *, similarity(company_name, 'boulangerie') as sim
FROM georefer.establishment
WHERE company_name % 'boulangerie'
  AND department_code = '75'
  AND is_active = true
ORDER BY sim DESC
LIMIT 25;

This returns all bakeries in Paris in ~45ms, even across 16.8M rows.

The API Layer

The Spring Boot service exposes this via REST:

# Search by SIREN
curl 'https://georefer.io/geographical_repository/v1/companies?siren=552120222' \
  -H 'X-Georefer-API-Key: YOUR_API_KEY'

# Search by name + department
curl 'https://georefer.io/geographical_repository/v1/companies/search?name=michelin&department_code=63' \
  -H 'X-Georefer-API-Key: YOUR_API_KEY'

The response includes all SIRENE fields:

{
  "success": true,
  "data": [
    {
      "siren": "855200507",
      "siret": "85520050700046",
      "company_name": "MANUFACTURE FRANCAISE DES PNEUMATIQUES MICHELIN",
      "naf_code": "22.11Z",
      "employee_range": "5000+",
      "postal_code": "63000",
      "city": "CLERMONT-FERRAND",
      "is_headquarters": true
    }
  ]
}

Performance Summary

Query Type	Before (no index)	After (pg_trgm)	Improvement
Name search	12,847ms	66ms	194x
Name + dept filter	13,102ms	45ms	291x
SIREN exact	8,200ms	0.3ms	27,333x
SIRET exact	8,150ms	0.2ms	40,750x

Lessons Learned

Always create indexes after bulk import — creating them before can make the import 10x slower
pg_trgm GIN indexes use a lot of disk — our 16.8M row trigram index is ~2.3 GB
Set maintenance_work_mem high during index creation — SET maintenance_work_mem = '1GB' cuts index creation time in half
COPY beats INSERT every time for bulk loading — use COPY for anything over 10K rows

DEV Community