The French SIRENE database contains information about every registered business in France — over 30 million establishments. We imported 16.8 million active ones into GEOREFER and needed to make them searchable by name in under 100ms.
Here's how we did it with PostgreSQL 16 and pg_trgm.
The Challenge
Our establishment table has 16.8 million rows. Users need to search by:
- SIREN (9 digits) — exact match, trivial with a B-tree index
- SIRET (14 digits) — exact match, same
- Company name — fuzzy match, this is where it gets interesting
The name search needs to handle:
- Partial matches: "Total" should find "TotalEnergies SE"
- Typos: "Miclein" should find "Michelin"
- Accent insensitivity: "Societe Generale" should match "Societe Generale"
The Naive Approach: ILIKE
First attempt:
SELECT * FROM georefer.establishment
WHERE company_name ILIKE '%total%'
LIMIT 25;
EXPLAIN ANALYZE result:
Seq Scan on establishment
Filter: (company_name ~~* '%total%')
Rows Removed by Filter: 16799975
Planning Time: 0.1ms
Execution Time: 12,847ms
12.8 seconds. Full sequential scan on 16.8M rows. Unusable.
Enter pg_trgm
PostgreSQL's pg_trgm extension breaks strings into trigrams (3-character sequences) and uses GIN indexes to find similar strings efficiently.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_establishment_name_trgm
ON georefer.establishment
USING GIN (company_name gin_trgm_ops);
Now with trigram similarity:
SELECT *, similarity(company_name, 'total') as sim
FROM georefer.establishment
WHERE company_name % 'total'
ORDER BY sim DESC
LIMIT 25;
EXPLAIN ANALYZE:
Bitmap Heap Scan on establishment
Recheck Cond: (company_name % 'total')
-> Bitmap Index Scan on idx_establishment_name_trgm
Index Cond: (company_name % 'total')
Planning Time: 0.3ms
Execution Time: 66ms
66ms. From 12.8 seconds to 66ms — a 194x improvement.
The Import Strategy: 3 Phases
Importing 16.8M rows isn't straightforward. We use a 3-phase approach:
Phase 1: Schema + Staging
CREATE TABLE georefer.establishment (
id SERIAL PRIMARY KEY,
siren VARCHAR(9) NOT NULL,
siret VARCHAR(14) NOT NULL UNIQUE,
company_name VARCHAR(255),
commercial_name VARCHAR(255),
legal_form VARCHAR(10),
naf_code VARCHAR(6),
employee_range VARCHAR(5),
postal_code VARCHAR(5),
city VARCHAR(100),
department_code VARCHAR(3),
is_headquarters BOOLEAN DEFAULT FALSE,
is_active BOOLEAN DEFAULT TRUE,
created_date DATE,
last_update DATE
);
Phase 2: Bulk COPY
We use PostgreSQL COPY for maximum throughput:
COPY georefer.establishment
(siren, siret, company_name, ...)
FROM '/tmp/sirene_active.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',');
COPY is 10-50x faster than batch INSERT for bulk loading. The 16.8M rows load in about 8 minutes.
Phase 3: Index Creation
We create indexes AFTER the bulk import (creating them before would slow the import):
-- Exact lookups
CREATE INDEX idx_establishment_siren ON georefer.establishment(siren);
CREATE INDEX idx_establishment_siret ON georefer.establishment(siret);
-- Geographic filtering
CREATE INDEX idx_establishment_postal ON georefer.establishment(postal_code);
CREATE INDEX idx_establishment_dept ON georefer.establishment(department_code);
CREATE INDEX idx_establishment_city ON georefer.establishment(city);
-- Fuzzy name search
CREATE INDEX idx_establishment_naf ON georefer.establishment(naf_code);
CREATE INDEX idx_establishment_name_trgm ON georefer.establishment USING GIN (company_name gin_trgm_ops);
Combined Query: Name + Geographic Filter
The real power comes from combining trigram search with geographic filters:
SELECT *, similarity(company_name, 'boulangerie') as sim
FROM georefer.establishment
WHERE company_name % 'boulangerie'
AND department_code = '75'
AND is_active = true
ORDER BY sim DESC
LIMIT 25;
This returns all bakeries in Paris in ~45ms, even across 16.8M rows.
The API Layer
The Spring Boot service exposes this via REST:
# Search by SIREN
curl 'https://georefer.io/geographical_repository/v1/companies?siren=552120222' \
-H 'X-Georefer-API-Key: YOUR_API_KEY'
# Search by name + department
curl 'https://georefer.io/geographical_repository/v1/companies/search?name=michelin&department_code=63' \
-H 'X-Georefer-API-Key: YOUR_API_KEY'
The response includes all SIRENE fields:
{
"success": true,
"data": [
{
"siren": "855200507",
"siret": "85520050700046",
"company_name": "MANUFACTURE FRANCAISE DES PNEUMATIQUES MICHELIN",
"naf_code": "22.11Z",
"employee_range": "5000+",
"postal_code": "63000",
"city": "CLERMONT-FERRAND",
"is_headquarters": true
}
]
}
Performance Summary
| Query Type | Before (no index) | After (pg_trgm) | Improvement |
|---|---|---|---|
| Name search | 12,847ms | 66ms | 194x |
| Name + dept filter | 13,102ms | 45ms | 291x |
| SIREN exact | 8,200ms | 0.3ms | 27,333x |
| SIRET exact | 8,150ms | 0.2ms | 40,750x |
Lessons Learned
- Always create indexes after bulk import — creating them before can make the import 10x slower
- pg_trgm GIN indexes use a lot of disk — our 16.8M row trigram index is ~2.3 GB
-
Set
maintenance_work_memhigh during index creation —SET maintenance_work_mem = '1GB'cuts index creation time in half -
COPYbeatsINSERTevery time for bulk loading — useCOPYfor anything over 10K rows
Try It
GEOREFER exposes 16.8M SIRENE establishments through a simple REST API:
- Free tier: 100 req/day, no credit card
- Docs: https://georefer.io/docs
- Sign up: https://georefer.io/#signup
AZMORIS Engineering — "Software that Endures"
Top comments (0)