France has a unique system for quality-certifying training organizations: the Qualiopi certification. Every training provider that wants public funding must hold this certification, and the government publishes the data in various formats. We built Annuaire Qualiopi to make this data searchable and useful.
This post covers the technical challenges of building a directory from government open data.
The Data Source Problem
The French government publishes Qualiopi certification data through several channels:
- data.gouv.fr: CSV exports, updated monthly (sometimes)
- Individual certifying bodies: Each publishes their own list in their own format
- The official Qualiopi list: A PDF that is surprisingly hard to parse programmatically
No single source is complete or consistently formatted. Our first job was building a data reconciliation pipeline.
Data Ingestion Pipeline
class QualiopiDataPipeline:
def __init__(self):
self.sources = [
DataGouvSource(),
AfnorSource(),
CofracSource(),
QualianorSource(),
]
def ingest(self):
all_records = []
for source in self.sources:
try:
records = source.fetch_and_parse()
all_records.extend(records)
except SourceUnavailableError as e:
log_warning(f"Source {source.name} unavailable: {e}")
return self.deduplicate_and_merge(all_records)
def deduplicate_and_merge(self, records):
# Key by SIRET (unique French business identifier)
by_siret = {}
for record in records:
siret = record.get("siret", "").replace(" ", "")
if len(siret) != 14:
continue # Invalid SIRET
if siret in by_siret:
by_siret[siret] = self.merge_records(
by_siret[siret], record
)
else:
by_siret[siret] = record
return list(by_siret.values())
The SIRET number (a 14-digit French business identifier) is our primary deduplication key. Without it, we would have thousands of duplicate entries from different sources with slightly different formatting of the company name.
The SIRET Validation Challenge
Not all published SIRETs are valid. Some have typos, some reference closed businesses, some are test entries. We validate against the INSEE SIRENE API:
import requests
def validate_siret(siret):
# Luhn algorithm check (SIRET uses a variant)
if not luhn_check(siret):
return {"valid": False, "reason": "checksum_failed"}
# Cross-reference with INSEE API
try:
resp = requests.get(
f"https://api.insee.fr/entreprises/sirene/V3/siret/{siret}",
headers={"Authorization": f"Bearer {INSEE_TOKEN}"},
timeout=5
)
if resp.status_code == 200:
data = resp.json()
establishment = data["etablissement"]
return {
"valid": True,
"name": establishment["uniteLegale"]["denominationUniteLegale"],
"address": format_address(establishment["adresseEtablissement"]),
"active": establishment["periodesEtablissement"][0]["etatAdministratifEtablissement"] == "A"
}
elif resp.status_code == 404:
return {"valid": False, "reason": "not_found"}
except requests.Timeout:
return {"valid": None, "reason": "api_timeout"}
def luhn_check(siret):
digits = [int(d) for d in siret]
checksum = 0
for i, d in enumerate(digits):
if i % 2 == 0:
d *= 2
if d > 9:
d -= 9
checksum += d
return checksum % 10 == 0
The INSEE API has rate limits (30 requests/minute on the free tier), so we batch our validation runs during off-peak hours.
Search Architecture
Users search for training providers by:
- Location (city, department, region)
- Training category (management, IT, safety, languages...)
- Certification scope (actions de formation, bilan de competences, VAE, apprentissage)
We chose MySQL full-text search over Elasticsearch for simplicity. With under 100,000 records, MySQL handles it fine:
CREATE TABLE training_providers (
id INT PRIMARY KEY AUTO_INCREMENT,
siret CHAR(14) UNIQUE,
name VARCHAR(300),
city VARCHAR(100),
department CHAR(3),
region VARCHAR(100),
categories JSON,
qualiopi_scopes JSON,
certification_date DATE,
certifying_body VARCHAR(200),
website VARCHAR(500),
phone VARCHAR(20),
FULLTEXT INDEX ft_search (name, city)
);
-- Search query
SELECT *, MATCH(name, city) AGAINST(? IN NATURAL LANGUAGE MODE) as relevance
FROM training_providers
WHERE department = ?
AND JSON_CONTAINS(qualiopi_scopes, ?)
ORDER BY relevance DESC
LIMIT 20;
Geolocation Without Google Maps
We needed to show providers on a map without the cost of Google Maps API. Our solution:
- Geocoding: We use the French government is free geocoding API (api-adresse.data.gouv.fr) to convert addresses to coordinates
- Map rendering: Leaflet.js with OpenStreetMap tiles
- Clustering: For areas with many providers, we use marker clustering to keep the map readable
const map = L.map("map").setView([46.603354, 1.888334], 6);
L.tileLayer("https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png").addTo(map);
const markers = L.markerClusterGroup();
providers.forEach(p => {
if (p.lat && p.lng) {
const marker = L.marker([p.lat, p.lng])
.bindPopup(`<b>${p.name}</b><br>${p.city}`);
markers.addLayer(marker);
}
});
map.addLayer(markers);
Total cost for geocoding and mapping: zero. The French government provides excellent free geospatial APIs.
Data Freshness: The Ongoing Challenge
Qualiopi certifications expire and get renewed. Training providers change addresses, phone numbers, and specialties. Keeping the directory current requires:
- Monthly full re-import from government sources
- Weekly delta checks against the SIRENE API for business status changes
- User-submitted corrections with manual verification
We show a "last verified" date on every listing. If a listing has not been verified in 6 months, we flag it with a warning.
SEO for a French Directory
The site targets French searches like "organisme formation Qualiopi Lyon" or "centre formation certifie Paris." Key technical SEO decisions:
-
URL structure:
/formation/{department}/{city}/{slug}for individual listings -
Department landing pages:
/formation/{department}/with aggregate data and filters -
Structured data:
LocalBusinessandEducationalOrganizationschema on every listing - Hreflang: Not needed — the site is French-only, targeting France
Server-side rendering in PHP means every page is immediately crawlable. No JavaScript rendering required.
Lessons Learned
- Government open data is valuable but messy. Budget significant time for data cleaning.
- SIRET is the anchor. Without a unique identifier, deduplication across sources would be nearly impossible.
- Free government APIs are underrated. The French ecosystem (api-adresse, INSEE SIRENE, data.gouv) is genuinely excellent.
- Directory SEO is a long game. Individual listing pages take months to index, but department-level pages rank faster.
Find Qualiopi-certified training providers across France at annuairequaliopi.fr
Top comments (0)