SIKOUTRIS

Posted on Feb 20

How I Built a 148K+ Training Directory from French Government Open Data

#opendata #webdev #php #seo

French government publishes a dataset called the Liste Publique des Organismes de Formation — a CSV file listing every officially registered training organization in the country. 148,000+ entries. Updated monthly. Free to download, no authentication required.

I used it to build AnnuaireQualiopi.fr, a searchable directory of Qualiopi-certified training centers in France. Here's the full technical story — including the parts where things broke.

The French Open Data Ecosystem

data.gouv.fr is France's official open data portal, managed by Etalab (a government agency under the Prime Minister's office). It hosts thousands of datasets from ministries, agencies, local authorities, and public bodies.

The interesting part for developers: most datasets come with a proper API. You can query the catalog programmatically, subscribe to resource update events, and download files via stable URIs. The liste-publique-des-organismes-de-formation dataset URL doesn't change between updates — the file is just replaced in-place. That makes automated pipelines straightforward.

Formats vary by dataset. Training organizations come as CSV. Geographic datasets tend to be GeoJSON or Shapefile. Some newer datasets use Parquet. The inconsistency is real, but the documentation is usually honest about it.

The Dataset Itself

The main file is a semicolon-delimited CSV with roughly 30 columns per row. Key fields:

numerodeclarationactivite — the unique identifier, a 11-digit number assigned by the DREETS (regional labor authority)
denomination — organization name
adresseorganisme, codepostal, ville — postal address
certifications.actionsdeformation — boolean, whether they offer training actions
certifications.bilan — professional skills assessments
certifications.vae — validation of prior learning
certifications.apprentissage — apprenticeship
siret — business registration number (14 digits)
naf — activity code (NAF/APE classification)

The Qualiopi angle specifically required cross-referencing with a second dataset: the official list of Qualiopi-certified bodies. Qualiopi is a quality certification required since January 2022 for training organizations to access public funding. The certification dataset is maintained separately, which meant I had to join two CSV files on the numerodeclarationactivite field.

About 85,000 of the 148,000+ organizations hold at least one Qualiopi certification at any given time.

The Cleaning Problem

Raw government data is never clean. This is not a complaint — it's just reality. A few things I hit:

Encoding chaos. The file is officially UTF-8, but some entries contain Latin-1 artifacts. Classic sign: Ã© where é should be. Probably upstream systems that were never migrated properly. I ran iconv with //IGNORE as a first pass, then a custom PHP pass to normalize remaining oddities.

Duplicate SIRET numbers. An organization can have multiple declaration numbers across different regional authorities, all pointing to the same legal entity. About 3% of records are genuine duplicates by SIRET. My deduplication logic: keep the record with the most recent datedernieredeclaration, merge the certification flags.

Incomplete addresses. Around 8% of entries have a postal code but no street address — just the city. I flagged these separately rather than trying to geocode from incomplete data. Better to show honest partial information than invented coordinates.

Name normalization. "SARL FORMABIEN" and "Formabien SARL" are the same thing. I lowercase everything, strip legal form suffixes (SARL, SAS, EURL, SASU, SA, etc.) for comparison purposes only — the display name stays as-is from the source.

Empty rows. The CSV occasionally has empty rows or rows where only the identifier column is populated. Filter early, filter often.

Geocoding at Scale

The dataset has postal addresses but no coordinates. For a directory where "find training near me" is a core use case, I needed lat/lng.

France has an excellent free geocoding API: api-adresse.data.gouv.fr. Again from Etalab. It accepts batch geocoding via POST with a CSV payload — up to 5,000 addresses per call, rate limit is reasonable for bulk work.

I split the full dataset into batches of 4,000, posted each one, and parsed the response. The match rate was around 91% for full addresses. The remaining 9% mostly fell into two buckets:

Addresses in overseas territories (DOM-TOM) where the geocoder coverage is spottier
PO box addresses (BP XXXX) which aren't geocodable by design

For the overseas territories, I fell back to centroid coordinates per postal code using the official postal code dataset (also on data.gouv.fr). Not precise, but good enough for display purposes.

The whole geocoding pipeline ran in about 4 hours for 148K entries on a basic VPS. I store the results in the database and re-geocode only new or modified records during monthly updates.

WordPress Architecture for Data-Driven Directories

The site runs on WordPress, which is a deliberate choice. The editorial layer (117 in-depth articles, category pages, guides) integrates naturally with a CMS. But the directory search is built on custom tables, not WP post types.

Here's why: storing 85,000+ organizations as WordPress posts would make the database sluggish for the kind of filtered queries I need — filter by certification type, by region (département), by NAF code, radius search. WordPress's WP_Query is fine for hundreds of posts. For tens of thousands with multi-column filtering, it's not the right tool.

Instead:

Custom MySQL table: wp_of_organismes with indexed columns on departement, certification_flags (bitmask), naf, and a spatial index on (lat, lng) using MySQL's native spatial functions
A thin PHP layer wraps raw $wpdb->get_results() queries
The WordPress REST API exposes endpoints for the JavaScript search frontend
AJAX pagination, no full page reloads

For radius search, MySQL's ST_Distance_Sphere() function handles the math without needing PostGIS:

SELECT *, 
  ST_Distance_Sphere(
    POINT(lng, lat), 
    POINT(%f, %f)
  ) AS distance_m
FROM wp_of_organismes
WHERE certif_bitmask & %d > 0
HAVING distance_m < %d
ORDER BY distance_m ASC
LIMIT 20;

Not elegant, but it works at this scale.

SEO for a Data-Driven Site

A directory of 85,000 organizations is not inherently useful to Google. Thin pages with just a name, address, and phone number get ignored or deindexed fast. The approach that actually moves the needle:

Programmatic pages with editorial signal. Each département (French administrative region) has a landing page that's not just a list — it includes context about training activity in that area, notable certifications, statistics pulled from the dataset (number of certified organizations, breakdown by certification type). Generated programmatically but substantial.

The editorial layer does the heavy lifting. The 117 articles average 1,541 words. They cover things like "what Qualiopi certification means for employees", "how to choose a training organization for a specific career path", "understanding CPF funding". These pages attract organic traffic and pass link equity to the directory.

Internal linking at scale. Articles link to relevant département pages and specific organization profiles when they appear in content. The directory pages link to relevant articles. The graph is intentional, not random.

No duplicate content between directory entries. Each organization page has a structured layout, but the text describing the organization is generated from structured data fields (certifications held, training domains, geographic scope) in ways that produce genuinely varied content.

The Update Pipeline

data.gouv.fr updates the training organization dataset monthly. My update pipeline:

Cron job downloads the current CSV at month start
Diff against previous version by numerodeclarationactivite
New records: geocode and insert
Modified records: update in place, re-geocode only if address changed
Records present in old dataset but absent in new: soft delete (keep in DB, mark inactive)
Re-cross-reference with updated Qualiopi certification dataset
Regenerate département statistics pages

The diff step matters. Re-processing 148K records monthly when only 2-3K actually changed is wasteful. The CSV doesn't include a modification timestamp, so the diff is field-level comparison on a hash of the key fields.

What Actually Worked and What Didn't

Worked well:

The batch geocoding API from data.gouv.fr is genuinely solid. Uptime is good, the batch mode is fast.
MySQL spatial indexes perform well enough that I never needed to migrate to PostGIS.
Editorial content driving traffic to directory pages is the right model. Pure directories without editorial context don't rank.

What I'd do differently:

Start with a proper ETL pipeline instead of a collection of PHP scripts. I ended up with something functional but hard to maintain — classic second-system problem waiting to happen.
The certification bitmask encoding is a premature optimization I now regret. A proper many-to-many relation table would be cleaner and barely slower.
I underestimated how much time the data quality work would take. Plan for at least twice as long as you think.

Replicating This Pattern

The general pattern works for any domain with a public government registry:

Business registration data (SIRENE dataset — 10M+ companies)
Healthcare professionals (RPPS — all licensed practitioners)
Public market contracts (DECP)
Schools and universities (RAMSESE dataset)

France's open data ecosystem is genuinely underused by developers. Most of these datasets get downloaded by academic researchers and policy analysts. The developer angle — building something useful on top of the data — is wide open.

The data.gouv.fr API is documented at doc.data.gouv.fr. The geocoding batch API documentation is at adresse.data.gouv.fr/api-doc. Both are in French, but the endpoints themselves are self-explanatory.

DEV Community