DEV Community

SIKOUTRIS
SIKOUTRIS

Posted on

Building a French Training Directory: Parsing Government Certification Data at Scale

France has a unique system for quality-certifying training organizations: the Qualiopi certification. Every training provider that wants public funding must hold this certification, and the government publishes the data in various formats. We built Annuaire Qualiopi to make this data searchable and useful.

This post covers the technical challenges of building a directory from government open data.

The Data Source Problem

The French government publishes Qualiopi certification data through several channels:

  • data.gouv.fr: CSV exports, updated monthly (sometimes)
  • Individual certifying bodies: Each publishes their own list in their own format
  • The official Qualiopi list: A PDF that is surprisingly hard to parse programmatically

No single source is complete or consistently formatted. Our first job was building a data reconciliation pipeline.

Data Ingestion Pipeline

class QualiopiDataPipeline:
    def __init__(self):
        self.sources = [
            DataGouvSource(),
            AfnorSource(),
            CofracSource(),
            QualianorSource(),
        ]

    def ingest(self):
        all_records = []
        for source in self.sources:
            try:
                records = source.fetch_and_parse()
                all_records.extend(records)
            except SourceUnavailableError as e:
                log_warning(f"Source {source.name} unavailable: {e}")

        return self.deduplicate_and_merge(all_records)

    def deduplicate_and_merge(self, records):
        # Key by SIRET (unique French business identifier)
        by_siret = {}
        for record in records:
            siret = record.get("siret", "").replace(" ", "")
            if len(siret) != 14:
                continue  # Invalid SIRET

            if siret in by_siret:
                by_siret[siret] = self.merge_records(
                    by_siret[siret], record
                )
            else:
                by_siret[siret] = record

        return list(by_siret.values())
Enter fullscreen mode Exit fullscreen mode

The SIRET number (a 14-digit French business identifier) is our primary deduplication key. Without it, we would have thousands of duplicate entries from different sources with slightly different formatting of the company name.

The SIRET Validation Challenge

Not all published SIRETs are valid. Some have typos, some reference closed businesses, some are test entries. We validate against the INSEE SIRENE API:

import requests

def validate_siret(siret):
    # Luhn algorithm check (SIRET uses a variant)
    if not luhn_check(siret):
        return {"valid": False, "reason": "checksum_failed"}

    # Cross-reference with INSEE API
    try:
        resp = requests.get(
            f"https://api.insee.fr/entreprises/sirene/V3/siret/{siret}",
            headers={"Authorization": f"Bearer {INSEE_TOKEN}"},
            timeout=5
        )
        if resp.status_code == 200:
            data = resp.json()
            establishment = data["etablissement"]
            return {
                "valid": True,
                "name": establishment["uniteLegale"]["denominationUniteLegale"],
                "address": format_address(establishment["adresseEtablissement"]),
                "active": establishment["periodesEtablissement"][0]["etatAdministratifEtablissement"] == "A"
            }
        elif resp.status_code == 404:
            return {"valid": False, "reason": "not_found"}
    except requests.Timeout:
        return {"valid": None, "reason": "api_timeout"}

def luhn_check(siret):
    digits = [int(d) for d in siret]
    checksum = 0
    for i, d in enumerate(digits):
        if i % 2 == 0:
            d *= 2
            if d > 9:
                d -= 9
        checksum += d
    return checksum % 10 == 0
Enter fullscreen mode Exit fullscreen mode

The INSEE API has rate limits (30 requests/minute on the free tier), so we batch our validation runs during off-peak hours.

Search Architecture

Users search for training providers by:

  • Location (city, department, region)
  • Training category (management, IT, safety, languages...)
  • Certification scope (actions de formation, bilan de competences, VAE, apprentissage)

We chose MySQL full-text search over Elasticsearch for simplicity. With under 100,000 records, MySQL handles it fine:

CREATE TABLE training_providers (
    id INT PRIMARY KEY AUTO_INCREMENT,
    siret CHAR(14) UNIQUE,
    name VARCHAR(300),
    city VARCHAR(100),
    department CHAR(3),
    region VARCHAR(100),
    categories JSON,
    qualiopi_scopes JSON,
    certification_date DATE,
    certifying_body VARCHAR(200),
    website VARCHAR(500),
    phone VARCHAR(20),

    FULLTEXT INDEX ft_search (name, city)
);

-- Search query
SELECT *, MATCH(name, city) AGAINST(? IN NATURAL LANGUAGE MODE) as relevance
FROM training_providers
WHERE department = ?
AND JSON_CONTAINS(qualiopi_scopes, ?)
ORDER BY relevance DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

Geolocation Without Google Maps

We needed to show providers on a map without the cost of Google Maps API. Our solution:

  1. Geocoding: We use the French government is free geocoding API (api-adresse.data.gouv.fr) to convert addresses to coordinates
  2. Map rendering: Leaflet.js with OpenStreetMap tiles
  3. Clustering: For areas with many providers, we use marker clustering to keep the map readable
const map = L.map("map").setView([46.603354, 1.888334], 6);
L.tileLayer("https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png").addTo(map);

const markers = L.markerClusterGroup();
providers.forEach(p => {
    if (p.lat && p.lng) {
        const marker = L.marker([p.lat, p.lng])
            .bindPopup(`<b>${p.name}</b><br>${p.city}`);
        markers.addLayer(marker);
    }
});
map.addLayer(markers);
Enter fullscreen mode Exit fullscreen mode

Total cost for geocoding and mapping: zero. The French government provides excellent free geospatial APIs.

Data Freshness: The Ongoing Challenge

Qualiopi certifications expire and get renewed. Training providers change addresses, phone numbers, and specialties. Keeping the directory current requires:

  1. Monthly full re-import from government sources
  2. Weekly delta checks against the SIRENE API for business status changes
  3. User-submitted corrections with manual verification

We show a "last verified" date on every listing. If a listing has not been verified in 6 months, we flag it with a warning.

SEO for a French Directory

The site targets French searches like "organisme formation Qualiopi Lyon" or "centre formation certifie Paris." Key technical SEO decisions:

  • URL structure: /formation/{department}/{city}/{slug} for individual listings
  • Department landing pages: /formation/{department}/ with aggregate data and filters
  • Structured data: LocalBusiness and EducationalOrganization schema on every listing
  • Hreflang: Not needed — the site is French-only, targeting France

Server-side rendering in PHP means every page is immediately crawlable. No JavaScript rendering required.

Lessons Learned

  1. Government open data is valuable but messy. Budget significant time for data cleaning.
  2. SIRET is the anchor. Without a unique identifier, deduplication across sources would be nearly impossible.
  3. Free government APIs are underrated. The French ecosystem (api-adresse, INSEE SIRENE, data.gouv) is genuinely excellent.
  4. Directory SEO is a long game. Individual listing pages take months to index, but department-level pages rank faster.

Find Qualiopi-certified training providers across France at annuairequaliopi.fr

Top comments (0)