Tagg

Posted on Apr 7

How I Verified 5,000 Regulation Records Against an EU Database — And What 99.2% Accuracy Actually Means

#python #api #data #webdev

The Trust Problem with Scraped Data

I built an API that serves cosmetic ingredient regulation data from the Korean government (MFDS). It includes regulatory status for 10 countries — which ingredients are banned, which are restricted, and under what conditions.

But here's the problem: how do you know the data is accurate?

I scraped it from a Korean government website. The data claims to show EU regulations, Chinese regulations, ASEAN regulations. But MFDS is a Korean agency — they're reporting their interpretation of other countries' rules. What if they got it wrong? What if the data is outdated?

If a cosmetic company relies on my API to check whether an ingredient is legal in the EU, and the data is wrong, that's a real problem. Not a "bug in production" problem — a "product recalled at customs" problem.

I needed to verify.

The Verification Target: EU Data

I chose to verify the EU data because:

The EU publishes its own official database — CosIng (Cosmetic Ingredient Database), maintained by the European Commission
CosIng data is downloadable — Annex II (prohibited) and Annex III (restricted) are available as CSV files
The EU has the most regulation records in my dataset — 5,301 entries, the largest country segment

If the Korean government's version of EU data matches the EU's own database, I can have reasonable confidence in the rest of the dataset.

Building the Verification Script

The approach was straightforward:

Download CosIng Annex II (prohibited) and Annex III (restricted) CSVs
Load my MFDS EU data
Match by CAS number + ingredient name
Compare regulation types (prohibited vs. restricted)
Report match rates and mismatches

The Matching Challenge

This sounds simple until you realize:

CAS numbers aren't always present in both databases
Ingredient names differ — MFDS uses one naming convention, CosIng uses another
One CosIng entry can cover multiple CAS numbers (grouped substances)
Character encoding differences — dashes, spaces, special characters

I used a multi-pass matching strategy:

Pass 1: Exact CAS number match
Pass 2: Normalized CAS match (strip leading zeros, standardize dashes)
Pass 3: Name-based fuzzy match (for entries without CAS numbers)

The Results

Metric	Value
Total MFDS EU records	5,248
Matched against CosIng	4,693 (89.4%)
Regulation type accuracy	99.2%
Type mismatches	38
Unmatched records	555

What 89.4% Match Rate Means

It does NOT mean 10.6% of the data is wrong. The 555 unmatched records fall into predictable categories:

Grouped substances: CosIng lists "Hydroquinone and its derivatives" as one entry. MFDS lists each derivative separately. The data is the same — just structured differently.

Naming differences: One database calls it "Retinol palmitate," the other calls it "Retinyl palmitate." Same substance, different naming convention.

Recent additions: Some MFDS entries reflect regulation updates that hadn't been published in the CosIng CSV version I downloaded.

What 99.2% Type Accuracy Means

Of the 4,693 successfully matched records, 4,655 had identical regulation types (both said "prohibited" or both said "restricted").

The 38 mismatches were not errors — they were context-dependent classifications:

An ingredient might be prohibited as a hair dye but restricted for general cosmetic use. MFDS classifies it based on one context, CosIng based on another. Both are correct for their respective scope.

What I Learned About Data Quality

1. "Accuracy" is contextual

99.2% sounds great, but accuracy depends on what you're measuring. My data accurately reflects what the Korean government reports about EU regulations. Whether that perfectly mirrors the EU's own interpretation is a different question — and one that even human regulatory experts disagree on.

2. Unmatched ≠ incorrect

The biggest trap in data verification is assuming unmatched records are errors. Most of mine were structural differences in how the two databases organize the same information.

3. Verification builds trust (and documentation)

Running this verification gave me two things: confidence that the data is solid, and a concrete number I can share with users. "Cross-verified against EU CosIng database" is more convincing than "sourced from official Korean government data."

4. Automated verification enables ongoing quality

I can re-run this script whenever I update my data. If the match rate drops significantly, something changed — either in my data pipeline or in the source. It's an early warning system.

The Disclaimer Problem

Even with 99.2% accuracy, I can't call this "regulatory advice." Here's why:

Regulations change constantly
My data reflects what MFDS reports, which may lag behind actual EU updates
Classification can vary by product type, concentration, and usage context
A 0.8% error rate across 5,000 records means ~40 potential edge cases

So my API docs include a clear disclaimer: data is for reference only, verify with official sources before making compliance decisions.

This isn't just legal protection — it's honest. And it's what professional users expect. A cosmetic formulator isn't going to blindly trust any single data source. They use multiple references. My API is one of them.

Applying This to Your Own Projects

If you're building a data product, here's the verification framework:

Find an authoritative reference for at least one segment of your data
Build automated matching with multiple passes (exact → normalized → fuzzy)
Measure match rate AND accuracy separately — they tell different stories
Investigate mismatches manually before assuming they're errors
Document your methodology — users trust data more when they can see how you validated it
Make it repeatable — verification should run on every data update

The API

The verified data is live at:

🔗 K-Beauty Cosmetic Ingredients API

21,796 ingredients, 30,960 regulation records across 10 countries, EU data cross-verified against CosIng at 99.2% accuracy.

If you're working with regulatory data or building data verification pipelines, I'd love to hear how you approach accuracy. What's "good enough" for your use case?

DEV Community