The Backstory
A few weeks ago, I launched a REST API for Korean cosmetic ingredients — 21,000+ ingredients from official Korean government sources, searchable by INCI name, CAS number, and Korean name.
The API worked. But I kept getting the same question from the cosmetic industry people I talked to:
"Cool, but what about EU regulations? What about China?"
Fair point. If you're formulating cosmetics for global markets, knowing that an ingredient is restricted in Korea is only part of the puzzle. You need to know if it's also banned in the EU, restricted in China, or flagged in ASEAN.
So I went down the rabbit hole.
What I Found: A Hidden Goldmine
Korea's Ministry of Food and Drug Safety (MFDS) doesn't just track Korean regulations. Their database at nedrug.mfds.go.kr contains regulation data for 10 countries, all in one place:
| Country | Records |
|---|---|
| EU | 5,301 |
| ASEAN | 4,843 |
| China | 4,145 |
| South Korea | 4,046 |
| Brazil | 4,022 |
| Argentina | 4,022 |
| Taiwan | 2,137 |
| Canada | 1,947 |
| Japan | 386 |
| USA | 111 |
30,960 regulation records total. Each one tells you whether an ingredient is prohibited or restricted in that country, with detailed conditions — concentration limits, product type restrictions, and usage warnings.
The catch? The data is:
- Embedded in HTML pages as JavaScript JSON arrays
- Behind a Korean-language interface
- Spread across 7,257 individual pages
- No API, no download button
Sound familiar?
The Scraping Challenge
7,257 pages. One request at a time.
The MFDS detail pages have an interesting structure. Each page contains a JavaScript variable called arCountry — a JSON array with all the regulation data for that ingredient, across all countries. No AJAX calls needed. One page request = all countries.
But there's a catch within the catch: some ingredients have both restricted and prohibited data, stored in an if/else branch in the JavaScript. A naive regex extraction misses half the data. I had to write a bracket-depth counter to properly extract both arrays.
def extract_json_array(text, start_pos):
"""Bracket counting instead of regex —
handles nested brackets in JSON strings"""
bracket_start = text.index('[', start_pos)
depth = 0
for i in range(bracket_start, len(text)):
if text[i] == '[':
depth += 1
elif text[i] == ']':
depth -= 1
if depth == 0:
return text[bracket_start:i + 1]
return None
Small bug, big difference: one ingredient went from 0 regulation records to 5 after fixing this.
Verifying the Data
Here's the thing about scraping government data from one country about other countries: how do you know it's accurate?
I cross-checked our EU data against the CosIng database — the EU's official cosmetic ingredient database. CosIng publishes their Annex II (prohibited) and Annex III (restricted) lists as downloadable CSVs.
Verification results:
| Metric | Result |
|---|---|
| Total EU records from MFDS | 5,248 |
| Matched against CosIng (by CAS number + name) | 4,693 (89.4%) |
| Regulation type accuracy | 99.2% |
| Type mismatches | 38 |
The 38 mismatches weren't errors — they were edge cases where an ingredient is prohibited when used as hair dye but restricted for other uses. Different classification logic, same underlying data.
Good enough to ship.
The New API (v3.0.0)
New endpoint: /v1/ingredient/{code}/regulations
import requests
response = requests.get(
"https://k-beauty-cosmetic-ingredients.p.rapidapi.com/v1/ingredient/9/regulations",
params={"country": "EU"},
headers={
"X-RapidAPI-Key": "YOUR_API_KEY",
"X-RapidAPI-Host": "k-beauty-cosmetic-ingredients.p.rapidapi.com"
}
)
print(response.json())
Response:
{
"success": true,
"ingredient": {
"code": 9,
"kr_name": "리날룰",
"inci_name": "Linalool"
},
"count": 1,
"available_countries": ["한국", "EU"],
"data": [
{
"country": "EU",
"regulate_type": "제한",
"notice_ingr_name": "1,6-Octadien-3-ol, 3,7-dimethyl-",
"limit_condition": null,
"source_type": "limit"
}
]
}
One API call. One ingredient. Regulations across multiple countries. No PDF digging, no Google Translate, no guesswork.
Country access by tier
Not everyone needs all 10 countries. So I tiered it:
| Tier | Price | Countries | Monthly Requests |
|---|---|---|---|
| BASIC | Free | Ingredients only (no regulations) | 100 |
| PRO | $29 | South Korea, EU | 2,000 |
| ULTRA | $79 | + China, USA, Japan, ASEAN | 5,000 |
| MEGA | $199 | All 10 countries | 15,000 |
Interesting Findings From the Data
After collecting all 30,960 regulation records, some patterns jumped out:
1. The EU bans the most ingredients
EU leads with 5,301 regulation records. They're the strictest regulatory body for cosmetics — many other countries reference EU decisions when updating their own lists.
2. "Prohibited" doesn't always mean "dangerous"
Some ingredients are prohibited in cosmetics simply because they're classified as pharmaceuticals. Not because they're toxic — because they're too effective and fall under drug regulation instead.
3. The same ingredient, different rules everywhere
Take silver compounds: restricted in Canada (allowed in mouthwash up to 0.04%), but prohibited in the EU when in nano form. A global cosmetic brand needs to track these differences per-market.
4. Most MFDS regulated substances aren't in the KCIA ingredient dictionary
Only 1,269 out of 7,257 MFDS regulated substances matched KCIA ingredients by CAS number. The rest are chemical substances that are banned from cosmetics — they were never cosmetic ingredients to begin with.
Technical Decisions
Why not add all MFDS substances to the main ingredients table?
KCIA tracks what can be used in cosmetics. MFDS tracks what can't be (or has conditions). Mixing them would pollute the ingredient search results with non-cosmetic chemicals. Instead, I kept them in a separate regulations table, linked by CAS number where possible.
Why SQLite, still?
Added 30,960 rows to the existing 21,796-ingredient database. SQLite handles it fine — the regulations table has indexes on ingredient_code, country, and regulate_type. Query time is still under 50ms.
The rate limiting rabbit hole
I wanted per-tier rate limits (BASIC: 10/min, MEGA: 40/min). Turns out the Python slowapi library doesn't support dynamic rate limits based on request context. The decorator function gets called without access to the request object.
Solution: two-layer approach. slowapi handles the global ceiling (40/min), and a custom in-memory counter in the middleware enforces per-tier limits after the tier is detected from the subscription header.
What's Next
- Automated weekly updates — KCIA change detection is already built, MFDS full re-scrape takes ~10 hours
- API name/description SEO optimization on RapidAPI
- More search filters for the regulations endpoint
Try It
The API is live on RapidAPI with a free tier.
🔗 K-Beauty Cosmetic Ingredients API
If you're building cosmetic tech, regulatory tools, or just curious about what's actually in your skincare products across different countries — give it a shot.
Questions? Drop a comment below.
Top comments (0)