I run two regulatory data APIs — one for Korean cosmetic ingredients (live on RapidAPI) and one for Korean chemical substance regulations (in development). The data comes from government sources: Korea's MFDS (Ministry of Food and Drug Safety) and the EU's CosIng database.
Government databases don't send you a notification when they update. They just change. If I miss an update, my API serves stale data, and my paying subscribers get wrong answers. That's the kind of problem that kills a data business.
So I built a bot that checks 8 regulatory pages every week and tells me when something changes.
The problem with naive approaches
My first attempt was simple: fetch the page HTML, hash it, compare it to the previous hash. If the hash changed, send an alert.
It worked for about a week. Then I started getting false positives every single run. The pages had dynamic elements — session tokens, timestamps, CSRF tokens, analytics scripts — that changed on every visit without any actual data update. I was getting 3 AM Telegram alerts for nothing.
What actually works: targeted extraction
Instead of hashing the entire page, I extract only the data that matters and hash that.
For the Korean MFDS pages, the real indicator of an update is the Excel download button. When they update the data, the download link changes. So I extract the href and onclick attributes from the download button, plus the metadata table that shows the last update date:
# MFDS: extract only the download button and metadata table
excel_btn = page.locator("span.icon_xls").first
await excel_btn.wait_for(state="attached", timeout=10000)
parent_a = page.locator("a", has=page.locator("span.icon_xls")).first
href = await parent_a.get_attribute("href") or ""
onclick = await parent_a.get_attribute("onclick") or ""
table_text = await page.locator("table").filter(
has_text="제공형태"
).first.inner_text()
extracted_data = f"HREF:{href}|ONCLICK:{onclick}|TABLE:{table_text}"
For the EU CosIng database, I extract download link URLs, the number of table rows, and the first data row:
# EU CosIng: extract download links + row count + first row
download_links = []
for fmt in ["PDF", "XLS", "CSV"]:
link = page.locator(f"a:has-text('{fmt}')").first
href = await link.get_attribute("href") or ""
if href:
download_links.append(f"{fmt}:{href}")
main_table = page.locator("table").first
row_count = await main_table.locator("tbody tr").count()
first_row = await main_table.locator("tbody tr").first.inner_text()
extracted_data = (
f"LINKS:{'|'.join(download_links)}"
f"|ROWS:{row_count}"
f"|FIRST:{first_row[:200]}"
)
Then I hash only the extracted data:
hashlib.sha256(extracted_data.encode("utf-8")).hexdigest()
False positives dropped to zero.
The 8 targets
I monitor 3 Korean MFDS pages and 5 EU CosIng Annexes:
| # | Source | What it contains |
|---|---|---|
| 1 | MFDS Raw Materials | Full ingredient list for Korean cosmetics |
| 2 | MFDS Restricted Ingredients | Concentration limits by country |
| 3 | MFDS Banned by Country | Which ingredients are banned where |
| 4 | EU Annex II | Prohibited substances (~1,600 entries) |
| 5 | EU Annex III | Restricted substances with conditions |
| 6 | EU Annex IV | Permitted colorants |
| 7 | EU Annex V | Permitted preservatives |
| 8 | EU Annex VI | Permitted UV filters |
If any of these change, my API data might be out of date.
Handling failures
Government websites go down. They redirect to maintenance pages. They time out. The bot handles this:
3 retries per target. Each retry creates a fresh browser page to avoid stale connections:
for attempt in range(1, MAX_RETRIES + 1):
page = await context.new_page()
try:
await page.goto(target["url"], wait_until="networkidle", timeout=45000)
# ... extraction logic
except Exception as e:
if attempt < MAX_RETRIES:
await asyncio.sleep(3)
finally:
await page.close()
Minimum data length check. If the extracted data is shorter than 20 characters, it's probably a block page or error page, not real data:
if len(extracted_data) < 20:
raise ValueError("Extracted data too short — possible block/maintenance page")
First-run guard. On the first run, there's no previous hash to compare against. Instead of sending 8 "CHANGE DETECTED" alerts, it silently saves the initial hashes:
if old_hash is not None:
# Real change — send alert
send_telegram_message(msg)
else:
# First run — just save the hash
pass
Running it on a t2.micro
Playwright + Chromium is heavy. On a t2.micro (1 GB RAM), it will crash without memory optimization:
BROWSER_ARGS = [
"--disable-gpu",
"--disable-dev-shm-usage",
"--no-sandbox"
]
--disable-dev-shm-usage is the one that matters. By default, Chromium uses /dev/shm for shared memory, which is tiny on small instances. This flag makes it use /tmp instead.
I also close each page after extraction and add a 2-second delay between targets. Without the delay, memory spikes and the instance freezes.
State management
State is a JSON file. Each target stores its last known hash and the timestamp of the last detected change:
{
"식약처 원료성분": {
"hash": "a3f2b8c1d9e7...",
"last_updated": "2026-04-15 03:00:12"
},
"EU Annex II (금지)": {
"hash": "7b4e9a2f1c3d...",
"last_updated": "2026-03-22 03:00:45"
}
}
No database needed. The file is small and gets overwritten on each run.
Alerts
When a change is detected, I get a Telegram message:
🚨 [Source Data Update Detected!]
식약처 사용제한원료 data has changed.
Check the file/DB and update.
- Detection time: 2026-04-15 03:00:12
- Link: [Go to page]
At the end of each run, I also get a weekly summary:
📋 [Weekly Source Data Check Complete]
Check time: 2026-04-15 03:05
Changes detected: 0 / 8
✅ 식약처 원료성분 — No change
✅ 식약처 사용제한원료 — No change
✅ EU Annex II (금지) — No change
...
The summary is important. Without it, silence is ambiguous — is nothing changing, or is the bot broken?
Scheduling
A cron job runs the script weekly:
0 3 * * 1 cd /home/ubuntu/compliance && python3 compliance_checker.py >> logs/compliance.log 2>&1
Monday at 3 AM. Regulatory databases rarely update more than once a month, so weekly is enough.
What I'd change
Playwright is overkill for most targets. The EU CosIng pages could probably be fetched with httpx and parsed with BeautifulSoup. I use Playwright because some MFDS pages require JavaScript rendering, and I'd rather have one tool than two. But if memory is tight, switching the EU targets to a lighter fetcher would save a lot of RAM.
Slack instead of Telegram. I started with Telegram because it was easier to set up. I'm migrating to Slack for better integration with my other monitoring (server health checks, API error rates).
Wrapping up
One Python file, a JSON state file, and a cron job. No database, no queue, no extra infrastructure.
Three things I learned: don't hash the whole page, don't trust that silence means the bot is working, and don't run Chromium without --disable-dev-shm-usage on a small instance.
I write about building data services from Korean regulatory sources at Decoded Korea. The cosmetic ingredients API is on RapidAPI.
Top comments (0)