Tagg

Posted on May 1

I Monitor 8 Government Databases for Changes With One Python Script

#python #api #automation #webdev

I run two regulatory data APIs — one for Korean cosmetic ingredients (live on RapidAPI) and one for Korean chemical substance regulations (in development). The data comes from government sources: Korea's MFDS (Ministry of Food and Drug Safety) and the EU's CosIng database.

Government databases don't send you a notification when they update. They just change. If I miss an update, my API serves stale data, and my paying subscribers get wrong answers. That's the kind of problem that kills a data business.

So I built a bot that checks 8 regulatory pages every week and tells me when something changes.

The problem with naive approaches

My first attempt was simple: fetch the page HTML, hash it, compare it to the previous hash. If the hash changed, send an alert.

It worked for about a week. Then I started getting false positives every single run. The pages had dynamic elements — session tokens, timestamps, CSRF tokens, analytics scripts — that changed on every visit without any actual data update. I was getting 3 AM Telegram alerts for nothing.

What actually works: targeted extraction

Instead of hashing the entire page, I extract only the data that matters and hash that.

For the Korean MFDS pages, the real indicator of an update is the Excel download button. When they update the data, the download link changes. So I extract the href and onclick attributes from the download button, plus the metadata table that shows the last update date:

# MFDS: extract only the download button and metadata table
excel_btn = page.locator("span.icon_xls").first
await excel_btn.wait_for(state="attached", timeout=10000)

parent_a = page.locator("a", has=page.locator("span.icon_xls")).first
href = await parent_a.get_attribute("href") or ""
onclick = await parent_a.get_attribute("onclick") or ""

table_text = await page.locator("table").filter(
    has_text="제공형태"
).first.inner_text()

extracted_data = f"HREF:{href}|ONCLICK:{onclick}|TABLE:{table_text}"

For the EU CosIng database, I extract download link URLs, the number of table rows, and the first data row:

# EU CosIng: extract download links + row count + first row
download_links = []
for fmt in ["PDF", "XLS", "CSV"]:
    link = page.locator(f"a:has-text('{fmt}')").first
    href = await link.get_attribute("href") or ""
    if href:
        download_links.append(f"{fmt}:{href}")

main_table = page.locator("table").first
row_count = await main_table.locator("tbody tr").count()

first_row = await main_table.locator("tbody tr").first.inner_text()

extracted_data = (
    f"LINKS:{'|'.join(download_links)}"
    f"|ROWS:{row_count}"
    f"|FIRST:{first_row[:200]}"
)

Then I hash only the extracted data:

hashlib.sha256(extracted_data.encode("utf-8")).hexdigest()

False positives dropped to zero.

The 8 targets

I monitor 3 Korean MFDS pages and 5 EU CosIng Annexes:

#	Source	What it contains
1	MFDS Raw Materials	Full ingredient list for Korean cosmetics
2	MFDS Restricted Ingredients	Concentration limits by country
3	MFDS Banned by Country	Which ingredients are banned where
4	EU Annex II	Prohibited substances (~1,600 entries)
5	EU Annex III	Restricted substances with conditions
6	EU Annex IV	Permitted colorants
7	EU Annex V	Permitted preservatives
8	EU Annex VI	Permitted UV filters

If any of these change, my API data might be out of date.

Handling failures

Government websites go down. They redirect to maintenance pages. They time out. The bot handles this:

3 retries per target. Each retry creates a fresh browser page to avoid stale connections:

for attempt in range(1, MAX_RETRIES + 1):
    page = await context.new_page()
    try:
        await page.goto(target["url"], wait_until="networkidle", timeout=45000)
        # ... extraction logic
    except Exception as e:
        if attempt < MAX_RETRIES:
            await asyncio.sleep(3)
    finally:
        await page.close()

Minimum data length check. If the extracted data is shorter than 20 characters, it's probably a block page or error page, not real data:

if len(extracted_data) < 20:
    raise ValueError("Extracted data too short — possible block/maintenance page")

First-run guard. On the first run, there's no previous hash to compare against. Instead of sending 8 "CHANGE DETECTED" alerts, it silently saves the initial hashes:

if old_hash is not None:
    # Real change — send alert
    send_telegram_message(msg)
else:
    # First run — just save the hash
    pass

Running it on a t2.micro

Playwright + Chromium is heavy. On a t2.micro (1 GB RAM), it will crash without memory optimization:

BROWSER_ARGS = [
    "--disable-gpu",
    "--disable-dev-shm-usage",
    "--no-sandbox"
]

--disable-dev-shm-usage is the one that matters. By default, Chromium uses /dev/shm for shared memory, which is tiny on small instances. This flag makes it use /tmp instead.

I also close each page after extraction and add a 2-second delay between targets. Without the delay, memory spikes and the instance freezes.

State management

State is a JSON file. Each target stores its last known hash and the timestamp of the last detected change:

{
    "식약처 원료성분": {
        "hash": "a3f2b8c1d9e7...",
        "last_updated": "2026-04-15 03:00:12"
    },
    "EU Annex II (금지)": {
        "hash": "7b4e9a2f1c3d...",
        "last_updated": "2026-03-22 03:00:45"
    }
}

No database needed. The file is small and gets overwritten on each run.

Alerts

When a change is detected, I get a Telegram message:

🚨 [Source Data Update Detected!]

식약처 사용제한원료 data has changed.
Check the file/DB and update.

- Detection time: 2026-04-15 03:00:12
- Link: [Go to page]

At the end of each run, I also get a weekly summary:

📋 [Weekly Source Data Check Complete]

Check time: 2026-04-15 03:05
Changes detected: 0 / 8

✅ 식약처 원료성분 — No change
✅ 식약처 사용제한원료 — No change
✅ EU Annex II (금지) — No change
...

The summary is important. Without it, silence is ambiguous — is nothing changing, or is the bot broken?

Scheduling

A cron job runs the script weekly:

0 3 * * 1 cd /home/ubuntu/compliance && python3 compliance_checker.py >> logs/compliance.log 2>&1

Monday at 3 AM. Regulatory databases rarely update more than once a month, so weekly is enough.

What I'd change

Playwright is overkill for most targets. The EU CosIng pages could probably be fetched with httpx and parsed with BeautifulSoup. I use Playwright because some MFDS pages require JavaScript rendering, and I'd rather have one tool than two. But if memory is tight, switching the EU targets to a lighter fetcher would save a lot of RAM.

Slack instead of Telegram. I started with Telegram because it was easier to set up. I'm migrating to Slack for better integration with my other monitoring (server health checks, API error rates).

Wrapping up

One Python file, a JSON state file, and a cron job. No database, no queue, no extra infrastructure.

Three things I learned: don't hash the whole page, don't trust that silence means the bot is working, and don't run Chromium without --disable-dev-shm-usage on a small instance.

I write about building data services from Korean regulatory sources at Decoded Korea. The cosmetic ingredients API is on RapidAPI.

DEV Community