Tagg

Posted on May 3

How I Collected 47,000 Chemical Substances From a Korean Government API

#webdev #python #data #api

I built an API for Korean chemical substance regulations. I wrote about the why and the architecture in a previous post. This post is about how I actually collected the data.

The source is data.go.kr, Korea's public data portal — the equivalent of data.gov but with Korean-language documentation and some quirks that took weeks to work through.

The target dataset: every chemical substance registered under K-REACH (Korea's chemical regulation framework). About 47,000 substances with regulatory classifications, CAS numbers, Korean/English names, and GHS hazard data.

There is no bulk download. The API only supports search queries. You send a search term and get back matching results, paginated at 100 per page. To get everything, I had to figure out a search strategy that would cover the entire database without missing substances and without burning through the daily API call limit.

The search problem

The API has one useful parameter: searchGubun. Set it to 1 and you can search by substance name. The search is a substring match — searching "benz" returns benzene, benzaldehyde, ethylbenzene, and everything else with "benz" in the name.

There is no wildcard search. You cannot send an empty string and get all results. Every query needs at least one character.

My first idea was to search for common chemical prefixes — "methyl", "ethyl", "propyl", "chloro" — and union the results. This would miss everything that doesn't start with a common prefix. Substances like "Zinc oxide" or "Lead" would fall through.

The alphabet approach

The solution is simple: search for every single character. Search "a", then "b", then "c", all the way to "z", then "0" through "9". Thirty-six single-character queries, each returning every substance whose name contains that character.

search_chars = list("abcdefghijklmnopqrstuvwxyz0123456789")

for char in search_chars:
    body, err = api_call(CHEM_BASE, search_gubun=1, search_nm=char, page=1)
    total_count = int(body.get("totalCount", 0))
    total_pages = (total_count + ROWS_PER_PAGE - 1) // ROWS_PER_PAGE

    for page in range(1, total_pages + 1):
        body, err = api_call(CHEM_BASE, 1, char, page=page)
        items = body.get("items", [])
        for item in items:
            sid = item.get("sbstnId")
            if sid and sid not in all_items:
                all_items[sid] = item

Searching "a" returns tens of thousands of results — hundreds of pages. Searching "e" returns even more. Most substances appear in multiple searches — benzene shows up in "b", "e", "n", and "z". Deduplication is mandatory.

Deduplication

Every substance in the database has a unique sbstnId. I store results in a Python dict keyed by sbstnId. If a substance was already collected from an earlier search, it gets silently skipped. No duplicates, no extra storage.

all_items = {}  # sbstnId -> item

for item in items:
    sid = item.get("sbstnId")
    if sid and sid not in all_items:
        all_items[sid] = item

By the time you finish all 36 characters, you have every substance in the database — each stored exactly once.

The API call limit problem

data.go.kr gives each API key a daily call limit. The exact number depends on your approval level, but it is not unlimited. My dataset needed tens of thousands of API calls to collect in full.

I set a per-run cap of 8,000 calls:

MAX_CALLS_PER_RUN = 8000

for char in remaining_chars:
    if calls_this_run >= MAX_CALLS_PER_RUN:
        print(f"[PAUSE] Call limit reached ({calls_this_run})")
        break

When the limit is hit, the script saves progress and stops. The next run picks up where it left off.

Resume logic

Progress is tracked by which characters have been fully collected:

progress = {"completed_chars": ["a", "b", "c", "d"]}
# Next run starts from "e"

remaining = [c for c in search_chars if c not in completed]

After each character is done, the completed list and the collected data are both saved to disk. If the script crashes mid-character, the worst case is re-collecting that one character — and deduplication means no data corruption.

The early skip optimization

Later characters produce fewer new results. By the time you search "x" or "9", almost every substance has already been collected from earlier searches. Paging through 200 pages to find zero new items wastes API calls.

The fix: if 5 consecutive pages return zero new items, skip to the next character.

SKIP_AFTER_EMPTY_PAGES = 5
empty_streak = 0

for page in range(2, total_pages + 1):
    # ... fetch page ...
    new_in_page = 0
    for item in items:
        sid = item.get("sbstnId")
        if sid and sid not in all_items:
            all_items[sid] = item
            new_in_page += 1

    if new_in_page == 0:
        empty_streak += 1
    else:
        empty_streak = 0

    if empty_streak >= SKIP_AFTER_EMPTY_PAGES:
        break  # move to next character

This cut total API calls significantly. Without it, the script would burn thousands of calls paging through results it already had.

Collecting GHS data separately

The chemical substance API and the GHS hazard API are separate endpoints. The GHS API also uses substring search — but I already had all the CAS numbers from the first collection.

Instead of repeating the alphabet search, I queried the GHS API directly with each CAS number (searchGubun=2):

for cas in cas_list:
    body, err = api_call(GHS_BASE, search_gubun=2, search_nm=cas)
    items = body.get("items", [])
    for item in items:
        sid = item.get("sbstnId")
        if sid:
            ghs_items[sid] = item

Same resume logic, same per-run call cap, same deduplication. But much more efficient — one API call per substance instead of paging through overlapping search results.

What I ended up with

After multiple runs spread over several days:

47,000+ chemical substances with regulatory classifications
GHS hazard data for substances that have it
9 regulatory flags per substance (toxic, restricted, prohibited, priority management, accident preparedness, CMR, registration required, persistent organic pollutant, Rotterdam Convention)
All stored as JSON, later loaded into SQLite for the API

The whole collection pipeline is a single Python file with three commands:

python kreach_phase1.py chem-remaining   # collect missing substances
python kreach_phase1.py ghs-by-cas       # collect GHS data
python kreach_phase1.py validate         # verify completeness

Things that went wrong

Rate limiting without clear feedback. When you exceed the daily limit, the API does not return a clean "rate limited" error. It returns a generic error code that looks like a server failure. Took a few failed runs to figure out what was happening.

Ghost substances. Some sbstnId values appeared in GHS results but had no matching entry in the chemical substance API. The validate step catches these, but I still do not know why they exist.

What I'd do differently

If I were starting over, I would negotiate for bulk access first. data.go.kr has a process for requesting elevated API access or direct database dumps for approved research/commercial use. I went straight to coding and spent days on a problem that might have been solvable with an email.

I would also use asyncio + aiohttp instead of synchronous requests. The 0.5-second delay between calls would still apply, but concurrent handling of retries and error recovery would be cleaner.

The collected data powers the K-REACH Chemical Substance API on RapidAPI. I also write about Korean regulatory data at Decoded Korea, and the cosmetic ingredients API is at K-Beauty Cosmetic Ingredients.

The alphabet search trick works for any API that supports substring matching but no bulk export. Brute-force, but effective.

DEV Community