Cara Jung

Posted on May 9

Building a Unified Korean Entertainment Database from 10 APIs and Web Scrapers

#api #webscraping #korea #ai

Korean entertainment has become a global phenomenon with shows such as Squid Game breaking records and K-dramas topping global charts. And yet, the data infrastructure behind it is fragmented.

Getting complete data on a single Korean show or film — cast, ratings (Korean and international), episode viewership numbers, where to stream it, what awards it won, its OST albums — requires hopscotching different websites.

The issue is that dominant platforms like NAVER and Melon lack English-first APIs. As Session Zero points out in this article, Korean data is heavily underserved in MCP ecosystems because when Western developer tools and AI systems are built, Korean platforms are often invisible by default.

The data exists. But it’s trapped behind language barriers, undocumented endpoints, JavaScript-rendered pages, and closed ecosystems. So while AI agents can easily retrieve structured information about Hollywood movies, Spotify charts, or IMDb ratings, asking the same systems about Korean dramas, OSTs, or Korean audience sentiment often returns incomplete results or nothing at all.

So I decided to build a unified database to fix it.

The Data Landscape

Korean entertainment data splits along two axes: language (Korean vs. English sources) and type (official vs. community vs. streaming).

English-language sources

TMDB is the closest thing to a comprehensive English-language database for Korean content. It has structured data on tens of thousands of Korean films and shows, a stable API, and community ratings. But it lacks Korea-specific data: no verified Korean audience scores, no Nielsen viewership ratings, no Korean box office data, no OST information.

MyDramaList fills a critical gap that TMDB misses entirely: community tags. MDL users have tagged Korean dramas with labels such as "Bromance", "Time Travel", "CEO Male Lead", and "Found Family." No official database captures that taxonomy. MDL also tracks airing status more accurately than TMDB for Korean dramas.

HanCinema has the deepest historical coverage of Korean content in English, including films from the 1950s through 1990s that TMDB barely covers.

JustWatch is the most reliable real-time source for streaming availability. TMDB's streaming data lags reality by weeks. JustWatch checks 364 services daily.

Wikipedia has rich content for major Korean films and shows including detailed plot summaries, production history, cultural reception sections that no English entertainment database captures.

Korean-language sources

Here's where things get interesting and painful.

NAVER is Korea's dominant search engine and entertainment portal. Search for any Korean film on NAVER and you'll get a rich information card with two ratings that don't exist anywhere else:

실관람객 평점 (Verified ticket buyer rating): Only people who purchased cinema tickets through affiliated platforms can rate. This is Korea's equivalent of a verified purchase review.
네티즌 평점 (Netizen rating): Korean general public rating.

These ratings often diverge significantly from international scores. Parasite has a 9.08 verified buyer rating on NAVER versus 8.5 on TMDB. The Korean audience who saw it in theaters rated it exceptionally highly.

NAVER also has per-episode Nielsen Korea viewership ratings for TV dramas, which is the official broadcast ratings that Korean media reports on weekly. No other English-language source has this data structured and queryable.

The critical catch: NAVER has no public API for any of this. Their entertainment data is rendered dynamically in JavaScript, served through their search interface, and entirely undocumented. Every data point requires a browser.

KOBIS (Korean Film Council) is the exception. It has an official government API that provides authoritative weekly and daily box office rankings. It's the only Korean government data source with a proper REST API.

Building the Scrapers

The Playwright Problem

Most of the Korean data sources render content through JavaScript. Static HTML requests return empty shells. This meant nearly every scraper needed a real browser.

To address this, I used Playwright with Chromium headless across all JS-rendered sources. The setup is consistent:

from playwright.sync_api import sync_playwright

def _get_page_html(url: str, wait_selector: str = "body") -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
            locale="ko-KR",
        )
        page = context.new_page()
        page.goto(url, wait_until="domcontentloaded")
        page.wait_for_selector(wait_selector)
        time.sleep(2)  # let lazy content settle
        html = page.content()
        browser.close()
    return html

The locale="ko-KR" matters for NAVER since it ensures Korean content is served rather than any region-specific variant.

The NAVER Genre Problem

One of the more unexpected parsing challenges came from NAVER's movie information card. Genre, country, and runtime appeared concatenated: 공포대한민국95분 (Horror South Korea 95min). They were in a single dd tag separated by invisible span.cm_bar_info elements.

The fix was to replace the separator spans with pipe characters before splitting:

first_dd = info_groups[0].select_one("dd")
if first_dd:
    for span in first_dd.select("span"):
        span.replace_with("|")
    segments = [s.strip() for s in first_dd.get_text().split("|") if s.strip()]
    result["genre"] = segments[0] if segments else None
    result["country"] = segments[1] if len(segments) > 1 else None

Extracting Nielsen Ratings from SVG

The trickiest scraping problem was NAVER's episode viewership chart. The data is rendered as an interactive SVG chart where viewership percentages, episode numbers, and air dates are all inside SVG text elements.

def _parse_episode_chart(soup: BeautifulSoup) -> list[dict]:
    # Rating values from bb-text elements inside the SVG
    rating_texts = soup.select("g.bb-texts-rank text.bb-text")
    ratings = []
    for t in rating_texts:
        val = t.get_text(strip=True)
        try:
            f = float(val)
            if f > 0:
                ratings.append(f)
        except ValueError:
            pass

    # Episode numbers and dates from x-axis ticks
    x_ticks = soup.select("g.bb-axis-x g.tick")
    ep_labels = []
    for tick in x_ticks:
        tspans = tick.select("tspan")
        if len(tspans) >= 2:
            ep_num = _parse_episode_num(tspans[0].get_text(strip=True))
            date_text = tspans[1].get_text(strip=True)
            if ep_num and date_text:
                ep_labels.append({"episode": ep_num, "date": date_text})

    return [
        {"episode": ep["episode"], "air_date": ep["date"], "rating": ratings[i]}
        for i, ep in enumerate(ep_labels)
        if i < len(ratings)
    ]

This gives us per-episode Nielsen ratings:

Ep 1 (12.02.): 6.3%
Ep 8 (12.24.): 12.3%
Ep 16 (01.21.): 20.5%

For Goblin. No English-language API has this data.

The JustWatch Shadow DOM Problem

JustWatch uses Web Components with Shadow DOM for their streaming offer cards. The score and provider data that appears in the browser is inside <slot> elements that don't render in server-side HTML:

<div class="score-wrap">
  <div class="critics-score-wrap">
    <slot name="critics-score"></slot>  <!-- empty in scraped HTML -->
  </div>
</div>

However, the streaming offers themselves (provider names, prices, monetization types) render in the regular DOM inside div.buybox-selector a.offer elements. The key insight was that the offers were accessible even though the score slots weren't.

Extracting the actual streaming URLs from JustWatch's redirect links required parsing the r= parameter:

def _extract_redirect_url(href: str) -> str:
    parsed = urlparse(href)
    params = parse_qs(parsed.query)
    r = params.get("r", [None])[0]
    return unquote(r) if r else href

Awards Parsing: Five Ceremonies, Three Formats

Korean drama and film awards span five major ceremonies, each with slightly different HTML structure. I scraped all of them from AsianWiki plus the official Baeksang Awards site.

The unexpected challenge was that award categories use different winner formats depending on ceremony type:

Drama awards: Person ("Show Title") → links[0] is person, links[1] is show
Film awards: "Film Title" → title-only, no person/show split
Blue Dragon Series: "Title" (Platform) → title plus streaming platform

The format detection:

value_text = item_text.replace(bold_text, "").strip().lstrip(":").strip()
if value_text.startswith('"'):
    # Film/series format: title only
    current_category["winner"] = links[0].get_text(strip=True)
    current_category["winner_show"] = None
else:
    # Drama format: person + show
    current_category["winner"] = links[0].get_text(strip=True)
    if len(links) > 1:
        current_category["winner_show"] = links[1].get_text(strip=True)

And the search function deduplicates winners who also appear in the nominees list. AsianWiki includes the winner in the nominees list, so a naive search returns double entries:

# Skip if this is the same person/title as winner (dedup)
if nom_name == winner_name and winner_matches:
    continue

The Wikipedia Section Problem

Wikipedia articles don't have standardized section names. "Plot" might be called "Synopsis", "Series overview", "Story", or "Episodes" depending on who wrote the article and when.

We built a section alias system:

SECTION_ALIASES = {
    "Plot": ["Plot", "Synopsis", "Story", "Series overview", "Premise", "Episodes"],
    "Cast": ["Cast", "Cast and characters", "Characters", "Main cast"],
    "Ratings": ["Ratings", "Viewership ratings", "Television ratings", "Viewership"],
    "Reception": ["Reception", "Critical response", "Critical reception"],
}

The smart lookup tries each alias in order until it finds content. Crash Landing on You uses "Episodes" for its episode list and "Viewership" for its ratings section — both non-standard names that the alias system handles automatically.

Cross-Source ID Management

One of the harder database design decisions was how to link data across sources. A single show like Crash Landing on You has:

TMDB ID: 94796
MDL ID: 70 (from slug 70-crash-landing-on-you)
Naver show OS ID: 3522952
JustWatch slug: crash-landing-on-you
Wikipedia title: Crash Landing on You

The database stores all of these as columns on the tv_shows table. No single ID links all sources — the TMDB ID is the primary key because TMDB has the broadest coverage and most stable IDs.

create table tv_shows (
  id           uuid primary key default uuid_generate_v4(),
  tmdb_id      text unique,
  mdl_id       text unique,
  mdl_slug     text,
  naver_show_id text,
  justwatch_slug text,
  wikipedia_title text,
  -- ... ratings, metadata, etc.
);

The pipeline links sources progressively: TMDB runs first to seed core titles, then MDL enriches with ratings and tags, then NAVER TV adds episode ratings using the Korean title stored by TMDB.

Rating Field Naming Convention

With eight different rating sources covering different audiences and methodologies, naming discipline was essential:

tmdb_rating              # Global community (0-10)
mdl_rating               # International K-drama fans (0-10)
naver_audience_rating    # Korean verified ticket buyers (0-10)
naver_netizen_rating     # Korean general public (0-10)
naver_latest_rating      # Nielsen Korea latest episode (%)
naver_highest_rating     # Nielsen Korea peak episode (%)
rt_tomatometer           # Western professional critics (0-100)
rt_audience_score        # Western RT users (0-100)

These are never stored as a generic "rating" field. The naming makes the source and audience type explicit at the schema level, preventing any ambiguity in downstream queries or API responses.

What This Unlocks

The unified database enables queries that weren't previously possible:

Find dramas where Korean audiences loved it but Western critics were lukewarm:

SELECT title_english, naver_audience_rating, rt_tomatometer
FROM tv_shows
WHERE naver_audience_rating > 8.5
AND rt_tomatometer < 60;

Find all content with OST albums by IU:

SELECT ts.title_english, oa.album_name, oa.vibe_url
FROM ost_albums oa
JOIN tv_shows ts ON oa.show_id = ts.id
WHERE oa.artist ILIKE '%IU%' OR oa.artist ILIKE '%아이유%';

Find award winners available on Netflix US:

SELECT DISTINCT ts.title_english, a.category, a.year
FROM awards a
JOIN tv_shows ts ON a.show_id = ts.id
JOIN streaming_availability sa ON sa.show_id = ts.id
WHERE sa.provider = 'Netflix'
AND sa.region = 'us'
AND sa.monetization_type = 'Subscription'
AND a.won = true;

Find the drama with the biggest episode-to-episode rating jump:

SELECT show_id, episode_number, nielsen_rating,
       nielsen_rating - LAG(nielsen_rating) OVER (PARTITION BY show_id ORDER BY episode_number) AS jump
FROM episodes
ORDER BY jump DESC NULLS LAST
LIMIT 10;

None of these queries are possible against any single existing source.

What's Next

The goal of this project is to build an MCP server for Korean entertainment data so that makes Korean movies and TV shows are accessible to AI agents and developer tooling in a structured, searchable way.

Instead of forcing developers to manually scrape different sites just to answer a basic query, the MCP server will expose unified tools that support natural language requests like “Find me a political thriller Korean audiences loved that’s available on Netflix and maintained strong episode ratings throughout its run.”

Under the hood, that means reconciling fragmented metadata across 10,000+ titles from APIs, scrapers, streaming providers, audience ratings, box office systems, and community-driven sources.

Part 2 will dive into the pipeline architecture itself: the automated sync system, GitHub Actions orchestration, incremental updates, scraper failures, rate limits, deduplication headaches, and the very questionable debugging decisions made at 2AM.

DEV Community