DEV Community: Cara Jung

From Scrapers to MCP Server: Serving Korean Entertainment Data to AI Agents

Cara Jung — Mon, 11 May 2026 13:00:00 +0000

Korean entertainment data is surprisingly fragmented. Information about a single drama or film is often scattered across multiple platforms.

To solve that, I built a unified Korean entertainment database powered by APIs, web scrapers, and automated sync pipelines. By the end of the project, I had a Supabase database containing nearly 10,000 Korean movies, 3,500 TV shows, per-episode Nielsen Korea ratings, award histories, and streaming availability across four regions.

The next problem was figuring out how to expose it to AI agents in a way that was actually useful, secure, monetizable, and maintainable.

This is the story of building the MCP server and the errors I encountered.

Designing the Tools

Before writing any code, I thought carefully about what an AI agent would actually need from a Korean entertainment database. The answer wasn't "expose every database column as a query parameter." That's an API, not a tool. MCP tools should be opinionated about what they return and why.

I ended up with 17 tools organized into three categories:

Discovery tools answer "what should I watch?" — get_trending_dramas, browse_by_genre, browse_by_tag. The tag tool is the most distinctive: MyDramaList's community taxonomy ("Bromance", "Enemies to Lovers", "Time Travel", "CEO Male Lead") doesn't exist anywhere else in structured form, and it's exactly how K-drama fans actually think about recommendations.

Detail tools answer "tell me everything about this title" like get_movie, get_drama, get_episode_ratings, get_ost_albums. The episode ratings tool is the one I'm most proud of: it returns Nielsen Korea per-episode viewership percentages scraped from SVG chart elements on Naver. No English-language API has this data.

Utility tools answer cross-cutting questions — find_where_to_watch, get_weekly_boxoffice, get_actor_filmography, compare_ratings. The compare_ratings tool is genuinely novel: it shows you Naver's verified Korean ticket buyer score, MDL's international fan score, TMDB's global community score, and RT's Western critic score side by side, with labels explaining what each audience represents.

Building the Server with FastMCP

FastMCP makes building MCP servers surprisingly clean. Each tool is a decorated Python function with a docstring that becomes the tool description:

from fastmcp import FastMCP

mcp = FastMCP(
    name="Korean Entertainment",
    instructions="""
You have access to a comprehensive database of Korean movies and TV shows.
Rating fields and what they mean:
- mdl_rating: International K-drama fans (0-10)
- naver_audience_rating: Korean verified ticket buyers (0-10)
- naver_latest_rating: Nielsen Korea latest episode viewership (%)
""",
)

@mcp.tool
def browse_by_tag(tag: str, limit: int = 20) -> list[dict]:
    """
    Browse Korean dramas by MyDramaList community tag.
    Common tags: "Bromance", "Time Travel", "CEO Male Lead",
    "Enemies to Lovers", "Revenge", "Found Family"
    """
    return _supabase.table("tv_shows") \
        .select("id, title_english, title_korean, year, mdl_rating, tags") \
        .contains("tags", [tag]) \
        .order("mdl_rating", desc=True) \
        .limit(limit) \
        .execute().data or []

The tools call db/queries.py directly, the same query layer the pipeline uses to write data. No intermediate API layer needed. When Claude calls browse_by_tag(tag="Revenge", limit=5), it goes straight to Supabase.

Choosing Authentication: Descope

For a monetizable MCP server, I needed real OAuth 2.1 authentication. Without auth, anyone with the URL can use your server, which fine for testing, but not for marketplace listings where you might want to gate access or track usage per user.

I chose Descope for three reasons:

FastMCP has a first-class DescopeProvider integration
Descope supports Dynamic Client Registration (DCR), which lets MCP clients like Claude register automatically without manual configuration
Their free tier is generous enough for an early-stage project

The final auth setup in server.py is just four lines:

from fastmcp.server.auth.providers.descope import DescopeProvider

_auth = DescopeProvider(
    config_url=os.environ["DESCOPE_CONFIG_URL"],
    base_url=os.environ["SERVER_URL"],
)

mcp = FastMCP(name="Korean Entertainment", auth=_auth)

Getting to those four lines took about seven failed deployments.

The Deployment Errors Worth Knowing About

Every deployment has gotchas. Here are the ones that will actually save you time if you're building something similar.

Corrupted file from incremental edits

When making multiple edits to server.py, one automated string replacement accidentally jammed code into the middle of an import block:

from db.queries import (
    get_movie_by_tmdb_id,
    get_movie_by_title,auth=DescopeProvider(  # ← corrupted by bad replacement
    get_movies,

Railway reported this as SyntaxError: '(' was never closed a confusing error that had nothing to do with parentheses. The actual problem was a botched edit 20 lines earlier.

The lesson: when making multiple changes to the same file, regenerate the whole file from scratch rather than applying incremental patches. A clean file beats a patched one every time.

Missing `https://` prefix

The SERVER_URL environment variable was set to kr-movie-tv-mcp-production.up.railway.app without the https:// prefix. Pydantic rejected it immediately:

Input should be a valid URL, relative URL without a base
[type=url_parsing, input_value='kr-movie-tv-mcp-production.up.railway.app']

Simple fix, but it costs a full Railway deployment cycle (about 3 minutes) to discover. Always include the schema when setting URL environment variables.

How Descope's token validation actually works

One thing worth knowing from Descope's documentation: the MCP Server URL field in the console is optional. When set, it adds an aud (audience) claim to access tokens, scoping them to your specific server. When left unset, no audience claim is added and tokens are validated purely against the .well-known config URL.
This means you can get a fully working OAuth setup (complete handshake, Dynamic Client Registration, and tool discovery) with just the .well-known URL configured. The audience field is an additional security layer for production environments where you want strict token scoping, not a prerequisite for the integration to work.

Deploying to Railway

I chose Railway over the free alternatives for one reason: no cold starts. Render's free tier spins down after 15 minutes of inactivity and takes 30-60 seconds to restart. For an MCP server that needs to respond to tool calls quickly, a cold start on the first request would produce a bad user experience and potentially cause claude.ai to show a timeout error.

Railway at $5/month gives you an always-on container with automatic deploys from GitHub. The configuration is minimal, a railway.toml that specifies the start command:

[build]
builder = "nixpacks"

[deploy]
startCommand = "python server.py"
restartPolicyType = "always"

The server runs with streamable-http transport, which is what claude.ai and other MCP clients expect for remote servers:

if __name__ == "__main__":
    port = int(os.environ.get("PORT", 8000))
    mcp.run(
        transport="streamable-http",
        host="0.0.0.0",
        port=port,
    )

Railway injects its own PORT environment variable, in this case 8080, so the server binds to whatever port Railway assigns. The 8000 default in the code is only used when running locally.

Testing End-to-End

Once the server was running, I tested it through Claude Code locally first:

claude mcp add korean-entertainment -- python server.py
claude

The first query I ran:

"Which Korean drama had the biggest Nielsen viewership jump between its first and last episode?"

Claude called get_top_dramas to get a list of candidates, then called get_episode_ratings for each one, computed the delta, and returned:

Crash Landing on You jumped from 6.1% → 21.7% (+15.6 percentage points), beating My Love from the Star's 15.6% → 28.1% jump (+12.5pp).

That's real Nielsen Korea data, pulled from SVG chart elements on NAVER, stored in Supabase, served through FastMCP, reasoned over by Claude. The full pipeline working end-to-end.

After connecting to claude.ai directly, I tested a more complex query:

"Find me a Korean thriller drama rated above 8.5 with the Revenge tag, and show me where I can watch it in the US"

The response correctly identified The Glory (더 글로리) at 8.9 MDL rating, confirmed it streams on Netflix US, and included context about the writing and direction. Cross-source, cross-tool reasoning working exactly as designed.

What the MCP Server Unlocks

The combination of data sources creates queries that weren't previously possible from any single API:

"Find dramas where Korean audiences loved it but Western audiences didn't":
This requires naver_audience_rating (Korean verified buyers) vs rt_tomatometer (Western critics): two fields from two different scrapers on the same row.

"Show me the episode rating trajectory for currently airing dramas":
This requires get_trending_dramas + get_episode_ratings: airing status from MDL, viewership from Naver's SVG charts.

"What Korean films won awards and are now on Netflix?":
This requires joining awards, movies, and streaming_availability: three tables from three different sources.

None of these queries are possible against any existing Korean entertainment API, because no existing API has all three pieces. That's the value proposition.

Getting It In Front of Users

With the server running, the next priority was distribution, which is getting it listed everywhere developers and AI users look for MCP servers.

Smithery

Smithery was the fastest listing. Paste your Railway URL, complete the OAuth flow, and their inspector automatically discovers all 17 tools and generates the listing. The whole process took under 10 minutes. Smithery is worth listing on because it's where Claude Code users browse for MCP servers such as developers who are already in an agentic workflow and actively looking for new tools to add.

Glama

Glama has over 23,000 MCP servers listed and is one of the most-searched directories in the ecosystem. Submitting is straightforward, just add your URL and GitHub repo. They have a private notes field for their review team; since the server uses OAuth, I left instructions explaining how to connect via claude.ai rather than providing API keys. Glama is worth listing on because it indexes for search relevance and recent usage, so active servers rise in rankings over time.

mcp.so

mcp.so has over 20,000 servers and accepts submissions via GitHub issue. The server config field expects the standard MCP JSON format:

{
  "mcpServers": {
    "korean-entertainment": {
      "url": "https://kr-movie-tv-mcp-production.up.railway.app/mcp"
    }
  }
}

Worth listing because it's one of the most-linked directories in MCP ecosystem articles and gets significant organic search traffic.

Cline Marketplace

Cline is a popular AI coding assistant with millions of users and their own MCP marketplace backed by a GitHub repo. Submission is a GitHub issue with your server name, repo URL, endpoint, tool list, and what makes it unique. The review team checks for code quality and documentation before approving. Worth the effort because Cline users are developers who tend to build on top of tools they discover: potential integrators, not just users.

MCP-Hive

MCP-Hive launched May 11, 2026 as the first marketplace with actual pay-per-call revenue sharing for server providers. MCP-Hive is the main monetization play in this ecosystem right now; every other directory is free discovery with no revenue component.

What's Next

The distribution is in place. What comes next:

Complete initial population — ~6,500 TV shows still need TMDB sync as GitHub Actions minutes reset monthly
Add per-user billing — Descope supports scope-based access control, enabling a free tier (basic search) vs paid tier (Nielsen data, awards)
KMDb integration — API membership pending, will add Korean film archive data when approved

The database gets richer every night. The server is listed where it needs to be. Time to let the data do the work.

Inside the Pipeline Powering a Korean Entertainment MCP Server

Cara Jung — Sun, 10 May 2026 13:00:00 +0000

Korean entertainment has become global, but the infrastructure behind its data is still surprisingly broken. Information about a single show is often scattered across multiple platforms: one site for Korean ratings, another for streaming availability, another for cast data, another for box office numbers, and several with no public APIs at all.

This is Part 2 of a series. In Part 1, I covered how I unified data across 10 APIs and web scrapers into a single database designed to power an MCP server.

But collecting the data was only half the problem. Once you have 10 independent sources feeding into the same system, the real engineering work begins: how do you run all of them reliably, on a schedule, writing to a shared database, without the entire pipeline breaking every time one site changes its HTML structure?

This is the story of building the pipeline that keeps the Korean entertainment database alive.

The Database: Why Supabase

The database needed to be:

Queryable via REST (the MCP server reads directly from it — no separate API layer)
PostgreSQL-native (arrays, jsonb, GIN indexes for genre/tag filtering)
Free to start (we're pre-revenue)
Always-on (the nightly pipeline writes to it every day)

Supabase checked all four boxes. Its free tier includes 500MB of storage, unlimited API requests via PostgREST, and Row Level Security for access control. The built-in REST API means I can query the database directly from the MCP server without building a middleware layer.

The one gotcha: Supabase free tier pauses the database after 1 week of inactivity. The nightly pipeline prevents this — activity every day keeps it awake. If the pipeline goes down for more than a week, it needs to be manually unpaused from the dashboard.

Schema Design Decisions

Separate tables for movies and TV shows. The data shapes are different: movies have runtime, box office revenue, and KOBIS codes while TV shows have episode counts, airing schedules, MDL-specific tags, and per-episode ratings. One table with a type column would mean half the columns are always NULL.

Both a cast table and JSON summary. The normalized movie_cast and show_cast tables enable queries like "all films with Song Kang-ho". The JSON summary in the main record enables fast single-title lookups without a join.

Separate streaming_availability table. Storing streaming data as a jsonb column would make queries like "all Korean films on Netflix US" require a full table scan. A proper table with a row per title/region/provider enables indexed lookups.

Explicit rating field naming. Every rating field is named by source and audience type — naver_audience_rating, tmdb_rating, rt_tomatometer — never generic "rating". This is a naming convention enforced at the schema level that prevents ambiguity at the application layer.

The NULL Constraint Issue

PostgreSQL's unique constraints treat NULL as not equal to NULL, which caused a subtle bug in the streaming availability table. The unique constraint on (movie_id, show_id, region, provider, monetization_type) failed to prevent duplicates when show_id was NULL. Postgres saw NULL != NULL and allowed duplicate rows.

The fix was a partial index using COALESCE:

create unique index streaming_availability_unique
  on streaming_availability (
    coalesce(movie_id::text, ''),
    coalesce(show_id::text, ''),
    region,
    provider,
    coalesce(monetization_type, '')
  );

This converts NULLs to empty strings before comparison, making the uniqueness check work correctly.

The Pipeline: Prefect + GitHub Actions

Why Not Prefect Cloud?

Prefect is genuinely excellent orchestration software. Its flow/task model gives you retry logic, caching, dependency management, and a beautiful UI, all from simple Python decorators:

@task(retries=3, retry_delay_seconds=10)
def sync_movie(tmdb_id: int) -> dict | None:
    raw = run_with_retry(get_movie_details, tmdb_id)
    if not raw:
        return None
    movie_row = upsert_movie(_transform_movie(raw))
    return movie_row

@flow(name="sync_tmdb", log_prints=True)
def sync_tmdb_flow(movie_limit: int | None = None):
    movie_ids = fetch_korean_movie_ids()
    for tmdb_id in movie_ids[:movie_limit]:
        sync_movie(tmdb_id)

The problem was Prefect Cloud's free tier. When I tried to create a work pool:

prefect work-pool create kr-mcp-pool --type process
# Your plan does not support hybrid or push work pools.

Hybrid work pools, which let you run jobs on your own infrastructure, require a paid plan starting at $20/month. Without a work pool, there's no way to schedule flows to run automatically.

I kept the Prefect flow/task structure since it gives retry logic, task-level logging, and caching for free even without Prefect Cloud. But for the scheduling layer, I went with GitHub Actions.

Why GitHub Actions

GitHub Actions gives 2,000 free minutes per month on private repositories. Each workflow job runs on a fresh Ubuntu runner with Python, pip, and dependencies installed from scratch.

The scheduling syntax is standard cron:

on:
  schedule:
    - cron: "0 2 * * *"    # 2am UTC daily
  workflow_dispatch:         # also triggerable manually

Secrets are injected as environment variables:

env:
  TMDB_API_KEY: ${{ secrets.TMDB_API_KEY }}
  SUPABASE_SERVICE_ROLE_KEY: ${{ secrets.SUPABASE_SERVICE_ROLE_KEY }}

The workflows call simple Python scripts that invoke Prefect flows:

# scripts/run_sync_tmdb.py
from pipeline.jobs.sync_tmdb import sync_tmdb_flow
sync_tmdb_flow(movie_limit=200, show_limit=200)

Estimated monthly usage:

Nightly sync (TMDB + MDL + Naver TV airing): ~37 min × 30 days = ~1,110 min
Weekly sync (KOBIS + JustWatch + Wikipedia + Awards): ~75 min × 4 = ~300 min
Total: ~1,410 min/month — safely under the 2,000 minute limit

The tradeoff is runner startup time (~30 seconds per job for pip install and Playwright browser install) and the 6-hour job timeout, which became a significant constraint during the initial population phase.

The Errors

Error 1: Prefect Type Validation is Strict

Prefect 3.x validates flow parameter types at runtime. Optional integer parameters typed as int = None fail validation:

ParameterTypeError: Flow run received invalid parameters:
 - movie_limit: Input should be a valid integer

The fix is explicit Optional typing:

# Before (fails)
def sync_tmdb_flow(movie_limit: int = None):

# After (works)
def sync_tmdb_flow(movie_limit: int | None = None):

This affected every pipeline job. I had to audit all int, str, float, and list optional parameters across 7 files. Python's int | None syntax (PEP 604) is required; Optional[int] from typing also works.

Error 2: The TMDB Response Shape Mismatch

The data_sources/tmdb.py module normalizes TMDB's raw API response before returning it. The pipeline assumed it was getting raw TMDB format:

# Pipeline assumed raw TMDB format:
genres = normalize_genres([g["name"] for g in raw.get("genres", [])])
# Actual normalized format from tmdb.py:
# genres is already ["Drama", "Romance"] — a list of strings, not dicts

This caused TypeError: string indices must be integers when trying to do g["name"] on a string.

The fix was simple once diagnosed:

genres = normalize_genres(raw.get("genres", []))

But finding it required running the job, reading the stack trace, debugging the actual response shape with a quick Python one-liner, and tracing back to the source.

Error 3: The Initial Population Timeout

GitHub Actions has a maximum job runtime of 6 hours. TMDB has ~10,000 Korean movies and ~10,000 Korean TV shows. Fetching full details for each at ~0.25 seconds per title = ~83 minutes for movies, ~83 minutes for shows. That's within the limit.

But each title also has cast: 20 cast members per title × 2 DB writes each × 20,000 titles = 800,000 database operations. The actual runtime exceeded 6 hours and the job was cancelled mid-run.

The job has exceeded the maximum execution time of 6h0m0s

The solution was accepting partial initial population and relying on the nightly sync to fill gaps over time. After the initial run, I had:

9,983 movies (nearly complete)
3,536 shows (35% since the timeout hit mid-show-sync)

Subsequent nightly runs continue adding shows. The upsert pattern means no data is lost or duplicated since each run picks up new IDs and updates existing records.

Error 4: ON CONFLICT DO UPDATE with Duplicates

The upsert_bulk functions failed when the input list contained duplicate rows with the same unique key:

ON CONFLICT DO UPDATE command cannot affect row a second time
Hint: Ensure that no rows proposed for insertion within the same command have duplicate constrained values.

This happened for two reasons:

OST albums: Goblin has two editions of "Part 1", which is same album name but different recordings. The scraper returned both and the upsert tried to update the same row twice in one batch.
Streaming availability: Amazon Prime Video appeared as both a None monetization type entry and a regular entry for the same title/region. Two rows with the same effective unique key.

The fix was deduplication before upsert in both cases:

def upsert_ost_albums_bulk(show_id: str, albums: list[dict]) -> list[dict]:
    seen = set()
    unique_albums = []
    for a in albums:
        key = a.get("album_name", "")
        if key not in seen:
            seen.add(key)
            unique_albums.append(a)
    rows = [{"show_id": show_id, **_clean(a)} for a in unique_albums]
    result = _supabase.table("ost_albums").upsert(
        rows, on_conflict="show_id,album_name"
    ).execute()
    return result.data or []

Error 5: Missing Schema Columns

The MDL scraper stored content_rating and mdl_votes, fields that existed in the scraper output but weren't in the initial database schema. Supabase's PostgREST returns a clear error:

PGRST204: Could not find the 'content_rating' column of 'tv_shows' in the schema cache

The fix was two ALTER TABLE statements in Supabase SQL Editor:

alter table tv_shows add column if not exists content_rating text;
alter table tv_shows add column if not exists mdl_votes integer;

This is a schema drift problem. The scraper evolved during development but the SQL schema wasn't updated to match. The lesson is to keep schema.sql as the single source of truth and run it against a fresh database before deploying.

Error 6: requirements.txt Bloat

pip freeze in an Anaconda environment captures hundreds of unrelated packages — machine learning libraries, game engines, PDF parsers. The resulting requirements.txt had dependency conflicts:

ERROR: Cannot install -r requirements.txt (line 28) and pdfminer.six==20200517
because these package versions have conflicting dependencies.

The fix was a minimal hand-written requirements file containing only what the project actually uses:

playwright
beautifulsoup4
httpx
requests
python-dotenv
supabase
prefect

GitHub Actions installs from scratch on every run, keeping dependencies minimal reduces install time from ~90 seconds to ~22 seconds per job.

Error 7: Rotten Tomatoes Package Breakage

The rottentomatoes-python package parses RT's search results page to find movie scores. RT changed their HTML schema, breaking the parser:

meter = tomato_snip.split('"')[1]
# IndexError: list index out of range

RT actively fights scraping and changes their HTML regularly. I excluded RT from the pipeline rather than maintain a brittle scraper. The justwatch.py scraper already captures IMDB scores via JustWatch's sidebar — a partial substitute for Western critic consensus.

The Architecture That Emerged

After all the debugging, the pipeline architecture is:

GitHub Actions (scheduling)
    ↓
Prefect flows (retry logic, task caching, logging)
    ↓
data_sources/* (Playwright + httpx + REST APIs)
    ↓
pipeline/utils.py (data normalization)
    ↓
db/queries.py (upsert operations)
    ↓
Supabase (PostgreSQL + PostgREST)

Each layer has a clear responsibility. The data sources are pure functions that return normalized Python dicts. The pipeline jobs transform those dicts to match the DB schema. The queries module is the only file that touches Supabase. This separation made debugging much faster. When something breaks, it's usually immediately obvious which layer is responsible.

GitHub Actions Minutes: A Practical Guide

For anyone building a similar pipeline, here's how to think about GitHub Actions free tier minutes:

What's fast (< 5 min per run):

REST API sources (TMDB, KOBIS, Wikipedia) — no browser overhead
Small record counts with efficient batch upserts

What's slow (10-60+ min per run):

Playwright scrapers (browser startup + page load per request)
Multi-region lookups (JustWatch × 4 regions = 4× the time)
Large initial populations (10,000 titles × N seconds each)

Optimization strategies:

Only sync what changes frequently. Airing shows need daily updates, historical movies don't
Use Prefect's task caching for expensive discovery operations (fetching all TMDB IDs)
Keep JustWatch to weekly syncs of 100-200 titles rather than full catalog runs
Separate initial population (run once, manually) from ongoing sync (automated)

The nightly sync in steady state runs in under 45 minutes. The weekly sync runs in under 90 minutes. Combined, that's ~1,400 minutes/month, which is safely within the 2,000 minute free limit even accounting for occasional failures and re-runs.

What I'd Do Differently

Design the schema before writing the scrapers. Schema drift (scrapers adding fields that don't exist in the database) caused several production failures. The right approach is schema-first: define all columns upfront, then write scrapers that output exactly those fields.

Test upserts with duplicates explicitly. The ON CONFLICT DO UPDATE duplicate error was predictable in retrospect. Any bulk upsert should be tested with intentionally duplicated input rows.

Start with a hand-written requirements.txt. pip freeze in a development environment always captures too much. Start minimal and add packages as needed.

Accept that initial population and ongoing sync are different problems. Initial population of 10,000+ records from browser-scraped sources takes days, not hours. Design for incremental population from the start rather than trying to do it all in one run.

What's Next

The database feeds into a FastMCP server that exposes all of this as tools for AI agents — structured queries like get_drama_details, find_where_to_watch, get_actor_filmography, and get_weekly_boxoffice that leverage the cross-source enrichment I've built.

The combination of Korean-specific data (Nielsen ratings, Naver scores, KOBIS box office, award history, OST albums) with international enrichment (TMDB, MDL community tags, JustWatch streaming, Wikipedia context) creates a database that genuinely doesn't exist anywhere else in a queryable form.

That's what makes the scraping work worth it.

Building a Unified Korean Entertainment Database from 10 APIs and Web Scrapers

Cara Jung — Sat, 09 May 2026 20:56:22 +0000

Korean entertainment has become a global phenomenon with shows such as Squid Game breaking records and K-dramas topping global charts. And yet, the data infrastructure behind it is fragmented.

Getting complete data on a single Korean show or film — cast, ratings (Korean and international), episode viewership numbers, where to stream it, what awards it won, its OST albums — requires hopscotching different websites.

The issue is that dominant platforms like NAVER and Melon lack English-first APIs. As Session Zero points out in this article, Korean data is heavily underserved in MCP ecosystems because when Western developer tools and AI systems are built, Korean platforms are often invisible by default.

The data exists. But it’s trapped behind language barriers, undocumented endpoints, JavaScript-rendered pages, and closed ecosystems. So while AI agents can easily retrieve structured information about Hollywood movies, Spotify charts, or IMDb ratings, asking the same systems about Korean dramas, OSTs, or Korean audience sentiment often returns incomplete results or nothing at all.

So I decided to build a unified database to fix it.

The Data Landscape

Korean entertainment data splits along two axes: language (Korean vs. English sources) and type (official vs. community vs. streaming).

English-language sources

TMDB is the closest thing to a comprehensive English-language database for Korean content. It has structured data on tens of thousands of Korean films and shows, a stable API, and community ratings. But it lacks Korea-specific data: no verified Korean audience scores, no Nielsen viewership ratings, no Korean box office data, no OST information.

MyDramaList fills a critical gap that TMDB misses entirely: community tags. MDL users have tagged Korean dramas with labels such as "Bromance", "Time Travel", "CEO Male Lead", and "Found Family." No official database captures that taxonomy. MDL also tracks airing status more accurately than TMDB for Korean dramas.

HanCinema has the deepest historical coverage of Korean content in English, including films from the 1950s through 1990s that TMDB barely covers.

JustWatch is the most reliable real-time source for streaming availability. TMDB's streaming data lags reality by weeks. JustWatch checks 364 services daily.

Wikipedia has rich content for major Korean films and shows including detailed plot summaries, production history, cultural reception sections that no English entertainment database captures.

Korean-language sources

Here's where things get interesting and painful.

NAVER is Korea's dominant search engine and entertainment portal. Search for any Korean film on NAVER and you'll get a rich information card with two ratings that don't exist anywhere else:

실관람객 평점 (Verified ticket buyer rating): Only people who purchased cinema tickets through affiliated platforms can rate. This is Korea's equivalent of a verified purchase review.
네티즌 평점 (Netizen rating): Korean general public rating.

These ratings often diverge significantly from international scores. Parasite has a 9.08 verified buyer rating on NAVER versus 8.5 on TMDB. The Korean audience who saw it in theaters rated it exceptionally highly.

NAVER also has per-episode Nielsen Korea viewership ratings for TV dramas, which is the official broadcast ratings that Korean media reports on weekly. No other English-language source has this data structured and queryable.

The critical catch: NAVER has no public API for any of this. Their entertainment data is rendered dynamically in JavaScript, served through their search interface, and entirely undocumented. Every data point requires a browser.

KOBIS (Korean Film Council) is the exception. It has an official government API that provides authoritative weekly and daily box office rankings. It's the only Korean government data source with a proper REST API.

Building the Scrapers

The Playwright Problem

Most of the Korean data sources render content through JavaScript. Static HTML requests return empty shells. This meant nearly every scraper needed a real browser.

To address this, I used Playwright with Chromium headless across all JS-rendered sources. The setup is consistent:

from playwright.sync_api import sync_playwright

def _get_page_html(url: str, wait_selector: str = "body") -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
            locale="ko-KR",
        )
        page = context.new_page()
        page.goto(url, wait_until="domcontentloaded")
        page.wait_for_selector(wait_selector)
        time.sleep(2)  # let lazy content settle
        html = page.content()
        browser.close()
    return html

The locale="ko-KR" matters for NAVER since it ensures Korean content is served rather than any region-specific variant.

The NAVER Genre Problem

One of the more unexpected parsing challenges came from NAVER's movie information card. Genre, country, and runtime appeared concatenated: 공포대한민국95분 (Horror South Korea 95min). They were in a single dd tag separated by invisible span.cm_bar_info elements.

The fix was to replace the separator spans with pipe characters before splitting:

first_dd = info_groups[0].select_one("dd")
if first_dd:
    for span in first_dd.select("span"):
        span.replace_with("|")
    segments = [s.strip() for s in first_dd.get_text().split("|") if s.strip()]
    result["genre"] = segments[0] if segments else None
    result["country"] = segments[1] if len(segments) > 1 else None

Extracting Nielsen Ratings from SVG

The trickiest scraping problem was NAVER's episode viewership chart. The data is rendered as an interactive SVG chart where viewership percentages, episode numbers, and air dates are all inside SVG text elements.

def _parse_episode_chart(soup: BeautifulSoup) -> list[dict]:
    # Rating values from bb-text elements inside the SVG
    rating_texts = soup.select("g.bb-texts-rank text.bb-text")
    ratings = []
    for t in rating_texts:
        val = t.get_text(strip=True)
        try:
            f = float(val)
            if f > 0:
                ratings.append(f)
        except ValueError:
            pass

    # Episode numbers and dates from x-axis ticks
    x_ticks = soup.select("g.bb-axis-x g.tick")
    ep_labels = []
    for tick in x_ticks:
        tspans = tick.select("tspan")
        if len(tspans) >= 2:
            ep_num = _parse_episode_num(tspans[0].get_text(strip=True))
            date_text = tspans[1].get_text(strip=True)
            if ep_num and date_text:
                ep_labels.append({"episode": ep_num, "date": date_text})

    return [
        {"episode": ep["episode"], "air_date": ep["date"], "rating": ratings[i]}
        for i, ep in enumerate(ep_labels)
        if i < len(ratings)
    ]

This gives us per-episode Nielsen ratings:

Ep 1 (12.02.): 6.3%
Ep 8 (12.24.): 12.3%
Ep 16 (01.21.): 20.5%

For Goblin. No English-language API has this data.

The JustWatch Shadow DOM Problem

JustWatch uses Web Components with Shadow DOM for their streaming offer cards. The score and provider data that appears in the browser is inside <slot> elements that don't render in server-side HTML:

<div class="score-wrap">
  <div class="critics-score-wrap">
    <slot name="critics-score"></slot>  <!-- empty in scraped HTML -->
  </div>
</div>

However, the streaming offers themselves (provider names, prices, monetization types) render in the regular DOM inside div.buybox-selector a.offer elements. The key insight was that the offers were accessible even though the score slots weren't.

Extracting the actual streaming URLs from JustWatch's redirect links required parsing the r= parameter:

def _extract_redirect_url(href: str) -> str:
    parsed = urlparse(href)
    params = parse_qs(parsed.query)
    r = params.get("r", [None])[0]
    return unquote(r) if r else href

Awards Parsing: Five Ceremonies, Three Formats

Korean drama and film awards span five major ceremonies, each with slightly different HTML structure. I scraped all of them from AsianWiki plus the official Baeksang Awards site.

The unexpected challenge was that award categories use different winner formats depending on ceremony type:

Drama awards: Person ("Show Title") → links[0] is person, links[1] is show
Film awards: "Film Title" → title-only, no person/show split
Blue Dragon Series: "Title" (Platform) → title plus streaming platform

The format detection:

value_text = item_text.replace(bold_text, "").strip().lstrip(":").strip()
if value_text.startswith('"'):
    # Film/series format: title only
    current_category["winner"] = links[0].get_text(strip=True)
    current_category["winner_show"] = None
else:
    # Drama format: person + show
    current_category["winner"] = links[0].get_text(strip=True)
    if len(links) > 1:
        current_category["winner_show"] = links[1].get_text(strip=True)

And the search function deduplicates winners who also appear in the nominees list. AsianWiki includes the winner in the nominees list, so a naive search returns double entries:

# Skip if this is the same person/title as winner (dedup)
if nom_name == winner_name and winner_matches:
    continue

The Wikipedia Section Problem

Wikipedia articles don't have standardized section names. "Plot" might be called "Synopsis", "Series overview", "Story", or "Episodes" depending on who wrote the article and when.

We built a section alias system:

SECTION_ALIASES = {
    "Plot": ["Plot", "Synopsis", "Story", "Series overview", "Premise", "Episodes"],
    "Cast": ["Cast", "Cast and characters", "Characters", "Main cast"],
    "Ratings": ["Ratings", "Viewership ratings", "Television ratings", "Viewership"],
    "Reception": ["Reception", "Critical response", "Critical reception"],
}

The smart lookup tries each alias in order until it finds content. Crash Landing on You uses "Episodes" for its episode list and "Viewership" for its ratings section — both non-standard names that the alias system handles automatically.

Cross-Source ID Management

One of the harder database design decisions was how to link data across sources. A single show like Crash Landing on You has:

TMDB ID: 94796
MDL ID: 70 (from slug 70-crash-landing-on-you)
Naver show OS ID: 3522952
JustWatch slug: crash-landing-on-you
Wikipedia title: Crash Landing on You

The database stores all of these as columns on the tv_shows table. No single ID links all sources — the TMDB ID is the primary key because TMDB has the broadest coverage and most stable IDs.

create table tv_shows (
  id           uuid primary key default uuid_generate_v4(),
  tmdb_id      text unique,
  mdl_id       text unique,
  mdl_slug     text,
  naver_show_id text,
  justwatch_slug text,
  wikipedia_title text,
  -- ... ratings, metadata, etc.
);

The pipeline links sources progressively: TMDB runs first to seed core titles, then MDL enriches with ratings and tags, then NAVER TV adds episode ratings using the Korean title stored by TMDB.

Rating Field Naming Convention

With eight different rating sources covering different audiences and methodologies, naming discipline was essential:

tmdb_rating              # Global community (0-10)
mdl_rating               # International K-drama fans (0-10)
naver_audience_rating    # Korean verified ticket buyers (0-10)
naver_netizen_rating     # Korean general public (0-10)
naver_latest_rating      # Nielsen Korea latest episode (%)
naver_highest_rating     # Nielsen Korea peak episode (%)
rt_tomatometer           # Western professional critics (0-100)
rt_audience_score        # Western RT users (0-100)

These are never stored as a generic "rating" field. The naming makes the source and audience type explicit at the schema level, preventing any ambiguity in downstream queries or API responses.

What This Unlocks

The unified database enables queries that weren't previously possible:

Find dramas where Korean audiences loved it but Western critics were lukewarm:

SELECT title_english, naver_audience_rating, rt_tomatometer
FROM tv_shows
WHERE naver_audience_rating > 8.5
AND rt_tomatometer < 60;

Find all content with OST albums by IU:

SELECT ts.title_english, oa.album_name, oa.vibe_url
FROM ost_albums oa
JOIN tv_shows ts ON oa.show_id = ts.id
WHERE oa.artist ILIKE '%IU%' OR oa.artist ILIKE '%아이유%';

Find award winners available on Netflix US:

SELECT DISTINCT ts.title_english, a.category, a.year
FROM awards a
JOIN tv_shows ts ON a.show_id = ts.id
JOIN streaming_availability sa ON sa.show_id = ts.id
WHERE sa.provider = 'Netflix'
AND sa.region = 'us'
AND sa.monetization_type = 'Subscription'
AND a.won = true;

Find the drama with the biggest episode-to-episode rating jump:

SELECT show_id, episode_number, nielsen_rating,
       nielsen_rating - LAG(nielsen_rating) OVER (PARTITION BY show_id ORDER BY episode_number) AS jump
FROM episodes
ORDER BY jump DESC NULLS LAST
LIMIT 10;

None of these queries are possible against any single existing source.

What's Next

The goal of this project is to build an MCP server for Korean entertainment data so that makes Korean movies and TV shows are accessible to AI agents and developer tooling in a structured, searchable way.

Instead of forcing developers to manually scrape different sites just to answer a basic query, the MCP server will expose unified tools that support natural language requests like “Find me a political thriller Korean audiences loved that’s available on Netflix and maintained strong episode ratings throughout its run.”

Under the hood, that means reconciling fragmented metadata across 10,000+ titles from APIs, scrapers, streaming providers, audience ratings, box office systems, and community-driven sources.

Part 2 will dive into the pipeline architecture itself: the automated sync system, GitHub Actions orchestration, incremental updates, scraper failures, rate limits, deduplication headaches, and the very questionable debugging decisions made at 2AM.

Pick Your Auth: An Interactive Guide

Cara Jung — Mon, 13 Apr 2026 16:00:00 +0000

Most auth tutorials focus on how authentication works such as how to drop in a component, spin up a dev server, and get a login screen running. There's no shortage of guides that tell you which method to use for your use case. What's missing is the hands-on part: actually experiencing each flow the way your users do, so you can feel the friction, see the session it produces, and make an informed decision from the ground up.

Magic link or passkey? Social login or OTP? The answer changes depending on whether you're building a consumer app, a fintech product, a B2B SaaS, or an internal tool. The choice is a product decision that affects activation, security posture, compliance, and long-term maintainability.

To tackle this dilemma, I built an interactive demo called Auth Decision Kit that lets you try three Descope auth flows live: magic link, social login, and passkey. This demo focuses on how each approach fits different product contexts and the tradeoffs you need to consider.

Demo: auth-decision-kit.vercel.app
GitHub: github.com/carasjung/auth-decision-kit

Each method has five tabs:

01 Auth Flow
Authenticate for real using a live Descope integration. See the actual UX users experience.

02 Session Inspector
After authenticating, inspect every claim in your JWT payload. Each field is annotated with what it means, why it matters, and when you'd use it.

03 Decision Matrix
Green / yellow / red ratings across six product contexts: B2B, consumer app, developer tool, internal tool, fintech, SaaS, and mobile-first.

04 Failure Simulator
Trigger each failure mode and see the Descope error code and the correct handling code.

05 Code
Copy-ready implementation snippets for Next.js

The Session Inspector: JWT Breakdown

One of the most useful things I learned building this is how different the JWT payload looks depending on which auth method you used and why they matter for your backend logic.

After a magic link auth, your session contains authenticationMethod: "magiclink" and verifiedEmail: true. The email verification is implicit, clicking the link is proof of inbox access. This is a meaningful signal for risk scoring and it also shows that magic link is a single factor (access to your inbox). For products that require two factors like healthcare and fintech, magic link on its own won’t satisfy.

After social login, you get the provider's access token nested under oauth.google.accessToken (or whichever provider). You also get externalIds.google, a stable provider-specific user ID that won't change even if the user changes their email address on Google's side. That's the field you want for account linking. Since you get free profile data, this is effective for consumer and developer tools.

After a passkey auth, the amr (Authentication Methods References) claim contains "hwk" (hardware key) and "user". This is the claim compliance teams care about. It's proof that a hardware-bound credential was used, not just a password or a link. Passkey is also the only method here where the private key never leaves the user’s device. Even a full Descope breach couldn’t expose user credentials.

The Decision Matrix

Here's a condensed version of what I found after thinking through six product contexts for each method:

Magic link is the sweet spot for B2B SaaS and early-stage products. Zero password management, implicit email verification, and simple implementation. However, it falls apart on mobile (context switch to email app kills conversion) and in high-security contexts where email as a sole factor isn't enough.

Social login is the fastest path to activation for consumer and developer tools. GitHub login in particular gives you free org and repo data via the OAuth token, which is useful for developer-focused products. Avoid it for fintech and banking where regulations often require you to own the identity directly.

Passkey is genuinely the best option for mobile-first and high-security context. Phishing-proof by design, the private key never leaves the device. The catch: users still need education on what a passkey is and you need a fallback for older browsers and lost devices.

Most products should offer at least two methods where one can be the default while the other an alternative. For instance, using magic link as the default and passkey as the upgrade path once users are comfortable.

The Failure Simulator

Auth flows break in predictable ways. Understanding those failure points from day one lets you design a seamless recovery experience so users can continue without friction and avoid escalating to support.

The failure simulator surfaces these scenarios using real Descope error codes and responses. While it doesn’t make live network calls, it replays actual API error outputs so you can explore failure cases without having to intentionally break a real session.

Magic links expire (Descope's default is 2 minutes). When they do, the onError callback fires with error code E011303. Your UI should catch this and offer to send a new link, not show a generic error message.

Social login gets cancelled. Users click "Continue with Google," see the permissions screen, and hit Cancel. That fires E062503. The right response is to return the user silently to the login screen and treat a cancellation as a choice, not an error.

Passkeys on new devices fire E083002 (WebAuthn NotAllowedError). The recovery flow is: fall back to magic link or OTP to verify identity, then offer to enroll a passkey on the new device. This is also why you should never make passkey the only auth method since you always need a fallback for device loss.

Stack and Setup

Next.js 15 with App Router
Descope Next.js SDK (@descope/nextjs-sdk) for auth flows and session management
Framer Motion for tab transitions
Tailwind CSS for layout

The entire setup is about 800 lines of TypeScript across nine files. All core data (steps, session highlights, decision matrix scores, failure scenarios, and code snippets) lives in a single lib/auth.ts file. Adding a new auth method requires only a single entry point, keeping the system easy to extend.

To run it yourself:

git clone https://github.com/carasjung/auth-decision-kit
cd auth-decision-kit
npm install
cp .env.local.example .env.local
# add your NEXT_PUBLIC_DESCOPE_PROJECT_ID
npm run dev

You'll need a free Descope account. Once you’ve created your account, grab your Project ID from app.descope.com/settings/project and configure a sign-up-or-in flow with whichever methods you want to test.

From Demo to Decision

There are plenty of great auth demos that show how things work. This one focuses on how to choose between them.

Auth is infrastructure and like many infrastructure decisions, the cost of getting it wrong rarely shows up immediately. It appears later through conversion drop-offs, security tradeoffs, compliance constraints, and migrations.

While modern tools make it easier to support multiple methods and evolve your approach over time, the decision of what to use and when still requires good judgement upfront. This project is designed to help make that choice more intentional.

The live demo is at auth-decision-kit.vercel.app and the full source is on GitHub.

What Predicts a Hit? I Trained 3 ML Models to Find Out

Cara Jung — Mon, 06 Apr 2026 07:00:00 +0000

In many entertainment adaptation decisions, content selections are still instinct-driven. Maybe a producer was vibing with a story or overheard their Gen Alpha nephew mentioning a GOAT title. This subjective approach has often led to expensive missteps and wasted resources for studios when the feature or show turns into a flop.

As someone who has worked in the breeding ground of popular webcomics, I asked: what if there was a system that could measure “success potential” of IPs based on real user behavior? Using ML, I wanted to see if I could build a forecasting model that could rank unadapted titles by their predicted commercial success.

The Data

For my endeavor, I worked with three datasets:

Source material metadata of roughly 1,500 titles that included engagement metrics such as views, likes, subscribers, genre, release schedule, and creator usernames
Produced show metadata of 1,977 titles including ratings, watcher counts, genre, episode count, and cast
Historical webcomic adaptation records of 424 cross-referenced titles that went from source material to screen, with data pulled from both sides

Before any modeling, I ran exploratory data analysis on all three and found a few things:

Engagement metrics (likes, views, subscribers) were strongly correlated with each other and overall popularity
Genre and tags correlated with watcher counts in the produced show data
Creator frequency showed no statistically significant impact on adaptation success, which directly contradicted what studios commonly assume

Engineering the Target Variable

One hurdle I ran into was that I couldn't directly measure adaptation “success" from the source material side alone. So I engineered a composite Popularity Score by normalizing and combining views, likes, and subscribers into a single metric representing audience appeal, which became the target variable for prediction.

For the produced show data, I created a parallel score using rating and watcher count.

Since correlation analysis confirmed that source popularity and show popularity moved together in historical adaptations, I used source popularity as a proxy target.

Simple vs Complex Models

I implemented three models: Random Forest, XGBoost, and Ridge Regression. If you worked with ML models, there’s an expectation that the more complex models will win. However, this wasn’t the case. Ridge Regression became the unexpected underdog model that won:

I cross-validated all three models to reduce overfitting and validate stability.

Likes = Success

Using standardized coefficients for feature importance in the Ridge model, the ranking was as follows:

Likes (strongest predictor by a significant margin)
Views
Subscribers

The factors that studios often focus on such as creator reputation, genre, rating, and engagement rate showed weak or no statistical significance.

I validated this further using Mann-Whitney U tests comparing adapted titles against the general pool. Adapted titles showed significantly higher “likes” than non-adapted ones and the difference was meaningful.

Feature Importance for Ridge regression(standardized coefficients)

So why “likes”?

One interpretation is that likes are intentional. A view can be passive while a subscription can be habitual. But giving a “like” is an act of emotional investment and this behavior is exactly what translates from IP to screen.

The Output

The final model produced a ranked list of the top 10 unadapted webcomic titles by predicted success, along with contextual signals for each including genre appeal, subscriber trends, engagement consistency, and creator track record.

Qualitative review of the top 10 confirmed alignment with the engagement patterns seen in historically successful adaptations. Cliff's Delta calculations showed that the predicted top titles had significantly higher likes than past adaptations.

Limitations on the Model

Part of doing good data work is being honest about the limitations. There were a few things that fell short:

Small adaptation dataset. 424 entries is workable, but more data would reduce overfitting risk and better generalization.
Proxy target variable. Using source popularity instead of actual show performance is a justified simplification, but it means the model can't fully capture real-world production quality, casting, or distribution reach.
Categorical features dropped. Creator and genre have too many levels and their coefficients dominated the model without adding significance. Excluding them improved interpretability but at the cost of losing some nuance.

What I'd Do Next

If I extended this project, I'd rethink how signal is captured and focus on the following:

Use NLP for deeper context
- Synopsis embeddings or sentiment analysis on reader reviews could capture thematic richness that raw engagement metrics miss.
Take a hybrid ranking approach
- Combining regression with a learning-to-rank algorithm could improve recommendation quality at the top of the list, where small differences actually matter.
Longitudinal validation
- The real test is tracking what happens when predicted titles actually get produced. Building a feedback loop into the model would sharpen it over time.

Final Thoughts

The core insight here doesn’t only strictly apply to entertainment. It can apply to decisions that are being made by intuition or legacy practice. As the models showed, behavioral signals from real users outperform assumptions about what will succeed.

Likes beat creator prestige. Engagement beat genre conventions. The audience’s preferences, not the ones from industry decision makers, predicted outcomes more reliably.

Whether you're choosing which content to produce, which features to build, or which markets to enter, the same principle applies. The answers are within the data, but we often overlook the right signals.

DEV Community: Cara Jung

From Scrapers to MCP Server: Serving Korean Entertainment Data to AI Agents

Designing the Tools

Building the Server with FastMCP

Choosing Authentication: Descope

The Deployment Errors Worth Knowing About

Corrupted file from incremental edits

Missing https:// prefix

How Descope's token validation actually works

Deploying to Railway

Testing End-to-End

What the MCP Server Unlocks

Getting It In Front of Users

Smithery

Glama

mcp.so

Cline Marketplace

MCP-Hive

What's Next

Inside the Pipeline Powering a Korean Entertainment MCP Server

The Database: Why Supabase

Schema Design Decisions

The NULL Constraint Issue

The Pipeline: Prefect + GitHub Actions

Why Not Prefect Cloud?

Why GitHub Actions

The Errors

Error 1: Prefect Type Validation is Strict

Error 2: The TMDB Response Shape Mismatch

Error 3: The Initial Population Timeout

Error 4: ON CONFLICT DO UPDATE with Duplicates

Error 5: Missing Schema Columns

Error 6: requirements.txt Bloat

Error 7: Rotten Tomatoes Package Breakage

The Architecture That Emerged

GitHub Actions Minutes: A Practical Guide

What I'd Do Differently

What's Next

Building a Unified Korean Entertainment Database from 10 APIs and Web Scrapers

The Data Landscape

English-language sources

Korean-language sources

Building the Scrapers

The Playwright Problem

The NAVER Genre Problem

Extracting Nielsen Ratings from SVG

The JustWatch Shadow DOM Problem

Awards Parsing: Five Ceremonies, Three Formats

The Wikipedia Section Problem

Cross-Source ID Management

Rating Field Naming Convention

What This Unlocks

What's Next

Pick Your Auth: An Interactive Guide

The Session Inspector: JWT Breakdown

The Decision Matrix

The Failure Simulator

Stack and Setup

From Demo to Decision

What Predicts a Hit? I Trained 3 ML Models to Find Out

Missing `https://` prefix