Cara Jung

Posted on May 10

Inside the Pipeline Powering a Korean Entertainment MCP Server

#mcp #githubactions #python #devops

Korean entertainment has become global, but the infrastructure behind its data is still surprisingly broken. Information about a single show is often scattered across multiple platforms: one site for Korean ratings, another for streaming availability, another for cast data, another for box office numbers, and several with no public APIs at all.

This is Part 2 of a series. In Part 1, I covered how I unified data across 10 APIs and web scrapers into a single database designed to power an MCP server.

But collecting the data was only half the problem. Once you have 10 independent sources feeding into the same system, the real engineering work begins: how do you run all of them reliably, on a schedule, writing to a shared database, without the entire pipeline breaking every time one site changes its HTML structure?

This is the story of building the pipeline that keeps the Korean entertainment database alive.

The Database: Why Supabase

The database needed to be:

Queryable via REST (the MCP server reads directly from it — no separate API layer)
PostgreSQL-native (arrays, jsonb, GIN indexes for genre/tag filtering)
Free to start (we're pre-revenue)
Always-on (the nightly pipeline writes to it every day)

Supabase checked all four boxes. Its free tier includes 500MB of storage, unlimited API requests via PostgREST, and Row Level Security for access control. The built-in REST API means I can query the database directly from the MCP server without building a middleware layer.

The one gotcha: Supabase free tier pauses the database after 1 week of inactivity. The nightly pipeline prevents this — activity every day keeps it awake. If the pipeline goes down for more than a week, it needs to be manually unpaused from the dashboard.

Schema Design Decisions

Separate tables for movies and TV shows. The data shapes are different: movies have runtime, box office revenue, and KOBIS codes while TV shows have episode counts, airing schedules, MDL-specific tags, and per-episode ratings. One table with a type column would mean half the columns are always NULL.

Both a cast table and JSON summary. The normalized movie_cast and show_cast tables enable queries like "all films with Song Kang-ho". The JSON summary in the main record enables fast single-title lookups without a join.

Separate streaming_availability table. Storing streaming data as a jsonb column would make queries like "all Korean films on Netflix US" require a full table scan. A proper table with a row per title/region/provider enables indexed lookups.

Explicit rating field naming. Every rating field is named by source and audience type — naver_audience_rating, tmdb_rating, rt_tomatometer — never generic "rating". This is a naming convention enforced at the schema level that prevents ambiguity at the application layer.

The NULL Constraint Issue

PostgreSQL's unique constraints treat NULL as not equal to NULL, which caused a subtle bug in the streaming availability table. The unique constraint on (movie_id, show_id, region, provider, monetization_type) failed to prevent duplicates when show_id was NULL. Postgres saw NULL != NULL and allowed duplicate rows.

The fix was a partial index using COALESCE:

create unique index streaming_availability_unique
  on streaming_availability (
    coalesce(movie_id::text, ''),
    coalesce(show_id::text, ''),
    region,
    provider,
    coalesce(monetization_type, '')
  );

This converts NULLs to empty strings before comparison, making the uniqueness check work correctly.

The Pipeline: Prefect + GitHub Actions

Why Not Prefect Cloud?

Prefect is genuinely excellent orchestration software. Its flow/task model gives you retry logic, caching, dependency management, and a beautiful UI, all from simple Python decorators:

@task(retries=3, retry_delay_seconds=10)
def sync_movie(tmdb_id: int) -> dict | None:
    raw = run_with_retry(get_movie_details, tmdb_id)
    if not raw:
        return None
    movie_row = upsert_movie(_transform_movie(raw))
    return movie_row

@flow(name="sync_tmdb", log_prints=True)
def sync_tmdb_flow(movie_limit: int | None = None):
    movie_ids = fetch_korean_movie_ids()
    for tmdb_id in movie_ids[:movie_limit]:
        sync_movie(tmdb_id)

The problem was Prefect Cloud's free tier. When I tried to create a work pool:

prefect work-pool create kr-mcp-pool --type process
# Your plan does not support hybrid or push work pools.

Hybrid work pools, which let you run jobs on your own infrastructure, require a paid plan starting at $20/month. Without a work pool, there's no way to schedule flows to run automatically.

I kept the Prefect flow/task structure since it gives retry logic, task-level logging, and caching for free even without Prefect Cloud. But for the scheduling layer, I went with GitHub Actions.

Why GitHub Actions

GitHub Actions gives 2,000 free minutes per month on private repositories. Each workflow job runs on a fresh Ubuntu runner with Python, pip, and dependencies installed from scratch.

The scheduling syntax is standard cron:

on:
  schedule:
    - cron: "0 2 * * *"    # 2am UTC daily
  workflow_dispatch:         # also triggerable manually

Secrets are injected as environment variables:

env:
  TMDB_API_KEY: ${{ secrets.TMDB_API_KEY }}
  SUPABASE_SERVICE_ROLE_KEY: ${{ secrets.SUPABASE_SERVICE_ROLE_KEY }}

The workflows call simple Python scripts that invoke Prefect flows:

# scripts/run_sync_tmdb.py
from pipeline.jobs.sync_tmdb import sync_tmdb_flow
sync_tmdb_flow(movie_limit=200, show_limit=200)

Estimated monthly usage:

Nightly sync (TMDB + MDL + Naver TV airing): ~37 min × 30 days = ~1,110 min
Weekly sync (KOBIS + JustWatch + Wikipedia + Awards): ~75 min × 4 = ~300 min
Total: ~1,410 min/month — safely under the 2,000 minute limit

The tradeoff is runner startup time (~30 seconds per job for pip install and Playwright browser install) and the 6-hour job timeout, which became a significant constraint during the initial population phase.

The Errors

Error 1: Prefect Type Validation is Strict

Prefect 3.x validates flow parameter types at runtime. Optional integer parameters typed as int = None fail validation:

ParameterTypeError: Flow run received invalid parameters:
 - movie_limit: Input should be a valid integer

The fix is explicit Optional typing:

# Before (fails)
def sync_tmdb_flow(movie_limit: int = None):

# After (works)
def sync_tmdb_flow(movie_limit: int | None = None):

This affected every pipeline job. I had to audit all int, str, float, and list optional parameters across 7 files. Python's int | None syntax (PEP 604) is required; Optional[int] from typing also works.

Error 2: The TMDB Response Shape Mismatch

The data_sources/tmdb.py module normalizes TMDB's raw API response before returning it. The pipeline assumed it was getting raw TMDB format:

# Pipeline assumed raw TMDB format:
genres = normalize_genres([g["name"] for g in raw.get("genres", [])])
# Actual normalized format from tmdb.py:
# genres is already ["Drama", "Romance"] — a list of strings, not dicts

This caused TypeError: string indices must be integers when trying to do g["name"] on a string.

The fix was simple once diagnosed:

genres = normalize_genres(raw.get("genres", []))

But finding it required running the job, reading the stack trace, debugging the actual response shape with a quick Python one-liner, and tracing back to the source.

Error 3: The Initial Population Timeout

GitHub Actions has a maximum job runtime of 6 hours. TMDB has ~10,000 Korean movies and ~10,000 Korean TV shows. Fetching full details for each at ~0.25 seconds per title = ~83 minutes for movies, ~83 minutes for shows. That's within the limit.

But each title also has cast: 20 cast members per title × 2 DB writes each × 20,000 titles = 800,000 database operations. The actual runtime exceeded 6 hours and the job was cancelled mid-run.

The job has exceeded the maximum execution time of 6h0m0s

The solution was accepting partial initial population and relying on the nightly sync to fill gaps over time. After the initial run, I had:

9,983 movies (nearly complete)
3,536 shows (35% since the timeout hit mid-show-sync)

Subsequent nightly runs continue adding shows. The upsert pattern means no data is lost or duplicated since each run picks up new IDs and updates existing records.

Error 4: ON CONFLICT DO UPDATE with Duplicates

The upsert_bulk functions failed when the input list contained duplicate rows with the same unique key:

ON CONFLICT DO UPDATE command cannot affect row a second time
Hint: Ensure that no rows proposed for insertion within the same command have duplicate constrained values.

This happened for two reasons:

OST albums: Goblin has two editions of "Part 1", which is same album name but different recordings. The scraper returned both and the upsert tried to update the same row twice in one batch.
Streaming availability: Amazon Prime Video appeared as both a None monetization type entry and a regular entry for the same title/region. Two rows with the same effective unique key.

The fix was deduplication before upsert in both cases:

def upsert_ost_albums_bulk(show_id: str, albums: list[dict]) -> list[dict]:
    seen = set()
    unique_albums = []
    for a in albums:
        key = a.get("album_name", "")
        if key not in seen:
            seen.add(key)
            unique_albums.append(a)
    rows = [{"show_id": show_id, **_clean(a)} for a in unique_albums]
    result = _supabase.table("ost_albums").upsert(
        rows, on_conflict="show_id,album_name"
    ).execute()
    return result.data or []

Error 5: Missing Schema Columns

The MDL scraper stored content_rating and mdl_votes, fields that existed in the scraper output but weren't in the initial database schema. Supabase's PostgREST returns a clear error:

PGRST204: Could not find the 'content_rating' column of 'tv_shows' in the schema cache

The fix was two ALTER TABLE statements in Supabase SQL Editor:

alter table tv_shows add column if not exists content_rating text;
alter table tv_shows add column if not exists mdl_votes integer;

This is a schema drift problem. The scraper evolved during development but the SQL schema wasn't updated to match. The lesson is to keep schema.sql as the single source of truth and run it against a fresh database before deploying.

Error 6: requirements.txt Bloat

pip freeze in an Anaconda environment captures hundreds of unrelated packages — machine learning libraries, game engines, PDF parsers. The resulting requirements.txt had dependency conflicts:

ERROR: Cannot install -r requirements.txt (line 28) and pdfminer.six==20200517
because these package versions have conflicting dependencies.

The fix was a minimal hand-written requirements file containing only what the project actually uses:

playwright
beautifulsoup4
httpx
requests
python-dotenv
supabase
prefect

GitHub Actions installs from scratch on every run, keeping dependencies minimal reduces install time from ~90 seconds to ~22 seconds per job.

Error 7: Rotten Tomatoes Package Breakage

The rottentomatoes-python package parses RT's search results page to find movie scores. RT changed their HTML schema, breaking the parser:

meter = tomato_snip.split('"')[1]
# IndexError: list index out of range

RT actively fights scraping and changes their HTML regularly. I excluded RT from the pipeline rather than maintain a brittle scraper. The justwatch.py scraper already captures IMDB scores via JustWatch's sidebar — a partial substitute for Western critic consensus.

The Architecture That Emerged

After all the debugging, the pipeline architecture is:

GitHub Actions (scheduling)
    ↓
Prefect flows (retry logic, task caching, logging)
    ↓
data_sources/* (Playwright + httpx + REST APIs)
    ↓
pipeline/utils.py (data normalization)
    ↓
db/queries.py (upsert operations)
    ↓
Supabase (PostgreSQL + PostgREST)

Each layer has a clear responsibility. The data sources are pure functions that return normalized Python dicts. The pipeline jobs transform those dicts to match the DB schema. The queries module is the only file that touches Supabase. This separation made debugging much faster. When something breaks, it's usually immediately obvious which layer is responsible.

GitHub Actions Minutes: A Practical Guide

For anyone building a similar pipeline, here's how to think about GitHub Actions free tier minutes:

What's fast (< 5 min per run):

REST API sources (TMDB, KOBIS, Wikipedia) — no browser overhead
Small record counts with efficient batch upserts

What's slow (10-60+ min per run):

Playwright scrapers (browser startup + page load per request)
Multi-region lookups (JustWatch × 4 regions = 4× the time)
Large initial populations (10,000 titles × N seconds each)

Optimization strategies:

Only sync what changes frequently. Airing shows need daily updates, historical movies don't
Use Prefect's task caching for expensive discovery operations (fetching all TMDB IDs)
Keep JustWatch to weekly syncs of 100-200 titles rather than full catalog runs
Separate initial population (run once, manually) from ongoing sync (automated)

The nightly sync in steady state runs in under 45 minutes. The weekly sync runs in under 90 minutes. Combined, that's ~1,400 minutes/month, which is safely within the 2,000 minute free limit even accounting for occasional failures and re-runs.

What I'd Do Differently

Design the schema before writing the scrapers. Schema drift (scrapers adding fields that don't exist in the database) caused several production failures. The right approach is schema-first: define all columns upfront, then write scrapers that output exactly those fields.

Test upserts with duplicates explicitly. The ON CONFLICT DO UPDATE duplicate error was predictable in retrospect. Any bulk upsert should be tested with intentionally duplicated input rows.

Start with a hand-written requirements.txt. pip freeze in a development environment always captures too much. Start minimal and add packages as needed.

Accept that initial population and ongoing sync are different problems. Initial population of 10,000+ records from browser-scraped sources takes days, not hours. Design for incremental population from the start rather than trying to do it all in one run.

What's Next

The database feeds into a FastMCP server that exposes all of this as tools for AI agents — structured queries like get_drama_details, find_where_to_watch, get_actor_filmography, and get_weekly_boxoffice that leverage the cross-source enrichment I've built.

The combination of Korean-specific data (Nielsen ratings, Naver scores, KOBIS box office, award history, OST albums) with international enrichment (TMDB, MDL community tags, JustWatch streaming, Wikipedia context) creates a database that genuinely doesn't exist anywhere else in a queryable form.

That's what makes the scraping work worth it.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.