<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cara Jung</title>
    <description>The latest articles on DEV Community by Cara Jung (@carasjung).</description>
    <link>https://dev.to/carasjung</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3448967%2F4a49be6a-32ba-4490-aa57-3617e659db09.png</url>
      <title>DEV Community: Cara Jung</title>
      <link>https://dev.to/carasjung</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/carasjung"/>
    <language>en</language>
    <item>
      <title>From Scrapers to MCP Server: Serving Korean Entertainment Data to AI Agents</title>
      <dc:creator>Cara Jung</dc:creator>
      <pubDate>Mon, 11 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/carasjung/from-scrapers-to-mcp-server-serving-korean-entertainment-data-to-ai-agents-1npj</link>
      <guid>https://dev.to/carasjung/from-scrapers-to-mcp-server-serving-korean-entertainment-data-to-ai-agents-1npj</guid>
      <description>&lt;p&gt;Korean entertainment data is surprisingly fragmented. Information about a single drama or film is often scattered across multiple platforms.&lt;/p&gt;

&lt;p&gt;To solve that, I built a unified Korean entertainment database powered by APIs, web scrapers, and automated sync pipelines. By the end of the project, I had a Supabase database containing nearly 10,000 Korean movies, 3,500 TV shows, per-episode Nielsen Korea ratings, award histories, and streaming availability across four regions.&lt;/p&gt;

&lt;p&gt;The next problem was figuring out how to expose it to AI agents in a way that was actually useful, secure, monetizable, and maintainable.&lt;/p&gt;

&lt;p&gt;This is the story of building the MCP server and the errors I encountered.&lt;/p&gt;




&lt;h2&gt;
  
  
  Designing the Tools
&lt;/h2&gt;

&lt;p&gt;Before writing any code, I thought carefully about what an AI agent would actually need from a Korean entertainment database. The answer wasn't "expose every database column as a query parameter." That's an API, not a tool. MCP tools should be opinionated about what they return and why.&lt;/p&gt;

&lt;p&gt;I ended up with 17 tools organized into three categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Discovery tools&lt;/strong&gt; answer "what should I watch?" — &lt;code&gt;get_trending_dramas&lt;/code&gt;, &lt;code&gt;browse_by_genre&lt;/code&gt;, &lt;code&gt;browse_by_tag&lt;/code&gt;. The tag tool is the most distinctive: MyDramaList's community taxonomy ("Bromance", "Enemies to Lovers", "Time Travel", "CEO Male Lead") doesn't exist anywhere else in structured form, and it's exactly how K-drama fans actually think about recommendations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detail tools&lt;/strong&gt; answer "tell me everything about this title" like &lt;code&gt;get_movie&lt;/code&gt;, &lt;code&gt;get_drama&lt;/code&gt;, &lt;code&gt;get_episode_ratings&lt;/code&gt;, &lt;code&gt;get_ost_albums&lt;/code&gt;. The episode ratings tool is the one I'm most proud of: it returns Nielsen Korea per-episode viewership percentages scraped from SVG chart elements on Naver. No English-language API has this data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Utility tools&lt;/strong&gt; answer cross-cutting questions — &lt;code&gt;find_where_to_watch&lt;/code&gt;, &lt;code&gt;get_weekly_boxoffice&lt;/code&gt;, &lt;code&gt;get_actor_filmography&lt;/code&gt;, &lt;code&gt;compare_ratings&lt;/code&gt;. The compare_ratings tool is genuinely novel: it shows you Naver's verified Korean ticket buyer score, MDL's international fan score, TMDB's global community score, and RT's Western critic score side by side, with labels explaining what each audience represents.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the Server with FastMCP
&lt;/h2&gt;

&lt;p&gt;FastMCP makes building MCP servers surprisingly clean. Each tool is a decorated Python function with a docstring that becomes the tool description:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Korean Entertainment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You have access to a comprehensive database of Korean movies and TV shows.
Rating fields and what they mean:
- mdl_rating: International K-drama fans (0-10)
- naver_audience_rating: Korean verified ticket buyers (0-10)
- naver_latest_rating: Nielsen Korea latest episode viewership (%)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;browse_by_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Browse Korean dramas by MyDramaList community tag.
    Common tags: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bromance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Time Travel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CEO Male Lead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enemies to Lovers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Revenge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Found Family&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_supabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tv_shows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id, title_english, title_korean, year, mdl_rating, tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mdl_rating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tools call &lt;code&gt;db/queries.py&lt;/code&gt; directly, the same query layer the pipeline uses to write data. No intermediate API layer needed. When Claude calls &lt;code&gt;browse_by_tag(tag="Revenge", limit=5)&lt;/code&gt;, it goes straight to Supabase.&lt;/p&gt;




&lt;h2&gt;
  
  
  Choosing Authentication: Descope
&lt;/h2&gt;

&lt;p&gt;For a monetizable MCP server, I needed real OAuth 2.1 authentication. Without auth, anyone with the URL can use your server, which fine for testing, but not for marketplace listings where you might want to gate access or track usage per user.&lt;/p&gt;

&lt;p&gt;I chose Descope for three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;FastMCP has a first-class &lt;code&gt;DescopeProvider&lt;/code&gt; integration&lt;/li&gt;
&lt;li&gt;Descope supports Dynamic Client Registration (DCR), which lets MCP clients like Claude register automatically without manual configuration&lt;/li&gt;
&lt;li&gt;Their free tier is generous enough for an early-stage project&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The final auth setup in server.py is just four lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastmcp.server.auth.providers.descope&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DescopeProvider&lt;/span&gt;

&lt;span class="n"&gt;_auth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DescopeProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;config_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DESCOPE_CONFIG_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SERVER_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Korean Entertainment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_auth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Getting to those four lines took about seven failed deployments.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Deployment Errors Worth Knowing About
&lt;/h2&gt;

&lt;p&gt;Every deployment has gotchas. Here are the ones that will actually save you time if you're building something similar.&lt;/p&gt;

&lt;h3&gt;
  
  
  Corrupted file from incremental edits
&lt;/h3&gt;

&lt;p&gt;When making multiple edits to &lt;code&gt;server.py&lt;/code&gt;, one automated string replacement accidentally jammed code into the middle of an import block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;db.queries&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;get_movie_by_tmdb_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;get_movie_by_title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;DescopeProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  &lt;span class="c1"&gt;# ← corrupted by bad replacement
&lt;/span&gt;    &lt;span class="n"&gt;get_movies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Railway reported this as &lt;code&gt;SyntaxError: '(' was never closed&lt;/code&gt; a confusing error that had nothing to do with parentheses. The actual problem was a botched edit 20 lines earlier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; when making multiple changes to the same file, regenerate the whole file from scratch rather than applying incremental patches. A clean file beats a patched one every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Missing &lt;code&gt;https://&lt;/code&gt; prefix
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;SERVER_URL&lt;/code&gt; environment variable was set to &lt;code&gt;kr-movie-tv-mcp-production.up.railway.app&lt;/code&gt; without the &lt;code&gt;https://&lt;/code&gt; prefix. Pydantic rejected it immediately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input should be a valid URL, relative URL without a base
[type=url_parsing, input_value='kr-movie-tv-mcp-production.up.railway.app']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple fix, but it costs a full Railway deployment cycle (about 3 minutes) to discover. Always include the schema when setting URL environment variables.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Descope's token validation actually works
&lt;/h3&gt;

&lt;p&gt;One thing worth knowing from Descope's documentation: the MCP Server URL field in the console is optional. When set, it adds an aud (audience) claim to access tokens, scoping them to your specific server. When left unset, no audience claim is added and tokens are validated purely against the .well-known config URL.&lt;br&gt;
This means you can get a fully working OAuth setup (complete handshake, Dynamic Client Registration, and tool discovery) with just the .well-known URL configured. The audience field is an additional security layer for production environments where you want strict token scoping, not a prerequisite for the integration to work.&lt;/p&gt;


&lt;h2&gt;
  
  
  Deploying to Railway
&lt;/h2&gt;

&lt;p&gt;I chose Railway over the free alternatives for one reason: no cold starts. Render's free tier spins down after 15 minutes of inactivity and takes 30-60 seconds to restart. For an MCP server that needs to respond to tool calls quickly, a cold start on the first request would produce a bad user experience and potentially cause claude.ai to show a timeout error.&lt;/p&gt;

&lt;p&gt;Railway at $5/month gives you an always-on container with automatic deploys from GitHub. The configuration is minimal, a &lt;code&gt;railway.toml&lt;/code&gt; that specifies the start command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[build]&lt;/span&gt;
&lt;span class="py"&gt;builder&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"nixpacks"&lt;/span&gt;

&lt;span class="nn"&gt;[deploy]&lt;/span&gt;
&lt;span class="py"&gt;startCommand&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"python server.py"&lt;/span&gt;
&lt;span class="py"&gt;restartPolicyType&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"always"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server runs with streamable-http transport, which is what claude.ai and other MCP clients expect for remote servers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PORT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;streamable-http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Railway injects its own PORT environment variable, in this case 8080, so the server binds to whatever port Railway assigns. The 8000 default in the code is only used when running locally.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing End-to-End
&lt;/h2&gt;

&lt;p&gt;Once the server was running, I tested it through Claude Code locally first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add korean-entertainment &lt;span class="nt"&gt;--&lt;/span&gt; python server.py
claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first query I ran:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Which Korean drama had the biggest Nielsen viewership jump between its first and last episode?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Claude called &lt;code&gt;get_top_dramas&lt;/code&gt; to get a list of candidates, then called &lt;code&gt;get_episode_ratings&lt;/code&gt; for each one, computed the delta, and returned:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Crash Landing on You jumped from 6.1% → 21.7% (+15.6 percentage points), beating My Love from the Star's 15.6% → 28.1% jump (+12.5pp).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's real Nielsen Korea data, pulled from SVG chart elements on NAVER, stored in Supabase, served through FastMCP, reasoned over by Claude. The full pipeline working end-to-end.&lt;/p&gt;

&lt;p&gt;After connecting to claude.ai directly, I tested a more complex query:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Find me a Korean thriller drama rated above 8.5 with the Revenge tag, and show me where I can watch it in the US"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The response correctly identified The Glory (더 글로리) at 8.9 MDL rating, confirmed it streams on Netflix US, and included context about the writing and direction. Cross-source, cross-tool reasoning working exactly as designed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the MCP Server Unlocks
&lt;/h2&gt;

&lt;p&gt;The combination of data sources creates queries that weren't previously possible from any single API:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Find dramas where Korean audiences loved it but Western audiences didn't":&lt;/strong&gt;&lt;br&gt;
This requires &lt;code&gt;naver_audience_rating&lt;/code&gt; (Korean verified buyers) vs &lt;code&gt;rt_tomatometer&lt;/code&gt; (Western critics): two fields from two different scrapers on the same row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Show me the episode rating trajectory for currently airing dramas":&lt;/strong&gt;&lt;br&gt;
This requires &lt;code&gt;get_trending_dramas&lt;/code&gt; + &lt;code&gt;get_episode_ratings&lt;/code&gt;: airing status from MDL, viewership from Naver's SVG charts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"What Korean films won awards and are now on Netflix?":&lt;/strong&gt;&lt;br&gt;
This requires joining &lt;code&gt;awards&lt;/code&gt;, &lt;code&gt;movies&lt;/code&gt;, and &lt;code&gt;streaming_availability&lt;/code&gt;: three tables from three different sources.&lt;/p&gt;

&lt;p&gt;None of these queries are possible against any existing Korean entertainment API, because no existing API has all three pieces. That's the value proposition.&lt;/p&gt;


&lt;h2&gt;
  
  
  Getting It In Front of Users
&lt;/h2&gt;

&lt;p&gt;With the server running, the next priority was distribution, which is getting it listed everywhere developers and AI users look for MCP servers.&lt;/p&gt;
&lt;h3&gt;
  
  
  Smithery
&lt;/h3&gt;

&lt;p&gt;Smithery was the fastest listing. Paste your Railway URL, complete the OAuth flow, and their inspector automatically discovers all 17 tools and generates the listing. The whole process took under 10 minutes. Smithery is worth listing on because it's where Claude Code users browse for MCP servers such as developers who are already in an agentic workflow and actively looking for new tools to add.&lt;/p&gt;
&lt;h3&gt;
  
  
  Glama
&lt;/h3&gt;

&lt;p&gt;Glama has over 23,000 MCP servers listed and is one of the most-searched directories in the ecosystem. Submitting is straightforward, just add your URL and GitHub repo. They have a private notes field for their review team; since the server uses OAuth, I left instructions explaining how to connect via claude.ai rather than providing API keys. Glama is worth listing on because it indexes for search relevance and recent usage, so active servers rise in rankings over time.&lt;/p&gt;
&lt;h3&gt;
  
  
  mcp.so
&lt;/h3&gt;

&lt;p&gt;mcp.so has over 20,000 servers and accepts submissions via GitHub issue. The server config field expects the standard MCP JSON format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"korean-entertainment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://kr-movie-tv-mcp-production.up.railway.app/mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Worth listing because it's one of the most-linked directories in MCP ecosystem articles and gets significant organic search traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cline Marketplace
&lt;/h3&gt;

&lt;p&gt;Cline is a popular AI coding assistant with millions of users and their own MCP marketplace backed by a GitHub repo. Submission is a GitHub issue with your server name, repo URL, endpoint, tool list, and what makes it unique. The review team checks for code quality and documentation before approving. Worth the effort because Cline users are developers who tend to build on top of tools they discover: potential integrators, not just users.&lt;/p&gt;

&lt;h3&gt;
  
  
  MCP-Hive
&lt;/h3&gt;

&lt;p&gt;MCP-Hive launched May 11, 2026 as the first marketplace with actual pay-per-call revenue sharing for server providers. MCP-Hive is the main monetization play in this ecosystem right now; every other directory is free discovery with no revenue component.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The distribution is in place. What comes next:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Complete initial population&lt;/strong&gt; — ~6,500 TV shows still need TMDB sync as GitHub Actions minutes reset monthly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add per-user billing&lt;/strong&gt; — Descope supports scope-based access control, enabling a free tier (basic search) vs paid tier (Nielsen data, awards)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KMDb integration&lt;/strong&gt; — API membership pending, will add Korean film archive data when approved&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The database gets richer every night. The server is listed where it needs to be. Time to let the data do the work.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>agents</category>
      <category>data</category>
    </item>
    <item>
      <title>Inside the Pipeline Powering a Korean Entertainment MCP Server</title>
      <dc:creator>Cara Jung</dc:creator>
      <pubDate>Sun, 10 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/carasjung/inside-the-pipeline-powering-a-korean-entertainment-mcp-server-3d2h</link>
      <guid>https://dev.to/carasjung/inside-the-pipeline-powering-a-korean-entertainment-mcp-server-3d2h</guid>
      <description>&lt;p&gt;Korean entertainment has become global, but the infrastructure behind its data is still surprisingly broken. Information about a single show is often scattered across multiple platforms: one site for Korean ratings, another for streaming availability, another for cast data, another for box office numbers, and several with no public APIs at all.&lt;/p&gt;

&lt;p&gt;This is Part 2 of a series. In &lt;a href="https://dev.to/carasjung/building-a-unified-korean-entertainment-database-from-10-apis-and-web-scrapers-3n91"&gt;Part 1&lt;/a&gt;, I covered how I unified data across 10 APIs and web scrapers into a single database designed to power an MCP server.&lt;/p&gt;

&lt;p&gt;But collecting the data was only half the problem. Once you have 10 independent sources feeding into the same system, the real engineering work begins: how do you run all of them reliably, on a schedule, writing to a shared database, without the entire pipeline breaking every time one site changes its HTML structure?&lt;/p&gt;

&lt;p&gt;This is the story of building the pipeline that keeps the Korean entertainment database alive.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Database: Why Supabase
&lt;/h2&gt;

&lt;p&gt;The database needed to be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Queryable via REST&lt;/strong&gt; (the MCP server reads directly from it — no separate API layer)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL-native&lt;/strong&gt; (arrays, jsonb, GIN indexes for genre/tag filtering)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free to start&lt;/strong&gt; (we're pre-revenue)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always-on&lt;/strong&gt; (the nightly pipeline writes to it every day)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Supabase checked all four boxes. Its free tier includes 500MB of storage, unlimited API requests via PostgREST, and Row Level Security for access control. The built-in REST API means I can query the database directly from the MCP server without building a middleware layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The one gotcha:&lt;/strong&gt; Supabase free tier pauses the database after 1 week of inactivity. The nightly pipeline prevents this — activity every day keeps it awake. If the pipeline goes down for more than a week, it needs to be manually unpaused from the dashboard.&lt;/p&gt;

&lt;h3&gt;
  
  
  Schema Design Decisions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Separate tables for movies and TV shows.&lt;/strong&gt; The data shapes are different: movies have runtime, box office revenue, and KOBIS codes while TV shows have episode counts, airing schedules, MDL-specific tags, and per-episode ratings. One table with a type column would mean half the columns are always NULL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Both a cast table and JSON summary.&lt;/strong&gt; The normalized &lt;code&gt;movie_cast&lt;/code&gt; and &lt;code&gt;show_cast&lt;/code&gt; tables enable queries like "all films with Song Kang-ho". The JSON summary in the main record enables fast single-title lookups without a join.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Separate streaming_availability table.&lt;/strong&gt; Storing streaming data as a jsonb column would make queries like "all Korean films on Netflix US" require a full table scan. A proper table with a row per title/region/provider enables indexed lookups.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explicit rating field naming.&lt;/strong&gt; Every rating field is named by source and audience type — &lt;code&gt;naver_audience_rating&lt;/code&gt;, &lt;code&gt;tmdb_rating&lt;/code&gt;, &lt;code&gt;rt_tomatometer&lt;/code&gt; — never generic "rating". This is a naming convention enforced at the schema level that prevents ambiguity at the application layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  The NULL Constraint Issue
&lt;/h3&gt;

&lt;p&gt;PostgreSQL's unique constraints treat NULL as not equal to NULL, which caused a subtle bug in the streaming availability table. The unique constraint on &lt;code&gt;(movie_id, show_id, region, provider, monetization_type)&lt;/code&gt; failed to prevent duplicates when &lt;code&gt;show_id&lt;/code&gt; was NULL. Postgres saw &lt;code&gt;NULL != NULL&lt;/code&gt; and allowed duplicate rows.&lt;/p&gt;

&lt;p&gt;The fix was a partial index using COALESCE:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;unique&lt;/span&gt; &lt;span class="k"&gt;index&lt;/span&gt; &lt;span class="n"&gt;streaming_availability_unique&lt;/span&gt;
  &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;streaming_availability&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;movie_id&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;show_id&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monetization_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This converts NULLs to empty strings before comparison, making the uniqueness check work correctly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline: Prefect + GitHub Actions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Not Prefect Cloud?
&lt;/h3&gt;

&lt;p&gt;Prefect is genuinely excellent orchestration software. Its flow/task model gives you retry logic, caching, dependency management, and a beautiful UI, all from simple Python decorators:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry_delay_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sync_movie&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmdb_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_movie_details&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tmdb_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;movie_row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;upsert_movie&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_transform_movie&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;movie_row&lt;/span&gt;

&lt;span class="nd"&gt;@flow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sync_tmdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_prints&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sync_tmdb_flow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;movie_limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;movie_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_korean_movie_ids&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tmdb_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;movie_ids&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;movie_limit&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="nf"&gt;sync_movie&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmdb_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem was Prefect Cloud's free tier. When I tried to create a work pool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;prefect work-pool create kr-mcp-pool &lt;span class="nt"&gt;--type&lt;/span&gt; process
&lt;span class="c"&gt;# Your plan does not support hybrid or push work pools.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hybrid work pools, which let you run jobs on your own infrastructure, require a paid plan starting at $20/month. Without a work pool, there's no way to schedule flows to run automatically.&lt;/p&gt;

&lt;p&gt;I kept the Prefect flow/task structure since it gives retry logic, task-level logging, and caching for free even without Prefect Cloud. But for the scheduling layer, I went with GitHub Actions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why GitHub Actions
&lt;/h3&gt;

&lt;p&gt;GitHub Actions gives 2,000 free minutes per month on private repositories. Each workflow job runs on a fresh Ubuntu runner with Python, pip, and dependencies installed from scratch.&lt;/p&gt;

&lt;p&gt;The scheduling syntax is standard cron:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;    &lt;span class="c1"&gt;# 2am UTC daily&lt;/span&gt;
  &lt;span class="na"&gt;workflow_dispatch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;         &lt;span class="c1"&gt;# also triggerable manually&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Secrets are injected as environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;TMDB_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.TMDB_API_KEY }}&lt;/span&gt;
  &lt;span class="na"&gt;SUPABASE_SERVICE_ROLE_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.SUPABASE_SERVICE_ROLE_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The workflows call simple Python scripts that invoke Prefect flows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scripts/run_sync_tmdb.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pipeline.jobs.sync_tmdb&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sync_tmdb_flow&lt;/span&gt;
&lt;span class="nf"&gt;sync_tmdb_flow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;movie_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;show_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Estimated monthly usage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nightly sync (TMDB + MDL + Naver TV airing): ~37 min × 30 days = ~1,110 min&lt;/li&gt;
&lt;li&gt;Weekly sync (KOBIS + JustWatch + Wikipedia + Awards): ~75 min × 4 = ~300 min&lt;/li&gt;
&lt;li&gt;Total: ~1,410 min/month — safely under the 2,000 minute limit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff is runner startup time (~30 seconds per job for pip install and Playwright browser install) and the 6-hour job timeout, which became a significant constraint during the initial population phase.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Errors
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Error 1: Prefect Type Validation is Strict
&lt;/h3&gt;

&lt;p&gt;Prefect 3.x validates flow parameter types at runtime. Optional integer parameters typed as &lt;code&gt;int = None&lt;/code&gt; fail validation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ParameterTypeError: Flow run received invalid parameters:
 - movie_limit: Input should be a valid integer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix is explicit Optional typing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before (fails)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sync_tmdb_flow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;movie_limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

&lt;span class="c1"&gt;# After (works)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sync_tmdb_flow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;movie_limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This affected every pipeline job. I had to audit all &lt;code&gt;int&lt;/code&gt;, &lt;code&gt;str&lt;/code&gt;, &lt;code&gt;float&lt;/code&gt;, and &lt;code&gt;list&lt;/code&gt; optional parameters across 7 files. Python's &lt;code&gt;int | None&lt;/code&gt; syntax (PEP 604) is required; &lt;code&gt;Optional[int]&lt;/code&gt; from typing also works.&lt;/p&gt;

&lt;h3&gt;
  
  
  Error 2: The TMDB Response Shape Mismatch
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;data_sources/tmdb.py&lt;/code&gt; module normalizes TMDB's raw API response before returning it. The pipeline assumed it was getting raw TMDB format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pipeline assumed raw TMDB format:
&lt;/span&gt;&lt;span class="n"&gt;genres&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize_genres&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;genres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])])&lt;/span&gt;
&lt;span class="c1"&gt;# Actual normalized format from tmdb.py:
# genres is already ["Drama", "Romance"] — a list of strings, not dicts
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This caused &lt;code&gt;TypeError: string indices must be integers&lt;/code&gt; when trying to do &lt;code&gt;g["name"]&lt;/code&gt; on a string.&lt;/p&gt;

&lt;p&gt;The fix was simple once diagnosed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;genres&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize_genres&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;genres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But finding it required running the job, reading the stack trace, debugging the actual response shape with a quick Python one-liner, and tracing back to the source.&lt;/p&gt;

&lt;h3&gt;
  
  
  Error 3: The Initial Population Timeout
&lt;/h3&gt;

&lt;p&gt;GitHub Actions has a maximum job runtime of 6 hours. TMDB has ~10,000 Korean movies and ~10,000 Korean TV shows. Fetching full details for each at ~0.25 seconds per title = ~83 minutes for movies, ~83 minutes for shows. That's within the limit.&lt;/p&gt;

&lt;p&gt;But each title also has cast: 20 cast members per title × 2 DB writes each × 20,000 titles = 800,000 database operations. The actual runtime exceeded 6 hours and the job was cancelled mid-run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The job has exceeded the maximum execution time of 6h0m0s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The solution was accepting partial initial population and relying on the nightly sync to fill gaps over time. After the initial run, I had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;9,983 movies (nearly complete)&lt;/li&gt;
&lt;li&gt;3,536 shows (35% since the timeout hit mid-show-sync)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Subsequent nightly runs continue adding shows. The upsert pattern means no data is lost or duplicated since each run picks up new IDs and updates existing records.&lt;/p&gt;

&lt;h3&gt;
  
  
  Error 4: ON CONFLICT DO UPDATE with Duplicates
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;upsert_bulk&lt;/code&gt; functions failed when the input list contained duplicate rows with the same unique key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;CONFLICT&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="n"&gt;cannot&lt;/span&gt; &lt;span class="n"&gt;affect&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="k"&gt;second&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;
&lt;span class="n"&gt;Hint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Ensure&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt; &lt;span class="k"&gt;no&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt; &lt;span class="n"&gt;proposed&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;insertion&lt;/span&gt; &lt;span class="n"&gt;within&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;same&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="n"&gt;have&lt;/span&gt; &lt;span class="n"&gt;duplicate&lt;/span&gt; &lt;span class="n"&gt;constrained&lt;/span&gt; &lt;span class="k"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This happened for two reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OST albums:&lt;/strong&gt; Goblin has two editions of "Part 1", which is same album name but different recordings. The scraper returned both and the upsert tried to update the same row twice in one batch.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Streaming availability:&lt;/strong&gt; Amazon Prime Video appeared as both a &lt;code&gt;None&lt;/code&gt; monetization type entry and a regular entry for the same title/region. Two rows with the same effective unique key.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The fix was deduplication before upsert in both cases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;upsert_ost_albums_bulk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;show_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;albums&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;seen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;unique_albums&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;albums&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;album_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;unique_albums&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;show_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;show_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="nf"&gt;_clean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_albums&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_supabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ost_albums&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on_conflict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;show_id,album_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Error 5: Missing Schema Columns
&lt;/h3&gt;

&lt;p&gt;The MDL scraper stored &lt;code&gt;content_rating&lt;/code&gt; and &lt;code&gt;mdl_votes&lt;/code&gt;, fields that existed in the scraper output but weren't in the initial database schema. Supabase's PostgREST returns a clear error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;PGRST204&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Could&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="n"&gt;find&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="s1"&gt;'content_rating'&lt;/span&gt; &lt;span class="k"&gt;column&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="s1"&gt;'tv_shows'&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="k"&gt;schema&lt;/span&gt; &lt;span class="k"&gt;cache&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix was two ALTER TABLE statements in Supabase SQL Editor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;alter&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="n"&gt;tv_shows&lt;/span&gt; &lt;span class="k"&gt;add&lt;/span&gt; &lt;span class="k"&gt;column&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;exists&lt;/span&gt; &lt;span class="n"&gt;content_rating&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;alter&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="n"&gt;tv_shows&lt;/span&gt; &lt;span class="k"&gt;add&lt;/span&gt; &lt;span class="k"&gt;column&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;exists&lt;/span&gt; &lt;span class="n"&gt;mdl_votes&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a schema drift problem. The scraper evolved during development but the SQL schema wasn't updated to match. The lesson is to keep &lt;code&gt;schema.sql&lt;/code&gt; as the single source of truth and run it against a fresh database before deploying.&lt;/p&gt;

&lt;h3&gt;
  
  
  Error 6: requirements.txt Bloat
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;pip freeze&lt;/code&gt; in an Anaconda environment captures hundreds of unrelated packages — machine learning libraries, game engines, PDF parsers. The resulting &lt;code&gt;requirements.txt&lt;/code&gt; had dependency conflicts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR: Cannot install -r requirements.txt (line 28) and pdfminer.six==20200517
because these package versions have conflicting dependencies.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix was a minimal hand-written requirements file containing only what the project actually uses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;playwright
beautifulsoup4
httpx
requests
python-dotenv
supabase
prefect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub Actions installs from scratch on every run, keeping dependencies minimal reduces install time from ~90 seconds to ~22 seconds per job.&lt;/p&gt;

&lt;h3&gt;
  
  
  Error 7: Rotten Tomatoes Package Breakage
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;rottentomatoes-python&lt;/code&gt; package parses RT's search results page to find movie scores. RT changed their HTML schema, breaking the parser:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;meter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tomato_snip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'"'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# IndexError: list index out of range
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RT actively fights scraping and changes their HTML regularly. I excluded RT from the pipeline rather than maintain a brittle scraper. The &lt;code&gt;justwatch.py&lt;/code&gt; scraper already captures IMDB scores via JustWatch's sidebar — a partial substitute for Western critic consensus.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture That Emerged
&lt;/h2&gt;

&lt;p&gt;After all the debugging, the pipeline architecture is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GitHub Actions (scheduling)
    ↓
Prefect flows (retry logic, task caching, logging)
    ↓
data_sources/* (Playwright + httpx + REST APIs)
    ↓
pipeline/utils.py (data normalization)
    ↓
db/queries.py (upsert operations)
    ↓
Supabase (PostgreSQL + PostgREST)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer has a clear responsibility. The data sources are pure functions that return normalized Python dicts. The pipeline jobs transform those dicts to match the DB schema. The queries module is the only file that touches Supabase. This separation made debugging much faster. When something breaks, it's usually immediately obvious which layer is responsible.&lt;/p&gt;




&lt;h2&gt;
  
  
  GitHub Actions Minutes: A Practical Guide
&lt;/h2&gt;

&lt;p&gt;For anyone building a similar pipeline, here's how to think about GitHub Actions free tier minutes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's fast (&amp;lt; 5 min per run):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;REST API sources (TMDB, KOBIS, Wikipedia) — no browser overhead&lt;/li&gt;
&lt;li&gt;Small record counts with efficient batch upserts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What's slow (10-60+ min per run):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Playwright scrapers (browser startup + page load per request)&lt;/li&gt;
&lt;li&gt;Multi-region lookups (JustWatch × 4 regions = 4× the time)&lt;/li&gt;
&lt;li&gt;Large initial populations (10,000 titles × N seconds each)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optimization strategies:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Only sync what changes frequently. Airing shows need daily updates, historical movies don't&lt;/li&gt;
&lt;li&gt;Use Prefect's task caching for expensive discovery operations (fetching all TMDB IDs)&lt;/li&gt;
&lt;li&gt;Keep JustWatch to weekly syncs of 100-200 titles rather than full catalog runs&lt;/li&gt;
&lt;li&gt;Separate initial population (run once, manually) from ongoing sync (automated)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The nightly sync in steady state runs in under 45 minutes. The weekly sync runs in under 90 minutes. Combined, that's ~1,400 minutes/month, which is safely within the 2,000 minute free limit even accounting for occasional failures and re-runs.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Design the schema before writing the scrapers.&lt;/strong&gt; Schema drift (scrapers adding fields that don't exist in the database) caused several production failures. The right approach is schema-first: define all columns upfront, then write scrapers that output exactly those fields.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test upserts with duplicates explicitly.&lt;/strong&gt; The &lt;code&gt;ON CONFLICT DO UPDATE&lt;/code&gt; duplicate error was predictable in retrospect. Any bulk upsert should be tested with intentionally duplicated input rows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with a hand-written requirements.txt.&lt;/strong&gt; &lt;code&gt;pip freeze&lt;/code&gt; in a development environment always captures too much. Start minimal and add packages as needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accept that initial population and ongoing sync are different problems.&lt;/strong&gt; Initial population of 10,000+ records from browser-scraped sources takes days, not hours. Design for incremental population from the start rather than trying to do it all in one run.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The database feeds into a FastMCP server that exposes all of this as tools for AI agents — structured queries like &lt;code&gt;get_drama_details&lt;/code&gt;, &lt;code&gt;find_where_to_watch&lt;/code&gt;, &lt;code&gt;get_actor_filmography&lt;/code&gt;, and &lt;code&gt;get_weekly_boxoffice&lt;/code&gt; that leverage the cross-source enrichment I've built.&lt;/p&gt;

&lt;p&gt;The combination of Korean-specific data (Nielsen ratings, Naver scores, KOBIS box office, award history, OST albums) with international enrichment (TMDB, MDL community tags, JustWatch streaming, Wikipedia context) creates a database that genuinely doesn't exist anywhere else in a queryable form.&lt;/p&gt;

&lt;p&gt;That's what makes the scraping work worth it.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>githubactions</category>
      <category>python</category>
      <category>devops</category>
    </item>
    <item>
      <title>Building a Unified Korean Entertainment Database from 10 APIs and Web Scrapers</title>
      <dc:creator>Cara Jung</dc:creator>
      <pubDate>Sat, 09 May 2026 20:56:22 +0000</pubDate>
      <link>https://dev.to/carasjung/building-a-unified-korean-entertainment-database-from-10-apis-and-web-scrapers-3n91</link>
      <guid>https://dev.to/carasjung/building-a-unified-korean-entertainment-database-from-10-apis-and-web-scrapers-3n91</guid>
      <description>&lt;p&gt;Korean entertainment has become a global phenomenon with shows such as Squid Game breaking records and K-dramas topping global charts. And yet, the data infrastructure behind it is fragmented.&lt;/p&gt;

&lt;p&gt;Getting complete data on a single Korean show or film — cast, ratings (Korean and international), episode viewership numbers, where to stream it, what awards it won, its OST albums — requires hopscotching different websites. &lt;/p&gt;

&lt;p&gt;The issue is that dominant platforms like NAVER and Melon lack English-first APIs. As Session Zero points out in this &lt;a href="https://dev.to/sessionzero_ai/i-built-an-mcp-server-for-korean-data-heres-why-e92"&gt;article&lt;/a&gt;, Korean data is heavily underserved in MCP ecosystems because when Western developer tools and AI systems are built, Korean platforms are often invisible by default.&lt;/p&gt;

&lt;p&gt;The data exists. But it’s trapped behind language barriers, undocumented endpoints, JavaScript-rendered pages, and closed ecosystems. So while AI agents can easily retrieve structured information about Hollywood movies, Spotify charts, or IMDb ratings, asking the same systems about Korean dramas, OSTs, or Korean audience sentiment often returns incomplete results or nothing at all. &lt;/p&gt;

&lt;p&gt;So I decided to build a unified database to fix it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Data Landscape
&lt;/h2&gt;

&lt;p&gt;Korean entertainment data splits along two axes: &lt;strong&gt;language&lt;/strong&gt; (Korean vs. English sources) and &lt;strong&gt;type&lt;/strong&gt; (official vs. community vs. streaming).&lt;/p&gt;

&lt;h3&gt;
  
  
  English-language sources
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;TMDB&lt;/strong&gt; is the closest thing to a comprehensive English-language database for Korean content. It has structured data on tens of thousands of Korean films and shows, a stable API, and community ratings. But it lacks Korea-specific data: no verified Korean audience scores, no Nielsen viewership ratings, no Korean box office data, no OST information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MyDramaList&lt;/strong&gt; fills a critical gap that TMDB misses entirely: community tags. MDL users have tagged Korean dramas with labels such as "Bromance", "Time Travel", "CEO Male Lead", and "Found Family." No official database captures that taxonomy. MDL also tracks airing status more accurately than TMDB for Korean dramas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HanCinema&lt;/strong&gt; has the deepest historical coverage of Korean content in English, including films from the 1950s through 1990s that TMDB barely covers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JustWatch&lt;/strong&gt; is the most reliable real-time source for streaming availability. TMDB's streaming data lags reality by weeks. JustWatch checks 364 services daily.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wikipedia&lt;/strong&gt; has rich content for major Korean films and shows including detailed plot summaries, production history, cultural reception sections that no English entertainment database captures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Korean-language sources
&lt;/h3&gt;

&lt;p&gt;Here's where things get interesting and painful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NAVER&lt;/strong&gt; is Korea's dominant search engine and entertainment portal. Search for any Korean film on NAVER and you'll get a rich information card with two ratings that don't exist anywhere else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;실관람객 평점&lt;/strong&gt; (Verified ticket buyer rating): Only people who purchased cinema tickets through affiliated platforms can rate. This is Korea's equivalent of a verified purchase review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;네티즌 평점&lt;/strong&gt; (Netizen rating): Korean general public rating.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These ratings often diverge significantly from international scores. Parasite has a 9.08 verified buyer rating on NAVER versus 8.5 on TMDB. The Korean audience who saw it in theaters rated it exceptionally highly.&lt;/p&gt;

&lt;p&gt;NAVER also has per-episode Nielsen Korea viewership ratings for TV dramas, which is the official broadcast ratings that Korean media reports on weekly. No other English-language source has this data structured and queryable.&lt;/p&gt;

&lt;p&gt;The critical catch: &lt;strong&gt;NAVER has no public API for any of this&lt;/strong&gt;. Their entertainment data is rendered dynamically in JavaScript, served through their search interface, and entirely undocumented. Every data point requires a browser.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KOBIS&lt;/strong&gt; (Korean Film Council) is the exception. It has an official government API that provides authoritative weekly and daily box office rankings. It's the only Korean government data source with a proper REST API.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the Scrapers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Playwright Problem
&lt;/h3&gt;

&lt;p&gt;Most of the Korean data sources render content through JavaScript. Static HTML requests return empty shells. This meant nearly every scraper needed a real browser.&lt;/p&gt;

&lt;p&gt;To address this, I used &lt;strong&gt;Playwright&lt;/strong&gt; with Chromium headless across all JS-rendered sources. The setup is consistent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.sync_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sync_playwright&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_page_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_selector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;sync_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;user_agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;locale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ko-KR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domcontentloaded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for_selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wait_selector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# let lazy content settle
&lt;/span&gt;        &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;content&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;locale="ko-KR"&lt;/code&gt; matters for NAVER since it ensures Korean content is served rather than any region-specific variant.&lt;/p&gt;

&lt;h3&gt;
  
  
  The NAVER Genre Problem
&lt;/h3&gt;

&lt;p&gt;One of the more unexpected parsing challenges came from NAVER's movie information card. Genre, country, and runtime appeared concatenated: &lt;code&gt;공포대한민국95분&lt;/code&gt; (Horror South Korea 95min). They were in a single &lt;code&gt;dd&lt;/code&gt; tag separated by invisible &lt;code&gt;span.cm_bar_info&lt;/code&gt; elements.&lt;/p&gt;

&lt;p&gt;The fix was to replace the separator spans with pipe characters before splitting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;first_dd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;info_groups&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;first_dd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;first_dd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace_with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;segments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;first_dd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;genre&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;segments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;segments&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;segments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;segments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Extracting Nielsen Ratings from SVG
&lt;/h3&gt;

&lt;p&gt;The trickiest scraping problem was NAVER's episode viewership chart. The data is rendered as an interactive SVG chart where viewership percentages, episode numbers, and air dates are all inside SVG text elements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_parse_episode_chart&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="c1"&gt;# Rating values from bb-text elements inside the SVG
&lt;/span&gt;    &lt;span class="n"&gt;rating_texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;g.bb-texts-rank text.bb-text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ratings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rating_texts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;ratings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;

    &lt;span class="c1"&gt;# Episode numbers and dates from x-axis ticks
&lt;/span&gt;    &lt;span class="n"&gt;x_ticks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;g.bb-axis-x g.tick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ep_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tick&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;x_ticks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tspans&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tick&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tspan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tspans&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;ep_num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_parse_episode_num&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tspans&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;date_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tspans&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ep_num&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;date_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;ep_labels&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;episode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ep_num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date_text&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;episode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ep&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;episode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;air_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ep&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ratings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ep&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ep_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ratings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us per-episode Nielsen ratings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Ep 1 (12.02.): 6.3%
Ep 8 (12.24.): 12.3%
Ep 16 (01.21.): 20.5%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Goblin. No English-language API has this data.&lt;/p&gt;

&lt;h3&gt;
  
  
  The JustWatch Shadow DOM Problem
&lt;/h3&gt;

&lt;p&gt;JustWatch uses Web Components with Shadow DOM for their streaming offer cards. The score and provider data that appears in the browser is inside &lt;code&gt;&amp;lt;slot&amp;gt;&lt;/code&gt; elements that don't render in server-side HTML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"score-wrap"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"critics-score-wrap"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;slot&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"critics-score"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/slot&amp;gt;&lt;/span&gt;  &lt;span class="c"&gt;&amp;lt;!-- empty in scraped HTML --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, the streaming offers themselves (provider names, prices, monetization types) render in the regular DOM inside &lt;code&gt;div.buybox-selector a.offer&lt;/code&gt; elements. The key insight was that the offers were accessible even though the score slots weren't.&lt;/p&gt;

&lt;p&gt;Extracting the actual streaming URLs from JustWatch's redirect links required parsing the &lt;code&gt;r=&lt;/code&gt; parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_extract_redirect_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;href&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;href&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_qs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;unquote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;href&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Awards Parsing: Five Ceremonies, Three Formats
&lt;/h3&gt;

&lt;p&gt;Korean drama and film awards span five major ceremonies, each with slightly different HTML structure. I scraped all of them from AsianWiki plus the official Baeksang Awards site.&lt;/p&gt;

&lt;p&gt;The unexpected challenge was that award categories use different winner formats depending on ceremony type:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drama awards:&lt;/strong&gt; &lt;code&gt;Person ("Show Title")&lt;/code&gt; → &lt;code&gt;links[0]&lt;/code&gt; is person, &lt;code&gt;links[1]&lt;/code&gt; is show&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Film awards:&lt;/strong&gt; &lt;code&gt;"Film Title"&lt;/code&gt; → title-only, no person/show split&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blue Dragon Series:&lt;/strong&gt; &lt;code&gt;"Title" (Platform)&lt;/code&gt; → title plus streaming platform&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The format detection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;value_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bold_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;lstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;value_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'"'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Film/series format: title only
&lt;/span&gt;    &lt;span class="n"&gt;current_category&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;winner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;current_category&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;winner_show&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Drama format: person + show
&lt;/span&gt;    &lt;span class="n"&gt;current_category&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;winner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;current_category&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;winner_show&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the search function deduplicates winners who also appear in the nominees list. AsianWiki includes the winner in the nominees list, so a naive search returns double entries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Skip if this is the same person/title as winner (dedup)
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;nom_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;winner_name&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;winner_matches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;continue&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Wikipedia Section Problem
&lt;/h2&gt;

&lt;p&gt;Wikipedia articles don't have standardized section names. "Plot" might be called "Synopsis", "Series overview", "Story", or "Episodes" depending on who wrote the article and when.&lt;/p&gt;

&lt;p&gt;We built a section alias system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;SECTION_ALIASES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Plot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Plot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Synopsis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Story&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Series overview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Premise&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Episodes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cast and characters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Characters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Main cast&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ratings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ratings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Viewership ratings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Television ratings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Viewership&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reception&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reception&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Critical response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Critical reception&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The smart lookup tries each alias in order until it finds content. Crash Landing on You uses "Episodes" for its episode list and "Viewership" for its ratings section — both non-standard names that the alias system handles automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cross-Source ID Management
&lt;/h2&gt;

&lt;p&gt;One of the harder database design decisions was how to link data across sources. A single show like Crash Landing on You has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TMDB ID: &lt;code&gt;94796&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;MDL ID: &lt;code&gt;70&lt;/code&gt; (from slug &lt;code&gt;70-crash-landing-on-you&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Naver show OS ID: &lt;code&gt;3522952&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;JustWatch slug: &lt;code&gt;crash-landing-on-you&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Wikipedia title: &lt;code&gt;Crash Landing on You&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The database stores all of these as columns on the &lt;code&gt;tv_shows&lt;/code&gt; table. No single ID links all sources — the TMDB ID is the primary key because TMDB has the broadest coverage and most stable IDs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="n"&gt;tv_shows&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt;           &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="k"&gt;primary&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="n"&gt;uuid_generate_v4&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="n"&gt;tmdb_id&lt;/span&gt;      &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;mdl_id&lt;/span&gt;       &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;mdl_slug&lt;/span&gt;     &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;naver_show_id&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;justwatch_slug&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;wikipedia_title&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;-- ... ratings, metadata, etc.&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline links sources progressively: TMDB runs first to seed core titles, then MDL enriches with ratings and tags, then NAVER TV adds episode ratings using the Korean title stored by TMDB.&lt;/p&gt;




&lt;h2&gt;
  
  
  Rating Field Naming Convention
&lt;/h2&gt;

&lt;p&gt;With eight different rating sources covering different audiences and methodologies, naming discipline was essential:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="err"&gt;tmdb_rating&lt;/span&gt;              &lt;span class="c"&gt;# Global community (0-10)
&lt;/span&gt;&lt;span class="err"&gt;mdl_rating&lt;/span&gt;               &lt;span class="c"&gt;# International K-drama fans (0-10)
&lt;/span&gt;&lt;span class="err"&gt;naver_audience_rating&lt;/span&gt;    &lt;span class="c"&gt;# Korean verified ticket buyers (0-10)
&lt;/span&gt;&lt;span class="err"&gt;naver_netizen_rating&lt;/span&gt;     &lt;span class="c"&gt;# Korean general public (0-10)
&lt;/span&gt;&lt;span class="err"&gt;naver_latest_rating&lt;/span&gt;      &lt;span class="c"&gt;# Nielsen Korea latest episode (%)
&lt;/span&gt;&lt;span class="err"&gt;naver_highest_rating&lt;/span&gt;     &lt;span class="c"&gt;# Nielsen Korea peak episode (%)
&lt;/span&gt;&lt;span class="err"&gt;rt_tomatometer&lt;/span&gt;           &lt;span class="c"&gt;# Western professional critics (0-100)
&lt;/span&gt;&lt;span class="err"&gt;rt_audience_score&lt;/span&gt;        &lt;span class="c"&gt;# Western RT users (0-100)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are never stored as a generic "rating" field. The naming makes the source and audience type explicit at the schema level, preventing any ambiguity in downstream queries or API responses.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Unlocks
&lt;/h2&gt;

&lt;p&gt;The unified database enables queries that weren't previously possible:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Find dramas where Korean audiences loved it but Western critics were lukewarm:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title_english&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;naver_audience_rating&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rt_tomatometer&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tv_shows&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;naver_audience_rating&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;rt_tomatometer&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Find all content with OST albums by IU:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title_english&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;oa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;album_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;oa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vibe_url&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ost_albums&lt;/span&gt; &lt;span class="n"&gt;oa&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;tv_shows&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;oa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;oa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;artist&lt;/span&gt; &lt;span class="k"&gt;ILIKE&lt;/span&gt; &lt;span class="s1"&gt;'%IU%'&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;oa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;artist&lt;/span&gt; &lt;span class="k"&gt;ILIKE&lt;/span&gt; &lt;span class="s1"&gt;'%아이유%'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Find award winners available on Netflix US:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title_english&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;awards&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;tv_shows&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;streaming_availability&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Netflix'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'us'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monetization_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Subscription'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;won&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Find the drama with the biggest episode-to-episode rating jump:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;show_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;episode_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nielsen_rating&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;nielsen_rating&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nielsen_rating&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;show_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;episode_number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;jump&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;episodes&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;jump&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="n"&gt;NULLS&lt;/span&gt; &lt;span class="k"&gt;LAST&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;None of these queries are possible against any single existing source.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The goal of this project is to build an MCP server for Korean entertainment data so that makes Korean movies and TV shows are accessible to AI agents and developer tooling in a structured, searchable way.&lt;/p&gt;

&lt;p&gt;Instead of forcing developers to manually scrape different sites just to answer a basic query, the MCP server will expose unified tools that support natural language requests like “Find me a political thriller Korean audiences loved that’s available on Netflix and maintained strong episode ratings throughout its run.”&lt;/p&gt;

&lt;p&gt;Under the hood, that means reconciling fragmented metadata across 10,000+ titles from APIs, scrapers, streaming providers, audience ratings, box office systems, and community-driven sources.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Part 2 will dive into the pipeline architecture itself: the automated sync system, GitHub Actions orchestration, incremental updates, scraper failures, rate limits, deduplication headaches, and the very questionable debugging decisions made at 2AM.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>api</category>
      <category>webscraping</category>
      <category>korea</category>
      <category>ai</category>
    </item>
    <item>
      <title>Pick Your Auth: An Interactive Guide</title>
      <dc:creator>Cara Jung</dc:creator>
      <pubDate>Mon, 13 Apr 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/carasjung/pick-your-auth-an-interactive-guide-44</link>
      <guid>https://dev.to/carasjung/pick-your-auth-an-interactive-guide-44</guid>
      <description>&lt;p&gt;Most auth tutorials focus on how authentication works such as how to drop in a component, spin up a dev server, and get a login screen running. There's no shortage of guides that tell you which method to use for your use case. What's missing is the hands-on part: actually experiencing each flow the way your users do, so you can feel the friction, see the session it produces, and make an informed decision from the ground up.&lt;/p&gt;

&lt;p&gt;Magic link or passkey? Social login or OTP? The answer changes depending on whether you're building a consumer app, a fintech product, a B2B SaaS, or an internal tool. The choice is a product decision that affects activation, security posture, compliance, and long-term maintainability.&lt;/p&gt;

&lt;p&gt;To tackle this dilemma, I built an interactive demo called &lt;strong&gt;Auth Decision Kit&lt;/strong&gt; that lets you try three Descope auth flows live: magic link, social login, and passkey. This demo focuses on how each approach fits different product contexts and the tradeoffs you need to consider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Demo: &lt;a href="https://auth-decision-kit.vercel.app/" rel="noopener noreferrer"&gt;auth-decision-kit.vercel.app&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;GitHub: &lt;a href="https://github.com/carasjung/auth-decision-kit" rel="noopener noreferrer"&gt;github.com/carasjung/auth-decision-kit&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each method has five tabs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;01 Auth Flow&lt;/strong&gt;&lt;br&gt;
Authenticate for real using a live Descope integration. See the actual UX users experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;02 Session Inspector&lt;/strong&gt;&lt;br&gt;
After authenticating, inspect every claim in your JWT payload. Each field is annotated with what it means, why it matters, and when you'd use it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;03 Decision Matrix&lt;/strong&gt;&lt;br&gt;
Green / yellow / red ratings across six product contexts: B2B, consumer app, developer tool, internal tool, fintech, SaaS, and mobile-first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;04 Failure Simulator&lt;/strong&gt; &lt;br&gt;
Trigger each failure mode and see the Descope error code and the correct handling code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;05 Code&lt;/strong&gt; &lt;br&gt;
Copy-ready implementation snippets for Next.js&lt;/p&gt;


&lt;h2&gt;
  
  
  The Session Inspector: JWT Breakdown
&lt;/h2&gt;

&lt;p&gt;One of the most useful things I learned building this is how different the JWT payload looks depending on which auth method you used and why they matter for your backend logic.&lt;/p&gt;

&lt;p&gt;After a &lt;strong&gt;magic link&lt;/strong&gt; auth, your session contains &lt;code&gt;authenticationMethod: "magiclink"&lt;/code&gt; and &lt;code&gt;verifiedEmail: true&lt;/code&gt;. The email verification is implicit, clicking the link is proof of inbox access. This is a meaningful signal for risk scoring and it also shows that magic link is a single factor (access to your inbox). For products that require two factors like healthcare and fintech, magic link on its own won’t satisfy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdgj5efcxw28gvywwfou.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdgj5efcxw28gvywwfou.png" alt="Magic link session" width="800" height="814"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After &lt;strong&gt;social login&lt;/strong&gt;, you get the provider's access token nested under &lt;code&gt;oauth.google.accessToken&lt;/code&gt; (or whichever provider). You also get &lt;code&gt;externalIds.google&lt;/code&gt;, a stable provider-specific user ID that won't change even if the user changes their email address on Google's side. That's the field you want for account linking. Since you get free profile data, this is effective for consumer and developer tools.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyaisvibzux3vf9xzg3lz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyaisvibzux3vf9xzg3lz.png" alt="Social login session" width="800" height="1211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After a &lt;strong&gt;passkey&lt;/strong&gt; auth, the &lt;code&gt;amr&lt;/code&gt; (Authentication Methods References) claim contains &lt;code&gt;"hwk"&lt;/code&gt; (hardware key) and &lt;code&gt;"user"&lt;/code&gt;. This is the claim compliance teams care about. It's proof that a hardware-bound credential was used, not just a password or a link. Passkey is also the only method here where the private key never leaves the user’s device. Even a full Descope breach couldn’t expose user credentials. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6hsm5v3asdr9722ij2gy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6hsm5v3asdr9722ij2gy.png" alt="Passkey session" width="800" height="892"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Decision Matrix
&lt;/h2&gt;

&lt;p&gt;Here's a condensed version of what I found after thinking through six product contexts for each method:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4s1by4v9a4f8ms740j2f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4s1by4v9a4f8ms740j2f.png" alt="Decision matrix" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Magic link&lt;/strong&gt; is the sweet spot for B2B SaaS and early-stage products. Zero password management, implicit email verification, and simple implementation. However, it falls apart on mobile (context switch to email app kills conversion) and in high-security contexts where email as a sole factor isn't enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Social login&lt;/strong&gt; is the fastest path to activation for consumer and developer tools. GitHub login in particular gives you free org and repo data via the OAuth token, which is useful for developer-focused products. Avoid it for fintech and banking where regulations often require you to own the identity directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Passkey&lt;/strong&gt; is genuinely the best option for mobile-first and high-security context. Phishing-proof by design, the private key never leaves the device. The catch: users still need education on what a passkey is and you need a fallback for older browsers and lost devices.&lt;/p&gt;

&lt;p&gt;Most products should offer at least two methods where one can be the default while the other an alternative. For instance, using magic link as the default and passkey as the upgrade path once users are comfortable.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Failure Simulator
&lt;/h2&gt;

&lt;p&gt;Auth flows break in predictable ways. Understanding those failure points from day one lets you design a seamless recovery experience so users can continue without friction and avoid escalating to support.&lt;/p&gt;

&lt;p&gt;The failure simulator surfaces these scenarios using real Descope error codes and responses. While it doesn’t make live network calls, it replays actual API error outputs so you can explore failure cases without having to intentionally break a real session.&lt;/p&gt;

&lt;p&gt;Magic links expire (Descope's default is 2 minutes). When they do, the &lt;code&gt;onError&lt;/code&gt; callback fires with error code &lt;code&gt;E011303&lt;/code&gt;. Your UI should catch this and offer to send a new link, not show a generic error message.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu90m6m8cyp4cr8woo4ap.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu90m6m8cyp4cr8woo4ap.png" alt="Error code for expired magic link" width="800" height="710"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Social login gets cancelled. Users click "Continue with Google," see the permissions screen, and hit Cancel. That fires &lt;code&gt;E062503&lt;/code&gt;. The right response is to return the user silently to the login screen and treat a cancellation as a choice, not an error.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobkmy2bwzlh81m8690e4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobkmy2bwzlh81m8690e4.png" alt="User denied session error for social login" width="800" height="779"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Passkeys on new devices fire &lt;code&gt;E083002&lt;/code&gt; (WebAuthn NotAllowedError). The recovery flow is: fall back to magic link or OTP to verify identity, then offer to enroll a passkey on the new device. This is also why you should never make passkey the only auth method since you always need a fallback for device loss.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fersspagsktjwj64cobgb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fersspagsktjwj64cobgb.png" alt="Passkey failure error" width="800" height="873"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Stack and Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Next.js 15&lt;/strong&gt; with App Router&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Descope Next.js SDK&lt;/strong&gt; (&lt;code&gt;@descope/nextjs-sdk&lt;/code&gt;) for auth flows and session management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Framer Motion&lt;/strong&gt; for tab transitions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tailwind CSS&lt;/strong&gt; for layout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The entire setup is about 800 lines of TypeScript across nine files. All core data (steps, session highlights, decision matrix scores, failure scenarios, and code snippets) lives in a single &lt;code&gt;lib/auth.ts&lt;/code&gt; file. Adding a new auth method requires only a single entry point, keeping the system easy to extend.&lt;/p&gt;

&lt;p&gt;To run it yourself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/carasjung/auth-decision-kit
&lt;span class="nb"&gt;cd &lt;/span&gt;auth-decision-kit
npm &lt;span class="nb"&gt;install
cp&lt;/span&gt; .env.local.example .env.local
&lt;span class="c"&gt;# add your NEXT_PUBLIC_DESCOPE_PROJECT_ID&lt;/span&gt;
npm run dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll need a free Descope account. Once you’ve created your account, grab your Project ID from &lt;a href="https://app.descope.com/settings/project" rel="noopener noreferrer"&gt;app.descope.com/settings/project&lt;/a&gt; and configure a &lt;code&gt;sign-up-or-in&lt;/code&gt; flow with whichever methods you want to test.&lt;/p&gt;




&lt;h2&gt;
  
  
  From Demo to Decision
&lt;/h2&gt;

&lt;p&gt;There are plenty of great auth demos that show how things work. This one focuses on how to choose between them.&lt;/p&gt;

&lt;p&gt;Auth is infrastructure and like many infrastructure decisions, the cost of getting it wrong rarely shows up immediately. It appears later through conversion drop-offs, security tradeoffs, compliance constraints, and migrations.&lt;/p&gt;

&lt;p&gt;While modern tools make it easier to support multiple methods and evolve your approach over time, the decision of what to use and when still requires good judgement upfront. This project is designed to help make that choice more intentional.&lt;/p&gt;

&lt;p&gt;The live demo is at &lt;strong&gt;&lt;a href="https://auth-decision-kit.vercel.app/" rel="noopener noreferrer"&gt;auth-decision-kit.vercel.app&lt;/a&gt;&lt;/strong&gt; and the full source is on &lt;strong&gt;&lt;a href="https://github.com/carasjung/auth-decision-kit" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/strong&gt;. &lt;/p&gt;

</description>
      <category>authentication</category>
      <category>nextjs</category>
      <category>webdev</category>
      <category>security</category>
    </item>
    <item>
      <title>What Predicts a Hit? I Trained 3 ML Models to Find Out</title>
      <dc:creator>Cara Jung</dc:creator>
      <pubDate>Mon, 06 Apr 2026 07:00:00 +0000</pubDate>
      <link>https://dev.to/carasjung/what-predicts-a-hit-i-trained-3-ml-models-to-find-out-31mj</link>
      <guid>https://dev.to/carasjung/what-predicts-a-hit-i-trained-3-ml-models-to-find-out-31mj</guid>
      <description>&lt;p&gt;In many entertainment adaptation decisions, content selections are still instinct-driven. Maybe a producer was vibing with a story or overheard their Gen Alpha nephew mentioning a GOAT title. This subjective approach has often led to expensive missteps and wasted resources for studios when the feature or show turns into a flop. &lt;/p&gt;

&lt;p&gt;As someone who has worked in the breeding ground of popular webcomics, I asked: what if there was a system that could measure “success potential” of IPs based on real user behavior? Using ML, I wanted to see if I could build a forecasting model that could rank unadapted titles by their predicted commercial success. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For my endeavor, I worked with three datasets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source material metadata of roughly 1,500 titles that included engagement metrics such as views, likes, subscribers, genre, release schedule, and creator usernames&lt;/li&gt;
&lt;li&gt;Produced show metadata of 1,977 titles including ratings, watcher counts, genre, episode count, and cast&lt;/li&gt;
&lt;li&gt;Historical webcomic adaptation records of 424 cross-referenced titles that went from source material to screen, with data pulled from both sides&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before any modeling, I ran exploratory data analysis on all three and found a few things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Engagement metrics (likes, views, subscribers) were strongly correlated with each other and overall popularity&lt;/li&gt;
&lt;li&gt;Genre and tags correlated with watcher counts in the produced show data&lt;/li&gt;
&lt;li&gt;Creator frequency showed no statistically significant impact on adaptation success, which directly contradicted what studios commonly assume&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhirqnr4guj1s9aabpv0v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhirqnr4guj1s9aabpv0v.png" alt="Modeling Pipeline" width="800" height="853"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Engineering the Target Variable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One hurdle I ran into was that I couldn't directly measure adaptation “success" from the source material side alone. So I engineered a composite Popularity Score by normalizing and combining views, likes, and subscribers into a single metric representing audience appeal, which became the target variable for prediction.&lt;/p&gt;

&lt;p&gt;For the produced show data, I created a parallel score using rating and watcher count.&lt;/p&gt;

&lt;p&gt;Since correlation analysis confirmed that source popularity and show popularity moved together in historical adaptations, I used source popularity as a proxy target.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp4zeu7ke58soe0baxo2z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp4zeu7ke58soe0baxo2z.png" alt="Close overlap between actual and predicted curves" width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple vs Complex Models&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I implemented three models: Random Forest, XGBoost, and Ridge Regression. If you worked with ML models, there’s an expectation that the more complex models will win. However, this wasn’t the case. Ridge Regression became the unexpected underdog model that won:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dqerf5x3rq1u3tnxs0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dqerf5x3rq1u3tnxs0d.png" alt="Cross-validation applied across all three models to reduce overfitting risk and validate stability on the adaptation dataset." width="800" height="245"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I cross-validated all three models to reduce overfitting and validate stability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Likes = Success&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using standardized coefficients for feature importance in the Ridge model, the ranking was as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Likes (strongest predictor by a significant margin)&lt;/li&gt;
&lt;li&gt;Views&lt;/li&gt;
&lt;li&gt;Subscribers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The factors that studios often focus on such as creator reputation, genre, rating, and engagement rate showed weak or no statistical significance.&lt;/p&gt;

&lt;p&gt;I validated this further using Mann-Whitney U tests comparing adapted titles against the general pool. Adapted titles showed significantly higher “likes” than non-adapted ones and the difference was meaningful.&lt;/p&gt;

&lt;p&gt;Feature Importance for Ridge regression(standardized coefficients)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pnlaapyh3eq148fjh7m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pnlaapyh3eq148fjh7m.png" alt="Creator, genre, and rating showed no statistically significant impact and were excluded from the final model" width="800" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So why “likes”? &lt;/p&gt;

&lt;p&gt;One interpretation is that likes are intentional. A view can be passive while a subscription can be habitual. But giving a “like” is an act of emotional investment and this behavior is exactly what translates from IP to screen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The final model produced a ranked list of the top 10 unadapted webcomic titles by predicted success, along with contextual signals for each including genre appeal, subscriber trends, engagement consistency, and creator track record.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fj4jgpy0pa8lwvh94wa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fj4jgpy0pa8lwvh94wa.png" alt="Top unadapted titles" width="800" height="603"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Qualitative review of the top 10 confirmed alignment with the engagement patterns seen in historically successful adaptations. Cliff's Delta calculations showed that the predicted top titles had significantly higher likes than past adaptations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitations on the Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Part of doing good data work is being honest about the limitations. There were a few things that fell short:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small adaptation dataset. 424 entries is workable, but more data would reduce overfitting risk and better generalization.&lt;/li&gt;
&lt;li&gt;Proxy target variable. Using source popularity instead of actual show performance is a justified simplification, but it means the model can't fully capture real-world production quality, casting, or distribution reach.&lt;/li&gt;
&lt;li&gt;Categorical features dropped. Creator and genre have too many levels and their coefficients dominated the model without adding significance. Excluding them improved interpretability but at the cost of losing some nuance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What I'd Do Next&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If I extended this project, I'd rethink how signal is captured and focus on the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use NLP for deeper context

&lt;ul&gt;
&lt;li&gt;Synopsis embeddings or sentiment analysis on reader reviews could capture thematic richness that raw engagement metrics miss.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Take a hybrid ranking approach

&lt;ul&gt;
&lt;li&gt;Combining regression with a learning-to-rank algorithm could improve recommendation quality at the top of the list, where small differences actually matter.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Longitudinal validation

&lt;ul&gt;
&lt;li&gt;The real test is tracking what happens when predicted titles actually get produced. Building a feedback loop into the model would sharpen it over time.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The core insight here doesn’t only strictly apply to entertainment. It can apply to decisions that are being made by intuition or legacy practice. As the models showed, behavioral signals from real users outperform assumptions about what will succeed.&lt;/p&gt;

&lt;p&gt;Likes beat creator prestige. Engagement beat genre conventions. The audience’s preferences, not the ones from industry decision makers, predicted outcomes more reliably.&lt;/p&gt;

&lt;p&gt;Whether you're choosing which content to produce, which features to build, or which markets to enter, the same principle applies. The answers are within the data, but we often overlook the right signals. &lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>python</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
