<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alper Ortac</title>
    <description>The latest articles on DEV Community by Alper Ortac (@alp).</description>
    <link>https://dev.to/alp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F583286%2F388dcc77-638a-47db-9dee-8f54f037b45c.jpg</url>
      <title>DEV Community: Alper Ortac</title>
      <link>https://dev.to/alp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alp"/>
    <language>en</language>
    <item>
      <title>A Data Pipeline for 1 million movies and 10 million streaming links</title>
      <dc:creator>Alper Ortac</dc:creator>
      <pubDate>Sun, 01 Dec 2024 19:01:36 +0000</pubDate>
      <link>https://dev.to/alp/a-data-pipeline-for-1-million-movies-and-10-million-streaming-links-1h6a</link>
      <guid>https://dev.to/alp/a-data-pipeline-for-1-million-movies-and-10-million-streaming-links-1h6a</guid>
      <description>&lt;p&gt;&lt;strong&gt;Feb 2023&lt;/strong&gt;: I wanted to see all scores for movies + tv shows and where to stream them on one page but couldn't find an aggregator that included all sources that were relevant for me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mar 2023&lt;/strong&gt;: So, I built an MVP that grabbed scores on the fly and &lt;a href="https://goodwatch.app/" rel="noopener noreferrer"&gt;put the site online&lt;/a&gt;. It worked, but was slow (10 seconds to display scores).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Oct 2023&lt;/strong&gt;: Realizing that storing data on my side is a necessity, I discovered &lt;a href="https://www.windmill.dev/" rel="noopener noreferrer"&gt;windmill.dev&lt;/a&gt;. It eclipses similar orchestration engines easily - at least for my needs.&lt;/p&gt;




&lt;p&gt;Fast forward to today and &lt;strong&gt;after 12 months of continuous data munching&lt;/strong&gt;, I want to share how the pipeline works in detail. You'll learn how to build a complex system that grabs data from many different sources, normalizes data and combines it into an optimized format for querying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pics or didn't happen!
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ghcu3xmku5sf8v0p09p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ghcu3xmku5sf8v0p09p.png" alt="GoodWatch Flow Runs on Windmill" width="800" height="684"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the Runs view. Every dot represents a flow run. A flow can be anything, for example a simple one-step script:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6nwikmmr0iqpb1lkzxx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6nwikmmr0iqpb1lkzxx.png" alt="GoodWatch Flow to fetch Daily Data Dump from TMDB" width="640" height="499"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The block in the center contains a script like this (simplified):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;tmdb_extract_daily_dump_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tmdb_extract_daily_dump_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Checking TMDB for latest daily dumps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;init_mongodb&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;daily_dump_infos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_daily_dump_infos&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;daily_dump_info&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;daily_dump_infos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;download_zip_and_store_in_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;daily_dump_info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;close_mongodb&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_mongo&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;daily_dump_infos&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="p"&gt;[...]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following &lt;strong&gt;beast&lt;/strong&gt; is also a flow (remember, this is only one of the green dots):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6w0e0dvdp0pr5k47g9m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6w0e0dvdp0pr5k47g9m.png" alt="GoodWatch Priority Flow to fetch and store all related data to a movie or show" width="800" height="899"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(higher resolution image: &lt;a href="https://i.imgur.com/LGhTUGG.png" rel="noopener noreferrer"&gt;https://i.imgur.com/LGhTUGG.png&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Let's break this one down:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Get the &lt;strong&gt;next prioritized&lt;/strong&gt; movie or tv show (see next section)&lt;/li&gt;
&lt;li&gt;Get up-to-date data from &lt;strong&gt;TMDB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Scrape &lt;strong&gt;IMDb&lt;/strong&gt;, &lt;strong&gt;Metacritic&lt;/strong&gt; and &lt;strong&gt;Rotten Tomatoes&lt;/strong&gt; for current scores&lt;/li&gt;
&lt;li&gt;Scrape &lt;strong&gt;TV Tropes&lt;/strong&gt; for... tropes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Huggingface API&lt;/strong&gt; to gather DNA data (will explain below)&lt;/li&gt;
&lt;li&gt;Store high dimensional &lt;strong&gt;vectors for DNA data&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store relational data&lt;/strong&gt; for movies, shows and streaming links&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of those steps are more or less complex and involve using async processes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where do you start? Priority Queue
&lt;/h2&gt;

&lt;p&gt;To determine which titles to pick next there are two lanes that are processed in parallel. This is another area where Windmill shines. Parallelization and orchestration works flawlessly with their architecture.&lt;/p&gt;

&lt;p&gt;The two lanes to pick the next item are:&lt;/p&gt;

&lt;h3&gt;
  
  
  Lane 1: Flows for each data source separately
&lt;/h3&gt;

&lt;p&gt;First of all, &lt;strong&gt;titles that don't have any data attached will be selected for each data source&lt;/strong&gt;. That means if the Metacritic pipeline has a movie that wasn't scraped yet, it will be selected next. This makes sure that every title was processed at least once, including new ones.&lt;/p&gt;

&lt;p&gt;Once every title has attached data, &lt;strong&gt;the pipeline selects those with the least recent data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here is example of such a flow run, here with an error because the rate limit was hit:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuccw6zhpf75yw0a08szo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuccw6zhpf75yw0a08szo.png" alt="Flow to grab scores from Metacritic" width="800" height="501"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Windmill allows you to define retries for each step in the flow easily. In this case, the logic is to retry three times in case of errors. Unless the rate limit was hit (which is usually a different status code or error message), then we stop immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lane 2: Priority Flow for each movie/show separately
&lt;/h3&gt;

&lt;p&gt;The above works, but has a serious issue: &lt;strong&gt;recent releases are not updated timely enough&lt;/strong&gt;. It can take weeks or even months until every data aspect has been successfully fetched. For example, it can happen that a movie has a recent IMDb score, but the other scores are outdated and the streaming links are missing completely. Especially for scores and streaming availability I wanted to achieve a much better accuracy.&lt;/p&gt;

&lt;p&gt;To solve this problem, the second lane focuses on a different prioritization strategy: &lt;strong&gt;The most popular and trending movies/shows  are selected for a complete data refresh across all data sources.&lt;/strong&gt; I showed this flow before, it's the one I referred to as &lt;strong&gt;beast&lt;/strong&gt; earlier.&lt;/p&gt;

&lt;p&gt;Titles that are shown more often on the app get a priority boost as well. That means that every time a movie or show is coming up in the top search results or when their details view is opened, they will likely be refreshed soon.&lt;/p&gt;

&lt;p&gt;Every title can only be &lt;strong&gt;refreshed once per week&lt;/strong&gt; using the priority lane to ensure that we don't fetch data that likely hasn't changed in the meantime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Are you allowed to do this? Scraping Considerations
&lt;/h2&gt;

&lt;p&gt;You might ask: Is scraping legal? The act of grabbing the data is normally fine. What you do with the data needs careful consideration though. &lt;strong&gt;As soon as you make profit from a service that uses scraped data, you are probably violating their terms and conditions.&lt;/strong&gt; (see &lt;a href="https://www.quinnemanuel.com/the-firm/publications/the-legal-landscape-of-web-scraping/" rel="noopener noreferrer"&gt;The Legal Landscape of Web Scraping&lt;/a&gt; and &lt;a href="https://www.eff.org/deeplinks/2018/04/scraping-just-automated-access-and-everyone-does-it" rel="noopener noreferrer"&gt;‘Scraping’ Is Just Automated Access, and Everyone Does It&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Scraping and related laws are new and often untested and there is a lot of legal gray area. I'm determined to cite every source accordingly, respect rate limits and avoid unnecessary requests to minimize impact on their services.&lt;/p&gt;

&lt;p&gt;Fact is, the data will not be used to make profit. GoodWatch will be free to use for everyone forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  More Work? Yes, Milord
&lt;/h2&gt;

&lt;p&gt;Windmill uses workers to distribute code execution across multiple processes. &lt;strong&gt;Each step in a flow is sent to a worker, which makes them independent from actual business logic.&lt;/strong&gt; Only the main app orchestrates the jobs, whereas workers only receive input data, code to execute and return the result.&lt;/p&gt;

&lt;p&gt;It's an efficient architecture that scales nicely. Currently, there are 12 workers splitting the work. They're all hosted on Hetzner.&lt;/p&gt;

&lt;p&gt;Each worker has a maximum resource consumption of 1 vCPU and 2 GB of RAM. Here is an overview:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv1k2yhpc42z9ijsm99va.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv1k2yhpc42z9ijsm99va.png" alt="Workers Dashboard" width="800" height="757"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Windmill Editor
&lt;/h2&gt;

&lt;p&gt;Windmill offers an in-browser IDE-like editor experience with &lt;strong&gt;linting&lt;/strong&gt;, &lt;strong&gt;auto-formatting&lt;/strong&gt;, an &lt;strong&gt;AI assistant&lt;/strong&gt; and even &lt;strong&gt;collaborative editing&lt;/strong&gt; (last one is a paid feature). The best thing is this button though:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F85s37rykht327p7zcfbm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F85s37rykht327p7zcfbm.png" alt="Windmill Script Editor with highlighted Test Button" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It allows me to quickly iterate and test scripts before deploying them. I usually edit and test files in the browser and push them to git when I'm finished.&lt;/p&gt;

&lt;p&gt;Only thing that's missing for an optimal coding environment are &lt;strong&gt;debugging tools&lt;/strong&gt; (breakpoints &amp;amp; variable context). Currently, I'm debugging scripts in my local IDE to overcome this weakness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers. I like Numbers
&lt;/h2&gt;

&lt;p&gt;Me too!&lt;/p&gt;

&lt;p&gt;Currently GoodWatch requires around &lt;strong&gt;100 GB of persistent data storage&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;15 GB&lt;/strong&gt; for raw preprocessing data (MongoDB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;23 GB&lt;/strong&gt; for processed relational data (Postgres)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;67 GB&lt;/strong&gt; for vector data (Postgres)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every day &lt;strong&gt;6.500 flows&lt;/strong&gt; run through Windmill's orchestration engine. This results in a daily volume of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;30.000&lt;/strong&gt; IMDb pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9.000&lt;/strong&gt; TV Tropes pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5.000&lt;/strong&gt; Rotten Tomatoes pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.500&lt;/strong&gt; Huggingface prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;600&lt;/strong&gt; Metacritic pages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These numbers are fundamentally different because of different rate limit policies.&lt;/p&gt;

&lt;p&gt;Once per day, data is cleaned up and combined into the final data format. Currently the database that powers the GoodWatch webapp stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;10 million&lt;/strong&gt; streaming links&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 million&lt;/strong&gt; movies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;300k&lt;/strong&gt; DNA values&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;200k&lt;/strong&gt; tv shows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;70k&lt;/strong&gt; movies/shows with DNA&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's that DNA you keep talking about?
&lt;/h2&gt;

&lt;p&gt;Imagine you could only distinguish movies by their genre, extremely limiting right?&lt;/p&gt;

&lt;p&gt;That's why I started the DNA project. It allows categorizing movies and shows by other attributes like &lt;strong&gt;Mood&lt;/strong&gt;, &lt;strong&gt;Plot Element&lt;/strong&gt;s, &lt;strong&gt;Character Types&lt;/strong&gt;, &lt;strong&gt;Dialog&lt;/strong&gt; or &lt;strong&gt;Key Props&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here are the top 10 of all DNA values over all items:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbhy1g36b1w8utxhexw1p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbhy1g36b1w8utxhexw1p.png" alt="Top 10 DNA values on GoodWatch" width="800" height="545"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It allows two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Filter by DNA values (using relational data)&lt;/li&gt;
&lt;li&gt;Search by similarity (using vector data)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://goodwatch.app/discover?similarDNA=Mood_Melancholic&amp;amp;similarDNACombinationType=all" rel="noopener noreferrer"&gt;Melancholic Mood&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://goodwatch.app/discover?similarTitles=693134_movie_Sub-Genres|Mood|Themes|Plot" rel="noopener noreferrer"&gt;Similar Story as Dune: Part Two&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm planning a dedicated blog post about the DNA in much more detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deeper Dive into the Data Pipeline
&lt;/h2&gt;

&lt;p&gt;To fully understand how the data pipeline works, here is a breakdown what happens for each data source:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Once a day, a MongoDB collection is updated with all required input data
&lt;/h3&gt;

&lt;p&gt;For each data source there is an &lt;code&gt;ìnit&lt;/code&gt; flow that prepares a MongoDB collection with all required data. For IMDb, that's just the &lt;code&gt;imdb_id&lt;/code&gt;. For Rotten Tomatoes, the &lt;code&gt;title&lt;/code&gt; and &lt;code&gt;release_year&lt;/code&gt; are required. That's because the ID is unknown and we need to guess the correct URL based on the name.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Continuously fetch data and write it into the MongoDB collection
&lt;/h3&gt;

&lt;p&gt;Based on the priority selection explained above, items in the prepared collections are updated with the data that is fetched. Each data source has their own collection which gets more and more complete over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Once a day, various flows collect the data from the MongoDB collections and write them into Postgres
&lt;/h3&gt;

&lt;p&gt;There is a flow for movies, one for tv shows and another one for streaming links. They collect all necessary data from various collections and store them in their respective Postgres tables, which are then queried by the web application.&lt;/p&gt;

&lt;p&gt;Here is an excerpt of the copy movies flow and script:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhhw55e5r9pw88l5lkxpl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhhw55e5r9pw88l5lkxpl.png" alt="Copy movies script" width="800" height="619"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Some of these flows take a long time to execute, sometimes even longer than 6 hours. This can be optimized by flagging all items that were updated and only copying those instead of batch processing the whole data set. One of many TODO items on my list 😅&lt;/p&gt;

&lt;h2&gt;
  
  
  Scheduling
&lt;/h2&gt;

&lt;p&gt;Scheduling is as easy as defining cron expressions for each flow or script that needs to be executed automatically:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpbkt0h84co6qtams47bn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpbkt0h84co6qtams47bn.png" alt="Schedule definition for Priority flow" width="800" height="933"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is an excerpt of all schedules that are defined for GoodWatch:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fruqkgr3oz5g9mqzqxwow.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fruqkgr3oz5g9mqzqxwow.png" alt="Schedules Overview" width="800" height="706"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In total there are around 50 schedules defined.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges
&lt;/h2&gt;

&lt;p&gt;With great data comes great responsibility. Lots can go wrong. And it did.&lt;/p&gt;

&lt;h3&gt;
  
  
  Very slow processing
&lt;/h3&gt;

&lt;p&gt;Early versions of my scripts were taking ages to update all entries in a collection or table. That was because I upserted every item individually. That causes a lot of overhead and slows down the process significantly.&lt;/p&gt;

&lt;p&gt;A much better approach is to collect data to be upserted and batch the database queries. Here is an example for MongoDB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="c1"&gt;# Process movies in batches
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_movies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_movies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processing movies &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;tmdb_movie_cursor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;tmdb_movie_collection&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;imdb_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$ne&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;}})&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;skip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tmdb_movie&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tmdb_movie_cursor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;operation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_operation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmdb_entry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tmdb_movie&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;movie_operations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Do all operations in bulk
&lt;/span&gt;    &lt;span class="n"&gt;bulk_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bulk_write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;movie_operations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Memory hungry scripts
&lt;/h3&gt;

&lt;p&gt;Even with batch processing, some scripts consumed so much memory that the workers crashed. The solution was to carefully fine-tune the batch size for every use case.&lt;/p&gt;

&lt;p&gt;Some batches are fine to run in steps of &lt;code&gt;5000&lt;/code&gt;, others store much more data in memory and run better with &lt;code&gt;500&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Windmill has a great feature to observe the memory while a script is running:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6v02yzaqxdnuca23rrgr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6v02yzaqxdnuca23rrgr.png" alt="Script run memory display" width="505" height="505"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Windmill is a great asset in any developer's toolkit for automating tasks. It's been an invaluable productivity booster for me, allowing me to focus on the flow structure and business logic while outsourcing the heavy lifting of task orchestration, error handling, retries and caching.&lt;/p&gt;

&lt;p&gt;Handling large volumes of data is still challenging, and optimizing the pipeline is an ongoing process - but I'm really happy with how everything has turned out so far.&lt;/p&gt;

&lt;h2&gt;
  
  
  Okay, okay. That's enough
&lt;/h2&gt;

&lt;p&gt;Thought so. Just let me link a few resources and we're finished:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://goodwatch.app/" rel="noopener noreferrer"&gt;GoodWatch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://discord.gg/TVAcrfQzcA" rel="noopener noreferrer"&gt;GoodWatch Discord Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.windmill.dev/" rel="noopener noreferrer"&gt;Windmill&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://discord.com/invite/V7PM2YHsPB" rel="noopener noreferrer"&gt;Windmill Discord Community&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Did you know that &lt;strong&gt;GoodWatch is open-source&lt;/strong&gt;? You can take a look at all scripts and flow definitions in this repository: &lt;a href="https://github.com/alp82/goodwatch-monorepo/tree/main/goodwatch-flows/windmill/f" rel="noopener noreferrer"&gt;https://github.com/alp82/goodwatch-monorepo/tree/main/goodwatch-flows/windmill/f&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let me know if you have any questions.&lt;/p&gt;




&lt;p&gt;This is a series of blog posts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://dev.to/alp/what-do-you-want-to-watch-next-this-is-why-i-built-goodwatch-1cn2"&gt;What do you want to watch next? This is why I built GoodWatch.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;A Data Pipeline for 1 million movies and 10 million streaming links &lt;em&gt;(you are here)&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>windmill</category>
      <category>python</category>
      <category>movies</category>
      <category>goodwatch</category>
    </item>
    <item>
      <title>What do you want to watch next? This is why I built GoodWatch.</title>
      <dc:creator>Alper Ortac</dc:creator>
      <pubDate>Mon, 06 May 2024 03:19:13 +0000</pubDate>
      <link>https://dev.to/alp/what-do-you-want-to-watch-next-this-is-why-i-built-goodwatch-1cn2</link>
      <guid>https://dev.to/alp/what-do-you-want-to-watch-next-this-is-why-i-built-goodwatch-1cn2</guid>
      <description>&lt;p&gt;Ever scrolled through Netflix, Disney+ or Hulu and wondered what to watch next?&lt;/p&gt;

&lt;p&gt;You can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Curate your own watchlist&lt;/li&gt;
&lt;li&gt;Delve into recommendations from streaming services&lt;/li&gt;
&lt;li&gt;Use one of countless websites and apps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Like many, I tried all of the above. They worked, to some extend. But I wanted more.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Idea
&lt;/h2&gt;

&lt;p&gt;What if you could find all scores and streaming information for movies and TV shows on one single page? This was the vision for &lt;a href="https://goodwatch.app/" rel="noopener noreferrer"&gt;GoodWatch&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6thgazq1ki2hk5frg0z8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6thgazq1ki2hk5frg0z8.png" alt="The Boys with scores and streaming info" width="800" height="411"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looks nice, I thought. So I built it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Prototype
&lt;/h2&gt;

&lt;p&gt;I quickly built a &lt;a href="https://remix.run" rel="noopener noreferrer"&gt;Remix&lt;/a&gt; web app that fetched movie and TV data from &lt;a href="https://tmdb.com" rel="noopener noreferrer"&gt;TMDB&lt;/a&gt; and grabbed scores from different sites in real-time. After only three weeks, I was proud to deliver a working prototype that brought immediate value.&lt;/p&gt;

&lt;p&gt;I used it almost daily. Loading a movie page took a few seconds though, so I needed a better solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Need for Data
&lt;/h2&gt;

&lt;p&gt;I compared many workflow engines and ended up choosing the excellent &lt;a href="https://windmill.dev" rel="noopener noreferrer"&gt;Windmill&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It allows me to define data pipelines like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs80j2ix2jsy7eaox84rf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs80j2ix2jsy7eaox84rf.png" alt="Data Pipeline for fetching movie and tv data" width="800" height="550"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Baseline data is fetched from the wonderful &lt;a href="https://developer.themoviedb.org/docs/getting-started" rel="noopener noreferrer"&gt;TMDB API&lt;/a&gt;. Then it gets enriched with ratings from IMDb, Metacritic, and Rotten Tomatoes, plus metadata from TV Tropes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenges
&lt;/h2&gt;

&lt;p&gt;One of the biggest challenges is to map the correct URL's for each data source. Some (not so) clever techniques are needed to find the correct matches.&lt;/p&gt;

&lt;p&gt;Another trade-off is how to deal with update frequencies. The database is around 15 GB big and data upgrades need to be carefully orchestrated.&lt;/p&gt;

&lt;p&gt;It's self-explanatory that each service's rate limiting policies are respected.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: Developed with &lt;a href="https://remix.run" rel="noopener noreferrer"&gt;Remix&lt;/a&gt;, hosted on &lt;a href="https://vercel.com" rel="noopener noreferrer"&gt;Vercel&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Handling&lt;/strong&gt;: Utilizes &lt;a href="https://windmill.dev" rel="noopener noreferrer"&gt;Windmill&lt;/a&gt; for data pipelines, with a primary database powered by &lt;a href="https://www.postgresql.org/" rel="noopener noreferrer"&gt;PostgreSQL&lt;/a&gt;. Auxiliary data storage is handled by &lt;a href="https://www.mongodb.com/" rel="noopener noreferrer"&gt;MongoDB&lt;/a&gt;, with &lt;a href="https://redis.io/" rel="noopener noreferrer"&gt;Redis&lt;/a&gt; for caching to optimize performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hosting and Deployment&lt;/strong&gt;: All services are deployed on &lt;a href="https://www.hetzner.com/cloud/" rel="noopener noreferrer"&gt;Hetzner Cloud&lt;/a&gt; using Docker Compose&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Obvious
&lt;/h2&gt;

&lt;p&gt;Visit &lt;a href="https://goodwatch.app/" rel="noopener noreferrer"&gt;GoodWatch&lt;/a&gt; and see how it makes movies and TV shows much easier to discover.&lt;/p&gt;

&lt;p&gt;The project is open-source and available on GitHub: &lt;a href="https://github.com/alp82/goodwatch-monorepo" rel="noopener noreferrer"&gt;goodwatch-monorepo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Join the community on Discord: &lt;a href="https://discord.com/invite/TVAcrfQzcA" rel="noopener noreferrer"&gt;GoodWatch on Discord&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future
&lt;/h2&gt;

&lt;p&gt;The ultimate goal is to create the best recommendation engine since &lt;a href="https://en.wikipedia.org/wiki/Jinni_(search_engine)" rel="noopener noreferrer"&gt;Jinni&lt;/a&gt; shut down in 2015.&lt;/p&gt;

&lt;p&gt;I'm aiming to build a tool that understands your unique taste to help you discover hidden gems.&lt;/p&gt;

&lt;p&gt;Feel free to ask any questions in the comments or suggest features you’d love to see!&lt;/p&gt;




&lt;p&gt;This is a series of blog posts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What do you want to watch next? This is why I built GoodWatch. &lt;em&gt;(you are here)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/alp/a-data-pipeline-for-1-million-movies-and-10-million-streaming-links-1h6a"&gt;A Data Pipeline for 1 million movies and 10 million streaming links&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>webdev</category>
      <category>typescript</category>
      <category>python</category>
      <category>goodwatch</category>
    </item>
  </channel>
</rss>
