<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shaan Satsangi</title>
    <description>The latest articles on DEV Community by Shaan Satsangi (@shaanalpha).</description>
    <link>https://dev.to/shaanalpha</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3968891%2Fde013e6c-25f4-4f17-846a-8ca0c349c0d1.jpeg</url>
      <title>DEV Community: Shaan Satsangi</title>
      <link>https://dev.to/shaanalpha</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shaanalpha"/>
    <language>en</language>
    <item>
      <title>I built a Databricks medallion lakehouse to roast my own YouTube history (Bronze Silver Gold Existential Dread)</title>
      <dc:creator>Shaan Satsangi</dc:creator>
      <pubDate>Thu, 04 Jun 2026 22:54:47 +0000</pubDate>
      <link>https://dev.to/shaanalpha/i-built-a-databricks-medallion-lakehouse-to-roast-my-own-youtube-history-bronze-silver-gold--2m9m</link>
      <guid>https://dev.to/shaanalpha/i-built-a-databricks-medallion-lakehouse-to-roast-my-own-youtube-history-bronze-silver-gold--2m9m</guid>
      <description>&lt;p&gt;There's a normal way to analyze your YouTube watch history. You export it from Google Takeout, open a Jupyter notebook, &lt;code&gt;pd.read_json()&lt;/code&gt;, run a couple of &lt;code&gt;value_counts()&lt;/code&gt;, feel a brief flicker of shame, and close the laptop.&lt;/p&gt;

&lt;p&gt;I did not do that.&lt;/p&gt;

&lt;p&gt;Instead I built a full &lt;strong&gt;Bronze → Silver → Gold medallion lakehouse&lt;/strong&gt; on Databricks — Delta Lake, PySpark, an enrichment layer that calls the YouTube Data API, a FastAPI serving tier, a Neon Postgres warehouse, and a Next.js 16 frontend with animated cards — to discover that I watch a concerning amount of YouTube at 2 AM.&lt;/p&gt;

&lt;p&gt;It's called &lt;strong&gt;YouTube Wrapped&lt;/strong&gt;: Spotify Wrapped, but for the platform you actually spend your life on. &lt;a href="https://youtube-wrapped-by-shaan.vercel.app" rel="noopener noreferrer"&gt;Live demo&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The premise
&lt;/h2&gt;

&lt;p&gt;Google Takeout hands you your watch history as a &lt;code&gt;watch-history.json&lt;/code&gt; that's somehow both enormous and useless. Each record is basically &lt;code&gt;{title, titleUrl, time}&lt;/code&gt; — no genre, no artist, no duration. Just a timestamp and a vibe.&lt;/p&gt;

&lt;p&gt;The goal: turn that raw shame-export into a year-in-review with top artists, genres, "binge sessions," a night-owl score, and a "main character" artist (the one you cannot stop replaying).&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture (a.k.a. the overkill)
&lt;/h2&gt;

&lt;p&gt;This is the part where a sane person uses pandas and I use enterprise data engineering for a personal hobby question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🥉 Bronze — land the raw JSON.&lt;/strong&gt; Dump Takeout exactly as-is into Delta. Zero transformations. If I break something downstream, the source of truth never moved. (Also: never trust your own parser on the first run.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🥈 Silver — clean, type, deduplicate.&lt;/strong&gt; Parse timestamps into real datetimes, normalize titles (YouTube prefixes everything with "Watched "), drop dupes, toss ads and deleted videos. Now it's a table instead of a JSON crime scene.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🜚 Enrichment — make the data &lt;em&gt;mean&lt;/em&gt; something.&lt;/strong&gt; The genuinely hard layer. "Watched Pasoori" tells you nothing structured. I hit the YouTube Data API for channel + metadata, then did artist mapping and genre classification — including a &lt;strong&gt;Desi / Western / untagged&lt;/strong&gt; split, because my listening is ~60% Bollywood and off-the-shelf genre tags had no idea what to do with that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🥇 Gold — aggregate into fact tables.&lt;/strong&gt; Pre-computed analytics the dashboard reads instantly: listening rhythm by hour and day-of-week, binge sessions (consecutive runs + durations), loyal artists ranked by span, the night-owl score, and the main-character artist.&lt;/p&gt;

&lt;p&gt;All on &lt;strong&gt;Databricks Free Edition&lt;/strong&gt; with Unity Catalog, because the budget was zero dollars and pure spite.&lt;/p&gt;

&lt;h2&gt;
  
  
  The serving layer
&lt;/h2&gt;

&lt;p&gt;Gold tables export to CSV and load into &lt;strong&gt;Neon Postgres&lt;/strong&gt; via a little &lt;code&gt;load_to_neon.py&lt;/code&gt;. A &lt;strong&gt;FastAPI&lt;/strong&gt; backend (SQLAlchemy + Uvicorn) exposes 15+ endpoints — overview totals, top artists/channels/genres, the rhythm heatmap, binge stats. (&lt;a href="https://youtube-wrapped-api.onrender.com/docs" rel="noopener noreferrer"&gt;API docs are live&lt;/a&gt; if you want to poke them.)&lt;/p&gt;

&lt;p&gt;Frontend is &lt;strong&gt;Next.js 16 / React 19 / Tailwind 4&lt;/strong&gt;, Recharts for graphs, Framer Motion for the Wrapped-style card reveals. Vercel + Render + Neon.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The medallion pattern earns its keep even when it's overkill.&lt;/strong&gt; Every time my enrichment logic was wrong (often), I re-ran from Silver and Bronze never flinched. That immutable raw layer saved me more times than I'll admit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enrichment is 80% of the work.&lt;/strong&gt; Ingest and clean are easy. Turning "Watched [title]" into "Punjabi track, this artist, this genre" is where the real engineering hides.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your data will roast you for free.&lt;/strong&gt; The night-owl score does not lie.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The honest part
&lt;/h2&gt;

&lt;p&gt;Could this have been a notebook and 40 lines of pandas? Absolutely. But I wanted real end-to-end data-engineering reps — Delta Lake, medallion layering, an enrichment API, a serving warehouse, a real frontend — on a dataset I actually cared about. Building it on something personal made every architecture decision stick harder than another Titanic clone ever would.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Links&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔴 Demo: &lt;a href="https://youtube-wrapped-by-shaan.vercel.app" rel="noopener noreferrer"&gt;https://youtube-wrapped-by-shaan.vercel.app&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📦 Code: &lt;a href="https://github.com/shaan-alpha/Youtube-Wrapped" rel="noopener noreferrer"&gt;https://github.com/shaan-alpha/Youtube-Wrapped&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📑 API: &lt;a href="https://youtube-wrapped-api.onrender.com/docs" rel="noopener noreferrer"&gt;https://youtube-wrapped-api.onrender.com/docs&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data engineers: I'd genuinely love a sanity check on the Gold grain — am I pre-aggregating at the right level, or should the rhythm/binge tables stay lower-level and let the API roll them up? Roast the architecture in the comments. That's what the project's named after, after all.&lt;/p&gt;

</description>
      <category>showdev</category>
      <category>dataengineering</category>
      <category>python</category>
      <category>databricks</category>
    </item>
  </channel>
</rss>
