DEV Community

Cover image for I built a Databricks medallion lakehouse to roast my own YouTube history (Bronze Silver Gold Existential Dread)
Shaan Satsangi
Shaan Satsangi

Posted on

I built a Databricks medallion lakehouse to roast my own YouTube history (Bronze Silver Gold Existential Dread)

There's a normal way to analyze your YouTube watch history. You export it from Google Takeout, open a Jupyter notebook, pd.read_json(), run a couple of value_counts(), feel a brief flicker of shame, and close the laptop.

I did not do that.

Instead I built a full Bronze → Silver → Gold medallion lakehouse on Databricks — Delta Lake, PySpark, an enrichment layer that calls the YouTube Data API, a FastAPI serving tier, a Neon Postgres warehouse, and a Next.js 16 frontend with animated cards — to discover that I watch a concerning amount of YouTube at 2 AM.

It's called YouTube Wrapped: Spotify Wrapped, but for the platform you actually spend your life on. Live demo.

The premise

Google Takeout hands you your watch history as a watch-history.json that's somehow both enormous and useless. Each record is basically {title, titleUrl, time} — no genre, no artist, no duration. Just a timestamp and a vibe.

The goal: turn that raw shame-export into a year-in-review with top artists, genres, "binge sessions," a night-owl score, and a "main character" artist (the one you cannot stop replaying).

The architecture (a.k.a. the overkill)

This is the part where a sane person uses pandas and I use enterprise data engineering for a personal hobby question.

🥉 Bronze — land the raw JSON. Dump Takeout exactly as-is into Delta. Zero transformations. If I break something downstream, the source of truth never moved. (Also: never trust your own parser on the first run.)

🥈 Silver — clean, type, deduplicate. Parse timestamps into real datetimes, normalize titles (YouTube prefixes everything with "Watched "), drop dupes, toss ads and deleted videos. Now it's a table instead of a JSON crime scene.

🜚 Enrichment — make the data mean something. The genuinely hard layer. "Watched Pasoori" tells you nothing structured. I hit the YouTube Data API for channel + metadata, then did artist mapping and genre classification — including a Desi / Western / untagged split, because my listening is ~60% Bollywood and off-the-shelf genre tags had no idea what to do with that.

🥇 Gold — aggregate into fact tables. Pre-computed analytics the dashboard reads instantly: listening rhythm by hour and day-of-week, binge sessions (consecutive runs + durations), loyal artists ranked by span, the night-owl score, and the main-character artist.

All on Databricks Free Edition with Unity Catalog, because the budget was zero dollars and pure spite.

The serving layer

Gold tables export to CSV and load into Neon Postgres via a little load_to_neon.py. A FastAPI backend (SQLAlchemy + Uvicorn) exposes 15+ endpoints — overview totals, top artists/channels/genres, the rhythm heatmap, binge stats. (API docs are live if you want to poke them.)

Frontend is Next.js 16 / React 19 / Tailwind 4, Recharts for graphs, Framer Motion for the Wrapped-style card reveals. Vercel + Render + Neon.

What I actually learned

  • The medallion pattern earns its keep even when it's overkill. Every time my enrichment logic was wrong (often), I re-ran from Silver and Bronze never flinched. That immutable raw layer saved me more times than I'll admit.
  • Enrichment is 80% of the work. Ingest and clean are easy. Turning "Watched [title]" into "Punjabi track, this artist, this genre" is where the real engineering hides.
  • Your data will roast you for free. The night-owl score does not lie.

The honest part

Could this have been a notebook and 40 lines of pandas? Absolutely. But I wanted real end-to-end data-engineering reps — Delta Lake, medallion layering, an enrichment API, a serving warehouse, a real frontend — on a dataset I actually cared about. Building it on something personal made every architecture decision stick harder than another Titanic clone ever would.

Links

Data engineers: I'd genuinely love a sanity check on the Gold grain — am I pre-aggregating at the right level, or should the rhythm/binge tables stay lower-level and let the API roll them up? Roast the architecture in the comments. That's what the project's named after, after all.

Top comments (0)