Data Engineering Skills Gap Nobody Fills — and the Side Project I Finally Finished to Fill It

#devchallenge #githubchallenge

GitHub “Finish-Up-A-Thon” Challenge Submission

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

Petascale Labs — a data engineering learning platform that teaches the
stack from the bytes up. Most DE curriculum shows you which button to click. We teach you why it breaks in production and how to reason about it from first principles.

What makes it ours:

The Strata model — the data platform as layers: storage & file formats → ingestion → open table formats → compute engines → orchestration → query engines/OLAP → semantic layer. A mental map for the whole stack.
Incident-driven lessons — every lesson is a real production failure and its fix. You learn the way you actually grow at work.
An Incident-Response Arcade — interactive, time-pressured sims where you diagnose and resolve infra failures (the phantom lag, shuffle spills, broken CDC) under a budget and a cluster-health clock -https://petascalelabs.com/arcade/games
Free, client-side DE tools — a Parquet Inspector, an SCD Playground, and a PII Masking Policy Generator that run entirely in your browser - https://petascalelabs.com/tools

Demo

🔗 Live: https://petascalelabs.com

Things to try:

The Incident-Response Arcade — pick a scenario, work the terminal, and ship a post-mortem before the cluster falls over (timer + budget + cluster-health clock).
Free DE Tools (https://petascalelabs.com/tools) — fast, 100% client-side utilities for working data engineers:
- Parquet Inspector — drop in a .parquet file and read its schema, row groups, column stats, and metadata, all in-browser (DuckDB-WASM), nothing uploaded anywhere.
- SCD Playground — a customer relocates, a tier gets upgraded, and every historical fact is suddenly at risk of silently re-stating under today's attributes. Replay the timeline and watch the dimension transform under each Slowly Changing Dimension type.
- PII Masking Policy Generator — paste a sample, auto-detect the PII, and generate ready-to-run dynamic data masking policies for Snowflake, Databricks, and BigQuery — while you learn what hashing, tokenization, redaction, and generalization each actually protect.
The Strata map — browse the data platform layer by layer, from storage & file formats up to the semantic layer.

The Comeback Story

This started as scattered notes and a half-built course engine — an idea
buried under "I'll finish it later." The bones existed: a lesson renderer, a few
Strata, a rough game loop. None of it hung together.

The finish-up sprint closed the gap:

Shipped the Incident-Response Arcade end to end — game engine, HUD (timer/credits/health), terminal, Slack-style alert stream, and the post-mortem screen.
Built a free tools hub — Parquet Inspector, SCD Playground, and PII Masking Policy Generator — all client-side, each one shippable on its own.
Wired content authoring into a real contract so new incidents and lessons drop in as data, not code.
Fixed the unglamorous-but-fatal stuff: production SSR/routing, auth, and the rough edges that keep a side project from ever feeling "done."

It went from a folder I was embarrassed to share to something I'll put a demo
link next to.

My Experience with GitHub Copilot

Copilot was most useful in the glue and grind — the parts that stall a finishing sprint. Concretely:

Boilerplate velocity — React component scaffolds, TypeScript interfaces for the game state, and repetitive handlers came out fast from a comment or a type signature, so I could spend attention on the game design, not the plumbing.
In-editor pattern-matching — once one phase component (e.g. the HUD) had a shape, Copilot inferred the next ones from context, keeping the codebase consistent.
Unblocking the boring last 20% — Go handler stubs, JSON scaffolds for new incident scenarios, and small refactors where momentum matters more than novelty.

Where I stayed hands-on: the architecture, the incident pedagogy, and anything
touching correctness in production. Copilot is a force multiplier on the typing,
not a substitute for the thinking — which is exactly the philosophy we teach.

Petascale Labs — understand the data stack from the bytes up.