DEV Community

Cover image for Data Engineering Projects: 8 Portfolio Projects to Land Your First DE Job
Gowtham Potureddi
Gowtham Potureddi

Posted on

Data Engineering Projects: 8 Portfolio Projects to Land Your First DE Job

The single biggest leverage point in an entry-level data engineering job search is not another certification or another LinkedIn course completion — it is a data engineering portfolio of two to four projects that a hiring manager can clone, run with one command, and read in under ten minutes. Recruiters and tech screens have stopped treating certs as signal; what they want is evidence. They want a GitHub link with a clean README, a docker-compose up that boots a working pipeline, and a one-paragraph description that names the problem, the stack, and the trade-off you defended.

This guide walks you through eight data engineer projects arranged as a difficulty pyramid: two foundation projects that prove SQL and dimensional modelling fluency, two orchestration projects that prove you can schedule and stream, two modern-stack projects that prove you can wield distributed compute and a cloud warehouse, and two end-to-end projects that prove business judgement and breadth. Each project ships as a #### block with a stack list, a short build path, a real code snippet you can paste, and an outcome list of the exact things hiring managers will mention back to you in interviews. By the end you will have a concrete de projects for resume plan, complete with etl project ideas and data engineering project ideas you can start tonight.

PipeCode blog header for 8 data engineering portfolio projects — bold white headline 'Data Engineering · 8 Portfolio Projects' with subtitle 'Foundation · Orchestration · Modern stack · End-to-end' and a stylised 4-tier project pyramid with tool-pill icons on a dark gradient with purple, green, orange, and amber accents and a small pipecode.ai attribution.

When you want hands-on reps alongside the build, drill the ETL practice library →, warm up with aggregation problems →, and rehearse the streaming pipeline drills → so each project ships paired with the interview muscles it demonstrates.


On this page


1. Why portfolio beats certifications for your first DE role

Hiring managers verify hands-on with GitHub, not certs

The one-line invariant for every entry-level data engineering projects conversation is this: a hiring manager would rather scroll your GitHub for ten minutes than read a stack of certificate PDFs. Certificates prove that you sat through a course; they do not prove you can wire an API to a warehouse, schedule a daily refresh, write a test that fails when a column goes null, and document the whole thing for a teammate. A portfolio proves all four in a single repo.

Five reasons a portfolio out-signals certifications.

  • Self-evident depth. Reading 200 lines of dbt is faster, more honest, and more diagnostic than reading a certificate. Anyone can pass a multiple-choice quiz; very few can ship a pipeline that does not crash overnight.
  • Reproducibility is the test. A README that says "clone, docker-compose up, hit localhost:3000 for the dashboard" tells the reader, "this person has thought about onboarding." It is the same skill that lets you onboard a new teammate at work.
  • Trade-off articulation. The README's "Trade-offs I made" section is the artefact most senior engineers actually click — it is the strongest signal of judgement. A cert hides judgement; a portfolio surfaces it.
  • Storytelling at the loop. When a behavioural round asks "walk me through a project," candidates with portfolios produce a structured five-minute story. Cert-only candidates produce a rehash of the course outline.
  • Reverse leverage. A portfolio project pulls you into the loop instead of the other way around. Recruiters routinely DM candidates whose GitHubs surface relevant projects when filtering for a specific stack.

The single most-asked screening prompt.

"Show me a pipeline you built end-to-end. Walk me through the architecture, what you chose, and why."

If you cannot answer that question with a link and a five-minute narrative, you are competing on resume keywords alone — a losing battle when every fresh-graduate resume lists the same SQL, Python, and Spark bullets. Your data engineering portfolio is the conversation that resume bullets cannot have on your behalf.

What an entry-level reviewer is scanning for in 90 seconds.

  • A pinned repo with a thumbnail in the README so the GitHub homepage looks intentional.
  • A one-sentence problem statement above the architecture diagram.
  • A docker-compose.yml or Makefile so the project boots locally without manual setup.
  • Tests — even one. Empty tests imply "I have heard of testing"; a passing test on a not_null constraint implies "I have used testing."
  • A "What I would build next" section showing where the project is heading — proves you have an opinion about the system, not just a snapshot.

Detailed sub-topic — the 2-strong-vs-5-weak rule

Detailed explanation. New candidates often pile eight half-finished projects onto a resume thinking that quantity demonstrates effort. The opposite signal lands. A hiring manager who clicks two projects and finds shallow, copy-pasted tutorials assumes the other six are worse and bounces. Two deep projects — each with tests, a README, a docker-compose, and a trade-off section — beat five shallow ones every single time. The rule of thumb: ship the smallest portfolio you can defend in interview, then iterate breadth from there.

Code (resume snippet that reflects the rule).

## Projects

### nyc-taxi-warehouse  [GitHub →]
PostgreSQL star schema + dbt models + Metabase dashboard.
- 8 GB of trip data ingested with idempotent COPY; partitioned by month.
- Star schema with `fact_trips`, `dim_zone`, `dim_time`, `dim_payment`.
- 14 dbt tests (not_null, unique, accepted_values, custom row-count freshness).
- `docker-compose up` boots Postgres + dbt-core + Metabase in <60s.
- Trade-off chosen: monthly partitions over daily — fewer files, simpler vacuum
  schedule, acceptable for the dashboard SLA of "fresh by 09:00 next day."

### dbt-analytics-duckdb  [GitHub →]
Seeds + sources + 12 staging + 6 marts + GitHub Actions CI.
- DuckDB + dbt-core, runs in <30s on a laptop, zero cloud cost.
- Tests run on every PR; merge blocked on test failure.
- Trade-off chosen: DuckDB over Snowflake — same dbt code, zero cloud spend,
  small enough to onboard a reviewer in five minutes.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Two projects, not eight. Each one gets its own headline, a one-line architecture statement, and a clear trade-off paragraph.
  2. Numbers anchor the bullets. "8 GB" and "14 dbt tests" and "<60s boot" are concrete; "comprehensive testing" is vapour.
  3. The trade-off line is the senior signal. "Monthly partitions over daily" tells the reader you considered alternatives and picked one for a reason.
  4. No hand-waving language. Avoid "leveraged" and "utilized"; prefer "ingested", "modelled", "tested", "shipped" — verbs the reader can verify.
  5. Repo links open in a new tab and land on a README that mirrors the bullets above. The bullets must be falsifiable in 60 seconds of scrolling.

Output.

Profile Avg interview-conversion rate (informal sample) Why
5 shallow projects low reviewer bounces on first shallow click
2 deep projects high reviewer clicks both, reads the README, opens the issues tab
0 projects very low recruiter sorts you below candidates with even one repo

Rule of thumb. Ship two projects you can defend cold against any "why did you do it that way?" question before you start a third. The deepest two-project portfolio in the world is still tighter than a six-project portfolio with one trade-off paragraph among them.

Data engineering interview question on portfolio strategy

A senior interviewer might open the loop with: "We have ten candidates with similar resumes. Why should I open your GitHub first?" — testing whether you have a sharp, one-sentence framing of your portfolio's strongest project and the trade-off it defends.

Solution Using a single defensible flagship project + supporting repos

# Hi — I'm Alex. Open this repo first.

## Flagship: product-analytics-platform
End-to-end Segment → Kafka → Snowflake → dbt → Looker, all dockerised.
- 1.2 M synthetic events/day; rolling DAU, WAU, retention cohorts.
- 38 dbt tests; CI on every PR; Looker dashboard auto-refreshes nightly.
- Trade-off chosen: Kafka over batch ingest — proves I can defend streaming
  semantics in an interview without needing the production traffic.

## Supporting projects
- nyc-taxi-warehouse — proves SQL + dimensional modelling.
- airflow-daily-etl — proves DAG design + error handling.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Reviewer action Time spent Signal received
1 Lands on GitHub profile 5s pinned flagship repo with thumbnail, clear name
2 Clicks flagship README 20s sees architecture diagram + one-line problem statement
3 Scrolls to "Trade-offs" 15s reads two paragraphs naming the alternative they rejected
4 Opens docker-compose.yml 10s sees five clean services + comments
5 Opens GitHub Actions tab 5s green check on main, last run yesterday
6 Reads "What I would build next" 15s sees roadmap — Spark feature engineering branch

Output:

Outcome Before flagship strategy After flagship strategy
Recruiter first-call response rate low — 8% high — 31%
Loop conversion to onsite medium — 22% high — 48%
Reviewer-named project in onsite rare every loop

Why this works — concept by concept:

  • Single flagship as the anchor — the reviewer's attention is scarce; one repo that absorbs 80% of that attention compounds far more than five repos that fight for it.
  • Architecture-diagram-first README — the first scroll above the fold either earns the next ten minutes or loses them; a diagram earns them.
  • Trade-off section as the senior signal — "I picked X over Y because Z" is the single sentence every interviewer is listening for, on every loop.
  • Green-check CI — a passing GitHub Actions badge in the README signals "I take repeatability seriously" — a free reputation point.
  • Roadmap section — proves the project is alive in your head, not abandoned; future-tense bullets are interview gold.
  • Cost — flagship build cost = O(3–4 weekends); ongoing maintenance cost = O(1 hour/month); interview leverage = O(every loop you take in the next 12 months).

ETL
Topic — ETL
End-to-end ETL practice problems

Practice →


2. The portfolio difficulty pyramid — Foundation to End-to-End

Pick one project from each of four tiers — that is the entire data engineering portfolio

The mental model in one sentence: a strong portfolio is a four-tier pyramid that walks the reviewer up from SQL + modelling to a real product pipeline, with exactly one project per tier. Skipping a tier creates a credibility gap — a reviewer who sees a Spark + Snowflake project but no underlying warehouse repo wonders whether the candidate actually wrote the SQL or copy-pasted a tutorial. Climbing the pyramid in order builds and demonstrates the curriculum at the same time.

Visual diagram of the 4-tier portfolio project pyramid — each tier (Foundation, Orchestration, Modern stack, End-to-end) drawn as a horizontal bar with 2 project tiles inside, a difficulty rating, an effort-week chip, and a skills-demonstrated pill list; a 'pick 1 from each tier' annotation on the side; on a light PipeCode card.

The four tiers in five bullets.

  • Tier 1 — Foundation. SQL warehouse + dbt analytics. Proves schema design, SQL fluency, dimensional modelling, and version control on data assets. Roughly 1–2 weeks of work per project.
  • Tier 2 — Orchestration. Airflow daily ETL + Kafka/Flink streaming. Proves scheduling discipline, DAG design, streaming semantics, and operational hygiene (alerting, retries, backfills). Roughly 2–3 weeks per project.
  • Tier 3 — Modern stack. Spark batch + dbt-on-Snowflake medallion. Proves distributed compute and a cloud warehouse layered together. Roughly 2–3 weeks per project.
  • Tier 4 — End-to-end. Product analytics platform + ML feature pipeline. Proves breadth and business judgement — the senior-track signal even for an entry-level loop. Roughly 3–4 weeks per project.
  • Pick one project per tier for a four-project portfolio that covers the full data engineering project ideas surface every loop touches.

Skills demonstrated per tier — the interview-side mapping.

  • Tier 1 → SQL screen, modelling round. "Walk me through a star schema you built." "How did you decide your grain?"
  • Tier 2 → ETL system design round. "How does your DAG handle a partial failure on Task 3?" "What is your backfill strategy?"
  • Tier 3 → distributed compute round. "How would you handle skew in this Spark job?" "Why medallion?"
  • Tier 4 → senior signal + business judgement round. "Why did you put the ML feature path on the same Kafka topic as analytics?"

Why an ordered curriculum beats a random walk.

  • A Tier 4 project that pretends Tier 1 has been mastered (no tests on the warehouse, no schema docs) collapses under interviewer probing.
  • The Tier 1 → Tier 2 jump teaches you the difference between "running a SQL query" and "running it on schedule when the input changes." This is the single biggest mental shift for a fresh DE.
  • The Tier 2 → Tier 3 jump teaches you the difference between Python on one machine and Python on a cluster. Most candidates assume this gap is small; it is not.
  • The Tier 3 → Tier 4 jump teaches you the difference between a pipeline that runs and a pipeline that serves a business outcome. This is the gap that earns the offer.

Detailed sub-topic — sizing each project to a calendar week

Detailed explanation. Each tier comes with a recommended effort budget. Going over budget means the project has grown beyond a "portfolio piece" into a side-business — fine, but stop letting it block the next tier. Going under budget usually means the project lacks tests, docs, or a defensible trade-off. The budget below assumes 8–12 hours/week of focused build time.

Code (project-tracker JSON you can copy into Notion/Linear).

{
  "portfolio_plan": [
    { "tier": 1, "project": "sql-warehouse",          "effort_weeks": 2, "stack": ["postgres","dbt","metabase"] },
    { "tier": 1, "project": "dbt-analytics-duckdb",   "effort_weeks": 1, "stack": ["duckdb","dbt","gh-actions"] },
    { "tier": 2, "project": "airflow-daily-etl",      "effort_weeks": 2, "stack": ["airflow","s3","snowflake","dbt","looker-studio"] },
    { "tier": 2, "project": "kafka-flink-realtime",   "effort_weeks": 3, "stack": ["kafka","flink-sql","clickhouse","grafana"] },
    { "tier": 3, "project": "spark-user-behaviour",   "effort_weeks": 2, "stack": ["pyspark","s3","delta-lake","minio"] },
    { "tier": 3, "project": "dbt-snowflake-medallion","effort_weeks": 3, "stack": ["dbt","snowflake","gh-actions"] },
    { "tier": 4, "project": "product-analytics",      "effort_weeks": 4, "stack": ["segment","kafka","snowflake","dbt","looker"] },
    { "tier": 4, "project": "ml-feature-pipeline",    "effort_weeks": 4, "stack": ["spark","feast","airflow","mlflow"] }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. List every project up-front so the calendar work is visible. Aspirations that live only in your head shrink.
  2. Budget each project in calendar weeks, not "hours" — calendar weeks include the inevitable real-life interruptions, which make the plan honest.
  3. Stack list per project doubles as resume bullets and as Google-search queries during the build.
  4. Total realistic budget for two projects per tier = ~20 weeks of evening work. For the recommended four-project portfolio (one per tier) it is closer to 10 weeks. Both are achievable inside a focused job search.
  5. Track project state in three bucketsplanning, building, polishing. "Polishing" is where README, tests, and screen-recording demos live; it is the bucket beginners skip.

Output.

Tier Recommended one-of-two Effort weeks Interview rounds it covers
1 dbt-analytics-duckdb 1 SQL screen, modelling round
2 airflow-daily-etl 2 ETL system design round
3 dbt-snowflake-medallion 3 warehouse + cloud round
4 product-analytics 4 senior signal + business judgement

Rule of thumb. "Four strong projects beats eight weak ones — always." If a project is taking more than 1.5× its budget, ship what you have, write the README, and move on to the next tier. The reviewer rewards breadth across tiers more than depth within one.

Data engineering interview question on portfolio sequencing

A senior interviewer often probes: "If you had eight weeks to build a portfolio from scratch, in what order would you tackle the projects and why?" — testing whether the candidate sees the dependencies between tiers (you cannot build a Tier 3 medallion if you cannot build a Tier 1 star schema first).

Solution Using strict tier order with one project per tier

Week 1     → Tier 1: dbt-analytics-duckdb (lightweight, low risk, builds dbt muscle).
Week 2-3   → Tier 2: airflow-daily-etl     (the first project that *runs on a schedule*).
Week 4-6   → Tier 3: dbt-snowflake-medallion (introduces the cloud warehouse).
Week 7-10  → Tier 4: product-analytics     (assembles everything into one repo).
Polish     → Add screen-recordings, READMEs, two-line trade-off paragraphs.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Week Project Output artefact New skill demonstrated
1 dbt-analytics-duckdb README + GH Actions green check dbt model layering + CI
2 airflow-daily-etl DAG screenshot + Looker Studio embed scheduling discipline
3 airflow-daily-etl dead-letter + retries patterns operational hygiene
4 dbt-snowflake-medallion bronze/silver/gold schema diagram medallion architecture
5 dbt-snowflake-medallion warehouse-cost-per-run metric cloud-cost awareness
6 dbt-snowflake-medallion dbt exposures + Looker tile downstream lineage
7 product-analytics architecture diagram system design breadth
8 product-analytics Segment → Kafka ingest streaming ingest
9 product-analytics dbt marts + Looker dashboard analytics surfacing
10 product-analytics screen-recording demo storytelling rehearsal

Output:

Metric Random-order portfolio Pyramid-order portfolio
Reviewer perceives depth low — projects feel disjointed high — projects build on each other
Story coherence in onsite low — separate stories per project high — one narrative across four projects
Cross-project test coverage none Tier 4 reuses Tier 3 dbt patterns

Why this works — concept by concept:

  • Tier order respects skill dependencies — Tier 4's medallion warehouse reuses Tier 1's dbt code muscle; skipping Tier 1 means re-learning under deadline.
  • One project per tier — keeps total scope at ~10 weeks; lets you reach Tier 4 before motivation flags.
  • Polish week — README, screen-recording, and trade-off paragraph are the 20% of effort that returns 80% of reviewer-perceived quality.
  • Reviewer narrative — four ordered projects let you say "I built this, then this, then this" in the behavioural round; random projects force four separate stories.
  • Calendar realism — assumes 8–12 hours/week, the real budget of a fresh-graduate searching for work.
  • Cost — total build cost = O(10 weeks × 10 hours); interview leverage = O(every loop for the next 12 months).

SQL
Topic — aggregation
SQL aggregation problems for the foundation tier

Practice →


3. Foundation projects — SQL warehouse and dbt analytics

Foundation projects prove SQL + modelling fluency — the bedrock of every data engineer projects portfolio

The mental model: 80% of entry-level rounds test SQL and dimensional modelling; the foundation tier exists to make those rounds easy. A warehouse with a star schema and a dbt repo with seeds, sources, models, tests, and docs cover the entire syllabus the typical screen pulls from. Two projects in this tier, each ~5–10 days, give a reviewer everything they need to believe you can handle a "build me a fact_orders table" whiteboard round without flinching.

Visual diagram of the two foundation projects — Project 1 SQL warehouse (PostgreSQL + Python ingest + dbt + Metabase) and Project 2 dbt analytics (dbt-core + DuckDB + GitHub Actions); each project is a panel with a small architecture diagram, a tool-icon row, a skills-demonstrated chip, and a difficulty rating; on a light PipeCode card.

Why foundation matters more than the flashier tiers.

  • It is the only tier you cannot fake. A reviewer can ask a SQL question on the spot; you cannot bluff it. The foundation tier ensures you do not have to.
  • dbt is the lingua franca of modern analytics engineering. A dbt repo with sources + models + tests + docs is the canonical entry-level artefact in 2026.
  • A star schema is the canonical interview drawing. Practising the four dimensional-modelling decisions (grain, facts, dims, slowly-changing strategy) on real data is the only way to defend them under pressure.
  • The foundation projects double as datasets for higher tiers. Your Tier 1 warehouse becomes the source-of-truth for the Tier 3 medallion and the Tier 4 product analytics platform — a compounding return.

The four dimensional-modelling decisions every foundation project should defend.

  • Grain — what does one row in the fact table represent? Be ruthless: "one row = one taxi trip from pickup to dropoff."
  • Conformed dimensions — which dims are shared across facts? dim_date, dim_zone, dim_user are the usual suspects.
  • SCD strategy — type 1 (overwrite), type 2 (history rows), or type 6 (combined). Pick one per dim and write it on the README.
  • Testsnot_null on grain columns, unique on surrogate keys, relationships on FKs, accepted_values on enums. Aim for ≥10 tests per project.

Project 1 — SQL data warehouse from an open dataset

Detailed explanation. Start with a well-known public dataset — NYC Taxi (yellow cabs, ~8 GB raw), Brazil e-commerce (Olist, ~2 GB), or a generated TPC-H sample (~1 GB). Ingest the raw CSVs into PostgreSQL with idempotent COPY, model a star schema in dbt, write ~10 analytics queries on the marts, and surface the results in Metabase. Total wall-clock budget: 7–10 days. The result is a repo a reviewer can clone, docker-compose up, and have a working dashboard at localhost:3000 in 60 seconds.

Code.

# docker-compose.yml — 3 services, 90s boot
services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: warehouse
      POSTGRES_USER: dbt
      POSTGRES_PASSWORD: dbt
    volumes:
      - ./data:/data:ro
    ports: ["5432:5432"]

  dbt:
    build: ./dbt
    depends_on: [postgres]
    volumes:
      - ./dbt:/usr/app
    command: ["bash", "-c", "dbt seed && dbt run && dbt test"]

  metabase:
    image: metabase/metabase:v0.50.0
    depends_on: [postgres]
    ports: ["3000:3000"]
Enter fullscreen mode Exit fullscreen mode
-- models/marts/fact_trips.sql
{{ config(materialized='table') }}

SELECT
    {{ dbt_utils.generate_surrogate_key(['trip_id']) }}    AS trip_sk,
    trip_id,
    pickup_zone_sk,
    dropoff_zone_sk,
    payment_type_sk,
    pickup_datetime,
    dropoff_datetime,
    trip_distance_miles,
    fare_amount_usd,
    tip_amount_usd,
    total_amount_usd,
    EXTRACT(EPOCH FROM (dropoff_datetime - pickup_datetime))/60 AS trip_minutes
FROM   {{ ref('stg_trips') }}
WHERE  pickup_datetime >= '2024-01-01'
  AND  trip_distance_miles BETWEEN 0 AND 200
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Ingest — a Python loader.py reads each monthly CSV and COPYs it into a raw table with an idempotent (filename, row_hash) dedup key, so re-running is safe.
  2. Stagestg_*.sql models rename columns, cast types, and apply trivial filters. One staging model per source table.
  3. Markfact_trips, dim_zone, dim_payment, dim_date join staging tables into a star schema. Surrogate keys are generated via dbt_utils.generate_surrogate_key.
  4. Test — at minimum: not_null on grain columns, unique on surrogate keys, relationships from facts → dims, accepted_values on payment_type.
  5. Dashboard — Metabase points at marts.fact_trips joined to dims; nine charts covering trips-per-day, fare-per-mile distribution, top-10 pickup zones, weekday-vs-weekend split, etc.

Output.

Layer Artefact Rows What it proves
raw raw.trips_yellow_2024_01 … 12 monthly tables ~100 M idempotent ingest
staging stg_trips ~100 M type-cast + filter discipline
marts fact_trips, 4 dims ~98 M / dim small dimensional modelling
BI 9 Metabase charts n/a downstream consumption

Rule of thumb. Foundation projects must boot in under 60 seconds on a laptop. If the reviewer has to install something beyond Docker, you have already lost half the audience. Bundle the dataset (or a 100k-row sample) inside the repo.

What hiring managers see.

  • You can model a star schema. The four-decision README (grain, dims, SCD, tests) tells them you have an opinion.
  • You write idempotent ingest. Re-running the loader is safe — a senior signal even at entry level.
  • You test data, not just code. Eleven dbt tests in the project, all green in CI.
  • You think about onboarding. A docker-compose up README earns reviewer goodwill in the first 30 seconds.

Common beginner mistakes.

  • Modelling at the wrong grain. Mixing trip-level facts with day-level aggregates in one table — always split into fact_trip (per trip) and agg_daily_trips (per day, sourced from the fact).
  • Skipping not_null on grain columns. A fact_trips with a nullable trip_sk is silently broken; reviewers will spot it instantly.
  • Loading the dashboard against raw tables. Metabase pointed at raw.trips_yellow_2024_01 is a red flag — always route BI through marts.
  • Forgetting dim_date. Hard-coding date arithmetic in every fact query is the single biggest "junior" tell on dbt repos.
  • Hard-coding zone names in SQL. A WHERE pickup_zone = 'Manhattan' filter inline anywhere outside dim_zone means the dimension does not exist in your head yet.

Project 2 — dbt analytics with seeds + sources + models + tests + docs

Detailed explanation. Project 2 is the lighter cousin of Project 1 — pure dbt with DuckDB as the warehouse (zero cloud cost, runs in <30s on a laptop) and GitHub Actions running dbt test on every PR. The point is to demonstrate dbt as the lingua franca: every modern analytics team uses it, every interview asks about it, and very few candidates ship a real dbt repo. Total wall-clock budget: 5 days.

Code.

# dbt_project.yml
name: dbt_analytics_duckdb
version: '1.0.0'
profile: dbt_analytics_duckdb

model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
macro-paths: ["macros"]

models:
  dbt_analytics_duckdb:
    staging:
      +materialized: view
    marts:
      +materialized: table
      +schema: marts
Enter fullscreen mode Exit fullscreen mode
# models/marts/_marts.yml — tests that block merge in CI
version: 2
models:
  - name: dim_customer
    columns:
      - name: customer_sk
        tests: [not_null, unique]
      - name: email
        tests:
          - not_null
          - dbt_utils.not_empty_string
  - name: fact_order
    columns:
      - name: order_sk
        tests: [not_null, unique]
      - name: customer_sk
        tests:
          - not_null
          - relationships:
              to: ref('dim_customer')
              field: customer_sk
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Seeds — CSVs checked into seeds/ for tiny lookup tables (country codes, currency rates). dbt seed loads them as DuckDB tables.
  2. Sourcessources.yml registers the raw tables and their freshness expectations (error_after: { count: 24, period: hour }).
  3. Staging — one view per source, no business logic, just renaming and casting.
  4. Marts — narrow set of fact and dimension tables, materialised as tables (faster queries downstream).
  5. Tests — schema tests on every column you care about + at least three custom data tests (e.g. "no future-dated orders", "all orders have positive amounts").
  6. CI.github/workflows/dbt.yml installs dbt, runs dbt build, fails the PR on any test red.

Output.

Surface Count Note
Sources 4 freshness configured
Staging models 12 views
Mart models 6 tables
Schema tests 28 run on every PR
Custom data tests 3 future dates, negative amounts, orphan FKs

Rule of thumb. The dbt analytics project is the test-coverage showcase of the portfolio. Aim for ≥30 tests total. A reviewer who clicks _models.yml and sees a wall of green tests will trust the rest of the repo immediately.

What hiring managers see.

  • You ship dbt the way analytics engineers actually ship it — seeds, sources, staging, marts, tests, docs, CI.
  • You understand ref() and source() — the two foundational dbt primitives.
  • You wrote at least one custom data test — the line that separates "used dbt" from "understood dbt."
  • Your CI badge is green. Reviewer instinctively trusts the rest of the repo.

Common beginner mistakes.

  • No sources.yml. Skipping the source layer and reading from raw tables breaks lineage and disables freshness checks.
  • All models materialised as view. Marts must be table so dashboard queries do not re-run the underlying CTE chain every refresh.
  • Tests in YAML only. Add at least one Python or SQL custom test to prove you understand dbt's full test surface.
  • dbt run in CI instead of dbt build. dbt build runs models and tests in DAG order; dbt run skips tests, which is a junior tell.
  • No --select state:modified+ in CI. On a small project it does not matter, but reviewers love seeing the senior-tier deferral pattern even when the project is tiny.

Data engineering interview question on a foundation-tier build

A senior interviewer often probes: "Walk me through the schema you would model for a ride-hailing analytics dashboard, starting from the raw events, and where you would put your tests." — testing whether the candidate can articulate grain, fact/dim split, conformed dimensions, and the four canonical schema-tests, all on a familiar dataset.

Solution Using a NYC-Taxi-style star schema with dbt tests in CI

-- models/marts/fact_trips.sql  (Postgres / DuckDB / Snowflake — same DDL)
{{ config(materialized='table') }}

SELECT
    {{ dbt_utils.generate_surrogate_key(['trip_id']) }} AS trip_sk,
    trip_id,
    pickup_zone_sk,
    dropoff_zone_sk,
    payment_type_sk,
    date_sk,
    trip_distance_miles,
    fare_amount_usd,
    tip_amount_usd,
    total_amount_usd
FROM   {{ ref('stg_trips') }} t
JOIN   {{ ref('dim_zone') }}     pz ON t.pickup_zone_id  = pz.zone_id
JOIN   {{ ref('dim_zone') }}     dz ON t.dropoff_zone_id = dz.zone_id
JOIN   {{ ref('dim_payment') }}  p  ON t.payment_type    = p.payment_type
JOIN   {{ ref('dim_date') }}     d  ON DATE(t.pickup_datetime) = d.date_value
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Layer Action Output rows
1 raw COPY 12 monthly CSVs ~100 M
2 stg rename + cast in stg_trips view ~100 M (view)
3 dim build dim_zone (265 rows), dim_payment (6 rows), dim_date (366 rows) small lookup tables
4 fact join staging to dims to produce fact_trips table ~98 M
5 test not_null(trip_sk), unique(trip_sk), relationships(pickup_zone_sk → dim_zone) all pass
6 CI GitHub Actions runs dbt build on PR green

Output:

Mart table Rows Tests Materialisation
fact_trips 98,201,442 7 table
dim_zone 265 3 table
dim_payment 6 3 table
dim_date 366 4 table

Why this works — concept by concept:

  • Grain on one line of the README — "one row of fact_trips = one yellow-cab trip from pickup to dropoff." This single sentence answers half the modelling round.
  • Surrogate keys via dbt_utils — hashing trip_id gives a stable, deterministic key that survives schema changes downstream.
  • Conformed dim_date and dim_zone — reused in higher-tier projects (medallion, product analytics) — a compounding return on the foundation work.
  • Tests written as YAML, not Python — dbt's canonical idiom; immediately legible to any reviewer who has used dbt.
  • CI on every PR — a green badge in the README earns trust before the reviewer reads any SQL.
  • Cost — local run = O(seconds on DuckDB, minutes on Postgres with the full dataset); reviewer time = O(60s from clone to dashboard).

Data modelling
Topic — dimensional modelling
Dimensional-modelling practice problems

Practice →


4. Orchestration projects — Airflow ETL and Kafka streaming

Orchestration projects prove you can ship pipelines that run on a schedule — the mental shift senior interviews care about

The mental model in one line: a SQL query that runs by hand is a school assignment; a pipeline that runs on a schedule is a job. Tier 2 closes that gap. An Airflow daily ETL proves you can sequence tasks, handle partial failures, and refresh a dashboard nightly without supervision. A Kafka + Flink streaming aggregator proves you can reason about windowing, watermarks, and exactly-once semantics — the streaming half of the etl project ideas universe.

Visual diagram of the two orchestration projects — Project 3 Airflow daily ETL (API → S3 → Snowflake → dbt → Looker) and Project 4 Kafka + Flink real-time aggregations (Kafka → Flink SQL → ClickHouse → Grafana); each project shows its DAG / pipeline diagram with named operators / processors and a skills-demonstrated chip; on a light PipeCode card.

Why Tier 2 is the most under-built tier in entry-level portfolios.

  • It is intimidating. Airflow has a learning curve, and Kafka feels like infrastructure rather than code. Most candidates skip Tier 2 and jump to Spark — a tell that they have not actually run a pipeline.
  • It is the most-asked design round. "Walk me through your DAG" and "what is your retry strategy" are the two most-asked operational questions in entry-level loops.
  • It teaches the failure modes that matter. Schedules expose race conditions, idempotency bugs, and silent data drift that batch-on-a-laptop never reveals.
  • It is the entry point to the modern stack. Tier 3 medallion patterns assume an Airflow or dbt-cloud schedule already exists; building it once at Tier 2 removes that assumption.

Two orthogonal interview signals.

  • Airflow → "I can schedule and monitor batch workflows." Covers scheduling, dependencies, alerting, backfills.
  • Kafka + Flink → "I can reason about streaming semantics." Covers partitioning, windowing, watermarks, exactly-once.

Project 3 — Airflow daily ETL (API → S3 → Snowflake → dbt → dashboard)

Detailed explanation. Build a daily DAG that fetches a public API (e.g. OpenWeather, AlphaVantage, NYC 311, GitHub events), lands the response as Parquet on S3 (or MinIO locally), loads it into Snowflake (or DuckDB locally), runs a dbt transform, and finally refreshes a Looker Studio dashboard via a webhook. Five tasks, one DAG, daily schedule at 02:00 UTC. Total wall-clock budget: 10–14 days. The repo ships with docker-compose.yml that boots Airflow + MinIO + DuckDB + a Looker Studio embed for free local testing.

Code.

# dags/daily_etl.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.operators.s3 import S3CreateObjectOperator
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from datetime import datetime, timedelta

default_args = {
    "owner": "alex",
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "email_on_failure": True,
    "sla": timedelta(hours=2),
}

with DAG(
    "daily_etl",
    default_args=default_args,
    schedule="0 2 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    max_active_runs=1,
    tags=["portfolio", "etl"],
) as dag:

    fetch  = PythonOperator(task_id="fetch_api",      python_callable=fetch_api)
    land   = PythonOperator(task_id="land_s3",        python_callable=write_parquet_to_s3)
    load   = SnowflakeOperator(task_id="load_snowflake",
                               sql="COPY INTO raw.events FROM @stage/events/")
    dbt_run = PythonOperator(task_id="dbt_run",       python_callable=run_dbt_build)
    refresh = PythonOperator(task_id="refresh_dashboard", python_callable=ping_looker_webhook)

    fetch >> land >> load >> dbt_run >> refresh
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. fetch_api — calls the source API with retry-on-429 and exponential backoff. Result written to local disk first (so failure during S3 upload does not re-hit the API).
  2. land_s3 — partitions the response by dt=YYYY-MM-DD and writes as events_HHMM.parquet. Idempotent — re-runs of the same execution date overwrite the same file.
  3. load_snowflake — runs COPY INTO raw.events FROM @stage/events/ with ON_ERROR='ABORT_STATEMENT'. Failure surfaces the row count and offending row.
  4. dbt_run — calls dbt build --select +marts.daily_events to refresh marts plus tests.
  5. refresh_dashboard — POSTs to Looker Studio's webhook so the embedded dashboard refreshes immediately.
  6. Failure handling — each task retries three times with 5-minute spacing; SLA miss (>2h end-to-end) emails the owner; the entire DAG is max_active_runs=1 to prevent overlapping runs.

Output.

Task Avg duration Retries used (typical week) Failure cause when it fails
fetch_api 30s 0–1 upstream 429
land_s3 5s 0 rare network blip
load_snowflake 45s 0 bad row type — surfaces in log
dbt_run 90s 0 broken test — by design
refresh_dashboard 2s 0 webhook 4xx

Rule of thumb. Every Airflow project must have an explicit backfill strategy documented in the README. "To backfill 2025-12, run airflow dags backfill daily_etl -s 2025-12-01 -e 2025-12-31." Reviewers love it.

What hiring managers see.

  • You can design a DAG with five real tasks, not a toy BashOperator("echo hello").
  • You retry intelligently, with backoff and idempotent task design.
  • You have an SLA configured — proves you understand observability.
  • You can backfill — the senior-track operational muscle.

Common beginner mistakes.

  • DAG with catchup=True and a far-back start_date — on first deploy Airflow tries to backfill every day since start_date, hammering the API and breaking rate limits. Always start with catchup=False.
  • Non-idempotent tasks. A task that appends to a table without first deleting the same execution_date rows will double-load on retry.
  • No max_active_runs=1. Overlapping runs of the same DAG corrupt warehouse state when two COPY INTOs race.
  • Hard-coded credentials in the DAG. Use Airflow connections + environment variables; secrets in code are a red flag for any reviewer.
  • No README screenshot of the Graph view. A picture of a green DAG with five tasks is the single highest-leverage README artefact for Airflow projects.

Project 4 — Kafka + Flink real-time aggregations

Detailed explanation. Stand up a single-broker Kafka in Docker, generate a synthetic stream of click events with a Python publisher (1k events/sec on a laptop is plenty), and write a Flink SQL job that consumes the topic, computes per-user 5-minute hopping windows, and writes the result to ClickHouse. Wire ClickHouse to Grafana for a real-time chart. Total wall-clock budget: ~21 days — Tier 2's hardest project. Ship everything in docker-compose.yml so the reviewer can docker-compose up, python publish.py, and see Grafana ticking in 90 seconds.

Code.

-- flink_sql/clicks_5min_window.sql
CREATE TABLE clicks (
    user_id    STRING,
    event_time TIMESTAMP_LTZ(3),
    page       STRING,
    WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) WITH (
    'connector'        = 'kafka',
    'topic'            = 'clicks',
    'properties.bootstrap.servers' = 'kafka:9092',
    'properties.group.id'          = 'clicks-agg',
    'scan.startup.mode'            = 'earliest-offset',
    'format'                       = 'json'
);

CREATE TABLE clicks_5m_agg (
    user_id      STRING,
    window_start TIMESTAMP(3),
    window_end   TIMESTAMP(3),
    clicks       BIGINT,
    PRIMARY KEY (user_id, window_start) NOT ENFORCED
) WITH (
    'connector' = 'clickhouse',
    'url'       = 'clickhouse://clickhouse:8123/default',
    'table-name'= 'clicks_5m_agg'
);

INSERT INTO clicks_5m_agg
SELECT user_id,
       HOP_START(event_time, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE) AS window_start,
       HOP_END  (event_time, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE) AS window_end,
       COUNT(*) AS clicks
FROM   clicks
GROUP BY user_id,
         HOP(event_time, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Source table — declares the Kafka topic as a Flink table with a watermark 10s behind event time (handles small clock skew).
  2. Sink table — declares ClickHouse as a sink, primary-key upsert by (user_id, window_start).
  3. Streaming SQLHOP(event_time, 1m, 5m) defines a 5-minute window advancing every 1 minute (hopping). One row emitted per (user_id, window) whenever the window closes.
  4. Watermark behaviour — events arriving more than 10s late are dropped (configurable). The README should state this trade-off explicitly.
  5. Backpressure — the publisher pushes 1k events/sec; Flink processes ~10k/sec on a laptop, so the system is steady. If you push 50k/sec, Kafka starts buffering — useful to demonstrate in the README screenshot.
  6. Grafana panel — queries SELECT window_start, SUM(clicks) FROM clicks_5m_agg GROUP BY window_start, refreshes every 5 seconds, looks alive.

Output.

Window start window_end user_id clicks
12:00 12:05 u_42 17
12:01 12:06 u_42 19
12:02 12:07 u_42 18
12:00 12:05 u_99 4
12:01 12:06 u_99 5

Rule of thumb. Pin the streaming project to one failure mode the README explains in detail (e.g. "what happens to a 30-second-late event"). One deeply-explained failure mode is worth ten one-liner bullet points to a reviewer.

What hiring managers see.

  • You can write Flink SQL, not just Python.
  • You configured a real watermark — proves you have read the streaming docs.
  • You wired Grafana to ClickHouse — proves you can close the loop on observability.
  • You shipped docker-compose — proves the reviewer can run it without a Confluent Cloud account.

Common beginner mistakes.

  • Tumbling windows where the prompt asks for hopping. Tumbling = non-overlapping; hopping = overlapping. Pick deliberately and explain in the README.
  • No watermark configured. Without one Flink cannot close windows on event time, and the job effectively becomes processing-time — defeating the purpose.
  • Single partition on Kafka topic. A streaming demo on one partition hides every parallelism question; create the topic with 4–8 partitions even on a laptop.
  • Sink without a primary key. ClickHouse upserts only when you declare the PK; without one, replays produce duplicates and the dashboard double-counts.
  • No docker-compose down -v instruction in README. Reviewers who do not see the volume teardown command spend ten minutes wondering why their second docker-compose up boots with stale Kafka state.

Data engineering interview question on an orchestration build

A senior interviewer often opens this round: "Your daily DAG's load_snowflake task fails halfway through on a Tuesday. Walk me through what happens, what the user sees, and what you would change." — testing whether the candidate has thought through partial failure, retries, idempotency, alerting, and backfill, all in one storyline.

Solution Using idempotent task design + retries + dead-letter pattern

# tasks/load_snowflake.py
from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook

def load_snowflake(execution_date, **_):
    hook = SnowflakeHook(snowflake_conn_id="snowflake_default")
    sql = f"""
        BEGIN;
        DELETE FROM raw.events
         WHERE event_date = '{execution_date}';
        COPY INTO raw.events
             FROM @stage/events/dt={execution_date}/
             FILE_FORMAT = (TYPE = PARQUET)
             ON_ERROR    = 'CONTINUE'
             RETURN_FAILED_ONLY = TRUE;
        COMMIT;
    """
    failed_rows = hook.get_pandas_df(sql)
    if not failed_rows.empty:
        write_dead_letter(failed_rows, execution_date)
        if len(failed_rows) > 100:
            raise ValueError(f"Too many bad rows: {len(failed_rows)}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Result
1 DELETE existing rows for execution_date idempotency: re-runs do not double-load
2 COPY INTO with ON_ERROR='CONTINUE' bad rows captured, good rows loaded
3 RETURN_FAILED_ONLY=TRUE exact list of failed rows returned
4 Write failed rows to a dead-letter table nothing silently lost
5 raise if > 100 bad rows DAG marked failed, Airflow alert fires
6 Airflow auto-retries 3× with 5-min backoff transient broker errors heal

Output:

Scenario User-visible effect Engineer-visible effect
Transient Snowflake timeout DAG retries silently, succeeds within 15 minutes metric ticks "transient_retries=1"
50 malformed rows DAG succeeds; dashboard slightly under-counts; bad rows in dead-letter engineer sees DLQ count, investigates Monday morning
5,000 malformed rows DAG fails; PagerDuty page; dashboard shows yesterday's data engineer triages, backfills with one command
Permanent schema drift DAG fails on the dbt test, not the load engineer changes model, redeploys, backfills

Why this works — concept by concept:

  • Idempotent taskDELETE before COPY lets Airflow safely retry without duplicating rows.
  • Dead-letter pattern — capturing failed rows separately prevents the "silent partial load" failure mode that destroys trust in dashboards.
  • Hard fail thresholdraise if > 100 bad rows ensures the DAG visibly fails when something is structurally wrong, instead of swallowing the error.
  • Built-in retries with backoff — 3 × 5-minute retries cover the long tail of transient network issues without paging the engineer.
  • Dashboard SLA — the on-call playbook the README links to lists "if dashboard not refreshed by 09:00, follow these steps."
  • Cost — engineer-attention cost of a transient blip = O(0); engineer-attention cost of a real failure = O(minutes to backfill, not hours to debug).

ETL
Topic — ETL
ETL pipeline design problems

Practice →

Streaming
Topic — streaming
Streaming pipeline practice problems

Practice →


5. Modern stack and end-to-end projects — Spark, Snowflake, product analytics, ML features

Tier 3 and Tier 4 are where senior signal lives — distributed compute, a cloud warehouse, and a real business outcome

The mental model: Tier 3 proves you can wield distributed compute (Spark) on top of a cloud warehouse (Snowflake), and Tier 4 proves you can assemble everything into a pipeline a real product team would consume. These four projects are the ones that convert a "junior" loop into a "junior-+1" or "associate" offer — the breadth that makes a reviewer say "this candidate is closer to mid-level than entry-level."

Visual end-to-end architecture diagram of a product analytics + ML feature project — Segment events → Kafka → Snowflake (raw, staging, marts via dbt) → Looker for analytics, with a parallel branch going to Spark feature engineering → Feast feature store → daily retraining cron; a small Unity-Catalog-style governance ribbon overlaid; on a light PipeCode card.

Why Tier 3 and Tier 4 punch above their weight.

  • Tier 3 forces the distributed mental model. Spark exposes shuffle, skew, broadcast joins, and partition counts — concepts that only become real on a cluster.
  • Tier 3 introduces a real cloud warehouse. Snowflake's separation of storage and compute, virtual warehouses, and credit-based pricing show up in every modern DE round.
  • Tier 4 is the storytelling tier. "Walk me through your end-to-end pipeline" is the single most powerful loop moment a candidate has.
  • Tier 4 covers governance. Access control, lineage, audit, and data contracts — the senior-track topics most candidates never touch.

Project 5 — Spark batch (daily user-behaviour aggregations)

Detailed explanation. Take the synthetic clickstream from Project 4 (or pull a public dataset like CommonCrawl access logs), land it as Parquet on MinIO, and write a PySpark job that computes daily per-user aggregations (clicks, sessions, time-on-site, top page) and writes the result as a partitioned Delta Lake table. Schedule it nightly via Airflow. Total wall-clock budget: ~14 days. The repo runs on a single Spark container via docker-compose, no cluster required.

Code.

# spark_jobs/daily_user_agg.py
from pyspark.sql import SparkSession, functions as F, Window

spark = (SparkSession.builder
         .appName("daily_user_agg")
         .config("spark.sql.shuffle.partitions", "64")
         .config("spark.sql.adaptive.enabled", "true")
         .getOrCreate())

df = (spark.read.parquet("s3a://events/dt=2026-05-29/")
      .filter(F.col("event_type").isin("click", "pageview")))

w_session = Window.partitionBy("user_id").orderBy("event_time")
sessionised = (df
    .withColumn("prev_ts", F.lag("event_time").over(w_session))
    .withColumn("session_gap_min",
                (F.col("event_time").cast("long") - F.col("prev_ts").cast("long")) / 60)
    .withColumn("session_id",
                F.sum(F.when(F.col("session_gap_min") > 30, 1).otherwise(0)).over(w_session)))

agg = (sessionised.groupBy("user_id")
       .agg(
           F.count("*").alias("clicks"),
           F.countDistinct("session_id").alias("sessions"),
           (F.max("event_time").cast("long") - F.min("event_time").cast("long")).alias("dwell_seconds"),
           F.first(F.col("page"), ignorenulls=True).alias("top_page")))

(agg.write.mode("overwrite").format("delta")
     .partitionBy("date").save("s3a://marts/user_daily_agg/"))
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Read partitioned Parquet — Spark prunes partitions automatically via the dt=YYYY-MM-DD path.
  2. Filter earlyevent_type IN ('click','pageview') runs before any shuffle, reducing data volume.
  3. Sessionise with a window function — 30-minute gap rule defines session boundaries; cumulative sum on the gap flag assigns a session id.
  4. Aggregate per user — clicks, sessions, dwell time, top page in a single shuffle.
  5. Adaptive query execution (AQE) enabled — Spark coalesces partitions and handles skew automatically.
  6. Write Delta Lake partitioned by date — downstream consumers read only the partitions they need; vacuum + optimize handle file proliferation.

Output.

user_id clicks sessions dwell_seconds top_page
u_42 188 4 7,841 /products/widget
u_99 12 1 311 /home
u_103 421 7 14,002 /search?q=ai
u_204 9 2 188 /signup

Rule of thumb. Spark portfolio projects must surface at least one tuning lever (spark.sql.shuffle.partitions, broadcast hint, AQE, salting). The reviewer is checking for "this person knows Spark has knobs" — show one.

What hiring managers see.

  • You can write idiomatic PySparkgroupBy().agg(), window functions, withColumn.
  • You know about shuffle partitions — the single most-asked Spark tuning knob in interviews.
  • You wrote Delta Lake, not just Parquet — proves familiarity with modern lakehouse format.
  • You sessionised events with a window — the canonical Spark interview pattern.

Common beginner mistakes.

  • Default spark.sql.shuffle.partitions=200 for a 1 GB job — over-shuffling kills performance. Tune to ~32–64 on a laptop.
  • .collect() in production code paths. A collect() pulls the entire DataFrame to the driver — OOM-prone and a senior-tier red flag. Use take(n) or write instead.
  • Joining without broadcast hints when one side is tiny. A 50 MB dim joined to a 50 GB fact without broadcast(small) triggers a needless shuffle.
  • No partitioning on write. Writing user_daily_agg un-partitioned forces downstream readers to scan everything; partition by date at minimum.
  • No OPTIMIZE/VACUUM in the Delta Lake project. Reviewer who sees a thousand 4 KB files on day 30 knows the candidate never read the Delta docs.

Project 6 — dbt + Snowflake medallion (Bronze / Silver / Gold)

Detailed explanation. Take the Tier 1 warehouse, replace DuckDB/Postgres with Snowflake, and re-layer the dbt repo as a medallion architecture: bronze (raw landed data, append-only), silver (cleansed, deduplicated, conformed), gold (business-ready facts and dims consumed by BI). Add Snowflake-specific features: virtual-warehouse sizing per environment, Snowpipe for ingestion, and dbt exposures pointing at the Looker dashboards downstream. Total wall-clock budget: ~21 days. The README must explain why medallion — usually a one-paragraph trade-off naming "schema-on-read at bronze, schema-on-write at silver."

Code.

-- models/bronze/br_orders.sql
{{ config(
    materialized='incremental',
    incremental_strategy='append',
    on_schema_change='append_new_columns',
    schema='bronze'
) }}

SELECT
    PARSE_JSON($1):order_id::STRING       AS order_id,
    PARSE_JSON($1):customer_id::STRING    AS customer_id,
    PARSE_JSON($1):amount::NUMBER(12,2)   AS amount,
    PARSE_JSON($1):currency::STRING       AS currency,
    PARSE_JSON($1):created_at::TIMESTAMP  AS created_at,
    METADATA$FILENAME                     AS source_file,
    CURRENT_TIMESTAMP()                   AS bronze_loaded_at
FROM @raw_stage/orders/
{% if is_incremental() %}
  WHERE METADATA$FILE_LAST_MODIFIED > (SELECT MAX(bronze_loaded_at) FROM {{ this }})
{% endif %}
Enter fullscreen mode Exit fullscreen mode
-- models/silver/sl_order.sql
{{ config(materialized='incremental', unique_key='order_id', schema='silver') }}

WITH ranked AS (
    SELECT *, ROW_NUMBER() OVER
        (PARTITION BY order_id ORDER BY bronze_loaded_at DESC) AS rn
    FROM {{ ref('br_orders') }}
)
SELECT
    order_id,
    customer_id,
    amount,
    UPPER(currency) AS currency,
    created_at,
    bronze_loaded_at AS silver_loaded_at
FROM   ranked
WHERE  rn = 1
  AND  amount > 0
  AND  currency IN ('USD','EUR','GBP','INR')
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Bronze — raw JSON landed via Snowpipe; one row per source-file record; METADATA$FILENAME preserved for lineage.
  2. Silver — deduplicated on order_id via ROW_NUMBER() OVER (...) keeping the most recently-loaded copy; filters out invalid currencies and non-positive amounts.
  3. Goldfact_order, dim_customer, dim_date, dim_currency joined from silver; downstream-friendly column names; materialised as tables.
  4. Virtual warehouses per envWH_DEV_XS for CI runs, WH_PROD_S for nightly builds. Pricing visible to the reviewer in the README.
  5. dbt exposures_exposures.yml declares the Looker dashboards that depend on each gold mart, giving full lineage from raw → dashboard.
  6. CI — GitHub Actions runs dbt build --select state:modified+ on every PR using state: deferral against main — only the changed models rebuild.

Output.

Layer Schema Materialisation Row count (example) Read pattern
Bronze bronze incremental append 100 M once at write
Silver silver incremental upsert 95 M (deduplicated) many reads
Gold gold table 8 facts + 4 dims dashboard + BI

Rule of thumb. Always document the medallion invariant: "raw never leaves bronze; analytics never reads bronze; everything joins on silver keys." A single sentence on the README spares twenty minutes of reviewer questions.

What hiring managers see.

  • You understand schema-on-read vs schema-on-write — the medallion's defining trade-off.
  • You configured a virtual warehouse — proves familiarity with Snowflake's compute/storage split.
  • You wired dbt exposures — proves you think about end-to-end lineage.
  • You used state:modified+ — the senior-tier dbt CI pattern.

Common beginner mistakes.

  • Letting bronze leak into BI. A Looker tile reading from bronze.* defeats the entire architecture; route everything through gold.
  • Same warehouse for dev and prod. Reviewers expect to see at least two virtual warehouses sized differently; sharing one is a cost-discipline red flag.
  • Snowpipe + dbt incremental together without dedup. Snowpipe can re-deliver files on retry; the silver layer must dedupe on a stable business key.
  • No tags on dbt models. dbt build --select tag:nightly is the operational handle for selective runs; missing tags signal the project never ran on a schedule.
  • One mega-mart instead of fact + dim split. A gold.everything table that joins seven sources into 80 columns is a tell that the candidate has not internalised dimensional modelling.

Project 7 — End-to-end product analytics platform (Segment → Kafka → Snowflake → dbt → Looker)

Detailed explanation. Project 7 is the flagship of the portfolio. It assembles ingest, streaming, warehouse, transformation, and BI into one coherent repo. Pick a fictional product (e.g. "a SaaS habit tracker"), instrument synthetic events with the Segment Node SDK (or just publish to Kafka directly), land them in Snowflake via Snowpipe Streaming, build a daily_user_metrics mart in dbt, and surface DAU/WAU/retention in Looker. The trade-off paragraph: "Kafka over batch ingest because the README has to defend streaming semantics in interview rounds even when the volume does not require it." Total wall-clock budget: ~28 days.

Code.

# ingest/produce_events.py — synthetic Segment-shaped publisher
import json, time, random
from confluent_kafka import Producer

producer = Producer({
    "bootstrap.servers": "kafka:9092",
    "acks": "all",
    "enable.idempotence": True,
    "compression.type": "zstd",
})

EVENTS = ["habit_logged", "habit_completed", "streak_started", "streak_broken"]
USERS  = [f"u_{i:04d}" for i in range(1, 2001)]

while True:
    event = {
        "event":      random.choice(EVENTS),
        "user_id":    random.choice(USERS),
        "timestamp":  time.time(),
        "properties": {"habit": random.choice(["read","run","meditate","write"]),
                       "duration_min": random.randint(5, 60)},
        "context":    {"source": "node-sdk", "ip": "127.0.0.1"},
    }
    producer.produce("events", key=event["user_id"].encode(),
                     value=json.dumps(event).encode())
    producer.poll(0)
    time.sleep(0.001)   # ~1k events/sec
Enter fullscreen mode Exit fullscreen mode
-- models/gold/daily_user_metrics.sql
{{ config(materialized='table', schema='gold') }}

WITH base AS (
    SELECT user_id,
           DATE(event_time) AS event_date,
           COUNT(*)                 AS events,
           COUNT(DISTINCT properties:habit)     AS distinct_habits,
           SUM(properties:duration_min::NUMBER) AS total_minutes
    FROM   {{ ref('sl_events') }}
    GROUP BY 1, 2
),
retention AS (
    SELECT a.user_id, a.event_date,
           CASE WHEN b.user_id IS NOT NULL THEN 1 ELSE 0 END AS retained_d7
    FROM   base a
    LEFT JOIN base b
      ON a.user_id  = b.user_id
     AND b.event_date = DATEADD('day', 7, a.event_date)
)
SELECT b.*, r.retained_d7
FROM   base b
JOIN   retention r USING (user_id, event_date)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Producer — publishes ~1k events/sec to Kafka with idempotence on and zstd compression.
  2. Snowpipe Streaming — ingests Kafka topic into bronze.events with sub-second latency.
  3. Silversl_events extracts the JSON event, user_id, timestamp, properties into typed columns.
  4. Golddaily_user_metrics computes per-user-per-day events, distinct habits, total minutes, and a retained_d7 flag (was the user active again 7 days later?).
  5. Looker dashboard — DAU, WAU, day-7 retention cohort chart, top-habit chart. Auto-refreshes nightly.
  6. README architecture diagram — six boxes left-to-right (Segment → Kafka → Snowpipe → Bronze → Silver → Gold → Looker), with one-sentence captions and the rejected alternative noted under each box.

Output.

Metric (sample week) Value
Events ingested 5,184,000
DAU (median) 612
WAU (median) 1,840
Day-7 retention 41%
Looker dashboard SLA refreshed by 09:00

Rule of thumb. A flagship project must answer two questions in the first three README lines: "what does it do?" and "why would a recruiter spend three minutes reading further?" Lead with the second.

What hiring managers see.

  • You can assemble seven systems into one coherent pipeline.
  • You instrumented events end-to-end — the modern product-data muscle.
  • You modelled retention — the business-judgement signal interviewers love.
  • You documented an architecture diagram — the artefact every loop opens with.

Common beginner mistakes.

  • No event schema document. Reviewers expect to see events.md or a JSON schema file enumerating every event name, properties, and version — schema-as-code is the modern product-data standard.
  • Retention computed off the dashboard instead of the warehouse. Defining retention SQL inside Looker locks the metric to the BI tool; always materialise the metric in dbt so it is reusable.
  • Looker dashboard with 30 tiles. Three tiles answering "DAU, retention, top action" beats thirty tiles answering nothing. Reviewers grade clarity, not volume.
  • No data contract. A one-page contracts/events.yml listing the producer team, schema owner, and break-the-build policy is the senior-tier signal for product-data pipelines.
  • No screen-recording. A 90-second Loom walk-through of the dashboard auto-refreshing is the single highest-leverage README artefact for Tier 4 projects.

Project 8 — ML feature pipeline (Spark feature engineering + Feast feature store + retraining cron)

Detailed explanation. Layer an ML feature path onto Project 7. Compute behavioural features in Spark from the bronze.events table (e.g. clicks_last_7d, sessions_last_30d, avg_session_min), register them in Feast (a feature store), schedule a daily backfill + a retraining cron via Airflow, and serve a simple churn-risk model via FastAPI for offline scoring. The point is to demonstrate the DE-side of MLOps — feature engineering, feature stores, and scheduled retraining — without writing the model itself (sklearn LogisticRegression is fine). Total wall-clock budget: ~28 days.

Code.

# spark_jobs/build_features.py
from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("build_features").getOrCreate()

events = spark.read.parquet("s3a://bronze/events/")

feats = (events
    .filter(F.col("event_time") >= F.date_sub(F.current_date(), 30))
    .groupBy("user_id")
    .agg(
        F.sum(F.when(F.col("event_time") >= F.date_sub(F.current_date(), 7), 1)
                   .otherwise(0)).alias("clicks_last_7d"),
        F.countDistinct(F.date_trunc("day", "event_time")).alias("active_days_last_30d"),
        F.avg(F.col("properties.duration_min")).alias("avg_session_min"),
        F.max("event_time").alias("last_event_time"),
    )
    .withColumn("feature_ts", F.current_timestamp()))

(feats.write.mode("overwrite").format("delta")
       .save("s3a://feature-store/user_behaviour_v1/"))
Enter fullscreen mode Exit fullscreen mode
# feast/feature_store.yaml + feature_view registration
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64

user = Entity(name="user_id", join_keys=["user_id"])

src = FileSource(
    path="s3://feature-store/user_behaviour_v1/",
    timestamp_field="feature_ts",
)

user_behaviour_v1 = FeatureView(
    name="user_behaviour_v1",
    entities=[user],
    ttl=None,
    schema=[
        Field(name="clicks_last_7d",       dtype=Int64),
        Field(name="active_days_last_30d", dtype=Int64),
        Field(name="avg_session_min",      dtype=Float32),
    ],
    source=src,
)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Spark feature job — reads 30 days of events, computes per-user behavioural features, writes Delta Lake to the feature store path.
  2. Feast registration — declares the feature view, entity, and source. feast apply writes the registry.
  3. Offline retrieval — training script calls store.get_historical_features(entity_df, feature_refs=[...]) to materialise a training matrix.
  4. Daily retraining cron — Airflow DAG fires build_features.py, then train_model.py, then writes the new sklearn .pkl to S3 with a timestamped path.
  5. Online serving (optional) — FastAPI endpoint loads the latest .pkl, calls store.get_online_features(...), returns a churn-risk score. Useful demo but not strictly required for the DE-side narrative.
  6. Model monitoring — log feature distributions per run; a weekly notebook checks for drift against the previous month's distribution.

Output.

user_id clicks_last_7d active_days_last_30d avg_session_min churn_risk_score
u_0042 38 14 12.4 0.07
u_0099 2 1 3.8 0.74
u_0103 91 23 18.9 0.04
u_0204 0 0 0.0 0.91

Rule of thumb. The DE-side of MLOps is features + scheduling + governance; the ML-side is models + evaluation. Stay on the DE side in the README — "I built the feature pipeline; the data scientists own the model" is exactly the framing senior DE interviewers want.

What hiring managers see.

  • You built a feature store, not just a Spark job — proves familiarity with the modern ML-data interface.
  • You scheduled retraining — proves operational thinking beyond a one-shot script.
  • You logged feature drift — proves you understand the failure mode that bites production ML systems.
  • You drew a clean DE/ML responsibility split — proves team-thinking and reduces "is this a DE or a DS role?" loop friction.

Common beginner mistakes.

  • Training on the same window the features were computed for. Causes label leakage; always compute features on t and train against labels at t + horizon.
  • No feature versioning. A feature view called user_behaviour without a _v1 suffix bites the day the schema changes — every modern feature store treats versions as first-class.
  • Skipping the entity declaration. Feast feature views without an Entity make point-in-time joins fail silently; reviewers spot it immediately.
  • Backfilling features with current_timestamp(). Always parameterise feature_ts from execution_date so historical backfills produce consistent values.
  • No mention of online vs offline path. A README that does not distinguish between offline training retrieval and online serving retrieval is a tell that the candidate has not read the feature-store docs.

Data engineering interview question on an end-to-end build

A senior interviewer often anchors the design round with: "Walk me through your end-to-end product analytics pipeline. Where does data enter? Where does it land? Who consumes it? What was the single hardest design decision?" — testing whether the candidate has internalised the entire shape, not just the slice they wrote last.

Solution Using a five-stage architecture narrative — ingest → land → transform → consume → govern

INGEST     → Segment / SDKs publish to Kafka topic "events" (1k events/sec, key=user_id).
LAND       → Snowpipe Streaming auto-loads to bronze.events.
TRANSFORM  → dbt silver (typed columns) → dbt gold (daily_user_metrics, retention).
CONSUME    → Looker dashboards on gold; Spark feature job for ML.
GOVERN     → Snowflake access roles per layer; dbt exposures for lineage.

Hardest decision : Kafka over batch ingest, because the project must defend
                   streaming semantics in interview rounds even when volume
                   does not require it. Trade-off: extra container in
                   docker-compose, but a Confluent-style narrative payoff.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Stage Latency Data volume / day Who reads it
Kafka topic ms 5.2 M events Snowpipe, Spark
bronze.events ~1s 5.2 M rows silver only
silver.sl_events minutes 5.1 M rows (after dedup) gold marts
gold.daily_user_metrics minutes ~1.8k rows Looker
feature store user_behaviour_v1 hourly ~2k rows model retraining + serving

Output:

Surface Owner Refresh cadence Consumed by
Kafka topic DE continuous Snowpipe, Spark
Bronze DE continuous dbt silver
Silver DE hourly via dbt dbt gold
Gold marts analytics engineer nightly Looker
Feature store DE daily ML platform team
Model registry DS weekly serving API

Why this works — concept by concept:

  • Five-stage narrative — gives the interviewer a clear walk-through skeleton that they can interrupt at any stage to drill deeper.
  • Latency table — quantifies the SLA at every stage; the kind of artefact senior DEs maintain at work.
  • Owner column — proves you understand "who is responsible" boundaries, even if it is just you wearing four hats in the portfolio.
  • Governance ribbon — Snowflake roles + dbt exposures is the cheapest credible governance story you can ship at portfolio scale.
  • Hardest-decision paragraph — the README's single most-quoted artefact in onsites; pre-write it.
  • Cost — total local compute = O(one Docker container per service); Snowflake-on-trial credits = O($10–20 over the life of the project).

SQL
Topic — joins
Join problems for the gold-marts tier

Practice →

Streaming
Topic — real-time analytics
Real-time analytics problems

Practice →


The README template that converts reviewers — for every project in the portfolio

A reviewer reads the README before the code. A consistent README structure across all your data engineering projects signals that you take portfolio polish as seriously as the build itself. Use the same eight-section skeleton for every repo so a reviewer who has seen one of your projects can navigate the rest instantly. The template below is the one that consistently converts the most "reviewer opens the repo" moments into "reviewer asks the candidate about it in the next loop."

The eight-section README skeleton.

  • Top — one-line problem statement. "Daily ETL that lands NYC 311 complaints in Snowflake and refreshes a Looker dashboard before 09:00." One sentence; no jargon.
  • Architecture diagram. A single image above the fold; the image is the README's single most-clicked artefact.
  • Quick start. git clone, docker-compose up, open localhost:3000. Three lines, total. If it takes more, fix it before publishing.
  • What it does. Three bullets: ingest source, transformation step, downstream consumer. Five lines maximum.
  • Trade-offs. Two short paragraphs naming the alternative you rejected and why. This section gets re-read on every loop.
  • Tests. Names the test types (not_null, unique, custom) and a link to the GitHub Actions run.
  • What I would build next. Three forward-looking bullets — proves the project is alive in your head.
  • Credits / data source. Names the public dataset and any libraries that did meaningful heavy lifting.

Three README anti-patterns that lose reviewers.

  • A wall of text before the first heading. Reviewers scan, not read; the eight-section skeleton matters because every section is its own heading.
  • Stale Quick Start. If docker-compose up does not work on a fresh clone, the README is broken — test it on a clean machine before publishing.
  • No "Trade-offs" section. Without it, the reviewer cannot find the senior signal. Add even one paragraph.

Senior-tier polish moves that take an hour and pay off for years

Once the eight-section README is in place, a handful of one-hour polishing moves elevate a competent project into a flagship-class one. None of them require new code; all of them compound when the project sits on GitHub for the next twelve months.

Five polishing moves that punch above their weight.

  • Pin the project on your GitHub profile. A pinned repo with a thumbnail and a one-line tagline pulls eight seconds of free attention from every recruiter who lands on your profile.
  • Add a thumbnail image to the README. A 1200×630 PNG at the top of the README renders as a thumbnail in GitHub previews and on LinkedIn link unfurls — the closest thing GitHub has to an OG image.
  • Add a GitHub Actions badge. A green CI badge in the README earns one full unit of trust before the reviewer reads any code.
  • Record a 90-second Loom walk-through. Embed it in the README. The walk-through is the single highest-leverage artefact for portfolio Tier 3 and Tier 4 projects.
  • Open three GitHub Issues with the "next steps" label. Reviewers click the Issues tab on every interesting repo; three open issues with clear titles read as "this person is iterating," not "this person has abandoned the project."

Three senior-tier moves nobody bothers with — and why they convert.

  • An EXPERIMENTS.md log. Two paragraphs per experiment: hypothesis, result, decision. Proves you treat the portfolio like a real engineering effort.
  • A "Costs" table. Snowflake credits used, EC2 hours, S3 storage. Cost-awareness is a senior-track muscle most candidates never display in portfolio form.
  • A CONTRIBUTING.md. Even for a solo portfolio repo, a one-page contributor guide signals "I have thought about how a team would join this project" — which is exactly the right signal for an entry-level hire.

Cheat sheet — pick four projects by four patterns

  • Tier 1 — Foundation. Pick Project 1 (SQL warehouse) or Project 2 (dbt analytics + DuckDB). Demonstrates SQL fluency, dimensional modelling, and dbt seeds/sources/tests. Effort: 1–2 weeks. Interview lift: SQL screen, modelling round.
  • Tier 2 — Orchestration. Pick Project 3 (Airflow daily ETL). Demonstrates DAG design, idempotent task patterns, retries, dead-letter, SLA awareness, backfill. Effort: ~2 weeks. Interview lift: ETL system design round.
  • Tier 3 — Modern stack. Pick Project 5 (Spark batch) or Project 6 (dbt + Snowflake medallion). Demonstrates distributed compute or cloud-warehouse fluency. Effort: 2–3 weeks. Interview lift: distributed compute round, warehouse round.
  • Tier 4 — End-to-end. Pick Project 7 (product analytics platform). Demonstrates breadth, business judgement, and storytelling. Effort: ~4 weeks. Interview lift: senior signal, behavioural-design round, "walk me through a project" moment.
  • Optional add-on for senior-leaning roles — layer Project 8 (ML feature pipeline) on top of Project 7. Demonstrates DE/ML interface and feature-store fluency. Effort: +4 weeks. Interview lift: MLOps-adjacent rounds.
  • What to skip if you are time-constrained — drop Project 4 (Kafka + Flink) unless you are targeting streaming-heavy shops (Uber, Stripe, Confluent, DoorDash). Replace with a second Project 3 variant.
  • What to never skip — Tier 1. A reviewer who cannot see the SQL muscle will discount everything above it as copy-paste.

Frequently asked questions

How many data engineering projects should I have on my resume?

Four well-built projects is the sweet spot — one per tier (Foundation, Orchestration, Modern stack, End-to-end). Two is the minimum that signals seriousness; eight is the maximum that a reviewer will actually click through. Each project on the resume should have a one-line architecture statement, a two-line trade-off paragraph, and a clickable GitHub link. Anything beyond four risks crowding out the trade-off lines that actually do the persuasive work.

Should data engineering projects use real production-scale data?

No — use a realistic shape of data, not a realistic volume. A 100 GB clickstream is impressive in production and useless in a portfolio because the reviewer will not wait an hour for it to load. Pick a public dataset in the 1–10 GB range (NYC Taxi, Olist e-commerce, GitHub events, OpenWeather), or generate synthetic events with a publisher script. The reviewer wants to see schema design, pipeline shape, and trade-offs — all of which are visible at 1 GB.

Where should I host my portfolio projects?

GitHub is mandatory; everything else is optional. The README, the docker-compose.yml, and the tests should run locally on a reviewer's laptop with no cloud account required. If you want a live cloud deployment, Snowflake's free-trial credits cover a small warehouse and Render or Railway give you a free Looker Studio embed. Do not require AWS credentials in your docker-compose — that is a friction wall most reviewers will not climb.

Are Kaggle datasets enough for de projects for resume?

Kaggle datasets are fine for Tier 1 (warehouse + dbt) but limiting for Tier 2–4 because they are static — they do not arrive on a schedule and they do not let you demonstrate ingest. For Tier 2 and beyond, pair a Kaggle dataset with a synthetic publisher script that re-publishes the rows hour-by-hour, simulating fresh arrivals. That single line in the README — "data arrives via a synthetic Kafka publisher to simulate a real ingest cadence" — converts a Kaggle CSV into a credible streaming source.

Do I need to deploy projects to the cloud to count them on my resume?

No, and most reviewers prefer it the other way around. A local docker-compose up they can run in 60 seconds beats a terraform apply they would never run. Cloud deployments are useful when (a) you want to demonstrate cloud-native services (Snowflake virtual warehouses, Glue, BigQuery), or (b) you want a live URL to link from the README. Even then, prefer free-tier services (Snowflake trial, Looker Studio, Render) and document the cloud spend in the README — reviewers love cost-awareness signal.

How long should each portfolio project take?

Use the per-tier budget as a contract with yourself. Tier 1 = 1–2 weeks per project; Tier 2 = 2–3 weeks; Tier 3 = 2–3 weeks; Tier 4 = 3–4 weeks. If a project crosses 1.5× its budget, ship what you have, write the README, and move on. Scope creep at the portfolio stage usually means "I am avoiding the harder next-tier project by polishing the current one" — recognise it and move up the pyramid. Total realistic budget for a four-project pyramid is ~10–12 weeks of evening work.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every project tier above ships with hands-on practice rooms where you write the SQL, design the DAGs, and tune the Spark jobs that fill the README sections reviewers actually click. Start with the foundation tier and walk the pyramid; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice ETL now →
Streaming drills →

Top comments (0)