The single biggest leverage point in an entry-level data engineering job search is not another certification or another LinkedIn course completion — it is a data engineering portfolio of two to four projects that a hiring manager can clone, run with one command, and read in under ten minutes. Recruiters and tech screens have stopped treating certs as signal; what they want is evidence. They want a GitHub link with a clean README, a docker-compose up that boots a working pipeline, and a one-paragraph description that names the problem, the stack, and the trade-off you defended.
This guide walks you through eight data engineer projects arranged as a difficulty pyramid: two foundation projects that prove SQL and dimensional modelling fluency, two orchestration projects that prove you can schedule and stream, two modern-stack projects that prove you can wield distributed compute and a cloud warehouse, and two end-to-end projects that prove business judgement and breadth. Each project ships as a #### block with a stack list, a short build path, a real code snippet you can paste, and an outcome list of the exact things hiring managers will mention back to you in interviews. By the end you will have a concrete de projects for resume plan, complete with etl project ideas and data engineering project ideas you can start tonight.
When you want hands-on reps alongside the build, drill the ETL practice library →, warm up with aggregation problems →, and rehearse the streaming pipeline drills → so each project ships paired with the interview muscles it demonstrates.
On this page
- Why portfolio beats certifications for your first DE role
- The portfolio difficulty pyramid — Foundation to End-to-End
- Foundation projects — SQL warehouse and dbt analytics
- Orchestration projects — Airflow ETL and Kafka streaming
- Modern stack and end-to-end projects — Spark, Snowflake, product analytics, ML features
- Cheat sheet — pick four projects by four patterns
- Frequently asked questions
- Practice on PipeCode
1. Why portfolio beats certifications for your first DE role
Hiring managers verify hands-on with GitHub, not certs
The one-line invariant for every entry-level data engineering projects conversation is this: a hiring manager would rather scroll your GitHub for ten minutes than read a stack of certificate PDFs. Certificates prove that you sat through a course; they do not prove you can wire an API to a warehouse, schedule a daily refresh, write a test that fails when a column goes null, and document the whole thing for a teammate. A portfolio proves all four in a single repo.
Five reasons a portfolio out-signals certifications.
- Self-evident depth. Reading 200 lines of dbt is faster, more honest, and more diagnostic than reading a certificate. Anyone can pass a multiple-choice quiz; very few can ship a pipeline that does not crash overnight.
-
Reproducibility is the test. A README that says "clone,
docker-compose up, hitlocalhost:3000for the dashboard" tells the reader, "this person has thought about onboarding." It is the same skill that lets you onboard a new teammate at work. - Trade-off articulation. The README's "Trade-offs I made" section is the artefact most senior engineers actually click — it is the strongest signal of judgement. A cert hides judgement; a portfolio surfaces it.
- Storytelling at the loop. When a behavioural round asks "walk me through a project," candidates with portfolios produce a structured five-minute story. Cert-only candidates produce a rehash of the course outline.
- Reverse leverage. A portfolio project pulls you into the loop instead of the other way around. Recruiters routinely DM candidates whose GitHubs surface relevant projects when filtering for a specific stack.
The single most-asked screening prompt.
"Show me a pipeline you built end-to-end. Walk me through the architecture, what you chose, and why."
If you cannot answer that question with a link and a five-minute narrative, you are competing on resume keywords alone — a losing battle when every fresh-graduate resume lists the same SQL, Python, and Spark bullets. Your data engineering portfolio is the conversation that resume bullets cannot have on your behalf.
What an entry-level reviewer is scanning for in 90 seconds.
- A pinned repo with a thumbnail in the README so the GitHub homepage looks intentional.
- A one-sentence problem statement above the architecture diagram.
- A
docker-compose.ymlorMakefileso the project boots locally without manual setup. - Tests — even one. Empty tests imply "I have heard of testing"; a passing test on a
not_nullconstraint implies "I have used testing." - A "What I would build next" section showing where the project is heading — proves you have an opinion about the system, not just a snapshot.
Detailed sub-topic — the 2-strong-vs-5-weak rule
Detailed explanation. New candidates often pile eight half-finished projects onto a resume thinking that quantity demonstrates effort. The opposite signal lands. A hiring manager who clicks two projects and finds shallow, copy-pasted tutorials assumes the other six are worse and bounces. Two deep projects — each with tests, a README, a docker-compose, and a trade-off section — beat five shallow ones every single time. The rule of thumb: ship the smallest portfolio you can defend in interview, then iterate breadth from there.
Code (resume snippet that reflects the rule).
## Projects
### nyc-taxi-warehouse [GitHub →]
PostgreSQL star schema + dbt models + Metabase dashboard.
- 8 GB of trip data ingested with idempotent COPY; partitioned by month.
- Star schema with `fact_trips`, `dim_zone`, `dim_time`, `dim_payment`.
- 14 dbt tests (not_null, unique, accepted_values, custom row-count freshness).
- `docker-compose up` boots Postgres + dbt-core + Metabase in <60s.
- Trade-off chosen: monthly partitions over daily — fewer files, simpler vacuum
schedule, acceptable for the dashboard SLA of "fresh by 09:00 next day."
### dbt-analytics-duckdb [GitHub →]
Seeds + sources + 12 staging + 6 marts + GitHub Actions CI.
- DuckDB + dbt-core, runs in <30s on a laptop, zero cloud cost.
- Tests run on every PR; merge blocked on test failure.
- Trade-off chosen: DuckDB over Snowflake — same dbt code, zero cloud spend,
small enough to onboard a reviewer in five minutes.
Step-by-step explanation.
- Two projects, not eight. Each one gets its own headline, a one-line architecture statement, and a clear trade-off paragraph.
- Numbers anchor the bullets. "8 GB" and "14 dbt tests" and "<60s boot" are concrete; "comprehensive testing" is vapour.
- The trade-off line is the senior signal. "Monthly partitions over daily" tells the reader you considered alternatives and picked one for a reason.
- No hand-waving language. Avoid "leveraged" and "utilized"; prefer "ingested", "modelled", "tested", "shipped" — verbs the reader can verify.
- Repo links open in a new tab and land on a README that mirrors the bullets above. The bullets must be falsifiable in 60 seconds of scrolling.
Output.
| Profile | Avg interview-conversion rate (informal sample) | Why |
|---|---|---|
| 5 shallow projects | low | reviewer bounces on first shallow click |
| 2 deep projects | high | reviewer clicks both, reads the README, opens the issues tab |
| 0 projects | very low | recruiter sorts you below candidates with even one repo |
Rule of thumb. Ship two projects you can defend cold against any "why did you do it that way?" question before you start a third. The deepest two-project portfolio in the world is still tighter than a six-project portfolio with one trade-off paragraph among them.
Data engineering interview question on portfolio strategy
A senior interviewer might open the loop with: "We have ten candidates with similar resumes. Why should I open your GitHub first?" — testing whether you have a sharp, one-sentence framing of your portfolio's strongest project and the trade-off it defends.
Solution Using a single defensible flagship project + supporting repos
# Hi — I'm Alex. Open this repo first.
## Flagship: product-analytics-platform
End-to-end Segment → Kafka → Snowflake → dbt → Looker, all dockerised.
- 1.2 M synthetic events/day; rolling DAU, WAU, retention cohorts.
- 38 dbt tests; CI on every PR; Looker dashboard auto-refreshes nightly.
- Trade-off chosen: Kafka over batch ingest — proves I can defend streaming
semantics in an interview without needing the production traffic.
## Supporting projects
- nyc-taxi-warehouse — proves SQL + dimensional modelling.
- airflow-daily-etl — proves DAG design + error handling.
Step-by-step trace.
| Step | Reviewer action | Time spent | Signal received |
|---|---|---|---|
| 1 | Lands on GitHub profile | 5s | pinned flagship repo with thumbnail, clear name |
| 2 | Clicks flagship README | 20s | sees architecture diagram + one-line problem statement |
| 3 | Scrolls to "Trade-offs" | 15s | reads two paragraphs naming the alternative they rejected |
| 4 | Opens docker-compose.yml
|
10s | sees five clean services + comments |
| 5 | Opens GitHub Actions tab | 5s | green check on main, last run yesterday |
| 6 | Reads "What I would build next" | 15s | sees roadmap — Spark feature engineering branch |
Output:
| Outcome | Before flagship strategy | After flagship strategy |
|---|---|---|
| Recruiter first-call response rate | low — 8% | high — 31% |
| Loop conversion to onsite | medium — 22% | high — 48% |
| Reviewer-named project in onsite | rare | every loop |
Why this works — concept by concept:
- Single flagship as the anchor — the reviewer's attention is scarce; one repo that absorbs 80% of that attention compounds far more than five repos that fight for it.
- Architecture-diagram-first README — the first scroll above the fold either earns the next ten minutes or loses them; a diagram earns them.
- Trade-off section as the senior signal — "I picked X over Y because Z" is the single sentence every interviewer is listening for, on every loop.
- Green-check CI — a passing GitHub Actions badge in the README signals "I take repeatability seriously" — a free reputation point.
- Roadmap section — proves the project is alive in your head, not abandoned; future-tense bullets are interview gold.
- Cost — flagship build cost = O(3–4 weekends); ongoing maintenance cost = O(1 hour/month); interview leverage = O(every loop you take in the next 12 months).
ETL
Topic — ETL
End-to-end ETL practice problems
2. The portfolio difficulty pyramid — Foundation to End-to-End
Pick one project from each of four tiers — that is the entire data engineering portfolio
The mental model in one sentence: a strong portfolio is a four-tier pyramid that walks the reviewer up from SQL + modelling to a real product pipeline, with exactly one project per tier. Skipping a tier creates a credibility gap — a reviewer who sees a Spark + Snowflake project but no underlying warehouse repo wonders whether the candidate actually wrote the SQL or copy-pasted a tutorial. Climbing the pyramid in order builds and demonstrates the curriculum at the same time.
The four tiers in five bullets.
- Tier 1 — Foundation. SQL warehouse + dbt analytics. Proves schema design, SQL fluency, dimensional modelling, and version control on data assets. Roughly 1–2 weeks of work per project.
- Tier 2 — Orchestration. Airflow daily ETL + Kafka/Flink streaming. Proves scheduling discipline, DAG design, streaming semantics, and operational hygiene (alerting, retries, backfills). Roughly 2–3 weeks per project.
- Tier 3 — Modern stack. Spark batch + dbt-on-Snowflake medallion. Proves distributed compute and a cloud warehouse layered together. Roughly 2–3 weeks per project.
- Tier 4 — End-to-end. Product analytics platform + ML feature pipeline. Proves breadth and business judgement — the senior-track signal even for an entry-level loop. Roughly 3–4 weeks per project.
-
Pick one project per tier for a four-project portfolio that covers the full
data engineering project ideassurface every loop touches.
Skills demonstrated per tier — the interview-side mapping.
- Tier 1 → SQL screen, modelling round. "Walk me through a star schema you built." "How did you decide your grain?"
- Tier 2 → ETL system design round. "How does your DAG handle a partial failure on Task 3?" "What is your backfill strategy?"
- Tier 3 → distributed compute round. "How would you handle skew in this Spark job?" "Why medallion?"
- Tier 4 → senior signal + business judgement round. "Why did you put the ML feature path on the same Kafka topic as analytics?"
Why an ordered curriculum beats a random walk.
- A Tier 4 project that pretends Tier 1 has been mastered (no tests on the warehouse, no schema docs) collapses under interviewer probing.
- The Tier 1 → Tier 2 jump teaches you the difference between "running a SQL query" and "running it on schedule when the input changes." This is the single biggest mental shift for a fresh DE.
- The Tier 2 → Tier 3 jump teaches you the difference between Python on one machine and Python on a cluster. Most candidates assume this gap is small; it is not.
- The Tier 3 → Tier 4 jump teaches you the difference between a pipeline that runs and a pipeline that serves a business outcome. This is the gap that earns the offer.
Detailed sub-topic — sizing each project to a calendar week
Detailed explanation. Each tier comes with a recommended effort budget. Going over budget means the project has grown beyond a "portfolio piece" into a side-business — fine, but stop letting it block the next tier. Going under budget usually means the project lacks tests, docs, or a defensible trade-off. The budget below assumes 8–12 hours/week of focused build time.
Code (project-tracker JSON you can copy into Notion/Linear).
{
"portfolio_plan": [
{ "tier": 1, "project": "sql-warehouse", "effort_weeks": 2, "stack": ["postgres","dbt","metabase"] },
{ "tier": 1, "project": "dbt-analytics-duckdb", "effort_weeks": 1, "stack": ["duckdb","dbt","gh-actions"] },
{ "tier": 2, "project": "airflow-daily-etl", "effort_weeks": 2, "stack": ["airflow","s3","snowflake","dbt","looker-studio"] },
{ "tier": 2, "project": "kafka-flink-realtime", "effort_weeks": 3, "stack": ["kafka","flink-sql","clickhouse","grafana"] },
{ "tier": 3, "project": "spark-user-behaviour", "effort_weeks": 2, "stack": ["pyspark","s3","delta-lake","minio"] },
{ "tier": 3, "project": "dbt-snowflake-medallion","effort_weeks": 3, "stack": ["dbt","snowflake","gh-actions"] },
{ "tier": 4, "project": "product-analytics", "effort_weeks": 4, "stack": ["segment","kafka","snowflake","dbt","looker"] },
{ "tier": 4, "project": "ml-feature-pipeline", "effort_weeks": 4, "stack": ["spark","feast","airflow","mlflow"] }
]
}
Step-by-step explanation.
- List every project up-front so the calendar work is visible. Aspirations that live only in your head shrink.
- Budget each project in calendar weeks, not "hours" — calendar weeks include the inevitable real-life interruptions, which make the plan honest.
- Stack list per project doubles as resume bullets and as Google-search queries during the build.
- Total realistic budget for two projects per tier = ~20 weeks of evening work. For the recommended four-project portfolio (one per tier) it is closer to 10 weeks. Both are achievable inside a focused job search.
-
Track project state in three buckets —
planning,building,polishing. "Polishing" is where README, tests, and screen-recording demos live; it is the bucket beginners skip.
Output.
| Tier | Recommended one-of-two | Effort weeks | Interview rounds it covers |
|---|---|---|---|
| 1 | dbt-analytics-duckdb | 1 | SQL screen, modelling round |
| 2 | airflow-daily-etl | 2 | ETL system design round |
| 3 | dbt-snowflake-medallion | 3 | warehouse + cloud round |
| 4 | product-analytics | 4 | senior signal + business judgement |
Rule of thumb. "Four strong projects beats eight weak ones — always." If a project is taking more than 1.5× its budget, ship what you have, write the README, and move on to the next tier. The reviewer rewards breadth across tiers more than depth within one.
Data engineering interview question on portfolio sequencing
A senior interviewer often probes: "If you had eight weeks to build a portfolio from scratch, in what order would you tackle the projects and why?" — testing whether the candidate sees the dependencies between tiers (you cannot build a Tier 3 medallion if you cannot build a Tier 1 star schema first).
Solution Using strict tier order with one project per tier
Week 1 → Tier 1: dbt-analytics-duckdb (lightweight, low risk, builds dbt muscle).
Week 2-3 → Tier 2: airflow-daily-etl (the first project that *runs on a schedule*).
Week 4-6 → Tier 3: dbt-snowflake-medallion (introduces the cloud warehouse).
Week 7-10 → Tier 4: product-analytics (assembles everything into one repo).
Polish → Add screen-recordings, READMEs, two-line trade-off paragraphs.
Step-by-step trace.
| Week | Project | Output artefact | New skill demonstrated |
|---|---|---|---|
| 1 | dbt-analytics-duckdb | README + GH Actions green check | dbt model layering + CI |
| 2 | airflow-daily-etl | DAG screenshot + Looker Studio embed | scheduling discipline |
| 3 | airflow-daily-etl | dead-letter + retries patterns | operational hygiene |
| 4 | dbt-snowflake-medallion | bronze/silver/gold schema diagram | medallion architecture |
| 5 | dbt-snowflake-medallion | warehouse-cost-per-run metric | cloud-cost awareness |
| 6 | dbt-snowflake-medallion | dbt exposures + Looker tile | downstream lineage |
| 7 | product-analytics | architecture diagram | system design breadth |
| 8 | product-analytics | Segment → Kafka ingest | streaming ingest |
| 9 | product-analytics | dbt marts + Looker dashboard | analytics surfacing |
| 10 | product-analytics | screen-recording demo | storytelling rehearsal |
Output:
| Metric | Random-order portfolio | Pyramid-order portfolio |
|---|---|---|
| Reviewer perceives depth | low — projects feel disjointed | high — projects build on each other |
| Story coherence in onsite | low — separate stories per project | high — one narrative across four projects |
| Cross-project test coverage | none | Tier 4 reuses Tier 3 dbt patterns |
Why this works — concept by concept:
- Tier order respects skill dependencies — Tier 4's medallion warehouse reuses Tier 1's dbt code muscle; skipping Tier 1 means re-learning under deadline.
- One project per tier — keeps total scope at ~10 weeks; lets you reach Tier 4 before motivation flags.
- Polish week — README, screen-recording, and trade-off paragraph are the 20% of effort that returns 80% of reviewer-perceived quality.
- Reviewer narrative — four ordered projects let you say "I built this, then this, then this" in the behavioural round; random projects force four separate stories.
- Calendar realism — assumes 8–12 hours/week, the real budget of a fresh-graduate searching for work.
- Cost — total build cost = O(10 weeks × 10 hours); interview leverage = O(every loop for the next 12 months).
SQL
Topic — aggregation
SQL aggregation problems for the foundation tier
3. Foundation projects — SQL warehouse and dbt analytics
Foundation projects prove SQL + modelling fluency — the bedrock of every data engineer projects portfolio
The mental model: 80% of entry-level rounds test SQL and dimensional modelling; the foundation tier exists to make those rounds easy. A warehouse with a star schema and a dbt repo with seeds, sources, models, tests, and docs cover the entire syllabus the typical screen pulls from. Two projects in this tier, each ~5–10 days, give a reviewer everything they need to believe you can handle a "build me a fact_orders table" whiteboard round without flinching.
Why foundation matters more than the flashier tiers.
- It is the only tier you cannot fake. A reviewer can ask a SQL question on the spot; you cannot bluff it. The foundation tier ensures you do not have to.
- dbt is the lingua franca of modern analytics engineering. A dbt repo with sources + models + tests + docs is the canonical entry-level artefact in 2026.
- A star schema is the canonical interview drawing. Practising the four dimensional-modelling decisions (grain, facts, dims, slowly-changing strategy) on real data is the only way to defend them under pressure.
- The foundation projects double as datasets for higher tiers. Your Tier 1 warehouse becomes the source-of-truth for the Tier 3 medallion and the Tier 4 product analytics platform — a compounding return.
The four dimensional-modelling decisions every foundation project should defend.
- Grain — what does one row in the fact table represent? Be ruthless: "one row = one taxi trip from pickup to dropoff."
-
Conformed dimensions — which dims are shared across facts?
dim_date,dim_zone,dim_userare the usual suspects. - SCD strategy — type 1 (overwrite), type 2 (history rows), or type 6 (combined). Pick one per dim and write it on the README.
-
Tests —
not_nullon grain columns,uniqueon surrogate keys,relationshipson FKs,accepted_valueson enums. Aim for ≥10 tests per project.
Project 1 — SQL data warehouse from an open dataset
Detailed explanation. Start with a well-known public dataset — NYC Taxi (yellow cabs, ~8 GB raw), Brazil e-commerce (Olist, ~2 GB), or a generated TPC-H sample (~1 GB). Ingest the raw CSVs into PostgreSQL with idempotent COPY, model a star schema in dbt, write ~10 analytics queries on the marts, and surface the results in Metabase. Total wall-clock budget: 7–10 days. The result is a repo a reviewer can clone, docker-compose up, and have a working dashboard at localhost:3000 in 60 seconds.
Code.
# docker-compose.yml — 3 services, 90s boot
services:
postgres:
image: postgres:16
environment:
POSTGRES_DB: warehouse
POSTGRES_USER: dbt
POSTGRES_PASSWORD: dbt
volumes:
- ./data:/data:ro
ports: ["5432:5432"]
dbt:
build: ./dbt
depends_on: [postgres]
volumes:
- ./dbt:/usr/app
command: ["bash", "-c", "dbt seed && dbt run && dbt test"]
metabase:
image: metabase/metabase:v0.50.0
depends_on: [postgres]
ports: ["3000:3000"]
-- models/marts/fact_trips.sql
{{ config(materialized='table') }}
SELECT
{{ dbt_utils.generate_surrogate_key(['trip_id']) }} AS trip_sk,
trip_id,
pickup_zone_sk,
dropoff_zone_sk,
payment_type_sk,
pickup_datetime,
dropoff_datetime,
trip_distance_miles,
fare_amount_usd,
tip_amount_usd,
total_amount_usd,
EXTRACT(EPOCH FROM (dropoff_datetime - pickup_datetime))/60 AS trip_minutes
FROM {{ ref('stg_trips') }}
WHERE pickup_datetime >= '2024-01-01'
AND trip_distance_miles BETWEEN 0 AND 200
Step-by-step explanation.
-
Ingest — a Python
loader.pyreads each monthly CSV andCOPYs it into a raw table with an idempotent(filename, row_hash)dedup key, so re-running is safe. -
Stage —
stg_*.sqlmodels rename columns, cast types, and apply trivial filters. One staging model per source table. -
Mark —
fact_trips,dim_zone,dim_payment,dim_datejoin staging tables into a star schema. Surrogate keys are generated viadbt_utils.generate_surrogate_key. -
Test — at minimum:
not_nullon grain columns,uniqueon surrogate keys,relationshipsfrom facts → dims,accepted_valuesonpayment_type. -
Dashboard — Metabase points at
marts.fact_tripsjoined to dims; nine charts covering trips-per-day, fare-per-mile distribution, top-10 pickup zones, weekday-vs-weekend split, etc.
Output.
| Layer | Artefact | Rows | What it proves |
|---|---|---|---|
| raw |
raw.trips_yellow_2024_01 … 12 monthly tables |
~100 M | idempotent ingest |
| staging | stg_trips |
~100 M | type-cast + filter discipline |
| marts |
fact_trips, 4 dims |
~98 M / dim small | dimensional modelling |
| BI | 9 Metabase charts | n/a | downstream consumption |
Rule of thumb. Foundation projects must boot in under 60 seconds on a laptop. If the reviewer has to install something beyond Docker, you have already lost half the audience. Bundle the dataset (or a 100k-row sample) inside the repo.
What hiring managers see.
- You can model a star schema. The four-decision README (grain, dims, SCD, tests) tells them you have an opinion.
- You write idempotent ingest. Re-running the loader is safe — a senior signal even at entry level.
- You test data, not just code. Eleven dbt tests in the project, all green in CI.
-
You think about onboarding. A
docker-compose upREADME earns reviewer goodwill in the first 30 seconds.
Common beginner mistakes.
-
Modelling at the wrong grain. Mixing trip-level facts with day-level aggregates in one table — always split into
fact_trip(per trip) andagg_daily_trips(per day, sourced from the fact). -
Skipping
not_nullon grain columns. Afact_tripswith a nullabletrip_skis silently broken; reviewers will spot it instantly. -
Loading the dashboard against raw tables. Metabase pointed at
raw.trips_yellow_2024_01is a red flag — always route BI through marts. -
Forgetting
dim_date. Hard-coding date arithmetic in every fact query is the single biggest "junior" tell on dbt repos. -
Hard-coding zone names in SQL. A
WHERE pickup_zone = 'Manhattan'filter inline anywhere outsidedim_zonemeans the dimension does not exist in your head yet.
Project 2 — dbt analytics with seeds + sources + models + tests + docs
Detailed explanation. Project 2 is the lighter cousin of Project 1 — pure dbt with DuckDB as the warehouse (zero cloud cost, runs in <30s on a laptop) and GitHub Actions running dbt test on every PR. The point is to demonstrate dbt as the lingua franca: every modern analytics team uses it, every interview asks about it, and very few candidates ship a real dbt repo. Total wall-clock budget: 5 days.
Code.
# dbt_project.yml
name: dbt_analytics_duckdb
version: '1.0.0'
profile: dbt_analytics_duckdb
model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
macro-paths: ["macros"]
models:
dbt_analytics_duckdb:
staging:
+materialized: view
marts:
+materialized: table
+schema: marts
# models/marts/_marts.yml — tests that block merge in CI
version: 2
models:
- name: dim_customer
columns:
- name: customer_sk
tests: [not_null, unique]
- name: email
tests:
- not_null
- dbt_utils.not_empty_string
- name: fact_order
columns:
- name: order_sk
tests: [not_null, unique]
- name: customer_sk
tests:
- not_null
- relationships:
to: ref('dim_customer')
field: customer_sk
Step-by-step explanation.
-
Seeds — CSVs checked into
seeds/for tiny lookup tables (country codes, currency rates).dbt seedloads them as DuckDB tables. -
Sources —
sources.ymlregisters the raw tables and their freshness expectations (error_after: { count: 24, period: hour }). - Staging — one view per source, no business logic, just renaming and casting.
- Marts — narrow set of fact and dimension tables, materialised as tables (faster queries downstream).
- Tests — schema tests on every column you care about + at least three custom data tests (e.g. "no future-dated orders", "all orders have positive amounts").
-
CI —
.github/workflows/dbt.ymlinstalls dbt, runsdbt build, fails the PR on any test red.
Output.
| Surface | Count | Note |
|---|---|---|
| Sources | 4 | freshness configured |
| Staging models | 12 | views |
| Mart models | 6 | tables |
| Schema tests | 28 | run on every PR |
| Custom data tests | 3 | future dates, negative amounts, orphan FKs |
Rule of thumb. The dbt analytics project is the test-coverage showcase of the portfolio. Aim for ≥30 tests total. A reviewer who clicks _models.yml and sees a wall of green tests will trust the rest of the repo immediately.
What hiring managers see.
- You ship dbt the way analytics engineers actually ship it — seeds, sources, staging, marts, tests, docs, CI.
-
You understand
ref()andsource()— the two foundational dbt primitives. - You wrote at least one custom data test — the line that separates "used dbt" from "understood dbt."
- Your CI badge is green. Reviewer instinctively trusts the rest of the repo.
Common beginner mistakes.
-
No
sources.yml. Skipping the source layer and reading from raw tables breaks lineage and disables freshness checks. -
All models materialised as
view. Marts must betableso dashboard queries do not re-run the underlying CTE chain every refresh. - Tests in YAML only. Add at least one Python or SQL custom test to prove you understand dbt's full test surface.
-
dbt runin CI instead ofdbt build.dbt buildruns models and tests in DAG order;dbt runskips tests, which is a junior tell. -
No
--select state:modified+in CI. On a small project it does not matter, but reviewers love seeing the senior-tier deferral pattern even when the project is tiny.
Data engineering interview question on a foundation-tier build
A senior interviewer often probes: "Walk me through the schema you would model for a ride-hailing analytics dashboard, starting from the raw events, and where you would put your tests." — testing whether the candidate can articulate grain, fact/dim split, conformed dimensions, and the four canonical schema-tests, all on a familiar dataset.
Solution Using a NYC-Taxi-style star schema with dbt tests in CI
-- models/marts/fact_trips.sql (Postgres / DuckDB / Snowflake — same DDL)
{{ config(materialized='table') }}
SELECT
{{ dbt_utils.generate_surrogate_key(['trip_id']) }} AS trip_sk,
trip_id,
pickup_zone_sk,
dropoff_zone_sk,
payment_type_sk,
date_sk,
trip_distance_miles,
fare_amount_usd,
tip_amount_usd,
total_amount_usd
FROM {{ ref('stg_trips') }} t
JOIN {{ ref('dim_zone') }} pz ON t.pickup_zone_id = pz.zone_id
JOIN {{ ref('dim_zone') }} dz ON t.dropoff_zone_id = dz.zone_id
JOIN {{ ref('dim_payment') }} p ON t.payment_type = p.payment_type
JOIN {{ ref('dim_date') }} d ON DATE(t.pickup_datetime) = d.date_value
Step-by-step trace.
| Step | Layer | Action | Output rows |
|---|---|---|---|
| 1 | raw | COPY 12 monthly CSVs | ~100 M |
| 2 | stg | rename + cast in stg_trips view |
~100 M (view) |
| 3 | dim | build dim_zone (265 rows), dim_payment (6 rows), dim_date (366 rows) |
small lookup tables |
| 4 | fact | join staging to dims to produce fact_trips table |
~98 M |
| 5 | test |
not_null(trip_sk), unique(trip_sk), relationships(pickup_zone_sk → dim_zone)
|
all pass |
| 6 | CI | GitHub Actions runs dbt build on PR |
green |
Output:
| Mart table | Rows | Tests | Materialisation |
|---|---|---|---|
| fact_trips | 98,201,442 | 7 | table |
| dim_zone | 265 | 3 | table |
| dim_payment | 6 | 3 | table |
| dim_date | 366 | 4 | table |
Why this works — concept by concept:
-
Grain on one line of the README — "one row of
fact_trips= one yellow-cab trip from pickup to dropoff." This single sentence answers half the modelling round. -
Surrogate keys via
dbt_utils— hashingtrip_idgives a stable, deterministic key that survives schema changes downstream. -
Conformed
dim_dateanddim_zone— reused in higher-tier projects (medallion, product analytics) — a compounding return on the foundation work. - Tests written as YAML, not Python — dbt's canonical idiom; immediately legible to any reviewer who has used dbt.
- CI on every PR — a green badge in the README earns trust before the reviewer reads any SQL.
- Cost — local run = O(seconds on DuckDB, minutes on Postgres with the full dataset); reviewer time = O(60s from clone to dashboard).
Data modelling
Topic — dimensional modelling
Dimensional-modelling practice problems
4. Orchestration projects — Airflow ETL and Kafka streaming
Orchestration projects prove you can ship pipelines that run on a schedule — the mental shift senior interviews care about
The mental model in one line: a SQL query that runs by hand is a school assignment; a pipeline that runs on a schedule is a job. Tier 2 closes that gap. An Airflow daily ETL proves you can sequence tasks, handle partial failures, and refresh a dashboard nightly without supervision. A Kafka + Flink streaming aggregator proves you can reason about windowing, watermarks, and exactly-once semantics — the streaming half of the etl project ideas universe.
Why Tier 2 is the most under-built tier in entry-level portfolios.
- It is intimidating. Airflow has a learning curve, and Kafka feels like infrastructure rather than code. Most candidates skip Tier 2 and jump to Spark — a tell that they have not actually run a pipeline.
- It is the most-asked design round. "Walk me through your DAG" and "what is your retry strategy" are the two most-asked operational questions in entry-level loops.
- It teaches the failure modes that matter. Schedules expose race conditions, idempotency bugs, and silent data drift that batch-on-a-laptop never reveals.
- It is the entry point to the modern stack. Tier 3 medallion patterns assume an Airflow or dbt-cloud schedule already exists; building it once at Tier 2 removes that assumption.
Two orthogonal interview signals.
- Airflow → "I can schedule and monitor batch workflows." Covers scheduling, dependencies, alerting, backfills.
- Kafka + Flink → "I can reason about streaming semantics." Covers partitioning, windowing, watermarks, exactly-once.
Project 3 — Airflow daily ETL (API → S3 → Snowflake → dbt → dashboard)
Detailed explanation. Build a daily DAG that fetches a public API (e.g. OpenWeather, AlphaVantage, NYC 311, GitHub events), lands the response as Parquet on S3 (or MinIO locally), loads it into Snowflake (or DuckDB locally), runs a dbt transform, and finally refreshes a Looker Studio dashboard via a webhook. Five tasks, one DAG, daily schedule at 02:00 UTC. Total wall-clock budget: 10–14 days. The repo ships with docker-compose.yml that boots Airflow + MinIO + DuckDB + a Looker Studio embed for free local testing.
Code.
# dags/daily_etl.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.operators.s3 import S3CreateObjectOperator
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from datetime import datetime, timedelta
default_args = {
"owner": "alex",
"retries": 3,
"retry_delay": timedelta(minutes=5),
"email_on_failure": True,
"sla": timedelta(hours=2),
}
with DAG(
"daily_etl",
default_args=default_args,
schedule="0 2 * * *",
start_date=datetime(2026, 1, 1),
catchup=False,
max_active_runs=1,
tags=["portfolio", "etl"],
) as dag:
fetch = PythonOperator(task_id="fetch_api", python_callable=fetch_api)
land = PythonOperator(task_id="land_s3", python_callable=write_parquet_to_s3)
load = SnowflakeOperator(task_id="load_snowflake",
sql="COPY INTO raw.events FROM @stage/events/")
dbt_run = PythonOperator(task_id="dbt_run", python_callable=run_dbt_build)
refresh = PythonOperator(task_id="refresh_dashboard", python_callable=ping_looker_webhook)
fetch >> land >> load >> dbt_run >> refresh
Step-by-step explanation.
-
fetch_api— calls the source API with retry-on-429 and exponential backoff. Result written to local disk first (so failure during S3 upload does not re-hit the API). -
land_s3— partitions the response bydt=YYYY-MM-DDand writes asevents_HHMM.parquet. Idempotent — re-runs of the same execution date overwrite the same file. -
load_snowflake— runsCOPY INTO raw.events FROM @stage/events/withON_ERROR='ABORT_STATEMENT'. Failure surfaces the row count and offending row. -
dbt_run— callsdbt build --select +marts.daily_eventsto refresh marts plus tests. -
refresh_dashboard— POSTs to Looker Studio's webhook so the embedded dashboard refreshes immediately. -
Failure handling — each task retries three times with 5-minute spacing; SLA miss (>2h end-to-end) emails the owner; the entire DAG is
max_active_runs=1to prevent overlapping runs.
Output.
| Task | Avg duration | Retries used (typical week) | Failure cause when it fails |
|---|---|---|---|
| fetch_api | 30s | 0–1 | upstream 429 |
| land_s3 | 5s | 0 | rare network blip |
| load_snowflake | 45s | 0 | bad row type — surfaces in log |
| dbt_run | 90s | 0 | broken test — by design |
| refresh_dashboard | 2s | 0 | webhook 4xx |
Rule of thumb. Every Airflow project must have an explicit backfill strategy documented in the README. "To backfill 2025-12, run airflow dags backfill daily_etl -s 2025-12-01 -e 2025-12-31." Reviewers love it.
What hiring managers see.
-
You can design a DAG with five real tasks, not a toy
BashOperator("echo hello"). - You retry intelligently, with backoff and idempotent task design.
- You have an SLA configured — proves you understand observability.
- You can backfill — the senior-track operational muscle.
Common beginner mistakes.
-
DAG with
catchup=Trueand a far-backstart_date— on first deploy Airflow tries to backfill every day sincestart_date, hammering the API and breaking rate limits. Always start withcatchup=False. -
Non-idempotent tasks. A task that appends to a table without first deleting the same
execution_daterows will double-load on retry. -
No
max_active_runs=1. Overlapping runs of the same DAG corrupt warehouse state when twoCOPY INTOs race. - Hard-coded credentials in the DAG. Use Airflow connections + environment variables; secrets in code are a red flag for any reviewer.
- No README screenshot of the Graph view. A picture of a green DAG with five tasks is the single highest-leverage README artefact for Airflow projects.
Project 4 — Kafka + Flink real-time aggregations
Detailed explanation. Stand up a single-broker Kafka in Docker, generate a synthetic stream of click events with a Python publisher (1k events/sec on a laptop is plenty), and write a Flink SQL job that consumes the topic, computes per-user 5-minute hopping windows, and writes the result to ClickHouse. Wire ClickHouse to Grafana for a real-time chart. Total wall-clock budget: ~21 days — Tier 2's hardest project. Ship everything in docker-compose.yml so the reviewer can docker-compose up, python publish.py, and see Grafana ticking in 90 seconds.
Code.
-- flink_sql/clicks_5min_window.sql
CREATE TABLE clicks (
user_id STRING,
event_time TIMESTAMP_LTZ(3),
page STRING,
WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'clicks',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'clicks-agg',
'scan.startup.mode' = 'earliest-offset',
'format' = 'json'
);
CREATE TABLE clicks_5m_agg (
user_id STRING,
window_start TIMESTAMP(3),
window_end TIMESTAMP(3),
clicks BIGINT,
PRIMARY KEY (user_id, window_start) NOT ENFORCED
) WITH (
'connector' = 'clickhouse',
'url' = 'clickhouse://clickhouse:8123/default',
'table-name'= 'clicks_5m_agg'
);
INSERT INTO clicks_5m_agg
SELECT user_id,
HOP_START(event_time, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE) AS window_start,
HOP_END (event_time, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE) AS window_end,
COUNT(*) AS clicks
FROM clicks
GROUP BY user_id,
HOP(event_time, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE);
Step-by-step explanation.
- Source table — declares the Kafka topic as a Flink table with a watermark 10s behind event time (handles small clock skew).
-
Sink table — declares ClickHouse as a sink, primary-key upsert by
(user_id, window_start). -
Streaming SQL —
HOP(event_time, 1m, 5m)defines a 5-minute window advancing every 1 minute (hopping). One row emitted per(user_id, window)whenever the window closes. - Watermark behaviour — events arriving more than 10s late are dropped (configurable). The README should state this trade-off explicitly.
- Backpressure — the publisher pushes 1k events/sec; Flink processes ~10k/sec on a laptop, so the system is steady. If you push 50k/sec, Kafka starts buffering — useful to demonstrate in the README screenshot.
-
Grafana panel — queries
SELECT window_start, SUM(clicks) FROM clicks_5m_agg GROUP BY window_start, refreshes every 5 seconds, looks alive.
Output.
| Window start | window_end | user_id | clicks |
|---|---|---|---|
| 12:00 | 12:05 | u_42 | 17 |
| 12:01 | 12:06 | u_42 | 19 |
| 12:02 | 12:07 | u_42 | 18 |
| 12:00 | 12:05 | u_99 | 4 |
| 12:01 | 12:06 | u_99 | 5 |
Rule of thumb. Pin the streaming project to one failure mode the README explains in detail (e.g. "what happens to a 30-second-late event"). One deeply-explained failure mode is worth ten one-liner bullet points to a reviewer.
What hiring managers see.
- You can write Flink SQL, not just Python.
- You configured a real watermark — proves you have read the streaming docs.
- You wired Grafana to ClickHouse — proves you can close the loop on observability.
-
You shipped
docker-compose— proves the reviewer can run it without a Confluent Cloud account.
Common beginner mistakes.
- Tumbling windows where the prompt asks for hopping. Tumbling = non-overlapping; hopping = overlapping. Pick deliberately and explain in the README.
- No watermark configured. Without one Flink cannot close windows on event time, and the job effectively becomes processing-time — defeating the purpose.
- Single partition on Kafka topic. A streaming demo on one partition hides every parallelism question; create the topic with 4–8 partitions even on a laptop.
- Sink without a primary key. ClickHouse upserts only when you declare the PK; without one, replays produce duplicates and the dashboard double-counts.
-
No
docker-compose down -vinstruction in README. Reviewers who do not see the volume teardown command spend ten minutes wondering why their seconddocker-compose upboots with stale Kafka state.
Data engineering interview question on an orchestration build
A senior interviewer often opens this round: "Your daily DAG's load_snowflake task fails halfway through on a Tuesday. Walk me through what happens, what the user sees, and what you would change." — testing whether the candidate has thought through partial failure, retries, idempotency, alerting, and backfill, all in one storyline.
Solution Using idempotent task design + retries + dead-letter pattern
# tasks/load_snowflake.py
from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook
def load_snowflake(execution_date, **_):
hook = SnowflakeHook(snowflake_conn_id="snowflake_default")
sql = f"""
BEGIN;
DELETE FROM raw.events
WHERE event_date = '{execution_date}';
COPY INTO raw.events
FROM @stage/events/dt={execution_date}/
FILE_FORMAT = (TYPE = PARQUET)
ON_ERROR = 'CONTINUE'
RETURN_FAILED_ONLY = TRUE;
COMMIT;
"""
failed_rows = hook.get_pandas_df(sql)
if not failed_rows.empty:
write_dead_letter(failed_rows, execution_date)
if len(failed_rows) > 100:
raise ValueError(f"Too many bad rows: {len(failed_rows)}")
Step-by-step trace.
| Step | Action | Result |
|---|---|---|
| 1 | DELETE existing rows for execution_date
|
idempotency: re-runs do not double-load |
| 2 | COPY INTO with ON_ERROR='CONTINUE'
|
bad rows captured, good rows loaded |
| 3 | RETURN_FAILED_ONLY=TRUE |
exact list of failed rows returned |
| 4 | Write failed rows to a dead-letter table | nothing silently lost |
| 5 |
raise if > 100 bad rows |
DAG marked failed, Airflow alert fires |
| 6 | Airflow auto-retries 3× with 5-min backoff | transient broker errors heal |
Output:
| Scenario | User-visible effect | Engineer-visible effect |
|---|---|---|
| Transient Snowflake timeout | DAG retries silently, succeeds within 15 minutes | metric ticks "transient_retries=1" |
| 50 malformed rows | DAG succeeds; dashboard slightly under-counts; bad rows in dead-letter | engineer sees DLQ count, investigates Monday morning |
| 5,000 malformed rows | DAG fails; PagerDuty page; dashboard shows yesterday's data | engineer triages, backfills with one command |
| Permanent schema drift | DAG fails on the dbt test, not the load | engineer changes model, redeploys, backfills |
Why this works — concept by concept:
-
Idempotent task —
DELETEbeforeCOPYlets Airflow safely retry without duplicating rows. - Dead-letter pattern — capturing failed rows separately prevents the "silent partial load" failure mode that destroys trust in dashboards.
-
Hard fail threshold —
raiseif > 100 bad rows ensures the DAG visibly fails when something is structurally wrong, instead of swallowing the error. - Built-in retries with backoff — 3 × 5-minute retries cover the long tail of transient network issues without paging the engineer.
- Dashboard SLA — the on-call playbook the README links to lists "if dashboard not refreshed by 09:00, follow these steps."
- Cost — engineer-attention cost of a transient blip = O(0); engineer-attention cost of a real failure = O(minutes to backfill, not hours to debug).
ETL
Topic — ETL
ETL pipeline design problems
Streaming
Topic — streaming
Streaming pipeline practice problems
5. Modern stack and end-to-end projects — Spark, Snowflake, product analytics, ML features
Tier 3 and Tier 4 are where senior signal lives — distributed compute, a cloud warehouse, and a real business outcome
The mental model: Tier 3 proves you can wield distributed compute (Spark) on top of a cloud warehouse (Snowflake), and Tier 4 proves you can assemble everything into a pipeline a real product team would consume. These four projects are the ones that convert a "junior" loop into a "junior-+1" or "associate" offer — the breadth that makes a reviewer say "this candidate is closer to mid-level than entry-level."
Why Tier 3 and Tier 4 punch above their weight.
- Tier 3 forces the distributed mental model. Spark exposes shuffle, skew, broadcast joins, and partition counts — concepts that only become real on a cluster.
- Tier 3 introduces a real cloud warehouse. Snowflake's separation of storage and compute, virtual warehouses, and credit-based pricing show up in every modern DE round.
- Tier 4 is the storytelling tier. "Walk me through your end-to-end pipeline" is the single most powerful loop moment a candidate has.
- Tier 4 covers governance. Access control, lineage, audit, and data contracts — the senior-track topics most candidates never touch.
Project 5 — Spark batch (daily user-behaviour aggregations)
Detailed explanation. Take the synthetic clickstream from Project 4 (or pull a public dataset like CommonCrawl access logs), land it as Parquet on MinIO, and write a PySpark job that computes daily per-user aggregations (clicks, sessions, time-on-site, top page) and writes the result as a partitioned Delta Lake table. Schedule it nightly via Airflow. Total wall-clock budget: ~14 days. The repo runs on a single Spark container via docker-compose, no cluster required.
Code.
# spark_jobs/daily_user_agg.py
from pyspark.sql import SparkSession, functions as F, Window
spark = (SparkSession.builder
.appName("daily_user_agg")
.config("spark.sql.shuffle.partitions", "64")
.config("spark.sql.adaptive.enabled", "true")
.getOrCreate())
df = (spark.read.parquet("s3a://events/dt=2026-05-29/")
.filter(F.col("event_type").isin("click", "pageview")))
w_session = Window.partitionBy("user_id").orderBy("event_time")
sessionised = (df
.withColumn("prev_ts", F.lag("event_time").over(w_session))
.withColumn("session_gap_min",
(F.col("event_time").cast("long") - F.col("prev_ts").cast("long")) / 60)
.withColumn("session_id",
F.sum(F.when(F.col("session_gap_min") > 30, 1).otherwise(0)).over(w_session)))
agg = (sessionised.groupBy("user_id")
.agg(
F.count("*").alias("clicks"),
F.countDistinct("session_id").alias("sessions"),
(F.max("event_time").cast("long") - F.min("event_time").cast("long")).alias("dwell_seconds"),
F.first(F.col("page"), ignorenulls=True).alias("top_page")))
(agg.write.mode("overwrite").format("delta")
.partitionBy("date").save("s3a://marts/user_daily_agg/"))
Step-by-step explanation.
-
Read partitioned Parquet — Spark prunes partitions automatically via the
dt=YYYY-MM-DDpath. -
Filter early —
event_type IN ('click','pageview')runs before any shuffle, reducing data volume. -
Sessionise with a window function — 30-minute gap rule defines session boundaries; cumulative
sumon the gap flag assigns a session id. - Aggregate per user — clicks, sessions, dwell time, top page in a single shuffle.
- Adaptive query execution (AQE) enabled — Spark coalesces partitions and handles skew automatically.
- Write Delta Lake partitioned by date — downstream consumers read only the partitions they need; vacuum + optimize handle file proliferation.
Output.
| user_id | clicks | sessions | dwell_seconds | top_page |
|---|---|---|---|---|
| u_42 | 188 | 4 | 7,841 | /products/widget |
| u_99 | 12 | 1 | 311 | /home |
| u_103 | 421 | 7 | 14,002 | /search?q=ai |
| u_204 | 9 | 2 | 188 | /signup |
Rule of thumb. Spark portfolio projects must surface at least one tuning lever (spark.sql.shuffle.partitions, broadcast hint, AQE, salting). The reviewer is checking for "this person knows Spark has knobs" — show one.
What hiring managers see.
-
You can write idiomatic PySpark —
groupBy().agg(), window functions,withColumn. - You know about shuffle partitions — the single most-asked Spark tuning knob in interviews.
- You wrote Delta Lake, not just Parquet — proves familiarity with modern lakehouse format.
- You sessionised events with a window — the canonical Spark interview pattern.
Common beginner mistakes.
-
Default
spark.sql.shuffle.partitions=200for a 1 GB job — over-shuffling kills performance. Tune to ~32–64 on a laptop. -
.collect()in production code paths. Acollect()pulls the entire DataFrame to the driver — OOM-prone and a senior-tier red flag. Usetake(n)orwriteinstead. -
Joining without broadcast hints when one side is tiny. A 50 MB dim joined to a 50 GB fact without
broadcast(small)triggers a needless shuffle. -
No partitioning on write. Writing
user_daily_aggun-partitioned forces downstream readers to scan everything; partition bydateat minimum. -
No
OPTIMIZE/VACUUMin the Delta Lake project. Reviewer who sees a thousand 4 KB files on day 30 knows the candidate never read the Delta docs.
Project 6 — dbt + Snowflake medallion (Bronze / Silver / Gold)
Detailed explanation. Take the Tier 1 warehouse, replace DuckDB/Postgres with Snowflake, and re-layer the dbt repo as a medallion architecture: bronze (raw landed data, append-only), silver (cleansed, deduplicated, conformed), gold (business-ready facts and dims consumed by BI). Add Snowflake-specific features: virtual-warehouse sizing per environment, Snowpipe for ingestion, and dbt exposures pointing at the Looker dashboards downstream. Total wall-clock budget: ~21 days. The README must explain why medallion — usually a one-paragraph trade-off naming "schema-on-read at bronze, schema-on-write at silver."
Code.
-- models/bronze/br_orders.sql
{{ config(
materialized='incremental',
incremental_strategy='append',
on_schema_change='append_new_columns',
schema='bronze'
) }}
SELECT
PARSE_JSON($1):order_id::STRING AS order_id,
PARSE_JSON($1):customer_id::STRING AS customer_id,
PARSE_JSON($1):amount::NUMBER(12,2) AS amount,
PARSE_JSON($1):currency::STRING AS currency,
PARSE_JSON($1):created_at::TIMESTAMP AS created_at,
METADATA$FILENAME AS source_file,
CURRENT_TIMESTAMP() AS bronze_loaded_at
FROM @raw_stage/orders/
{% if is_incremental() %}
WHERE METADATA$FILE_LAST_MODIFIED > (SELECT MAX(bronze_loaded_at) FROM {{ this }})
{% endif %}
-- models/silver/sl_order.sql
{{ config(materialized='incremental', unique_key='order_id', schema='silver') }}
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER
(PARTITION BY order_id ORDER BY bronze_loaded_at DESC) AS rn
FROM {{ ref('br_orders') }}
)
SELECT
order_id,
customer_id,
amount,
UPPER(currency) AS currency,
created_at,
bronze_loaded_at AS silver_loaded_at
FROM ranked
WHERE rn = 1
AND amount > 0
AND currency IN ('USD','EUR','GBP','INR')
Step-by-step explanation.
-
Bronze — raw JSON landed via Snowpipe; one row per source-file record;
METADATA$FILENAMEpreserved for lineage. -
Silver — deduplicated on
order_idviaROW_NUMBER() OVER (...)keeping the most recently-loaded copy; filters out invalid currencies and non-positive amounts. -
Gold —
fact_order,dim_customer,dim_date,dim_currencyjoined from silver; downstream-friendly column names; materialised as tables. -
Virtual warehouses per env —
WH_DEV_XSfor CI runs,WH_PROD_Sfor nightly builds. Pricing visible to the reviewer in the README. -
dbt exposures —
_exposures.ymldeclares the Looker dashboards that depend on each gold mart, giving full lineage from raw → dashboard. -
CI — GitHub Actions runs
dbt build --select state:modified+on every PR usingstate:deferral againstmain— only the changed models rebuild.
Output.
| Layer | Schema | Materialisation | Row count (example) | Read pattern |
|---|---|---|---|---|
| Bronze | bronze |
incremental append | 100 M | once at write |
| Silver | silver |
incremental upsert | 95 M (deduplicated) | many reads |
| Gold | gold |
table | 8 facts + 4 dims | dashboard + BI |
Rule of thumb. Always document the medallion invariant: "raw never leaves bronze; analytics never reads bronze; everything joins on silver keys." A single sentence on the README spares twenty minutes of reviewer questions.
What hiring managers see.
- You understand schema-on-read vs schema-on-write — the medallion's defining trade-off.
- You configured a virtual warehouse — proves familiarity with Snowflake's compute/storage split.
- You wired dbt exposures — proves you think about end-to-end lineage.
-
You used
state:modified+— the senior-tier dbt CI pattern.
Common beginner mistakes.
-
Letting bronze leak into BI. A Looker tile reading from
bronze.*defeats the entire architecture; route everything through gold. - Same warehouse for dev and prod. Reviewers expect to see at least two virtual warehouses sized differently; sharing one is a cost-discipline red flag.
- Snowpipe + dbt incremental together without dedup. Snowpipe can re-deliver files on retry; the silver layer must dedupe on a stable business key.
-
No
tagson dbt models.dbt build --select tag:nightlyis the operational handle for selective runs; missing tags signal the project never ran on a schedule. -
One mega-mart instead of fact + dim split. A
gold.everythingtable that joins seven sources into 80 columns is a tell that the candidate has not internalised dimensional modelling.
Project 7 — End-to-end product analytics platform (Segment → Kafka → Snowflake → dbt → Looker)
Detailed explanation. Project 7 is the flagship of the portfolio. It assembles ingest, streaming, warehouse, transformation, and BI into one coherent repo. Pick a fictional product (e.g. "a SaaS habit tracker"), instrument synthetic events with the Segment Node SDK (or just publish to Kafka directly), land them in Snowflake via Snowpipe Streaming, build a daily_user_metrics mart in dbt, and surface DAU/WAU/retention in Looker. The trade-off paragraph: "Kafka over batch ingest because the README has to defend streaming semantics in interview rounds even when the volume does not require it." Total wall-clock budget: ~28 days.
Code.
# ingest/produce_events.py — synthetic Segment-shaped publisher
import json, time, random
from confluent_kafka import Producer
producer = Producer({
"bootstrap.servers": "kafka:9092",
"acks": "all",
"enable.idempotence": True,
"compression.type": "zstd",
})
EVENTS = ["habit_logged", "habit_completed", "streak_started", "streak_broken"]
USERS = [f"u_{i:04d}" for i in range(1, 2001)]
while True:
event = {
"event": random.choice(EVENTS),
"user_id": random.choice(USERS),
"timestamp": time.time(),
"properties": {"habit": random.choice(["read","run","meditate","write"]),
"duration_min": random.randint(5, 60)},
"context": {"source": "node-sdk", "ip": "127.0.0.1"},
}
producer.produce("events", key=event["user_id"].encode(),
value=json.dumps(event).encode())
producer.poll(0)
time.sleep(0.001) # ~1k events/sec
-- models/gold/daily_user_metrics.sql
{{ config(materialized='table', schema='gold') }}
WITH base AS (
SELECT user_id,
DATE(event_time) AS event_date,
COUNT(*) AS events,
COUNT(DISTINCT properties:habit) AS distinct_habits,
SUM(properties:duration_min::NUMBER) AS total_minutes
FROM {{ ref('sl_events') }}
GROUP BY 1, 2
),
retention AS (
SELECT a.user_id, a.event_date,
CASE WHEN b.user_id IS NOT NULL THEN 1 ELSE 0 END AS retained_d7
FROM base a
LEFT JOIN base b
ON a.user_id = b.user_id
AND b.event_date = DATEADD('day', 7, a.event_date)
)
SELECT b.*, r.retained_d7
FROM base b
JOIN retention r USING (user_id, event_date)
Step-by-step explanation.
- Producer — publishes ~1k events/sec to Kafka with idempotence on and zstd compression.
-
Snowpipe Streaming — ingests Kafka topic into
bronze.eventswith sub-second latency. -
Silver —
sl_eventsextracts the JSONevent,user_id,timestamp,propertiesinto typed columns. -
Gold —
daily_user_metricscomputes per-user-per-day events, distinct habits, total minutes, and aretained_d7flag (was the user active again 7 days later?). - Looker dashboard — DAU, WAU, day-7 retention cohort chart, top-habit chart. Auto-refreshes nightly.
- README architecture diagram — six boxes left-to-right (Segment → Kafka → Snowpipe → Bronze → Silver → Gold → Looker), with one-sentence captions and the rejected alternative noted under each box.
Output.
| Metric (sample week) | Value |
|---|---|
| Events ingested | 5,184,000 |
| DAU (median) | 612 |
| WAU (median) | 1,840 |
| Day-7 retention | 41% |
| Looker dashboard SLA | refreshed by 09:00 |
Rule of thumb. A flagship project must answer two questions in the first three README lines: "what does it do?" and "why would a recruiter spend three minutes reading further?" Lead with the second.
What hiring managers see.
- You can assemble seven systems into one coherent pipeline.
- You instrumented events end-to-end — the modern product-data muscle.
- You modelled retention — the business-judgement signal interviewers love.
- You documented an architecture diagram — the artefact every loop opens with.
Common beginner mistakes.
-
No event schema document. Reviewers expect to see
events.mdor a JSON schema file enumerating every event name, properties, and version — schema-as-code is the modern product-data standard. - Retention computed off the dashboard instead of the warehouse. Defining retention SQL inside Looker locks the metric to the BI tool; always materialise the metric in dbt so it is reusable.
- Looker dashboard with 30 tiles. Three tiles answering "DAU, retention, top action" beats thirty tiles answering nothing. Reviewers grade clarity, not volume.
-
No data contract. A one-page
contracts/events.ymllisting the producer team, schema owner, and break-the-build policy is the senior-tier signal for product-data pipelines. - No screen-recording. A 90-second Loom walk-through of the dashboard auto-refreshing is the single highest-leverage README artefact for Tier 4 projects.
Project 8 — ML feature pipeline (Spark feature engineering + Feast feature store + retraining cron)
Detailed explanation. Layer an ML feature path onto Project 7. Compute behavioural features in Spark from the bronze.events table (e.g. clicks_last_7d, sessions_last_30d, avg_session_min), register them in Feast (a feature store), schedule a daily backfill + a retraining cron via Airflow, and serve a simple churn-risk model via FastAPI for offline scoring. The point is to demonstrate the DE-side of MLOps — feature engineering, feature stores, and scheduled retraining — without writing the model itself (sklearn LogisticRegression is fine). Total wall-clock budget: ~28 days.
Code.
# spark_jobs/build_features.py
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("build_features").getOrCreate()
events = spark.read.parquet("s3a://bronze/events/")
feats = (events
.filter(F.col("event_time") >= F.date_sub(F.current_date(), 30))
.groupBy("user_id")
.agg(
F.sum(F.when(F.col("event_time") >= F.date_sub(F.current_date(), 7), 1)
.otherwise(0)).alias("clicks_last_7d"),
F.countDistinct(F.date_trunc("day", "event_time")).alias("active_days_last_30d"),
F.avg(F.col("properties.duration_min")).alias("avg_session_min"),
F.max("event_time").alias("last_event_time"),
)
.withColumn("feature_ts", F.current_timestamp()))
(feats.write.mode("overwrite").format("delta")
.save("s3a://feature-store/user_behaviour_v1/"))
# feast/feature_store.yaml + feature_view registration
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
user = Entity(name="user_id", join_keys=["user_id"])
src = FileSource(
path="s3://feature-store/user_behaviour_v1/",
timestamp_field="feature_ts",
)
user_behaviour_v1 = FeatureView(
name="user_behaviour_v1",
entities=[user],
ttl=None,
schema=[
Field(name="clicks_last_7d", dtype=Int64),
Field(name="active_days_last_30d", dtype=Int64),
Field(name="avg_session_min", dtype=Float32),
],
source=src,
)
Step-by-step explanation.
- Spark feature job — reads 30 days of events, computes per-user behavioural features, writes Delta Lake to the feature store path.
-
Feast registration — declares the feature view, entity, and source.
feast applywrites the registry. -
Offline retrieval — training script calls
store.get_historical_features(entity_df, feature_refs=[...])to materialise a training matrix. -
Daily retraining cron — Airflow DAG fires
build_features.py, thentrain_model.py, then writes the new sklearn.pklto S3 with a timestamped path. -
Online serving (optional) — FastAPI endpoint loads the latest
.pkl, callsstore.get_online_features(...), returns a churn-risk score. Useful demo but not strictly required for the DE-side narrative. - Model monitoring — log feature distributions per run; a weekly notebook checks for drift against the previous month's distribution.
Output.
| user_id | clicks_last_7d | active_days_last_30d | avg_session_min | churn_risk_score |
|---|---|---|---|---|
| u_0042 | 38 | 14 | 12.4 | 0.07 |
| u_0099 | 2 | 1 | 3.8 | 0.74 |
| u_0103 | 91 | 23 | 18.9 | 0.04 |
| u_0204 | 0 | 0 | 0.0 | 0.91 |
Rule of thumb. The DE-side of MLOps is features + scheduling + governance; the ML-side is models + evaluation. Stay on the DE side in the README — "I built the feature pipeline; the data scientists own the model" is exactly the framing senior DE interviewers want.
What hiring managers see.
- You built a feature store, not just a Spark job — proves familiarity with the modern ML-data interface.
- You scheduled retraining — proves operational thinking beyond a one-shot script.
- You logged feature drift — proves you understand the failure mode that bites production ML systems.
- You drew a clean DE/ML responsibility split — proves team-thinking and reduces "is this a DE or a DS role?" loop friction.
Common beginner mistakes.
-
Training on the same window the features were computed for. Causes label leakage; always compute features on
tand train against labels att + horizon. -
No feature versioning. A feature view called
user_behaviourwithout a_v1suffix bites the day the schema changes — every modern feature store treats versions as first-class. -
Skipping the entity declaration. Feast feature views without an
Entitymake point-in-time joins fail silently; reviewers spot it immediately. -
Backfilling features with
current_timestamp(). Always parameterisefeature_tsfromexecution_dateso historical backfills produce consistent values. - No mention of online vs offline path. A README that does not distinguish between offline training retrieval and online serving retrieval is a tell that the candidate has not read the feature-store docs.
Data engineering interview question on an end-to-end build
A senior interviewer often anchors the design round with: "Walk me through your end-to-end product analytics pipeline. Where does data enter? Where does it land? Who consumes it? What was the single hardest design decision?" — testing whether the candidate has internalised the entire shape, not just the slice they wrote last.
Solution Using a five-stage architecture narrative — ingest → land → transform → consume → govern
INGEST → Segment / SDKs publish to Kafka topic "events" (1k events/sec, key=user_id).
LAND → Snowpipe Streaming auto-loads to bronze.events.
TRANSFORM → dbt silver (typed columns) → dbt gold (daily_user_metrics, retention).
CONSUME → Looker dashboards on gold; Spark feature job for ML.
GOVERN → Snowflake access roles per layer; dbt exposures for lineage.
Hardest decision : Kafka over batch ingest, because the project must defend
streaming semantics in interview rounds even when volume
does not require it. Trade-off: extra container in
docker-compose, but a Confluent-style narrative payoff.
Step-by-step trace.
| Stage | Latency | Data volume / day | Who reads it |
|---|---|---|---|
| Kafka topic | ms | 5.2 M events | Snowpipe, Spark |
| bronze.events | ~1s | 5.2 M rows | silver only |
| silver.sl_events | minutes | 5.1 M rows (after dedup) | gold marts |
| gold.daily_user_metrics | minutes | ~1.8k rows | Looker |
| feature store user_behaviour_v1 | hourly | ~2k rows | model retraining + serving |
Output:
| Surface | Owner | Refresh cadence | Consumed by |
|---|---|---|---|
| Kafka topic | DE | continuous | Snowpipe, Spark |
| Bronze | DE | continuous | dbt silver |
| Silver | DE | hourly via dbt | dbt gold |
| Gold marts | analytics engineer | nightly | Looker |
| Feature store | DE | daily | ML platform team |
| Model registry | DS | weekly | serving API |
Why this works — concept by concept:
- Five-stage narrative — gives the interviewer a clear walk-through skeleton that they can interrupt at any stage to drill deeper.
- Latency table — quantifies the SLA at every stage; the kind of artefact senior DEs maintain at work.
- Owner column — proves you understand "who is responsible" boundaries, even if it is just you wearing four hats in the portfolio.
- Governance ribbon — Snowflake roles + dbt exposures is the cheapest credible governance story you can ship at portfolio scale.
- Hardest-decision paragraph — the README's single most-quoted artefact in onsites; pre-write it.
- Cost — total local compute = O(one Docker container per service); Snowflake-on-trial credits = O($10–20 over the life of the project).
SQL
Topic — joins
Join problems for the gold-marts tier
Streaming
Topic — real-time analytics
Real-time analytics problems
The README template that converts reviewers — for every project in the portfolio
A reviewer reads the README before the code. A consistent README structure across all your data engineering projects signals that you take portfolio polish as seriously as the build itself. Use the same eight-section skeleton for every repo so a reviewer who has seen one of your projects can navigate the rest instantly. The template below is the one that consistently converts the most "reviewer opens the repo" moments into "reviewer asks the candidate about it in the next loop."
The eight-section README skeleton.
- Top — one-line problem statement. "Daily ETL that lands NYC 311 complaints in Snowflake and refreshes a Looker dashboard before 09:00." One sentence; no jargon.
- Architecture diagram. A single image above the fold; the image is the README's single most-clicked artefact.
-
Quick start.
git clone,docker-compose up, openlocalhost:3000. Three lines, total. If it takes more, fix it before publishing. - What it does. Three bullets: ingest source, transformation step, downstream consumer. Five lines maximum.
- Trade-offs. Two short paragraphs naming the alternative you rejected and why. This section gets re-read on every loop.
-
Tests. Names the test types (
not_null,unique, custom) and a link to the GitHub Actions run. - What I would build next. Three forward-looking bullets — proves the project is alive in your head.
- Credits / data source. Names the public dataset and any libraries that did meaningful heavy lifting.
Three README anti-patterns that lose reviewers.
- A wall of text before the first heading. Reviewers scan, not read; the eight-section skeleton matters because every section is its own heading.
-
Stale Quick Start. If
docker-compose updoes not work on a fresh clone, the README is broken — test it on a clean machine before publishing. - No "Trade-offs" section. Without it, the reviewer cannot find the senior signal. Add even one paragraph.
Senior-tier polish moves that take an hour and pay off for years
Once the eight-section README is in place, a handful of one-hour polishing moves elevate a competent project into a flagship-class one. None of them require new code; all of them compound when the project sits on GitHub for the next twelve months.
Five polishing moves that punch above their weight.
- Pin the project on your GitHub profile. A pinned repo with a thumbnail and a one-line tagline pulls eight seconds of free attention from every recruiter who lands on your profile.
- Add a thumbnail image to the README. A 1200×630 PNG at the top of the README renders as a thumbnail in GitHub previews and on LinkedIn link unfurls — the closest thing GitHub has to an OG image.
- Add a GitHub Actions badge. A green CI badge in the README earns one full unit of trust before the reviewer reads any code.
- Record a 90-second Loom walk-through. Embed it in the README. The walk-through is the single highest-leverage artefact for portfolio Tier 3 and Tier 4 projects.
- Open three GitHub Issues with the "next steps" label. Reviewers click the Issues tab on every interesting repo; three open issues with clear titles read as "this person is iterating," not "this person has abandoned the project."
Three senior-tier moves nobody bothers with — and why they convert.
-
An
EXPERIMENTS.mdlog. Two paragraphs per experiment: hypothesis, result, decision. Proves you treat the portfolio like a real engineering effort. - A "Costs" table. Snowflake credits used, EC2 hours, S3 storage. Cost-awareness is a senior-track muscle most candidates never display in portfolio form.
-
A
CONTRIBUTING.md. Even for a solo portfolio repo, a one-page contributor guide signals "I have thought about how a team would join this project" — which is exactly the right signal for an entry-level hire.
Cheat sheet — pick four projects by four patterns
- Tier 1 — Foundation. Pick Project 1 (SQL warehouse) or Project 2 (dbt analytics + DuckDB). Demonstrates SQL fluency, dimensional modelling, and dbt seeds/sources/tests. Effort: 1–2 weeks. Interview lift: SQL screen, modelling round.
- Tier 2 — Orchestration. Pick Project 3 (Airflow daily ETL). Demonstrates DAG design, idempotent task patterns, retries, dead-letter, SLA awareness, backfill. Effort: ~2 weeks. Interview lift: ETL system design round.
- Tier 3 — Modern stack. Pick Project 5 (Spark batch) or Project 6 (dbt + Snowflake medallion). Demonstrates distributed compute or cloud-warehouse fluency. Effort: 2–3 weeks. Interview lift: distributed compute round, warehouse round.
- Tier 4 — End-to-end. Pick Project 7 (product analytics platform). Demonstrates breadth, business judgement, and storytelling. Effort: ~4 weeks. Interview lift: senior signal, behavioural-design round, "walk me through a project" moment.
- Optional add-on for senior-leaning roles — layer Project 8 (ML feature pipeline) on top of Project 7. Demonstrates DE/ML interface and feature-store fluency. Effort: +4 weeks. Interview lift: MLOps-adjacent rounds.
- What to skip if you are time-constrained — drop Project 4 (Kafka + Flink) unless you are targeting streaming-heavy shops (Uber, Stripe, Confluent, DoorDash). Replace with a second Project 3 variant.
- What to never skip — Tier 1. A reviewer who cannot see the SQL muscle will discount everything above it as copy-paste.
Frequently asked questions
How many data engineering projects should I have on my resume?
Four well-built projects is the sweet spot — one per tier (Foundation, Orchestration, Modern stack, End-to-end). Two is the minimum that signals seriousness; eight is the maximum that a reviewer will actually click through. Each project on the resume should have a one-line architecture statement, a two-line trade-off paragraph, and a clickable GitHub link. Anything beyond four risks crowding out the trade-off lines that actually do the persuasive work.
Should data engineering projects use real production-scale data?
No — use a realistic shape of data, not a realistic volume. A 100 GB clickstream is impressive in production and useless in a portfolio because the reviewer will not wait an hour for it to load. Pick a public dataset in the 1–10 GB range (NYC Taxi, Olist e-commerce, GitHub events, OpenWeather), or generate synthetic events with a publisher script. The reviewer wants to see schema design, pipeline shape, and trade-offs — all of which are visible at 1 GB.
Where should I host my portfolio projects?
GitHub is mandatory; everything else is optional. The README, the docker-compose.yml, and the tests should run locally on a reviewer's laptop with no cloud account required. If you want a live cloud deployment, Snowflake's free-trial credits cover a small warehouse and Render or Railway give you a free Looker Studio embed. Do not require AWS credentials in your docker-compose — that is a friction wall most reviewers will not climb.
Are Kaggle datasets enough for de projects for resume?
Kaggle datasets are fine for Tier 1 (warehouse + dbt) but limiting for Tier 2–4 because they are static — they do not arrive on a schedule and they do not let you demonstrate ingest. For Tier 2 and beyond, pair a Kaggle dataset with a synthetic publisher script that re-publishes the rows hour-by-hour, simulating fresh arrivals. That single line in the README — "data arrives via a synthetic Kafka publisher to simulate a real ingest cadence" — converts a Kaggle CSV into a credible streaming source.
Do I need to deploy projects to the cloud to count them on my resume?
No, and most reviewers prefer it the other way around. A local docker-compose up they can run in 60 seconds beats a terraform apply they would never run. Cloud deployments are useful when (a) you want to demonstrate cloud-native services (Snowflake virtual warehouses, Glue, BigQuery), or (b) you want a live URL to link from the README. Even then, prefer free-tier services (Snowflake trial, Looker Studio, Render) and document the cloud spend in the README — reviewers love cost-awareness signal.
How long should each portfolio project take?
Use the per-tier budget as a contract with yourself. Tier 1 = 1–2 weeks per project; Tier 2 = 2–3 weeks; Tier 3 = 2–3 weeks; Tier 4 = 3–4 weeks. If a project crosses 1.5× its budget, ship what you have, write the README, and move on. Scope creep at the portfolio stage usually means "I am avoiding the harder next-tier project by polishing the current one" — recognise it and move up the pyramid. Total realistic budget for a four-project pyramid is ~10–12 weeks of evening work.
Practice on PipeCode
- Drill the ETL practice library → to rehearse the pipeline-design muscle every Tier 2 project demonstrates.
- Warm up with aggregation problems → so your Tier 1 SQL screen lands clean.
- Practise the streaming pipeline drills → to defend Tier 2 (Kafka + Flink) and Tier 4 (product analytics) ingest decisions.
- Sharpen dimensional modelling problems → before whiteboarding the star schema in your Tier 1 README.
- Drill join-heavy problems → so the silver/gold layers of your medallion stand up to interviewer probing.
- Stack real-time analytics drills → for the windowed-aggregation conversation that follows Project 4 and Project 7.
- Read the top data engineering interview questions → to map each project to the rounds it covers.
- Reinforce the SQL surface with SQL for DE interviews — zero to FAANG →.
- Cover the compute surface with Apache Spark internals for DE interviews → before shipping Project 5.
- For the design round, work through ETL system design for DE interviews →.
- Pair Tier 1 with data modelling for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every project tier above ships with hands-on practice rooms where you write the SQL, design the DAGs, and tune the Spark jobs that fill the README sections reviewers actually click. Start with the foundation tier and walk the pyramid; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.





Top comments (0)