De' Clerke

Posted on Jun 2

The Data Engineering Take-Home Assessment: How to Turn a 4-Hour Test Into a Job Offer

#dataengineering #career #sql #beginners

Most candidates treat the take-home assessment as a coding test. It is not. It is a professional communication test that happens to include coding. The evaluator is asking: can you scope your own work, make defensible decisions, write clean code, handle data that is not pristine, and explain what you did to someone who was not in the room?

I have submitted take-homes for Kenya-based DE roles and remote ones. The structure I use now, after getting through several of them, is in this article. It covers how to read the brief, allocate time, decide what to build, write tests that signal production thinking, structure the README that actually gets read, and avoid the mistakes that end reviews immediately regardless of how good the code is.

Before You Write a Single Line of Code

Read the brief twice before touching anything

Read the entire brief once, then close it and ask yourself what the deliverable actually is. Not the summary you assumed, but the exact words they used. Then re-read and annotate:

What is the actual output? Code only? Code with a README? A working dashboard with screenshots? A written design doc? These are different submissions requiring different time allocations.
What keywords did they use? "Production-ready," "scalable," "testable," "explain your design decisions," and "data quality" are not decoration. Each one is a scoring criterion. Find them and treat them as explicit requirements.
What does the data look like? Open it before planning anything. How many rows? Are there nulls, duplicates, malformed dates? Is there a schema dictionary or do you have to infer types? The shape of the data changes the transformation work.
What is the hidden scope trap? Common ones: "bonus points for streaming" (do not build a streaming pipeline), "feel free to add tests" (this is not optional), "use any tools you like" (means use tools you can defend in a follow-up).
What is the minimum viable submission? Define it before you start. If you run out of time, you ship the MVP with good documentation, not a half-finished ambitious version.

Send clarifying questions if communication is allowed

Many candidates assume they should not ask questions. This is wrong. If the brief allows contact, send 2 to 3 targeted questions immediately after reading. Not vague questions ("what format should the output be?") but specific ones ("the schema shows customer_id as nullable, but the join in the transformation would lose those rows. Should I treat null customer IDs as invalid records, or route them to a separate output?").

Asking good questions demonstrates the same thinking that senior engineers apply on the first day of a real project. If you cannot contact them, document every assumption in your README. This protects you if the evaluator meant something different from what you built.

Look up the company's tech stack before you start

If their job posting mentions dbt, use dbt if it fits. If they work in BigQuery and you have a choice of warehouse, pick BigQuery. Mirror their keywords in your Key Design Decisions section. An evaluator who uses dbt every day will notice that you wrote {{ config(materialized='incremental', unique_key='id') }} and will have more to discuss in the follow-up call than an evaluator looking at a pandas script from someone who clearly has not read the JD.

Time Allocation

The single most common failure mode is running out of time before the README is written. Lock in time for documentation before you start building.

2-hour assessment:

Block	Duration	Focus
Brief + data exploration	0:00 to 0:25	Read twice, open data, plan approach on paper
Core pipeline	0:25 to 1:15	The transformation they asked for. This is the grade.
Tests	1:15 to 1:35	3 to 5 meaningful assertions
README + cleanup	1:35 to 1:55	Design decisions, assumptions, how to run
Final check	1:55 to 2:00	Run from clean clone

4-hour assessment:

Block	Duration	Focus
Brief + data exploration	0:00 to 0:20	Data profiling, schema decisions written down
Core build	0:20 to 1:30	Extract, transform, load
Schema + modeling decisions	1:30 to 2:00	Document grain, dimensions, load strategy
Tests	2:00 to 2:45	5 to 8 tests covering schema, business logic, edge cases
Dashboard or API (if required)	2:45 to 3:30	Working, not polished
README + Key Design Decisions	3:30 to 3:50	This section wins or loses the assessment
Git cleanup + submit	3:50 to 4:00	Run from clean clone one more time

6-hour assessment:

Block	Duration	Focus
Hour 1	Brief + exploration + schema decisions written
Hours 2 to 4	Core pipeline (extract, transform, load)
Hour 5	Tests + edge cases + data quality checks
Hour 6	README + architecture diagram + polish

Rules that hold across all lengths:

Protect the last 20 minutes for README. Never let coding eat into it.
Stop adding features at the 70% mark. Polish what exists.
A working pipeline with 80% of the features beats a broken one with 100%.

Requesting an extension

If the timeline is tight and communication is allowed, ask for more time rather than rushing. "I want to give you my best work. I can deliver by Wednesday instead of Monday if that works." Most interviewers prefer a polished Thursday submission over a hurried Monday one. The ask itself signals that you value quality over optics.

What to Build

Always build

Working code that actually runs
The exact transformation or query they asked for
At least 3 to 5 meaningful tests
A README that explains what you built and why

Build if time allows (differentiates you)

Error handling at API and file ingestion boundaries
Idempotency (run twice without duplication)
Schema validation at ingestion
logging module instead of print statements
.env.example with every variable documented
A Makefile or run.sh so the evaluator can test in one command

.PHONY: run test clean

run:
    python pipelines/run.py

test:
    pytest tests/ -v

clean:
    find . -type f -name "*.pyc" -delete
    find . -type d -name "__pycache__" -delete

Skip unless explicitly required

Full Airflow orchestration (too much overhead in a take-home)
Kafka streaming (unless they said "streaming pipeline")
dbt full project (unless they said "use dbt")
Docker Compose (unless they said "containerize")
Multiple database options
CI/CD pipeline

Evaluators score on the quality of what you deliver, not the quantity of tools you list. Six tools used badly is worse than two tools used well.

Schema and Data Modeling Decisions

Write down your schema decisions before writing any SQL or transformation code. Evaluators look for evidence that you can think before you code.

Decision template:

1. Grain: one row per what?
   e.g. "One row per transaction, identified by transaction_id + timestamp"

2. Dimensions: stable lookup attributes
   e.g. customer, product, region, date

3. Facts/measures: numeric, aggregatable values
   e.g. amount, quantity, duration_seconds

4. Query pattern: how will this data be read?
   e.g. "mostly filtered by date range and region"

5. Load strategy: full refresh, incremental, or upsert?
   e.g. "Incremental on created_at — table grows, no updates to old rows"

6. Known data quality issues
   e.g. "Nulls in customer_id — treat as 'UNKNOWN' not drop"

Star schema vs flat table:

For take-homes under 4 hours, a flat wide table plus a mart view is almost always the right call. Star schema is appropriate when the brief asks you to design for a BI use case with multiple reporting dimensions. If the brief says "answer these 3 business questions," a flat table with good column names is faster and cleaner to explain.

Data Quality Checklist

Run through this before calling the build done:

# Quick profiling you should show in your README
print(f"Source rows: {len(raw_df)}")

# These assertions belong in your tests AND your README as a data profile
df.duplicated(subset=['id']).sum()          # dupe check
df.isnull().sum()                           # null audit per column
df.dtypes                                   # type check
(df['amount'] < 0).sum()                    # impossible values
df['city'].str.strip().ne(df['city']).sum() # hidden whitespace

-- Equivalent SQL checks
SELECT COUNT(*), COUNT(DISTINCT id) FROM target;           -- duplicate PK
SELECT COUNT(*) FROM target WHERE required_col IS NULL;    -- required nulls
SELECT MIN(amount), MAX(amount) FROM target;               -- range sanity
SELECT COUNT(*) FROM target WHERE created_at > CURRENT_DATE; -- future dates

The data profile in your README is a strong signal. Write it like this:

Source:          10,247 rows
After dedup:     10,104 rows  (143 duplicate transaction_ids removed)
After null drop:  9,882 rows  (222 rows missing required customer_id)
Loaded:           9,882 rows

One line per transformation stage with the delta explained. Evaluators running 20 submissions will remember the one candidate who documented what happened to their data end-to-end.

Tests That Signal Production Thinking

Aim for 5 to 10 tests. Evaluators count them and check whether they are meaningful.

Required tests (always):

import pytest
import pandas as pd
from pipelines.transform import transform_sales

@pytest.fixture
def sample_input():
    return pd.DataFrame({
        'id':         [1, 2, 3],
        'amount':     [100.0, 250.0, 75.5],
        'created_at': ['2024-01-01', '2024-01-02', '2024-01-03'],
        'region':     ['Nairobi', 'Mombasa', None]
    })

def test_output_has_expected_columns(sample_input):
    result = transform_sales(sample_input)
    assert 'total_amount' in result.columns
    assert 'region' in result.columns

def test_no_duplicate_ids(sample_input):
    result = transform_sales(sample_input)
    assert result['id'].duplicated().sum() == 0

def test_row_count_preserved(sample_input):
    result = transform_sales(sample_input)
    assert len(result) == len(sample_input)

def test_null_region_filled(sample_input):
    result = transform_sales(sample_input)
    assert result['region'].isnull().sum() == 0

def test_amount_is_non_negative(sample_input):
    result = transform_sales(sample_input)
    assert (result['amount'] >= 0).all()

Tests that differentiate you:

def test_idempotency(sample_input, db_connection):
    """Running the pipeline twice should produce the same row count."""
    load(transform_sales(sample_input), db_connection)
    load(transform_sales(sample_input), db_connection)
    result = db_connection.execute("SELECT COUNT(*) FROM sales").fetchone()[0]
    assert result == len(sample_input)

def test_empty_input_returns_empty():
    """Empty input should not crash — it should return an empty DataFrame."""
    empty = pd.DataFrame(columns=['id', 'amount', 'created_at', 'region'])
    result = transform_sales(empty)
    assert len(result) == 0
    assert 'total_amount' in result.columns  # schema still correct

def test_business_logic(sample_input):
    """Total revenue in output matches sum of input amounts."""
    result = transform_sales(sample_input)
    assert result['amount'].sum() == pytest.approx(sample_input['amount'].sum())

The idempotency test and the empty input test are the two that most junior candidates skip. Both show production awareness.

The README That Wins Assessments

The README is the first thing the evaluator opens. It is also the document they refer to during the follow-up call when they ask "walk me through your approach." Write it like a professional document, not like a GitHub hobby project.

Mandatory sections in this order:

Summary

One paragraph. What you built, what the data is, what the output is. No fluff.

Builds an incremental ELT pipeline that ingests daily M-Pesa transaction records from a CSV export, validates and transforms them using pandas, loads them into PostgreSQL, and exposes aggregate revenue metrics per region per day. Input: 10,247 rows across 3 months. Output: 9,882 clean rows in a fct_transactions table.

How to Run

Commands that work from a clean clone. Assume the evaluator has Python and Docker but not your database or your environment variables.

cp .env.example .env
# Edit .env with your DB credentials
pip install -r requirements.txt
python pipelines/run.py
pytest tests/ -v

Key Design Decisions

This section is where assessments are won or lost. Write 3 to 5 decisions with the reasoning behind each.

Why incremental load over full refresh:
The source table grows daily. Full refresh would reprocess all rows on every run. Incremental load on created_at reduces per-run cost by approximately 95% as the table grows. Trade-off: if a past record is corrected, it will not be picked up unless I add a lookback window or a CDC mechanism.

Why PostgreSQL over SQLite:
PostgreSQL supports window functions (required for the cohort analysis query in the mart layer) and has proper JSONB support if the events column schema expands. SQLite would have required workarounds for both.

Why I filled null regions with UNKNOWN instead of dropping the rows:
The region column was 12% null. Dropping those rows would have removed 12% of total revenue from the mart. I filled with UNKNOWN and added a filter in the mart view so analysts can exclude those rows if they need clean-only data. The raw rows are preserved.

Performance at scale note:
This pipeline processes the given 10K-row dataset in under 2 seconds. At 100M+ rows I would switch from pandas to DuckDB for in-process columnar processing, or to a Spark-based approach if the data requires distributed processing across multiple nodes.

Assumptions

List every assumption you made. This protects you if the evaluator meant something different.

transaction_id is assumed globally unique, not unique per account.
Dates are assumed UTC. If local time is needed, timezone conversion should be added at ingestion.
Rows where amount = 0 are treated as valid (refunds cleared to zero, not errors).
The region field maps to Kenyan counties based on string matching to the provided lookup.

If I Had More Time

Show ambition without making excuses. Be specific about what problem you would solve, not what feature you would add.

Add a Great Expectations suite on the raw ingest to catch schema drift before it reaches the mart. Currently any new column in the source CSV is silently dropped.
Parameterize the date range via CLI argument so the pipeline can be backfilled for any period without code changes.
Add row count monitoring: assert that each run loads within 20% of the previous run's row count, and alert if it deviates.

Project Structure

project/
├── pipelines/
│   ├── extract.py       # reads source CSV or API
│   ├── transform.py     # business logic and cleaning
│   └── load.py          # writes to PostgreSQL with ON CONFLICT
├── tests/
│   └── test_transform.py
├── .env.example
├── .gitignore
├── Makefile
├── requirements.txt
└── README.md

The Five Take-Home Types

Type A: "Here's a CSV, build a pipeline"

Focus: correctness of transformation, data quality, idempotency.

Deliverable: Python script with tests and README.

The trap: over-engineering. A clean pandas to PostgreSQL pipeline beats a half-working Airflow DAG. Write ON CONFLICT DO NOTHING on inserts so it is idempotent from day one.

Type B: "Write SQL to answer these business questions"

Focus: correctness, readable query structure, efficiency.

Deliverable: one .sql file per question, each with a comment on the grain.

-- Q3: Monthly revenue by region
-- Grain: one row per region per calendar month
SELECT
    DATE_TRUNC('month', transaction_date) AS month,
    region,
    SUM(amount)                           AS total_revenue,
    COUNT(DISTINCT customer_id)           AS unique_customers
FROM fct_transactions
GROUP BY 1, 2
ORDER BY 1, 2;

The trap: subqueries where a CTE is clearer. Every complex query should have a grain comment and a note explaining any non-obvious approach.

Root cause decomposition separates strong SQL submissions from average ones. When asked "why is revenue down?" the answer is not a single number. Break the metric into components:

-- Revenue = orders × average order value
-- Splitting reveals whether the problem is volume or value
SELECT
    month,
    COUNT(order_id)                        AS order_volume,
    ROUND(SUM(amount) / COUNT(order_id), 2) AS avg_order_value,
    SUM(amount)                             AS total_revenue
FROM fct_transactions
GROUP BY 1
ORDER BY 1;

This is the kind of query that makes an evaluator pause and read it twice.

Type C: "Design a data pipeline for X"

Focus: architecture decisions, trade-off discussion.

Deliverable: a written design doc with an ASCII architecture diagram.

The trap: using exotic tools to look smart. If the scale is 50,000 rows per day, Kafka is overkill. Say so explicitly. "At this volume, a cron-scheduled Python script writing to PostgreSQL is sufficient and operationally simpler than a streaming architecture. If volume grows to 1M+ events per hour with sub-minute latency requirements, I would introduce Kafka and a Flink consumer."

Type D: "Build a small API or dashboard"

Focus: does it work? Is it clean? Can the evaluator use it without calling you?

The trap: forgetting to include screenshots in the README. Many candidates lose points because the evaluator cannot run the visualization locally. Screenshot every page of the dashboard and embed it in the README before submitting.

Type E: "Review this pipeline and tell us what's wrong"

Focus: identifying bugs, edge cases, missing quality checks.

Do not just say "this is bad." Write structured feedback with severity levels:

CRITICAL (line 47): No error handling on the API call. If the API returns 
a 429 or 500, the pipeline fails silently with no retry and no alert. 
Fix: wrap in try/except with exponential backoff and a failure log.

MEDIUM (line 82): Full table scan on every run. The WHERE clause is 
missing a watermark filter, so the pipeline reprocesses all historical 
rows on each execution. Fix: add WHERE created_at > last_processed_ts.

LOW (line 104): print() used instead of logging module. In production 
this output would be invisible in any log aggregation system.

Code Quality Details Evaluators Notice

Format your code before submitting

Run a formatter before submitting. Inconsistent indentation and style signal that you do not work as part of a team.

# Install once
pip install black ruff

# Run before every submission
black pipelines/ tests/
ruff check pipelines/ tests/ --fix

black handles formatting. ruff catches common issues: unused imports, undefined variables, f-string without placeholders. Add both to your requirements.txt as dev dependencies so the evaluator can see you use them.

Use the logging module, not print

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s"
)
logger = logging.getLogger(__name__)

logger.info(f"Loaded {len(df)} rows from source")
logger.warning(f"Found {null_count} null customer_ids — filling with UNKNOWN")
logger.info(f"Written {rows_written} rows to fct_transactions")

print() works for a script. logging works for a production system. The evaluator is assessing whether you would be comfortable in a production codebase.

AI tools in take-home assessments

AI tools exist and evaluators know you will use them. Using them is not a problem. The problem is submitting code you cannot explain.

Use AI to:

Clarify ambiguous requirements when you cannot contact the team
Check whether a SQL query handles a specific edge case correctly
Identify missing test scenarios you have not thought of

Do not use AI to:

Generate the entire pipeline without understanding it
Write test names that sound meaningful but assert nothing
Produce a README with five Key Design Decisions you cannot defend in a follow-up call

The follow-up call after a take-home always includes "walk me through this function" and "why did you choose this approach?" If you cannot answer, the submission score is irrelevant.

Disqualifying Mistakes

These do not lose points. They end the review immediately.

Code that does not run:

Clone your own repo into a temp folder and run it from scratch before submitting. This is not optional.

cd /tmp
git clone https://github.com/yourname/assessment-repo
cd assessment-repo
cp .env.example .env
# Edit with test credentials
pip install -r requirements.txt
python pipelines/run.py
pytest tests/ -v

Hardcoded secrets in any file:

# Check before submitting
git log --diff-filter=A -- .env          # check .env was never committed
git grep -i "password\|api_key\|secret"  # scan all tracked files

Missing or blank README: Even 10 lines of real README beats none. An evaluator who cannot understand what you built in 30 seconds moves to the next submission.

Broken imports or missing requirements.txt entries: After finishing installation, run pipreqs . --force to regenerate requirements.txt from your actual imports. Do not use pip freeze unless you want to submit 200 packages from your entire venv.

SQL that does not run on the specified engine: If they said PostgreSQL, test on PostgreSQL. Window functions, CTEs, JSONB, and type casting all differ between PostgreSQL, SQLite, and DuckDB. Test on the exact engine specified.

Large binary files committed to the repo:

# Check for large files before final push
git ls-files | xargs ls -la | sort -k5 -n | tail -20

Your .gitignore minimum:

.env
.venv/
__pycache__/
*.pyc
*.db
*.csv
*.parquet
*.egg-info/
.DS_Store
.idea/
*.log

When to Decline

Some take-homes are not worth doing. Red flags:

Scope beyond 6 hours for an entry-level role (you are being asked to do free work)
Unreasonable timelines where the brief is sent Friday afternoon due Monday morning
Repeated assessments after multiple interview rounds where you have already demonstrated technical ability
No feedback loop after submission (you will never know if you passed or failed)

Declining professionally is straightforward: "Thank you for considering me. Due to current bandwidth I'm unable to complete an assessment of this scope by that deadline. I'd be happy to discuss an alternative format or timeline if that's possible." This is better than a rushed submission that leaves a bad impression.

Pre-Submission Checklist (Last 10 Minutes)

Code:

Code runs end-to-end from a clean clone
No hardcoded paths (no C:/Users/Administrator/...)
No secrets in any tracked file
requirements.txt is up to date
.env.example has every required variable with a placeholder value
.gitignore covers .env, .venv/, __pycache__/, *.csv, *.db, *.parquet
Code formatted with black and linted with ruff

Tests:

All tests pass (pytest tests/ -v)
At least 3 meaningful assertions (not just test_import)
Idempotency test present if the pipeline writes to a database

README:

"How to Run" section works from a clean clone
Key Design Decisions written with reasoning, not placeholders
Assumptions documented
Data profile (source rows, after dedup, after cleaning, final loaded)
"If I Had More Time" with specific problems, not generic features
Screenshots embedded if there is a dashboard or API output

Git:

git status is clean
Commit messages are readable
No large binary files
Repo is public or evaluator's GitHub handle has access

Final:

You ran your own submission from scratch in a temp directory
You re-read the original brief to confirm you answered the actual question

My portfolio of 55 data engineering projects is on GitHub. If you found this useful, follow me on dev.to for more on pipelines, dbt, and Airflow.

DEV Community