DEV Community: De' Clerke

I Built a World Cup Prediction Model That Retrains Itself Daily and Can't Cheat Its Own Results ⚽

De' Clerke — Sat, 13 Jun 2026 17:31:08 +0000

Most sports prediction models have perfect hindsight. Mine is committed to git before kickoff.

That's the constraint I kept coming back to when building CupCast 2026, a machine learning system that forecasts every remaining World Cup fixture, refits on fresh data every morning, and records its predictions in an append-only log that cannot be edited after the match starts. When the result comes in, the model grades its frozen prediction. No retroactive edits. No cherry-picked accuracy claims.

This article is about two engineering decisions that make it actually honest: daily automated retraining and prediction freezing.

The Pipeline in Plain English

Here's what runs at 09:00 UTC every day:

GitHub Actions spins up a fresh ubuntu-latest runner
Fetches the latest fixtures and results from football-data.org
Recomputes World Football Elo over 49,410 historical matches (every international match since 1872)
Refits XGBoost using hyperparameters committed in best_params.json
Runs 10,000 Monte Carlo simulations of the remaining bracket
Validates output against 7 JSON schemas — if anything's malformed, the build fails and the previous data stays live
Appends frozen predictions for any fixture kicking off within 72 hours
Commits everything and pushes — Vercel picks it up and auto-deploys the frontend

The whole run takes under 2 minutes on a free GitHub Actions runner. Zero infrastructure cost.

on:
  schedule:
    - cron: "0 9 * * *"   # 09:00 UTC, after North-American overnight kickoffs settle

jobs:
  forecast:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run pipeline (refit on fresh data + 10k simulations)
        env:
          FOOTBALL_DATA_TOKEN: ${{ secrets.FOOTBALL_DATA_TOKEN }}
        run: uv run python run_pipeline.py --refit --sims 10000

      - name: Commit updated forecast
        run: |
          git add web/public/data pipeline/data/frozen pipeline/best_params.json
          if git diff --staged --quiet; then
            echo "No changes to commit."
          else
            git commit -m "Daily forecast update $(date -u +%Y-%m-%d)"
            git push
          fi

tune-once, refit-daily: Why These Are Different Things

My first instinct was to run Optuna on every CI push. Bad idea.

Optuna with 75 trials across 3 CV folds takes 5-10 minutes. It burns free-tier minutes fast. Worse, running it daily introduces search noise: you're not finding better parameters, you're finding parameters that overfit to the most recent week of results.

The right separation:

Tuning (Optuna): run on-demand when there's a structural reason to, such as new features, significant data drift, or an algorithm change. Output is best_params.json, committed to the repo.
Refitting (daily CI): load those committed params, fit on all historical data up to today. Takes seconds. Knowledge updates; architecture stays locked.

Here's how the code makes that explicit:

# train.py

def train_all(art: dict, n_trials: int = 75) -> dict:
    """Full tune: run Optuna, find best params, commit them, fit production model."""
    study = optuna.create_study(direction="minimize",
                                sampler=optuna.samplers.TPESampler(seed=C.SEED))
    study.optimize(lambda t: _objective(t, dev), n_trials=n_trials)
    best = study.best_params
    BEST_PARAMS_PATH.write_text(json.dumps(best, indent=2))  # committed for daily refit
    return _fit_production(art["train"], best)


def refit(art: dict) -> dict:
    """Daily CI path: load committed params, refit on fresh data. No Optuna."""
    best = json.loads(BEST_PARAMS_PATH.read_text())
    return _fit_production(art["train"], best)

best_params.json is versioned in git. When I add a feature I re-run train_all locally (Optuna included) and commit the updated params. The daily CI only ever calls refit. Clean separation between architecture decisions and knowledge updates.

32 Features, Zero Leakage

The classifier is XGBoost with objective="multi:softprob" for W/D/L as three classes. For scorelines, I pair it with two XGBoost count:poisson regressors (home goals and away goals separately), then combine the predicted goal rates into a scoreline probability matrix via the Poisson distribution.

32 features:

FEATURE_COLUMNS = [
    "elo_home", "elo_away", "elo_diff",
    "neutral", "home_is_host",
    "form5_win_h", "form5_draw_h", "form5_gf_h", "form5_ga_h",
    "form5_win_a", "form5_draw_a", "form5_gf_a", "form5_ga_a",
    "form10_win_h", "form10_draw_h", "form10_gf_h", "form10_ga_h",
    "form10_win_a", "form10_draw_a", "form10_gf_a", "form10_ga_a",
    "form10_oppelo_h", "form10_oppelo_a",
    "rest_h", "rest_a",
    "h2h_home_winrate", "h2h_mean_gd",
    "importance",
    "elo_trend_h", "elo_trend_a",
    "alt_gap_home", "alt_gap_away",
]

The last two (alt_gap_home and alt_gap_away) are the altitude feature. Each team has a baseline elevation derived from the stadiums where they typically play. These features capture how much each team ascends relative to that baseline to reach the match venue. Of WC 2026's 16 host cities, only Mexico City (2,240m) and Guadalajara (1,566m) are materially elevated. For those five fixtures the delta is real signal. For the other 63 matches it's effectively zero.

The leakage problem in sports ML is subtle. A random train/test split means your training set will contain matches that happened after some test matches. The model's form features will implicitly encode future trajectory. This is fine in most ML tasks where samples are i.i.d. In temporal sports data it makes your backtests look better than they are.

The fix is walk-forward expanding-window CV:

CV_FOLDS = [  # (train_end, val_start, val_end)
    ("2017-12-31", "2018-01-01", "2019-12-31"),
    ("2019-12-31", "2020-01-01", "2021-12-31"),
    ("2021-12-31", "2022-01-01", "2023-12-31"),
]

def _objective(trial, train: pd.DataFrame) -> float:
    losses = []
    for tr_end, va_start, va_end in CV_FOLDS:
        tr = train[train["date"] <= tr_end]
        va = train[(train["date"] >= va_start) & (train["date"] <= va_end)]
        model = XGBClassifier(objective="multi:softprob", num_class=3, **params)
        model.fit(Xtr, ytr, sample_weight=w)
        losses.append(log_loss(yva, model.predict_proba(Xva)))
    return float(np.mean(losses))

Each fold trains on everything before tr_end and validates on the following two years. Validation always starts after training ends. I also hold out 2024-01-01 through 2026-06-10 as an untouched test set; the tuner never sees it.

Test-set results:

Log loss: 0.8583
Favourite accuracy: 60.0%
Brier score: 0.5043

Baselines on the same holdout: Elo-only logistic regression logs 0.9201, historical base rates log 0.9864. Both beaten.

Prediction Freezing: The Part Most Sports Models Skip

Here's the problem with sports model accuracy claims: they're almost always backtested. The model already saw the outcome distribution when you built it. A "62% accuracy" headline is meaningless unless you can show it was generated before kickoff against predictions the model couldn't retroactively update.

My approach: write predictions to an append-only CSV before the match starts, then score them as results arrive. This is the same principle as event sourcing or an audit log: the log is immutable, and the current state (accuracy metrics) is derived from it.

The freezing function runs as part of every daily pipeline. It checks every upcoming fixture, sees if a prediction has already been logged for that match, and if not, appends one:

def freeze_due(fixtures, predictions, model_version, now=None):
    now = now or datetime.now(timezone.utc)
    horizon = now + timedelta(hours=72)
    existing = {int(r["match_id"]) for r in _read_csv(LOG_PATH)}

    new_rows = []
    for _, m in fixtures.iterrows():
        mid = int(m["match_id"])
        if mid in existing or mid not in predictions:
            continue   # already frozen, don't overwrite
        kickoff = datetime.fromisoformat(m["utc_date"].replace("Z", "+00:00"))
        if not (now <= kickoff <= horizon):
            continue
        p = predictions[mid]
        new_rows.append({
            "frozen_at_utc": now.isoformat(),
            "match_id": mid,
            "p_home": round(p["p_home"], 4),
            "p_draw": round(p["p_draw"], 4),
            "p_away": round(p["p_away"], 4),
            "model_version": model_version,
        })

    # append-only: open in 'a' mode, never truncate existing rows
    if new_rows:
        with open(LOG_PATH, "a", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=LOG_FIELDS)
            writer.writerows(new_rows)

Once a row exists for a match_id, the existing check on the next daily run skips it. The prediction is locked. When a result comes in, a separate score_resolved function reads the frozen log, computes Brier score and log loss for that row, and appends to scores.csv. Those scores accumulate live on the Model page.

7 JSON Contracts (Making It a Product, Not a Script)

A model that only lives in Python is a script. A product needs a data contract.

I publish 7 validated JSON files on every daily run:

File	Contents
`meta.json`	Run timestamp, model version, simulation count
`matches.json`	All 104 fixtures with W/D/L probabilities
`champion_odds.json`	Per-team tournament win probability + daily delta
`groups.json`	Group standings with qualification probabilities
`bracket.json`	Full knockout bracket with per-match probabilities
`accuracy.json`	Frozen prediction scores as they accumulate
`match_detail/{id}.json`	SHAP values, Elo trend, top scorelines, form, H2H

Each is validated against a JSON schema before the commit step. If the check fails (a required field is null, probabilities don't sum correctly, an ID is missing), the pipeline errors out and the previous good data stays live. The frontend never receives a partial update.

The React frontend (Vite + Tailwind v4 + Recharts + Framer Motion) is a pure read-only consumer of these contracts. No backend, no runtime server. The data contract is the interface, and the interface is versioned in git.

One Vercel SPA Gotcha Worth Noting

For direct URL access to match pages (/match/537330), I had this in vercel.json:

{
  "cleanUrls": true,
  "rewrites": [{ "source": "/:path*", "destination": "/index.html" }]
}

Direct navigation returned 404. The issue: cleanUrls: true interferes with how Vercel resolves the catch-all rewrite; it strips extensions first, then can't match a file. Fix: remove cleanUrls entirely. The rewrite handles all routing. Small thing, 20 minutes of my life I won't get back.

What I'd Do Differently

Add betting market odds as a feature. Markets are the single most efficient signal in football: they aggregate injury news, team selection, and weather that the model doesn't have. I avoided them to keep the pipeline free to run, but for a production system they'd be the first addition.

Red card and injury data. The model has no idea a team's starting goalkeeper is injured. Elo absorbs this over time, but pre-match it's a real blind spot in individual fixture predictions.

Form-weighted Elo. Self-computed Elo is reliable for ranking relative team strength across years. It's less reliable at capturing rapid momentum shifts. A team on an 8-game winning streak and a team grinding out draws can have identical Elo trajectories. A recency-weighted variant would be worth testing.

The Result

CupCast 2026 is live at world-cup-2026-forecast.vercel.app. The model updates daily as the tournament progresses. Current champion odds: Spain 26.3%, Argentina 18.9%, France 10.4%.

The code is open: github.com/declerke/World-Cup-2026-Forecast.

If you've built a similar system, or have opinions on the altitude feature or on why betting markets are hard to beat, drop a comment below. Follow me on dev.to for more data engineering from production.

Follow me on dev.to for more data engineering content, or check out the full code at github.com/declerke.

I Shipped 12 BI Dashboards With 5 Different Tools. Here Is the Honest Comparison.

De' Clerke — Sun, 07 Jun 2026 11:03:40 +0000

Most BI tool comparisons are written by someone who spent a weekend with each option, deployed a toy dataset, and wrote up their impressions. This is not that.

Over the past three months I shipped 12 dashboards across 5 tools: Streamlit, Plotly Dash, Apache Superset, Evidence.dev, and Grafana. Each one was the visualization layer on a real data pipeline with Airflow, DuckDB, dbt, and live API data. I ran into real failures, real deployment constraints, and real differences in where each tool fits -- and where it does not.

This is that article.

The Setup

Every project I built follows the same general pattern: an Airflow 3.0 pipeline pulls data from somewhere, dbt transforms it into mart tables, and a visualization layer sits on top. The question was always: what goes in that last layer?

Here is what I ended up using and why:

Tool	Projects	Stack
Streamlit	Kenya Fiscal Intelligence, Kenya Human Development, Kenya Agricultural Pulse, EAC Economic Lens, LoanRisk Analytics, Kenya Tenders	Python + WB API + DuckDB
Plotly Dash	Call Center Analytics	DuckDB + dbt
Apache Superset	Kenya Real Estate Pipeline, Ecommerce Analytics	PostgreSQL + DuckDB + dbt
Evidence.dev	BizPulse Kenya, Kenya Economic Pulse, LedgerSync	PostgreSQL + Delta Lake
Grafana	Saruk Electronics Tracker	PostgreSQL + dbt

None of these were chosen at random. Each one came from a specific requirement -- deployment target, interactivity level, pipeline stack, or time constraints. Let me go through each one.

Streamlit: The Data Engineer's Default

Streamlit is where I start when the requirement is flexible and the timeline is tight. It is Python all the way down, which means I can query DuckDB, call an API, run a pandas transformation, and render a chart in the same file. No context switching.

For the Kenya BI Dashboards series I built four separate Streamlit apps, each pulling live data from the World Bank REST API:

@st.cache_data(ttl=86400)
def fetch_indicator(country: str, indicator: str) -> pd.DataFrame:
    url = f"https://api.worldbank.org/v2/country/{country}/indicator/{indicator}"
    r = requests.get(url, params={"format": "json", "per_page": 100}, timeout=15)
    data = r.json()[1]
    return pd.DataFrame([
        {"year": int(d["date"]), "value": d["value"]}
        for d in data if d["value"]
    ]).sort_values("year")

The ttl=86400 means the API is called once per day. On Streamlit Cloud, the parquet files I commit to the repo serve as the cold-start fallback. That pattern -- cache aggressively, commit a data snapshot -- is what makes Streamlit Cloud viable for production.

What Streamlit does well: iteration speed, Python-native logic in the dashboard, @st.cache_data for heavy computations, st.session_state for multi-page state persistence, and one-click deploy to Streamlit Cloud.

Where it gets awkward: reactive filtering. In Streamlit, every widget interaction re-runs the entire script. For simple filters this is fine. Once you need dependent dropdowns, cross-filter behavior between charts, or real callback logic, the model starts to feel wrong. That is where Dash earns its place.

The Plotly 6 breaking changes. Every Streamlit project I built in 2026 hit these. Three things changed silently between Plotly 5 and 6:

First, numpy.bool_ is no longer accepted in layout parameters. If you compute a boolean from pandas and pass it directly to a Plotly call, you get a TypeError. Wrap with bool().

Second, titlefont is removed. The old syntax fig.update_layout(titlefont=dict(size=16)) silently does nothing in Plotly 6. The replacement is fig.update_layout(title=dict(font=dict(size=16))).

Third, 8-digit hex colors silently drop the alpha channel. fillcolor="#00d26a40" no longer works. You need fillcolor="rgba(0,210,106,0.25)". I wrote a small helper that I now copy into every project:

def hex_to_rgba(hex_color: str, alpha: float = 1.0) -> str:
    h = hex_color.lstrip("#")
    if len(h) == 3:
        h = "".join(c * 2 for c in h)
    r, g, b = int(h[0:2], 16), int(h[2:4], 16), int(h[4:6], 16)
    return f"rgba({r},{g},{b},{alpha})"

None of these raise loud errors. They just produce wrong output. Check Plotly version first if your charts look off after an upgrade.

Plotly Dash: When You Need Real Reactivity

The Call Center Analytics project needed something Streamlit could not deliver cleanly: a five-page dashboard where filtering by date range on page one should update the agent leaderboard on page three, and where clicking a row in a table should drill into that agent's trend line.

That is callbacks, and callbacks are Dash's native model.

@callback(
    Output("agent-trend", "figure"),
    Input("agent-table", "active_cell"),
    State("agent-table", "data"),
)
def update_agent_trend(active_cell, table_data):
    if not active_cell:
        return go.Figure()
    agent = table_data[active_cell["row"]]["agent_name"]
    dff = df[df["agent_name"] == agent]
    return px.line(dff, x="date", y="calls_resolved", title=f"{agent} -- Daily Resolution")

The Input / Output / State model is explicit about data flow in a way that Streamlit's implicit re-run is not. For complex interactivity, that explicitness is a feature.

What Dash does well: reactive multi-page apps, the DataTable component (sortable, filterable, paginated, no extra libraries), cross-filtering between charts, and fine-grained control over what triggers what.

Where it gets awkward: deployment. Dash is a Flask server. You need to manage process lifecycle, reverse proxying if you want HTTPS, and there is no equivalent of Streamlit Cloud's one-click deploy. For internal tools or Docker-hosted projects it is fine. For public-facing demos that need to be live at a URL, Streamlit Cloud wins on friction.

One gotcha with Dash Bootstrap Components: dbc.themes.DARKLY sets a dark theme, but Plotly figures still need template="plotly_dark" and paper_bgcolor set manually. The Bootstrap theme does not propagate into the Plotly canvas.

Apache Superset: When Your Stakeholders Use the Dashboard

The Kenya Real Estate Pipeline and the Ecommerce Analytics project both ended up on Superset, and for the same reason: the expected audience was non-technical. Superset has a point-and-click chart builder. You do not need to write code to add a filter or change a chart type. A business analyst can use it without opening a terminal.

Setup is Docker Compose:

superset:
  image: apache/superset:4.1.1
  environment:
    SUPERSET_SECRET_KEY: "change-this-in-production"
  ports:
    - "8088:8088"

Then three commands on first run:

docker exec -it superset superset db upgrade
docker exec -it superset superset fab create-admin \
  --username admin --email admin@admin.com --password admin
docker exec -it superset superset init

Connecting Superset to DuckDB requires duckdb-engine installed in the Superset container, and the connection string has one critical detail: use ?read_only=true. Airflow writes to the DuckDB file while Superset reads it. Without the read-only flag you will hit file lock errors mid-query.

duckdb:////app/data/analytics.duckdb?read_only=true

What Superset does well: chart library is deep (30+ chart types including time-series, heatmaps, treemaps, geospatial), SQL Lab for ad-hoc queries, role-based access control, and it looks polished out of the box.

Where it gets awkward: it is heavy. Seven Docker services minimum. Initial build takes several minutes. The Explore page loses unsaved changes on refresh (no autosave). And the DuckDB integration is second-class compared to PostgreSQL -- time grains do not work with DuckDB's date_trunc, so you end up writing custom SQL for what should be a dropdown option.

For PostgreSQL backends, Superset is near-perfect. For DuckDB, use it with that read-only caveat and accept the limitations.

Evidence.dev: When the Story Is in the Data

Evidence.dev takes a different approach from everything else on this list. You write SQL query blocks directly in Markdown files, and the results become available as variables that feed components:

```sql debt_trend
SELECT year, govt_debt_pct_gdp, interest_pct_revenue
FROM gold.fiscal_summary
WHERE country = 'KEN'
ORDER BY year
```

<LineChart data={debt_trend} x="year" y="govt_debt_pct_gdp" />

Kenya's debt-to-GDP ratio reached **{debt_trend[debt_trend.length-1].govt_debt_pct_gdp.toFixed(1)}%** in {debt_trend[debt_trend.length-1].year}.

The key insight is that the narrative and the data live in the same file. You write around the numbers. This is the right model for analytical reports -- fiscal briefings, quarterly reviews, data quality documentation -- where the goal is communication, not exploration.

I used Evidence.dev on three projects: BizPulse Kenya (weekly sentiment briefing), Kenya Economic Pulse (macro indicator report), and LedgerSync (fiscal reconciliation audit report). All three had the same shape: a data engineering pipeline produced the numbers, and Evidence.dev turned those numbers into a readable document with charts embedded in the prose.

What Evidence.dev does well: the SQL-in-Markdown model is genuinely fast for static reports, Svelte under the hood means the output is a fast static site, and deploy to Vercel is one command.

The production gotcha. Evidence.dev 40.x has a broken dev server. Running npm run dev throws a lodash ESM error:

Error [ERR_REQUIRE_ESM]: require() of ES Module .../lodash-es/lodash.js

The fix is to skip the dev server entirely and use build-and-serve:

npm run build && npx serve build

This is not documented prominently. It took me longer to find than it should have. The dev server issue is a known regression in the 40.x line.

Where Evidence.dev gets awkward: interactivity is limited. Dropdown filters and date pickers exist, but anything complex requires writing Svelte components. If your users need to explore the data -- not just read a pre-built narrative -- use one of the other tools.

Grafana: When Your Data Is Already in Postgres and You Need Operational Monitoring

The Saruk Electronics Tracker pipeline writes daily price history to PostgreSQL. The natural choice for monitoring that kind of time-series operational data is Grafana. Every chart is a SQL query against the database. The panels auto-refresh. There is no Python to write.

SELECT
    scraped_at AS __time,
    AVG(price_kes) AS avg_price,
    category
FROM price_history
WHERE scraped_at BETWEEN $__timeFrom() AND $__timeTo()
GROUP BY DATE_TRUNC('day', scraped_at), category
ORDER BY __time

The $__timeFrom() and $__timeTo() macros are Grafana's time range variables. The time range picker in the dashboard header drives them automatically.

The Grafana 13 PostgreSQL gotcha. This one cost me hours. Grafana 13 rewrote the PostgreSQL plugin as grafana-postgresql-datasource, and it now requires the database name in jsonData.database in addition to the top-level database field. The health check passes either way. The error only surfaces when a panel runs its first query:

You do not currently have a default database configured.

The fix is in your provisioning YAML:

datasources:
  - name: PostgreSQL
    type: grafana-postgresql-datasource
    uid: postgres-ds
    url: postgres:5432
    database: analytics
    jsonData:
      database: analytics   # this line is the fix
      sslmode: disable
      postgresVersion: 1500

The second gotcha: do not use template variable substitution for the datasource UID in provisioned dashboards. ${DS_POSTGRESQL} does not resolve correctly when dashboards are loaded from a provisioning directory. Hardcode the UID in every panel's datasource field instead.

What Grafana does well: time-series visualization is its native language, alerting is built-in, the dashboard JSON is version-controllable, and it runs on almost nothing resource-wise compared to Superset.

Where it gets awkward: non-time-series reports feel forced. Grafana is optimized for "how is this metric behaving over time." For cross-sectional analysis, category breakdowns, or anything that looks like a report rather than a monitoring panel, the other tools are better.

The Decision Framework

After 12 projects, here is how I think about the choice:

Who uses the dashboard?

You (the engineer) or a technical teammate: Streamlit or Grafana
A business analyst who needs to build their own charts: Superset
An exec or stakeholder reading a report: Evidence.dev
Someone who needs complex cross-filtering: Dash

What does the underlying data look like?

Time-series operational data in PostgreSQL: Grafana
Analytical mart tables in DuckDB: Streamlit or Superset (read-only)
Aggregate report data: Evidence.dev
Complex relational data requiring SQL exploration: Superset SQL Lab

Where does it need to run?

Publicly accessible URL, no server to manage: Streamlit Cloud or Evidence.dev on Vercel
Internal tool, Docker Compose is fine: any of them
Embedded in another application: Dash (Flask) or Streamlit (embeddable via iframe)

How long do you have?

Under a day: Streamlit
A few days, complex interactivity required: Dash
A few days, non-technical audience: Superset
Writing a data narrative: Evidence.dev

What I Would Change

If I were starting over, I would reach for Streamlit by default earlier and stop second-guessing it. It handles 80% of dashboard requirements and deploys in minutes. The re-run model is a constraint, but it is a constraint you work around once and then forget.

I would also set up the hex_to_rgba helper and the Plotly 6 compatibility checks at the start of every project instead of discovering them mid-build. The Plotly 6 changes are not loud. They silently produce wrong output. That is the worst kind of bug in a visualization layer.

Evidence.dev is underused in the data engineering community. If you are building an end-of-sprint data summary, a pipeline audit report, or any kind of structured analytical document, it is faster than any of the other tools for that use case. The SQL-in-Markdown model is genuinely good.

Grafana is the right choice exactly when you are already using PostgreSQL and you need time-series monitoring. Outside that narrow case, the ergonomics work against you.

Superset is the right choice when the audience is non-technical and you need a real chart builder. The Docker setup cost is real but one-time. After that, analysts can build their own views without bothering you.

The code patterns behind all of these -- caching strategies, Plotly 6 compatibility, the WB API direct-request pattern, DuckDB connection strings, Evidence.dev build workarounds, Grafana 13 provisioning -- are in my BI and Data Analysis cheatsheet along with the rest of my reference docs.

If you have questions about any of these tools or want to see the full pipeline code, the repos are all public on my GitHub. Follow me on dev.to for more articles from real data engineering projects.

Machine Learning for Data Engineers: The Patterns I Actually Used Across 7 Projects

De' Clerke — Fri, 05 Jun 2026 18:02:37 +0000

Machine Learning for Data Engineers: The Patterns I Actually Used Across 7 Projects

Data engineers are not supposed to be machine learning engineers. But at some point every serious DE pipeline ends with a question the data alone cannot answer, and you end up building a model.

Over the past six months I've shipped seven ML-driven projects: price prediction on used Japanese cars, health outcome modelling across 53 African countries, 109 time-series forecasts for 15 African development indicators, financial news sentiment analysis, semantic job search with vector embeddings, inflation forecasting for the East African Community, and crop yield projections for East Africa. None of them were data science projects in the traditional sense. They were data engineering projects where the final step was a model instead of a dashboard.

This article is about what the ML stack actually looks like when a data engineer builds it, what each tool is genuinely good for, and the specific gotchas I hit in production that the documentation does not warn you about.

The Core Stack

Seven projects, four primary tools:

XGBoost for tabular regression and classification
Facebook Prophet for time-series forecasting at scale
SHAP for model explainability
HuggingFace Transformers for NLP (FinBERT for financial sentiment, sentence-transformers for semantic search)

Supporting cast: scikit-learn for preprocessing and clustering, MLflow for experiment tracking, joblib for model persistence, Optuna for hyperparameter search, and pgvector when embeddings need to be queryable.

Everything runs locally or on a free tier. No OpenAI API keys, no cloud ML platforms. The entire stack is reproducible with uv pip install.

XGBoost: The Workhorse for Tabular Data

I used XGBoost in two projects with very different datasets and got strong results on both.

Japan Car Advisory -- 541 listings scraped from BE FORWARD and SBT Japan. 291 rows after filtering. XGBoost vs LightGBM vs Random Forest:

XGBoost: MAE = $3,706, R² = 0.722
LightGBM: slightly worse MAE on this dataset
Random Forest: R² = 0.68

On a dataset with 291 rows, XGBoost won. The margin was not huge, but it held across multiple random seeds. The model ended up as the champion, trained as part of an Airflow DAG that scraped, validated, trained, and logged to MLflow every week.

Africa Health ML -- 1,219 rows, 53 African countries, 23 years of World Bank data. Three models predicting health outcomes from public investment indicators:

Life Expectancy: MAE = 1.22 years, R² = 0.934
Under-5 Mortality: MAE = 9.36 per 1,000, R² = 0.885
Maternal Mortality: MAE = 47.6 per 100,000, R² = 0.945

R² above 0.88 on all three. That is not a data science achievement -- it is a data quality and feature selection achievement. The World Bank REST API provides clean, consistent historical data. If your features are right, XGBoost does the rest.

The Setup That Actually Works

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = xgb.XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    early_stopping_rounds=50,   # find the real optimal n_estimators
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False,
)

y_pred = model.predict(X_test)
print(f"MAE: {mean_absolute_error(y_test, y_pred):,.2f}")
print(f"R²:  {r2_score(y_test, y_pred):.3f}")
print(f"Best iteration: {model.best_iteration}")

early_stopping_rounds=50 is non-negotiable on small datasets. Without it, XGBoost will train all 500 trees even if optimal performance was reached at tree 120. On the Japan Car dataset the best iteration landed around 180. You also get model.best_iteration for free, which becomes your n_estimators when you retrain the final model.

The Data Leakage Trap

The single most common mistake in any ML project:

# WRONG: scaler sees the test set before training
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)             # test statistics leak into training
X_train, X_test, _, _ = train_test_split(...)

# CORRECT: scaler only sees training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)      # same transform, no fitting

XGBoost does not need scaling (tree-based models are scale-invariant), but the principle applies to any preprocessing step -- target encoding, imputation statistics, polynomial feature generation. Always fit on training data only.

SHAP: Explainability Is Not Optional

The Africa Health project had a policy simulator: a user could set public health expenditure values for a specific country and see the predicted impact on life expectancy. That feature only works if the model's logic is interpretable.

SHAP (SHapley Additive exPlanations) gives you per-prediction feature attribution that is mathematically grounded. TreeExplainer is the right choice for XGBoost and LightGBM -- it uses the tree structure directly instead of approximation.

import shap

explainer   = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot: feature importance across all test predictions
shap.summary_plot(shap_values, X_test, show=False)
plt.savefig("plots/shap_summary.png", bbox_inches="tight")

# Waterfall plot: why did this specific country get this prediction?
i = 0  # index of the prediction to explain
shap.waterfall_plot(
    shap.Explanation(
        values=shap_values[i],
        base_values=explainer.expected_value,
        data=X_test.iloc[i],
        feature_names=list(X_test.columns),
    ),
    show=False,
)

What the Africa Health SHAP analysis confirmed: health expenditure as a percentage of GDP was the strongest driver of life expectancy in the model, followed by access to clean water and sanitation. The dependence plot showed the relationship was non-linear -- the gains from increasing health expenditure were steep below 5% of GDP and flattened above 8%. That is the kind of insight a dashboard cannot surface on its own.

For Streamlit integration, cache the SHAP computation:

@st.cache_data
def get_shap_values(_model, X_test_df):
    explainer = shap.TreeExplainer(_model)
    sv = explainer.shap_values(X_test_df)
    return sv, explainer.expected_value

The underscore prefix on _model tells Streamlit not to hash the model object (which is not hashable). This is a framework quirk that will silently error without it.

Prophet: Time-Series Forecasting at Scale

This is where the data engineer mindset directly applies to ML. Prophet is not hard to use for a single series. The engineering challenge is scaling it to dozens or hundreds of series without the loop collapsing on sparse data.

The Numbers

Kenya Crop Yield Forecaster: 29 Prophet models across 5 countries and 6 agricultural indicators
EAC Inflation Forecaster: 50 models across 5 countries and 10 macroeconomic indicators
Africa Development Trajectory Forecaster: 109 models across 15 countries and 8 development indicators

188 Prophet models total, all trained in automated loops, all feeding Streamlit dashboards with 4 tabs each.

The Loop Pattern

from prophet import Prophet
import pandas as pd
import logging

logging.getLogger("prophet").setLevel(logging.ERROR)
logging.getLogger("cmdstanpy").setLevel(logging.ERROR)

models    = {}
forecasts = {}

for country in countries:
    for indicator in indicators:
        key    = f"{country}_{indicator}"
        subset = (
            df[(df["country"] == country) & (df["indicator"] == indicator)]
            .rename(columns={"year": "ds", "value": "y"})
            .copy()
        )
        subset["ds"] = pd.to_datetime(subset["ds"].astype(str))

        # Prophet needs at least 2 seasonal cycles
        # Africa Dev: skipped Ethiopia/Adult Literacy and Nigeria/Trade Openness
        if len(subset) < 10:
            print(f"Skipping {key}: only {len(subset)} obs")
            continue

        m = Prophet(
            yearly_seasonality=True,
            weekly_seasonality=False,
            daily_seasonality=False,
            changepoint_prior_scale=0.1,
        )
        m.fit(subset)

        future = m.make_future_dataframe(periods=10, freq="YE")
        fc     = m.predict(future)

        models[key]    = m
        forecasts[key] = fc

print(f"Trained {len(models)} Prophet models")

Suppress the logging on lines 5-6. Prophet prints Stan convergence diagnostics on every fit. In a loop of 120 iterations that is thousands of lines of noise that obscure real errors.

The Sparse Series Problem

Prophet will fit a model on any dataset with at least 2 rows. It will not warn you that the result is meaningless. On the Africa Dev project, Ethiopia's adult literacy data had 6 observations. Prophet fit it, produced confident-looking 10-year forecasts with narrow confidence intervals, and the forecasts were nonsense.

The fix is the len(subset) < 10 guard. I settled on 10 as the minimum after testing -- fewer than two full seasonal cycles (10 years of annual data) produces forecasts that extrapolate from noise rather than signal. You will know you hit this when the forecast shows a perfectly straight line with zero seasonality.

The Frequency Trap

Annual data must use freq="YE" (year-end) or freq="YS" (year-start). Not "Y". Pandas deprecated bare "Y" in version 2.2 and make_future_dataframe passes this directly to pd.date_range. The error message is a generic pandas deprecation warning that does not mention Prophet at all.

# WRONG
future = model.make_future_dataframe(periods=10, freq="Y")

# CORRECT
future = model.make_future_dataframe(periods=10, freq="YE")

Similarly, future["ds"] returned by make_future_dataframe includes all historical dates. To isolate the actual forecast rows:

last_historical = prophet_df["ds"].max()
forecast_only   = fc[fc["ds"] > last_historical]

Separating Forecast from Historical in Plotly Charts

The Streamlit dashboards use Plotly rather than Prophet's built-in matplotlib plots. The pattern for a forecast chart with shaded confidence intervals:

import plotly.graph_objects as go
import pandas as pd

hist_df = fc[fc["ds"] <= prophet_df["ds"].max()]
fc_df   = fc[fc["ds"] >  prophet_df["ds"].max()]

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=prophet_df["ds"], y=prophet_df["y"],
    mode="lines+markers", name="Historical",
))
fig.add_trace(go.Scatter(
    x=fc_df["ds"], y=fc_df["yhat"],
    mode="lines", name="Forecast",
))
fig.add_trace(go.Scatter(
    x=pd.concat([fc_df["ds"], fc_df["ds"][::-1]]),
    y=pd.concat([fc_df["yhat_upper"], fc_df["yhat_lower"][::-1]]),
    fill="toself",
    fillcolor="rgba(99,110,250,0.15)",
    line=dict(color="rgba(255,255,255,0)"),
    name="80% CI",
))

One Plotly 6 gotcha that appeared across multiple projects: showlegend will throw a TypeError if you pass a numpy.bool_ instead of a Python bool. Any boolean derived from a pandas or numpy operation is numpy.bool_. The fix:

show = bool(some_condition)
fig.add_trace(go.Scatter(..., showlegend=show))

NLP for Data Engineers: FinBERT and Embeddings

Two different NLP patterns -- one for sentiment, one for semantic search.

FinBERT for Financial Sentiment (BizPulse Kenya)

BizPulse Kenya classified Kenyan business and financial news articles into positive, negative, and neutral sentiment using an ensemble of three models: FinBERT (financial domain BERT), VADER (rule-based, no GPU needed), and TextBlob (general polarity).

from transformers import pipeline
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import torch

device = 0 if torch.cuda.is_available() else -1

finbert = pipeline(
    "sentiment-analysis",
    model="ProsusAI/finbert",
    device=device,
    truncation=True,   # CRITICAL: BERT has a 512-token hard limit
    max_length=512,
)

analyzer = SentimentIntensityAnalyzer()

def classify(text: str) -> str:
    fb     = finbert(text)[0]["label"]
    vader  = ("positive" if analyzer.polarity_scores(text)["compound"] > 0.05
               else "negative" if analyzer.polarity_scores(text)["compound"] < -0.05
               else "neutral")
    tb_pol = TextBlob(text).sentiment.polarity
    tb     = "positive" if tb_pol > 0.05 else "negative" if tb_pol < -0.05 else "neutral"

    votes = [fb, vader, tb]
    return max(set(votes), key=votes.count)

The truncation=True argument is not optional. FinBERT crashes with a cryptic index error on any text exceeding 512 tokens without it. Most financial news articles are within 512 tokens, but earnings reports and government gazettes are not.

FinBERT labels are lowercase (positive, negative, neutral) -- not the POSITIVE/NEGATIVE format used by general BERT sentiment models. This mismatch will silently break any code that does if label == "POSITIVE".

Sentence Transformers and pgvector (JobSense)

JobSense indexed 604 job listings as vector embeddings in PostgreSQL using pgvector, enabling semantic search rather than keyword matching.

from sentence_transformers import SentenceTransformer
import psycopg2
from pgvector.psycopg2 import register_vector

model = SentenceTransformer("all-MiniLM-L6-v2")  # 384 dimensions, runs locally

embeddings = model.encode(job_descriptions, normalize_embeddings=True)

# Semantic search using cosine distance
def search(query: str, top_k: int = 10) -> list:
    q_emb = model.encode([query], normalize_embeddings=True)[0]
    with conn.cursor() as cur:
        cur.execute("""
            SELECT title, company, source,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM job_embeddings
            JOIN jobs USING (job_id)
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (q_emb.tolist(), q_emb.tolist(), top_k))
        return cur.fetchall()

normalize_embeddings=True makes cosine similarity equal to the dot product, which pgvector's <=> operator (cosine distance) is optimised for. Use IVFFlat indexing for datasets up to ~100K rows; HNSW for anything larger or where recall matters more than index build time.

The Airflow Integration Pattern

In Japan Car Advisory, the ML training step was a task in an Airflow DAG:

scrape_beforward + scrape_sbt (parallel) → validate → train → log_to_mlflow

The key design decision: pass file paths through XCom, not model objects.

@task
def train_model(validated_path: str) -> dict:
    df = pd.read_parquet(validated_path)
    X, y = preprocess(df)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    model = xgb.XGBRegressor(n_estimators=300, learning_rate=0.05)
    model.fit(X_train, y_train)

    metrics = {
        "mae": float(mean_absolute_error(y_test, y_pred)),
        "r2":  float(r2_score(y_test, y_pred)),
    }
    joblib.dump(model, "/opt/airflow/models/xgb_latest.pkl")
    return metrics   # small dict via XCom

@task
def log_to_mlflow(metrics: dict):
    with mlflow.start_run():
        mlflow.log_metrics(metrics)
        mlflow.log_artifact("/opt/airflow/models/xgb_latest.pkl")

XCom is for metadata. Models are binary blobs -- write them to a shared volume and pass the path. Trying to serialise a 10 MB XGBoost model through XCom will either fail silently or hit the database row size limit.

The Checklist I Run Before Every Model Goes Live

Seven projects in, this is what I verify before any model result goes into a dashboard or gets pushed to GitHub:

1. No data leakage. Scaler and encoder fitted only on X_train. Target-based aggregations computed on training fold only if using cross-validation.

2. Metrics on held-out test, not training. A model with R² = 0.99 on training and R² = 0.72 on test is not a good model.

3. Scale before K-Means. On the Africa Dev project, K-Means without StandardScaler produced clusters dominated entirely by GDP (billions of dollars vs. ratios between 0 and 1). Always scaler.fit_transform(X) before KMeans.fit(X_scaled).

4. Guard sparse Prophet series. Any loop over Prophet models needs if len(subset) < 10: continue. The model will fit silently on 3 observations and produce confident-looking nonsense.

5. Log with MLflow. Even for quick experiments. Reproducing "what were the hyperparameters on the model that got R²=0.88" without MLflow means re-running the full training loop.

6. SHAP before shipping. If you cannot explain why the model made a prediction, you cannot defend it to anyone who asks. TreeExplainer on XGBoost/LightGBM takes seconds. There is no reason to skip it.

What I Would Do Differently

Use Optuna earlier. On the Japan Car project I tuned XGBoost with GridSearchCV. The search space was small enough that it worked, but Optuna's Bayesian optimisation consistently finds better hyperparameters in fewer trials. It should be the default now.

Cache Prophet models. Fitting 109 Prophet models takes 4-8 minutes depending on the machine. In a Streamlit app, this has to happen at startup. The pattern I settled on: compute at import time and store in a module-level dictionary, protected by @st.cache_resource. It works but it is fragile. A proper solution would pre-compute and serialise all models at pipeline time and load them from joblib files at app start.

Use MLflow earlier in the Prophet projects. The time-series projects tracked metrics manually (storing MAE in CSV files). MLflow would have made the experiment comparison much cleaner.

Conclusion

ML as a data engineer looks different from ML as a data scientist. You are not exploring in notebooks -- you are building pipelines that run on a schedule, produce reproducible results, and feed into dashboards that non-technical stakeholders will use to make decisions.

The tools that work for this are not glamorous: XGBoost for tabular data, Prophet for time series, SHAP for explainability, and a straightforward preprocessing pipeline from scikit-learn. They are boring in the best sense. They are predictable, well-documented, fast to iterate on, and genuinely good at what they do.

188 Prophet models, three XGBoost regressors with R² above 0.88, 604 job embeddings queryable in milliseconds, financial sentiment across Kenyan business news. All built in Python, all running locally, all integrated into Airflow pipelines and Streamlit dashboards. You do not need to be a data scientist to ship ML in production. You need to know which tools to reach for and which gotchas to watch out for.

The full cheatsheet with every code pattern referenced in this article is in my GitHub repository.

Follow for more data engineering from production experience.

Built across Japan Car Advisory, Africa Health ML, Kenya Crop Yield Forecaster, EAC Inflation Forecaster, Africa Development Trajectory Forecaster, BizPulse Kenya, and JobSense -- all pushed to GitHub with full source code.

FastAPI for Data Engineers: Building, Testing, and Debugging APIs That Don't Lie to You

De' Clerke — Tue, 02 Jun 2026 22:24:08 +0000

The JobSense project needed a FastAPI backend that served 604 job embeddings via semantic search, a Pydantic validation layer that stopped bad data before it reached pgvector, and a test suite that could be run without a live Ollama instance. Getting all three right took more time than the pipeline itself.

This article is the guide I wish I had then. It covers FastAPI setup for data engineering use cases, the Pydantic patterns that actually prevent bad data at the boundary, consuming external APIs without silent failures, testing patterns that catch real bugs, debugging the most common FastAPI errors, and the production patterns that most tutorials skip.

What FastAPI Is and Is Not in a Data Stack

Before building anything: FastAPI is a system boundary tool. It is not a scheduler, not a data processor, and not a database.

Use FastAPI for	Use something else for
Ingestion endpoint (receive events, files, JSON)	Orchestration: use Airflow, Dagster, Prefect
Serving processed data to dashboards	Heavy transformation: use pandas, DuckDB, Spark
Triggering pipeline runs via HTTP	Real-time streaming: use Kafka, Flink
Health and metadata endpoints	Batch processing: use a DAG task, not an endpoint
Feature serving (ML embeddings, predictions)	Message queuing: use SQS, RabbitMQ

The most common mistake I see in portfolio projects is using FastAPI where a dbt model and a BI tool would do the job in a third of the code. FastAPI belongs at the edges of your system where external clients need to push data in or pull data out.

App Setup

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware

app = FastAPI(
    title="JobSense API",
    description="Kenyan jobs semantic search",
    version="1.0.0",
    debug=True,           # detailed error messages in dev — disable in prod
)

# CORS — required when a Streamlit or React frontend calls FastAPI
app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:8501", "http://localhost:3000"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Lifecycle events — connect DB and warm caches at startup
@app.on_event("startup")
async def on_startup():
    await database.connect()

@app.on_event("shutdown")
async def on_shutdown():
    await database.disconnect()

API versioning from day one

If your API will have external consumers, prefix routes with a version. It costs nothing now and avoids breaking changes later.

from fastapi import APIRouter

v1 = APIRouter(prefix="/api/v1", tags=["v1"])
v2 = APIRouter(prefix="/api/v2", tags=["v2"])  # Future — add when needed

@v1.get("/jobs")
def list_jobs_v1():
    return []

app.include_router(v1)

The alternative — changing /api/jobs to a different response shape after clients depend on it — is a breaking change that requires coordination. Versioning upfront avoids this.

Development server

uvicorn main:app --reload               # dev: auto-reload on save
uvicorn main:app --host 0.0.0.0 --port 8000  # expose to network
uvicorn main:app --workers 4            # production: multiple workers

Interactive docs auto-generated at http://localhost:8000/docs (Swagger) and /redoc (ReDoc). Disable them in production:

app = FastAPI(docs_url=None, redoc_url=None)

Pydantic: The Data Contract at the Boundary

Pydantic models are the most important part of a FastAPI data engineering setup. They are the point where your pipeline says "this is the shape data must have to enter my system." Everything downstream assumes this contract was enforced.

from pydantic import BaseModel, Field, field_validator, model_validator
from typing import Optional
from datetime import datetime
from enum import Enum

class JobSource(str, Enum):
    brightermonday = "brightermonday"
    linkedin       = "linkedin"
    jobwebkenya    = "jobwebkenya"

class JobSchema(BaseModel):
    title:      str            = Field(..., min_length=2, max_length=200)
    company:    str            = Field(..., min_length=1)
    salary_min: Optional[float] = Field(None, ge=0)
    salary_max: Optional[float] = Field(None, ge=0)
    source:     JobSource
    posted_at:  Optional[datetime] = None

    @field_validator("title")
    @classmethod
    def strip_title(cls, v: str) -> str:
        return v.strip()

    @model_validator(mode="after")
    def salary_order(self):
        if self.salary_min and self.salary_max:
            if self.salary_min > self.salary_max:
                raise ValueError("salary_min must be <= salary_max")
        return self

When validation fails, FastAPI returns a 422 with the exact field and reason. That is more useful than the silent data corruption you get when you skip validation.

Idempotency keys for ingestion endpoints

Production ingestion APIs add an idempotency key requirement. If the client retries a failed POST, you need to recognize the duplicate and return the same result rather than inserting twice.

import hashlib
from fastapi import Header, HTTPException
from typing import Optional

@app.post("/api/v1/events", status_code=201)
def ingest_event(
    event: EventSchema,
    x_idempotency_key: Optional[str] = Header(None),
):
    if x_idempotency_key:
        # Check if we already processed this key
        existing = event_repo.find_by_idempotency_key(x_idempotency_key)
        if existing:
            return existing  # Return previous result, no re-insert

    result = event_repo.create(event, idempotency_key=x_idempotency_key)
    return result

Without this pattern, a client that retries after a network timeout (which received no response but the insert succeeded) creates a duplicate. This is how pipelines end up with double-counted revenue.

Common HTTP status codes

200 OK            — successful GET/PUT
201 Created       — successful POST
204 No Content    — successful DELETE
400 Bad Request   — client sent invalid data (use this for your own validation logic)
401 Unauthorized  — missing or invalid credentials
403 Forbidden     — authenticated but not permitted to access this resource
404 Not Found     — resource does not exist
422 Unprocessable — Pydantic validation failed (FastAPI default for bad body)
429 Too Many      — rate limited (from you or from upstream)
500 Server Error  — unhandled exception in your code
503 Unavailable   — your service is up but a dependency (DB) is down

Dependency Injection

Dependency injection lets you share resources (database sessions, auth checks, config) across route handlers without passing them around manually.

from fastapi import Depends, HTTPException
from sqlalchemy.orm import Session

def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()

def get_job_or_404(job_id: int, db: Session = Depends(get_db)) -> JobModel:
    obj = db.get(JobModel, job_id)
    if not obj:
        raise HTTPException(status_code=404, detail=f"Job {job_id} not found")
    return obj

@app.get("/api/v1/jobs/{job_id}", response_model=JobResponse)
def get_job(job: JobModel = Depends(get_job_or_404)):
    return job

API key authentication

from fastapi.security import APIKeyHeader
import os

api_key_header = APIKeyHeader(name="X-API-Key")

def verify_api_key(key: str = Depends(api_key_header)):
    if key != os.getenv("API_KEY"):
        raise HTTPException(status_code=401, detail="Invalid API key")
    return key

@app.get("/admin/stats", dependencies=[Depends(verify_api_key)])
def admin_stats():
    return {"total_jobs": 604}

JWT authentication for multi-user APIs

API keys work for service-to-service auth. For user-facing APIs with multiple roles, use JWT.

from fastapi.security import OAuth2PasswordBearer
from jose import JWTError, jwt
from datetime import datetime, timedelta

SECRET_KEY = os.getenv("JWT_SECRET_KEY")
ALGORITHM  = "HS256"

oauth2_scheme = OAuth2PasswordBearer(tokenUrl="/auth/token")

def create_access_token(data: dict, expires_delta: timedelta = timedelta(hours=1)):
    payload = data.copy()
    payload["exp"] = datetime.utcnow() + expires_delta
    return jwt.encode(payload, SECRET_KEY, algorithm=ALGORITHM)

def get_current_user(token: str = Depends(oauth2_scheme)):
    try:
        payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])
        user_id = payload.get("sub")
        if user_id is None:
            raise HTTPException(status_code=401, detail="Invalid token")
        return user_id
    except JWTError:
        raise HTTPException(status_code=401, detail="Invalid or expired token")

@app.get("/api/v1/profile")
def get_profile(user_id: str = Depends(get_current_user)):
    return {"user_id": user_id}

Install python-jose[cryptography] for the JWT library.

Consuming External APIs Without Silent Failures

Every external API call is a failure point. The pattern that works in all my projects: a session with a retry adapter, explicit timeout, structured error handling, and logging that tells you exactly what failed and why.

requests: the complete setup

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import logging
import os

log = logging.getLogger(__name__)

def build_session() -> requests.Session:
    session = requests.Session()
    session.headers.update({
        "User-Agent": "DataPipeline/1.0",
        "Accept": "application/json",
    })
    retry = Retry(
        total=3,
        backoff_factor=2,             # wait 2s, 4s, 8s between retries
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET", "POST"],
    )
    session.mount("https://", HTTPAdapter(max_retries=retry))
    session.mount("http://",  HTTPAdapter(max_retries=retry))
    return session

SESSION = build_session()

Error handling that tells you what actually happened

from requests.exceptions import HTTPError, ConnectionError, Timeout, JSONDecodeError

def fetch_eia_prices(fuel_type: str) -> list[dict]:
    url = "https://api.eia.gov/v2/petroleum/pri/gnd/data/"
    params = {
        "api_key":    os.getenv("EIA_API_KEY"),
        "frequency":  "monthly",
        "data[0]":    "value",
        "facets[product][]": fuel_type,
    }
    try:
        r = SESSION.get(url, params=params, timeout=15)
        r.raise_for_status()
        return r.json()["response"]["data"]
    except HTTPError as e:
        log.error(f"EIA API HTTP {e.response.status_code}: {e.response.text[:200]}")
        raise
    except ConnectionError:
        log.error("EIA API unreachable — check network or service status")
        raise
    except Timeout:
        log.error("EIA API timeout after 15s")
        raise
    except (JSONDecodeError, KeyError) as e:
        log.error(f"EIA API response parse error: {e}")
        raise

Never call r.json() without catching JSONDecodeError. When an API returns a 200 with an HTML error page (maintenance mode, Cloudflare challenge), .json() raises an exception with a confusing message. Catch it explicitly.

Pagination patterns

Offset/limit (most REST APIs):

def fetch_all_pages(base_url: str, params: dict, page_size: int = 100) -> list[dict]:
    all_results = []
    offset = 0
    while True:
        params.update({"limit": page_size, "offset": offset})
        r = SESSION.get(base_url, params=params, timeout=15)
        r.raise_for_status()
        data = r.json()

        # APIs use different response shapes — handle both
        items = data.get("results") or data.get("data") or (data if isinstance(data, list) else [])
        if not items:
            break
        all_results.extend(items)
        offset += len(items)
        if len(items) < page_size:
            break        # reached last page
        time.sleep(0.5)  # polite delay
    log.info(f"Fetched {len(all_results)} total records from {base_url}")
    return all_results

Cursor-based pagination:

def fetch_cursor_pages(base_url: str) -> list[dict]:
    all_results = []
    cursor = None
    while True:
        params = {"cursor": cursor} if cursor else {}
        data = SESSION.get(base_url, params=params, timeout=15).json()
        all_results.extend(data["items"])
        cursor = data.get("next_cursor")
        if not cursor:
            break
    return all_results

Handling 429: rate limit responses

import time

def request_with_backoff(url: str, max_retries: int = 5) -> requests.Response:
    delay = 1
    for attempt in range(max_retries):
        r = SESSION.get(url, timeout=15)
        if r.status_code == 429:
            # Respect the Retry-After header if the API sends one
            wait = int(r.headers.get("Retry-After", delay))
            log.warning(f"Rate limited (attempt {attempt + 1}/{max_retries}). Waiting {wait}s")
            time.sleep(wait)
            delay = min(delay * 2, 60)
            continue
        r.raise_for_status()
        return r
    raise RuntimeError(f"Exceeded {max_retries} retries for {url}")

httpx for async pipelines

Use httpx.AsyncClient inside FastAPI async routes or asyncio-based pipelines. For the Ollama embedding calls in JobSense:

import httpx
import asyncio

async def fetch_embedding(text: str) -> list[float]:
    async with httpx.AsyncClient(timeout=30) as client:
        r = await client.post(
            "http://localhost:11434/api/embeddings",
            json={"model": "nomic-embed-text", "prompt": text},
        )
        r.raise_for_status()
        return r.json()["embedding"]

# Fetch many embeddings concurrently
async def fetch_all_embeddings(texts: list[str]) -> list[list[float]]:
    async with httpx.AsyncClient(timeout=30) as client:
        tasks = [
            client.post(
                "http://localhost:11434/api/embeddings",
                json={"model": "nomic-embed-text", "prompt": t},
            )
            for t in texts
        ]
        responses = await asyncio.gather(*tasks, return_exceptions=True)
        results = []
        for r in responses:
            if isinstance(r, Exception):
                log.error(f"Embedding fetch failed: {r}")
                results.append([])
            else:
                results.append(r.json()["embedding"])
        return results

Quick Manual Testing with curl

Before writing a test, reach for curl to verify the endpoint works at all.

# Basic GET
curl http://localhost:8000/api/v1/jobs
curl -s http://localhost:8000/api/v1/jobs | python3 -m json.tool  # pretty print

# GET with query params and auth
curl -H "X-API-Key: abc123" \
     "http://localhost:8000/api/v1/jobs?keyword=data+engineer&limit=10"

# POST with JSON body
curl -X POST http://localhost:8000/api/v1/jobs \
     -H "Content-Type: application/json" \
     -d '{"title": "Data Engineer", "company": "Safaricom", "source": "linkedin"}'

# POST from a file
curl -X POST http://localhost:8000/api/v1/jobs \
     -H "Content-Type: application/json" \
     -d @payload.json

# Verbose: show request and response headers
curl -v http://localhost:8000/api/v1/jobs

# Status code only
curl -o /dev/null -s -w "%{http_code}\n" http://localhost:8000/api/v1/jobs

# Test all services at once
for port in 8000 8080 8501; do
  echo -n ":$port → "
  curl -s --max-time 3 http://localhost:$port/health || echo "DOWN"
done

Testing FastAPI Endpoints

The conftest.py pattern

# tests/conftest.py
import pytest
from fastapi.testclient import TestClient
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from app.main import app
from app.database import get_db, Base

TEST_DB_URL = "postgresql+psycopg2://user:pass@localhost:5432/test_jobsense"

@pytest.fixture(scope="session")
def test_engine():
    engine = create_engine(TEST_DB_URL)
    Base.metadata.create_all(bind=engine)
    yield engine
    Base.metadata.drop_all(bind=engine)

@pytest.fixture
def db_session(test_engine):
    Session = sessionmaker(bind=test_engine)
    session = Session()
    yield session
    session.rollback()   # undo every test's changes
    session.close()

@pytest.fixture
def client(db_session):
    def override_get_db():
        yield db_session
    app.dependency_overrides[get_db] = override_get_db
    with TestClient(app) as c:
        yield c
    app.dependency_overrides.clear()

The session.rollback() in the fixture is critical. Without it, data written by one test leaks into the next, causing flaky tests that pass in isolation but fail in sequence.

Tests that actually catch bugs

# tests/test_jobs.py
class TestJobsEndpoint:

    def test_list_jobs_200(self, client):
        r = client.get("/api/v1/jobs")
        assert r.status_code == 200
        assert isinstance(r.json(), list)

    def test_create_job_201(self, client):
        payload = {"title": "Data Engineer", "company": "Safaricom", "source": "linkedin"}
        r = client.post("/api/v1/jobs", json=payload)
        assert r.status_code == 201
        assert r.json()["title"] == "Data Engineer"

    def test_missing_required_field_422(self, client):
        # Test that Pydantic validation catches missing company
        r = client.post("/api/v1/jobs", json={"title": "No company"})
        assert r.status_code == 422
        errors = r.json()["detail"]
        assert any("company" in str(e) for e in errors)

    def test_invalid_enum_422(self, client):
        payload = {"title": "DE", "company": "X", "source": "FAKE_SOURCE"}
        r = client.post("/api/v1/jobs", json=payload)
        assert r.status_code == 422

    def test_salary_validation(self, client):
        # salary_min > salary_max should fail
        payload = {
            "title": "DE", "company": "X", "source": "linkedin",
            "salary_min": 200_000, "salary_max": 100_000
        }
        r = client.post("/api/v1/jobs", json=payload)
        assert r.status_code == 422

    def test_job_not_found_404(self, client):
        r = client.get("/api/v1/jobs/99999")
        assert r.status_code == 404

    def test_delete_204(self, client):
        r = client.post("/api/v1/jobs", json={"title": "Temp", "company": "X", "source": "linkedin"})
        job_id = r.json()["id"]
        assert client.delete(f"/api/v1/jobs/{job_id}").status_code == 204
        assert client.get(f"/api/v1/jobs/{job_id}").status_code == 404

    def test_auth_required_401(self, client):
        r = client.get("/admin/stats")
        assert r.status_code in (401, 403)

    def test_auth_with_key(self, client, monkeypatch):
        monkeypatch.setenv("API_KEY", "test-key-123")
        r = client.get("/admin/stats", headers={"X-API-Key": "test-key-123"})
        assert r.status_code == 200

Mocking external API calls

Never call a live external API in tests. They are slow, unreliable, may have rate limits, and make your CI dependent on a third-party service being up.

# Using pytest-mock
def test_eia_endpoint(client, mocker):
    mock_response = MagicMock()
    mock_response.json.return_value = {
        "response": {"data": [{"period": "2024-01", "value": "3.45"}]}
    }
    mock_response.status_code = 200
    mock_response.raise_for_status = lambda: None
    mocker.patch("app.services.eia.SESSION.get", return_value=mock_response)

    r = client.get("/api/v1/energy/prices?fuel=gasoline")
    assert r.status_code == 200

# Using the 'responses' library (cleaner for URL-level mocking)
import responses as mock_http

@mock_http.activate
def test_cbk_forex_fetch():
    mock_http.add(
        mock_http.GET,
        "https://www.centralbank.go.ke/api/forex",
        json={"rates": [{"pair": "USD/KES", "rate": 129.5}]},
        status=200,
    )
    from app.services.forex import fetch_rates
    data = fetch_rates()
    assert data[0]["rate"] == 129.5

Async endpoint tests

import pytest
import httpx
from app.main import app

@pytest.mark.asyncio
async def test_semantic_search():
    async with httpx.AsyncClient(app=app, base_url="http://test") as client:
        r = await client.post(
            "/api/v1/search",
            json={"text": "python data engineer nairobi", "top_k": 5}
        )
        assert r.status_code == 200
        assert len(r.json()) <= 5

Add to pytest.ini or pyproject.toml:

[pytest]
asyncio_mode = auto

Useful pytest flags

pytest -v                              # verbose output
pytest -x                              # stop on first failure
pytest -s                              # show print/logging output
pytest -k "keyword"                    # run matching tests only
pytest -k "not slow"                   # skip slow tests
pytest --cov=app --cov-report=term-missing  # coverage
pytest -m integration                  # run marked tests

Debugging Common Errors

422 Unprocessable Entity

This is FastAPI's most common error. Pydantic validation failed. The response body tells you exactly what and where:

curl -X POST http://localhost:8000/api/v1/jobs \
     -H "Content-Type: application/json" \
     -d '{"title": "DE"}' | python3 -m json.tool

{
  "detail": [
    {
      "loc": ["body", "company"],
      "msg": "Field required",
      "type": "missing"
    }
  ]
}

Common causes:

Required field missing in the request body
Wrong type (sending a string where a number is expected)
Enum value not in the allowed list
min_length or max_length constraint violated

500 Internal Server Error

Check the uvicorn terminal. The full Python traceback is printed there. For dev, add a global exception handler that returns the trace in the response:

from fastapi import Request
from fastapi.responses import JSONResponse
import traceback

@app.exception_handler(Exception)
async def generic_handler(request: Request, exc: Exception):
    return JSONResponse(
        status_code=500,
        content={"detail": str(exc), "trace": traceback.format_exc()},
    )

Disable this in production. Exposing tracebacks to external clients leaks implementation details.

Custom exception handlers (better than generic 500)

class DataQualityError(Exception):
    def __init__(self, message: str, field: str = None):
        self.message = message
        self.field   = field

@app.exception_handler(DataQualityError)
async def data_quality_handler(request: Request, exc: DataQualityError):
    return JSONResponse(
        status_code=400,
        content={"error": "data_quality_error", "message": exc.message, "field": exc.field},
    )

# In your route:
@app.post("/api/v1/events")
def ingest_event(event: EventSchema):
    if event.timestamp > datetime.utcnow():
        raise DataQualityError("Event timestamp is in the future", field="timestamp")

This pattern gives clients a structured error they can handle programmatically rather than a generic 500.

CORS errors in the browser

CORS errors appear in the browser console, not in the FastAPI terminal. The fix is nearly always the same:

app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:8501"],  # exact origin, no trailing slash
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Common CORS mistakes:

Including a trailing slash: "http://localhost:8501/" does not match "http://localhost:8501"
Using allow_origins=["*"] with allow_credentials=True (blocked by browsers for credentialed requests)
Forgetting to add the middleware before route definitions

Debugging what is running on a port

# Linux / WSL
lsof -i :8000
fuser -k 8000/tcp     # kill what is on port 8000

# PowerShell
netstat -ano | findstr ":8000"
$pid = (Get-NetTCPConnection -LocalPort 8000).OwningProcess
Stop-Process -Id $pid -Force

Production Patterns

Health endpoint (add to every API)

from sqlalchemy import text

@app.get("/health")
def health(db: Session = Depends(get_db)):
    try:
        db.execute(text("SELECT 1"))
        return {"status": "ok", "database": "connected"}
    except Exception as e:
        raise HTTPException(status_code=503, detail=str(e))

Airflow's HttpSensor can poll this endpoint before triggering downstream tasks. Docker Compose health checks use it. It is one line of code that saves real debugging time.

Rate limiting your own API

Protecting your API from abuse or accidental hammering requires a rate limiter. slowapi is the standard library for FastAPI:

pip install slowapi

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.get("/api/v1/search")
@limiter.limit("10/minute")
def semantic_search(request: Request, q: str):
    # The `request` parameter is required by slowapi
    return search_jobs(q)

Without rate limiting, a single script hitting /search in a tight loop can exhaust your database connections or your embedding model's memory.

Streaming responses for large data exports

When an endpoint returns a large dataset (thousands of rows), do not load the entire result into memory before sending the response. Use StreamingResponse:

import csv
import io
from fastapi.responses import StreamingResponse

@app.get("/api/v1/export/jobs.csv")
def export_jobs_csv(db: Session = Depends(get_db)):
    def generate():
        output = io.StringIO()
        writer = csv.writer(output)
        writer.writerow(["id", "title", "company", "source", "posted_at"])
        yield output.getvalue()
        output.seek(0)
        output.truncate(0)

        for job in db.query(JobModel).yield_per(1000):
            writer.writerow([job.id, job.title, job.company, job.source, job.posted_at])
            yield output.getvalue()
            output.seek(0)
            output.truncate(0)

    return StreamingResponse(
        generate(),
        media_type="text/csv",
        headers={"Content-Disposition": "attachment; filename=jobs.csv"},
    )

yield_per(1000) on the SQLAlchemy query means only 1,000 rows are held in memory at a time regardless of how large the table is.

Response caching for expensive queries

For endpoints that run the same expensive query repeatedly (dashboard metrics, aggregate counts), cache the result in memory:

from functools import lru_cache
from datetime import datetime, timedelta

_cache: dict = {}

def get_cached(key: str, ttl_seconds: int = 300):
    entry = _cache.get(key)
    if entry and datetime.utcnow() - entry["ts"] < timedelta(seconds=ttl_seconds):
        return entry["value"]
    return None

def set_cached(key: str, value):
    _cache[key] = {"value": value, "ts": datetime.utcnow()}

@app.get("/api/v1/stats")
def get_stats(db: Session = Depends(get_db)):
    cached = get_cached("stats", ttl_seconds=300)
    if cached:
        return cached
    result = {
        "total_jobs":     db.query(JobModel).count(),
        "total_sources":  db.query(JobModel.source).distinct().count(),
    }
    set_cached("stats", result)
    return result

For production with multiple workers, replace the in-memory dict with Redis so all workers share the cache.

Database connection pool configuration

The default SQLAlchemy pool is fine for development. For production with multiple Uvicorn workers:

from sqlalchemy import create_engine

engine = create_engine(
    os.getenv("DATABASE_URL"),
    pool_size=5,          # connections kept open per worker
    max_overflow=10,      # extra connections above pool_size allowed in burst
    pool_pre_ping=True,   # test connection before use (handles DB restarts)
    pool_recycle=3600,    # recycle connections older than 1 hour (avoids stale TCP)
)

pool_pre_ping=True is the one you need most. Without it, workers that have been idle may hold dead connections and throw OperationalError on the first request after a database restart.

Structured request logging

import time
import logging

log = logging.getLogger("api")

@app.middleware("http")
async def log_requests(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    elapsed_ms = round((time.time() - start) * 1000, 2)
    log.info(
        f"{request.method} {request.url.path} "
        f"status={response.status_code} "
        f"duration={elapsed_ms}ms "
        f"ip={request.client.host}"
    )
    return response

This middleware gives you one log line per request with the information you need to debug production issues: method, path, status code, and how long it took.

The Profiling Section Nobody Reads Until They Need It

When an endpoint is slower than expected, do not guess. Measure.

import time
from functools import wraps

def timed(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        t = time.perf_counter()
        result = func(*args, **kwargs)
        elapsed = (time.perf_counter() - t) * 1000
        print(f"{func.__name__} took {elapsed:.2f}ms")
        return result
    return wrapper

@timed
def slow_query(db):
    return db.query(JobModel).filter(...).all()

# Full profiling with cProfile
import cProfile, pstats, io

pr = cProfile.Profile()
pr.enable()
slow_function()
pr.disable()

stream = io.StringIO()
pstats.Stats(pr, stream=stream).sort_stats("cumulative").print_stats(20)
print(stream.getvalue())

# Log all SQL queries from SQLAlchemy (enable during debugging, disable in prod)
import logging
logging.getLogger("sqlalchemy.engine").setLevel(logging.INFO)

The SQLAlchemy logging usually reveals the problem immediately: an N+1 query pattern where a route is running one query per row rather than a single JOIN.

JobSense uses FastAPI for semantic job search with pgvector and Ollama. The Kenya Forex API uses FastAPI with DuckDB for sub-10ms query latency. Both are on GitHub.

Follow me on dev.to for more on data engineering, APIs, and pipelines.

The Data Engineering Take-Home Assessment: How to Turn a 4-Hour Test Into a Job Offer

De' Clerke — Tue, 02 Jun 2026 22:13:29 +0000

Most candidates treat the take-home assessment as a coding test. It is not. It is a professional communication test that happens to include coding. The evaluator is asking: can you scope your own work, make defensible decisions, write clean code, handle data that is not pristine, and explain what you did to someone who was not in the room?

I have submitted take-homes for Kenya-based DE roles and remote ones. The structure I use now, after getting through several of them, is in this article. It covers how to read the brief, allocate time, decide what to build, write tests that signal production thinking, structure the README that actually gets read, and avoid the mistakes that end reviews immediately regardless of how good the code is.

Before You Write a Single Line of Code

Read the brief twice before touching anything

Read the entire brief once, then close it and ask yourself what the deliverable actually is. Not the summary you assumed, but the exact words they used. Then re-read and annotate:

What is the actual output? Code only? Code with a README? A working dashboard with screenshots? A written design doc? These are different submissions requiring different time allocations.
What keywords did they use? "Production-ready," "scalable," "testable," "explain your design decisions," and "data quality" are not decoration. Each one is a scoring criterion. Find them and treat them as explicit requirements.
What does the data look like? Open it before planning anything. How many rows? Are there nulls, duplicates, malformed dates? Is there a schema dictionary or do you have to infer types? The shape of the data changes the transformation work.
What is the hidden scope trap? Common ones: "bonus points for streaming" (do not build a streaming pipeline), "feel free to add tests" (this is not optional), "use any tools you like" (means use tools you can defend in a follow-up).
What is the minimum viable submission? Define it before you start. If you run out of time, you ship the MVP with good documentation, not a half-finished ambitious version.

Send clarifying questions if communication is allowed

Many candidates assume they should not ask questions. This is wrong. If the brief allows contact, send 2 to 3 targeted questions immediately after reading. Not vague questions ("what format should the output be?") but specific ones ("the schema shows customer_id as nullable, but the join in the transformation would lose those rows. Should I treat null customer IDs as invalid records, or route them to a separate output?").

Asking good questions demonstrates the same thinking that senior engineers apply on the first day of a real project. If you cannot contact them, document every assumption in your README. This protects you if the evaluator meant something different from what you built.

Look up the company's tech stack before you start

If their job posting mentions dbt, use dbt if it fits. If they work in BigQuery and you have a choice of warehouse, pick BigQuery. Mirror their keywords in your Key Design Decisions section. An evaluator who uses dbt every day will notice that you wrote {{ config(materialized='incremental', unique_key='id') }} and will have more to discuss in the follow-up call than an evaluator looking at a pandas script from someone who clearly has not read the JD.

Time Allocation

The single most common failure mode is running out of time before the README is written. Lock in time for documentation before you start building.

2-hour assessment:

Block	Duration	Focus
Brief + data exploration	0:00 to 0:25	Read twice, open data, plan approach on paper
Core pipeline	0:25 to 1:15	The transformation they asked for. This is the grade.
Tests	1:15 to 1:35	3 to 5 meaningful assertions
README + cleanup	1:35 to 1:55	Design decisions, assumptions, how to run
Final check	1:55 to 2:00	Run from clean clone

4-hour assessment:

Block	Duration	Focus
Brief + data exploration	0:00 to 0:20	Data profiling, schema decisions written down
Core build	0:20 to 1:30	Extract, transform, load
Schema + modeling decisions	1:30 to 2:00	Document grain, dimensions, load strategy
Tests	2:00 to 2:45	5 to 8 tests covering schema, business logic, edge cases
Dashboard or API (if required)	2:45 to 3:30	Working, not polished
README + Key Design Decisions	3:30 to 3:50	This section wins or loses the assessment
Git cleanup + submit	3:50 to 4:00	Run from clean clone one more time

6-hour assessment:

Block	Duration	Focus
Hour 1	Brief + exploration + schema decisions written
Hours 2 to 4	Core pipeline (extract, transform, load)
Hour 5	Tests + edge cases + data quality checks
Hour 6	README + architecture diagram + polish

Rules that hold across all lengths:

Protect the last 20 minutes for README. Never let coding eat into it.
Stop adding features at the 70% mark. Polish what exists.
A working pipeline with 80% of the features beats a broken one with 100%.

Requesting an extension

If the timeline is tight and communication is allowed, ask for more time rather than rushing. "I want to give you my best work. I can deliver by Wednesday instead of Monday if that works." Most interviewers prefer a polished Thursday submission over a hurried Monday one. The ask itself signals that you value quality over optics.

What to Build

Always build

Working code that actually runs
The exact transformation or query they asked for
At least 3 to 5 meaningful tests
A README that explains what you built and why

Build if time allows (differentiates you)

Error handling at API and file ingestion boundaries
Idempotency (run twice without duplication)
Schema validation at ingestion
logging module instead of print statements
.env.example with every variable documented
A Makefile or run.sh so the evaluator can test in one command

.PHONY: run test clean

run:
    python pipelines/run.py

test:
    pytest tests/ -v

clean:
    find . -type f -name "*.pyc" -delete
    find . -type d -name "__pycache__" -delete

Skip unless explicitly required

Full Airflow orchestration (too much overhead in a take-home)
Kafka streaming (unless they said "streaming pipeline")
dbt full project (unless they said "use dbt")
Docker Compose (unless they said "containerize")
Multiple database options
CI/CD pipeline

Evaluators score on the quality of what you deliver, not the quantity of tools you list. Six tools used badly is worse than two tools used well.

Schema and Data Modeling Decisions

Write down your schema decisions before writing any SQL or transformation code. Evaluators look for evidence that you can think before you code.

Decision template:

1. Grain: one row per what?
   e.g. "One row per transaction, identified by transaction_id + timestamp"

2. Dimensions: stable lookup attributes
   e.g. customer, product, region, date

3. Facts/measures: numeric, aggregatable values
   e.g. amount, quantity, duration_seconds

4. Query pattern: how will this data be read?
   e.g. "mostly filtered by date range and region"

5. Load strategy: full refresh, incremental, or upsert?
   e.g. "Incremental on created_at — table grows, no updates to old rows"

6. Known data quality issues
   e.g. "Nulls in customer_id — treat as 'UNKNOWN' not drop"

Star schema vs flat table:

For take-homes under 4 hours, a flat wide table plus a mart view is almost always the right call. Star schema is appropriate when the brief asks you to design for a BI use case with multiple reporting dimensions. If the brief says "answer these 3 business questions," a flat table with good column names is faster and cleaner to explain.

Data Quality Checklist

Run through this before calling the build done:

# Quick profiling you should show in your README
print(f"Source rows: {len(raw_df)}")

# These assertions belong in your tests AND your README as a data profile
df.duplicated(subset=['id']).sum()          # dupe check
df.isnull().sum()                           # null audit per column
df.dtypes                                   # type check
(df['amount'] < 0).sum()                    # impossible values
df['city'].str.strip().ne(df['city']).sum() # hidden whitespace

-- Equivalent SQL checks
SELECT COUNT(*), COUNT(DISTINCT id) FROM target;           -- duplicate PK
SELECT COUNT(*) FROM target WHERE required_col IS NULL;    -- required nulls
SELECT MIN(amount), MAX(amount) FROM target;               -- range sanity
SELECT COUNT(*) FROM target WHERE created_at > CURRENT_DATE; -- future dates

The data profile in your README is a strong signal. Write it like this:

Source:          10,247 rows
After dedup:     10,104 rows  (143 duplicate transaction_ids removed)
After null drop:  9,882 rows  (222 rows missing required customer_id)
Loaded:           9,882 rows

One line per transformation stage with the delta explained. Evaluators running 20 submissions will remember the one candidate who documented what happened to their data end-to-end.

Tests That Signal Production Thinking

Aim for 5 to 10 tests. Evaluators count them and check whether they are meaningful.

Required tests (always):

import pytest
import pandas as pd
from pipelines.transform import transform_sales

@pytest.fixture
def sample_input():
    return pd.DataFrame({
        'id':         [1, 2, 3],
        'amount':     [100.0, 250.0, 75.5],
        'created_at': ['2024-01-01', '2024-01-02', '2024-01-03'],
        'region':     ['Nairobi', 'Mombasa', None]
    })

def test_output_has_expected_columns(sample_input):
    result = transform_sales(sample_input)
    assert 'total_amount' in result.columns
    assert 'region' in result.columns

def test_no_duplicate_ids(sample_input):
    result = transform_sales(sample_input)
    assert result['id'].duplicated().sum() == 0

def test_row_count_preserved(sample_input):
    result = transform_sales(sample_input)
    assert len(result) == len(sample_input)

def test_null_region_filled(sample_input):
    result = transform_sales(sample_input)
    assert result['region'].isnull().sum() == 0

def test_amount_is_non_negative(sample_input):
    result = transform_sales(sample_input)
    assert (result['amount'] >= 0).all()

Tests that differentiate you:

def test_idempotency(sample_input, db_connection):
    """Running the pipeline twice should produce the same row count."""
    load(transform_sales(sample_input), db_connection)
    load(transform_sales(sample_input), db_connection)
    result = db_connection.execute("SELECT COUNT(*) FROM sales").fetchone()[0]
    assert result == len(sample_input)

def test_empty_input_returns_empty():
    """Empty input should not crash — it should return an empty DataFrame."""
    empty = pd.DataFrame(columns=['id', 'amount', 'created_at', 'region'])
    result = transform_sales(empty)
    assert len(result) == 0
    assert 'total_amount' in result.columns  # schema still correct

def test_business_logic(sample_input):
    """Total revenue in output matches sum of input amounts."""
    result = transform_sales(sample_input)
    assert result['amount'].sum() == pytest.approx(sample_input['amount'].sum())

The idempotency test and the empty input test are the two that most junior candidates skip. Both show production awareness.

The README That Wins Assessments

The README is the first thing the evaluator opens. It is also the document they refer to during the follow-up call when they ask "walk me through your approach." Write it like a professional document, not like a GitHub hobby project.

Mandatory sections in this order:

Summary

One paragraph. What you built, what the data is, what the output is. No fluff.

Builds an incremental ELT pipeline that ingests daily M-Pesa transaction records from a CSV export, validates and transforms them using pandas, loads them into PostgreSQL, and exposes aggregate revenue metrics per region per day. Input: 10,247 rows across 3 months. Output: 9,882 clean rows in a fct_transactions table.

How to Run

Commands that work from a clean clone. Assume the evaluator has Python and Docker but not your database or your environment variables.

cp .env.example .env
# Edit .env with your DB credentials
pip install -r requirements.txt
python pipelines/run.py
pytest tests/ -v

Key Design Decisions

This section is where assessments are won or lost. Write 3 to 5 decisions with the reasoning behind each.

Why incremental load over full refresh:
The source table grows daily. Full refresh would reprocess all rows on every run. Incremental load on created_at reduces per-run cost by approximately 95% as the table grows. Trade-off: if a past record is corrected, it will not be picked up unless I add a lookback window or a CDC mechanism.

Why PostgreSQL over SQLite:
PostgreSQL supports window functions (required for the cohort analysis query in the mart layer) and has proper JSONB support if the events column schema expands. SQLite would have required workarounds for both.

Why I filled null regions with UNKNOWN instead of dropping the rows:
The region column was 12% null. Dropping those rows would have removed 12% of total revenue from the mart. I filled with UNKNOWN and added a filter in the mart view so analysts can exclude those rows if they need clean-only data. The raw rows are preserved.

Performance at scale note:
This pipeline processes the given 10K-row dataset in under 2 seconds. At 100M+ rows I would switch from pandas to DuckDB for in-process columnar processing, or to a Spark-based approach if the data requires distributed processing across multiple nodes.

Assumptions

List every assumption you made. This protects you if the evaluator meant something different.

transaction_id is assumed globally unique, not unique per account.
Dates are assumed UTC. If local time is needed, timezone conversion should be added at ingestion.
Rows where amount = 0 are treated as valid (refunds cleared to zero, not errors).
The region field maps to Kenyan counties based on string matching to the provided lookup.

If I Had More Time

Show ambition without making excuses. Be specific about what problem you would solve, not what feature you would add.

Add a Great Expectations suite on the raw ingest to catch schema drift before it reaches the mart. Currently any new column in the source CSV is silently dropped.
Parameterize the date range via CLI argument so the pipeline can be backfilled for any period without code changes.
Add row count monitoring: assert that each run loads within 20% of the previous run's row count, and alert if it deviates.

Project Structure

project/
├── pipelines/
│   ├── extract.py       # reads source CSV or API
│   ├── transform.py     # business logic and cleaning
│   └── load.py          # writes to PostgreSQL with ON CONFLICT
├── tests/
│   └── test_transform.py
├── .env.example
├── .gitignore
├── Makefile
├── requirements.txt
└── README.md

The Five Take-Home Types

Type A: "Here's a CSV, build a pipeline"

Focus: correctness of transformation, data quality, idempotency.

Deliverable: Python script with tests and README.

The trap: over-engineering. A clean pandas to PostgreSQL pipeline beats a half-working Airflow DAG. Write ON CONFLICT DO NOTHING on inserts so it is idempotent from day one.

Type B: "Write SQL to answer these business questions"

Focus: correctness, readable query structure, efficiency.

Deliverable: one .sql file per question, each with a comment on the grain.

-- Q3: Monthly revenue by region
-- Grain: one row per region per calendar month
SELECT
    DATE_TRUNC('month', transaction_date) AS month,
    region,
    SUM(amount)                           AS total_revenue,
    COUNT(DISTINCT customer_id)           AS unique_customers
FROM fct_transactions
GROUP BY 1, 2
ORDER BY 1, 2;

The trap: subqueries where a CTE is clearer. Every complex query should have a grain comment and a note explaining any non-obvious approach.

Root cause decomposition separates strong SQL submissions from average ones. When asked "why is revenue down?" the answer is not a single number. Break the metric into components:

-- Revenue = orders × average order value
-- Splitting reveals whether the problem is volume or value
SELECT
    month,
    COUNT(order_id)                        AS order_volume,
    ROUND(SUM(amount) / COUNT(order_id), 2) AS avg_order_value,
    SUM(amount)                             AS total_revenue
FROM fct_transactions
GROUP BY 1
ORDER BY 1;

This is the kind of query that makes an evaluator pause and read it twice.

Type C: "Design a data pipeline for X"

Focus: architecture decisions, trade-off discussion.

Deliverable: a written design doc with an ASCII architecture diagram.

The trap: using exotic tools to look smart. If the scale is 50,000 rows per day, Kafka is overkill. Say so explicitly. "At this volume, a cron-scheduled Python script writing to PostgreSQL is sufficient and operationally simpler than a streaming architecture. If volume grows to 1M+ events per hour with sub-minute latency requirements, I would introduce Kafka and a Flink consumer."

Type D: "Build a small API or dashboard"

Focus: does it work? Is it clean? Can the evaluator use it without calling you?

The trap: forgetting to include screenshots in the README. Many candidates lose points because the evaluator cannot run the visualization locally. Screenshot every page of the dashboard and embed it in the README before submitting.

Type E: "Review this pipeline and tell us what's wrong"

Focus: identifying bugs, edge cases, missing quality checks.

Do not just say "this is bad." Write structured feedback with severity levels:

CRITICAL (line 47): No error handling on the API call. If the API returns 
a 429 or 500, the pipeline fails silently with no retry and no alert. 
Fix: wrap in try/except with exponential backoff and a failure log.

MEDIUM (line 82): Full table scan on every run. The WHERE clause is 
missing a watermark filter, so the pipeline reprocesses all historical 
rows on each execution. Fix: add WHERE created_at > last_processed_ts.

LOW (line 104): print() used instead of logging module. In production 
this output would be invisible in any log aggregation system.

Code Quality Details Evaluators Notice

Format your code before submitting

Run a formatter before submitting. Inconsistent indentation and style signal that you do not work as part of a team.

# Install once
pip install black ruff

# Run before every submission
black pipelines/ tests/
ruff check pipelines/ tests/ --fix

black handles formatting. ruff catches common issues: unused imports, undefined variables, f-string without placeholders. Add both to your requirements.txt as dev dependencies so the evaluator can see you use them.

Use the logging module, not print

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s"
)
logger = logging.getLogger(__name__)

logger.info(f"Loaded {len(df)} rows from source")
logger.warning(f"Found {null_count} null customer_ids — filling with UNKNOWN")
logger.info(f"Written {rows_written} rows to fct_transactions")

print() works for a script. logging works for a production system. The evaluator is assessing whether you would be comfortable in a production codebase.

AI tools in take-home assessments

AI tools exist and evaluators know you will use them. Using them is not a problem. The problem is submitting code you cannot explain.

Use AI to:

Clarify ambiguous requirements when you cannot contact the team
Check whether a SQL query handles a specific edge case correctly
Identify missing test scenarios you have not thought of

Do not use AI to:

Generate the entire pipeline without understanding it
Write test names that sound meaningful but assert nothing
Produce a README with five Key Design Decisions you cannot defend in a follow-up call

The follow-up call after a take-home always includes "walk me through this function" and "why did you choose this approach?" If you cannot answer, the submission score is irrelevant.

Disqualifying Mistakes

These do not lose points. They end the review immediately.

Code that does not run:

Clone your own repo into a temp folder and run it from scratch before submitting. This is not optional.

cd /tmp
git clone https://github.com/yourname/assessment-repo
cd assessment-repo
cp .env.example .env
# Edit with test credentials
pip install -r requirements.txt
python pipelines/run.py
pytest tests/ -v

Hardcoded secrets in any file:

# Check before submitting
git log --diff-filter=A -- .env          # check .env was never committed
git grep -i "password\|api_key\|secret"  # scan all tracked files

Missing or blank README: Even 10 lines of real README beats none. An evaluator who cannot understand what you built in 30 seconds moves to the next submission.

Broken imports or missing requirements.txt entries: After finishing installation, run pipreqs . --force to regenerate requirements.txt from your actual imports. Do not use pip freeze unless you want to submit 200 packages from your entire venv.

SQL that does not run on the specified engine: If they said PostgreSQL, test on PostgreSQL. Window functions, CTEs, JSONB, and type casting all differ between PostgreSQL, SQLite, and DuckDB. Test on the exact engine specified.

Large binary files committed to the repo:

# Check for large files before final push
git ls-files | xargs ls -la | sort -k5 -n | tail -20

Your .gitignore minimum:

.env
.venv/
__pycache__/
*.pyc
*.db
*.csv
*.parquet
*.egg-info/
.DS_Store
.idea/
*.log

When to Decline

Some take-homes are not worth doing. Red flags:

Scope beyond 6 hours for an entry-level role (you are being asked to do free work)
Unreasonable timelines where the brief is sent Friday afternoon due Monday morning
Repeated assessments after multiple interview rounds where you have already demonstrated technical ability
No feedback loop after submission (you will never know if you passed or failed)

Declining professionally is straightforward: "Thank you for considering me. Due to current bandwidth I'm unable to complete an assessment of this scope by that deadline. I'd be happy to discuss an alternative format or timeline if that's possible." This is better than a rushed submission that leaves a bad impression.

Pre-Submission Checklist (Last 10 Minutes)

Code:

Code runs end-to-end from a clean clone
No hardcoded paths (no C:/Users/Administrator/...)
No secrets in any tracked file
requirements.txt is up to date
.env.example has every required variable with a placeholder value
.gitignore covers .env, .venv/, __pycache__/, *.csv, *.db, *.parquet
Code formatted with black and linted with ruff

Tests:

All tests pass (pytest tests/ -v)
At least 3 meaningful assertions (not just test_import)
Idempotency test present if the pipeline writes to a database

README:

"How to Run" section works from a clean clone
Key Design Decisions written with reasoning, not placeholders
Assumptions documented
Data profile (source rows, after dedup, after cleaning, final loaded)
"If I Had More Time" with specific problems, not generic features
Screenshots embedded if there is a dashboard or API output

Git:

git status is clean
Commit messages are readable
No large binary files
Repo is public or evaluator's GitHub handle has access

Final:

You ran your own submission from scratch in a temp directory
You re-read the original brief to confirm you answered the actual question

My portfolio of 55 data engineering projects is on GitHub. If you found this useful, follow me on dev.to for more on pipelines, dbt, and Airflow.

Data Engineering Interviews with No Industry Experience: A Playbook That Actually Works

De' Clerke — Tue, 02 Jun 2026 22:06:20 +0000

I graduated in November 2025. My only formal work experience is a 6-month IT internship. I have never worked at a tech company, never contributed to a production pipeline with a team, and never had a senior data engineer review my code.

What I do have is 55 real data engineering projects on GitHub, each with specific metrics: row counts, test pass rates, pipeline success logs, and dashboard screenshots. Every project uses real data from real APIs. None of it is synthetic.

This article is the playbook I built around that situation. It covers the opener, the STAR answers drawn from project work, the technical questions interviewers actually ask, how to handle behavioral rounds honestly, and the Kenya market context that most interview guides skip entirely.

The Core Problem with Standard Interview Advice

Most interview guides assume you have 2 to 3 years of work experience to draw from. They say things like "describe a time when you had to scale a pipeline for a large dataset" or "tell me about a production incident you resolved." If you built your skills through self-directed projects, you can answer these questions just as well, but you need to translate your project work into the language interviewers expect.

The translation is not difficult. The key is precision. Interviewers are not checking whether you worked at a specific company. They are checking whether you understand what you built, why you made the decisions you did, and what you would do differently. A project where you processed 1.5 million rows, caught a date parsing bug with dbt tests before it corrupted 40,000 rows of fiscal year attribution, and generated 2,961 legitimate anomaly alerts is a stronger story than a vague answer about "working on a pipeline at a previous company."

Part 1: The Opener

"Tell me about yourself" is the first question in almost every interview. It is also the question most people underestimate.

The formula: who you are, what you have built, what you are targeting, why you are interested in this specific role.

Version A (general data engineering role):

"I'm a data engineer based in Nairobi. I graduated with a BSc in IT from Zetech University in late 2025, and since then I have been building a portfolio of production-style data pipelines, currently 55 projects on GitHub. My focus has been East African data contexts: I have built pipelines on Airflow, Kafka, dbt, and GCP, covering everything from real-time flight monitoring at JKIA to fiscal reconciliation of Kenya's budget data. I am looking for a junior data engineering role where I can contribute immediately to real pipelines and grow alongside an experienced team. I was drawn to this role specifically because [mirror 1 to 2 phrases from the JD]."

Version B (analytics or BI role):

"I am a data analyst and engineer based in Nairobi. I completed my BSCIT at Zetech University in November 2025, and I have spent the following months building an analytics portfolio covering SQL, Python, dbt, and BI tools including Power BI, Grafana, Streamlit, and Looker Studio. I have modeled Kenya economic data, NSE stock data, and Kenyan job market trends. My work is deliberately honest and measurable: every project has specific test pass rates and row counts I can speak to. I am looking for a role where analytical thinking and engineering overlap, like what your team does with [mention company context]."

Version C (technical deep-dive panel):

"I am a junior data engineer with a focused portfolio of 55 real-data pipelines. My stack centers on Apache Airflow for orchestration, dbt for transformation, Kafka for streaming, and PostgreSQL or BigQuery as the target. I also have experience with CDC via Debezium, medallion lakehouses on Delta Lake, and serverless pipelines on AWS. All projects use real public APIs or open datasets. I can walk you through any of them in detail, including the design decisions and what I would do differently now."

Execution notes:

Speak slowly. Most candidates rush this answer. Pause at the transition between each section. Have the JD in front of you the night before and mirror 1 to 2 phrases from it. End with a slight upward inflection on the last sentence, which signals you want them to respond rather than just waiting for the next question.

Part 2: STAR Answers from Real Projects

Behavioral and project questions follow STAR format: Situation, Task, Action, Result. The common prompts are "tell me about a challenging project," "describe your most complex data engineering work," and "give me an example of a pipeline you built." Here are five project stories ready to deliver.

JKIA Flight Traffic Monitor (streaming and GCP)

Situation: I wanted to build a real-time streaming pipeline over Kenya aviation data, something that would demonstrate Kafka, GCP, and dbt together in a single coherent project.

Task: Ingest live flight data from the OpenSky Network API, buffer it through Kafka, land it in Google Cloud Storage, transform it in BigQuery using dbt, and surface traffic patterns in Looker Studio.

Action:

Built an Airflow DAG to orchestrate the full workflow end-to-end
Configured Kafka with a Zookeeper cluster in Docker and used it as a buffer between the API poller and GCS writer; if GCS had a timeout, Kafka held the data and resumed delivery without loss
Wrote dbt models with a staging to mart layer and 22 dbt tests to validate output integrity
Used BigQuery window functions to detect peak traffic windows at JKIA
Surfaced insights in a Looker Studio dashboard

Result: All 22 dbt tests passing. The Kafka buffer successfully handled GCS timeouts during testing without data loss. The dashboard showed peak-traffic windows that aligned with published JKIA schedule data, which validated the pipeline's correctness.

What I would do differently: Add Schema Registry with Avro. If the OpenSky API changes a field type, the current pipeline would silently corrupt data. Schema Registry with compatibility rules would catch that before it reaches the mart layer.

LedgerSync (scale and dbt testing)

Situation: Kenya's national budget data (the BOOST dataset) is publicly available but rarely analyzed computationally. I wanted to build a fiscal reconciliation pipeline at scale and understand what the data actually shows.

Task: Ingest 1.5 million rows of Kenya fiscal transaction data, reconcile budget versus actuals, detect anomalies, and surface alerts in a dashboard.

Action:

Airflow 5-task DAG for orchestration with PostgreSQL as the warehouse
9 dbt models covering raw to staging to reconciliation to alerts
35 dbt tests; the suite caught a date parsing bug during development that would have corrupted approximately 40,000 rows of fiscal year attribution before it reached the mart layer
Built an alerting model that flags transactions deviating more than 20% from expected budget allocation

Result: 1,528,492 BOOST rows processed. All 5 Airflow tasks succeeded. 35 of 35 dbt tests passing. 2,961 legitimate fiscal anomalies surfaced. The most interesting finding: several ministry categories showed consistent overspend in Q4, which aligns with public reporting on year-end budget rushing in Kenya.

What I would do differently: Add incremental loading. The current pipeline uses WRITE_TRUNCATE on every run. With a fiscal year partition key I could load only the current fiscal year's data on each run, which reduces cost and run time at scale.

BungeWatch Kenya (NLP and PDF parsing)

Situation: Kenya's parliament publishes bills as PDFs on a poorly structured website with no machine-readable feed. I wanted to make legislative data searchable and summarized.

Task: Build a pipeline that scrapes parliamentary bills, parses the PDFs, extracts keywords, generates summaries, and surfaces foreign funding mentions in a searchable dashboard.

Action:

Playwright for dynamic site scraping (the parliament site uses JS rendering)
3-tier PDF parser: pdfplumber for text PDFs, PyMuPDF for complex layouts, Tesseract OCR for scanned image PDFs; 4 parallel workers
spaCy and YAKE for keyword extraction; TF-IDF for relevance ranking
Airflow 3.0 with an 11-task DAG; checkpoint pattern to resume after failures without re-downloading bills
dbt for transformation (8 models, 28 tests); Streamlit for the dashboard

Result: 319 bills processed; 223 successfully parsed. 2,230 keywords extracted. 223 AI-generated summaries. Identified 14 foreign funding mentions across 8 bills, data that is not easily discoverable by reading each bill manually.

What I would do differently: Store the raw PDFs in GCS or S3 before parsing. If parsing logic changes, I cannot re-run on the same documents without re-downloading them. Separating the raw storage from the processing layer would fix that.

Kenya Forex Intelligence API (data quality and FastAPI)

Situation: Forex data APIs for Kenyan currency pairs either cost money or return unreliable data. I wanted to build a self-hosted API with validated data and fast query response.

Task: Build a REST API backed by DuckDB that ingests forex rates daily through Airflow, runs a validation suite, and serves clean data with sub-10ms latency.

Action:

Airflow DAG handles daily ingestion; 120 rows per day (one per currency pair)
Great Expectations validation suite with 5 checks: nulls, expected currency codes, rate range bounds, timestamp freshness, duplicate detection
DuckDB as the query engine (in-process OLAP, no separate DB server)
FastAPI with 4 endpoints: current rates, historical series, pair comparison, and validation report

Result: Sub-10ms query latency on all endpoints. The 5-check validation suite caught 3 real data quality issues during development: stale timestamps from the upstream source and duplicate records on weekend ingestion. Fixed before they reached the API layer.

What I would do differently: Add Redis caching for high-traffic scenarios and OpenAPI authentication so the API can be deployed publicly without exposing it to abuse.

JobSense (mistake and recovery)

This one is for the "describe a mistake you made" question.

Situation: In the JobSense pipeline, I named a dbt model the same as a Python-managed PostgreSQL table that already existed in the same schema.

Task: When dbt ran, it dropped the Python-managed table and replaced it with its own version, which deleted all 604 pre-loaded job embeddings.

Action: Re-ran the embedding generation script (604 jobs times one embedding model call, approximately 45 minutes to regenerate). Then renamed the dbt model with a dbt_ prefix to distinguish it from Python-managed tables. Added a step to my project setup checklist: verify no naming conflicts between dbt models and raw tables before running dbt run for the first time.

Result: Pipeline fully recovered. The conflict pattern is now in my notes and has not recurred in subsequent projects. The lesson is now documented in my memory system, not just in my head.

Part 3: Technical Questions That Actually Come Up

Interviewers test SQL, Python, orchestration, transformation, and system design. Here are the questions I have seen most often, with answers tied back to real work.

SQL

What is the difference between WHERE and HAVING?

WHERE filters rows before aggregation. HAVING filters groups after aggregation. WHERE cannot reference aggregate functions.

SELECT dept, COUNT(*) AS headcount
FROM employees
WHERE active = true            -- filters rows before grouping
GROUP BY dept
HAVING COUNT(*) > 10;          -- filters groups after aggregation

Explain window functions with a real use case.

Window functions compute values across rows related to the current row without collapsing them like GROUP BY does. From the NSE Stock Pipeline:

SELECT
    ticker,
    close_price,
    RANK() OVER (PARTITION BY sector ORDER BY close_price DESC) AS sector_rank,
    LAG(close_price, 1) OVER (PARTITION BY ticker ORDER BY trade_date) AS prev_close,
    close_price - LAG(close_price, 1) OVER (
        PARTITION BY ticker ORDER BY trade_date
    ) AS daily_change
FROM nse_prices;

I also used window functions in BigQuery on JKIA data to compute hourly flight counts relative to a 7-day rolling average to detect peak traffic windows.

Write a query for a rolling 7-day average.

SELECT
    trade_date,
    ticker,
    close_price,
    AVG(close_price) OVER (
        PARTITION BY ticker
        ORDER BY trade_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7d_avg
FROM nse_prices
ORDER BY ticker, trade_date;

The ROWS BETWEEN 6 PRECEDING AND CURRENT ROW is the frame clause. Without it you get a running average, not a rolling 7-day. This is one of the most common window function mistakes in interviews.

What is an SCD Type 2 and how do you implement it?

Type 2 tracks historical changes by adding new rows with effective_from and effective_to date columns rather than overwriting.

-- Close the current record
UPDATE dim_customer
SET effective_to = CURRENT_DATE, is_current = FALSE
WHERE customer_id = 123 AND is_current = TRUE;

-- Insert the new version
INSERT INTO dim_customer
    (customer_id, name, address, effective_from, effective_to, is_current)
VALUES
    (123, 'New Name', 'New Address', CURRENT_DATE, '9999-12-31', TRUE);

dbt snapshots automate this pattern. In the Kenya Real Estate pipeline I used snapshots to track price changes on listings, so each change has a dated record rather than just the current price.

How do you optimize a slow query?

EXPLAIN ANALYZE to find sequential scans and nested loop joins
Add indexes on JOIN and WHERE columns
Avoid functions on indexed columns in WHERE predicates (prevents index use)
Filter early: push predicates into CTEs rather than filtering after a large join
For BigQuery: check bytes processed and add partitioning or clustering on the filter columns
Check for data skew in GROUP BY operations where some groups are much larger than others

Python

What is a generator and why use one in a pipeline?

A generator yields values lazily, one at a time, instead of loading the entire result into memory. For large files or API responses this is essential.

def read_csv_chunks(path, chunk_size=10_000):
    with open(path) as f:
        reader = csv.DictReader(f)
        chunk = []
        for row in reader:
            chunk.append(row)
            if len(chunk) >= chunk_size:
                yield chunk
                chunk = []
        if chunk:
            yield chunk

for chunk in read_csv_chunks("large_file.csv"):
    process_and_load(chunk)  # never more than chunk_size rows in memory

What is the GIL and what does it mean for pipelines?

The GIL (Global Interpreter Lock) prevents multiple CPython threads from executing Python bytecode simultaneously. For data pipelines:

I/O-bound work (API calls, DB reads): use threading or asyncio; the GIL releases during I/O wait
CPU-bound work (parsing, transformation): use multiprocessing; each process has its own GIL

In BungeWatch I used multiprocessing with 4 workers for OCR because it is CPU-bound. If I had used threading, all 4 workers would have shared one core.

How do you implement retries with exponential backoff?

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    retry=retry_if_exception_type(requests.exceptions.RequestException),
    reraise=True
)
def fetch_data(url: str) -> dict:
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    return response.json()

I use this pattern in every API extraction script. The max=60 cap prevents waiting 4 minutes between retries if the upstream API is down completely.

How do you process a large CSV without running out of memory?

Three approaches ranked by simplicity:

# Option 1: pandas chunks
for chunk in pd.read_csv("large.csv", chunksize=10_000):
    process(chunk)

# Option 2: DuckDB — streams from disk natively, never loads full file
import duckdb
conn = duckdb.connect()
result = conn.execute("SELECT col1, col2 FROM 'large.csv' WHERE amount > 1000").df()

# Option 3: generator (shown above)

For files above 5 GB, DuckDB is usually the fastest Python option because it reads Parquet and CSV with column pruning built in.

Apache Airflow

What is XCom and what are its limits?

XCom allows tasks to share data through the metadata database. The limit is size: XCom is for small metadata (IDs, row counts, file paths), not DataFrames. Storing a DataFrame in XCom bloats the metadata DB and makes it slow.

Best practice: write actual data to PostgreSQL, GCS, or S3, then pass the path via XCom. The task that receives the path reads the data from storage directly.

What is the difference between catchup and backfill?

Catchup is a DAG-level flag. If catchup=True and you deploy a DAG with a start_date two weeks ago, Airflow will run all missed intervals immediately. Set catchup=False for most pipelines to avoid this.

Backfill is intentional: you manually trigger historical re-runs.

airflow dags backfill my_dag --start-date 2025-01-01 --end-date 2025-01-31 --reset-dagruns

How do you handle task failures in production?

@task(retries=3, retry_delay=timedelta(minutes=5), on_failure_callback=alert_slack)
def extract_nse_data():
    ...

Beyond retries: design tasks to be idempotent so they are safe to re-run. Use ON CONFLICT DO NOTHING for inserts so a retry never creates duplicates. Set depends_on_past=True for critical pipelines where a failed run should block future runs until resolved.

dbt

What is the staging to mart layer pattern?

Staging: 1-to-1 with source tables. Light cleaning only: rename columns, cast types, drop obvious garbage. No business logic. Prefix: stg_.

Intermediate: business logic joins and transformations that combine staging models. Not exposed to end users. Prefix: int_.

Marts: final wide tables optimized for specific use cases. These are what analysts and BI tools query. Prefix: fct_ or dim_.

From the NSE pipeline: stg_prices to int_daily_changes to fct_stock_performance. Each layer is testable independently.

What is an incremental model and when do you need a lookback buffer?

{{ config(materialized='incremental', unique_key='flight_id') }}

SELECT * FROM {{ ref('stg_flights') }}
{% if is_incremental() %}
WHERE departure_time > (SELECT MAX(departure_time) FROM {{ this }})
{% endif %}

The buffer matters for late-arriving data. If records can arrive a few hours after the event time (which happens with API sources that update retroactively), use:

WHERE departure_time > (SELECT MAX(departure_time) FROM {{ this }}) - interval '4 hours'

Without the buffer, late records are silently dropped.

How do you do CI for dbt?

Use dbt state to run only models changed since the last production run:

dbt build \
  --select state:modified+ \
  --defer \
  --state /artifacts/prod-manifest

state:modified+ runs changed models and all downstream dependents. --defer makes unchanged models reference the production results rather than rebuilding everything. This requires storing manifest.json from the last prod run as a CI artifact.

System Design

How do you design an end-to-end batch analytics pipeline?

Use the NSE Stock Pipeline as the example:

Source: NSE website after market close at 18:00 EAT
Extract: Python with exponential backoff retries, land raw data in PostgreSQL staging table with ON CONFLICT DO NOTHING
Validate: row count check (not zero, not 10x normal), null rate check on required fields, price range bounds
Transform: dbt staging model (cast types, rename), intermediate (compute daily change and sector rank), mart (aggregated performance)
Serve: Streamlit dashboard reads from the mart layer
Orchestrate: Airflow DAG with 3 retries per task, Slack alert on failure, SLA set to 30 minutes

When do you choose streaming over batch?

Batch when:

Data arrives in bulk at defined intervals (daily files, nightly exports)
Latency of minutes or hours is acceptable
Aggregations need complete data (stock prices only make sense after market close)

Streaming when:

Need to act within seconds (fraud detection, live dashboards)
Data arrives continuously and unboundedly
Need to capture deletes (batch only sees what exists, not what was removed)

I have built both. The NSE pipeline is batch. The JKIA Flight Monitor uses Kafka for streaming with Airflow as the outer orchestrator.

What is idempotency and how do you achieve it?

Idempotency means running the same task multiple times produces the same result as running it once. It is what makes retries safe.

Techniques:

INSERT ... ON CONFLICT DO NOTHING or ON CONFLICT DO UPDATE instead of plain INSERT
WRITE_TRUNCATE followed by a full reload for small dimension tables
dbt incremental models with unique_key so re-runs merge rather than append
For file-based outputs: write to a temporary path, then rename atomically when complete

Every extraction in my projects uses ON CONFLICT DO NOTHING. A re-run after a failure never creates duplicate rows.

Distributed Systems and Database Foundations

These questions come up in technical screens at companies running any distributed infrastructure. They are foundational concepts, not Airflow-specific or dbt-specific.

What is the CAP theorem?

The CAP theorem states that a distributed system can only guarantee two of three properties simultaneously:

Consistency (C): every read receives the most recent write or an error
Availability (A): every request receives a response, though it may not be the most recent data
Partition Tolerance (P): the system continues operating even if network partitions cause nodes to lose communication

In practice, network partitions always happen. So the real trade-off is between consistency and availability when a partition occurs.

System	Trade-off	Example
CP (consistent, partition-tolerant)	May reject requests during partition	HBase, Zookeeper
AP (available, partition-tolerant)	May return stale data during partition	Cassandra, DynamoDB
CA (consistent, available)	No partition tolerance	Traditional RDBMS on single node

For data engineering: event streaming systems like Kafka favor availability (messages are always accepted). Transactional databases like PostgreSQL favor consistency (reads always reflect committed writes). Knowing which trade-off a system makes tells you how to design around its failure modes.

What are the main data serialization formats and when do you use each?

This question comes up whenever the role involves data lakes, Kafka, or any file-based pipeline.

Format	Type	Schema	Compression	Best for
JSON	Text, row-oriented	Schema-less	Poor	APIs, config, small data
CSV	Text, row-oriented	Schema-less	Poor	Simple exports, Excel users
Avro	Binary, row-oriented	Schema required (embedded)	Good	Kafka messages, schema evolution
Parquet	Binary, columnar	Schema embedded	Excellent	Data lakes, analytical queries
Protocol Buffers	Binary, row-oriented	Schema required (.proto file)	Excellent	gRPC, high-throughput streaming

The two you need to know deeply for data engineering interviews:

Parquet is the default format for data lake storage. It stores data column by column, so a query on 3 columns out of 50 only reads those 3 columns from disk. Combined with Snappy or ZSTD compression, Parquet files are typically 5 to 10 times smaller than the equivalent CSV. BigQuery, Athena, Spark, and DuckDB all read Parquet natively.

# Writing Parquet with pandas
df.to_parquet("output.parquet", engine="pyarrow", compression="snappy", index=False)

# DuckDB reading Parquet directly (no load step)
import duckdb
conn = duckdb.connect()
result = conn.execute("SELECT ministry, SUM(amount) FROM 'ledger.parquet' GROUP BY 1").df()

Avro is the standard for Kafka because it stores the schema alongside the data and supports schema evolution with compatibility rules (add a field with a default value, remove a field). Without Avro or Protobuf in a Kafka pipeline, a schema change in the producer silently breaks consumers.

What is the difference between a data lake and a data warehouse?

Property	Data Lake	Data Warehouse
Data format	Raw, any format (CSV, JSON, Parquet, images, PDFs)	Structured, processed, schema-on-write
Schema	Schema-on-read (applied at query time)	Schema-on-write (enforced at load time)
Cost	Very cheap storage (S3, GCS)	More expensive (compute + storage)
Query speed	Slower (no indexes, scans files)	Fast (columnar, optimized)
Use case	Retain all raw data for future use	Analytical queries, dashboards, BI
Who queries it	Data engineers, data scientists	Analysts, BI tools

The modern answer is the data lakehouse: combine object storage cost with warehouse-level query performance using open table formats (Delta Lake, Apache Iceberg, Apache Hudi). They add ACID transactions, time travel, and schema enforcement on top of Parquet files in S3 or GCS.

In Kenya Economic Pulse I implemented a medallion lakehouse using Delta Lake on MinIO. The Bronze layer is raw Parquet files in object storage. The Silver layer applies Delta Lake ACID writes. The Gold layer runs dbt on top of it. You get the storage cost of a data lake and the reliability guarantees of a warehouse.

What is database normalization and when do you denormalize?

Normalization organizes a relational database to reduce redundancy and avoid update anomalies. The common normal forms:

1NF: each column holds atomic (indivisible) values; no repeating groups
2NF: all non-key columns depend on the entire primary key (removes partial dependencies)
3NF: all non-key columns depend only on the primary key, not on other non-key columns (removes transitive dependencies)

A fully normalized schema stores each fact in exactly one place. A customer's city is stored once in dim_customer, not repeated on every transaction row.

When to denormalize: in OLAP systems (warehouses, data marts), you denormalize intentionally. Analytical queries join many tables at query time, which is expensive. Pre-joining dimension data into wide fact tables eliminates those joins and speeds up BI tools. This is what a star schema does: it is deliberately denormalized for read performance at the cost of some write redundancy.

Rule of thumb: normalize for OLTP (transactional systems where you write frequently). Denormalize for OLAP (analytical systems where you read frequently across many rows).

What is sharding and when would you use it?

Sharding is horizontal partitioning of a database across multiple machines. Each shard holds a subset of the rows, usually split by a shard key (user ID range, geographic region, hash of a key).

Why it exists: a single database server has a physical ceiling on storage, CPU, and write throughput. Sharding distributes that load across multiple nodes so the system can scale beyond what any single machine supports.

Sharding strategies:

Range-based: rows where user_id between 1 and 1,000,000 go to shard 1, the next million to shard 2. Simple, but hot spots appear if new users cluster in one range.
Hash-based: shard = hash(user_id) % num_shards. Distributes evenly, but range queries touch all shards.
Directory-based: a lookup table maps each key to its shard. Flexible, but the directory itself becomes a bottleneck.

Sharding challenges:

Cross-shard joins are expensive or impossible; you must denormalize or use application-level joins
Resharding (adding or removing shards) requires moving data, which is disruptive
Transactions across shards require distributed transaction protocols

For data engineering specifically: you rarely shard the warehouse yourself (BigQuery and Snowflake handle distribution internally). Sharding knowledge matters most when the source system is a sharded operational database and you need to extract from all shards during ingestion.

How do you handle data security and PII in a pipeline?

This comes up in financial services, healthcare, NGO, and any role handling customer data. The key concepts:

PII identification and classification:
Personally Identifiable Information includes names, ID numbers, phone numbers, email addresses, financial account numbers, location data, and any combination that could identify an individual. Build a data catalog or lineage map that tags which columns contain PII.

Techniques for protecting PII in pipelines:

import hashlib

# Pseudonymization — replace PII with a consistent token
# The same ID always produces the same hash, so you can join
def pseudonymize(value: str, salt: str) -> str:
    return hashlib.sha256(f"{salt}{value}".encode()).hexdigest()[:16]

df["customer_id_hashed"] = df["national_id"].apply(
    lambda x: pseudonymize(x, salt=os.environ["PII_SALT"])
)
df = df.drop(columns=["national_id"])  # never land raw PII in the warehouse

Tokenization: replace PII with a random token; only the token vault maps tokens back to originals
Masking: replace PII with a format-preserving fake value (e.g., 07*****890 instead of the full number)
Encryption: encrypt columns at rest; only authorized roles can decrypt
Column-level access control: in BigQuery and Snowflake, use column-level security policies so analysts see masked values while engineers see the full column

Key principles:

Never land raw PII in the Bronze layer if you can avoid it. Pseudonymize at extraction.
Apply data minimization: only collect and store what you actually need.
GDPR and Kenya's Data Protection Act (2019) require that individuals can request deletion of their data. Design your pipeline so you can honor a deletion request: track which records belong to a given individual and can delete or mask them without breaking referential integrity downstream.
Log access to sensitive tables; alert on unusual query patterns.

Part 4: Behavioral Questions

The trap with behavioral questions for fresh grads is giving vague, hypothetical answers ("I would communicate clearly..."). Be specific about real situations, even if the real situation was a solo project. Specificity is what interviewers remember.

"Describe a time you debugged a difficult problem."

In BungeWatch, the PDF parser was silently failing on roughly 30% of bills. The pipeline reported success but the bills came through with empty content. I added structured logging to capture the filename, file size, and which parser tier was used. This revealed the pattern: every failure was a file under 50 KB. Inspecting those manually showed they were scanned images disguised as PDFs, so pdfplumber and PyMuPDF both returned empty strings silently. I added a content-length check after each tier and a Tesseract OCR fallback for files returning less than 100 characters. Parser coverage went from 70% to 93%.

"Describe a time you worked with ambiguous requirements."

In LedgerSync, the Kenya BOOST dataset does not have official documentation for ministry codes. The field exists, but there is no lookup table. I cross-referenced the codes manually against Kenya Treasury Annual Reports and built a dbt seed file (a static CSV lookup) mapping code to ministry name. I flagged it as a source of uncertainty in the documentation and added a dbt test that alerts if a new code appears that is not in the lookup. When new data arrives with unknown ministry codes, the test catches it immediately rather than silently defaulting to NULL.

"Tell me about a time you made a mistake."

The dbt and Python naming conflict in JobSense (described above in the STAR section). Key point: I recovered, documented the root cause, added it to a setup checklist, and the pattern has not recurred across 10 subsequent projects.

"How do you handle technical disagreement?"

Be honest if your experience is limited: "Most of my projects are solo, so I have not had a real team disagreement yet. My approach would be to understand the other person's reasoning fully before presenting an alternative, propose a small prototype to test both approaches against a defined metric if we remain split, and defer to the senior engineer if the decision is above my level, while documenting my concern."

Honesty here is better than a fabricated story an interviewer might probe further.

"Where do you see yourself in 3 years?"

"I want to go from building everything independently, as I do now, to contributing to a production data platform and learning from engineers who operate at scale. In 3 years I would like to be a mid-level data engineer who can own a pipeline end-to-end, mentor newer team members, and make confident architectural decisions. I also want to deepen my understanding of areas I have only touched so far: data observability, Apache Flink for streaming at scale, and Azure."

Part 5: Questions to Ask the Interviewer

Asking good questions at the end is as important as answering them well. It signals engineering curiosity and helps you evaluate whether the role is actually a good fit.

About the tech stack:

What does your current data infrastructure look like?
Are there any migrations or re-architectures planned that I would be involved in?
How mature is your CI/CD pipeline for data work?
Do you use dbt? If not, how do you manage SQL transformations?

About the role:

What would success look like in the first 3 months?
What are the most common sources of pipeline failures on this team?
Is this a new role or a backfill?

About growth:

What learning resources does the company support?
Are there opportunities to work across domains like ML, analytics, and platform engineering?

Avoid asking about salary in the first round, asking questions that are answered in the JD, and asking what the company does (research beforehand).

Part 6: The Kenya Market

Most interview prep content is written for US or UK candidates. The Kenyan data engineering market has its own structure.

Who hires data engineers in Kenya:

Financial services: Equity Bank, KCB, Co-operative Bank, M-KOPA, Pezesha, Lipa Later
Telecoms: Safaricom (the largest data organization in Kenya), Airtel
Tech and startups: Twiga Foods, BURN Manufacturing, Kobo360, Apollo Agriculture, Wave
NGOs and research: CGIAR, ICRAF, PATH, Mercy Corps, AmeriCares
International organizations: World Bank, UNDP, IFC (often contractor roles)
Remote (global): Many Kenyan engineers work remotely for EU and US companies via Andela, direct contract, or platforms like Toptal

Realistic salary ranges for junior and associate roles (2025 to 2026):

Level	Monthly (KES)	Notes
Entry (0 to 1 year)	60,000 to 120,000	Portfolio projects count toward "experience"
Junior (1 to 2 years)	100,000 to 180,000	With strong GitHub + relevant domain
Remote (USD)	$1,500 to $3,500/month	Andela-connected or direct remote

NGO and international organization roles typically pay 20 to 40% above private sector market rates. Salary data for junior roles is sparse, so negotiate based on your portfolio strength rather than published ranges alone.

What Kenyan employers actually test for:

Can you work with real Kenyan data, not just textbook examples? Portfolio projects on NSE, KNBS, parliamentary data, and M-Pesa ecosystem answer this directly.
Can you write SQL and Python at a practical level? Expect a coding round, not just theory.
Can you communicate technical concepts clearly? Practice explaining your projects to a non-technical person.
Are you reliable and will you show up? Consistent GitHub activity across 55+ projects signals this more than a CV claim.

Typical interview format:

Round 1: HR screen (15 to 30 min). Background, motivation, salary range.
Round 2: Technical screen (45 to 60 min). SQL problems, Python questions, pipeline design question.
Round 3: Technical deep-dive or take-home. Build a small pipeline on provided test data, usually 3 to 5 days.
Round 4: Panel with hiring manager and team member. Behavioral and culture fit.
Timeline: typically 2 to 4 weeks from first contact to offer.

For remote roles:

Mention your timezone and availability for overlap (East Africa Time = UTC+3). Specify your rate in dollars per month, not per hour or per year. Don't leave it open-ended; it signals you have not thought it through.

Part 7: The Last 5 Minutes Before the Interview

Re-read the JD. Pick 2 phrases you will mirror in your opener.
Re-read your "tell me about yourself" version (A, B, or C depending on the role).
Pick the 2 to 3 projects most relevant to the JD. Know the specific metrics for each.
Review your 2 to 3 questions to ask.
Stop preparing. Your portfolio is real. You built those pipelines. The metrics are yours. Trust it.

The single biggest mistake is going into the opener with vague framing like "I have worked on various data projects." Replace every "various" with a specific number. Replace every "complex pipeline" with the actual metric. "I processed 1.5 million rows" is better than "I have experience with large datasets" every time.

My full portfolio of 55 projects is on GitHub. Portfolio site at ian-mwendwa.vercel.app.

Follow me on dev.to for more on data engineering, dbt, and Airflow.

SQL Interview Problems for Data Engineers: 30 Patterns That Actually Get Asked

De' Clerke — Tue, 02 Jun 2026 21:52:23 +0000

Data engineering SQL interviews are not the same as software engineering SQL interviews. You will not be asked to find the second-highest salary or reverse a string. You will be asked to sessionize user events, detect data quality issues in a pipeline, build a cohort retention table, or find consecutive streaks in time-series data. The problems are more complex, the schemas are messier, and the expectation is that you have encountered these patterns in production, not just on LeetCode.

I built these 30 problems from the SQL patterns I use repeatedly across my pipeline projects: NSE equity data, Kenyan property listings, job postings, flight data, and financial transactions. Every problem here maps to something I have written in production.

The problems use six tables throughout. Memorise the schemas before any interview:

nse_trades    (trade_id, ticker, sector, trade_date, open_price,
               close_price, volume, market_cap)

flights       (flight_id, airline, origin, destination,
               departure_time, arrival_time, status, passengers)

listings      (listing_id, location, property_type, price,
               bedrooms, listed_date, sold_date, status)

transactions  (transaction_id, user_id, amount, category,
               created_at, status)  -- status: completed/failed/pending

employees     (employee_id, name, department, salary,
               hire_date, manager_id)

events        (event_id, user_id, event_type, created_at)

Easy: The Foundations

Problem 1: Top N Per Group

Question: Find the top 3 most traded stocks by total volume for each sector. Return sector, ticker, total_volume. Order by sector, then total_volume DESC.

WITH ranked AS (
    SELECT
        sector,
        ticker,
        SUM(volume)                                                      AS total_volume,
        RANK() OVER (PARTITION BY sector ORDER BY SUM(volume) DESC)      AS rnk
    FROM nse_trades
    GROUP BY sector, ticker
)
SELECT sector, ticker, total_volume
FROM ranked
WHERE rnk <= 3
ORDER BY sector, total_volume DESC;

What the interviewer is testing: Whether you know that window functions run after GROUP BY. You aggregate first to get total_volume per ticker, then rank within each sector. The common mistake is trying to apply the window function before grouping, which does not work because window functions are evaluated after GROUP BY in the SQL logical processing order.

Follow-up you will get: "What is the difference between RANK and DENSE_RANK?"

RANK leaves gaps after ties: 1, 1, 3, 4. DENSE_RANK does not: 1, 1, 2, 3. Use DENSE_RANK when you genuinely want "top 3 positions" even when multiple tickers tie for second place. Use RANK when you want the literal third-most-traded stock.

Problem 2: HAVING vs WHERE

Question: Find sectors where the average closing price across all trades is above 50. Show the count of distinct tickers per sector. Return sector, avg_close, ticker_count. Order by avg_close DESC.

SELECT
    sector,
    ROUND(AVG(close_price), 2)  AS avg_close,
    COUNT(DISTINCT ticker)       AS ticker_count
FROM nse_trades
GROUP BY sector
HAVING AVG(close_price) > 50
ORDER BY avg_close DESC;

What the interviewer is testing: The difference between WHERE and HAVING. WHERE filters rows before aggregation. HAVING filters groups after aggregation. You cannot use WHERE close_price > 50 here because that filters individual rows, not sector averages. Any interviewer asking a problem with an aggregate filter condition is testing whether you reach for HAVING.

Problem 3: Self Join

Question: List every employee with their manager's name. Include employees with no manager (the CEO/top level). Return employee_name, department, salary, manager_name.

SELECT
    e.name       AS employee_name,
    e.department,
    e.salary,
    m.name       AS manager_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id
ORDER BY e.department, e.name;

What the interviewer is testing: Self joins and the critical choice of LEFT JOIN vs INNER JOIN. Top-level employees have NULL in manager_id. An INNER JOIN silently drops them. LEFT JOIN keeps them with NULL in manager_name, which is correct.

Follow-up you will get: "How do you find employees who earn more than their manager?"

SELECT e.name, e.salary, m.name AS manager, m.salary AS manager_salary
FROM employees e
JOIN employees m ON e.manager_id = m.employee_id
WHERE e.salary > m.salary;

Here you use INNER JOIN because you only want employees who have a manager to compare against.

Problem 4: NULL Counting

Question: The listings table has NULL in sold_date for unsold properties. Calculate total listings, sold listings, unsold listings, and the sell-through rate as a percentage. Return one row.

SELECT
    COUNT(*)                                                    AS total_listings,
    COUNT(sold_date)                                            AS sold_listings,
    COUNT(*) - COUNT(sold_date)                                 AS unsold_listings,
    ROUND(COUNT(sold_date)::NUMERIC / COUNT(*) * 100, 1)        AS sell_through_pct
FROM listings;

What the interviewer is testing: The fundamental difference between COUNT(*) and COUNT(column). COUNT(*) counts all rows including those with NULLs. COUNT(column) counts only non-NULL values in that specific column. This lets you use COUNT(sold_date) as a conditional count without writing CASE WHEN sold_date IS NOT NULL THEN 1 END. In the Kenya Real Estate project, this exact pattern was how I tracked listing status across 1,338 properties.

Problem 5: Conditional Aggregation (Pivot)

Question: For each user in the transactions table, show how much they spent in each category: food, transport, utilities. Return user_id, food_total, transport_total, utilities_total.

SELECT
    user_id,
    SUM(CASE WHEN category = 'food'      THEN amount ELSE 0 END) AS food_total,
    SUM(CASE WHEN category = 'transport' THEN amount ELSE 0 END) AS transport_total,
    SUM(CASE WHEN category = 'utilities' THEN amount ELSE 0 END) AS utilities_total
FROM transactions
WHERE status = 'completed'
GROUP BY user_id
ORDER BY user_id;

What the interviewer is testing: Conditional aggregation, which is the SQL pivot pattern. PostgreSQL does not have a native PIVOT keyword. SUM(CASE WHEN ...) is the standard approach. This pattern appears in nearly every analytics interview. Know it cold.

Problem 6: Date Arithmetic

Question: For each completed flight, calculate the duration in minutes and flag flights over 180 minutes as 'long_haul'. Return flight_id, airline, origin, destination, duration_minutes, flight_type.

SELECT
    flight_id,
    airline,
    origin,
    destination,
    EXTRACT(EPOCH FROM (arrival_time - departure_time)) / 60   AS duration_minutes,
    CASE
        WHEN EXTRACT(EPOCH FROM (arrival_time - departure_time)) / 60 > 180
        THEN 'long_haul' ELSE 'short_haul'
    END                                                         AS flight_type
FROM flights
WHERE status = 'completed';

What the interviewer is testing: Timestamp arithmetic. Subtracting two TIMESTAMP values in PostgreSQL gives an INTERVAL. EXTRACT(EPOCH FROM interval) converts it to seconds. Divide by 60 for minutes. In BigQuery and Snowflake, you would use TIMESTAMP_DIFF(arrival_time, departure_time, MINUTE). Know both forms.

Problem 7: Deduplication

Question: The nse_trades table has duplicates due to a loading bug. Keep only the row with the highest volume per ticker + trade_date. Write a DELETE.

WITH ranked AS (
    SELECT trade_id,
        ROW_NUMBER() OVER (
            PARTITION BY ticker, trade_date
            ORDER BY volume DESC
        ) AS rn
    FROM nse_trades
)
DELETE FROM nse_trades
WHERE trade_id IN (SELECT trade_id FROM ranked WHERE rn > 1);

What the interviewer is testing: Using a CTE inside a DELETE to isolate the rows to remove. The ROW_NUMBER() approach is portable across PostgreSQL, BigQuery, and Snowflake. PostgreSQL also supports DISTINCT ON for this, but the ROW_NUMBER() version works everywhere. I used this exact pattern to clean the NSE trade dataset before loading it into the analytics schema.

Medium: The Patterns That Separate Candidates

Problem 8: Running Total and LAG

Question: For ticker 'SCOM', show each trade date with the closing price, the daily change in price, and the cumulative volume from the start of the year.

SELECT
    trade_date,
    close_price,
    close_price - LAG(close_price) OVER (ORDER BY trade_date)    AS daily_change,
    SUM(volume) OVER (
        ORDER BY trade_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    )                                                             AS cumulative_volume
FROM nse_trades
WHERE ticker = 'SCOM'
  AND EXTRACT(YEAR FROM trade_date) = 2025
ORDER BY trade_date;

What the interviewer is testing: Two window function patterns together. LAG(col) fetches the value from the previous row. The ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW frame on SUM() creates a running total. The frame clause is explicit here because the default frame for ordered windows (RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) can give unexpected results when there are ties in the ORDER BY column.

Problem 9: Moving Average with Partial Window Filtering

Question: Calculate a 7-day moving average of closing price for each ticker. Only show rows where a full 7-day window exists. Return ticker, trade_date, close_price, ma_7d.

WITH with_ma AS (
    SELECT
        ticker,
        trade_date,
        close_price,
        AVG(close_price) OVER (
            PARTITION BY ticker
            ORDER BY trade_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        )                                           AS ma_7d,
        COUNT(*) OVER (
            PARTITION BY ticker
            ORDER BY trade_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        )                                           AS window_size
    FROM nse_trades
)
SELECT ticker, trade_date, close_price, ROUND(ma_7d, 2) AS ma_7d
FROM with_ma
WHERE window_size = 7
ORDER BY ticker, trade_date;

What the interviewer is testing: Handling partial windows at the start of a time series. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW includes the current row plus the 6 preceding. For the first 6 rows per ticker, the window has fewer than 7 rows, so the average is based on incomplete data. The COUNT(*) OVER trick with the same frame detects partial windows cleanly.

Problem 10: Year-Over-Year Comparison

Question: For each ticker, compare the average monthly closing price in 2025 vs 2024. Show the absolute difference and percentage change.

WITH monthly AS (
    SELECT
        ticker,
        EXTRACT(MONTH FROM trade_date)    AS month,
        EXTRACT(YEAR  FROM trade_date)    AS year,
        ROUND(AVG(close_price), 2)        AS avg_close
    FROM nse_trades
    WHERE EXTRACT(YEAR FROM trade_date) IN (2024, 2025)
    GROUP BY ticker,
             EXTRACT(MONTH FROM trade_date),
             EXTRACT(YEAR  FROM trade_date)
)
SELECT
    a.ticker,
    a.month,
    a.avg_close                                          AS avg_2024,
    b.avg_close                                          AS avg_2025,
    ROUND(b.avg_close - a.avg_close, 2)                  AS abs_change,
    ROUND((b.avg_close - a.avg_close)
          / NULLIF(a.avg_close, 0) * 100, 1)             AS pct_change
FROM monthly a
JOIN monthly b ON a.ticker = b.ticker AND a.month = b.month
WHERE a.year = 2024 AND b.year = 2025
ORDER BY a.ticker, a.month;

What the interviewer is testing: Self-joining a CTE on different year values. NULLIF(a.avg_close, 0) prevents a division-by-zero error if any 2024 average is zero, returning NULL instead of crashing. Always use NULLIF when dividing by user data.

Problem 11: Cohort Retention

Question: For each user, identify their first transaction month as their cohort. Then for each subsequent month, show how many users from that cohort were still active.

WITH first_txn AS (
    SELECT
        user_id,
        DATE_TRUNC('month', MIN(created_at)) AS cohort_month
    FROM transactions
    WHERE status = 'completed'
    GROUP BY user_id
),
monthly_activity AS (
    SELECT DISTINCT
        t.user_id,
        DATE_TRUNC('month', t.created_at) AS activity_month
    FROM transactions t
    WHERE t.status = 'completed'
)
SELECT
    f.cohort_month,
    EXTRACT(MONTH FROM AGE(m.activity_month, f.cohort_month)) AS months_since_start,
    COUNT(DISTINCT m.user_id)                                  AS active_users
FROM first_txn f
JOIN monthly_activity m ON f.user_id = m.user_id
GROUP BY f.cohort_month, m.activity_month
ORDER BY f.cohort_month, months_since_start;

What the interviewer is testing: Cohort analysis. The pattern is always the same: find each user's first event, then join back to all subsequent activity. DATE_TRUNC('month', ...) collapses timestamps to month granularity. AGE(later, earlier) gives the interval between months; EXTRACT(MONTH FROM ...) pulls the number. This is one of the most commonly asked product analytics problems in data engineering interviews.

Problem 12: Funnel Analysis

Question: The events table tracks 'page_view', 'add_to_cart', 'checkout', and 'purchase'. Find how many users reached each stage. A user counts at a stage only if they completed all prior stages.

WITH stages AS (
    SELECT user_id,
        MAX(CASE WHEN event_type = 'page_view'   THEN 1 ELSE 0 END) AS did_view,
        MAX(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS did_cart,
        MAX(CASE WHEN event_type = 'checkout'    THEN 1 ELSE 0 END) AS did_checkout,
        MAX(CASE WHEN event_type = 'purchase'    THEN 1 ELSE 0 END) AS did_purchase
    FROM events
    GROUP BY user_id
),
funnel AS (
    SELECT
        SUM(did_view)                                                    AS views,
        SUM(CASE WHEN did_view=1 AND did_cart=1 THEN 1 END)             AS carts,
        SUM(CASE WHEN did_view=1 AND did_cart=1
                  AND did_checkout=1 THEN 1 END)                        AS checkouts,
        SUM(CASE WHEN did_view=1 AND did_cart=1
                  AND did_checkout=1 AND did_purchase=1 THEN 1 END)     AS purchases
    FROM stages
)
SELECT
    stage,
    users_reached,
    ROUND(users_reached::NUMERIC / LAG(users_reached) OVER (ORDER BY step) * 100, 1)
        AS conversion_from_prev
FROM (
    SELECT 1 AS step, 'page_view'   AS stage, views     AS users_reached FROM funnel
    UNION ALL
    SELECT 2,         'add_to_cart',           carts    FROM funnel
    UNION ALL
    SELECT 3,         'checkout',              checkouts FROM funnel
    UNION ALL
    SELECT 4,         'purchase',              purchases FROM funnel
) f
ORDER BY step;

What the interviewer is testing: Two things. First, MAX(CASE WHEN ...) as a flag-per-user pivot. Second, UNION ALL plus LAG() to compute step-to-step conversion rates. The funnel is an ordered set of stages, so LAG() applied to the ordered output gives conversion from the previous step.

Problem 13: Consecutive Streaks (The Islands Pattern)

Question: For each ticker, find the longest consecutive streak of days where the closing price increased.

WITH daily_change AS (
    SELECT
        ticker,
        trade_date,
        CASE WHEN close_price > LAG(close_price) OVER (PARTITION BY ticker ORDER BY trade_date)
             THEN 1 ELSE 0 END AS is_up
    FROM nse_trades
),
streak_groups AS (
    SELECT
        ticker,
        trade_date,
        is_up,
        ROW_NUMBER() OVER (PARTITION BY ticker ORDER BY trade_date)
        - ROW_NUMBER() OVER (PARTITION BY ticker, is_up ORDER BY trade_date) AS grp
    FROM daily_change
)
SELECT
    ticker,
    MAX(streak_len) AS longest_streak
FROM (
    SELECT ticker, grp, COUNT(*) AS streak_len
    FROM streak_groups
    WHERE is_up = 1
    GROUP BY ticker, grp
) streaks
GROUP BY ticker
ORDER BY longest_streak DESC;

What the interviewer is testing: The islands technique. It is one of the most consistently asked hard SQL patterns. The trick is subtracting two ROW_NUMBER() values: one ordered overall, one ordered within each group. For consecutive equal values, this subtraction stays constant, creating a unique group ID for each run. When the value changes, the subtraction shifts. Memorise this pattern. It comes up for streaks, sessions, consecutive activity, and any "how long did this condition hold" question.

Problem 14: LAST_VALUE Frame Clause Gotcha

Question: For each ticker, show the opening price on the first trading day of each month and the closing price on the last trading day of the month.

WITH monthly_bounds AS (
    SELECT
        ticker,
        DATE_TRUNC('month', trade_date)   AS year_month,
        FIRST_VALUE(open_price) OVER (
            PARTITION BY ticker, DATE_TRUNC('month', trade_date)
            ORDER BY trade_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        )                                  AS month_open,
        LAST_VALUE(close_price) OVER (
            PARTITION BY ticker, DATE_TRUNC('month', trade_date)
            ORDER BY trade_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        )                                  AS month_close
    FROM nse_trades
)
SELECT DISTINCT
    ticker,
    year_month,
    month_open,
    month_close,
    ROUND((month_close - month_open) / NULLIF(month_open, 0) * 100, 2) AS monthly_return_pct
FROM monthly_bounds
ORDER BY ticker, year_month;

What the interviewer is testing: The LAST_VALUE frame clause trap. Without ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, LAST_VALUE only looks at the current row, because the default frame for an ordered window is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This is the single most common window function mistake I see. FIRST_VALUE works correctly with the default frame because it looks back to the beginning. LAST_VALUE does not, because the default frame stops at the current row, not the end of the partition.

Problem 15: Gap Detection in Time-Series

Question: Find all date gaps for ticker 'EQTY' in 2025. The table should have an entry for every weekday.

WITH trading_days AS (
    SELECT gs.dt::DATE AS expected_date
    FROM generate_series(
        '2025-01-01'::DATE,
        '2025-12-31'::DATE,
        '1 day'::INTERVAL
    ) AS gs(dt)
    WHERE EXTRACT(DOW FROM gs.dt) NOT IN (0, 6)
),
actual_days AS (
    SELECT DISTINCT trade_date FROM nse_trades WHERE ticker = 'EQTY'
)
SELECT t.expected_date AS missing_date
FROM trading_days t
LEFT JOIN actual_days a ON t.expected_date = a.trade_date
WHERE a.trade_date IS NULL
ORDER BY t.expected_date;

What the interviewer is testing: generate_series() for sequence generation combined with the LEFT JOIN + WHERE NULL pattern for gap detection. This is a core data quality pattern. In BigQuery, use GENERATE_DATE_ARRAY('2025-01-01', '2025-12-31', INTERVAL 1 DAY). In Snowflake, use GENERATOR(ROWCOUNT => 365) with DATEADD. Know how to generate a date spine on your target platform.

Problem 16: Sessionization

Question: Group user events into sessions. A session ends when there is more than 30 minutes of inactivity. For each session, calculate session start, end, duration in minutes, and event count.

WITH lagged AS (
    SELECT
        user_id,
        created_at,
        LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_event_time
    FROM events
),
session_flags AS (
    SELECT *,
        CASE
            WHEN prev_event_time IS NULL
              OR created_at - prev_event_time > INTERVAL '30 minutes'
            THEN 1 ELSE 0
        END AS is_session_start
    FROM lagged
),
sessions AS (
    SELECT *,
        SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY created_at) AS session_id
    FROM session_flags
)
SELECT
    user_id,
    session_id,
    MIN(created_at)                                                              AS session_start,
    MAX(created_at)                                                              AS session_end,
    ROUND(EXTRACT(EPOCH FROM MAX(created_at) - MIN(created_at)) / 60, 1)        AS duration_minutes,
    COUNT(*)                                                                     AS event_count
FROM sessions
GROUP BY user_id, session_id
ORDER BY user_id, session_start;

What the interviewer is testing: The sessionization pattern, which is the same islands technique applied to time gaps instead of value changes. Detect the gap with LAG(), flag session starts with a CASE, then take a running SUM() of those flags to create a session ID. Every event within the same session increments its session ID by zero; a new session increments it by one. This pattern appears in almost every streaming and product analytics interview.

Hard: What Separates Senior Candidates

Problem 17: Recursive CTE for Hierarchies

Question: Find the full management chain from every employee up to the CEO. Return employee_id, name, level (0 = CEO), and path.

WITH RECURSIVE hierarchy AS (
    -- Base case: CEO has no manager
    SELECT
        employee_id, name, manager_id,
        0          AS level,
        name::TEXT AS path
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case: employees who report into the hierarchy
    SELECT
        e.employee_id, e.name, e.manager_id,
        h.level + 1,
        h.path || ' > ' || e.name
    FROM employees e
    JOIN hierarchy h ON e.manager_id = h.employee_id
)
SELECT employee_id, name, level, path
FROM hierarchy
ORDER BY path;

What the interviewer is testing: Recursive CTEs. The two parts joined by UNION ALL: the base case (anchor) that returns the starting rows, and the recursive case that references the CTE itself to add one more level per iteration. The recursion terminates when no rows match the JOIN, meaning there are no more employees to add. PostgreSQL, BigQuery, Snowflake, and Redshift all support WITH RECURSIVE.

Problem 18: Volume-Weighted Average Price with Ranking

Question: Calculate VWAP per ticker per month. Rank tickers by VWAP within each sector per month.

WITH monthly_vwap AS (
    SELECT
        DATE_TRUNC('month', trade_date)     AS month,
        sector,
        ticker,
        SUM(close_price * volume)::NUMERIC
        / NULLIF(SUM(volume), 0)            AS vwap
    FROM nse_trades
    GROUP BY DATE_TRUNC('month', trade_date), sector, ticker
)
SELECT
    month,
    sector,
    ticker,
    ROUND(vwap, 4)                                                             AS vwap,
    RANK() OVER (PARTITION BY month, sector ORDER BY vwap DESC)               AS sector_rank
FROM monthly_vwap
ORDER BY month, sector, sector_rank;

What the interviewer is testing: Weighted averages and multi-level window partitioning. VWAP is SUM(price * volume) / SUM(volume). It gives the true average price weighted by how much was traded at each level. The RANK() OVER (PARTITION BY month, sector ...) applies ranking within each month-sector combination. Interviewers in fintech, trading, and analytics roles ask VWAP variants constantly.

Problem 19: Running Balance with Signed Transactions

Question: Calculate each user's running account balance. Completed transactions add. Completed reversals subtract. Pending transactions have no effect.

WITH adjusted AS (
    SELECT
        user_id,
        created_at,
        CASE
            WHEN status = 'completed' AND category != 'reversal' THEN  amount
            WHEN status = 'completed' AND category = 'reversal'  THEN -amount
            ELSE 0
        END AS net_amount
    FROM transactions
)
SELECT
    user_id,
    created_at,
    net_amount,
    SUM(net_amount) OVER (
        PARTITION BY user_id
        ORDER BY created_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_balance
FROM adjusted
WHERE net_amount != 0
ORDER BY user_id, created_at;

What the interviewer is testing: Transform first, then window. Convert rows into signed amounts in a CTE before applying any running total. Mixing the sign logic inside the window function makes the query both unreadable and error-prone. This pattern underpins any financial ledger query. In LedgerSync, I applied this exact structure to 1.5 million BOOST transaction rows.

Problem 20: SCD Type 2 Change Detection

Question: Given a daily snapshot table listing_snapshots (listing_id, price, status, snapshot_date), find which listings had their price change between consecutive snapshots.

WITH with_prev AS (
    SELECT
        listing_id,
        snapshot_date,
        price                                                AS new_price,
        LAG(price) OVER (
            PARTITION BY listing_id
            ORDER BY snapshot_date
        )                                                    AS old_price
    FROM listing_snapshots
)
SELECT
    listing_id,
    snapshot_date,
    old_price,
    new_price,
    new_price - old_price AS price_change
FROM with_prev
WHERE old_price IS NOT NULL
  AND new_price != old_price
ORDER BY listing_id, snapshot_date;

What the interviewer is testing: CDC detection logic. LAG() within PARTITION BY listing_id compares each row to the previous snapshot for the same listing. WHERE old_price IS NOT NULL excludes the first snapshot for each listing, which has no previous value. WHERE new_price != old_price keeps only rows where the price actually changed. This is the exact pattern dbt uses in SCD Type 2 snapshot models to detect which records to close.

Problem 21: Pipeline Quality Checks in SQL

Question: Your pipeline loads daily NSE data. Write a query that verifies: no duplicate trade_ids, row count between 50 and 500, no NULLs in ticker/close_price/volume, and all prices positive. Return one row per check with status (PASS/FAIL) and detail.

WITH today_data AS (
    SELECT * FROM nse_trades
    WHERE trade_date = CURRENT_DATE
)
SELECT check_name, status, detail
FROM (
    -- Check 1: No duplicate trade_ids
    SELECT
        'no_duplicate_trade_ids'                                       AS check_name,
        CASE WHEN COUNT(*) = COUNT(DISTINCT trade_id)
             THEN 'PASS' ELSE 'FAIL' END                               AS status,
        'Duplicates: ' || (COUNT(*) - COUNT(DISTINCT trade_id))::TEXT  AS detail
    FROM today_data

    UNION ALL

    -- Check 2: Row count in expected range
    SELECT
        'row_count_in_range',
        CASE WHEN COUNT(*) BETWEEN 50 AND 500 THEN 'PASS' ELSE 'FAIL' END,
        'Row count: ' || COUNT(*)::TEXT
    FROM today_data

    UNION ALL

    -- Check 3: No NULLs in critical fields
    SELECT
        'no_nulls_in_critical_fields',
        CASE WHEN COUNT(*) FILTER (
            WHERE ticker IS NULL OR close_price IS NULL OR volume IS NULL
        ) = 0 THEN 'PASS' ELSE 'FAIL' END,
        'Null rows: ' || COUNT(*) FILTER (
            WHERE ticker IS NULL OR close_price IS NULL OR volume IS NULL
        )::TEXT
    FROM today_data

    UNION ALL

    -- Check 4: All prices positive
    SELECT
        'all_prices_positive',
        CASE WHEN MIN(close_price) > 0 THEN 'PASS' ELSE 'FAIL' END,
        'Min close_price: ' || COALESCE(MIN(close_price)::TEXT, 'NULL')
    FROM today_data
) checks
ORDER BY check_name;

What the interviewer is testing: Whether you think like a data engineer, not just a SQL writer. This is the native SQL equivalent of a Great Expectations test suite. COUNT(*) FILTER (WHERE ...) is PostgreSQL syntax for conditional counting without CASE WHEN. The UNION ALL pattern assembles multiple independent checks into a single result set. A candidate who writes something like this in an interview signals production pipeline experience.

The Pattern Reference

Every problem above maps to one of these recurring patterns. Knowing the right tool for each type of question is what makes the difference in a live interview.

Pattern	Tool
Top N per group	`RANK() OVER (PARTITION BY ... ORDER BY ...)`
Running total	`SUM() OVER (ORDER BY ... ROWS UNBOUNDED PRECEDING)`
Moving average	`AVG() OVER (... ROWS BETWEEN 6 PRECEDING AND CURRENT)`
Previous row value	`LAG(col) OVER (PARTITION BY ... ORDER BY ...)`
Next row value	`LEAD(col) OVER (PARTITION BY ... ORDER BY ...)`
First/last in group	`FIRST_VALUE` / `LAST_VALUE` (always specify the frame)
Consecutive streak	`ROW_NUMBER()` minus `ROW_NUMBER() OVER group` (islands)
Session detection	`LAG` + gap flag + running `SUM` of flags
Median	`PERCENTILE_CONT(0.5)` or `ROW_NUMBER` trick
Weighted average	`SUM(val * weight) / SUM(weight)`
Pivot (long to wide)	`SUM(CASE WHEN category = 'X' THEN val END)`
Unpivot (wide to long)	`CROSS JOIN LATERAL VALUES` or `UNPIVOT`
Hierarchy traversal	`WITH RECURSIVE` CTE
Gap detection	`generate_series` left join + `WHERE NULL`
Deduplication	`ROW_NUMBER() = 1` or `DISTINCT ON` (PostgreSQL)
Division safety	`NULLIF(denominator, 0)`
Conditional count	`COUNT(*) FILTER (WHERE ...)` or `SUM(CASE WHEN ...)`
Point-in-time (SCD2)	`valid_from <= target_date AND valid_to > target_date`
YoY comparison	Self-join on same CTE aliased by year
Cohort retention	`MIN()` per user + join back to activity

What Interviewers Actually Look For

Beyond the correct answer, these are the signals a good interviewer reads:

Do you explain your approach before writing? Say "I'll use a CTE to rank within each partition, then filter" before typing. Interviewers want to hear how you think.

Do you know the difference between RANK and DENSE_RANK? Any window function problem will get a follow-up on this. Have a one-sentence answer ready.

Do you consider edge cases? Division by zero with NULLIF. NULLs in COUNT. Partial windows in moving averages. Mentioning these unprompted signals production experience.

Can you write it two ways? The subquery approach and the window function approach for the same problem. Knowing that the window function version is usually cleaner signals depth.

Do you know which patterns are database-specific? DISTINCT ON and generate_series are PostgreSQL only. PIVOT and UNPIVOT work in Snowflake and BigQuery but not standard PostgreSQL. QUALIFY for post-window filtering works in Snowflake and BigQuery but not PostgreSQL. Know what is portable and what is not.

These problems are drawn from real SQL patterns I've used building NSE stock pipelines, Kenyan property analytics, job market intelligence, and financial ledger reconciliation. All portfolio projects are on my GitHub.

Follow me on dev.to for more on data engineering, dbt, and Airflow.

Terraform for Data Engineers: Provisioning GCS, BigQuery, S3, and Lambda Without Clicking Through Consoles

De' Clerke — Tue, 02 Jun 2026 21:45:06 +0000

Every data pipeline eventually needs a bucket. Then a second bucket. Then a BigQuery dataset, a service account with the right permissions, and a Lambda function to handle alerts. If you set all of that up through the GCP and AWS consoles, you get something that works once, is impossible to reproduce exactly, and will be misconfigured in the next project because you forgot which checkboxes you ticked. Terraform solves this by treating infrastructure as code: version-controlled, reviewable, and repeatable.

This article covers the patterns a data engineer actually needs. Not VPCs and Kubernetes clusters. GCS buckets, BigQuery tables with partitioning, S3 data lakes with lifecycle rules, Lambda functions for lightweight processing, and the IAM wiring that makes service accounts work without over-permissioning.

All provider versions in this article are current as of June 2026: Terraform 1.15.5, Google provider 7.34.0, AWS provider 6.47.0.

The Mental Model: State, Plan, Apply

Terraform works by comparing three things: what you wrote in your .tf files, what it last recorded in the state file, and what actually exists in the cloud. The core workflow is three commands:

terraform init    # download providers and modules
terraform plan    # show what will change without touching anything
terraform apply   # make the changes

terraform plan is the command you run the most. It shows exactly what will be created, modified, or destroyed before anything happens. A plan that shows a resource being replaced (-/+) when you expected it to be modified (~) is a signal to stop and read the plan carefully. Replacement destroys and recreates the resource, which means downtime for anything depending on it.

The state file (terraform.tfstate) is how Terraform knows what it manages. It contains the IDs, attributes, and dependencies of every resource it has created. Never edit it manually and never delete it. If the state file is lost, Terraform loses track of what it owns and will try to create everything from scratch.

File Structure

Split your Terraform config across five files. Every file has a specific responsibility:

project/
├── main.tf          # resource definitions
├── variables.tf     # input variable declarations
├── outputs.tf       # output values
├── providers.tf     # provider and version config
└── terraform.tfvars # actual variable values (gitignore this if it has secrets)

providers.tf is where you pin versions. This is not optional.

terraform {
  required_version = ">= 1.9.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 7.0"
    }
    aws = {
      source  = "hashicorp/aws"
      version = "~> 6.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

provider "aws" {
  region = "us-east-1"
}

The ~> 7.0 constraint allows patch and minor updates (7.1, 7.34) but blocks major version upgrades (8.0). Major version bumps in both providers have historically included breaking changes. Pinning to a major version means terraform init -upgrade will not silently change provider behavior.

variables.tf declares inputs with types and validation:

variable "project_id" {
  description = "GCP project ID"
  type        = string
}

variable "region" {
  description = "Default region"
  type        = string
  default     = "africa-south1"   # Johannesburg; BigQuery and GCS available
}

variable "environment" {
  type = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Must be dev, staging, or prod."
  }
}

variable "labels" {
  type    = map(string)
  default = {}
}

terraform.tfvars provides the actual values. Add it to .gitignore if it contains credentials or account IDs you do not want public:

project_id  = "my-gcp-project-123"
region      = "africa-south1"
environment = "dev"
labels = {
  project    = "kenya-data-pipeline"
  managed_by = "terraform"
}

Remote State: Stop Storing State Locally

By default, Terraform writes terraform.tfstate to your local working directory. This works for solo projects and breaks the moment anyone else touches the infrastructure. Remote state keeps the file in a shared location with locking so two people cannot run terraform apply simultaneously and corrupt the state.

For GCP projects, use a GCS bucket as the backend:

# backend.tf
terraform {
  backend "gcs" {
    bucket = "my-project-terraform-state"
    prefix = "terraform/state/pipeline"
  }
}

For AWS projects, use S3 with a DynamoDB table for locking:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "state/pipeline/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock"
    encrypt        = true
  }
}

The DynamoDB table needs a LockID string partition key. Create it manually once before initializing:

aws dynamodb create-table \
  --table-name terraform-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

After adding a backend, run terraform init again. It will ask whether to migrate the existing local state to the remote backend.

GCS: The Data Lake Bucket

A data lake GCS bucket with versioning, lifecycle rules, and uniform access control:

resource "google_storage_bucket" "data_lake" {
  name          = "${var.project_id}-data-lake-${var.environment}"
  location      = "US"
  storage_class = "STANDARD"
  force_destroy = false

  versioning {
    enabled = true
  }

  lifecycle_rule {
    condition { age = 90 }
    action {
      type          = "SetStorageClass"
      storage_class = "NEARLINE"
    }
  }

  lifecycle_rule {
    condition { age = 365 }
    action { type = "Delete" }
  }

  uniform_bucket_level_access = true
  labels = var.labels
}

force_destroy = false prevents Terraform from deleting the bucket if it contains objects. If terraform destroy encounters a non-empty bucket, it fails with an error instead of silently deleting your data. Leave this as false on anything that contains data you care about.

GCS bucket names are globally unique across all GCP accounts. Including the project ID in the name (${var.project_id}-data-lake) avoids the Error 409: The requested bucket name is not available error, which you will hit if you try to create a bucket with a generic name like data-lake.

BigQuery: Datasets and Partitioned Tables

resource "google_bigquery_dataset" "raw" {
  dataset_id    = "raw"
  friendly_name = "Raw Layer"
  location      = "US"
  labels        = var.labels

  delete_contents_on_destroy = false

  access {
    role          = "OWNER"
    special_group = "projectOwners"
  }
}

resource "google_bigquery_table" "flights" {
  dataset_id          = google_bigquery_dataset.raw.dataset_id
  table_id            = "flights"
  project             = var.project_id
  deletion_protection = false

  time_partitioning {
    type  = "DAY"
    field = "departure_time"
  }

  clustering = ["airline", "origin"]

  schema = file("${path.module}/schemas/flights.json")
  labels = var.labels
}

Two things worth explaining here.

First, deletion_protection = false on the table resource. As of Google provider 6.0, many resources have deletion_protection defaulting to true, which prevents terraform destroy from deleting them. For BigQuery tables in a data pipeline project you plan to rebuild frequently, set it to false explicitly or terraform destroy will error out on the table.

Second, the combination of time_partitioning and clustering. Partitioning by day on departure_time means BigQuery scans only the relevant day partitions when you filter by date, reducing bytes processed and cost. Clustering by airline and origin within each partition further reduces scan size when you filter by those columns. For a table that receives daily appends and is queried by date and airline, this setup can reduce query cost by 80% or more compared to an unpartitioned table.

The schema file is a standard BigQuery JSON schema:

[
  {"name": "flight_id",       "type": "STRING",    "mode": "REQUIRED"},
  {"name": "airline",         "type": "STRING",    "mode": "NULLABLE"},
  {"name": "origin",          "type": "STRING",    "mode": "NULLABLE"},
  {"name": "departure_time",  "type": "TIMESTAMP", "mode": "NULLABLE"}
]

GCP IAM: Service Accounts for Pipelines

Never run a pipeline with personal credentials or a broad role like roles/editor. Create a service account with exactly the permissions needed.

resource "google_service_account" "pipeline" {
  account_id   = "data-pipeline-sa"
  display_name = "Data Pipeline Service Account"
  project      = var.project_id
}

resource "google_project_iam_member" "pipeline_bq_editor" {
  project = var.project_id
  role    = "roles/bigquery.dataEditor"
  member  = "serviceAccount:${google_service_account.pipeline.email}"
}

resource "google_project_iam_member" "pipeline_bq_job" {
  project = var.project_id
  role    = "roles/bigquery.jobUser"
  member  = "serviceAccount:${google_service_account.pipeline.email}"
}

resource "google_storage_bucket_iam_member" "pipeline_gcs" {
  bucket = google_storage_bucket.data_lake.name
  role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.pipeline.email}"
}

Note the two separate BigQuery roles. roles/bigquery.dataEditor lets the service account read and write table data. roles/bigquery.jobUser lets it run query jobs. You need both for a pipeline that reads from and writes to BigQuery. Without jobUser, queries fail with a 403 even though the service account has data access.

For local development, generate a key file and set the environment variable:

resource "google_service_account_key" "pipeline_key" {
  service_account_id = google_service_account.pipeline.name
}

output "sa_key" {
  value     = base64decode(google_service_account_key.pipeline_key.private_key)
  sensitive = true
}

terraform output -raw sa_key > sa-key.json
export GOOGLE_APPLICATION_CREDENTIALS=$(pwd)/sa-key.json

Add sa-key.json to .gitignore immediately. For production and CI/CD, use Workload Identity instead of key files.

S3: Data Lake with Encryption and Lifecycle

S3 bucket resources in provider 6.x are split into separate resources for each concern, unlike the older monolithic aws_s3_bucket with nested blocks. Each setting is its own resource:

resource "aws_s3_bucket" "data_lake" {
  bucket = "${var.project_name}-data-lake-${var.environment}"
  tags   = var.tags
}

resource "aws_s3_bucket_versioning" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "data_lake" {
  bucket                  = aws_s3_bucket.data_lake.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "archive-and-expire"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "STANDARD_IA"
    }

    expiration {
      days = 365
    }
  }
}

If you have existing Terraform code using the old aws_s3_bucket_object resource, it was renamed to aws_s3_object in AWS provider 4.x. Use the moved block to update the state reference without destroying and recreating the object:

moved {
  from = aws_s3_bucket_object.schema_file
  to   = aws_s3_object.schema_file
}

Lambda: Lightweight Processing and Alerts

Lambda is useful in data pipelines for things that do not belong inside the main DAG: webhook receivers, lightweight event-driven transforms, and alert dispatchers. Here is the full pattern for a scheduled Python Lambda:

data "archive_file" "lambda_zip" {
  type        = "zip"
  source_dir  = "${path.module}/lambda"
  output_path = "${path.module}/lambda.zip"
}

resource "aws_lambda_function" "alert" {
  filename         = data.archive_file.lambda_zip.output_path
  function_name    = "pipeline-alert-${var.environment}"
  role             = aws_iam_role.lambda.arn
  handler          = "handler.lambda_handler"
  runtime          = "python3.12"
  source_code_hash = data.archive_file.lambda_zip.output_base64sha256
  timeout          = 30
  memory_size      = 256

  environment {
    variables = {
      SNS_TOPIC_ARN = aws_sns_topic.alerts.arn
      ENVIRONMENT   = var.environment
    }
  }

  tags = var.tags
}

source_code_hash is what tells Terraform the code changed. Without it, Terraform only updates the function when the .tf file changes, not when the Python code in /lambda changes. With output_base64sha256, a new zip hash triggers a redeployment automatically on terraform apply.

The supported Python runtimes as of June 2026 are python3.12, python3.13, and python3.14. python3.12 is a stable, widely tested choice for production. python3.9 reached Python EOL in October 2025 and Lambda deprecated it in early 2026. Do not use it for new functions.

The Lambda needs an IAM role:

resource "aws_iam_role" "lambda" {
  name = "pipeline-lambda-role-${var.environment}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_basic" {
  role       = aws_iam_role.lambda.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

resource "aws_iam_role_policy" "lambda_sns" {
  name = "lambda-sns-publish"
  role = aws_iam_role.lambda.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["sns:Publish"]
      Resource = aws_sns_topic.alerts.arn
    }]
  })
}

AWSLambdaBasicExecutionRole grants CloudWatch Logs write access, which is the minimum a Lambda needs to emit logs. Everything else (SNS, S3, DynamoDB) needs explicit policy attachments.

To schedule the Lambda, use EventBridge:

resource "aws_cloudwatch_event_rule" "hourly" {
  name                = "hourly-pipeline-check"
  schedule_expression = "rate(1 hour)"
}

resource "aws_cloudwatch_event_target" "lambda" {
  rule      = aws_cloudwatch_event_rule.hourly.name
  target_id = "PipelineAlertLambda"
  arn       = aws_lambda_function.alert.arn
}

resource "aws_lambda_permission" "cloudwatch" {
  statement_id  = "AllowCloudWatchInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.alert.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.hourly.arn
}

The aws_lambda_permission resource is easy to miss. Without it, EventBridge will attempt to invoke the Lambda and get an access denied error, even though the EventBridge rule and target are configured correctly. Lambda requires explicit permission grants for each invoking service.

Modules: Reusing Patterns Across Projects

Once you write a GCS bucket with lifecycle rules and IAM correctly once, you do not want to rewrite it for every project. Extract it into a module:

modules/
└── gcs_data_lake/
    ├── main.tf
    ├── variables.tf
    └── outputs.tf

# modules/gcs_data_lake/main.tf
resource "google_storage_bucket" "this" {
  name                        = var.bucket_name
  location                    = var.location
  project                     = var.project_id
  storage_class               = "STANDARD"
  uniform_bucket_level_access = true
  force_destroy               = false
  labels                      = var.labels

  versioning { enabled = true }

  lifecycle_rule {
    condition { age = var.nearline_days }
    action { type = "SetStorageClass"; storage_class = "NEARLINE" }
  }
}

# modules/gcs_data_lake/variables.tf
variable "bucket_name"   { type = string }
variable "project_id"    { type = string }
variable "location"      { type = string; default = "US" }
variable "labels"        { type = map(string); default = {} }
variable "nearline_days" { type = number; default = 90 }

# modules/gcs_data_lake/outputs.tf
output "bucket_name" { value = google_storage_bucket.this.name }
output "bucket_url"  { value = google_storage_bucket.this.url }

Use it from the root module:

module "landing_zone" {
  source       = "./modules/gcs_data_lake"
  project_id   = var.project_id
  bucket_name  = "${var.project_id}-landing-${var.environment}"
  location     = "US"
  labels       = var.labels
  nearline_days = 60
}

output "landing_bucket" {
  value = module.landing_zone.bucket_name
}

Run terraform init after adding a module reference. Without it, Terraform does not know the module exists.

Common Errors and Actual Fixes

State lock error after a crashed run:

Error: Error acquiring the state lock
Lock Info:
  ID: abc-123-def

A previous Terraform run exited without releasing the lock. Fix it with the lock ID from the error message:

terraform force-unlock abc-123-def

GCP 403 permission error:

Error: googleapi: Error 403: The caller does not have permission

The service account running Terraform is missing an IAM role. Fix it in the Terraform config with google_project_iam_member, or temporarily with:

gcloud projects add-iam-policy-binding MY_PROJECT \
  --member="serviceAccount:sa@project.iam.gserviceaccount.com" \
  --role="roles/bigquery.dataEditor"

GCS bucket name conflict:

Error: Error creating Bucket: googleapi: Error 409: The requested bucket name is not available

GCS bucket names are globally unique. Another account (or a previous version of your own project) already has that name. Add var.project_id or a random suffix to the bucket name.

GCP credentials not found:

Error: No valid credential sources found

Terraform cannot find GCP credentials. Fix with one of:

gcloud auth application-default login
# or
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/sa-key.json

AWS Lambda default timeout:

Lambda's default timeout is 3 seconds. Any function doing API calls, database writes, or anything with network latency will time out. Set it explicitly in the resource:

resource "aws_lambda_function" "alert" {
  timeout     = 30
  memory_size = 256
}

Maximum timeout is 900 seconds (15 minutes).

AWS provider 6.x boolean values:

If you have existing configs that use "0" or "1" for boolean attributes, provider 6.x rejects them. Update to true or false:

# Old (fails in provider 6.x)
versioning_enabled = "1"

# Correct
versioning_enabled = true

CI/CD: Running Terraform in GitHub Actions

# .github/workflows/terraform.yml
name: Terraform

on:
  push:
    branches: [main]
    paths: ['terraform/**']
  pull_request:
    branches: [main]
    paths: ['terraform/**']

jobs:
  terraform:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "~> 1.9"

      - name: Terraform Init
        working-directory: terraform/
        run: terraform init
        env:
          GOOGLE_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}

      - name: Terraform Validate
        working-directory: terraform/
        run: terraform validate

      - name: Terraform Plan
        working-directory: terraform/
        run: terraform plan -out=tfplan
        env:
          GOOGLE_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}
          TF_VAR_project_id: ${{ secrets.GCP_PROJECT_ID }}
          TF_VAR_environment: prod

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        working-directory: terraform/
        run: terraform apply -auto-approve tfplan
        env:
          GOOGLE_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}

Pass sensitive variables through environment variables prefixed with TF_VAR_. Terraform picks them up automatically, mapping TF_VAR_project_id to var.project_id. This avoids putting credentials or project IDs in .tfvars files that might be committed.

The if: condition on Apply means the plan runs on every pull request but apply only runs when merged to main. Pull request authors see the plan output in the workflow logs before anything changes.

Terraform vs OpenTofu

In August 2023, HashiCorp changed Terraform's license from MPL 2.0 to the Business Source License (BUSL). The BSL prohibits using Terraform directly in competing products. OpenTofu is an open-source fork under the Linux Foundation that continued under MPL 2.0.

As of June 2026, both tools use the same HCL syntax and are largely compatible. OpenTofu 1.11 introduced ephemeral values (temporary credentials that never land in state), and its state encryption feature from 1.7 has no direct Terraform equivalent. A January 2026 survey found 31% of platform engineering teams had migrated at least one environment to OpenTofu.

For a data engineer building pipelines, the practical difference is minimal today. If you are using Terraform Cloud or HCP Terraform for remote state and collaboration, stay on Terraform. If you want open-source-only tooling or are concerned about the license, OpenTofu is a direct drop-in replacement: rename the binary and nothing else in your workflow changes.

The Three Files to Start With

Every new pipeline project gets three Terraform files from the start. They are the minimum needed to provision a data lake bucket and keep the state in a remote backend:

providers.tf: provider versions pinned to major version ranges, remote backend configured
variables.tf: project ID, region, environment, labels
main.tf: GCS bucket or S3 bucket with versioning, encryption, lifecycle, and public access block

Run terraform plan before every terraform apply. Read the plan. A plan that shows destruction where you expected modification is telling you something about how the resource handles updates. Trust the plan more than you trust your memory of what you configured.

The Terraform patterns in this article are drawn from multiple data engineering projects using GCP and AWS. Infrastructure code for the Kenya Economic Pulse and BizPulse Kenya pipelines is on my GitHub.

Follow me on dev.to for more on data engineering, dbt, and Airflow.

Snowflake for Data Engineers: The Mental Model Shifts That Actually Matter

De' Clerke — Tue, 02 Jun 2026 21:33:14 +0000

When I moved to Snowflake for the BizPulse Kenya project, my first reaction was that it felt familiar. The SQL was standard. The dbt models were identical. The Airflow DAG looked the same. Then I started hitting unexpected costs, slow loads, and behaviors that did not match what I expected from PostgreSQL. The SQL was the same. The mental model was completely different.

This article covers the shifts that matter, written from the perspective of someone who uses dbt, Python, and Airflow to build pipelines, not someone managing a Snowflake account full-time.

Compute and Storage Are Separate

In PostgreSQL, the database server handles both storage and query execution. In Snowflake, they are fully decoupled. Your data lives in cloud object storage (S3 or similar). Compute is provided by virtual warehouses, which are independent clusters you spin up on demand.

CREATE WAREHOUSE COMPUTE_WH
  WAREHOUSE_SIZE = 'X-SMALL'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE;

This separation has one critical implication: you only pay for compute when a warehouse is running. A warehouse that stays running 24 hours a day costs 24 times more than one that runs for one hour. Set AUTO_SUSPEND = 60 on every warehouse. 60 seconds of inactivity before suspend is aggressive but appropriate for a batch pipeline that runs once a day.

Warehouse sizing affects query speed directly. An X-SMALL warehouse has 1 server. A LARGE has 8. For a dbt run that builds 30 models in sequence, increasing the warehouse size reduces wall clock time. The credits-per-hour rate also increases, but the total run time drops proportionally, so the cost is roughly the same. Where sizing matters is for parallel workloads: multiple dbt threads, multiple concurrent queries, or large joins that exceed available memory.

For a typical data engineering project with a daily batch pipeline:

Use X-SMALL for dbt development and testing
Use SMALL or MEDIUM for production dbt runs if models are slow
Always AUTO_SUSPEND = 60 and AUTO_RESUME = TRUE

ALTER WAREHOUSE COMPUTE_WH SUSPEND;
ALTER WAREHOUSE COMPUTE_WH RESUME;
ALTER WAREHOUSE COMPUTE_WH SET WAREHOUSE_SIZE = 'SMALL';

You can resize a warehouse without dropping it. Resize up before a heavy job, resize down after.

There Are No Indexes: Micro-Partitions Do the Work

PostgreSQL requires you to create indexes explicitly. Snowflake does not have user-managed indexes at all. Data is automatically divided into compressed micro-partitions of 50 to 500 MB each, and Snowflake tracks the min and max value of every column in every micro-partition. When your query has a WHERE clause, Snowflake prunes partitions whose min/max range cannot contain the matching values and never reads them.

This works well for monotonically increasing columns like timestamps, where each partition naturally contains a distinct range. It works less well for high-cardinality random columns where values are scattered across many partitions.

When a table is large and queries on it are slow, check the clustering depth with:

SELECT SYSTEM$CLUSTERING_INFORMATION('ANALYTICS.MARTS.FCT_ORDERS', '(TO_DATE(CREATED_AT))');

If the average depth is high, the partitions are not well-organized around your query column. You can declare a clustering key to reorganize them:

ALTER TABLE ANALYTICS.MARTS.FCT_ORDERS
  CLUSTER BY (TO_DATE(created_at), customer_id);

Snowflake then automatically reclusters the table in the background. This costs credits, so use it only on large tables (hundreds of millions of rows) that you query repeatedly on the same columns. For normal pipeline tables of a few million rows, micro-partition pruning is sufficient without explicit clustering.

COPY INTO: The Right Way to Load Data

The PostgreSQL COPY command streams data directly to the database server. Snowflake's COPY INTO loads from a stage, which is a named location pointing to internal storage or an external cloud bucket (S3, GCS, Azure Blob).

-- Internal named stage
CREATE STAGE ANALYTICS.RAW.MY_STAGE
  FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1);

-- Upload a file (from SnowSQL)
PUT file:///path/to/orders.csv @MY_STAGE AUTO_COMPRESS=TRUE;

-- Load into table
COPY INTO ANALYTICS.RAW.ORDERS
FROM @MY_STAGE/orders.csv.gz
FILE_FORMAT = (FORMAT_NAME = 'CSV_FORMAT')
ON_ERROR = 'SKIP_FILE';

One behavior that is easy to miss: COPY INTO does not re-load a file that it has already successfully loaded from that stage. Snowflake tracks a load history per stage and table. If you re-run the same COPY INTO command with the same file, nothing happens. This is the right default for idempotent pipelines but will bite you if you are trying to reload corrected data. Use FORCE = TRUE to override:

COPY INTO ANALYTICS.RAW.ORDERS
FROM @MY_STAGE/orders.csv.gz
FILE_FORMAT = (FORMAT_NAME = 'CSV_FORMAT')
FORCE = TRUE;

Before loading production data, validate first:

COPY INTO ANALYTICS.RAW.ORDERS
FROM @MY_STAGE/orders.csv.gz
FILE_FORMAT = (FORMAT_NAME = 'CSV_FORMAT')
VALIDATION_MODE = 'RETURN_ERRORS';

This returns a list of parsing errors without writing anything to the table. A validation run that comes back clean means the actual load will succeed.

For Python pipelines that build DataFrames and write to Snowflake, write_pandas is the right call, not to_sql. Under the hood, write_pandas stages the data and uses COPY INTO. The performance difference is the same as the COPY vs insert difference in PostgreSQL.

from snowflake.connector.pandas_tools import write_pandas

success, nchunks, nrows, _ = write_pandas(
    conn,
    df,
    table_name="ORDERS",
    schema="RAW",
    database="ANALYTICS",
    auto_create_table=True,
    overwrite=False,
)
print(f"Loaded {nrows} rows in {nchunks} chunks")

to_sql via SQLAlchemy works, but it issues row-by-row inserts and is slow for any meaningful volume.

VARIANT: Semi-Structured Data Without the Pain

Snowflake's VARIANT type stores JSON, XML, or Avro natively. Unlike PostgreSQL's JSONB, you do not need GIN indexes for path queries. Snowflake automatically indexes paths within VARIANT columns at load time.

CREATE TABLE RAW.EVENTS (
    event_id  VARCHAR(36),
    payload   VARIANT,
    loaded_at TIMESTAMP_NTZ
);

-- Query VARIANT paths with colon-dot notation
SELECT
    payload:event_id::VARCHAR        AS event_id,
    payload:user.id::NUMBER          AS user_id,
    payload:tags[0]::VARCHAR         AS first_tag,
    (payload:amount)::FLOAT          AS amount
FROM RAW.EVENTS;

The :: casting is mandatory. Without it, every extracted value comes back as a VARIANT, not a typed column, and your aggregations will fail or return wrong results.

For arrays inside VARIANT, use LATERAL FLATTEN to expand them into rows:

SELECT
    e.event_id,
    f.value::VARCHAR AS tag
FROM RAW.EVENTS e,
LATERAL FLATTEN(input => e.payload:tags) f;

This is equivalent to PostgreSQL's jsonb_array_elements, but the syntax is distinct enough that it catches people who copy Postgres patterns directly.

The practical use case: land raw API responses as VARIANT, run dbt staging models that extract and type the fields you need. This pattern lets you change what you extract later without re-loading the raw data.

Time Travel and Zero-Copy Cloning

Two Snowflake features that have no direct PostgreSQL equivalent.

Time travel lets you query any table as it existed at any point in the past, up to the retention period (1 day on Standard edition, up to 90 days on Enterprise).

-- What did this table look like 30 minutes ago?
SELECT * FROM ANALYTICS.RAW.ORDERS AT (OFFSET => -60*30);

-- Restore an accidentally dropped table
UNDROP TABLE ANALYTICS.RAW.ORDERS;
UNDROP SCHEMA ANALYTICS.RAW;

This is a safety net, not a backup strategy. I have used it exactly once, after a dbt model accidentally ran with the wrong filter and truncated a staging table. UNDROP TABLE restored it in seconds.

Zero-copy cloning creates an instant copy of a table, schema, or entire database that shares the underlying storage until you start modifying rows. Creating a clone costs nothing and takes no time, regardless of the source size.

-- Clone a table (instant, zero cost until modified)
CREATE TABLE STAGING.ORDERS_BACKUP CLONE RAW.ORDERS;

-- Clone a whole database for dev/testing
CREATE DATABASE ANALYTICS_DEV CLONE ANALYTICS;

The practical use case: create a _DEV clone of production before running a risky migration or a new dbt model that touches fact tables. If something goes wrong, drop the clone and try again. No restore from backup, no waiting.

Streams: CDC Without Extra Infrastructure

A stream is a change-data-capture object that tracks inserts, updates, and deletes on a table since the last time the stream was consumed.

CREATE STREAM ANALYTICS.RAW.ORDERS_STREAM ON TABLE ANALYTICS.RAW.ORDERS;

-- Read what has changed since last consumption
SELECT * FROM ANALYTICS.RAW.ORDERS_STREAM;
-- Returns: METADATA$ACTION ('INSERT' or 'DELETE'), METADATA$ISUPDATE, METADATA$ROW_ID

-- Consume the stream in a MERGE (marks it as consumed)
MERGE INTO ANALYTICS.STAGING.STG_ORDERS AS target
USING (
    SELECT * FROM ANALYTICS.RAW.ORDERS_STREAM
    WHERE METADATA$ACTION = 'INSERT'
) AS src
ON target.order_id = src.order_id
WHEN MATCHED THEN
    UPDATE SET status = src.status, amount = src.amount
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, status, amount, created_at)
    VALUES (src.order_id, src.customer_id, src.status, src.amount, src.created_at);

The stream is consumed (reset) when you run a DML statement that reads from it. If the DML fails, the stream is not consumed and still contains all the changes. This makes it naturally idempotent: re-running the MERGE after a failure replays the same changes.

In a dbt + Airflow pipeline, streams are useful when you want to process only new data since the last DAG run without maintaining your own watermark table. Run a SnowflakeOperator to consume the stream into your staging table, then run dbt on top of it.

dbt on Snowflake: The Config Options That Matter

Most dbt behavior is identical between PostgreSQL and Snowflake. A few Snowflake-specific options are worth knowing.

Transient tables have no Fail-safe storage layer (the 7-day retention beyond Time Travel). For staging tables that are rebuilt every run and do not need point-in-time recovery, transient tables are cheaper.

{{ config(
    materialized='table',
    transient_table=true
) }}

Per-model warehouse lets you use a larger warehouse only for the models that need it.

{{ config(
    materialized='incremental',
    unique_key='order_id',
    snowflake_warehouse='LARGE_WH'
) }}

Use this for fact table builds that scan hundreds of millions of rows. Leave everything else on the default warehouse.

Clustering keys in dbt are configured at the model level:

{{ config(
    materialized='table',
    cluster_by=['TO_DATE(created_at)', 'customer_id']
) }}

copy_grants=true preserves grants on the table after a full rebuild. Without it, every dbt run that rebuilds a table model drops and recreates the table, which drops all grants. Any downstream role that had SELECT on the table loses access until grants are re-applied.

{{ config(
    materialized='table',
    copy_grants=true
) }}

Query result caching returns the same result for identical queries on unchanged data, at zero compute cost. This is excellent for dashboards but misleading when benchmarking dbt models. Disable it before timing a model:

ALTER SESSION SET USE_CACHED_RESULT = FALSE;

Python Connection Setup

The connection string format for SQLAlchemy is different from PostgreSQL:

from sqlalchemy import create_engine
import os

engine = create_engine(
    "snowflake://{user}:{password}@{account}/{database}/{schema}"
    "?warehouse={warehouse}&role={role}".format(
        user      = os.environ['SNOWFLAKE_USER'],
        password  = os.environ['SNOWFLAKE_PASSWORD'],
        account   = os.environ['SNOWFLAKE_ACCOUNT'],
        database  = 'ANALYTICS',
        schema    = 'RAW',
        warehouse = 'COMPUTE_WH',
        role      = 'TRANSFORMER',
    )
)

The account value is your Snowflake account identifier, not the URL. It looks like xy12345.eu-west-1 or myorg-myaccount depending on whether you're on the old or new account format. If you pass the full URL including .snowflakecomputing.com, the connection fails.

For Airflow, set the connection via environment variable in your Docker Compose file to avoid storing credentials in the Airflow metadata database:

environment:
  AIRFLOW_CONN_SNOWFLAKE_DEFAULT: >-
    snowflake://USER:PASS@ACCOUNT/ANALYTICS/RAW
    ?warehouse=COMPUTE_WH&role=TRANSFORMER

Then reference it in tasks:

from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook

def load_to_snowflake(**context):
    hook = SnowflakeHook(snowflake_conn_id='snowflake_default')
    df   = hook.get_pandas_df("SELECT * FROM MARTS.FCT_ORDERS")
    return df

Monitoring Cost and Query Performance

Unlike PostgreSQL where you optimize for CPU and I/O, in Snowflake you optimize for credit consumption and bytes scanned. Both are visible in the query history.

-- Slow queries in the last 24 hours
SELECT query_text,
       total_elapsed_time / 1000       AS seconds,
       bytes_scanned / 1e9             AS gb_scanned,
       percentage_scanned_from_cache   AS cache_hit_pct
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY(
    DATEADD('hour', -24, CURRENT_TIMESTAMP())
))
ORDER BY total_elapsed_time DESC
LIMIT 20;

-- Credit usage by warehouse this week
SELECT warehouse_name, SUM(credits_used) AS total_credits
FROM ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY 2 DESC;

bytes_scanned is the key metric. A query that scans 500 GB is expensive. A query that scans 50 MB is not, even if it runs 100 times a day. Micro-partition pruning reduces bytes scanned. Caching eliminates it entirely for repeated identical queries.

percentage_scanned_from_cache being 100% means the query hit the result cache or the local disk cache and cost no credits. A dashboard query that runs 200 times a day and consistently hits cache costs the same as running it once.

If a query is scanning more data than expected, check whether the warehouse is running with a good clustering depth on the columns you are filtering by. A full scan on a 1 TB table when you are querying one week of data is a sign that Time Travel data is bloating the table or the clustering is poor.

What Is Different at a Glance

For anyone coming from PostgreSQL specifically, this is the quick mental model reset:

PostgreSQL	Snowflake
Manage indexes explicitly	No indexes; use clustering keys for large tables
psycopg2 COPY for bulk loads	COPY INTO from a stage
JSONB with GIN index	VARIANT with automatic path indexing
VACUUM to reclaim dead rows	No manual maintenance required
Connection pool tuning	Warehouse auto-suspend and sizing
`pg_stat_statements` for slow queries	`INFORMATION_SCHEMA.QUERY_HISTORY`
COPY breaks if the file changes	COPY INTO deduplicates by default (use FORCE=TRUE to reload)

The biggest adjustment is not technical. It is accepting that some things you control in PostgreSQL are handled automatically in Snowflake, and the levers you do have (warehouse size, clustering, transient vs permanent tables) exist for cost reasons, not correctness reasons.

BizPulse Kenya uses Snowflake as the warehouse layer with dbt and FinBERT for Kenyan news sentiment analysis. The code is on my GitHub.

Follow me on dev.to for more on data engineering, dbt, and Airflow.

PostgreSQL for Data Engineers: Indexes, Bulk Loads, and the Patterns That Actually Matter

De' Clerke — Tue, 02 Jun 2026 21:27:29 +0000

The LedgerSync pipeline was inserting 1.5 million rows into PostgreSQL using pandas.to_sql(). It took four minutes per run. I switched to psycopg2's COPY command and it dropped to 18 seconds. Same data, same schema, same machine. That is not an optimization tip. It is the difference between a pipeline that fits in an Airflow schedule and one that does not. This article is about patterns like that: the ones that matter when you are building pipelines that run on a schedule, not when you are writing ad-hoc queries.

Loading Data: to_sql vs execute_values vs COPY

There are three ways to write rows from Python into PostgreSQL, and the performance gap between them is significant.

pandas to_sql issues one INSERT statement per row by default, or a multi-row INSERT with method="multi". It is the easiest to write and the slowest for any serious volume.

psycopg2 execute_values batches many rows into a single multi-row INSERT VALUES statement. About 5x faster than to_sql for medium-sized loads.

psycopg2 COPY streams rows directly to PostgreSQL using its native bulk-load protocol. No statement parsing, no row-by-row overhead. For LedgerSync at 1.5M rows, this was the one that mattered.

import psycopg2
import io
import pandas as pd

conn = psycopg2.connect("host=localhost dbname=proj_db user=proj_user password=proj_pass")

def bulk_copy(df: pd.DataFrame, table: str, columns: list[str]):
    buf = io.StringIO()
    df[columns].to_csv(buf, index=False, header=False)
    buf.seek(0)
    with conn.cursor() as cur:
        cur.copy_from(buf, table, sep=",", columns=columns)
    conn.commit()
    print(f"Loaded {len(df)} rows into {table}")

Use COPY for initial loads and large backfills. For incremental daily writes of a few thousand rows, execute_values is fine and gives you more control over conflict handling:

from psycopg2.extras import execute_values

def bulk_insert(rows: list[dict], table: str):
    if not rows:
        return
    columns = list(rows[0].keys())
    values = [tuple(r[c] for c in columns) for r in rows]
    sql = f"INSERT INTO {table} ({', '.join(columns)}) VALUES %s ON CONFLICT DO NOTHING"
    with conn.cursor() as cur:
        execute_values(cur, sql, values, page_size=1000)
    conn.commit()

The rule: COPY for bulk historical loads, execute_values for daily incremental writes, to_sql only for one-off exploration.

Upserts: Building Idempotent Pipelines

A pipeline that runs on a schedule will re-encounter rows it has already written. Either you track what you've loaded externally, or you make the database handle it. ON CONFLICT makes the database handle it.

-- Requires a UNIQUE constraint on the natural key first
ALTER TABLE stock_prices ADD CONSTRAINT uq_symbol_timestamp
    UNIQUE (symbol, timestamp);

-- UPDATE on conflict: keep the latest version of each row
INSERT INTO stock_prices (symbol, close_price, volume, timestamp)
VALUES ('SCOM', 12.50, 500000, '2025-01-01 10:00:00')
ON CONFLICT (symbol, timestamp)
DO UPDATE SET
    close_price = EXCLUDED.close_price,
    volume      = EXCLUDED.volume;

-- DO NOTHING: skip if already exists, no error
INSERT INTO stock_prices (symbol, close_price, volume, timestamp)
VALUES ('SCOM', 12.50, 500000, '2025-01-01 10:00:00')
ON CONFLICT (symbol, timestamp) DO NOTHING;

From Python with SQLAlchemy:

from sqlalchemy.dialects.postgresql import insert

rows = [{"symbol": "SCOM", "close_price": 12.5, "timestamp": "2025-01-01 10:00:00"}]
stmt = insert(StockPrice).values(rows)
stmt = stmt.on_conflict_do_update(
    index_elements=["symbol", "timestamp"],
    set_={"close_price": stmt.excluded.close_price}
)
with engine.connect() as conn:
    conn.execute(stmt)
    conn.commit()

There is one pattern worth knowing: skip the update if the value has not actually changed. This avoids unnecessary writes on tables where triggers or replication are watching for changes.

ON CONFLICT (symbol, timestamp)
DO UPDATE SET
    close_price = EXCLUDED.close_price
WHERE stock_prices.close_price IS DISTINCT FROM EXCLUDED.close_price;

IS DISTINCT FROM handles NULLs correctly, unlike !=.

Indexes: Which Type and When

PostgreSQL has five index types. Using the wrong one does not give you an error. It just gives you a slow query.

B-tree is the default and covers equality, range queries, sorting, and IS NULL. Use it for everything that is not in the list below.

CREATE INDEX idx_stock_symbol   ON stock_prices (symbol);
CREATE INDEX idx_stock_ts       ON stock_prices (timestamp DESC);
CREATE INDEX idx_stock_sym_ts   ON stock_prices (symbol, timestamp DESC);

Composite index column order matters. Put the equality column first. If your most common query is WHERE symbol = 'SCOM' AND timestamp >= ..., the index on (symbol, timestamp) will be used. An index on (timestamp, symbol) will not be, because Postgres cannot use the second column without matching the first.

GIN for full-text search, JSONB containment queries, and arrays.

CREATE INDEX idx_headline_gin ON news_articles USING GIN (to_tsvector('english', headline));
CREATE INDEX idx_raw_gin      ON api_responses USING GIN (raw);

-- Full-text search query
SELECT * FROM news_articles
WHERE to_tsvector('english', headline) @@ to_tsquery('english', 'Kenya & economy');

BRIN for very large tables where the physical row order matches the query order. A timestamp column that grows monotonically is the textbook case. The index is tiny (a few kilobytes versus megabytes for B-tree) because it stores only the min and max value per page block, not every value.

CREATE INDEX idx_stock_ts_brin ON stock_prices USING BRIN (timestamp);

Do not use BRIN if rows arrive out of order. It only works because new rows have later timestamps than older rows, so the ranges per page do not overlap.

Partial index indexes only a subset of rows. If most of your queries filter on status = 'active' and that is 10% of the table, there is no reason to index the other 90%.

CREATE INDEX idx_active_listings ON listings (location, listed_at)
WHERE status = 'active';

This index is smaller, faster to build, and more likely to stay in memory than a full index. It only works for queries that include the WHERE condition from the index definition.

Expression index indexes the result of a function, not a raw column value. If you always query WHERE DATE(timestamp) = CURRENT_DATE, the index on (timestamp) does not help because the function wraps the column. The expression index does.

CREATE INDEX idx_stock_date  ON stock_prices (DATE(timestamp));
CREATE INDEX idx_jsonb_symbol ON api_responses ((raw->>'symbol'));

Reading EXPLAIN ANALYZE

Running EXPLAIN ANALYZE on a slow query and reading the output is the one skill that actually teaches you why a query is slow. The format looks intimidating the first time. Once you know what to look for, it is direct.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM stock_prices
WHERE symbol = 'SCOM'
  AND timestamp >= NOW() - INTERVAL '7 days';

The important fields:

Node type: The operation being performed. Seq Scan on a large table means no index was used. Index Scan means an index was used but rows still need to be fetched from the heap. Index Only Scan means all needed columns were in the index itself (fastest). Bitmap Heap Scan batches many index lookups and is efficient when returning a large fraction of the table.
rows= estimate vs actual: The planner's row estimate versus what actually happened. If the planner estimated 10 rows and got 100,000, the statistics are stale. Run ANALYZE table_name and re-run the query.
Buffers shared hit vs read: Hit means the data was in memory. Read means it came from disk. More hits is better. A query with a lot of disk reads on a table that should be cached suggests the table is too large to fit in shared_buffers or the query is not using an index.
Filter rows removed: The number of rows read but discarded by a filter condition. A large number here with a Seq Scan means the index is not covering your WHERE clause.

Red flags at a glance:

What you see	What it means
`Seq Scan` on a large table	Missing index
Estimated rows far off from actual	Stale stats; run `ANALYZE`
Large `Sort` node	Consider an index on the `ORDER BY` column
`Nested Loop` with high `loops=`	Consider restructuring the join
`Filter` removing many rows	Index not reaching the right column

Finding slow queries in production without running them manually: pg_stat_statements tracks cumulative execution stats.

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

SELECT query, calls, mean_exec_time, total_exec_time, rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

This shows you the top 10 queries by average execution time across all runs since the last reset. It is the first place to look when a dashboard or API endpoint starts getting slow.

Window Functions for Time-Series

Most data engineering work involves time-series: prices, logs, metrics, events. Window functions are what make time-series analysis readable in SQL.

Latest record per group with DISTINCT ON:

SELECT DISTINCT ON (symbol)
    symbol, close_price, timestamp
FROM stock_prices
ORDER BY symbol, timestamp DESC;

This returns one row per symbol: the one with the latest timestamp. It is faster than a GROUP BY with MAX(timestamp) subquery approach on most PostgreSQL versions.

Day-over-day change with LAG:

SELECT
    symbol,
    timestamp,
    close_price,
    LAG(close_price) OVER (PARTITION BY symbol ORDER BY timestamp) AS prev_price,
    ROUND(
        (close_price - LAG(close_price) OVER (PARTITION BY symbol ORDER BY timestamp))
        / NULLIF(LAG(close_price) OVER (PARTITION BY symbol ORDER BY timestamp), 0) * 100,
    4) AS pct_change
FROM stock_prices
ORDER BY symbol, timestamp;

The NULLIF(..., 0) prevents a division-by-zero error if any previous price is zero.

Moving average with ROWS BETWEEN:

SELECT
    symbol,
    timestamp,
    close_price,
    ROUND(AVG(close_price) OVER (
        PARTITION BY symbol ORDER BY timestamp
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 4) AS sma_7,
    ROUND(AVG(close_price) OVER (
        PARTITION BY symbol ORDER BY timestamp
        ROWS BETWEEN 19 PRECEDING AND CURRENT ROW
    ), 4) AS sma_20
FROM stock_prices;

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW means the current row plus the 6 rows before it. For a 7-period moving average.

Ranking within groups:

SELECT
    symbol,
    total_volume,
    RANK()       OVER (ORDER BY total_volume DESC) AS rank,
    DENSE_RANK() OVER (ORDER BY total_volume DESC) AS dense_rank,
    NTILE(4)     OVER (ORDER BY total_volume DESC) AS quartile
FROM daily_volumes;

RANK skips numbers on ties (1, 2, 2, 4). DENSE_RANK does not (1, 2, 2, 3). NTILE(4) divides the result into four equal buckets. Use RANK when you need to know that positions were tied. Use NTILE for percentile bucketing.

CTEs: When to Use Them and When Not To

CTEs make complex queries readable by naming intermediate results. They are not free.

In PostgreSQL 12+, CTEs are inlined by default unless you add MATERIALIZED. That means the planner can optimize across CTE boundaries. Before PostgreSQL 12, every CTE was a fence: it always materialized to disk first, and the planner could not push filters down into it.

Use CTEs when the query genuinely has multiple logical stages that benefit from being named:

WITH daily_avg AS (
    SELECT symbol, AVG(close_price) AS avg_price
    FROM stock_prices
    WHERE timestamp >= NOW() - INTERVAL '30 days'
    GROUP BY symbol
),
today_prices AS (
    SELECT DISTINCT ON (symbol)
        symbol, close_price AS today_price
    FROM stock_prices
    WHERE DATE(timestamp) = CURRENT_DATE
    ORDER BY symbol, timestamp DESC
)
SELECT
    t.symbol,
    t.today_price,
    d.avg_price,
    ROUND((t.today_price - d.avg_price) / d.avg_price * 100, 2) AS pct_above_30d_avg
FROM today_prices t
JOIN daily_avg d USING (symbol)
ORDER BY pct_above_30d_avg DESC;

If you find yourself writing a CTE just to avoid repeating an expression, use a subquery instead. CTEs that are only referenced once and contain no GROUP BY are usually cleaner as subqueries.

JSONB: When It Earns Its Weight

Storing raw API responses as JSONB is legitimate when the schema varies between records or when you need to land data quickly and parse it later. I used this in the NSE pipeline for raw market data before the schema was stable.

-- Extract values
SELECT
    raw->>'symbol'               AS symbol,
    (raw->>'price')::NUMERIC     AS price,
    raw->'meta'->>'source'       AS source
FROM api_responses;

-- Filter
SELECT * FROM api_responses WHERE raw->>'market' = 'NSE';
SELECT * FROM api_responses WHERE raw @> '{"status": "active"}';

-- Build JSONB in a query
SELECT jsonb_build_object('symbol', symbol, 'price', close_price) AS row_json
FROM stock_prices LIMIT 3;

JSONB is fast for reads when indexed properly. A GIN index covers all key/value lookups in the document:

CREATE INDEX idx_raw_gin ON api_responses USING GIN (raw);

For queries on a specific field you query often, an expression index on that path is smaller and faster than a full GIN index:

CREATE INDEX idx_raw_symbol ON api_responses ((raw->>'symbol'));

The rule: use JSONB when the schema is genuinely unknown or variable. If you know the fields, use typed columns. Typed columns are faster, smaller, and easier to index precisely.

pgvector: Vector Search in PostgreSQL

For JobSense, I needed semantic search over 604 job embeddings. pgvector adds a vector column type and three distance operators directly to PostgreSQL. No separate vector database required.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE job_embeddings (
    id        BIGSERIAL PRIMARY KEY,
    job_id    INTEGER NOT NULL,
    title     TEXT,
    embedding vector(384)
);

The dimension must match your embedding model. all-MiniLM-L6-v2 from HuggingFace outputs 384 dimensions. OpenAI text-embedding-ada-002 outputs 1536.

Similarity search:

SELECT
    job_id,
    title,
    1 - (embedding <=> '[0.12, -0.34, ...]'::vector) AS cosine_similarity
FROM job_embeddings
ORDER BY embedding <=> '[0.12, -0.34, ...]'::vector
LIMIT 10;

<=> is cosine distance (lower is more similar). <-> is L2 distance. <#> is negative inner product. For normalized embeddings from sentence transformers, cosine distance gives the most intuitive results.

For tables under a million rows, HNSW gives better query performance than IVFFlat:

CREATE INDEX idx_job_emb_hnsw ON job_embeddings
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

IVFFlat is faster to build but less accurate. HNSW takes longer to build but has better recall and faster queries at search time. For a 600-row table the difference does not matter. At 100k rows it does.

Hybrid search combines vector similarity with a keyword filter:

SELECT job_id, title,
    1 - (embedding <=> '[...]'::vector) AS semantic_score
FROM job_embeddings
WHERE to_tsvector('english', description) @@ to_tsquery('english', 'Kenya & Airflow')
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;

This finds jobs that contain both the keywords and are semantically similar to the query. In practice it is more accurate than either keyword or vector search alone for short queries.

Materialized Views

A materialized view stores the result of a query as a physical table. When your dashboard queries a pre-aggregated summary of 50 million rows, the dashboard reads 5,000 rows from the materialized view instead of computing the aggregation every time.

CREATE MATERIALIZED VIEW daily_stock_summary AS
SELECT
    symbol,
    DATE(timestamp) AS trade_date,
    MAX(close_price) AS high,
    MIN(close_price) AS low,
    SUM(volume)      AS total_volume,
    COUNT(*)         AS tick_count
FROM stock_prices
GROUP BY symbol, DATE(timestamp)
WITH DATA;

CREATE UNIQUE INDEX uq_mv_summary ON daily_stock_summary (symbol, trade_date);

REFRESH MATERIALIZED VIEW CONCURRENTLY daily_stock_summary;

CONCURRENTLY refreshes without locking the view for reads. It requires a unique index. Without CONCURRENTLY, the refresh takes an exclusive lock and blocks all reads for the duration of the refresh.

Schedule the refresh in an Airflow DAG after the upstream data loads:

@task
def refresh_summary():
    with engine.connect() as conn:
        conn.execute(text("REFRESH MATERIALIZED VIEW CONCURRENTLY daily_stock_summary"))
        conn.commit()

The trade-off: materialized views are stale by the refresh interval. If your dashboard can tolerate data that is 15 minutes old, refresh every 15 minutes. If it needs real-time data, do not use a materialized view. Use a regular view instead, and put the right indexes on the underlying table.

Table Partitioning for High-Volume Time-Series

For a table that will grow to tens of millions of rows over a year, partitioning by time lets you query only the relevant months instead of scanning everything. It also makes dropping old data instant: DROP TABLE stock_prices_2023_01 runs in milliseconds. A DELETE on millions of rows does not.

CREATE TABLE stock_prices_partitioned (
    id          BIGSERIAL,
    symbol      VARCHAR(20) NOT NULL,
    close_price NUMERIC(12, 4),
    volume      BIGINT,
    timestamp   TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (timestamp);

CREATE TABLE stock_prices_2025_01
    PARTITION OF stock_prices_partitioned
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

CREATE TABLE stock_prices_2025_02
    PARTITION OF stock_prices_partitioned
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

CREATE INDEX ON stock_prices_2025_01 (symbol, timestamp DESC);
CREATE INDEX ON stock_prices_2025_02 (symbol, timestamp DESC);

Inserts route to the correct partition automatically. Queries with a WHERE timestamp BETWEEN ... condition scan only the relevant partitions (partition pruning). Queries without a time filter still scan all partitions.

Add new partitions before data arrives in that period. Inserts into a period with no partition fail with an error.

Maintenance You Cannot Skip

Dead row accumulation. PostgreSQL's MVCC model does not delete old row versions immediately. They become dead rows and take up space. The autovacuum process handles this automatically on most tables, but on high-write pipeline tables it can fall behind.

-- Check dead rows
SELECT relname, n_live_tup, n_dead_tup,
       ROUND(n_dead_tup::NUMERIC / NULLIF(n_live_tup, 0) * 100, 2) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 0
ORDER BY dead_pct DESC;

-- Force cleanup
VACUUM ANALYZE stock_prices;

VACUUM ANALYZE reclaims space from dead rows and updates table statistics at the same time. Run it manually after a large delete or after a bulk load that changed the data distribution significantly.

Statistics staleness. The query planner estimates row counts using statistics. When estimates are far off from actuals in EXPLAIN ANALYZE, statistics are stale:

ANALYZE stock_prices;

This updates statistics without reclaiming space. Run it after loading a large initial dataset or after significant changes to the data distribution.

Monitoring long-running queries:

SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - INTERVAL '5 minutes'
ORDER BY duration DESC;

A query that has been running for 30 minutes in a data pipeline is almost certainly stuck on a lock or performing a sequential scan it should not be doing. Kill it with pg_cancel_backend(pid) (graceful) or pg_terminate_backend(pid) (hard).

Checking what is blocking what:

SELECT pid, locktype, relation::regclass, mode, granted
FROM pg_locks
WHERE NOT granted;

Ungranted locks mean a query is waiting. The relation column tells you which table is involved.

A Practical Schema for Pipeline Tables

This is the pattern I use for any table that receives scheduled writes from a pipeline:

CREATE TABLE IF NOT EXISTS stock_prices (
    id          BIGSERIAL PRIMARY KEY,
    symbol      VARCHAR(20)  NOT NULL,
    close_price NUMERIC(12, 4),
    volume      BIGINT,
    source      VARCHAR(50),
    timestamp   TIMESTAMPTZ  NOT NULL,
    created_at  TIMESTAMPTZ  DEFAULT NOW(),
    CONSTRAINT uq_symbol_timestamp UNIQUE (symbol, timestamp)
);

CREATE INDEX idx_stock_symbol ON stock_prices (symbol);
CREATE INDEX idx_stock_ts     ON stock_prices (timestamp DESC);
CREATE INDEX idx_stock_sym_ts ON stock_prices (symbol, timestamp DESC);

BIGSERIAL over SERIAL for the primary key because pipeline tables grow. UNIQUE on the natural key to enable upserts. Three indexes: symbol alone for lookups by equity, timestamp for recency queries, composite for the common case of filtering by both.

The created_at column records when the row was written to the database, separate from the event timestamp. This is useful for debugging when something loaded late and for incremental refresh patterns.

Connecting It to Your Pipeline

From Python, the engine setup that handles connection pool exhaustion and stale connections:

from sqlalchemy import create_engine
import os

engine = create_engine(
    os.getenv("DATABASE_URL"),
    pool_size=5,
    max_overflow=10,
    pool_timeout=30,
    pool_pre_ping=True,
)

pool_pre_ping=True tests each connection before using it. Without it, a connection that has been idle in the pool for longer than the server's idle_in_transaction_session_timeout will fail on first use with a cryptic error. With it, the pool detects the stale connection and creates a new one automatically.

For Docker-to-Docker connections (Airflow container talking to Postgres container), use the service name as the host, not localhost:

DATABASE_URL=postgresql+psycopg2://proj_user:proj_pass@postgres:5432/proj_db

localhost inside a container resolves to the container itself, not the host machine. The service name resolves via Docker's internal DNS.

These patterns come from building and running data pipelines on top of PostgreSQL across a dozen projects. The code for LedgerSync, JobSense, and the NSE stock pipeline is on my GitHub.

Follow me on dev.to for more on data engineering, Airflow, and dbt.

Web Scraping Kenyan Data Sources: What's Available, What Fights Back, and the Patterns That Keep Pipelines Running

De' Clerke — Tue, 02 Jun 2026 21:13:25 +0000

I've scraped property listings from BuyRentKenya (1,338 of them), parliament bills from the National Assembly site (319 PDFs), job postings from six Kenyan job boards, forex rates from CBK, and news articles from Business Daily and The Standard. Each of those sources has its own quirks, and a few of them actively fight back. This article covers what I learned building five production scrapers that ran on schedule without getting blocked.

Before writing a single line of scraping code, check the Network tab in your browser's DevTools. Many sites that look like they need scraping are actually hitting an undocumented JSON API in the background. Interact with the page, watch the XHR/Fetch tab, and look for the request that returns the data. If you find one, you're writing a two-line requests.get() call instead of parsing HTML. The NSE equities page works exactly this way.

Pick Your Tool Before You Start

Three tools cover almost every case. Knowing which one to reach for first saves a lot of wasted effort.

requests + BeautifulSoup for static HTML. This is the right starting point for most Kenyan government and news sites. Fast, lightweight, no browser overhead.

Playwright when requests returns empty content or a "loading..." placeholder. The page requires JavaScript to render its content. You need a real browser.

Official API when one exists. data.go.ke runs CKAN, CBK publishes exchange rates, EIA has a full API. Always check before scraping.

The test that tells you which one applies:

import requests

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

r = requests.get(url, headers=HEADERS, timeout=15)

if len(r.text) < 500 or "Just a moment" in r.text or "cf-browser-verification" in r.text:
    print("Cloudflare detected. Use Playwright-stealth.")
else:
    print(f"Static HTML. Got {len(r.text)} characters.")

Under 500 characters means you got a challenge page, not the actual content. "Just a moment" is Cloudflare's loading screen. Both mean requests will not work here.

Kenyan Data Sources: What Works for Each

data.go.ke: CKAN API

The Kenya Open Data portal runs CKAN, which has a documented REST API. You can list datasets, fetch metadata, and download resources programmatically without touching HTML.

CKAN_BASE = "https://www.opendata.go.ke/api/3/action"

def list_datasets() -> list[str]:
    r = requests.get(f"{CKAN_BASE}/package_list", timeout=15)
    r.raise_for_status()
    return r.json()["result"]

def get_dataset(dataset_id: str) -> dict:
    r = requests.get(f"{CKAN_BASE}/package_show", params={"id": dataset_id}, timeout=15)
    r.raise_for_status()
    return r.json()["result"]

Each dataset record contains a resources list with download URLs. Datasets cover agriculture, health, education, and population. Most are downloadable as CSV directly. Use the API to discover what's available; use a plain requests.get() to download the file.

CBK Forex Rates

The Central Bank publishes daily exchange rates at centralbank.go.ke/forex/. The page renders as a standard HTML table. pandas handles it in four lines.

import pandas as pd
import requests

def scrape_cbk_forex() -> pd.DataFrame:
    url = "https://www.centralbank.go.ke/forex/"
    r = requests.get(url, headers=HEADERS, timeout=15)
    r.raise_for_status()
    tables = pd.read_html(r.text)
    df = tables[0]
    df.columns = df.columns.str.strip()
    df["fetched_at"] = pd.Timestamp.now()
    return df

pd.read_html() finds all HTML tables on the page and returns them as a list of DataFrames. The forex table is the first one. This is faster than parsing with BeautifulSoup when the data you want is already in a <table> tag.

NSE Equities

The NSE live prices page makes a background API call. Open the Network tab, click XHR/Fetch, reload the page, and you will see requests to endpoints under nse.co.ke/api/. The equities endpoint returns JSON.

NSE_HEADERS = {
    "User-Agent": "Mozilla/5.0",
    "X-Requested-With": "XMLHttpRequest",
    "Referer": "https://www.nse.co.ke/",
}

def fetch_nse_prices() -> list[dict]:
    r = requests.get(
        "https://www.nse.co.ke/api/equity/prices",
        headers=NSE_HEADERS,
        timeout=10
    )
    r.raise_for_status()
    return r.json()

The X-Requested-With: XMLHttpRequest header is required. Without it, the endpoint returns a 403. This is a common pattern on Kenyan sites that use jQuery Ajax internally.

Historical price data is a different story. It is locked behind PDF market reports, which means PDF parsing. See the parliament bills section below for the pattern.

Kenya National Assembly Bills

The parliament site lists bills at parliament.go.ke/the-national-assembly/bills. Each bill links to a PDF hosted on the same domain. The scraping is two steps: parse the HTML table to get the list, then download and parse each PDF.

import pdfplumber
import tempfile
import os

def scrape_national_assembly_bills() -> list[dict]:
    url = "http://parliament.go.ke/the-national-assembly/bills"
    r = requests.get(url, headers=HEADERS, timeout=20)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")
    bills = []
    for row in soup.select("table.bills-table tbody tr"):
        cols = row.find_all("td")
        if len(cols) < 3:
            continue
        link = cols[0].find("a")
        bills.append({
            "title":   cols[0].get_text(strip=True),
            "status":  cols[1].get_text(strip=True),
            "date":    cols[2].get_text(strip=True),
            "pdf_url": link["href"] if link else None,
        })
    return bills

def download_and_parse_bill(pdf_url: str) -> str:
    r = requests.get(pdf_url, headers=HEADERS, timeout=30)
    r.raise_for_status()
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as f:
        f.write(r.content)
        tmp_path = f.name
    try:
        full_text = []
        with pdfplumber.open(tmp_path) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    full_text.append(text)
        return "\n".join(full_text)
    finally:
        os.unlink(tmp_path)

Write to a tempfile, parse, then delete it. Do not save the PDFs to disk permanently unless you specifically need them. In BungeWatch, I parsed 223 bills this way and stored only the extracted text and keyword counts.

For PDF tables, pdfplumber has a page.extract_tables() method. For bulk text extraction where speed matters more than precision, PyMuPDF (fitz) is faster.

import fitz

def extract_text_fast(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    return "\n".join(page.get_text() for page in doc)

Use pdfplumber when the PDF has structured tables. Use fitz when you just need the raw text quickly.

Kenyan Job Boards

I pulled from six sources for JobSense. The breakdown:

Source	Tool	Notes
BrighterMonday	requests + BS4	Static HTML, pagination via `?page=N`
JobWeb Kenya	requests + BS4	Static HTML, some JavaScript on detail pages
MyJobMag	requests + BS4	Standard pagination
LinkedIn	Playwright + login	Requires authentication, JS rendering
Indeed Kenya	Playwright	Dynamic content

For LinkedIn, Playwright with a logged-in session is the only reliable approach. Store the login cookies after the first session and reuse them. Without cookies, LinkedIn redirects you to the login page after a few pages.

For BrighterMonday and JobWeb, a standard session with headers handles pagination cleanly:

def scrape_brightermonday(keyword: str, max_pages: int = 10) -> list[dict]:
    base_url = f"https://www.brightermonday.co.ke/jobs?q={keyword}"
    session = build_session()
    all_jobs = []

    for page in range(1, max_pages + 1):
        url = f"{base_url}&page={page}"
        soup = get_page(url, session)
        jobs = parse_job_listings(soup)
        if not jobs:
            break
        all_jobs.extend(jobs)

    return all_jobs

BuyRentKenya Property Listings

This was the Kenya Real Estate Pipeline. BuyRentKenya uses ?page=N pagination, static HTML, and responds cleanly to a browser User-Agent. 67 pages at about 20 listings each gives you the full dataset.

def scrape_buyrentkenya(max_pages: int = 70) -> list[dict]:
    base_url = "https://www.buyrentkenya.com/property-for-sale"
    session = build_session()
    all_listings = []

    for page in range(1, max_pages + 1):
        soup = get_page(f"{base_url}?page={page}", session)
        listings = parse_listings(soup)
        if not listings:
            break
        all_listings.extend(listings)

    return all_listings

The final run produced 1,338 listings. The site does not block scrapers as long as you include a realistic User-Agent and keep delays between requests.

Kenyan News Sites

Check for RSS before scraping HTML. Business Daily, The Standard, and The Star all publish RSS feeds. RSS gives you structured data (title, link, published date, description) with no parsing required.

import feedparser

def fetch_rss(feed_url: str) -> list[dict]:
    feed = feedparser.parse(feed_url)
    return [
        {
            "title":     e.title,
            "link":      e.link,
            "published": e.get("published"),
            "summary":   e.get("summary"),
        }
        for e in feed.entries
    ]

articles = fetch_rss("https://businessdailyafrica.com/rss/feed")

RSS only gives you headlines and summaries. For full article text, follow the links and scrape the article pages with requests + BS4. Most Kenyan news sites are static HTML. article tags with class names like article-body or post-content are the usual targets.

The Session Builder Pattern

Building a session once and reusing it across all pages is more efficient than creating a new connection for every request. It also lets you configure retry behavior in one place.

import requests
import time
import random
from fake_useragent import UserAgent
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

ua = UserAgent()

FULL_HEADERS = {
    "Accept":          "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection":      "keep-alive",
    "Referer":         "https://www.google.com/",
}

def build_session() -> requests.Session:
    session = requests.Session()
    retry = Retry(
        total=3,
        backoff_factor=2,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"],
    )
    session.mount("https://", HTTPAdapter(max_retries=retry))
    session.mount("http://",  HTTPAdapter(max_retries=retry))
    return session

def get_page(url: str, session: requests.Session) -> BeautifulSoup:
    headers = {**FULL_HEADERS, "User-Agent": ua.random}
    r = session.get(url, headers=headers, timeout=20)
    if r.status_code == 429:
        wait = int(r.headers.get("Retry-After", 30))
        print(f"Rate limited. Waiting {wait}s...")
        time.sleep(wait)
        r = session.get(url, headers={**FULL_HEADERS, "User-Agent": ua.random}, timeout=20)
    r.raise_for_status()
    time.sleep(random.uniform(1.5, 3.5))
    return BeautifulSoup(r.text, "lxml")

Rotating the User-Agent on every request with ua.random keeps your requests looking like they come from different browsers. The backoff_factor=2 on the retry adapter means it waits 2 seconds before the first retry, 4 before the second, 8 before the third. That covers most transient server errors without hammering the site.

Checkpoint and Resume for Long Scrapes

A scrape that covers 70 pages and fails on page 45 should resume from page 45, not page 1. Save a checkpoint after every successful page.

import json
import os

CHECKPOINT_FILE = "scrape_checkpoint.json"

def load_checkpoint() -> dict:
    if os.path.exists(CHECKPOINT_FILE):
        with open(CHECKPOINT_FILE) as f:
            return json.load(f)
    return {"last_page": 0, "scraped_ids": []}

def save_checkpoint(state: dict):
    with open(CHECKPOINT_FILE, "w") as f:
        json.dump(state, f)

def scrape_with_checkpoint(base_url: str, max_pages: int = 100) -> list[dict]:
    state = load_checkpoint()
    start_page = state.get("last_page", 0) + 1
    scraped_ids = set(state.get("scraped_ids", []))
    all_results = []
    session = build_session()

    for page in range(start_page, max_pages + 1):
        try:
            soup = get_page(f"{base_url}?page={page}", session)
            items = parse_listings(soup)
            if not items:
                break

            new_items = [i for i in items if i.get("url") not in scraped_ids]
            all_results.extend(new_items)
            scraped_ids.update(i["url"] for i in new_items)

            save_checkpoint({"last_page": page, "scraped_ids": list(scraped_ids)})
            print(f"Page {page}: +{len(new_items)} items")

        except Exception as e:
            print(f"Page {page} failed: {e}")
            continue

    return all_results

Delete the checkpoint file after the scrape completes. If you leave it in place, the next scheduled run will start from the last saved page instead of the beginning, which breaks incremental scraping.

JavaScript-Rendered Pages with Playwright

When requests gives you empty HTML or generic loading content, the page depends on JavaScript to render. Playwright runs a real Chromium browser.

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_with_playwright(url: str) -> BeautifulSoup:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0",
            viewport={"width": 1920, "height": 1080},
        )
        page = context.new_page()
        page.goto(url, timeout=30000, wait_until="domcontentloaded")
        page.wait_for_selector("div.listings", timeout=10000)
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(1500)
        content = page.content()
        browser.close()
    return BeautifulSoup(content, "lxml")

Wait for a specific element with wait_for_selector rather than using wait_for_load_state("networkidle"). Waiting for networkidle means waiting until no network requests have fired for 500ms. On pages with telemetry, ads, or analytics loading in the background, that can take 10+ seconds or never settle. Waiting for the element you actually need is faster and more reliable.

For pages with a "Load More" button, click through until the button disappears:

while page.locator("button.load-more").is_visible():
    page.click("button.load-more")
    page.wait_for_timeout(1500)

When debugging a Playwright scraper, take a screenshot at the failure point:

try:
    page.wait_for_selector("div.listings", timeout=5000)
except Exception:
    page.screenshot(path="error_screenshot.png", full_page=True)
    raise

The screenshot shows you exactly what the browser rendered, which is almost always more useful than any error message.

Network Interception: When the Site Calls Its Own API

Some sites build their frontend as a JavaScript app that calls an internal API. Playwright can intercept those calls and grab the JSON directly, skipping HTML parsing entirely.

api_responses = []

def handle_response(response):
    if "api/jobs" in response.url and response.status == 200:
        try:
            api_responses.append(response.json())
        except Exception:
            pass

page.on("response", handle_response)
page.goto(url)
page.wait_for_load_state("networkidle")
print(f"Captured {len(api_responses)} API responses")

I used this on LinkedIn during the JobSense project. The page renders job cards, but underneath it is calling api.linkedin.com/graphql with a structured query. Intercepting those calls gives you structured JSON with no HTML parsing required.

BeautifulSoup Gotchas

The hyphen-class problem. On SBT Japan, every field on a car listing card uses class names like card-mileage, card-year, card-price. You would expect card.find(class_="card-mileage") to work. It does not. BeautifulSoup treats hyphens as word boundaries in class matching in certain contexts, and the lookup fails. CSS selectors like soup.select(".card-mileage") also fail with a malformed selector error when the class starts with a hyphen.

The fix is to iterate children and check membership directly:

def find_by_hyphen_class(container, suffix: str):
    for el in container.find_all(True):
        classes = el.get("class", [])
        if any(suffix in cls for cls in classes):
            return el
    return None

mileage_el = find_by_hyphen_class(card, "-mileage")
year_el    = find_by_hyphen_class(card, "-year")

The dual-element selector problem. BE FORWARD renders listings as <div class="stocklist-row"> in desktop view and <tr class="stocklist-row"> in table view. A selector for one tag type misses the other. Pass a list to find_all:

cards = soup.find_all(["div", "tr"], class_="stocklist-row")

Finding the right selector. When DevTools gives you a long CSS path that does not work in BeautifulSoup, dump all classes on the page and look for the right one:

classes = set()
for tag in soup.find_all(True):
    if tag.get("class"):
        classes.add(f"{tag.name}.{'.'.join(tag['class'])}")
for c in sorted(classes):
    print(c)

Scan the output for the class names that match the element you are targeting. This is faster than guessing and re-running the scraper.

Storing Data: Idempotent Upserts

A scraper that runs on a schedule will re-encounter listings it has already stored. Your storage layer needs to handle duplicates without creating them. Use a UNIQUE constraint on the natural key and ON CONFLICT DO UPDATE on writes.

CREATE TABLE IF NOT EXISTS listings (
    id          SERIAL PRIMARY KEY,
    url         TEXT UNIQUE,
    title       TEXT,
    price       NUMERIC(14, 2),
    location    VARCHAR(200),
    source      VARCHAR(50),
    scraped_at  TIMESTAMPTZ DEFAULT NOW()
);

from sqlalchemy.dialects.postgresql import insert as pg_insert

def upsert_listings(records: list[dict]):
    if not records:
        return
    with engine.begin() as conn:
        stmt = pg_insert(Listing).values(records)
        stmt = stmt.on_conflict_do_update(
            index_elements=["url"],
            set_={"price": stmt.excluded.price, "scraped_at": stmt.excluded.scraped_at},
        )
        conn.execute(stmt)
    print(f"Upserted {len(records)} rows")

The ON CONFLICT DO UPDATE updates price and scraped_at when a URL is already in the database. This means your table always has the latest price for each listing, not a growing pile of duplicates.

Scheduling with Airflow

Once the scraper works, wrapping it in an Airflow DAG gives you scheduled runs, retries, and failure alerts without any extra infrastructure.

from airflow.decorators import dag, task
from datetime import datetime, timedelta

@dag(
    schedule="0 6 * * *",
    start_date=datetime(2025, 1, 1),
    catchup=False,
    max_active_runs=1,
    default_args={"retries": 2, "retry_delay": timedelta(minutes=10)},
    tags=["scraping", "real-estate"],
)
def kenya_real_estate_dag():

    @task
    def scrape() -> list[dict]:
        from scraper import scrape_all_pages
        return scrape_all_pages(
            "https://www.buyrentkenya.com/property-for-sale",
            max_pages=70
        )

    @task
    def store(records: list[dict]) -> int:
        upsert_listings(records)
        return len(records)

    @task
    def report(count: int):
        print(f"Scrape complete: {count} listings loaded")

    records = scrape()
    count = store(records)
    report(count)

kenya_real_estate_dag()

Two things worth noting here. First, max_active_runs=1 prevents overlap if a run takes longer than the schedule interval. A scrape of 70 pages with polite delays can take 5 to 10 minutes. Without this setting, a second run can start before the first one finishes, both writing to the same table at the same time. Second, retries: 2 with a 10-minute retry delay covers transient network failures without hammering the site immediately.

What the Kenyan Ecosystem Actually Looks Like

After building scrapers for most of these sources, here is the honest picture:

What is well-structured: data.go.ke (CKAN API), CBK forex (clean HTML table), job boards (mostly static HTML with predictable pagination), news RSS feeds.

What requires more work: NSE historical data (PDF reports), parliament bills (HTML table plus PDF per bill), BuyRentKenya (static HTML but needs checkpoint for 70 pages), news article full text (follow links after RSS).

What actively resists scraping: LinkedIn (requires login, JS rendering), CarFromJapan and similar Cloudflare-protected sites (need Playwright-stealth or a proxy). For Cloudflare sites, playwright-stealth patches the browser context to remove the JavaScript signals that identify it as a headless browser.

The general principle that holds across all of them: check the Network tab before writing any code, respect robots.txt, use polite delays, and build idempotent storage from the start. A scraper that breaks after 30 minutes because of a duplicate key error is not a production scraper.

These patterns come from five data engineering projects that scrape Kenyan sources on a schedule. The code for Kenya Real Estate, BungeWatch, JobSense, and BizPulse Kenya is on my GitHub.

Follow me on dev.to for more articles on data pipelines, dbt, and Airflow.

GitHub Actions for Data Pipelines: The Setup I Use Across All My Projects

De' Clerke — Tue, 02 Jun 2026 21:04:43 +0000

Every data engineering repo I push has a GitHub Actions workflow in it. Not because it is required, but because I got burned enough times pushing code that "worked on my machine" and broke in production. After setting up CI across 30+ pipeline projects, the same patterns come up every time and so do the same gotchas. This article covers the setup that actually works for Python pipelines, dbt, Docker, and Airflow DAG validation.

What GitHub Actions Is (In One Paragraph)

GitHub Actions is GitHub's built-in CI/CD system. You write a YAML file in .github/workflows/, push it, and GitHub runs it on a virtual machine whenever the trigger fires. The machine is ephemeral: it spins up, runs your steps, and disappears. You get 2,000 free minutes per month on private repos and unlimited minutes on public ones.

Four terms matter:

Workflow is the .yml file. You can have multiple.
Job is a group of steps that share a runner. Jobs run in parallel by default.
Step is a single command or a reusable action from the Marketplace.
Runner is the VM. ubuntu-latest is free and covers most use cases.

That is the mental model. Everything else builds on it.

The Base Workflow for a Python Data Pipeline

This is the starting point for every project. Tests run on push to main and on every pull request.

# .github/workflows/ci.yml
name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  PYTHON_VERSION: '3.11'

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: 'pip'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run tests
        run: pytest tests/ -v --tb=short

The cache: 'pip' on setup-python is not optional for data engineering projects. Your requirements.txt likely pulls in pandas, SQLAlchemy, Airflow providers, dbt, or all of the above. Without caching, you pay that install time on every single run. With caching, it restores from a cache keyed to requirements.txt and you skip most of the download.

Adding PostgreSQL as a Service

Most pipeline tests need a real database. GitHub Actions supports service containers, which spin up alongside your job and are accessible on localhost.

jobs:
  test:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: testdb
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - run: pip install -r requirements.txt

      - name: Run tests
        env:
          DATABASE_URL: postgresql://postgres:postgres@localhost:5432/testdb
        run: pytest tests/ -v --tb=short --junitxml=reports/junit.xml

The options: block with --health-cmd is not optional. Without it, your steps start running the moment the container starts, not when Postgres is ready to accept connections. The first time I ran a test suite without health checks, half the tests failed with connection refused errors that disappeared when I re-ran the same workflow 30 seconds later. GitHub does not wait for services automatically. You have to tell it what "ready" means.

Pass DATABASE_URL at the step level, not the workflow level. Secrets and sensitive env vars belong as close to the step that uses them as possible.

dbt CI: The Profiles Problem

dbt reads connection credentials from ~/.dbt/profiles.yml. That file is never committed to the repo. In CI, the home directory is empty, so dbt fails immediately on dbt debug. The fix is to generate the file in a step.

# .github/workflows/dbt.yml
name: dbt CI

on:
  pull_request:
    branches: [main]
    paths:
      - 'dbt/**'

jobs:
  dbt-ci:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: ${{ secrets.DBT_POSTGRES_USER }}
          POSTGRES_PASSWORD: ${{ secrets.DBT_POSTGRES_PASSWORD }}
          POSTGRES_DB: ${{ secrets.DBT_POSTGRES_DB }}
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - name: Install dbt
        run: pip install dbt-core dbt-postgres

      - name: Write profiles.yml
        run: |
          mkdir -p ~/.dbt
          cat > ~/.dbt/profiles.yml << EOF
          my_project:
            target: ci
            outputs:
              ci:
                type: postgres
                host: localhost
                port: 5432
                user: ${{ secrets.DBT_POSTGRES_USER }}
                password: ${{ secrets.DBT_POSTGRES_PASSWORD }}
                dbname: ${{ secrets.DBT_POSTGRES_DB }}
                schema: public
                threads: 4
          EOF

      - name: dbt debug
        working-directory: ./dbt
        run: dbt debug

      - name: dbt deps
        working-directory: ./dbt
        run: dbt deps

      - name: dbt build
        working-directory: ./dbt
        run: dbt build --target ci

Two things to note here. First, the paths: filter on the trigger. If your repo has both Python code and dbt models, you do not want every Python change to trigger a full dbt build. Scoping the trigger to dbt/** means this workflow only runs when dbt files change.

Second, working-directory: ./dbt on each dbt step. dbt looks for dbt_project.yml in the current directory. If your dbt project lives in a subdirectory, every step needs working-directory set or dbt will not find the project.

Airflow DAG Validation Without Running Airflow

Running a full Airflow stack in CI is too heavy. What you actually want is to catch import errors, which is what dag.py bugs look like at runtime. You can do that by importing each DAG file with Python directly.

- name: Validate DAG imports
  run: |
    pip install apache-airflow
    python -c "
    import importlib, sys, pathlib
    dag_files = list(pathlib.Path('dags').glob('*.py'))
    for f in dag_files:
        spec = importlib.util.spec_from_file_location(f.stem, f)
        mod  = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(mod)
        print(f'OK: {f.name}')
    "

This catches syntax errors and import errors in your DAGs without spinning up a scheduler, webserver, or database. It will not catch runtime task failures, but it will catch the most common CI issue which is pushing a DAG with a broken import that takes down the whole DAG parser.

Docker Build with Layer Caching

Data engineering Docker images are large. The base Airflow image alone is several hundred megabytes. Without caching, a docker build on every push takes 5+ minutes. With GitHub's built-in Docker layer cache, unchanged layers are restored and skipped.

# .github/workflows/docker.yml
name: Build and Push Docker Image

on:
  push:
    branches: [main]
    tags: ['v*']

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Log in to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: myusername/my-pipeline
          tags: |
            type=ref,event=branch
            type=semver,pattern={{version}}
            type=sha,prefix=sha-

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

Use a Docker Hub access token in DOCKERHUB_TOKEN, not your account password. Access tokens can be scoped and revoked. Your account password cannot.

The metadata-action handles tagging automatically. Push to main and you get a main tag. Push a v1.2.3 tag and you get 1.2.3, 1.2, and 1 tags. Add a sha- prefix tag so you always know exactly which commit is running in any environment.

Multi-Job Pipelines with Dependencies

When you have test, build, and deploy as separate concerns, you want them chained. A failed test should stop the build. A failed build should stop the deploy.

name: Full Pipeline

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: {python-version: '3.11', cache: 'pip'}
      - run: pip install -r requirements.txt && pytest tests/ -v

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: myuser/my-pipeline:latest

  notify:
    needs: [test, build]
    if: always()
    runs-on: ubuntu-latest
    steps:
      - run: echo "Test=${{ needs.test.result }}, Build=${{ needs.build.result }}"

needs: test means the build job will not start unless test passes. needs: [test, build] means notify waits for both. The if: always() on notify means it runs regardless of whether the earlier jobs succeeded or failed. This is where you would put a Slack alert on failure.

Conditional Steps

Four conditions cover most real scenarios:

# Run only on pushes to main, not on PRs
- name: Deploy to production
  if: github.event_name == 'push' && github.ref == 'refs/heads/main'
  run: ./deploy.sh

# Run only when something fails
- name: Alert on failure
  if: failure()
  run: curl -X POST ${{ secrets.SLACK_WEBHOOK }} -d '{"text":"CI failed!"}'

# Always run regardless of outcome
- name: Cleanup
  if: always()
  run: docker compose down

# Upload test results even if tests fail
- name: Upload test report
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: test-results
    path: reports/

The if: always() on artifact upload matters in practice. If tests fail, you want the report. Without if: always(), the upload step is skipped because a previous step failed, and you have no output to debug from.

Caching Beyond pip

setup-python caches pip automatically. Everything else needs a manual cache step.

- name: Cache dbt packages
  uses: actions/cache@v4
  with:
    path: dbt/dbt_packages
    key: dbt-packages-${{ hashFiles('dbt/packages.yml') }}
    restore-keys: dbt-packages-

- name: Cache Evidence.dev node_modules
  uses: actions/cache@v4
  with:
    path: evidence-app/node_modules
    key: npm-${{ hashFiles('evidence-app/package-lock.json') }}

The key is a hash of the lockfile. When packages.yml or package-lock.json changes, the cache busts and reinstalls. When nothing changes, the cache restores and the install step is a no-op. restore-keys gives a partial match fallback so you still get a cache hit even after a minor version bump.

The Gotchas That Bite Data Engineers Specifically

Postgres health checks are not optional. Already covered above, but worth repeating: without --health-cmd pg_isready in options:, your tests will fail non-deterministically on the first run after cold start. Always add the health check.

Schedule cron is UTC. schedule: - cron: '0 6 * * *' runs at 6 AM UTC, which is 9 AM EAT (Nairobi/Kampala/Dar es Salaam). If you are scheduling a data pull for East African business hours, account for the offset. Use crontab.guru to verify before you push.

Secrets are not available in fork PRs. When someone forks your public repo and opens a pull request, the pull_request event does not have access to your secrets. Steps that need ${{ secrets.* }} will silently get empty strings. This is a security feature, not a bug. Design your CI so that secret-dependent steps only run on push events, not on pull_request from forks.

Do not write secrets to .env files. Some setups do echo "DB_PASSWORD=${{ secrets.DB_PASSWORD }}" >> .env and then read from .env. That writes the secret to the runner filesystem. Even though the runner is ephemeral, this is unnecessary exposure. Pass secrets as env: directly on the step that needs them.

step outputs use $GITHUB_OUTPUT, not set-output. The old ::set-output:: syntax was deprecated in 2022 and disabled in 2023. If you are copying workflow snippets from older articles, update to the current pattern:

- name: Get version
  id: version
  run: echo "tag=$(git describe --tags)" >> $GITHUB_OUTPUT

- name: Use version
  run: echo "Version is ${{ steps.version.outputs.tag }}"

Security Scanning (Required, Not Optional)

Every repo I push includes a security workflow. For Python, pip-audit against the CVE database covers most dependency vulnerabilities:

# .github/workflows/security.yml
name: Security Audit

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 8 * * 1'  # Weekly on Monday at 8 AM UTC

jobs:
  audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
      - run: pip install pip-audit
      - run: pip-audit -r requirements.txt

Do not use safety for this. It requires a paid API key as of 2024. pip-audit pulls from the public OSV database and is completely free.

For Node.js projects (Evidence.dev, React frontends):

- name: npm audit
  run: npm audit --audit-level=moderate
  working-directory: ./evidence-app

Day-to-Day Management with gh CLI

Once workflows are running, the gh CLI is faster than the GitHub web UI for most operations:

# See what workflows exist
gh workflow list

# Trigger a workflow manually (requires workflow_dispatch trigger)
gh workflow run ci.yml
gh workflow run ci.yml --ref develop

# Watch a run in real time
gh run watch

# See recent runs
gh run list --workflow ci.yml

# View logs from a specific run
gh run view <run-id> --log

# Re-run only failed jobs
gh run rerun <run-id> --failed-only

# Download artifacts without opening the browser
gh run download <run-id> -n pytest-results

gh run watch is the one I use most during active development. It streams the current run to your terminal so you do not have to refresh the GitHub UI to see progress.

A Practical Starting Point

For a new data engineering project, I add three workflow files from day one:

ci.yml for pytest with Postgres service, triggered on push and pull request
security.yml for pip-audit, triggered on push and weekly on a schedule
docker.yml for building and pushing the image, triggered on push to main

That covers: tests passing, no known CVEs in dependencies, and a fresh image on every merge. Everything else (dbt CI, DAG validation, matrix builds) gets added when the project grows to need it.

The full patterns for all three are in this article. The only setup required in GitHub before any of this works is: go to your repo, open Settings, then Secrets and variables, then Actions, and add whatever credentials your tests need. For a PostgreSQL-backed pipeline, that is typically TEST_DATABASE_URL. For dbt, it is the three Postgres credentials. For Docker, it is DOCKERHUB_USERNAME and DOCKERHUB_TOKEN.

Automate the verification work. The pipeline should tell you what broke before a reviewer has to.

All the workflows in this article are patterns I use across my open data engineering projects. If you want to see them in context, the repos are linked on my GitHub profile.

Follow me on dev.to for more articles on Airflow, dbt, and building production-grade data pipelines.