I graduated in November 2025. My only formal work experience is a 6-month IT internship. I have never worked at a tech company, never contributed to a production pipeline with a team, and never had a senior data engineer review my code.
What I do have is 55 real data engineering projects on GitHub, each with specific metrics: row counts, test pass rates, pipeline success logs, and dashboard screenshots. Every project uses real data from real APIs. None of it is synthetic.
This article is the playbook I built around that situation. It covers the opener, the STAR answers drawn from project work, the technical questions interviewers actually ask, how to handle behavioral rounds honestly, and the Kenya market context that most interview guides skip entirely.
The Core Problem with Standard Interview Advice
Most interview guides assume you have 2 to 3 years of work experience to draw from. They say things like "describe a time when you had to scale a pipeline for a large dataset" or "tell me about a production incident you resolved." If you built your skills through self-directed projects, you can answer these questions just as well, but you need to translate your project work into the language interviewers expect.
The translation is not difficult. The key is precision. Interviewers are not checking whether you worked at a specific company. They are checking whether you understand what you built, why you made the decisions you did, and what you would do differently. A project where you processed 1.5 million rows, caught a date parsing bug with dbt tests before it corrupted 40,000 rows of fiscal year attribution, and generated 2,961 legitimate anomaly alerts is a stronger story than a vague answer about "working on a pipeline at a previous company."
Part 1: The Opener
"Tell me about yourself" is the first question in almost every interview. It is also the question most people underestimate.
The formula: who you are, what you have built, what you are targeting, why you are interested in this specific role.
Version A (general data engineering role):
"I'm a data engineer based in Nairobi. I graduated with a BSc in IT from Zetech University in late 2025, and since then I have been building a portfolio of production-style data pipelines, currently 55 projects on GitHub. My focus has been East African data contexts: I have built pipelines on Airflow, Kafka, dbt, and GCP, covering everything from real-time flight monitoring at JKIA to fiscal reconciliation of Kenya's budget data. I am looking for a junior data engineering role where I can contribute immediately to real pipelines and grow alongside an experienced team. I was drawn to this role specifically because [mirror 1 to 2 phrases from the JD]."
Version B (analytics or BI role):
"I am a data analyst and engineer based in Nairobi. I completed my BSCIT at Zetech University in November 2025, and I have spent the following months building an analytics portfolio covering SQL, Python, dbt, and BI tools including Power BI, Grafana, Streamlit, and Looker Studio. I have modeled Kenya economic data, NSE stock data, and Kenyan job market trends. My work is deliberately honest and measurable: every project has specific test pass rates and row counts I can speak to. I am looking for a role where analytical thinking and engineering overlap, like what your team does with [mention company context]."
Version C (technical deep-dive panel):
"I am a junior data engineer with a focused portfolio of 55 real-data pipelines. My stack centers on Apache Airflow for orchestration, dbt for transformation, Kafka for streaming, and PostgreSQL or BigQuery as the target. I also have experience with CDC via Debezium, medallion lakehouses on Delta Lake, and serverless pipelines on AWS. All projects use real public APIs or open datasets. I can walk you through any of them in detail, including the design decisions and what I would do differently now."
Execution notes:
Speak slowly. Most candidates rush this answer. Pause at the transition between each section. Have the JD in front of you the night before and mirror 1 to 2 phrases from it. End with a slight upward inflection on the last sentence, which signals you want them to respond rather than just waiting for the next question.
Part 2: STAR Answers from Real Projects
Behavioral and project questions follow STAR format: Situation, Task, Action, Result. The common prompts are "tell me about a challenging project," "describe your most complex data engineering work," and "give me an example of a pipeline you built." Here are five project stories ready to deliver.
JKIA Flight Traffic Monitor (streaming and GCP)
Situation: I wanted to build a real-time streaming pipeline over Kenya aviation data, something that would demonstrate Kafka, GCP, and dbt together in a single coherent project.
Task: Ingest live flight data from the OpenSky Network API, buffer it through Kafka, land it in Google Cloud Storage, transform it in BigQuery using dbt, and surface traffic patterns in Looker Studio.
Action:
- Built an Airflow DAG to orchestrate the full workflow end-to-end
- Configured Kafka with a Zookeeper cluster in Docker and used it as a buffer between the API poller and GCS writer; if GCS had a timeout, Kafka held the data and resumed delivery without loss
- Wrote dbt models with a staging to mart layer and 22 dbt tests to validate output integrity
- Used BigQuery window functions to detect peak traffic windows at JKIA
- Surfaced insights in a Looker Studio dashboard
Result: All 22 dbt tests passing. The Kafka buffer successfully handled GCS timeouts during testing without data loss. The dashboard showed peak-traffic windows that aligned with published JKIA schedule data, which validated the pipeline's correctness.
What I would do differently: Add Schema Registry with Avro. If the OpenSky API changes a field type, the current pipeline would silently corrupt data. Schema Registry with compatibility rules would catch that before it reaches the mart layer.
LedgerSync (scale and dbt testing)
Situation: Kenya's national budget data (the BOOST dataset) is publicly available but rarely analyzed computationally. I wanted to build a fiscal reconciliation pipeline at scale and understand what the data actually shows.
Task: Ingest 1.5 million rows of Kenya fiscal transaction data, reconcile budget versus actuals, detect anomalies, and surface alerts in a dashboard.
Action:
- Airflow 5-task DAG for orchestration with PostgreSQL as the warehouse
- 9 dbt models covering raw to staging to reconciliation to alerts
- 35 dbt tests; the suite caught a date parsing bug during development that would have corrupted approximately 40,000 rows of fiscal year attribution before it reached the mart layer
- Built an alerting model that flags transactions deviating more than 20% from expected budget allocation
Result: 1,528,492 BOOST rows processed. All 5 Airflow tasks succeeded. 35 of 35 dbt tests passing. 2,961 legitimate fiscal anomalies surfaced. The most interesting finding: several ministry categories showed consistent overspend in Q4, which aligns with public reporting on year-end budget rushing in Kenya.
What I would do differently: Add incremental loading. The current pipeline uses WRITE_TRUNCATE on every run. With a fiscal year partition key I could load only the current fiscal year's data on each run, which reduces cost and run time at scale.
BungeWatch Kenya (NLP and PDF parsing)
Situation: Kenya's parliament publishes bills as PDFs on a poorly structured website with no machine-readable feed. I wanted to make legislative data searchable and summarized.
Task: Build a pipeline that scrapes parliamentary bills, parses the PDFs, extracts keywords, generates summaries, and surfaces foreign funding mentions in a searchable dashboard.
Action:
- Playwright for dynamic site scraping (the parliament site uses JS rendering)
- 3-tier PDF parser: pdfplumber for text PDFs, PyMuPDF for complex layouts, Tesseract OCR for scanned image PDFs; 4 parallel workers
- spaCy and YAKE for keyword extraction; TF-IDF for relevance ranking
- Airflow 3.0 with an 11-task DAG; checkpoint pattern to resume after failures without re-downloading bills
- dbt for transformation (8 models, 28 tests); Streamlit for the dashboard
Result: 319 bills processed; 223 successfully parsed. 2,230 keywords extracted. 223 AI-generated summaries. Identified 14 foreign funding mentions across 8 bills, data that is not easily discoverable by reading each bill manually.
What I would do differently: Store the raw PDFs in GCS or S3 before parsing. If parsing logic changes, I cannot re-run on the same documents without re-downloading them. Separating the raw storage from the processing layer would fix that.
Kenya Forex Intelligence API (data quality and FastAPI)
Situation: Forex data APIs for Kenyan currency pairs either cost money or return unreliable data. I wanted to build a self-hosted API with validated data and fast query response.
Task: Build a REST API backed by DuckDB that ingests forex rates daily through Airflow, runs a validation suite, and serves clean data with sub-10ms latency.
Action:
- Airflow DAG handles daily ingestion; 120 rows per day (one per currency pair)
- Great Expectations validation suite with 5 checks: nulls, expected currency codes, rate range bounds, timestamp freshness, duplicate detection
- DuckDB as the query engine (in-process OLAP, no separate DB server)
- FastAPI with 4 endpoints: current rates, historical series, pair comparison, and validation report
Result: Sub-10ms query latency on all endpoints. The 5-check validation suite caught 3 real data quality issues during development: stale timestamps from the upstream source and duplicate records on weekend ingestion. Fixed before they reached the API layer.
What I would do differently: Add Redis caching for high-traffic scenarios and OpenAPI authentication so the API can be deployed publicly without exposing it to abuse.
JobSense (mistake and recovery)
This one is for the "describe a mistake you made" question.
Situation: In the JobSense pipeline, I named a dbt model the same as a Python-managed PostgreSQL table that already existed in the same schema.
Task: When dbt ran, it dropped the Python-managed table and replaced it with its own version, which deleted all 604 pre-loaded job embeddings.
Action: Re-ran the embedding generation script (604 jobs times one embedding model call, approximately 45 minutes to regenerate). Then renamed the dbt model with a dbt_ prefix to distinguish it from Python-managed tables. Added a step to my project setup checklist: verify no naming conflicts between dbt models and raw tables before running dbt run for the first time.
Result: Pipeline fully recovered. The conflict pattern is now in my notes and has not recurred in subsequent projects. The lesson is now documented in my memory system, not just in my head.
Part 3: Technical Questions That Actually Come Up
Interviewers test SQL, Python, orchestration, transformation, and system design. Here are the questions I have seen most often, with answers tied back to real work.
SQL
What is the difference between WHERE and HAVING?
WHERE filters rows before aggregation. HAVING filters groups after aggregation. WHERE cannot reference aggregate functions.
SELECT dept, COUNT(*) AS headcount
FROM employees
WHERE active = true -- filters rows before grouping
GROUP BY dept
HAVING COUNT(*) > 10; -- filters groups after aggregation
Explain window functions with a real use case.
Window functions compute values across rows related to the current row without collapsing them like GROUP BY does. From the NSE Stock Pipeline:
SELECT
ticker,
close_price,
RANK() OVER (PARTITION BY sector ORDER BY close_price DESC) AS sector_rank,
LAG(close_price, 1) OVER (PARTITION BY ticker ORDER BY trade_date) AS prev_close,
close_price - LAG(close_price, 1) OVER (
PARTITION BY ticker ORDER BY trade_date
) AS daily_change
FROM nse_prices;
I also used window functions in BigQuery on JKIA data to compute hourly flight counts relative to a 7-day rolling average to detect peak traffic windows.
Write a query for a rolling 7-day average.
SELECT
trade_date,
ticker,
close_price,
AVG(close_price) OVER (
PARTITION BY ticker
ORDER BY trade_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg
FROM nse_prices
ORDER BY ticker, trade_date;
The ROWS BETWEEN 6 PRECEDING AND CURRENT ROW is the frame clause. Without it you get a running average, not a rolling 7-day. This is one of the most common window function mistakes in interviews.
What is an SCD Type 2 and how do you implement it?
Type 2 tracks historical changes by adding new rows with effective_from and effective_to date columns rather than overwriting.
-- Close the current record
UPDATE dim_customer
SET effective_to = CURRENT_DATE, is_current = FALSE
WHERE customer_id = 123 AND is_current = TRUE;
-- Insert the new version
INSERT INTO dim_customer
(customer_id, name, address, effective_from, effective_to, is_current)
VALUES
(123, 'New Name', 'New Address', CURRENT_DATE, '9999-12-31', TRUE);
dbt snapshots automate this pattern. In the Kenya Real Estate pipeline I used snapshots to track price changes on listings, so each change has a dated record rather than just the current price.
How do you optimize a slow query?
-
EXPLAIN ANALYZEto find sequential scans and nested loop joins - Add indexes on JOIN and WHERE columns
- Avoid functions on indexed columns in WHERE predicates (prevents index use)
- Filter early: push predicates into CTEs rather than filtering after a large join
- For BigQuery: check bytes processed and add partitioning or clustering on the filter columns
- Check for data skew in GROUP BY operations where some groups are much larger than others
Python
What is a generator and why use one in a pipeline?
A generator yields values lazily, one at a time, instead of loading the entire result into memory. For large files or API responses this is essential.
def read_csv_chunks(path, chunk_size=10_000):
with open(path) as f:
reader = csv.DictReader(f)
chunk = []
for row in reader:
chunk.append(row)
if len(chunk) >= chunk_size:
yield chunk
chunk = []
if chunk:
yield chunk
for chunk in read_csv_chunks("large_file.csv"):
process_and_load(chunk) # never more than chunk_size rows in memory
What is the GIL and what does it mean for pipelines?
The GIL (Global Interpreter Lock) prevents multiple CPython threads from executing Python bytecode simultaneously. For data pipelines:
- I/O-bound work (API calls, DB reads): use threading or asyncio; the GIL releases during I/O wait
- CPU-bound work (parsing, transformation): use multiprocessing; each process has its own GIL
In BungeWatch I used multiprocessing with 4 workers for OCR because it is CPU-bound. If I had used threading, all 4 workers would have shared one core.
How do you implement retries with exponential backoff?
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=2, max=60),
retry=retry_if_exception_type(requests.exceptions.RequestException),
reraise=True
)
def fetch_data(url: str) -> dict:
response = requests.get(url, timeout=30)
response.raise_for_status()
return response.json()
I use this pattern in every API extraction script. The max=60 cap prevents waiting 4 minutes between retries if the upstream API is down completely.
How do you process a large CSV without running out of memory?
Three approaches ranked by simplicity:
# Option 1: pandas chunks
for chunk in pd.read_csv("large.csv", chunksize=10_000):
process(chunk)
# Option 2: DuckDB — streams from disk natively, never loads full file
import duckdb
conn = duckdb.connect()
result = conn.execute("SELECT col1, col2 FROM 'large.csv' WHERE amount > 1000").df()
# Option 3: generator (shown above)
For files above 5 GB, DuckDB is usually the fastest Python option because it reads Parquet and CSV with column pruning built in.
Apache Airflow
What is XCom and what are its limits?
XCom allows tasks to share data through the metadata database. The limit is size: XCom is for small metadata (IDs, row counts, file paths), not DataFrames. Storing a DataFrame in XCom bloats the metadata DB and makes it slow.
Best practice: write actual data to PostgreSQL, GCS, or S3, then pass the path via XCom. The task that receives the path reads the data from storage directly.
What is the difference between catchup and backfill?
Catchup is a DAG-level flag. If catchup=True and you deploy a DAG with a start_date two weeks ago, Airflow will run all missed intervals immediately. Set catchup=False for most pipelines to avoid this.
Backfill is intentional: you manually trigger historical re-runs.
airflow dags backfill my_dag --start-date 2025-01-01 --end-date 2025-01-31 --reset-dagruns
How do you handle task failures in production?
@task(retries=3, retry_delay=timedelta(minutes=5), on_failure_callback=alert_slack)
def extract_nse_data():
...
Beyond retries: design tasks to be idempotent so they are safe to re-run. Use ON CONFLICT DO NOTHING for inserts so a retry never creates duplicates. Set depends_on_past=True for critical pipelines where a failed run should block future runs until resolved.
dbt
What is the staging to mart layer pattern?
Staging: 1-to-1 with source tables. Light cleaning only: rename columns, cast types, drop obvious garbage. No business logic. Prefix: stg_.
Intermediate: business logic joins and transformations that combine staging models. Not exposed to end users. Prefix: int_.
Marts: final wide tables optimized for specific use cases. These are what analysts and BI tools query. Prefix: fct_ or dim_.
From the NSE pipeline: stg_prices to int_daily_changes to fct_stock_performance. Each layer is testable independently.
What is an incremental model and when do you need a lookback buffer?
{{ config(materialized='incremental', unique_key='flight_id') }}
SELECT * FROM {{ ref('stg_flights') }}
{% if is_incremental() %}
WHERE departure_time > (SELECT MAX(departure_time) FROM {{ this }})
{% endif %}
The buffer matters for late-arriving data. If records can arrive a few hours after the event time (which happens with API sources that update retroactively), use:
WHERE departure_time > (SELECT MAX(departure_time) FROM {{ this }}) - interval '4 hours'
Without the buffer, late records are silently dropped.
How do you do CI for dbt?
Use dbt state to run only models changed since the last production run:
dbt build \
--select state:modified+ \
--defer \
--state /artifacts/prod-manifest
state:modified+ runs changed models and all downstream dependents. --defer makes unchanged models reference the production results rather than rebuilding everything. This requires storing manifest.json from the last prod run as a CI artifact.
System Design
How do you design an end-to-end batch analytics pipeline?
Use the NSE Stock Pipeline as the example:
- Source: NSE website after market close at 18:00 EAT
- Extract: Python with exponential backoff retries, land raw data in PostgreSQL staging table with
ON CONFLICT DO NOTHING - Validate: row count check (not zero, not 10x normal), null rate check on required fields, price range bounds
- Transform: dbt staging model (cast types, rename), intermediate (compute daily change and sector rank), mart (aggregated performance)
- Serve: Streamlit dashboard reads from the mart layer
- Orchestrate: Airflow DAG with 3 retries per task, Slack alert on failure, SLA set to 30 minutes
When do you choose streaming over batch?
Batch when:
- Data arrives in bulk at defined intervals (daily files, nightly exports)
- Latency of minutes or hours is acceptable
- Aggregations need complete data (stock prices only make sense after market close)
Streaming when:
- Need to act within seconds (fraud detection, live dashboards)
- Data arrives continuously and unboundedly
- Need to capture deletes (batch only sees what exists, not what was removed)
I have built both. The NSE pipeline is batch. The JKIA Flight Monitor uses Kafka for streaming with Airflow as the outer orchestrator.
What is idempotency and how do you achieve it?
Idempotency means running the same task multiple times produces the same result as running it once. It is what makes retries safe.
Techniques:
-
INSERT ... ON CONFLICT DO NOTHINGorON CONFLICT DO UPDATEinstead of plain INSERT -
WRITE_TRUNCATEfollowed by a full reload for small dimension tables - dbt incremental models with
unique_keyso re-runs merge rather than append - For file-based outputs: write to a temporary path, then rename atomically when complete
Every extraction in my projects uses ON CONFLICT DO NOTHING. A re-run after a failure never creates duplicate rows.
Distributed Systems and Database Foundations
These questions come up in technical screens at companies running any distributed infrastructure. They are foundational concepts, not Airflow-specific or dbt-specific.
What is the CAP theorem?
The CAP theorem states that a distributed system can only guarantee two of three properties simultaneously:
- Consistency (C): every read receives the most recent write or an error
- Availability (A): every request receives a response, though it may not be the most recent data
- Partition Tolerance (P): the system continues operating even if network partitions cause nodes to lose communication
In practice, network partitions always happen. So the real trade-off is between consistency and availability when a partition occurs.
| System | Trade-off | Example |
|---|---|---|
| CP (consistent, partition-tolerant) | May reject requests during partition | HBase, Zookeeper |
| AP (available, partition-tolerant) | May return stale data during partition | Cassandra, DynamoDB |
| CA (consistent, available) | No partition tolerance | Traditional RDBMS on single node |
For data engineering: event streaming systems like Kafka favor availability (messages are always accepted). Transactional databases like PostgreSQL favor consistency (reads always reflect committed writes). Knowing which trade-off a system makes tells you how to design around its failure modes.
What are the main data serialization formats and when do you use each?
This question comes up whenever the role involves data lakes, Kafka, or any file-based pipeline.
| Format | Type | Schema | Compression | Best for |
|---|---|---|---|---|
| JSON | Text, row-oriented | Schema-less | Poor | APIs, config, small data |
| CSV | Text, row-oriented | Schema-less | Poor | Simple exports, Excel users |
| Avro | Binary, row-oriented | Schema required (embedded) | Good | Kafka messages, schema evolution |
| Parquet | Binary, columnar | Schema embedded | Excellent | Data lakes, analytical queries |
| Protocol Buffers | Binary, row-oriented | Schema required (.proto file) | Excellent | gRPC, high-throughput streaming |
The two you need to know deeply for data engineering interviews:
Parquet is the default format for data lake storage. It stores data column by column, so a query on 3 columns out of 50 only reads those 3 columns from disk. Combined with Snappy or ZSTD compression, Parquet files are typically 5 to 10 times smaller than the equivalent CSV. BigQuery, Athena, Spark, and DuckDB all read Parquet natively.
# Writing Parquet with pandas
df.to_parquet("output.parquet", engine="pyarrow", compression="snappy", index=False)
# DuckDB reading Parquet directly (no load step)
import duckdb
conn = duckdb.connect()
result = conn.execute("SELECT ministry, SUM(amount) FROM 'ledger.parquet' GROUP BY 1").df()
Avro is the standard for Kafka because it stores the schema alongside the data and supports schema evolution with compatibility rules (add a field with a default value, remove a field). Without Avro or Protobuf in a Kafka pipeline, a schema change in the producer silently breaks consumers.
What is the difference between a data lake and a data warehouse?
| Property | Data Lake | Data Warehouse |
|---|---|---|
| Data format | Raw, any format (CSV, JSON, Parquet, images, PDFs) | Structured, processed, schema-on-write |
| Schema | Schema-on-read (applied at query time) | Schema-on-write (enforced at load time) |
| Cost | Very cheap storage (S3, GCS) | More expensive (compute + storage) |
| Query speed | Slower (no indexes, scans files) | Fast (columnar, optimized) |
| Use case | Retain all raw data for future use | Analytical queries, dashboards, BI |
| Who queries it | Data engineers, data scientists | Analysts, BI tools |
The modern answer is the data lakehouse: combine object storage cost with warehouse-level query performance using open table formats (Delta Lake, Apache Iceberg, Apache Hudi). They add ACID transactions, time travel, and schema enforcement on top of Parquet files in S3 or GCS.
In Kenya Economic Pulse I implemented a medallion lakehouse using Delta Lake on MinIO. The Bronze layer is raw Parquet files in object storage. The Silver layer applies Delta Lake ACID writes. The Gold layer runs dbt on top of it. You get the storage cost of a data lake and the reliability guarantees of a warehouse.
What is database normalization and when do you denormalize?
Normalization organizes a relational database to reduce redundancy and avoid update anomalies. The common normal forms:
- 1NF: each column holds atomic (indivisible) values; no repeating groups
- 2NF: all non-key columns depend on the entire primary key (removes partial dependencies)
- 3NF: all non-key columns depend only on the primary key, not on other non-key columns (removes transitive dependencies)
A fully normalized schema stores each fact in exactly one place. A customer's city is stored once in dim_customer, not repeated on every transaction row.
When to denormalize: in OLAP systems (warehouses, data marts), you denormalize intentionally. Analytical queries join many tables at query time, which is expensive. Pre-joining dimension data into wide fact tables eliminates those joins and speeds up BI tools. This is what a star schema does: it is deliberately denormalized for read performance at the cost of some write redundancy.
Rule of thumb: normalize for OLTP (transactional systems where you write frequently). Denormalize for OLAP (analytical systems where you read frequently across many rows).
What is sharding and when would you use it?
Sharding is horizontal partitioning of a database across multiple machines. Each shard holds a subset of the rows, usually split by a shard key (user ID range, geographic region, hash of a key).
Why it exists: a single database server has a physical ceiling on storage, CPU, and write throughput. Sharding distributes that load across multiple nodes so the system can scale beyond what any single machine supports.
Sharding strategies:
-
Range-based: rows where
user_idbetween 1 and 1,000,000 go to shard 1, the next million to shard 2. Simple, but hot spots appear if new users cluster in one range. -
Hash-based:
shard = hash(user_id) % num_shards. Distributes evenly, but range queries touch all shards. - Directory-based: a lookup table maps each key to its shard. Flexible, but the directory itself becomes a bottleneck.
Sharding challenges:
- Cross-shard joins are expensive or impossible; you must denormalize or use application-level joins
- Resharding (adding or removing shards) requires moving data, which is disruptive
- Transactions across shards require distributed transaction protocols
For data engineering specifically: you rarely shard the warehouse yourself (BigQuery and Snowflake handle distribution internally). Sharding knowledge matters most when the source system is a sharded operational database and you need to extract from all shards during ingestion.
How do you handle data security and PII in a pipeline?
This comes up in financial services, healthcare, NGO, and any role handling customer data. The key concepts:
PII identification and classification:
Personally Identifiable Information includes names, ID numbers, phone numbers, email addresses, financial account numbers, location data, and any combination that could identify an individual. Build a data catalog or lineage map that tags which columns contain PII.
Techniques for protecting PII in pipelines:
import hashlib
# Pseudonymization — replace PII with a consistent token
# The same ID always produces the same hash, so you can join
def pseudonymize(value: str, salt: str) -> str:
return hashlib.sha256(f"{salt}{value}".encode()).hexdigest()[:16]
df["customer_id_hashed"] = df["national_id"].apply(
lambda x: pseudonymize(x, salt=os.environ["PII_SALT"])
)
df = df.drop(columns=["national_id"]) # never land raw PII in the warehouse
- Tokenization: replace PII with a random token; only the token vault maps tokens back to originals
-
Masking: replace PII with a format-preserving fake value (e.g.,
07*****890instead of the full number) - Encryption: encrypt columns at rest; only authorized roles can decrypt
- Column-level access control: in BigQuery and Snowflake, use column-level security policies so analysts see masked values while engineers see the full column
Key principles:
- Never land raw PII in the Bronze layer if you can avoid it. Pseudonymize at extraction.
- Apply data minimization: only collect and store what you actually need.
- GDPR and Kenya's Data Protection Act (2019) require that individuals can request deletion of their data. Design your pipeline so you can honor a deletion request: track which records belong to a given individual and can delete or mask them without breaking referential integrity downstream.
- Log access to sensitive tables; alert on unusual query patterns.
Part 4: Behavioral Questions
The trap with behavioral questions for fresh grads is giving vague, hypothetical answers ("I would communicate clearly..."). Be specific about real situations, even if the real situation was a solo project. Specificity is what interviewers remember.
"Describe a time you debugged a difficult problem."
In BungeWatch, the PDF parser was silently failing on roughly 30% of bills. The pipeline reported success but the bills came through with empty content. I added structured logging to capture the filename, file size, and which parser tier was used. This revealed the pattern: every failure was a file under 50 KB. Inspecting those manually showed they were scanned images disguised as PDFs, so pdfplumber and PyMuPDF both returned empty strings silently. I added a content-length check after each tier and a Tesseract OCR fallback for files returning less than 100 characters. Parser coverage went from 70% to 93%.
"Describe a time you worked with ambiguous requirements."
In LedgerSync, the Kenya BOOST dataset does not have official documentation for ministry codes. The field exists, but there is no lookup table. I cross-referenced the codes manually against Kenya Treasury Annual Reports and built a dbt seed file (a static CSV lookup) mapping code to ministry name. I flagged it as a source of uncertainty in the documentation and added a dbt test that alerts if a new code appears that is not in the lookup. When new data arrives with unknown ministry codes, the test catches it immediately rather than silently defaulting to NULL.
"Tell me about a time you made a mistake."
The dbt and Python naming conflict in JobSense (described above in the STAR section). Key point: I recovered, documented the root cause, added it to a setup checklist, and the pattern has not recurred across 10 subsequent projects.
"How do you handle technical disagreement?"
Be honest if your experience is limited: "Most of my projects are solo, so I have not had a real team disagreement yet. My approach would be to understand the other person's reasoning fully before presenting an alternative, propose a small prototype to test both approaches against a defined metric if we remain split, and defer to the senior engineer if the decision is above my level, while documenting my concern."
Honesty here is better than a fabricated story an interviewer might probe further.
"Where do you see yourself in 3 years?"
"I want to go from building everything independently, as I do now, to contributing to a production data platform and learning from engineers who operate at scale. In 3 years I would like to be a mid-level data engineer who can own a pipeline end-to-end, mentor newer team members, and make confident architectural decisions. I also want to deepen my understanding of areas I have only touched so far: data observability, Apache Flink for streaming at scale, and Azure."
Part 5: Questions to Ask the Interviewer
Asking good questions at the end is as important as answering them well. It signals engineering curiosity and helps you evaluate whether the role is actually a good fit.
About the tech stack:
- What does your current data infrastructure look like?
- Are there any migrations or re-architectures planned that I would be involved in?
- How mature is your CI/CD pipeline for data work?
- Do you use dbt? If not, how do you manage SQL transformations?
About the role:
- What would success look like in the first 3 months?
- What are the most common sources of pipeline failures on this team?
- Is this a new role or a backfill?
About growth:
- What learning resources does the company support?
- Are there opportunities to work across domains like ML, analytics, and platform engineering?
Avoid asking about salary in the first round, asking questions that are answered in the JD, and asking what the company does (research beforehand).
Part 6: The Kenya Market
Most interview prep content is written for US or UK candidates. The Kenyan data engineering market has its own structure.
Who hires data engineers in Kenya:
- Financial services: Equity Bank, KCB, Co-operative Bank, M-KOPA, Pezesha, Lipa Later
- Telecoms: Safaricom (the largest data organization in Kenya), Airtel
- Tech and startups: Twiga Foods, BURN Manufacturing, Kobo360, Apollo Agriculture, Wave
- NGOs and research: CGIAR, ICRAF, PATH, Mercy Corps, AmeriCares
- International organizations: World Bank, UNDP, IFC (often contractor roles)
- Remote (global): Many Kenyan engineers work remotely for EU and US companies via Andela, direct contract, or platforms like Toptal
Realistic salary ranges for junior and associate roles (2025 to 2026):
| Level | Monthly (KES) | Notes |
|---|---|---|
| Entry (0 to 1 year) | 60,000 to 120,000 | Portfolio projects count toward "experience" |
| Junior (1 to 2 years) | 100,000 to 180,000 | With strong GitHub + relevant domain |
| Remote (USD) | $1,500 to $3,500/month | Andela-connected or direct remote |
NGO and international organization roles typically pay 20 to 40% above private sector market rates. Salary data for junior roles is sparse, so negotiate based on your portfolio strength rather than published ranges alone.
What Kenyan employers actually test for:
- Can you work with real Kenyan data, not just textbook examples? Portfolio projects on NSE, KNBS, parliamentary data, and M-Pesa ecosystem answer this directly.
- Can you write SQL and Python at a practical level? Expect a coding round, not just theory.
- Can you communicate technical concepts clearly? Practice explaining your projects to a non-technical person.
- Are you reliable and will you show up? Consistent GitHub activity across 55+ projects signals this more than a CV claim.
Typical interview format:
- Round 1: HR screen (15 to 30 min). Background, motivation, salary range.
- Round 2: Technical screen (45 to 60 min). SQL problems, Python questions, pipeline design question.
- Round 3: Technical deep-dive or take-home. Build a small pipeline on provided test data, usually 3 to 5 days.
- Round 4: Panel with hiring manager and team member. Behavioral and culture fit.
- Timeline: typically 2 to 4 weeks from first contact to offer.
For remote roles:
Mention your timezone and availability for overlap (East Africa Time = UTC+3). Specify your rate in dollars per month, not per hour or per year. Don't leave it open-ended; it signals you have not thought it through.
Part 7: The Last 5 Minutes Before the Interview
- Re-read the JD. Pick 2 phrases you will mirror in your opener.
- Re-read your "tell me about yourself" version (A, B, or C depending on the role).
- Pick the 2 to 3 projects most relevant to the JD. Know the specific metrics for each.
- Review your 2 to 3 questions to ask.
- Stop preparing. Your portfolio is real. You built those pipelines. The metrics are yours. Trust it.
The single biggest mistake is going into the opener with vague framing like "I have worked on various data projects." Replace every "various" with a specific number. Replace every "complex pipeline" with the actual metric. "I processed 1.5 million rows" is better than "I have experience with large datasets" every time.
My full portfolio of 55 projects is on GitHub. Portfolio site at ian-mwendwa.vercel.app.
Follow me on dev.to for more on data engineering, dbt, and Airflow.
Top comments (0)