DEV Community

Cover image for I curated 1,863 Data Engineering interview questions from 97+ companies --- here's what I learned. Website :: dataengprep.tech
Aditya Kumar
Aditya Kumar

Posted on

I curated 1,863 Data Engineering interview questions from 97+ companies --- here's what I learned. Website :: dataengprep.tech

I spent months collecting and organizing real data engineering interview questions from 97+ companies including Amazon, Google, Databricks, Goldman Sachs, Walmart, and Meta.

The result: **1,863 questions** across 7 categories, each with a Senior/Principal-level answer.

Here's what I learned about what top companies actually ask.

## The 7 Categories (and their weight in real interviews)

| Category         | Questions | Interview Weight          |
| ---------------- | --------- | ------------------------- |
| SQL              | 487       | Every single interview    |
| Spark / Big Data | 452       | Critical for senior roles |
| System Design    | 179       | The make-or-break round   |
| Python / Coding  | 179       | Usually 1–2 rounds        |
| Cloud / Tools    | 179       | AWS, GCP, Airflow, dbt    |
| Behavioral       | 144       | Often underestimated      |
| Fundamentals     | 243       | Phone screen staples      |

## The Surprising Patterns

### 1. SQL is 90% of phone screens

Almost every company starts with SQL. But it's not just `SELECT * FROM`. The questions I collected most frequently:

- **Window functions** (ROW_NUMBER, RANK, LAG/LEAD) — asked at 70%+ of companies
- **Self-joins and anti-joins** — Amazon's favorite
- **Query optimization** — "This query takes 45 minutes. Fix it."
- **Recursive CTEs** — Goldman Sachs asks these regularly

### 2. System Design separates Senior from Staff

The gap between a mid-level and senior candidate isn't SQL knowledge — it's **system design thinking**. The top questions I found:

- "Design a real-time analytics pipeline for e-commerce"
- "How would you handle late-arriving data in a streaming pipeline?"
- "Design a data warehouse for a ride-sharing company"

What makes a great answer isn't the architecture — it's explaining **trade-offs**:
- Why Kafka over RabbitMQ for *this specific use case*?
- What's the CAP theorem trade-off you're making?
- What happens when this component fails? (Blast Radius)

### 3. Behavioral rounds are pass/fail gates

I was surprised how many senior candidates get rejected in behavioral rounds. The pattern:

- **Amazon**: 100% LP-focused. Every answer needs a Leadership Principle.
- **Google**: "Tell me about a time you disagreed with a technical decision"
- **Meta**: Focus on impact metrics ("What was the business result?")

The STAR method (Situation, Task, Action, Result) works for all of them. But your Result needs **numbers**.

### 4. Company-specific patterns are real

After mapping questions to companies, clear patterns emerged:

- **Amazon**: Heavy on SQL optimization + Leadership Principles
- **Google**: System Design + coding fundamentals
- **Databricks**: Spark internals (shuffle, partitioning, catalyst optimizer)
- **Goldman Sachs**: SQL edge cases + data quality/governance
- **Snowflake**: Their own architecture + query optimization

## What I Built

I turned this into [DataEngPrep.tech](https://dataengprep.tech) — a free platform where you can browse all 1,863 questions with partial answer previews.

Every question page shows:
- The question text
- Which companies ask it
- Difficulty level and category
- A preview of the expert answer (first ~500 chars)
- Full answer behind a paywall

The full answers go deep — trade-offs, architecture diagrams for System Design, and a "Pro-Tip" on every question (either a common mistake to avoid or a technique that impresses interviewers).

## 5 Questions You Should Practice Right Now

If you have a data engineering interview coming up, practice these — they appear everywhere:

1. **"Explain the difference between a star schema and snowflake schema. When would you use each?"** — Tests data modeling fundamentals

2. **"How would you optimize a slow-running Spark job?"** — Tests production experience (hint: start with shuffle reduction, then partitioning)

3. **"Design a data pipeline that handles late-arriving events"** — Tests system design + real-world awareness

4. **"Write a SQL query to find the second-highest salary in each department"** — Tests window functions (the #1 most-asked SQL pattern)

5. **"Tell me about a time you had to make a technical decision with incomplete information"** — Tests decision-making under uncertainty

---

If you're prepping for a DE interview, check out [DataEngPrep.tech](https://dataengprep.tech). All 1,863 question pages are free to browse.

What's the hardest interview question you've been asked? Drop it in the comments — I'll add it to the collection. 👇
Enter fullscreen mode Exit fullscreen mode

Top comments (0)