<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DataDriven</title>
    <description>The latest articles on DEV Community by DataDriven (@datadriven).</description>
    <link>https://dev.to/datadriven</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864671%2F923e8540-fa96-491d-adb6-0e01c42ec26a.png</url>
      <title>DEV Community: DataDriven</title>
      <link>https://dev.to/datadriven</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/datadriven"/>
    <language>en</language>
    <item>
      <title>Azure Lost 60% of DE Job Postings in One Year. Is Your Resume Wrong?</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Tue, 28 Apr 2026 14:42:02 +0000</pubDate>
      <link>https://dev.to/datadriven/azure-lost-60-of-de-job-postings-in-one-year-is-your-resume-wrong-1k5i</link>
      <guid>https://dev.to/datadriven/azure-lost-60-of-de-job-postings-in-one-year-is-your-resume-wrong-1k5i</guid>
      <description>&lt;p&gt;Last year, I was reviewing resumes for a senior data engineering role on my team. Out of maybe 40 applicants, I'd estimate 30 of them led with Azure. Azure Data Factory, Azure Synapse, Azure Databricks, Azure everything. Made sense; that's where the jobs were. Fast forward twelve months and I'm looking at a market that's barely recognizable. If your resume looks like it did in 2025, you might be wondering why the phone stopped ringing.&lt;/p&gt;

&lt;p&gt;Here's the number: &lt;strong&gt;Azure&lt;/strong&gt; dropped from 75% of &lt;strong&gt;data engineering&lt;/strong&gt; job postings to 34% in a single year. Not a gradual decline. Not a rounding error. A 41-percentage-point collapse. And most working DEs I talk to haven't even registered it yet because they're heads-down maintaining pipelines, not refreshing job boards.&lt;/p&gt;

&lt;p&gt;This isn't a "the sky is falling" piece. Azure isn't dead. But if your &lt;strong&gt;resume&lt;/strong&gt; is a love letter to a single &lt;strong&gt;cloud&lt;/strong&gt; ecosystem, you need to read this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers Don't Lie (But They Do Need Context)
&lt;/h2&gt;

&lt;p&gt;Let's be precise about what happened. Azure went from appearing in three out of four DE job postings to appearing in one out of three. Meanwhile, AWS holds roughly 32% of the data engineering market share, with AWS certifications showing up in 4.2% of listings versus Azure's 3.6%. GCP sits at 1.2%, which sounds small until you realize GCP climbed to 13% overall cloud market share in 2025 and is growing faster in percentage terms than anyone else.&lt;/p&gt;

&lt;p&gt;But here's the part most people miss: the absolute number of Azure jobs didn't evaporate. There are still 12,000+ Azure data engineer roles on LinkedIn right now, and around 42,000 Azure cloud engineer postings globally. The problem isn't that Azure jobs disappeared. The problem is that AWS and Databricks jobs multiplied so fast that Azure's share got swallowed. It's a market-share collapse, not a job-count collapse.&lt;/p&gt;

&lt;p&gt;Why does that distinction matter? Because it changes the advice. If Azure jobs were vanishing, you'd need to panic-retrain. Since they're being outpaced, you need to broaden. Different problem, different solution.&lt;/p&gt;

&lt;p&gt;The cause is partly architectural and partly Microsoft's own doing. Azure Synapse has entered maintenance mode. Microsoft launched a new Azure Databricks Data Engineer certification in March 2026. Read between the lines: Microsoft is conceding the traditional data warehouse space and pivoting toward platform-agnostic tooling. When the platform vendor itself is telling you to learn Databricks, maybe listen.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The tools change every 18 months. The problems don't change. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Resume Trap: How Single-Cloud Profiles Get Filtered
&lt;/h2&gt;

&lt;p&gt;I've been on hiring panels where we passed on strong candidates for the dumbest reasons. But this one isn't dumb; it's mechanical. When a recruiter sources candidates, they're typing "AWS + Airflow" or "GCP + Dataflow" into their search tools. If your profile says "Azure Data Factory + Azure Synapse + Azure Purview" and nothing else, you're invisible before any human ever reads your name.&lt;/p&gt;

&lt;p&gt;This isn't an ATS conspiracy theory. The folklore about 75% of resumes getting auto-rejected is largely unverified; a 2025 study of 25 US recruiters across 10+ ATS platforms found that 92% don't configure auto-rejection rules based on content. The real killer is simpler: keyword misalignment and recruiter search queries. You're not being rejected. You're not being found.&lt;/p&gt;

&lt;p&gt;Here's what makes it worse. 51% of resumes score below 50 out of 100 on ATS assessment before any optimization, mostly because the keywords don't match the job description. If the job says "AWS, Spark, Airflow" and your resume says "Azure, Synapse, Data Factory," you're speaking a different language even though the underlying concepts are identical.&lt;/p&gt;

&lt;p&gt;And look, I get it. You spent three years building production pipelines on Azure. That's real work. You shipped real things. Stop discounting that. But your resume isn't a list of tools. It's evidence that you solve problems that matter to the business. The shift isn't about abandoning what you know; it's about framing it in the language the market is actually searching for.&lt;/p&gt;

&lt;p&gt;Multi-cloud specialists are commanding an 18 to 25% salary premium right now. That's not a suggestion; that's a price signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who's Actually Winning (It's Not Who You Think)
&lt;/h2&gt;

&lt;p&gt;AWS has the volume with roughly 55,000 active cloud engineer postings globally. GCP has the momentum in AI and ML workloads. But the real winners? &lt;strong&gt;Snowflake&lt;/strong&gt; and &lt;strong&gt;Databricks&lt;/strong&gt;. Snowflake skills jumped 10 percentage points from 2025 to 2026. Databricks appears in 16.8% of postings. Apache Spark sits at 38.7%.&lt;/p&gt;

&lt;p&gt;These are platform-agnostic tools. They run on all three clouds. And that's the point.&lt;/p&gt;

&lt;p&gt;The industry isn't consolidating around a cloud provider. It's consolidating around tools that work everywhere. Delta Lake, Iceberg, and Hudi don't care whether your underlying infrastructure is AWS, Azure, or GCP. Airflow appears in 732+ job listings on Indeed alone, not because it's the best orchestrator, but because it's mature and cloud-portable. Hiring gravity around Airflow signals risk-averse enterprises hiring for known-good, not innovation.&lt;/p&gt;

&lt;p&gt;The practical implication: if you're an Azure engineer who knows Databricks, you're already 80% cloud-portable. The Delta Lake format is identical across Azure Databricks, AWS Databricks, and GCP Databricks. You're closer than you think.&lt;/p&gt;

&lt;p&gt;GCP deserves a mention here because it's playing a different game entirely. Smallest absolute share, but the fastest growth rate, and it's repositioning as the data and ML specialist cloud. If you're looking at where the &lt;strong&gt;career&lt;/strong&gt; trajectory leads in five years, GCP's AI infrastructure bet is worth watching.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Interviewers Are Testing Now (That They Weren't Last Year)
&lt;/h2&gt;

&lt;p&gt;The interview loop has gotten longer and weirder. We're now at 5 to 7 rounds for senior DE roles: recruiter screen, live SQL and Python coding, a take-home, then 4 to 5 onsites covering data modeling, system design, and behavioral. Enterprise hiring timelines have stretched to 60 to 90 days. I once did eight rounds at a single company, was told I passed, was told the offer was sent, the offer was never sent, then a new recruiter said I'd declined the offer I never saw, then I did four more rounds, passed again, and the headcount was closed. The process is not designed for candidates.&lt;/p&gt;

&lt;p&gt;But beyond the structural insanity, the content has shifted. Three things I'm seeing in 2026 that barely existed in 2024 loops:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost optimization is now a hiring separator.&lt;/strong&gt; Interviewers are asking candidates to optimize pipelines for cost, not just correctness. "How would you reduce the monthly spend on this pipeline by 40%?" If you've never thought about FinOps, start. The economics argument always wins; storage costs 2 cents per GB per month, but engineer time costs $75 per hour. Know when to optimize and when to throw money at the problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch-versus-streaming is a false binary.&lt;/strong&gt; You're no longer asked to design one or the other. You're asked to design both. Lakehouse architectures with Kappa principles, Delta Lake or Iceberg as the storage format, streaming for operational use cases and batch for regulatory reporting. Use the free interactive SQL, Python, Data Modeling, and Pipeline Architecture @ &lt;a href="https://www.datadriven.io" rel="noopener noreferrer"&gt;datadriven.io&lt;/a&gt; that tags every problem by pattern, which is useful when you need to practice the architectural thinking these loops are actually testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Governance and data quality are no longer afterthoughts.&lt;/strong&gt; 26% of data engineering job postings no longer mention education requirements, signaling a shift toward demonstrable skills. But what skills? Not just "can you write a DAG." Companies want engineers who can explain decisions, detect drift, and document compliance. Insurance, FinTech, and healthcare are hiring governance specialists faster than traditional pipeline engineers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Junior engineers worry about which tool to learn. Senior engineers worry about which problems to solve. Staff engineers worry about which problems to prevent.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  How to Reposition Without Starting Over
&lt;/h2&gt;

&lt;p&gt;If you're sitting on an Azure-heavy resume right now, here's the play. And it's not "go get three new certifications."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reframe, don't rebuild.&lt;/strong&gt; Azure Data Factory is an orchestrator. Airflow is an orchestrator. Synapse is a warehouse. Snowflake is a warehouse. The concepts transfer; the syntax is the easy part. Your resume should lead with what you did (migrated 400 tables, built the pipeline finance depends on for board decks, reduced pipeline failures by 60%) and list tools second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add AWS or GCP keywords, but only if they're real.&lt;/strong&gt; Spin up a personal project on AWS. Build one pipeline. Use S3, Glue or Athena, and Airflow. That's enough to honestly list it. Don't lie; do reps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Target verticals where Azure still dominates.&lt;/strong&gt; Healthcare and financial services still prefer Azure due to compliance ecosystems and Microsoft's enterprise relationships. An Azure engineer targeting regulated industries faces less competition than the headline numbers suggest. The crash is real in aggregate but uneven by vertical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invest in platform-agnostic skills.&lt;/strong&gt; Spark, SQL, Python, data modeling, Airflow, dbt. These appear in job postings regardless of cloud. 65% of hiring managers say it's harder to find skilled data engineers than a year ago. The scarcity isn't in single tools; it's in systems thinking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stop treating the interview as the job.&lt;/strong&gt; Interviewing is a separate skill. The market wants architectural fluency across tools that change quarterly. Prep for pipeline architecture, not system design. DEs don't care about load balancers and reverse proxies.&lt;/p&gt;

&lt;p&gt;Average DE salaries compressed from $153k to $133k between 2025 and 2026, but experienced engineers are still commanding $200k+ total comp. The money is there. The question is whether your profile is positioned to capture it.&lt;/p&gt;

&lt;p&gt;I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. The Azure shift is real, it's significant, and it's worth adjusting for. But the engineers who thrive aren't the ones who pick the right cloud. They're the ones who understand that clouds are interchangeable and problems are forever.&lt;/p&gt;

&lt;p&gt;So: how many of you are sitting on an Azure-heavy resume right now, and what's your plan?&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>career</category>
      <category>interview</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Data Engineers Don't Need DSA. So Why Do Interviews Still Test It?</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Mon, 27 Apr 2026 02:20:24 +0000</pubDate>
      <link>https://dev.to/datadriven/data-engineers-dont-need-dsa-so-why-do-interviews-still-test-it-bof</link>
      <guid>https://dev.to/datadriven/data-engineers-dont-need-dsa-so-why-do-interviews-still-test-it-bof</guid>
      <description>&lt;p&gt;I did somewhere around 20 interview loops during my last job search. Phone screens, take-homes, onsites, "culture chats" that were secretly technical screens. At one company I did eight rounds, was told I passed, was told the offer was sent, it was never sent, then a new recruiter said I'd declined the offer I never saw. I did four more rounds. Passed again. Headcount was closed.&lt;/p&gt;

&lt;p&gt;Through all of that, you know what never once came up on the actual job? Inverting a binary tree.&lt;/p&gt;

&lt;p&gt;This debate isn't new. Every six months, a Reddit thread blows up with senior &lt;strong&gt;data engineering&lt;/strong&gt; folks asking why they're being tested on dynamic programming when their actual job is debugging why a pipeline silently dropped 2M rows last Tuesday. But in 2026, with 80,000 tech layoffs in Q1 alone and companies banning AI tools in interviews while AI reshapes the job itself, the question isn't theoretical anymore. It's a breaking point.&lt;/p&gt;

&lt;h2&gt;
  
  
  The DSA Debate That Won't Die
&lt;/h2&gt;

&lt;p&gt;Here's why this fight resurfaces like clockwork: nobody agrees on what a data engineer even is.&lt;/p&gt;

&lt;p&gt;I'm not being glib. The title means something completely different at Google than it does at a Series B startup than it does at a Fortune 500 retailer. When companies can't define the role, they fall back on the only standardized proxy they have: &lt;strong&gt;LeetCode&lt;/strong&gt;-style algorithm problems. Binary trees. Graph traversal. Backtracking. Problems that software engineers have been grinding for a decade, repurposed wholesale for a role that shares maybe 30% of the same skill set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DSA&lt;/strong&gt; is a mechanism to rank candidates; not an indicator of data engineering experience. I've said this before and I'll keep saying it. Accept it for the arbitrary IQ measuring stick that it is.&lt;/p&gt;

&lt;p&gt;But here's the thing that's changed. The market broke.&lt;/p&gt;

&lt;p&gt;80,000 people lost their jobs in the first quarter of 2026. Nearly half of those cuts were attributed to AI and automation. The people flooding the &lt;strong&gt;interview&lt;/strong&gt; pipeline aren't junior developers taking their first shot; they're experienced engineers with production systems under their belt, competing for fewer roles, facing 5 to 7 round loops that stretch 60 to 90 days. Karat's data across 600,000+ technical interviews confirms this is the norm now, not the exception.&lt;/p&gt;

&lt;p&gt;And fewer than 30% of companies have updated their assessment systems to reflect what data engineering actually requires.&lt;/p&gt;

&lt;p&gt;Let that sit. Seven out of ten companies are screening data engineers the same way they screened them in 2022. The tools changed. The job changed. AI changed everything about how we write and review code. The interview didn't change.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Senior DEs aren't walking away from interview loops because they can't code. They're walking away because the cost/benefit calculation broke. Twenty hours of prep for problems that have zero correlation with the job, in a market where the roles might disappear before the loop finishes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What DSA Actually Measures (and What It Doesn't)
&lt;/h2&gt;

&lt;p&gt;Let's be precise about this. LeetCode has 3,000+ problems. The vast majority test binary trees, dynamic programming, and graph algorithms. Skills that data engineers report using "never to rarely" in production.&lt;/p&gt;

&lt;p&gt;You know what I use daily? SQL window functions. CTEs. Deduplication logic. Understanding why a LEFT JOIN is silently inflating row counts because someone upstream changed a grain without telling anyone. Figuring out why a Spark job is spilling to disk. Debugging schema drift that broke a downstream dashboard the CFO reads every Monday.&lt;/p&gt;

&lt;p&gt;None of that is on LeetCode.&lt;/p&gt;

&lt;p&gt;An empirical study from interviewing.io found that LeetCode rating has no correlation with interview performance percentile. What does correlate? Problem volume solved. Which isn't a signal of capability; it's a signal of free time. That's a selection bias, not a predictor of job performance.&lt;/p&gt;

&lt;p&gt;SQL appears in 61% of data engineering job postings. Data modeling skills have 122,000+ open US roles. Cloud cost optimization is now a top-5 interview category at companies tying bonus incentives to infrastructure savings. Yet the screening gate for all of these roles is still "solve this medium in 25 minutes."&lt;/p&gt;

&lt;p&gt;I've been on hiring panels where we passed on strong candidates for the dumbest reasons. "They got the optimal solution but took too long." Meanwhile, the candidate who speed-ran the binary search problem couldn't explain what idempotency means or why you'd want it in a pipeline. We hired the fast one. That pipeline broke in production within a month.&lt;/p&gt;

&lt;p&gt;50+ companies (Airtable, Buffer, Calendly, CircleCI, and others) have moved away from LeetCode-style assessments entirely, replacing them with take-home projects, code reviews, and system design discussions. The signal is there. The industry just hasn't followed it at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI-Banning Irony
&lt;/h2&gt;

&lt;p&gt;This is the part that makes me want to punch a hole in the wall.&lt;/p&gt;

&lt;p&gt;62% of organizations prohibit AI use in technical interviews. At the same time, 76% of data engineering work is now enhanced by AI tools, delivering 25% productivity improvements on average. Companies are telling candidates: "Don't use the thing you'll use every single day if we hire you."&lt;/p&gt;

&lt;p&gt;It's like banning calculators from a math test in a world where every math job involves using calculators.&lt;/p&gt;

&lt;p&gt;And it gets better. Karat's data says over half of candidates use AI anyway, despite being told not to. No company has disclosed a scalable detection method beyond "watch their eyes" and "screen recording." The enforcement is theater.&lt;/p&gt;

&lt;p&gt;Anthropic, the company that built Claude, initially banned candidates from using AI in interviews. Then reversed the policy in July 2025. If the company most invested in AI's credibility can't figure out a coherent policy, what chance does your average enterprise hiring committee have?&lt;/p&gt;

&lt;p&gt;Meanwhile, Meta went the opposite direction and piloted AI-enabled interviews where Claude, GPT, and Gemini are built into the coding environment. Amazon explicitly bans all GenAI with disqualification as the penalty. Google brought back in-person rounds because remote assessments were too easy to game.&lt;/p&gt;

&lt;p&gt;There's no consensus. There's no stable equilibrium. There's just companies reacting quarter by quarter while candidates try to figure out which rules apply at which company.&lt;/p&gt;

&lt;p&gt;Here's the contrarian take nobody wants to hear: if an AI can spit out a clean solution to a medium LC problem, what does asking that problem actually tell you about the candidate? That they memorized something a machine produces on demand? The signal was already thin. Now it's basically noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Predicts Whether Someone Can Do This Job
&lt;/h2&gt;

&lt;p&gt;The actual job of data engineering is less "write a DAG" and more "figure out why finance's board deck had wrong numbers for three months and nobody noticed." It's debugging. It's data modeling. It's understanding the business well enough to catch when something looks wrong before stakeholders do.&lt;/p&gt;

&lt;p&gt;Karat's own data from 400 engineering leaders confirms the baseline assessment focus should be SQL proficiency, window functions, CTEs, and Python fundamentals. Not graph algorithms. Not dynamic programming.&lt;/p&gt;

&lt;p&gt;The companies getting this right are testing for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data modeling fluency.&lt;/strong&gt; Can you design a schema that won't collapse when requirements change? Can you explain why you'd keep fact tables at grain? This is the make-or-break round, and every practitioner knows it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline architecture.&lt;/strong&gt; Not system design in the SWE sense (I don't care about load balancers and reverse proxies). Can you design an ETL pipeline that handles late-arriving data, schema evolution, and failure recovery?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost reasoning.&lt;/strong&gt; Cloud cost optimization is now a top interview category. Can you explain why denormalizing that table saves $40K/year in compute even though it costs $200/year in storage? The economics argument wins every time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident debugging.&lt;/strong&gt; What broke, why, and how do you make sure it never happens again? This is 60% of the actual job and maybe 5% of interview loops.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;35% year-over-year growth in data engineering demand tells you the &lt;strong&gt;career&lt;/strong&gt; isn't going anywhere. 2.9 million data-related roles remain open globally. The role is healthy. The hiring process is sick.&lt;/p&gt;

&lt;h2&gt;
  
  
  Play the Game, But Name It
&lt;/h2&gt;

&lt;p&gt;I'm not going to sit here and tell you to boycott DSA prep. That's bad advice from people who already have jobs. The game is the game. If you're interviewing at companies that screen on LeetCode, you grind LeetCode. Stick to mediums; do 50 and you'll be solid. Few companies ask hards consistently.&lt;/p&gt;

&lt;p&gt;But let's stop pretending this process is meritocratic. It's not. It's standardized and defensible, which is what legal departments and risk-averse hiring committees want. It has almost nothing to do with predicting whether you'll be good at maintaining the pipeline that finance depends on for board decks.&lt;/p&gt;

&lt;p&gt;Interviewing is a skill. It's separate from the actual job. Treat prep like a job. I'ma be super honest: I have a degree from a degree mill and don't feel particularly "skilled." Just a grind.&lt;/p&gt;

&lt;p&gt;The real fix isn't going to come from candidates complaining on Reddit. It's going to come from companies losing great engineers because those engineers did the math. Twenty hours of algorithm prep for a role where you'll never touch an algorithm, in a market where you might get ghosted anyway, while simultaneously being told you can't use the AI tools that define modern engineering work. At some point, the experienced people just stop showing up for that loop.&lt;/p&gt;

&lt;p&gt;Some companies have figured this out. The 50+ that ditched LeetCode. The ones testing pipeline architecture, data modeling, and cost optimization. They're getting better candidates because they're filtering for the right signal.&lt;/p&gt;

&lt;p&gt;The rest are going to keep wondering why their data pipelines break and their senior engineers leave.&lt;/p&gt;

&lt;p&gt;I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. The tools change every 18 months. The problems don't change. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal.&lt;/p&gt;

&lt;p&gt;The interview process should test for those eternal problems. Not for whether you memorized the optimal solution to "Minimum Window Substring."&lt;/p&gt;

&lt;p&gt;What's the worst interview loop you've been through, and did the questions have anything to do with the actual job?&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
      <category>sql</category>
    </item>
    <item>
      <title>The 6 Python Data Engineering Interview Questions You Will Actually Be Asked in 2026</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:31:19 +0000</pubDate>
      <link>https://dev.to/datadriven/the-6-python-data-engineering-interview-questions-you-will-actually-be-asked-in-2026-1o14</link>
      <guid>https://dev.to/datadriven/the-6-python-data-engineering-interview-questions-you-will-actually-be-asked-in-2026-1o14</guid>
      <description>&lt;p&gt;Every data engineer preparing for interviews hits the same confused moment. You search for python interview questions, get a list of reverse-a-linked-list and two-sum problems, grind them for two weeks, walk into your first data engineering loop, and get asked to deduplicate a 10 million row event stream while preserving the latest record per composite key.&lt;/p&gt;

&lt;p&gt;None of those LeetCode problems prepared you for that question. And the python round in a data engineering interview is not going to get easier until you realize the questions are a different species from the ones that show up in a software engineering loop.&lt;/p&gt;

&lt;p&gt;I have run over 250 interview loops at Google, Meta, LinkedIn, and Netflix. The python portion of a data engineering loop does not look like a python backend or frontend loop. It looks like a pipeline-correctness loop wearing a python costume. Below is the full taxonomy of the six questions that actually get asked, how they differ from the SWE canon, and which patterns you have to internalize to pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Difference In One Sentence
&lt;/h2&gt;

&lt;p&gt;The SWE python round tests whether you can write correct code on data that fits in memory. The data engineering python round tests whether you can reason about data correctness, grain, idempotency, and scale on data that usually does not.&lt;/p&gt;

&lt;p&gt;That is the entire gap. Every other difference flows from it.&lt;/p&gt;

&lt;p&gt;A SWE Python problem gives you a list of integers and asks you to do something clever with it. The list has 10 elements. The test cases have 10 elements. The expected behavior is obvious. The skill tested is algorithms.&lt;/p&gt;

&lt;p&gt;A data engineering Python problem gives you an iterator over events that might have 10 million elements, might have duplicates from retries, might have late-arriving data, might have schema drift between rows, and asks you to produce a deduplicated, ordered, grouped output without loading it all into memory. The test cases will have 5 elements. The skill tested is production instinct.&lt;/p&gt;

&lt;p&gt;Candidates who prepped for SWE Python walk in confident and freeze the moment the input becomes an iterator instead of a list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 1: Streaming Aggregation Over an Iterator
&lt;/h2&gt;

&lt;p&gt;This is the single most common python question I have given and received in a data engineering loop across four FAANG companies.&lt;/p&gt;

&lt;p&gt;Setup: you are handed an iterator that yields event dictionaries one at a time. Each event has &lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;event_type&lt;/code&gt;, &lt;code&gt;ts&lt;/code&gt;, and a few other fields. Compute the count of each event type per user without loading the full iterator into a list.&lt;/p&gt;

&lt;p&gt;A SWE candidate types this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;events_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That solution works on the 5 test events the interviewer gave you. It also kills the pipeline when a real day's worth of events arrives. The &lt;code&gt;list(events)&lt;/code&gt; call materializes the whole iterator. In production that is 40 GB of dictionaries in memory for no reason.&lt;/p&gt;

&lt;p&gt;The data engineering answer never materializes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same logic, different relationship with memory. An interviewer running this round is watching for whether you call &lt;code&gt;list()&lt;/code&gt; on an iterator. If you do, you have told them you think like a SWE, not a data engineer. Half the battle in Python data engineering interviews is showing you know the difference between an iterator and a list and that you default to iterators.&lt;/p&gt;

&lt;p&gt;The follow-up is always: what if the iterator is too large to even hold the counts dict in memory? Now you are in sketch-aggregation territory (HyperLogLog, Count-Min Sketch) or you partition by a hash of the key. If you have never heard of those, the senior bar just evaporated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 2: Deduplication With a Tiebreaker
&lt;/h2&gt;

&lt;p&gt;Every data engineering loop has a dedup question. The SWE version is "remove duplicates from a list." The DE version is "here is an event stream with retries. Each event has an &lt;code&gt;event_id&lt;/code&gt;, an &lt;code&gt;ingested_at&lt;/code&gt;, and an &lt;code&gt;updated_by&lt;/code&gt; that is sometimes null. Keep one row per &lt;code&gt;event_id&lt;/code&gt;, preferring the latest &lt;code&gt;ingested_at&lt;/code&gt;, breaking ties by preferring a non-null &lt;code&gt;updated_by&lt;/code&gt;."&lt;/p&gt;

&lt;p&gt;A SWE candidate reaches for a set. Sets do not have tiebreakers. Sets discard information you need.&lt;/p&gt;

&lt;p&gt;The DE answer iterates, keeps a dict keyed by &lt;code&gt;event_id&lt;/code&gt;, and compares against the current best:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dedupe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What interviewers are testing here is not the algorithm. The algorithm is trivial. They are testing whether you ask "what defines a duplicate" before typing, whether you handle the null case in the tiebreaker explicitly, and whether you return an iterator-compatible view instead of a list.&lt;/p&gt;

&lt;p&gt;The follow-up is always: what if the input is sorted by &lt;code&gt;ingested_at&lt;/code&gt; already? Can you do this in constant additional memory? That is where a streaming groupby pattern comes in, and if you can sketch it on the fly you are clearing a senior bar.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 3: Schema-Tolerant Parsing
&lt;/h2&gt;

&lt;p&gt;This is the one SWE prep completely ignores and DE interviews lean on heavily.&lt;/p&gt;

&lt;p&gt;Setup: you are given a list of dictionaries representing events from a log. Some events are missing fields. Some have extra fields. Some have the right field names but wrong types. Write a function that produces a clean, typed output and a quarantine list for rows that cannot be parsed.&lt;/p&gt;

&lt;p&gt;SWE Python does not train this muscle. The LeetCode problem gives you a clean input every time. The DE interviewer is watching whether you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Validate required fields explicitly before touching them&lt;/li&gt;
&lt;li&gt;Cast types with explicit &lt;code&gt;try/except&lt;/code&gt; around each cast&lt;/li&gt;
&lt;li&gt;Never let one bad row kill the whole batch&lt;/li&gt;
&lt;li&gt;Separate "valid" from "invalid" without discarding the invalid rows silently&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A reasonable answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarantine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missing required field&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;         &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;KeyError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;quarantine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;row&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarantine&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trap interviewers plant is a row where &lt;code&gt;user_id&lt;/code&gt; is the string &lt;code&gt;"null"&lt;/code&gt; instead of the Python &lt;code&gt;None&lt;/code&gt;. &lt;code&gt;int("null")&lt;/code&gt; raises. A candidate who wraps only the final output in one big try/except loses one bad row and all subsequent rows, which is a pipeline bug.&lt;/p&gt;

&lt;p&gt;If you have never written parsing code in production, the instinct to quarantine bad rows instead of crashing on them is foreign. It is the single most senior-signaling habit in a DE Python interview.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 4: Window and Session Logic Without SQL
&lt;/h2&gt;

&lt;p&gt;When the python round asks a question that maps to a SQL window function, the SQL-only candidates freeze. The prompt sounds like: "Here is a sorted list of events per user. Group them into sessions where a session ends after 30 minutes of inactivity. Return a list of sessions with start_ts, end_ts, and event_count."&lt;/p&gt;

&lt;p&gt;This is the sessionization pattern the SQL round tests, ported to Python. The interviewer wants to see whether you can implement in Python what you would write as a &lt;code&gt;LAG&lt;/code&gt; + running &lt;code&gt;SUM&lt;/code&gt; in SQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sessionize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gap_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1800&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;gap_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SWE candidate tries to use a library. The DE candidate writes the two-pointer/rolling-state loop. This is the clearest example of DE Python being closer to state-machine code than to algorithmic code.&lt;/p&gt;

&lt;p&gt;Follow-ups are always about edge cases. What if two events share a timestamp? What if the input is not sorted? What if the iterator is chunked across hour boundaries and you have to support resuming? Each one is a real pipeline concern, not an algorithmic concern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 5: Backfill-Safe Incremental Logic
&lt;/h2&gt;

&lt;p&gt;This is the question that most distinguishes senior data engineers from mid-level candidates.&lt;/p&gt;

&lt;p&gt;Setup: you have a function that processes yesterday's events. Rewrite it so that running it on today's date, on a date from last week, or on a date range over the last month all produce correct output without double-counting or dropping data.&lt;/p&gt;

&lt;p&gt;The SWE candidate does not realize they were asked a design question. They write a function that filters by date and returns aggregates. The DE candidate writes code that is idempotent, deterministic on input date range, and safe against partial re-runs.&lt;/p&gt;

&lt;p&gt;The moves interviewers are watching for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filter on &lt;code&gt;ingested_at&lt;/code&gt; (processing time), not on &lt;code&gt;event_ts&lt;/code&gt; (event time), when you want to catch late data&lt;/li&gt;
&lt;li&gt;Produce output keyed by (partition, primary_key) so re-running overwrites instead of appends&lt;/li&gt;
&lt;li&gt;Take the date range as an argument, not as &lt;code&gt;datetime.now()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Emit a result that is the same shape whether you process one day or thirty&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A minimal answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events_iter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_date&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events_iter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;end_date&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interviewer's follow-up is "what happens if this fails halfway through?" If you do not immediately say "the same inputs produce the same outputs, so re-running it is safe," you have not understood why you were asked this question. Idempotency is the senior signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 6: PySpark DataFrame Logic
&lt;/h2&gt;

&lt;p&gt;If the role is Spark-heavy, and many data engineering roles in 2026 are, the python portion of the interview is really a PySpark portion. The questions are the same five patterns above, expressed in DataFrame API calls instead of pure python.&lt;/p&gt;

&lt;p&gt;The specific patterns that recur:&lt;/p&gt;

&lt;p&gt;A join on multiple columns, phrased as "here are two DataFrames, produce one row per customer with their latest order and their primary payment method." Candidates who do not know the &lt;code&gt;join(other, on=["customer_id", "region"], how="left")&lt;/code&gt; syntax lose time to syntax, not logic.&lt;/p&gt;

&lt;p&gt;A window function in PySpark. Same sessionization prompt, but now you write &lt;code&gt;Window.partitionBy("user_id").orderBy("ts")&lt;/code&gt; and use &lt;code&gt;F.lag("ts").over(w)&lt;/code&gt; to compute gaps. This is the direct translation of the SQL pattern and the pure-Python pattern, and interviewers love it because it tests whether you have touched all three.&lt;/p&gt;

&lt;p&gt;A broadcast-join decision. Interviewers describe two tables of wildly different sizes and ask how you would join them. If you say "regular join," you fail. If you say "broadcast the small one," you pass the first level. If you say "broadcast the small one, but only if it fits in the executor memory budget, otherwise repartition and sort-merge," you pass the senior level.&lt;/p&gt;

&lt;p&gt;A partitioning and skew question. A DataFrame has 10 million rows, 90% of which share the same &lt;code&gt;user_id&lt;/code&gt; (a bot). The interviewer asks what happens when you &lt;code&gt;groupBy("user_id")&lt;/code&gt;. The answer involves salting, two-stage aggregation, or adaptive query execution. This is not a SWE question. It is a pipeline-performance question dressed as code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the SWE Round Never Asks That the DE Round Always Asks
&lt;/h2&gt;

&lt;p&gt;If you only prep SWE-style python you will never see these coming in a data engineering loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What is the memory footprint of this solution on 100 million rows?"&lt;/li&gt;
&lt;li&gt;"What happens to this code if the input is an iterator instead of a list?"&lt;/li&gt;
&lt;li&gt;"How do you make this idempotent?"&lt;/li&gt;
&lt;li&gt;"What does this return when a field is missing or null?"&lt;/li&gt;
&lt;li&gt;"How does this behave if you run it twice?"&lt;/li&gt;
&lt;li&gt;"If this was running in production and failed midway, what would the next run see?"&lt;/li&gt;
&lt;li&gt;"What is the grain of the output, and does that match what the downstream consumer expects?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of those is a pipeline-correctness question, not an algorithm question. Every one of them is what separates a DE hire from a SWE hire in the same company.&lt;/p&gt;

&lt;h2&gt;
  
  
  How To Practice In Four Weeks
&lt;/h2&gt;

&lt;p&gt;Week one, iterator-first coding. Rewrite every LeetCode-style problem you have solved so that it accepts an iterator and returns an iterator. Use &lt;code&gt;itertools.groupby&lt;/code&gt;, &lt;code&gt;itertools.islice&lt;/code&gt;, generator expressions. Stop reaching for lists.&lt;/p&gt;

&lt;p&gt;Week two, the five recurring patterns above. Streaming aggregation, dedup with tiebreakers, schema-tolerant parsing, sessionization, and backfill-safe incremental logic. Write each one on paper before running it.&lt;/p&gt;

&lt;p&gt;Week three, PySpark DataFrame fluency. Joins with multiple keys, window functions, broadcast hints, skew handling. Read one real PySpark job from an open-source repository end to end. The muscle memory for DataFrame syntax only comes from reading real jobs.&lt;/p&gt;

&lt;p&gt;Week four, edge cases. Null handling, duplicate keys, out-of-order inputs, idempotency, late-arriving data. Most DE interview rejections happen on the edge-case follow-up, not on the main question. Budget more time here than feels right.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Meta-Skill
&lt;/h2&gt;

&lt;p&gt;None of the patterns above are syntactically hard. The hard part is that the DE Python interview is testing a worldview. You see the question "count events per user" and the SWE worldview asks "what data structure." The DE worldview asks "what is the grain, what is the scale, what is the recovery story."&lt;/p&gt;

&lt;p&gt;Candidates who cross over from SWE Python to DE Python successfully are the ones who rewire the question first. Before typing: what is the grain, what is the scale, is the input an iterator, how does this behave on a re-run, how does it handle nulls and missing fields. Once those questions are automatic, the syntax is trivial.&lt;/p&gt;

&lt;p&gt;The DE Python interview is closer to a code review than to a coding round. You are not being asked if you can write the code. You are being asked if you would approve this code at 3 AM on a pipeline you own.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Last Thing
&lt;/h2&gt;

&lt;p&gt;If you are preparing for a data engineering loop with a LeetCode-style practice routine, stop. The patterns are wrong. The input shapes are wrong. The follow-ups will catch you off guard, and you will lose offers you should have won.&lt;/p&gt;

&lt;p&gt;Practice the six patterns above. Think about grain, scale, and idempotency every time you type. Make your default input an iterator and your default concern production safety. Do that for four weeks and the python round stops being scary.&lt;/p&gt;

&lt;p&gt;If you want to practice for your upcoming data engineer interview, &lt;a href="http://www.DataDriven.io" rel="noopener noreferrer"&gt;www.DataDriven.io&lt;/a&gt; is free. No trial, no credit card. Built because the gap between Python SWE prep and Python DE prep is costing good data engineers jobs that they would otherwise get.&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
    </item>
    <item>
      <title>78K Tech Layoffs, 47% AI-Blamed: Is Data Engineering Safe?</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 23 Apr 2026 13:54:34 +0000</pubDate>
      <link>https://dev.to/datadriven/78k-tech-layoffs-47-ai-blamed-is-data-engineering-safe-4en0</link>
      <guid>https://dev.to/datadriven/78k-tech-layoffs-47-ai-blamed-is-data-engineering-safe-4en0</guid>
      <description>&lt;p&gt;I woke up on March 31st to a Slack message from a former colleague at Oracle. Six words: "Got the email. 6am. It's done." Thirty thousand people, notified by email before sunrise. Not because Oracle was struggling; the company had just posted a 95% net income jump to $6.13 billion. They cut 18% of their workforce to fund data centers.&lt;/p&gt;

&lt;p&gt;That's the 2026 &lt;strong&gt;layoffs&lt;/strong&gt; story in a single sentence. Companies aren't cutting because they're broke. They're cutting because Wall Street rewards headcount-to-capex conversion, and "AI" is the magic word that makes the stock go up.&lt;/p&gt;

&lt;p&gt;78,557 tech workers were laid off in Q1 2026. Nearly half of those cuts, 47.9%, were publicly attributed to &lt;strong&gt;AI&lt;/strong&gt;. Block slashed 40% of its workforce and explicitly blamed AI. Meta announced 8,000 more cuts on April 20th. And every data engineer I know has been asking the same question: am I next?&lt;/p&gt;

&lt;p&gt;I've been through three waves of "&lt;strong&gt;data engineering&lt;/strong&gt; is getting automated away." Still here. Still employed. Still debugging the same categories of problems. But this wave feels different, and it deserves an honest look.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 47.9% Number Is Half Real, Half Investor Theater
&lt;/h2&gt;

&lt;p&gt;Let's start with the headline stat, because it's doing a lot of heavy lifting. 47.9% of Q1 2026 tech layoffs were attributed to AI. That sounds terrifying. It's also misleading.&lt;/p&gt;

&lt;p&gt;Here's the thing nobody's unpacking: of 45,363 confirmed layoffs tracked through early March, only 20.4% were &lt;em&gt;explicitly&lt;/em&gt; attributed to AI by the companies themselves. The 47.9% figure comes from retrospective analysis that assigns AI blame more liberally than the companies did in real-time disclosures. That's a gap you could drive a truck through.&lt;/p&gt;

&lt;p&gt;Sam Altman said it plainly: "There's some AI washing where people are blaming AI for layoffs that they would otherwise do, and there's some real displacement by AI of different kinds of jobs." When the CEO of OpenAI is telling you the AI attribution is inflated, maybe listen.&lt;/p&gt;

&lt;p&gt;59% of hiring managers surveyed admitted their companies frame workforce reductions as "AI-driven" partly to appeal to stakeholders, even when automation played a minimal role. Think about that. More than half of these companies are saying "AI made us do it" because it sounds better on an earnings call than "we overhired in 2021 and our margins need work."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The 47.9% figure is a stock market narrative wearing a labor statistic's clothing. Some of it is real displacement. A lot of it is executives who discovered that saying "AI efficiency" gets a better reaction from analysts than "cost cutting."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This doesn't mean AI displacement isn't happening. It is. But treating 47.9% as gospel is lazy analysis, and lazy analysis leads to bad &lt;strong&gt;career&lt;/strong&gt; decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Oracle's $50 Billion Bet (Funded by 30,000 People)
&lt;/h2&gt;

&lt;p&gt;Oracle's March layoffs deserve their own section because they're the clearest example of what's actually happening. This isn't AI replacing workers. This is capital replacing labor.&lt;/p&gt;

&lt;p&gt;Oracle cut 30,000 people, 18% of its global workforce, to free up $8 to $10 billion in annual cash flow. That cash is going directly into AI data center infrastructure; roughly $50 billion in 2026 capex alone, a 136% increase over 2025. India bore the worst of it: 12,000 of Oracle's approximately 30,000 Indian employees were terminated.&lt;/p&gt;

&lt;p&gt;The company had $523 billion in remaining performance obligations, up 433% year over year. Contracted demand from hyperscalers like OpenAI, Meta, and xAI. Oracle wasn't shrinking. It was restructuring its entire business model from "employ people to build software" to "build infrastructure that other companies rent."&lt;/p&gt;

&lt;p&gt;Here's where it gets relevant for data engineers: Oracle ran 8-month internal pilot programs with AI agents automating database administration tasks. Maintenance, performance optimization, backup verification. The routine stuff. Entry-level data analyst roles fell 40% industry-wide during the same period.&lt;/p&gt;

&lt;p&gt;The pattern is clear. Routine infrastructure work is on the chopping block. Non-routine infrastructure work (the kind where you're debugging why a pipeline silently dropped 2M rows last Tuesday) is not. Oracle didn't cut its cloud architects. It cut the people doing work that could be codified into a runbook and handed to an agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Data Engineering &lt;strong&gt;Job Security&lt;/strong&gt; Isn't a Myth (Yet)
&lt;/h2&gt;

&lt;p&gt;Here's where I'll validate the anxiety and then redirect it, because both things are true: the market is tightening &lt;em&gt;and&lt;/em&gt; data engineers are structurally safer than most adjacent roles.&lt;/p&gt;

&lt;p&gt;The numbers tell the story. Data engineering roles saw only a 20.6% reduction in openings when Q3 2024 layoffs hit; the smallest decline among all data roles. Data scientists accounted for just 3% of Q1 2026 layoffs, while software engineers absorbed 22%. Companies are allocating 60 to 70% of data budgets to engineering (ingestion, transformation, orchestration, reliability). And 90% of AI and ML projects depend directly on data engineering pipelines for training data, feature delivery, and real-time inference.&lt;/p&gt;

&lt;p&gt;That last stat is the one that matters. If you cut pipeline builders, your AI initiatives die. Full stop. Oracle is spending $50 billion on AI infrastructure. Meta is spending $115 to $135 billion. That infrastructure needs data flowing through it, which means it needs people who know how to make data flow reliably. You can't automate the thing that the automation depends on; at least not yet.&lt;/p&gt;

&lt;p&gt;55% of data professionals now identify primarily as data engineers, up from 40% in 2021. That's not just new hiring. That's existing staff reclassifying because companies realized they need infrastructure builders more than they need dashboard makers.&lt;/p&gt;

&lt;p&gt;But, and this is the part nobody wants to hear, entry-level data engineering positions represent just 2% of openings. Roles requiring 6+ years of experience make up 20%. The market isn't shrinking for data engineers. It's bifurcating. Senior engineers who can architect systems are in high demand. Junior engineers who can write a basic DAG are competing with AI tools that can do the same thing.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Junior engineers worry about which tool to learn. Senior engineers worry about which problems to solve. Staff engineers worry about which problems to prevent. The layoffs are targeting the first group.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Data and analytics postings are down 15.2% year over year, outpacing the overall tech decline of 8.5%. But that aggregate number masks a high-variance market. Data engineers at Series B/C startups and enterprise AI implementations are thriving. Legacy BI teams are hollowing out. The label "data engineer" covers everything from someone writing dbt models to someone designing real-time feature stores for ML inference. These are not the same job, and they don't have the same risk profile.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Skills That Actually Keep You Employed
&lt;/h2&gt;

&lt;p&gt;I've watched people with 10 YOE get laid off because their entire skillset was "I run Airflow DAGs and write SQL." That was a fine career in 2020. In 2026, it's a ceiling.&lt;/p&gt;

&lt;p&gt;Here's what the hiring data shows. AI job postings surged 92% in Q1 2026 versus Q1 2025. ML engineering and AI ops roles command 56% wage premiums. Streaming data engineer roles pay $114K to $245K annually. The real-time analytics market is growing at 23.8% CAGR through 2028. OpenAI and Instacart are actively hiring for data infrastructure roles requiring Kafka, Flink, Spark, and Terraform experience.&lt;/p&gt;

&lt;p&gt;The demand isn't for "data engineers." It's for data engineers who can do specific, hard things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data modeling at scale.&lt;/strong&gt; This has always been the core skill, and it's only getting more important. Getting the model wrong upstream means everything downstream is pain; including every AI training pipeline that depends on your tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline architecture for ML systems.&lt;/strong&gt; Not system design in the SWE sense. Nobody cares if you can whiteboard a load balancer. Can you design a feature pipeline that serves both batch training and real-time inference without duplicating logic?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming infrastructure.&lt;/strong&gt; I know, I know; I've said streaming is overrated. And for 90% of companies, it still is. But the 10% that need it are the ones paying $200K+ for Kafka and Flink expertise. If you want &lt;strong&gt;job security&lt;/strong&gt; in a tightening market, depth in an undersupplied niche beats breadth in an oversupplied generalist pool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-aware engineering.&lt;/strong&gt; Storage is 2 cents per GB per month. Compute is cheap. But "cheap" times a thousand pipelines times 365 days adds up. The engineer who can shave $400K off the annual cloud bill by rethinking a data model is worth more than the engineer who memorized the Spark API.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern isn't complicated. Routine work is getting automated. Non-routine work is getting more valuable. If your job can be described as a series of steps that don't require judgment calls, you're exposed. If your job involves figuring out &lt;em&gt;why&lt;/em&gt; the pipeline broke, &lt;em&gt;how&lt;/em&gt; to model the data so downstream teams aren't constantly filing tickets, and &lt;em&gt;what&lt;/em&gt; infrastructure choices save the company money at scale, you're fine.&lt;/p&gt;

&lt;p&gt;66% of CEOs are freezing hiring through the rest of 2026. But data engineer ranks #7 in CEO hiring priorities at 23%, and the roles that &lt;em&gt;are&lt;/em&gt; opening carry premium compensation. The market is smaller but richer. Fewer seats, higher stakes, better pay for the people who get them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Means for Your Career
&lt;/h2&gt;

&lt;p&gt;I'm not going to sugarcoat this: the &lt;strong&gt;layoffs&lt;/strong&gt; are real, the market is harder than it was in 2021, and "just learn SQL and Airflow" isn't a viable &lt;strong&gt;career&lt;/strong&gt; strategy anymore. But I've been through this cycle before. The tools change every 18 months. The problems don't change. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal.&lt;/p&gt;

&lt;p&gt;The 47.9% AI attribution number is mostly theater. The entry-level contraction is real. The senior-level demand is also real. And the data engineers who treat this moment as a reason to deepen their skills (not panic, not pivot to product management, not "learn AI" by taking a Coursera course) are going to come out of this cycle better compensated than they went in.&lt;/p&gt;

&lt;p&gt;I gave myself a week to feel anxious about the headlines. Then I went back to studying pipeline architecture patterns and brushing up on streaming fundamentals. Because that's always been the move: play the game, win the prize.&lt;/p&gt;

&lt;p&gt;What's the most in-demand skill in your corner of data engineering right now? Genuinely curious whether the streaming and ML infrastructure trend is as universal as the job postings suggest, or if it's concentrated in specific markets.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>career</category>
      <category>ai</category>
      <category>beginners</category>
    </item>
    <item>
      <title>What Spark Interviews Actually Test (Based on 189 Real Interview Reports)</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 16 Apr 2026 17:01:17 +0000</pubDate>
      <link>https://dev.to/datadriven/what-spark-interviews-actually-test-based-on-189-real-interview-reports-46ol</link>
      <guid>https://dev.to/datadriven/what-spark-interviews-actually-test-based-on-189-real-interview-reports-46ol</guid>
      <description>&lt;h1&gt;
  
  
  What Spark Interviews Actually Test (Based on 189 Real Interview Reports)
&lt;/h1&gt;

&lt;p&gt;We scraped thousands of data engineering interview reports from across the internet. 189 of them mentioned Spark. We tagged every question, tracked every outcome, and found patterns that contradicted most of the advice we see online.&lt;/p&gt;

&lt;p&gt;This is what the data says.&lt;/p&gt;

&lt;h2&gt;
  
  
  Spark Shows Up Less Than You Think
&lt;/h2&gt;

&lt;p&gt;Across all the reports we collected, Spark appeared in 6.7%. SQL appeared in 22.8%. Python in 16%.&lt;/p&gt;

&lt;p&gt;That ratio matters. If you have 4 weeks to prep and you spend 2 of them grinding Spark internals, you've made a bad bet. SQL is 3.4x more likely to show up. Python is 2.4x more likely.&lt;/p&gt;

&lt;p&gt;But here's the catch: when Spark does show up, it shows up hard. It's rarely one question in a round. It tends to be the entire round. And the failure rate is brutal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question Changes Completely By Level
&lt;/h2&gt;

&lt;p&gt;Most people prep for Spark interviews as if there's one test. There isn't. The question changes shape depending on what level you're interviewing for, and the jump between levels is steeper than people expect.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;L3/L4&lt;/strong&gt;, interviewers test whether you can explain the basics. "What is a DAG?" "Why is a shuffle expensive?" "Tell me about your PySpark projects." One candidate interviewing at Nasdaq described the round as "Python, Pandas, PySpark, Databricks, Linux commands, my projects in Python." Conceptual. Vocabulary. Can you talk about this stuff without stumbling.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;L5&lt;/strong&gt;, the entire format flips. The interviewer hands you a Spark UI screenshot and says "this job was meeting SLA for six months and now it's 10x slower. Nothing in the code changed. Walk me through your diagnosis." A TikTok L5 round combined "complex SQL problems, Spark architecture, and performance optimization questions, including indexing strategies, partitioning, query tuning, and resource management in distributed data processing systems" into a single session. You're not explaining what Spark is. You're fixing something that broke at 3am.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;L6&lt;/strong&gt;, the scope widens again. One candidate at Booking.com was rejected because their system design choices were wrong: "feedback centered on tool choices (Flink vs Spark despite prompt asking for low latency; Redis vs Cassandra)." The question isn't "fix this job." It's "design the memory layout for a system that caches 100GB of reference data while running a 500GB sort-merge join." You're sizing executors, reasoning about GC pressure past 30GB of heap, deciding between &lt;code&gt;MEMORY_AND_DISK&lt;/code&gt; and recomputation.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;L7&lt;/strong&gt;, it's organizational. "How would you design a Spark application that processes 100+ PB across a shared multi-tenant cluster?" The bottleneck isn't compute anymore, it's resource isolation between 50 competing teams.&lt;/p&gt;

&lt;p&gt;Same topic at every level. Completely different test. One Databricks candidate went through a 7-round process over 60 days that included a take-home with 15 hands-on Spark questions, followed by a live grilling where a lead engineer dug into their optimization choices. They called the whole experience "disappointing after almost 2 months."&lt;/p&gt;

&lt;p&gt;The prep that gets you through an L3 round won't even register as relevant at L5.&lt;/p&gt;

&lt;h2&gt;
  
  
  68.8% of "Spark Interviews" Are Really SQL-at-Scale Interviews
&lt;/h2&gt;

&lt;p&gt;This one surprised me the most. We tagged every technical topic mentioned in the 189 Spark interview reports. The breakdown:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;% of Spark Interviews&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SQL optimization&lt;/td&gt;
&lt;td&gt;68.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance tuning&lt;/td&gt;
&lt;td&gt;11.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Window functions&lt;/td&gt;
&lt;td&gt;6.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Joins&lt;/td&gt;
&lt;td&gt;5.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partitioning&lt;/td&gt;
&lt;td&gt;3.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data skew&lt;/td&gt;
&lt;td&gt;2.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory management&lt;/td&gt;
&lt;td&gt;2.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Nearly 7 out of 10 "Spark interviews" are really about running SQL efficiently at distributed scale. Not RDD transformations. Not Catalyst internals. SQL.&lt;/p&gt;

&lt;p&gt;The typical question sounds like this (from a real TikTok L5 interview): "...Discussed Spark architecture, and answered performance optimization questions, including indexing strategies, partitioning, query tuning, and resource management in distributed data processing systems."&lt;/p&gt;

&lt;p&gt;SQL is the entry point. Spark is the context. The question is whether you understand what happens to your SQL query after you hit enter on a 500-executor cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nobody Asks About RDDs Anymore
&lt;/h2&gt;

&lt;p&gt;Zero interviews in the dataset asked about RDDs as a primary topic. Zero asked about GC tuning directly.&lt;/p&gt;

&lt;p&gt;That doesn't mean these concepts are irrelevant. It means interviewers have stopped asking "what is an RDD" and started asking questions where RDD knowledge helps you reason about the answer. The question is "why is this job slow?" and the ability to think in terms of lineage, partitions, and shuffle boundaries is what separates a good answer from a textbook recitation.&lt;/p&gt;

&lt;p&gt;If you're spending prep time memorizing the difference between &lt;code&gt;map&lt;/code&gt; and &lt;code&gt;flatMap&lt;/code&gt; on RDDs, stop. That time is better spent learning to read a Spark UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Companies Actually Ask
&lt;/h2&gt;

&lt;p&gt;Here are real questions from real interviews, pulled directly from the reports:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Databricks&lt;/strong&gt; (37 interviews, 46% rejection rate): "Length, breadth, height, depth on Spark core, DLT, Unity Catalog, code optimization, scenarios, your project issues and how they were resolved." Their process runs 7+ rounds over 50-60 days. One candidate reported being rejected after the presentation round "because of less Databricks knowledge" despite the hiring manager saying Databricks knowledge wouldn't be needed. The bar is inconsistent and the process is long.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finance companies&lt;/strong&gt; (multi-round, structured): "Explanation of Spark architecture in detail and different optimization techniques if any Spark job is taking long to run." These tend to be 4-round processes: PySpark coding, Spark optimization, system design, then a techno-managerial round.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TikTok&lt;/strong&gt; (L5, 25% rejection): Complex SQL + Spark architecture + performance optimization in a single round. They test breadth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BNSF Railway&lt;/strong&gt; (100% rejection in dataset): "Multi-round process with system design, SQL, PySpark, and a deep technical discussion with leadership. The interviews were much challenging and focused heavily on real-world trade-offs, especially around data architecture and streaming concepts." When a railroad company rejects every Spark candidate in your dataset, they're not messing around.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QuantumBlack&lt;/strong&gt; (McKinsey's data arm, 71% rejection): "What makes PySpark great? How do you debug PySpark?" Then a 45-minute coding test with 3 problems solvable in either Pandas or PySpark.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Failure Patterns That Keep Showing Up
&lt;/h2&gt;

&lt;p&gt;After tagging all the technical content across interview reports, challenge databases, and company-specific prep guides, five production failure patterns dominate what senior interviews test:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Data skew on power-law keys.&lt;/strong&gt; One partition holds 320M rows while the others hold 3-4M. Task 199 runs for 7,140 seconds while the other 199 tasks finished in 22 seconds. The interviewer wants you to identify the skew from the Spark UI, explain why adding more executors won't help (the bottleneck is one partition, not total parallelism), and apply the right fix (broadcast the small table, or salt the key if both tables are large).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Broadcast overflow.&lt;/strong&gt; A dimension table that was 8MB a year ago grew past the 10MB &lt;code&gt;autoBroadcastJoinThreshold&lt;/code&gt; silently. Spark switched from BroadcastHashJoin to SortMergeJoin without anyone noticing. Runtime went from 8 minutes to 2 hours. The fix is one line of code. The interview tests whether you can find that line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Shuffle explosion.&lt;/strong&gt; Someone added a &lt;code&gt;repartition()&lt;/code&gt; before a join, thinking more partitions would speed things up. It multiplied shuffle volume by 50x. Network saturated. The interviewer wants you to explain why repartition before a join is almost always wrong and what to do instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Executor OOM from cached data.&lt;/strong&gt; A 100GB reference table is cached. A 500GB sort-merge join needs execution memory. Both compete for the unified pool (60% of heap). Spark's unified memory model lets execution evict cached blocks, but at 100GB the eviction churn destroys throughput. The interview tests whether you understand &lt;code&gt;spark.memory.fraction&lt;/code&gt;, &lt;code&gt;spark.memory.storageFraction&lt;/code&gt;, and the tradeoff between cache hit rate and execution headroom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Catalyst plan regression from stale statistics.&lt;/strong&gt; CBO statistics went stale after a table doubled in size. Spark picked sort-merge instead of broadcast. Nobody changed any code. The job just got slower. The interviewer wants you to explain how Catalyst's cost-based optimizer works and why &lt;code&gt;ANALYZE TABLE ... COMPUTE STATISTICS&lt;/code&gt; matters.&lt;/p&gt;

&lt;p&gt;These five patterns cover what I'd estimate is 80%+ of production Spark incidents. They're also what separates L3 answers ("it's slow because of the data") from L6 answers ("task 199 is reading 15.8GB of shuffle data because the top 1% of user_ids hash to the same partition, and the executor is at 78% GC overhead because it's trying to sort 320M rows in 28GB of heap").&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Signal: Can You Read the Spark UI?
&lt;/h2&gt;

&lt;p&gt;Every pattern above comes down to one skill: reading the Spark UI and reasoning about what you see.&lt;/p&gt;

&lt;p&gt;Stages, tasks, shuffle read/write, GC time, executor memory. That's the entire diagnostic surface. If you can look at a Spark UI screenshot and say "task 199 has 100x the shuffle read of every other task, the executor is at 98% heap, and the physical plan shows SortMergeJoin when this should be a broadcast" then you pass. If you can't, you recite textbook answers and the interviewer can tell.&lt;/p&gt;

&lt;p&gt;This is the skill that most Spark prep resources skip entirely. They teach you what a broadcast join is. They don't teach you to recognize when a missing broadcast join is the reason your 3am pager went off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practice This Before Your Interview
&lt;/h2&gt;

&lt;p&gt;I built a free Spark mock interview that simulates exactly this. You get paged. You see real Spark UI evidence: task durations, shuffle sizes, GC overhead, executor memory, the physical plan. You diagnose, write the fix in PySpark or Scala, run your code in the browser, then an AI interviewer grills you on tradeoffs and edge cases.&lt;/p&gt;

&lt;p&gt;Four phases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Think&lt;/strong&gt; (5 min): Read the Spark UI. Diagnose before you touch code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code&lt;/strong&gt; (15 min): Write and run your PySpark or Scala fix in a hosted IDE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discuss&lt;/strong&gt; (10 min): AI interviewer asks follow-ups one at a time. "What happens when the table doubles?" "Why not just add more executors?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verdict&lt;/strong&gt;: Scored across 5 dimensions (problem solving, technical execution, communication, verification, requirements understanding). Calibrated from L3 to L7.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No paywall, no trial, no credit card.&lt;/p&gt;

&lt;p&gt;Try it here: &lt;a href="https://www.datadriven.io/interview/spark_skew_broadcast_user_events" rel="noopener noreferrer"&gt;Spark Mock Interview&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Data sourced from thousands of interview reports scraped across the internet, covering 945+ companies.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
      <category>programming</category>
    </item>
    <item>
      <title>Everything You Need for Data Engineering Interview Prep, and Why It's Free</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Sat, 11 Apr 2026 16:14:22 +0000</pubDate>
      <link>https://dev.to/datadriven/everything-you-need-for-data-engineering-interview-prep-and-why-its-free-4kem</link>
      <guid>https://dev.to/datadriven/everything-you-need-for-data-engineering-interview-prep-and-why-its-free-4kem</guid>
      <description>&lt;p&gt;I have been on both sides of over 250 FAANG data engineering interview loops. As a candidate, I did about 20 loops in a single job search. As an interviewer, I have watched hundreds of candidates walk in prepared for the wrong things.&lt;/p&gt;

&lt;p&gt;The data engineering interview prep market in 2026 charges $5 to $15 a month for platforms that cover maybe two of the four rounds you will actually face. You end up stitching together three or four subscriptions, paying $50+ a month, and still walking into your onsite with a blind spot that gets you eliminated.&lt;/p&gt;

&lt;p&gt;I built DataDriven.io to cover every round in one place. It is free. Every feature, every problem, every company tag. No trial, no credit card, no paywall. Here is what that actually includes and why each piece matters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy7rg9lj3l2epwy0ntvxj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy7rg9lj3l2epwy0ntvxj.png" alt=" " width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real Code Execution, Not Multiple Choice&lt;/strong&gt;&lt;br&gt;
The single most important thing about interview prep is that it has to match the format of the interview. You will not get multiple choice questions in a DE onsite. You will get a blank editor and a prompt.&lt;/p&gt;

&lt;p&gt;DataDriven runs your SQL and Python in real execution environments. You write a query against actual tables, it executes, you see output, and you find out whether your logic holds up against edge cases. This is not "select the correct answer from four options." This is the same pressure you will feel in the interview: a blank screen, a problem, and a clock.&lt;/p&gt;

&lt;p&gt;Every problem is tagged by company and weighted toward what those companies actually ask. A Meta SQL round looks different from a Databricks SQL round. Practicing generic problems without knowing what your target company emphasizes is prep with a blindfold on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Mock Interviews That Simulate the Full Loop&lt;/strong&gt;&lt;br&gt;
The hardest thing to practice alone is the conversational pressure of a live interview. Solving a problem in silence is a fundamentally different skill than solving it while someone asks you to explain your approach, challenges your assumptions, and throws follow-ups at you.&lt;/p&gt;

&lt;p&gt;DataDriven's AI mock interviews simulate real technical and behavioral rounds. The AI asks follow-up questions based on your responses, evaluates your communication and technical depth, and gives you multi-dimensional feedback on where you lost clarity or missed an opportunity to demonstrate deeper understanding.&lt;/p&gt;

&lt;p&gt;This matters because the interview is not just about getting the right answer. I have watched candidates solve problems correctly and still get rejected because they could not articulate why they made the choices they did. The mock interview is where you build that muscle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Right Data Modeling Practice&lt;/strong&gt;&lt;br&gt;
55% of DE interview loops include a data modeling round. It is the round with the highest elimination rate because almost nobody practices it. The reason is simple: automating evaluation of schema design is genuinely hard. Most platforms skip it entirely.&lt;/p&gt;

&lt;p&gt;DataDriven does not skip it. You get a business scenario, you design a schema from scratch using an interactive schema designer, and you get evaluated on grain, dimensions, normalization trade-offs, and SCD strategies. This is the round that kills experienced engineers who have been writing pipelines for years but have never had to defend a schema design under time pressure.&lt;/p&gt;

&lt;p&gt;If you are prepping for interviews and not practicing data modeling, you are leaving the highest-leverage round completely to chance.&lt;/p&gt;

&lt;p&gt;Structured Courses Built Around Interviews, Not Textbooks&lt;br&gt;
There is a difference between learning data engineering concepts and learning how those concepts get tested in interviews. A course that teaches you what a star schema is does not help you when the interviewer says "design the data model for a ride-sharing marketplace and defend your grain choices."&lt;/p&gt;

&lt;p&gt;DataDriven's courses are structured around interview patterns, not academic curricula. SQL, Python, data modeling, pipeline architecture, and Spark internals, all framed as "here is how this gets asked, here is what the interviewer is evaluating, here is what a strong answer looks like versus a weak one." The content covers the same topics you would find in a $200 course, except it is built by someone who has actually conducted hundreds of these interviews and knows what separates a hire from a no-hire.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adaptive Difficulty That Targets Your Weak Spots&lt;/strong&gt;&lt;br&gt;
Solving 500 random problems feels productive. It is not. If your window functions are solid but your self-joins fall apart under time pressure, doing 50 more window function problems is wasted effort.&lt;/p&gt;

&lt;p&gt;DataDriven tracks your performance by round and by topic, identifies where you are weakest, and feeds you more of that. It also adjusts by company, because a Netflix prep track and an Amazon prep track emphasize different things. The readiness score tells you, per round, which ones you would pass today and which ones would cost you the offer.&lt;/p&gt;

&lt;p&gt;This is the information most candidates do not have until after they get rejected. Getting it before the interview is the difference between focused prep and aimless grinding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why All of This Is Free&lt;/strong&gt;&lt;br&gt;
The marginal cost of one more user on DataDriven is close to zero. The execution environments are containerized and ephemeral. Storage costs pennies. The expensive part was building it, not running it.&lt;/p&gt;

&lt;p&gt;I am a staff-level data engineer with a day job. I built this because the data engineering community gave me my career through free blog posts, open source tools, and people answering questions on Reddit at midnight. Charging $10 a month for prep that costs me almost nothing to serve, to people who are often between jobs, felt wrong.&lt;/p&gt;

&lt;p&gt;DataDriven.io. No account required to start. Open it, pick a challenge, and find out which round would cost you the offer before it actually does.&lt;/p&gt;

&lt;p&gt;Do the DataDriven 75 to prepare for your upcoming data engineering interview&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Data Engineering Interviews Are Broken (Here's Proof)</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Tue, 07 Apr 2026 02:01:55 +0000</pubDate>
      <link>https://dev.to/datadriven/data-engineering-interviews-are-broken-heres-proof-4b6c</link>
      <guid>https://dev.to/datadriven/data-engineering-interviews-are-broken-heres-proof-4b6c</guid>
      <description>&lt;p&gt;I did somewhere around 20 interview loops in a single job search. Phone screens, take-homes, onsites, "culture fits," system design rounds, and yes, the inevitable LeetCode gauntlet. Some went well. Some went laughably poorly. One company had me do eight rounds, told me I passed, said the offer was sent, never sent it, then a new recruiter said I'd declined the offer I never saw. I did four more rounds. Passed again. Headcount was closed.&lt;/p&gt;

&lt;p&gt;That was a few years ago. It's gotten worse.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;data engineering&lt;/strong&gt; &lt;strong&gt;interview&lt;/strong&gt; process in 2026 is broken in ways that would be funny if people's livelihoods weren't on the line. Experienced engineers are failing screens designed for new grads. Take-home projects have ballooned into unpaid consulting gigs. And the disconnect between what companies test for and what the job actually requires has never been wider. I'm not speculating; I've been on both sides of the table, and the view is ugly from everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LeetCode Gauntlet Has Nothing to Do With the Job
&lt;/h2&gt;

&lt;p&gt;Let me describe what I do on an average Tuesday: I debug a pipeline that's silently deduplicating records because an upstream team changed a column type without telling anyone. I trace lineage through four layers of transformations to figure out why finance's board deck numbers don't match the dashboard. I write SQL. I write more SQL. I model data. I argue with product about grain.&lt;/p&gt;

&lt;p&gt;You know what I don't do? Reverse a linked list. Implement a binary search tree. Solve dynamic programming problems on a whiteboard while someone watches me sweat.&lt;/p&gt;

&lt;p&gt;And yet, that's what most &lt;strong&gt;hiring&lt;/strong&gt; loops still test. Companies slap a LeetCode medium (or hard, if they're feeling spicy) in front of a data engineer candidate and call it a technical screen. It's a mechanism to rank candidates; not an indicator of data engineering experience. I've said this before and I'll keep saying it until the industry listens or I retire, whichever comes first.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;DS&amp;amp;A is an arbitrary IQ measuring stick. Accept it for what it is. But let's stop pretending it tells you anything about whether someone can debug a Spark job that's been silently dropping 40% of records for six months.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The defense I always hear: "It tests problem-solving ability." Sure. So does figuring out why your Airflow DAG succeeded but wrote zero rows to the target table. One of those scenarios actually happens at work. The other is a party trick.&lt;/p&gt;

&lt;p&gt;And here's the part that really gets me: &lt;strong&gt;AI&lt;/strong&gt; can solve a medium LC problem faster than most humans now. If your screening mechanism can be defeated by a tool every candidate has access to, what exactly are you measuring? You're not testing engineering skill. You're testing whether someone spent 200 hours on a grinding site. That's not a &lt;strong&gt;career&lt;/strong&gt; signal; that's a compliance signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Flooded Market Gave Companies Permission to Be Absurd
&lt;/h2&gt;

&lt;p&gt;The last couple of years hit tech hard. Layoffs stacked up. Bootcamps kept pumping out graduates. And suddenly, every mid-level data engineering role had hundreds of applicants. Companies that used to hire based on a SQL assessment and a system design conversation started adding rounds like they were collecting Infinity Stones.&lt;/p&gt;

&lt;p&gt;When you have 500 applicants for one role, you don't need a good filter. You need any filter. LeetCode hards become a convenient way to reject 490 people without anyone in HR having to think too hard about it. The bar didn't get raised because the work got harder. The bar got raised because companies could get away with it.&lt;/p&gt;

&lt;p&gt;This is the part that kills me: the actual job hasn't changed. You're still modeling data. Still writing pipelines. Still debugging why the daily load failed at 3am because someone deployed a schema change to prod on a Friday (don't deploy on Fridays, please, I'm begging). The problems are eternal: schema drift, late-arriving data, upstream teams breaking contracts without telling you. These haven't evolved. The &lt;strong&gt;interview&lt;/strong&gt; has just drifted further from reality.&lt;/p&gt;

&lt;p&gt;And &lt;strong&gt;salary&lt;/strong&gt; compression is making it worse. Companies know candidates are desperate. They know the market is flooded. So they add more hoops, offer less money, and frame it as "maintaining a high bar." Nah. You're exploiting leverage. There's a difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Senior Engineers Are Failing Junior Screens
&lt;/h2&gt;

&lt;p&gt;This one makes my blood pressure spike. I've watched engineers with 10+ years of experience, people who've built entire data platforms from scratch, get bounced in a first-round automated screen because they didn't solve a string manipulation problem in 20 minutes.&lt;/p&gt;

&lt;p&gt;Let me be clear about something: interviewing is a skill. It's separate from the actual job. I've always said that. But the gap between "interview skill" and "job skill" has become a canyon. A staff-level engineer who's maintained production systems serving hundreds of millions of rows per day shouldn't be failing a screen that a CS sophomore could pass after a weekend of cramming.&lt;/p&gt;

&lt;p&gt;The problem is structural. Automated filters don't care about your experience. They care about your score on a timed coding challenge. Recruiters scanning resumes don't know the difference between "built a real-time ingestion layer processing 2TB daily" and "experience with data pipelines." Both get the same keyword match. One of those people has actually done the thing; the other wrote a convincing sentence about it.&lt;/p&gt;

&lt;p&gt;I've been on hiring panels where we passed on strong candidates for the dumbest reasons. "They took too long on the coding challenge." The challenge was implementing a graph traversal. The role was building dbt models. Make it make sense.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The interview is a different skill than the job. That's always been true. But in 2026, it's not even the same sport.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Take-Homes Have Become Unpaid Consulting
&lt;/h2&gt;

&lt;p&gt;I want to talk about take-home projects because they've crossed a line. A few years ago, a take-home was a reasonable ask: build a small pipeline, write some SQL, show your thinking. Two, maybe three hours. Fair.&lt;/p&gt;

&lt;p&gt;Now? I'm hearing about take-homes that take 10, 15, 20 hours. Full pipeline implementations. Data modeling exercises with multiple data sources. Documentation requirements. Testing requirements. "Present your solution to the team" follow-ups. For a job you haven't been offered. At a company that might ghost you after you submit.&lt;/p&gt;

&lt;p&gt;That's not an interview. That's a free proof of concept.&lt;/p&gt;

&lt;p&gt;And the kicker: these companies often don't provide meaningful feedback when they reject you. You spent a weekend building something real, and you get a templated "we've decided to move forward with other candidates" email. No notes. No explanation. Just silence and your lost weekend.&lt;/p&gt;

&lt;p&gt;If you're a &lt;strong&gt;hiring&lt;/strong&gt; manager reading this: your take-home should be completable in under 4 hours. If it takes longer, you're either poorly scoping the project or you're trying to get free work. Neither is a good look. And for the love of everything, give feedback. The candidate spent hours on your assignment. You can spend 10 minutes writing a paragraph about why they didn't move forward.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works (It's Not Complicated)
&lt;/h2&gt;

&lt;p&gt;Good &lt;strong&gt;data engineering&lt;/strong&gt; interviews exist. I've been part of them. They share a few traits:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SQL deep-dives that reflect real work.&lt;/strong&gt; Give candidates a messy dataset and ask them to answer business questions. Window functions, CTEs, handling nulls and dupes. This is what they'll do on day one. Test for it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipeline architecture discussions.&lt;/strong&gt; Not "design Twitter" system design. Pipeline architecture. "Here's a data source that updates irregularly with late-arriving records and schema changes. How do you get it into the warehouse reliably?" That's the job. That's the interview.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data modeling exercises.&lt;/strong&gt; Hand someone a business domain and ask them to model it. Watch how they think about grain, how they handle slowly changing dimensions, whether they understand the tradeoffs between normalization and denormalization. This is the core skill. If you're not testing for it, you're testing for the wrong things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Debugging scenarios.&lt;/strong&gt; Give them a broken pipeline. Let them trace the issue. The actual job is less "write a DAG" and more "figure out why this pipeline silently dropped 2M rows last Tuesday." Test for the thing you need.&lt;/p&gt;

&lt;p&gt;None of this requires LeetCode. None of this requires a 15-hour take-home. It requires interviewers who understand the role and have done the work themselves. Which, I realize, is a bigger ask than it should be.&lt;/p&gt;

&lt;p&gt;The tools change every 18 months. The problems don't change. Your interview process should test for understanding of the problems, not fluency in the tool of the moment. Concepts transfer across tools; tool knowledge doesn't transfer across concepts. If your loop is filtering for people who memorized Spark APIs instead of people who understand why data pipelines fail, you're building a team that'll look great on paper and struggle in production.&lt;/p&gt;

&lt;p&gt;I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. The role isn't going anywhere; it's evolving. But the &lt;strong&gt;hiring&lt;/strong&gt; process needs to evolve with it, and right now it's stuck in 2018 with a LeetCode subscription and a God complex.&lt;/p&gt;

&lt;p&gt;So here's my question for anyone who's been through the loop recently: what's the most absurd interview experience you've had for a data engineering role? Because I have a feeling the stories are going to be wild, and honestly, we could all use the catharsis.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>career</category>
      <category>interview</category>
      <category>python</category>
    </item>
  </channel>
</rss>
