<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DataDriven</title>
    <description>The latest articles on DEV Community by DataDriven (@datadriven).</description>
    <link>https://dev.to/datadriven</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864671%2F923e8540-fa96-491d-adb6-0e01c42ec26a.png</url>
      <title>DEV Community: DataDriven</title>
      <link>https://dev.to/datadriven</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/datadriven"/>
    <language>en</language>
    <item>
      <title>DSA Is Dying in DE Interviews. Nobody Agrees on What's Next.</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 14 May 2026 10:05:33 +0000</pubDate>
      <link>https://dev.to/datadriven/dsa-is-dying-in-de-interviews-nobody-agrees-on-whats-next-56kh</link>
      <guid>https://dev.to/datadriven/dsa-is-dying-in-de-interviews-nobody-agrees-on-whats-next-56kh</guid>
      <description>&lt;p&gt;I did somewhere around 20 &lt;strong&gt;interview&lt;/strong&gt; loops in a single job search. Some went well. Some went so poorly I still think about them in the shower. But here's the thing: at least I knew what I was prepping for. LeetCode mediums, maybe a SQL round, maybe a system design conversation. The format was predictable, even if it was stupid. That era is over, and what replaced it is somehow worse.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;data engineering&lt;/strong&gt; community has been screaming for years that &lt;strong&gt;DSA&lt;/strong&gt; doesn't belong in DE interviews. Binary tree traversals, dynamic programming, graph algorithms; none of this maps to the actual job. The actual job is debugging why a pipeline silently dropped 2M rows last Tuesday, not implementing Dijkstra's algorithm on a whiteboard. Reddit finally agreed. r/dataengineering blew up over it. The "NoMoreBigONotations" thread went viral. Companies listened. They dropped the algorithmic rounds.&lt;/p&gt;

&lt;p&gt;And then they replaced them with absolute chaos.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DSA Never Fit Data Engineering in the First Place
&lt;/h2&gt;

&lt;p&gt;Let's be clear about something: &lt;strong&gt;LeetCode&lt;/strong&gt; was never a valid proxy for data engineering skill. It was a borrowed ritual from software engineering interviews that nobody bothered to adapt. Data engineers are rarely expected to write complex algorithms from scratch. We use pre-built libraries and frameworks. The daily work is SQL, pipeline architecture, data modeling, debugging, cost optimization, and dealing with upstream teams who break contracts without telling you.&lt;/p&gt;

&lt;p&gt;The best data engineers I've worked with would struggle on a LeetCode hard. And the engineers who ace competitive programming challenges? They frequently struggle with data modeling, pipeline design, and the kind of real-world optimization that actually matters. It's an inverse correlation, and it's been staring us in the face for years.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;DSA is a mechanism to rank candidates; not an indicator of data engineering experience. Accept it for the arbitrary IQ measuring stick that it is.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;26% of data engineering job ads in 2026 don't even mention education requirements anymore. The industry is finally pivoting toward practical skill assessment. Hiring timelines now exceed 60 to 90 days for complex enterprise roles. Interview loops run 5 to 7 rounds. And yet, the most important question remains unanswered: what are we actually testing for?&lt;/p&gt;

&lt;p&gt;Most candidates don't fail data engineering interviews because of SQL or Python. They fail because they can't connect everything together under pressure and communicate it clearly. That's a completely different skill than reversing a linked list.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Replacement: Three Interviews, Zero Consensus
&lt;/h2&gt;

&lt;p&gt;Here's where it gets ugly. Companies dropped DSA and replaced it with whatever their hiring manager felt like that quarter. There is no standard. There is no consensus. There is barely even a pattern.&lt;/p&gt;

&lt;p&gt;Company A wants you to do a 60-minute Cursor-based live build where you implement a feature in a real codebase. Company B wants pure system design: vague, open-ended, no single correct answer, and every interviewer weights trade-offs differently. Company C sends you the interview rules 24 hours before the onsite, and those rules contradict what the recruiter told you two weeks ago. Company D gives you an 8-hour take-home that's definitely 15 hours of work and pays you nothing for it.&lt;/p&gt;

&lt;p&gt;If you're running parallel loops (and you should be; it's the only sane strategy), you are now simultaneously prepping for three completely different skill sets with zero overlap. One company allows Cursor, one bans it, one grades on "cleverness," one grades on "correctness." This isn't a hiring process. It's a lottery where you don't know which ticket you bought.&lt;/p&gt;

&lt;p&gt;Startups compress everything into 2 to 3 rounds focused on "can you ship on day one." Big Tech runs 4 to 6 standardized rounds emphasizing system design and scale. Mid-market companies? They interview data engineers like they're software engineers, because nobody told them not to. Candidates get blindsided. You prep like it's a data role and walk into SWE-level production-grade coding requirements with full test suites.&lt;/p&gt;

&lt;p&gt;For the architecture-style rounds, &lt;a href="https://www.datadriven.io" rel="noopener noreferrer"&gt;datadriven.io&lt;/a&gt; lets you work through the pipeline-design and data-modeling drills end-to-end instead of just reading about them. That matters, because system design is actually harder to prepare for than LeetCode. At least with DSA, there's consensus on what a good answer looks like. System design? No rubric. No "correct" answer. And every interviewer has a different opinion on whether you should optimize for cost, latency, or data freshness. You're training for a ghost target.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Made It Worse, Not Better
&lt;/h2&gt;

&lt;p&gt;Here's the part nobody wants to say out loud: AI didn't lower the interview bar. It raised it invisibly.&lt;/p&gt;

&lt;p&gt;Canva replaced its "Computer Science Fundamentals" round with "AI-Assisted Coding" in mid-2025. Candidates now face vague, open-ended challenges like "design an aircraft takeoff and landing control system." 64% of companies still ban AI in interviews, but 80% of candidates use LLMs anyway on take-homes. Meanwhile, 67% of startups explicitly allow AI. Meta, Rippling, Google, Canva, and Shopify all permit AI use in live technical sessions. The policy landscape is a mess.&lt;/p&gt;

&lt;p&gt;One CTO told a candidate mid-interview to leave Cursor on. "We want to see how you solve this with AI." The problems got harder. When AI handles the boilerplate, the interviewer's expectations shift from "can you code?" to "can you architect while AI codes for you?" That's a completely different evaluation, and most candidates aren't ready for it.&lt;/p&gt;

&lt;p&gt;The goal has evolved: interviewers want to understand how you evaluate, modify, and trust AI-generated answers. Seniors use AI to compress tedious work while maintaining design control. Staff engineers direct AI through complex tasks while monitoring quality. But here's the problem; nobody tells you which version of this test you're walking into. One company wants to see you pair-program with Cursor like it's a junior engineer on your team. The next company will disqualify you for opening ChatGPT.&lt;/p&gt;

&lt;p&gt;Companies publicly mandate AI usage daily in production, then secretly ban it in interviews. That's not a hiring process. That's a credibility gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Hiring Managers Say They Want (When They Bother to Say Anything)
&lt;/h2&gt;

&lt;p&gt;I've been on &lt;strong&gt;hiring&lt;/strong&gt; panels where we passed on strong candidates for the dumbest reasons. So let me tell you what actually separates the hires from the passes, at least at companies that have thought about it for more than five minutes.&lt;/p&gt;

&lt;p&gt;They want problem-solving mindset over tool knowledge. If you walk into an architecture round and start listing tools instead of describing the problem you're solving, that's a concern. Concepts transfer across tools; tool knowledge doesn't transfer across concepts. This has always been true, and it's finally becoming the interview thesis at companies that are paying attention.&lt;/p&gt;

&lt;p&gt;They want business literacy. A query that runs in 3 seconds instead of 30 might save a downstream BI team hours of waiting. Does the candidate connect technical decisions to business outcomes? If your pipeline is technically perfect but ignores downstream consumers or compliance, you're not a hire. You're a liability.&lt;/p&gt;

&lt;p&gt;They want you to reason about boundaries. Don't propose a single-pattern solution. Describe the boundary between patterns and the contracts that flow across it. That's the senior signal. At staff level, they want to see you prevent problems, not just solve them.&lt;/p&gt;

&lt;p&gt;The irony is thick: these are all reasonable things to test for. But about a third of interview loops include a dedicated data modeling round. A third. The single most important skill in data engineering, and two-thirds of companies don't even have a round for it. They'll spend 45 minutes on a LeetCode medium (or its chaotic replacement) and zero minutes on whether you understand grain, slowly normalized schemas, or why wide denormalized tables with complex types are eating star schema alive.&lt;/p&gt;

&lt;p&gt;Cloud cost efficiency is now one of the highest-scored interview categories. Companies are tying bonus incentives to cloud cost optimizations. This makes sense. Storage is 2 cents per GB per month. Engineer time is $100 an hour. The economics killed star schema, and now they're killing the interview formats that don't test for economic reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Problem Is Nobody Wants to Admit
&lt;/h2&gt;

&lt;p&gt;The inconsistency isn't accidental. It's evidence that the role itself transformed faster than hiring practices could keep up.&lt;/p&gt;

&lt;p&gt;Between 2023 and 2026, data engineering moved from "batch ETL plumber" to a role that combines real-time architecture, cloud cost optimization, metadata governance, platform engineering, and AI integration. Companies testing SQL plus system design plus Cursor builds aren't being random. They're testing for three different versions of the job simultaneously because they don't yet know which version matters most.&lt;/p&gt;

&lt;p&gt;That's not an excuse. It's a diagnosis.&lt;/p&gt;

&lt;p&gt;The community is furious not because DSA is gone, but because at least DSA was consistent. You could grind 50 mediums and be solid. Now? 97% of data engineers report burnout. 70% are likely to leave their jobs within 12 months. Hiring timelines stretch past 90 days. And at the end of that timeline, you might get an offer, be told it was sent, never receive it, do four more rounds, pass again, and have the headcount closed. I'm not making that up. That happened to me.&lt;/p&gt;

&lt;p&gt;The interview process isn't designed for candidates. It's designed for companies to feel thorough. The data engineering community won the argument against DSA, and the prize was chaos.&lt;/p&gt;

&lt;p&gt;I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. The tools change every 18 months. The problems don't change. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal. The interview formats will eventually stabilize around testing for these eternal problems.&lt;/p&gt;

&lt;p&gt;Until then? Treat prep like a job. Accept that every loop will be different. Ask recruiters what types of questions to expect; and if you don't get good answers, look online and at the job description. Prep for system design, SQL fluency, data modeling, and yes, basic Python. Cover the surface area because nobody else is going to narrow it down for you.&lt;/p&gt;

&lt;p&gt;What's the worst interview format you've encountered since companies started dropping DSA rounds? I genuinely want to know, because I thought my eight-round saga was bad, and I keep hearing stories that make it look quaint.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Junior Data Engineers Are Getting Wiped Out. Seniors Are Thriving.</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Tue, 12 May 2026 10:05:09 +0000</pubDate>
      <link>https://dev.to/datadriven/junior-data-engineers-are-getting-wiped-out-seniors-are-thriving-4j7d</link>
      <guid>https://dev.to/datadriven/junior-data-engineers-are-getting-wiped-out-seniors-are-thriving-4j7d</guid>
      <description>&lt;p&gt;Three years ago, a company I was at hired eight junior data engineers in a single quarter. Boilerplate ETL, basic SQL transforms, test scaffolding, docs. The standard apprenticeship pipeline. Last month, that same company posted two senior DE roles and zero junior ones. The eight seats are gone. Not frozen; gone. The work those engineers did still gets done. An LLM and two staff engineers handle it now.&lt;/p&gt;

&lt;p&gt;This isn't a hot take. It's Q1 2026 by the numbers: 52,050 tech &lt;strong&gt;layoffs&lt;/strong&gt; announced in the first three months of the year, a 40% jump over Q1 2025. Nearly half of those cuts were attributed to AI-driven automation. And the people getting cut aren't the ones designing pipeline architectures or negotiating data contracts with upstream teams. They're the ones writing the boilerplate that AI now generates on demand.&lt;/p&gt;

&lt;p&gt;The seniority bifurcation in &lt;strong&gt;data engineering&lt;/strong&gt; is real, it's accelerating, and if you're early in your &lt;strong&gt;career&lt;/strong&gt;, you need to understand the mechanics of it before you can do anything about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Junior Toolkit Got Automated First
&lt;/h2&gt;

&lt;p&gt;Here's what a typical &lt;strong&gt;junior&lt;/strong&gt; data engineer did two years ago: wrote basic ETL scripts, generated dbt models from specs, built simple Airflow DAGs, ran data quality checks, documented schemas. Useful work. Necessary work. Also, as it turns out, exactly the kind of work that LLMs are terrifyingly good at.&lt;/p&gt;

&lt;p&gt;The numbers are brutal. 70% of data quality checks are now automated. 65% of ETL/ELT pipeline design can be generated by AI code assistants. SQL generation tools hit 90% accuracy on first pass. Developers report 88% productivity increases with AI, spending 60% less time on boilerplate code, database schemas, and API creation.&lt;/p&gt;

&lt;p&gt;That's not "AI is coming for your job" fear-mongering. That's the specific, measurable erosion of the tasks that justified hiring someone at $72K to sit in a seat and learn.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The work isn't gone. The justification for hiring someone cheap to do it is.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Companies that used to bring on cohorts of 5 to 10 junior engineers now handle the same workload with 2 to 3 seniors plus AI tooling. Entry-level data engineer positions dropped 20 to 35% globally over the past 12 months. Recently hired workers (42%) and entry-level employees (41%) face disproportionate layoff risk compared to senior cohorts. The apprenticeship ladder that built every &lt;strong&gt;senior&lt;/strong&gt; engineer reading this article is being pulled up behind us.&lt;/p&gt;

&lt;p&gt;And here's the part that should make you uncomfortable if you're a senior who benefited from that ladder: this isn't a technology readiness problem. There's a fascinating gap in the data. Data engineers show 75% theoretical AI exposure but only 37% observed exposure. Companies &lt;em&gt;know&lt;/em&gt; AI can automate junior work. Many just haven't pulled the trigger yet because complex data systems break in unexpected ways and they'd rather keep a human in the loop than risk a silent pipeline failure from auto-generated code.&lt;/p&gt;

&lt;p&gt;That gap is closing. Fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Seniors Aren't Just Surviving; They're Getting Promoted
&lt;/h2&gt;

&lt;p&gt;While junior roles contract, the senior market is doing something counterintuitive: growing. &lt;strong&gt;Senior&lt;/strong&gt; data engineer compensation is up 12 to 18% year over year. Base salaries hold at $147K to $179K nationally, with top talent in SF commanding $233K. Engineers with Databricks or Snowflake certifications see a 10 to 15% premium on top of that. Roles with demonstrated AI skills command another 15 to 30% salary premium.&lt;/p&gt;

&lt;p&gt;40% of data teams actually grew in 2025, up from 14% the year before, and budgets increased 30%. Read that again. Layoffs and growth are happening simultaneously. That's not contradictory; it's compositional. Companies are cutting junior headcount and reinvesting in senior hires who can own broader scope with AI leverage.&lt;/p&gt;

&lt;p&gt;The global data engineering market hit $105 billion in 2026 and is projected to reach $213 billion by 2031. The Bureau of Labor Statistics projects 36% job growth through 2034. Data engineering is not dying. It's not shrinking. It's getting more expensive and more senior.&lt;/p&gt;

&lt;p&gt;I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal. AI doesn't fix them because they're not code problems; they're judgment problems, communication problems, business context problems. The kind of problems you can only solve after years of getting burned by them.&lt;/p&gt;

&lt;p&gt;The role is shifting from pipeline plumber to system architect. Senior DEs are moving up the stack while entry-level boilerplate gets consumed by tools. The engineers who thrive won't write the most SQL; they'll design the frameworks that let AI write SQL safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Skills That Actually Matter Now
&lt;/h2&gt;

&lt;p&gt;The bar for what counts as "data engineering skills" moved. A few years ago, you could be a strong DE focused mainly on batch ETL and warehousing. Now teams expect you to support ML workflows, real-time data needs, governance, and cost optimization, all under the same job title.&lt;/p&gt;

&lt;p&gt;Streaming infrastructure went from "nice to have" to competitive moat. Uber launched IngestionNext in March 2026, cutting data latency from hours to minutes and reducing compute costs 25% with Kafka, Flink, and Hudi. I still maintain that most companies don't need streaming (most of y'all don't), but the companies that &lt;em&gt;do&lt;/em&gt; need it are the ones paying $250K+ for the engineers who can build it.&lt;/p&gt;

&lt;p&gt;Cloud proficiency is non-negotiable; over 94% of enterprises have adopted cloud. AI skill requirements appear in 71% of U.S. tech job postings, up 181% year over year. And the real shortage isn't data engineers; it's governance experts wearing data engineer hats. Companies that used to treat governance as a separate function now embed it in every DE hire. If you can articulate data lineage, PII handling, and audit trails, you command a premium. If you can only write Spark jobs, you're becoming a commodity.&lt;/p&gt;

&lt;p&gt;The concept still holds: learn data modeling, query optimization, understanding why things break. Those transfer across every tool. But the floor has risen. The minimum viable senior DE in 2026 needs architecture thinking, AI fluency, governance awareness, and cloud-native platform skills. For the architecture and data modeling side of interview prep, &lt;a href="https://www.datadriven.io" rel="noopener noreferrer"&gt;datadriven.io&lt;/a&gt; lets you work through pipeline-design and modeling drills end-to-end instead of just reading about them; that kind of hands-on practice is what actually builds the muscle.&lt;/p&gt;

&lt;p&gt;Hiring timelines for senior roles have stretched to 60 to 90 days in enterprise settings. That's not bureaucracy; that's scarcity. Companies can't find enough people who combine architecture, AI integration, governance, and platform engineering in a single candidate. The 250,000-person shortage in AI/ML skillsets compounds everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Can Juniors Still Break In?
&lt;/h2&gt;

&lt;p&gt;Yes. But not the way it used to work.&lt;/p&gt;

&lt;p&gt;The direct path into data engineering is mostly gone. "Data engineer" is not an entry-level position. It combines business context, analytics insight, infrastructure, software engineering, and SRE. The industry consensus now expects 2 to 6 years of prior experience, not a first career jump.&lt;/p&gt;

&lt;p&gt;The realistic path looks like this: start as a SQL-heavy data analyst, analytics engineer, DBA, or backend engineer. Spend 18 to 24 months building production experience and domain knowledge. Then transition to DE internally or through a targeted job search. This detour is becoming standard, not exceptional.&lt;/p&gt;

&lt;p&gt;If you're 3 years into an adjacent role running pipelines in production, that's not "close to being ready." You're doing the job. Stop discounting what you've already built.&lt;/p&gt;

&lt;p&gt;Portfolio projects help demonstrate skills but rarely replace production experience. That's the catch-22. You can't get production experience without the role, and you can't get the role without production experience. The way through is the adjacent role. Analyst to analytics engineer to data engineer. It's longer. It works.&lt;/p&gt;

&lt;p&gt;IBM tripled entry-level hiring in 2026, explicitly stating that AI still needs a human touch. That's an outlier, but it proves the path isn't completely closed. Some enterprises still see juniors as necessary friction-catchers. The BLS projects data engineering as one of the fastest-growing roles through 2030. The demand is there; it's just shifted upward in seniority.&lt;/p&gt;

&lt;p&gt;Here's what I'd tell anyone trying to break in right now: stop learning tools. Learn concepts. Data modeling is the core skill. Getting the model wrong upstream means everything downstream is pain. Pick one orchestration tool, build something small that forces you to deal with failures, retries, and alerting. Then pick the next one. Treat the job search like a job. I did somewhere around 20 interview loops in a single search. Some went well. Some went laughably poorly. The grind is the strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ladder Problem
&lt;/h2&gt;

&lt;p&gt;The uncomfortable truth behind all of this is structural. AI creates more high-leverage work for seniors while erasing the stepping stones juniors traditionally used to become seniors. The boilerplate ETL, the basic SQL, the test generation; that was the apprenticeship. That was how you learned why pipelines break, how schemas drift, what happens when upstream teams push breaking changes at 2am. If AI handles all of that, where do future senior engineers come from?&lt;/p&gt;

&lt;p&gt;Nobody's talking about this enough. The industry is celebrating productivity gains without asking what the pipeline (the human one) looks like in five years. Junior engineers who never debug a failed DAG because AI handles it won't develop the foundational understanding necessary to debug complex systems when the AI fails. And AI will fail. It always does, usually at 2am, usually on the pipeline that finance depends on for board decks.&lt;/p&gt;

&lt;p&gt;The data engineering career isn't dying. It's bifurcating. Senior roles are growing, compensation is climbing, and the problems are getting harder and more strategic. Junior roles are contracting, the bar for entry is rising, and the old apprenticeship model is breaking down. Both of these things are true simultaneously.&lt;/p&gt;

&lt;p&gt;I'm not a doomer about this. The field is healthy, expanding, and full of hard problems worth solving. But the path in looks nothing like it did three years ago, and pretending otherwise is a disservice to every bootcamp grad refreshing LinkedIn right now.&lt;/p&gt;

&lt;p&gt;If you're senior: you're in a strong position. Use the leverage. Learn the AI tooling. Move up the stack.&lt;/p&gt;

&lt;p&gt;If you're junior: the path is longer and harder than it was. That's not your fault. It's the industry being the industry. Start adjacent, build real production experience, focus on concepts over tools, and grind.&lt;/p&gt;

&lt;p&gt;What's your read on the junior pipeline problem? Are we building a generation of seniors who never went through the apprenticeship, or will the path just look different? Genuinely curious what people on both sides are seeing.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>career</category>
      <category>beginners</category>
      <category>python</category>
    </item>
    <item>
      <title>Data Engineer Salaries Are Splitting in Two. Which Side Are You On?</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 07 May 2026 10:09:51 +0000</pubDate>
      <link>https://dev.to/datadriven/data-engineer-salaries-are-splitting-in-two-which-side-are-you-on-508k</link>
      <guid>https://dev.to/datadriven/data-engineer-salaries-are-splitting-in-two-which-side-are-you-on-508k</guid>
      <description>&lt;p&gt;I sat on a hiring panel last month where we reviewed 340 applications for a single mid-level &lt;strong&gt;data engineer&lt;/strong&gt; role. SQL, Airflow, dbt, Snowflake. Every resume looked the same. I'm not exaggerating; I mean structurally identical. Same tools, same bullet points, same "built and maintained ELT pipelines" phrasing. We could've shuffled the names and nobody would've noticed.&lt;/p&gt;

&lt;p&gt;That same week, a colleague pinged me about a role on a different team. They were looking for someone who could build retrieval-augmented generation pipelines, tune embedding models for search, and wire vector databases into their existing warehouse infrastructure. They had four applicants. Four. The comp? 35% higher than the role with 340 candidates.&lt;/p&gt;

&lt;p&gt;That's the &lt;strong&gt;data engineer salary&lt;/strong&gt; market in 2026. It's not one market anymore. It's two.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Split Is Already Here
&lt;/h2&gt;

&lt;p&gt;I've been through enough hype cycles to know the difference between a trend and noise. This isn't noise. The data engineering career ladder has forked, and the two paths are diverging fast.&lt;/p&gt;

&lt;p&gt;On one side, you've got specialists. Engineers building AI data infrastructure, streaming systems, vector search pipelines, the plumbing that makes ML and GenAI products actually work in production. These roles are pulling 20 to 40% &lt;strong&gt;salary&lt;/strong&gt; premiums over their generalist counterparts. Google's L4 data engineer total comp is sitting at a $307K median. That's not a staff role. That's the equivalent of a senior SWE level, and it's being filled by people who can do more than write SQL and schedule DAGs.&lt;/p&gt;

&lt;p&gt;On the other side, you've got generalists. Solid engineers, many of them. People who've been running pipelines in production for years, doing real work. But their resumes are indistinguishable from 300 other resumes in the same pile. And when &lt;strong&gt;layoffs&lt;/strong&gt; hit (52,050 tech workers in Q1 2026 alone, with roughly 20% of those cuts explicitly citing AI automation), guess which group absorbs the damage?&lt;/p&gt;

&lt;p&gt;It's not the person building the RAG pipeline. It's the person whose entire job description can be replicated by a well-prompted AI agent and a managed orchestration service.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The tools change every 18 months. The problems don't. But right now, the market is paying a massive premium for people who understand the &lt;em&gt;new&lt;/em&gt; problems, not just the eternal ones.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why Generalists Are Getting Crushed
&lt;/h2&gt;

&lt;p&gt;Let me be clear: being a generalist isn't a character flaw. I was a generalist for years. I did the SQL, the Airflow DAGs, the warehouse migrations, the 3am on-call pages when the finance pipeline broke before the board deck was due. That work matters. It keeps companies running.&lt;/p&gt;

&lt;p&gt;But the economics have shifted under our feet.&lt;/p&gt;

&lt;p&gt;Three things happened at once. First, the tooling for standard batch ELT got really, really good. dbt, Fivetran, managed Airflow; these tools automated the middle of the stack. The work that used to require a mid-level DE now requires a config file and a credit card. Second, AI coding assistants made it possible for analytics engineers and even some analysts to write passable pipeline code. Not great code, but functional code. Good enough code. Third, companies started building AI products, and those products need data infrastructure that looks nothing like a traditional warehouse.&lt;/p&gt;

&lt;p&gt;The result? The demand for "build me a standard ELT pipeline" has flatlined while the demand for "build me the data layer for our AI product" has spiked. Supply and demand. The generalist side got flooded; the specialist side stayed scarce.&lt;/p&gt;

&lt;p&gt;I've been on hiring panels where we passed on strong candidates for the dumbest reasons. But this isn't that. This is structural. When 340 people apply for your role and they all have the same stack, you're not competing on skill anymore. You're competing on luck. That's not a &lt;strong&gt;career&lt;/strong&gt; strategy; that's a lottery ticket.&lt;/p&gt;

&lt;h2&gt;
  
  
  The $307K Number and What It Actually Means
&lt;/h2&gt;

&lt;p&gt;Everyone screenshots the big comp numbers. $307K at Google L4. And yeah, it's real. But let's talk about what's behind it, because the number without context is just resume bait.&lt;/p&gt;

&lt;p&gt;Total comp at that level is base plus bonus plus stock. The base alone isn't making anyone faint; it's the equity that moves the needle, and equity is where companies show you what they actually value. When Google offers $307K total comp for a data engineer, they're not paying for someone who can write a GROUP BY. They're paying for someone who understands distributed systems, can design data pipelines that serve ML models at scale, and can debug the Spark job that's silently corrupting embeddings in production.&lt;/p&gt;

&lt;p&gt;That's the key distinction people miss. The premium isn't for knowing a specific tool. It's not "learn Pinecone and get a 40% raise." The premium is for understanding the &lt;em&gt;concepts&lt;/em&gt; underneath the tools. How vector similarity search actually works. Why your embedding pipeline needs different SLAs than your batch reporting pipeline. What happens when your feature store and your serving layer disagree on freshness.&lt;/p&gt;

&lt;p&gt;Concepts transfer across tools; tool knowledge doesn't transfer across concepts. I've been saying this for years and it's never been more true than right now. The engineers commanding top &lt;strong&gt;data engineer salary&lt;/strong&gt; offers aren't the ones who memorized the Kafka API. They're the ones who understand why you'd choose streaming over batch for a specific use case (and more importantly, why you usually wouldn't).&lt;/p&gt;

&lt;p&gt;If you want to sharpen the pipeline architecture and data modeling thinking that actually moves the needle in these interviews, &lt;a href="https://www.datadriven.io" rel="noopener noreferrer"&gt;datadriven.io&lt;/a&gt; lets you work through those design problems end-to-end with real feedback, which is closer to what these loops feel like than reading blog posts about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Specialist Trap (Yes, There Is One)
&lt;/h2&gt;

&lt;p&gt;Before you go update your LinkedIn headline to "AI Data Engineer" and start listing vector databases you've never used in production, let me pump the brakes.&lt;/p&gt;

&lt;p&gt;I've watched this movie before. Every hype cycle produces a wave of people who rebrand without reskilling. In 2019 it was "machine learning engineer" on every resume. In 2021 it was "data mesh architect." Now it's "AI/ML data engineer." Most of those people couldn't architect a RAG pipeline if you spotted them the retrieval layer.&lt;/p&gt;

&lt;p&gt;The market isn't stupid. Not forever, anyway. Hiring managers are already getting wise to inflated titles and keyword-stuffed resumes. I've interviewed candidates who listed "vector database experience" and couldn't explain what an embedding is. That's not specialization; that's decoration.&lt;/p&gt;

&lt;p&gt;Real specialization means you've built something. You've debugged something. You've been paged at 2am because the embedding pipeline drifted and the search results went haywire and you had to figure out why. The reps matter more than the resume line.&lt;/p&gt;

&lt;p&gt;Here's the uncomfortable truth about crossing from the generalist side to the specialist side: it requires doing the work before you get paid for it. Build a side project that uses a vector store. Contribute to an open-source streaming framework. Take your existing warehouse and bolt on a real-time feature serving layer, even if nobody asked you to. The engineers who are commanding premiums right now didn't wait for permission. They saw where the puck was going and started skating.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Compounds
&lt;/h2&gt;

&lt;p&gt;I'm not going to tell you &lt;strong&gt;data engineering&lt;/strong&gt; is dying. I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems, just with fancier tools.&lt;/p&gt;

&lt;p&gt;But I am going to tell you that the floor is dropping for people who haven't evolved their skill set in the last three years. The &lt;strong&gt;layoffs&lt;/strong&gt; aren't random. They're patterned. The roles getting cut are the roles that overlap most with what automation can handle. If your entire job is "move data from point A to point B on a schedule," you're competing with a SaaS product that costs $500/month.&lt;/p&gt;

&lt;p&gt;The skills that compound in 2026 are the same ones that have always compounded, just applied to new problem domains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data modeling.&lt;/strong&gt; Still the core skill. Getting the model wrong upstream means everything downstream is pain. This is true whether you're modeling a star schema or an embedding index.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Systems thinking.&lt;/strong&gt; Understanding how data flows through an entire architecture, not just your slice of it. The engineer who can trace a data quality issue from the serving layer back through the feature pipeline to the ingestion source is worth three engineers who can only see their own DAG.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging under pressure.&lt;/strong&gt; The actual job is less "write a DAG" and more "figure out why this pipeline silently dropped 2M rows last Tuesday." That skill doesn't get automated. It gets more valuable as systems get more complex.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business context.&lt;/strong&gt; Knowing which pipeline matters to revenue and which one is a vanity dashboard that nobody checks. AI can't tell you that. Your CFO can.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Junior engineers worry about which tool to learn. Senior engineers worry about which problems to solve. Staff engineers worry about which problems to prevent. That hierarchy hasn't changed. The specific problems have.&lt;/p&gt;

&lt;p&gt;The gap between specialist and generalist &lt;strong&gt;salary&lt;/strong&gt; isn't permanent for any individual. It's a snapshot of where the market values your current skill set. Skills can be developed. Reps can be done. I went from a non-CS degree and a career outside tech to staff-level at companies you've heard of. It's possible; it just requires being strategic about which skills compound.&lt;/p&gt;

&lt;p&gt;But you do have to choose. Sitting in the middle, hoping the market comes back to rewarding the same stack you learned four years ago, is the one strategy I can guarantee won't work.&lt;/p&gt;

&lt;p&gt;So which side of the split are you building toward?&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>career</category>
      <category>beginners</category>
      <category>interview</category>
    </item>
    <item>
      <title>How to Think During a Data Engineering Interview in 2026</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Wed, 06 May 2026 17:09:48 +0000</pubDate>
      <link>https://dev.to/datadriven/how-to-think-during-a-data-engineering-interview-in-2026-45gi</link>
      <guid>https://dev.to/datadriven/how-to-think-during-a-data-engineering-interview-in-2026-45gi</guid>
      <description>&lt;p&gt;Most interview prep teaches you what to know. Not how to think.&lt;/p&gt;

&lt;p&gt;That's a problem, because data engineering interviews don't fail candidates on knowledge gaps as often as people assume. They fail candidates who know the answer but can't show their work.&lt;/p&gt;

&lt;p&gt;I watched "Data &amp;amp; AI Guy" &lt;a href="https://www.youtube.com/watch?v=uPIzRHfYNgU" rel="noopener noreferrer"&gt;solve five real interview questions&lt;/a&gt; live on camera using DataDriven.io. SQL, Python, Spark, data modeling, pipeline architecture. One problem per domain, full reasoning narrated out loud. It's a useful model for how to actually behave during an interview. Not just what to code, but how to move through the problem from prompt to solution to edge cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SQL&lt;/strong&gt;: Say the Why Out Loud&lt;br&gt;
The problem: return a deduplicated list of regions from an infrastructure nodes table.&lt;/p&gt;

&lt;p&gt;Answer is SELECT DISTINCT region FROM infra_nodes. One line. Correct.&lt;/p&gt;

&lt;p&gt;Most people write it and wait. He writes it and immediately explains why DISTINCT over GROUP BY. Both work here. But DISTINCT signals intent. You're not aggregating, you're deduplicating. That's what DISTINCT is for. GROUP BY is a more powerful tool being used as a weaker one.&lt;/p&gt;

&lt;p&gt;Then he raises the null edge case without being asked.&lt;/p&gt;

&lt;p&gt;If region is nullable, DISTINCT returns NULL as a value. Is that correct? Depends on the business context. He doesn't assume. He flags it and asks.&lt;/p&gt;

&lt;p&gt;That's the habit worth building. Not just solving the problem, but immediately asking: where does this solution make an assumption it shouldn't? The NULL case takes ten seconds to raise. It tells the interviewer you've worked with real production data, where nullable columns are the default, not the exception.&lt;/p&gt;

&lt;p&gt;The move: After every solution, ask yourself "what breaks here?" out loud. Interviewers are watching for that step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt;: Decompose Before You Code&lt;br&gt;
The problem: given a list of integers and container type names, group by distinct value, sort ascending, round-robin assign each group to a container, format output differently per container type (set gets deduped and sorted descending, list and tuple keep original order).&lt;/p&gt;

&lt;p&gt;It's tricky. Multi-step, several interacting requirements.&lt;/p&gt;

&lt;p&gt;Before touching code he restates the problem in his own words, slowly, and traces through a concrete example by hand.&lt;/p&gt;

&lt;p&gt;"I'm grouping every occurrence of each distinct value together. So the ones go in one bucket, threes in another. Then I order those groups by distinct value ascending. Then round-robin assign each group to a container type."&lt;/p&gt;

&lt;p&gt;This is not stalling. This is the move that prevents you from writing fifteen minutes of code that solves the wrong problem.&lt;/p&gt;

&lt;p&gt;Once the decomposition is right the code is almost mechanical. defaultdict for grouping, sorted for ordering, enumerate with modulo for round-robin, conditional formatting per container type.&lt;/p&gt;

&lt;p&gt;Then three edge cases:&lt;/p&gt;

&lt;p&gt;Empty input: returns empty dict, works correctly&lt;br&gt;
Empty containers list: ZeroDivisionError from modulo, worth flagging upfront&lt;br&gt;
Unknown container name: falls through to else branch, silently treated like a list. If the interviewer wants strict validation you'd add an assert or raise&lt;br&gt;
The third one is subtle and most candidates don't get there. The interviewer who wants to stress test you will hand you exactly that input.&lt;/p&gt;

&lt;p&gt;The move: On any problem with multiple interacting requirements, restate out loud before coding. Trace one example by hand. List the edge cases you can already see. Do this even when the answer feels obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spark&lt;/strong&gt;: The Expected Output Is Telling You Something&lt;br&gt;
The problem: return authors who deployed to both dev and prod.&lt;/p&gt;

&lt;p&gt;Looks simple. Then he looks at the expected output.&lt;/p&gt;

&lt;p&gt;Alice and alice are separate rows. Different people as far as this query is concerned. But DEV and dev should match. Case sensitivity applies to authors, not environments.&lt;/p&gt;

&lt;p&gt;He catches this before writing a line.&lt;/p&gt;

&lt;p&gt;This is the Spark round in a nutshell. The prompt is often deliberately vague or ambiguous. The expected output encodes decisions the prompt doesn't state explicitly. Candidates who read the prompt and start coding miss it. Candidates who look at the expected output first and ask "what is this output telling me?" catch it.&lt;/p&gt;

&lt;p&gt;His pipeline: normalize environment name to lowercase in a new column, filter to dev/prod using the normalized column, group by author, count DISTINCT environments (not deploys), filter to count equals two, sort alphabetically.&lt;/p&gt;

&lt;p&gt;The DISTINCT matters. Without it, an author who deployed to dev five times and prod zero times gets count five and passes the filter. You want distinct environments hit, not total deploys.&lt;/p&gt;

&lt;p&gt;He also shows an alternative self-join solution that another user submitted. That's worth noting. There's usually more than one way to solve these and being able to discuss tradeoffs between approaches is what senior loops are actually testing.&lt;/p&gt;

&lt;p&gt;The move: On any problem with expected output shown, read the output before the prompt. Ask what decisions the output is making that the prompt left unstated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Modeling&lt;/strong&gt;: Name the Grain First&lt;br&gt;
The problem: track employee application usage, flag anyone spending more than ten hours a day in a single app.&lt;/p&gt;

&lt;p&gt;His answer is a star schema. Two dims, one fact, daily grain.&lt;/p&gt;

&lt;p&gt;dim_employee: employee_id, full_name, city, department&lt;br&gt;
dim_application: application_id, app_name, category&lt;br&gt;
fact_application_usage: usage_id, employee_id (FK), application_id (FK), usage_date, hours_used, over_ten_hour_flag&lt;br&gt;
He names the grain explicitly before drawing anything. One row equals one employee using one application on one day. That grain is forced by the prompt. The ten-hour threshold is daily. The summaries are per employee per application. There's no other grain that fits.&lt;/p&gt;

&lt;p&gt;This is the skill most data modeling rounds are actually testing. Not whether you know what a star schema is. Whether you can derive the grain from the business requirements and then explain why that grain is the only one that works.&lt;/p&gt;

&lt;p&gt;The over_ten_hour_flag lives on the fact table rather than the BI layer. Reasonable call. If HR is querying it constantly, materializing it once is cleaner than recomputing it in every downstream report.&lt;/p&gt;

&lt;p&gt;The move: Before drawing any schema, answer three questions out loud: who is the actor, what is the event, what is the time granularity. The schema follows from those answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipeline Architecture&lt;/strong&gt;: Constraints Drive the Design&lt;br&gt;
The last problem. Greenfield build. Six sources. Three requirements:&lt;/p&gt;

&lt;p&gt;Dashboards ready by 8am on weekdays&lt;br&gt;
One canonical definition of MRR and NRR&lt;br&gt;
Finance cannot see raw card data&lt;br&gt;
He builds a medallion architecture: sources into Kafka/Fivetran ingestion layers, into bronze raw delta tables, through dbt/DLT into silver cleaned tables, into gold star schema marts, with Unity Catalog and Airflow orchestrating across all of it.&lt;/p&gt;

&lt;p&gt;The constraints are the answer.&lt;/p&gt;

&lt;p&gt;The 8am SLA means batch-first. Maybe one streaming path for event data. But you cannot promise 8am dashboard delivery on a streaming-first architecture without a lot of complexity that isn't justified here.&lt;/p&gt;

&lt;p&gt;The canonical MRR/NRR requirement means a semantic layer. Not just clean gold tables. If the metric definition lives in five BI tools separately, you have five definitions of MRR inside a year. The semantic layer is what makes "one canonical definition" actually mean something.&lt;/p&gt;

&lt;p&gt;Finance not seeing raw card data means catalog-level security. Table-level access controls fail the moment someone grants the wrong table permission. Column masking and dynamic views in Unity Catalog enforce the control at the platform level, not dependent on every access grant being correct.&lt;/p&gt;

&lt;p&gt;The candidates who fail architecture rounds can draw this diagram. They just can't explain why it's the right diagram for these constraints. The boxes are decoration. The constraints are the substance.&lt;/p&gt;

&lt;p&gt;The move: Before drawing any architecture, write down the constraints and what each one forces. The diagram should be a consequence of the constraints, not the starting point.&lt;/p&gt;

&lt;p&gt;The Meta-Skill Across All Five&lt;br&gt;
Restate the problem. Trace an example. Narrate your reasoning. Raise the edge cases.&lt;/p&gt;

&lt;p&gt;None of those are syntax. They're production instincts. The interviewer is trying to figure out if you've actually shipped things that broke at 2am. The reasoning is how you show that you have.&lt;/p&gt;

&lt;p&gt;Most people cram to memorize answers. This video is a useful model for training the thing that gets you hired; intuition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where to Practice&lt;/strong&gt;&lt;br&gt;
&lt;a href="//www.datadriven.io/yt"&gt;DataDriven.io&lt;/a&gt; is where he pulls all five problems in the video. It covers all six domains: SQL, Python, Spark, AI Coding, data modeling, and pipeline architecture. Most platforms stop at SQL.&lt;/p&gt;

&lt;p&gt;Multiple solutions per problem, community submissions, and it's free. No trial, no credit card, no pricing page.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
      <category>python</category>
    </item>
    <item>
      <title>AI Broke Data Engineering Interviews. Nobody Knows What's Next.</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Tue, 05 May 2026 10:05:37 +0000</pubDate>
      <link>https://dev.to/datadriven/ai-broke-data-engineering-interviews-nobody-knows-whats-next-1cmi</link>
      <guid>https://dev.to/datadriven/ai-broke-data-engineering-interviews-nobody-knows-whats-next-1cmi</guid>
      <description>&lt;p&gt;I've been on both sides of the &lt;strong&gt;data engineering&lt;/strong&gt; hiring table for years. I've written interview loops, failed interview loops, and watched candidates ace screens that told me absolutely nothing about whether they could debug a silent data loss bug at 2am. The signal was always thin. Now it's basically noise.&lt;/p&gt;

&lt;p&gt;Here's the situation in 2026: 64% of companies ban AI in interviews. Candidates use it anyway. One company measured 80% of candidates using LLMs on take-home tests despite explicit prohibition. AI cheating on take-homes doubled from 15% to 35% between June and December 2025, and that number is accelerating. The traditional code screen; the thing that was supposed to separate "can do the job" from "can't do the job"; is dead. It just hasn't stopped twitching yet.&lt;/p&gt;

&lt;p&gt;So if an AI can spit out a clean solution to a medium LC problem, what does asking that problem actually tell me about you? That you memorized something a machine produces on demand? I've been interviewing FAANG data engineers for years. The &lt;strong&gt;interview&lt;/strong&gt; signal has always been questionable. Now it's gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  64% Ban AI, 100% Can't Stop It
&lt;/h2&gt;

&lt;p&gt;The Karat 2025-2026 AI Workforce Transformation Report surveyed 400 engineering leaders across the U.S., India, and China. The headline number: nearly two-thirds still prohibit AI use in interviews. But less than 30% have actually updated their assessments or retrained interviewers. That's not a policy. That's a legal compliance gesture stapled to a prayer.&lt;/p&gt;

&lt;p&gt;The enforcement paradox is brutal. Modern cheating tools solve take-homes in 5 minutes. Invisible overlay tools render answers in candidates' IDEs while screen-capture sees nothing. AI detection? Same 800-word essay tested on five different detectors returned scores of 4%, 91%, 12%, 67%, and 38%. That's not detection; that's a random number generator.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The skill being tested (AI-free coding) is not the job. Engineers use AI daily. Testing without it measures neither job performance nor authentic ability. It measures anxiety tolerance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's the hypocrisy that gets me: Amazon, Microsoft, Meta, and Google all require engineers to use AI daily in production code. Google has publicly acknowledged a significant portion of its codebase is AI-generated. Then they disqualify candidates for using the same tools in interviews. The Class of 2025 watched a generation get laid off, saw companies ship AI-generated code to production, and decided the "no AI" rule is a fiction they're not participating in. I can't say I blame them.&lt;/p&gt;

&lt;p&gt;The policy chaos is something else. Amazon will fully disqualify you for AI use. Goldman Sachs bans ChatGPT. Anthropic banned AI in May 2025, walked it back in July, now allows it for resumes only. Meanwhile Meta, Shopify, and Canva explicitly encourage AI in coding rounds. You can go through three interview loops in parallel and face completely opposite rules in each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Any LLM Can Pass Your Screen
&lt;/h2&gt;

&lt;p&gt;Traditional code tests had exactly one value proposition: differentiation at scale. LLMs dissolved that advantage.&lt;/p&gt;

&lt;p&gt;Codility achieves only a 0.47 correlation to job performance. That's barely better than a coin flip. Frontier models hit 95%+ on HumanEval, and the gap between top models is one point of meaningless noise. The benchmarks that companies used to calibrate difficulty are saturated. A medium LeetCode problem that used to filter out 60% of candidates now filters out nobody, because the candidates aren't solving it; their tools are.&lt;/p&gt;

&lt;p&gt;71% of engineering leaders say AI is making it harder to assess technical skills. That number was probably 20-30% two years ago. The &lt;strong&gt;hiring&lt;/strong&gt; process is in freefall and the people running it know it.&lt;/p&gt;

&lt;p&gt;The really insidious part is what one HN commenter called "vibe coders." Candidates who are phenomenal at prompting AI to generate boilerplate but completely freeze when architecture gets complex, things break, or AI subtly hallucinates. Traditional screens can't distinguish between a strong engineer using AI as leverage and a weak engineer hiding behind it. And 59% of surveyed SVPs and CTOs now say weak engineers deliver net zero or negative value in the AI era. The stakes for getting this wrong have never been higher.&lt;/p&gt;

&lt;p&gt;73% of those 400 engineering leaders say strong engineers are worth at least 3x their total compensation. So the ROI of correct &lt;strong&gt;hiring&lt;/strong&gt; decisions just tripled while the signal quality went to zero. That's not a mismatch; that's a crisis.&lt;/p&gt;

&lt;h2&gt;
  
  
  US vs. China: The AI Interview Gap
&lt;/h2&gt;

&lt;p&gt;While American companies debate whether to allow AI, Chinese tech firms already integrated it into hiring workflows. The Karat data shows Chinese companies are nearly 2x more likely to allow AI in live interviews and significantly less reliant on take-home projects and automated testing.&lt;/p&gt;

&lt;p&gt;ByteDance is offering 5,000 positions with a 23% increase in R&amp;amp;D hiring. Alibaba posted 7,000+ roles, 60% AI-related. Baidu saw a 60% position increase with 90% of campus recruitment focused on AI. AI-related positions in China surged 12x year-on-year in early 2026. The US saw 78,000 tech layoffs in Q1 2026 while 275,000 AI job postings remained unfilled. That's not a skills gap; that's a structural mismatch dressed up as one.&lt;/p&gt;

&lt;p&gt;The speed differential isn't just volume; it's philosophy. Chinese firms stopped pretending AI doesn't exist in interviews and started measuring whether candidates can use it effectively. US firms are still arguing about whether adaptation is allowed. By the time American companies reach consensus, Chinese firms will have hired an entire generation of engineers calibrated for AI-native work.&lt;/p&gt;

&lt;p&gt;67% of startups already use AI in interviews while established companies cling to bans. If you're a &lt;strong&gt;data engineering&lt;/strong&gt; candidate right now, the rules you're prepping for depend entirely on whether the company you're targeting was founded before or after 2015.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Replaces the Code Test?
&lt;/h2&gt;

&lt;p&gt;Nobody knows. That's the honest answer.&lt;/p&gt;

&lt;p&gt;The industry is fragmenting. Some companies mandate live coding. Others double down on take-homes. Chinese firms embrace AI-in-session evaluation. Five major companies (Canva, Rippling, Meta, Shopify, Red Hat) now explicitly expect candidates to use Copilot, Cursor, and Claude during technical interviews. The shift isn't from "no AI" to "yes AI." It's from testing output to observing process. Can you prompt effectively? Do you critically evaluate AI suggestions? Do you know when the model is hallucinating?&lt;/p&gt;

&lt;p&gt;The data engineering &lt;strong&gt;interview&lt;/strong&gt; process was already broken before AI. Testing algorithms that engineers never use while ignoring skills they need daily. The actual job is debugging, not building. Less "write a DAG" and more "figure out why this pipeline silently dropped 2M rows last Tuesday." Nobody was interviewing for that skill anyway. AI just made the gap impossible to ignore.&lt;/p&gt;

&lt;p&gt;Live, conversational interviews with integrity verification have become the only reliable alternative. 95% of candidates prefer assessments that mirror actual job scenarios over abstract puzzles. The best-performing teams aren't inventing new interview types; they're using existing formats with tighter rubrics, calibrated interviewers, and outcome-based feedback loops. For the architecture-style rounds, &lt;a href="https://www.datadriven.io" rel="noopener noreferrer"&gt;datadriven.io&lt;/a&gt; lets you work through the pipeline-design and data-modeling drills end-to-end instead of just reading about them; that kind of realistic simulation is where the industry is heading anyway.&lt;/p&gt;

&lt;p&gt;The collaborative model is emerging: interviewers reading candidate cues, providing hints at the right time, jointly solving problems instead of input-output gotchas. It's more expensive. It requires trained interviewers. Less than 30% of firms have retrained their people to do this. The window to capture that signal advantage is narrow.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Prep When the Rules Keep Changing
&lt;/h2&gt;

&lt;p&gt;You're a &lt;strong&gt;data engineering&lt;/strong&gt; candidate in 2026. Company A wants classic dynamic programming with no AI. Company B wants you to build a feature using Cursor in 45 minutes. Company C hasn't decided yet and will tell you the rules 24 hours before your onsite. This isn't preparation guidance; it's a moving target.&lt;/p&gt;

&lt;p&gt;Here's what I'd tell you (and what I've told myself through 20+ interview loops):&lt;/p&gt;

&lt;p&gt;Concepts transfer across tools; tool knowledge doesn't transfer across concepts. That hasn't changed. Data modeling, query optimization, understanding why things break; these are tool-agnostic. They're also AI-resistant. An LLM can generate a Spark job. It can't tell you why your pipeline silently corrupted data for six months. It can't make the business context judgment calls that separate a senior from a staff engineer.&lt;/p&gt;

&lt;p&gt;The interview is still a separate skill from the job. That was true before AI and it's true now. Treat prep like a job. But focus your prep on the things AI can't fake: system design reasoning, architecture tradeoffs, debugging methodology, and the ability to articulate why you made the decisions you made.&lt;/p&gt;

&lt;p&gt;For live rounds where AI is allowed, practice using AI as a tool, not a crutch. The companies that permit it are watching how you use it, not whether you use it. Can you spot when it's wrong? Can you direct it toward the right solution? That's the new signal.&lt;/p&gt;

&lt;p&gt;For companies that ban it, the mediums are still enough. Do 50. You'll be solid. But accept it for the arbitrary measuring stick it is, play the game, and spend more energy on the system design and &lt;strong&gt;career&lt;/strong&gt; narrative rounds where AI provides zero advantage.&lt;/p&gt;

&lt;p&gt;The one thing I know for certain: this isn't settling down anytime soon. The 35% of candidates using AI on take-homes is heading past 50% by late 2026. When the majority cheats, the honest minority faces inverse selection pressure. Companies will either redesign their loops or watch their hiring signal collapse entirely.&lt;/p&gt;

&lt;p&gt;The tools change every 18 months. The problems don't. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal. Focus your prep there.&lt;/p&gt;

&lt;p&gt;What's the weirdest interview format you've encountered in 2026? I'm genuinely curious whether anyone's seen something that actually works.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>ai</category>
      <category>career</category>
    </item>
    <item>
      <title>Your Data Engineering Take-Home Is Unpaid Consulting. Refuse It.</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 30 Apr 2026 10:07:07 +0000</pubDate>
      <link>https://dev.to/datadriven/your-data-engineering-take-home-is-unpaid-consulting-refuse-it-3jo5</link>
      <guid>https://dev.to/datadriven/your-data-engineering-take-home-is-unpaid-consulting-refuse-it-3jo5</guid>
      <description>&lt;p&gt;I spent a full weekend building an end-to-end pipeline for a Series B startup. Ingestion from three sources, data modeling in their warehouse, dbt transformations, tests, documentation, and a 45-minute live presentation to their "data team" (two people). Monday morning I got a three-sentence rejection email. The subject line was "Update on your application." That was the update.&lt;/p&gt;

&lt;p&gt;I wish this was unusual. It isn't. This is the &lt;strong&gt;data engineering&lt;/strong&gt; hiring process in 2026, and it's broken in ways that should make every engineer angry.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Take-Home Ballooned While You Weren't Looking
&lt;/h2&gt;

&lt;p&gt;Here's how it used to work: a company sends you a &lt;strong&gt;take-home&lt;/strong&gt; assignment. Build a small ETL script, write some SQL, maybe model a couple tables. Two hours, tops. You submit it, they review it, you talk about it. Reasonable.&lt;/p&gt;

&lt;p&gt;Here's how it works now: a company sends you a take-home that says "should take 2-4 hours." You open the brief and find a multi-source ingestion problem, a data modeling exercise, a transformation layer, unit tests, documentation, a README explaining your design decisions, and oh yeah, a live presentation to the team next week. Recruiters claim 2-4 hours. Candidates actually spend 6-10 hours minimum. For roles they really want? 15-20 hours. I've seen reports of people spending entire weekends on a single application.&lt;/p&gt;

&lt;p&gt;62% of companies admit their take-homes are "too long." They keep using them anyway.&lt;/p&gt;

&lt;p&gt;The scope creep isn't accidental. It's structural. AI tools made short assignments trivial to complete, so companies stretched them longer to maintain "signal." Instead of fixing the format, they just demanded more of your time. The result is a 20-hour unpaid project that produces a deliverable artifact; a working pipeline, a documented data model, a presentation deck. That's not an assessment. That's &lt;strong&gt;consulting&lt;/strong&gt; work with zero compensation and a templated rejection email as your receipt.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If your take-home produces something the company could deploy, you're not interviewing. You're freelancing for free.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And the numbers back this up. 68% of companies now use take-home coding tests. Only 20% of candidates pass. That means 80% of the people who spend their weekends building pipelines for strangers get a canned "we've decided to move forward with other candidates" email. No feedback. No explanation. Just silence and a wasted weekend.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hiring Market Doesn't Care About Your Time
&lt;/h2&gt;

&lt;p&gt;Let's talk about why candidates put up with this. It's not complicated: &lt;strong&gt;100,443 tech workers&lt;/strong&gt; have been affected by layoffs in 2026. That's 837 job cuts per day through April. Q1 alone saw 52,050 tech-sector job cut announcements, up 40% from Q1 2025. Nearly half of those cuts were explicitly attributed to AI and automation.&lt;/p&gt;

&lt;p&gt;When you're eight weeks into a job search and the bills are piling up, you don't push back on a 20-hour take-home. You grind through it at 2am because the alternative is another month of nothing. Companies know this. Three forces collided to make extended unpaid trials feel acceptable to employers: application volume favors employers, AI tools favor employers, and a budget-tight year favors employers. The asymmetry isn't a bug; it's the business model.&lt;/p&gt;

&lt;p&gt;Entry-level &lt;strong&gt;hiring&lt;/strong&gt; fell 25% from 2023 to 2024 at the top 15 tech firms. 72% of tech leaders plan further reductions. Early-career engineers with less than three years of experience are the hardest hit. These are the same people being asked to prove themselves with the longest, most grueling take-homes. Fewer positions, longer assessments, and a candidate pool too desperate to say no.&lt;/p&gt;

&lt;p&gt;I've been on both sides of the &lt;strong&gt;interview&lt;/strong&gt; table enough times to know what this looks like from the inside. Hiring managers aren't sitting there rubbing their hands together plotting to steal your work. Most of them genuinely believe the take-home is a fair evaluation. They don't think about the fact that they're asking someone to do 15 hours of uncompensated labor with an 80% chance of a form rejection. They've never done the math on what that costs the candidate in aggregate across a job search with 10-15 active applications.&lt;/p&gt;

&lt;p&gt;I did somewhere around 20 interview loops in one search. If even half of those had included a 10-hour take-home, that's 100 hours of unpaid work. That's two and a half full work weeks. For one job search. At some point you have to ask: is this an evaluation or an extraction?&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Killed the Signal; Companies Doubled Down Anyway
&lt;/h2&gt;

&lt;p&gt;Here's the part that makes this whole thing absurd. The take-home format is collapsing under its own logic.&lt;/p&gt;

&lt;p&gt;AI-assisted cheating on take-home assignments surged from 15% to 35% in six months across nearly 20,000 interviews analyzed. One in three candidates completing take-homes is using AI assistance. And honestly? Good for them. If a company asks me to spend 20 hours on an unpaid project, I'm using every tool available. The assignment isn't measuring my engineering ability; it's measuring my tolerance for exploitation.&lt;/p&gt;

&lt;p&gt;But here's the loop that should concern everyone: AI makes short take-homes trivial, so companies make them longer. Longer take-homes produce more deliverable work product, which looks more like free consulting. Meanwhile, the "signal" companies wanted (can this person actually build things?) is completely destroyed because the output tells you nothing about whether the candidate wrote it, prompted it, or copy-pasted it.&lt;/p&gt;

&lt;p&gt;The rational response would be to abandon the format. Instead, companies are doing something worse: keeping the 15-hour take-home AND adding live coding rounds on top. You get the worst of both worlds. The total time burden for a single application now includes a recruiter screen, a take-home project, a live coding round, a system design round, and a behavioral panel. That's 25-30 hours per company. Multiply by the 5-10 active pipelines any serious job seeker maintains, and you're looking at a part-time job just applying for jobs.&lt;/p&gt;

&lt;p&gt;In-person interview rounds increased from 24% in 2022 to 38% in 2025, driven entirely by AI cheating concerns. The take-home isn't going away; it's just getting a live coding bodyguard that doubles your time commitment.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Spot the Extraction
&lt;/h2&gt;

&lt;p&gt;Not every take-home is exploitative. Some are genuinely well-designed, 2-3 hour exercises that test real skills. Problems &lt;a href="https://www.datadriven.io/problems/two_hundred_million_redirects" rel="noopener noreferrer"&gt;like this one&lt;/a&gt; force you to reason about grain before you join, and that's a legitimate thing to evaluate. The difference between a fair assessment and free consulting is usually obvious if you know what to look for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The scope test.&lt;/strong&gt; If the deliverable could ship as a real feature or inform a real business decision, it's consulting. A good take-home is abstract enough that the output has zero value to the company. The moment they ask you to use their actual data, model their actual domain, or solve their actual problem, you're doing their job for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The time test.&lt;/strong&gt; If the brief says "2-4 hours" but the requirements include ingestion, modeling, transformation, testing, documentation, and a presentation, they're either lying about the time estimate or delusional about what's involved. Either way, red flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The feedback test.&lt;/strong&gt; Ask upfront: "Will I receive detailed feedback on my submission regardless of the outcome?" If they can't commit to that, they're telling you the evaluation is one-directional. You give them 15 hours; they give you a three-sentence email. 65% of candidates never or rarely receive interview feedback. That's not an assessment process; it's a black hole.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The legal test.&lt;/strong&gt; Under the Fair Labor Standards Act, any work that benefits an employer must be paid. A Nashville dental practice paid $50,000 in back wages after the DOL found they performed actual patient-facing work during unpaid trials. Take-homes that produce deployable work sit in the same legal gray area. Nobody's enforcing it yet, but the precedent exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Do Now
&lt;/h2&gt;

&lt;p&gt;I've stopped doing take-homes that exceed four hours of estimated work. Full stop. If a company sends me a brief that's clearly a weekend project, I respond with something like: "I'm happy to do a focused 2-3 hour exercise, or I'm happy to do a paid consulting engagement at my hourly rate. Which works for you?"&lt;/p&gt;

&lt;p&gt;Most companies ghost me after that. Some actually appreciate the directness. The ones that appreciate it are invariably better places to work.&lt;/p&gt;

&lt;p&gt;I've also started asking a question early in every process: "What does the full interview loop look like, and what's the estimated total time commitment?" If the answer exceeds 10 hours, I factor that into whether the role is worth pursuing. Not every opportunity justifies 20 hours of unpaid labor.&lt;/p&gt;

&lt;p&gt;The data engineering market is growing. Roles are projected to increase 18% in the coming years. This isn't a dying field; it's a field with a broken hiring process being exploited by companies that know candidates are desperate. The problem isn't scarcity of roles. It's scarcity of proper hiring infrastructure. Companies expanding rapidly are just copy-pasting broken assessment formats at speed.&lt;/p&gt;

&lt;p&gt;61% of job seekers have been ghosted after an interview. 80% won't reapply to companies that ghost them. Companies running 20-hour take-homes with no feedback aren't just burning individual candidates; they're torching their own recruiting pipeline.&lt;/p&gt;

&lt;p&gt;So here's my question for the hiring managers still running these loops: if 80% of your candidates fail the take-home, you ghost the failures, and AI has destroyed whatever signal you thought you were getting, what exactly are you paying for with all that candidate time? Because from where I'm sitting, it looks like the only thing you're selecting for is desperation.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Azure Lost 60% of DE Job Postings in One Year. Is Your Resume Wrong?</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Tue, 28 Apr 2026 14:42:02 +0000</pubDate>
      <link>https://dev.to/datadriven/azure-lost-60-of-de-job-postings-in-one-year-is-your-resume-wrong-1k5i</link>
      <guid>https://dev.to/datadriven/azure-lost-60-of-de-job-postings-in-one-year-is-your-resume-wrong-1k5i</guid>
      <description>&lt;p&gt;Last year, I was reviewing resumes for a senior data engineering role on my team. Out of maybe 40 applicants, I'd estimate 30 of them led with Azure. Azure Data Factory, Azure Synapse, Azure Databricks, Azure everything. Made sense; that's where the jobs were. Fast forward twelve months and I'm looking at a market that's barely recognizable. If your resume looks like it did in 2025, you might be wondering why the phone stopped ringing.&lt;/p&gt;

&lt;p&gt;Here's the number: &lt;strong&gt;Azure&lt;/strong&gt; dropped from 75% of &lt;strong&gt;data engineering&lt;/strong&gt; job postings to 34% in a single year. Not a gradual decline. Not a rounding error. A 41-percentage-point collapse. And most working DEs I talk to haven't even registered it yet because they're heads-down maintaining pipelines, not refreshing job boards.&lt;/p&gt;

&lt;p&gt;This isn't a "the sky is falling" piece. Azure isn't dead. But if your &lt;strong&gt;resume&lt;/strong&gt; is a love letter to a single &lt;strong&gt;cloud&lt;/strong&gt; ecosystem, you need to read this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers Don't Lie (But They Do Need Context)
&lt;/h2&gt;

&lt;p&gt;Let's be precise about what happened. Azure went from appearing in three out of four DE job postings to appearing in one out of three. Meanwhile, AWS holds roughly 32% of the data engineering market share, with AWS certifications showing up in 4.2% of listings versus Azure's 3.6%. GCP sits at 1.2%, which sounds small until you realize GCP climbed to 13% overall cloud market share in 2025 and is growing faster in percentage terms than anyone else.&lt;/p&gt;

&lt;p&gt;But here's the part most people miss: the absolute number of Azure jobs didn't evaporate. There are still 12,000+ Azure data engineer roles on LinkedIn right now, and around 42,000 Azure cloud engineer postings globally. The problem isn't that Azure jobs disappeared. The problem is that AWS and Databricks jobs multiplied so fast that Azure's share got swallowed. It's a market-share collapse, not a job-count collapse.&lt;/p&gt;

&lt;p&gt;Why does that distinction matter? Because it changes the advice. If Azure jobs were vanishing, you'd need to panic-retrain. Since they're being outpaced, you need to broaden. Different problem, different solution.&lt;/p&gt;

&lt;p&gt;The cause is partly architectural and partly Microsoft's own doing. Azure Synapse has entered maintenance mode. Microsoft launched a new Azure Databricks Data Engineer certification in March 2026. Read between the lines: Microsoft is conceding the traditional data warehouse space and pivoting toward platform-agnostic tooling. When the platform vendor itself is telling you to learn Databricks, maybe listen.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The tools change every 18 months. The problems don't change. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Resume Trap: How Single-Cloud Profiles Get Filtered
&lt;/h2&gt;

&lt;p&gt;I've been on hiring panels where we passed on strong candidates for the dumbest reasons. But this one isn't dumb; it's mechanical. When a recruiter sources candidates, they're typing "AWS + Airflow" or "GCP + Dataflow" into their search tools. If your profile says "Azure Data Factory + Azure Synapse + Azure Purview" and nothing else, you're invisible before any human ever reads your name.&lt;/p&gt;

&lt;p&gt;This isn't an ATS conspiracy theory. The folklore about 75% of resumes getting auto-rejected is largely unverified; a 2025 study of 25 US recruiters across 10+ ATS platforms found that 92% don't configure auto-rejection rules based on content. The real killer is simpler: keyword misalignment and recruiter search queries. You're not being rejected. You're not being found.&lt;/p&gt;

&lt;p&gt;Here's what makes it worse. 51% of resumes score below 50 out of 100 on ATS assessment before any optimization, mostly because the keywords don't match the job description. If the job says "AWS, Spark, Airflow" and your resume says "Azure, Synapse, Data Factory," you're speaking a different language even though the underlying concepts are identical.&lt;/p&gt;

&lt;p&gt;And look, I get it. You spent three years building production pipelines on Azure. That's real work. You shipped real things. Stop discounting that. But your resume isn't a list of tools. It's evidence that you solve problems that matter to the business. The shift isn't about abandoning what you know; it's about framing it in the language the market is actually searching for.&lt;/p&gt;

&lt;p&gt;Multi-cloud specialists are commanding an 18 to 25% salary premium right now. That's not a suggestion; that's a price signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who's Actually Winning (It's Not Who You Think)
&lt;/h2&gt;

&lt;p&gt;AWS has the volume with roughly 55,000 active cloud engineer postings globally. GCP has the momentum in AI and ML workloads. But the real winners? &lt;strong&gt;Snowflake&lt;/strong&gt; and &lt;strong&gt;Databricks&lt;/strong&gt;. Snowflake skills jumped 10 percentage points from 2025 to 2026. Databricks appears in 16.8% of postings. Apache Spark sits at 38.7%.&lt;/p&gt;

&lt;p&gt;These are platform-agnostic tools. They run on all three clouds. And that's the point.&lt;/p&gt;

&lt;p&gt;The industry isn't consolidating around a cloud provider. It's consolidating around tools that work everywhere. Delta Lake, Iceberg, and Hudi don't care whether your underlying infrastructure is AWS, Azure, or GCP. Airflow appears in 732+ job listings on Indeed alone, not because it's the best orchestrator, but because it's mature and cloud-portable. Hiring gravity around Airflow signals risk-averse enterprises hiring for known-good, not innovation.&lt;/p&gt;

&lt;p&gt;The practical implication: if you're an Azure engineer who knows Databricks, you're already 80% cloud-portable. The Delta Lake format is identical across Azure Databricks, AWS Databricks, and GCP Databricks. You're closer than you think.&lt;/p&gt;

&lt;p&gt;GCP deserves a mention here because it's playing a different game entirely. Smallest absolute share, but the fastest growth rate, and it's repositioning as the data and ML specialist cloud. If you're looking at where the &lt;strong&gt;career&lt;/strong&gt; trajectory leads in five years, GCP's AI infrastructure bet is worth watching.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Interviewers Are Testing Now (That They Weren't Last Year)
&lt;/h2&gt;

&lt;p&gt;The interview loop has gotten longer and weirder. We're now at 5 to 7 rounds for senior DE roles: recruiter screen, live SQL and Python coding, a take-home, then 4 to 5 onsites covering data modeling, system design, and behavioral. Enterprise hiring timelines have stretched to 60 to 90 days. I once did eight rounds at a single company, was told I passed, was told the offer was sent, the offer was never sent, then a new recruiter said I'd declined the offer I never saw, then I did four more rounds, passed again, and the headcount was closed. The process is not designed for candidates.&lt;/p&gt;

&lt;p&gt;But beyond the structural insanity, the content has shifted. Three things I'm seeing in 2026 that barely existed in 2024 loops:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost optimization is now a hiring separator.&lt;/strong&gt; Interviewers are asking candidates to optimize pipelines for cost, not just correctness. "How would you reduce the monthly spend on this pipeline by 40%?" If you've never thought about FinOps, start. The economics argument always wins; storage costs 2 cents per GB per month, but engineer time costs $75 per hour. Know when to optimize and when to throw money at the problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch-versus-streaming is a false binary.&lt;/strong&gt; You're no longer asked to design one or the other. You're asked to design both. Lakehouse architectures with Kappa principles, Delta Lake or Iceberg as the storage format, streaming for operational use cases and batch for regulatory reporting. Use the free interactive SQL, Python, Data Modeling, and Pipeline Architecture @ &lt;a href="https://www.datadriven.io" rel="noopener noreferrer"&gt;datadriven.io&lt;/a&gt; that tags every problem by pattern, which is useful when you need to practice the architectural thinking these loops are actually testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Governance and data quality are no longer afterthoughts.&lt;/strong&gt; 26% of data engineering job postings no longer mention education requirements, signaling a shift toward demonstrable skills. But what skills? Not just "can you write a DAG." Companies want engineers who can explain decisions, detect drift, and document compliance. Insurance, FinTech, and healthcare are hiring governance specialists faster than traditional pipeline engineers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Junior engineers worry about which tool to learn. Senior engineers worry about which problems to solve. Staff engineers worry about which problems to prevent.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  How to Reposition Without Starting Over
&lt;/h2&gt;

&lt;p&gt;If you're sitting on an Azure-heavy resume right now, here's the play. And it's not "go get three new certifications."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reframe, don't rebuild.&lt;/strong&gt; Azure Data Factory is an orchestrator. Airflow is an orchestrator. Synapse is a warehouse. Snowflake is a warehouse. The concepts transfer; the syntax is the easy part. Your resume should lead with what you did (migrated 400 tables, built the pipeline finance depends on for board decks, reduced pipeline failures by 60%) and list tools second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add AWS or GCP keywords, but only if they're real.&lt;/strong&gt; Spin up a personal project on AWS. Build one pipeline. Use S3, Glue or Athena, and Airflow. That's enough to honestly list it. Don't lie; do reps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Target verticals where Azure still dominates.&lt;/strong&gt; Healthcare and financial services still prefer Azure due to compliance ecosystems and Microsoft's enterprise relationships. An Azure engineer targeting regulated industries faces less competition than the headline numbers suggest. The crash is real in aggregate but uneven by vertical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invest in platform-agnostic skills.&lt;/strong&gt; Spark, SQL, Python, data modeling, Airflow, dbt. These appear in job postings regardless of cloud. 65% of hiring managers say it's harder to find skilled data engineers than a year ago. The scarcity isn't in single tools; it's in systems thinking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stop treating the interview as the job.&lt;/strong&gt; Interviewing is a separate skill. The market wants architectural fluency across tools that change quarterly. Prep for pipeline architecture, not system design. DEs don't care about load balancers and reverse proxies.&lt;/p&gt;

&lt;p&gt;Average DE salaries compressed from $153k to $133k between 2025 and 2026, but experienced engineers are still commanding $200k+ total comp. The money is there. The question is whether your profile is positioned to capture it.&lt;/p&gt;

&lt;p&gt;I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. The Azure shift is real, it's significant, and it's worth adjusting for. But the engineers who thrive aren't the ones who pick the right cloud. They're the ones who understand that clouds are interchangeable and problems are forever.&lt;/p&gt;

&lt;p&gt;So: how many of you are sitting on an Azure-heavy resume right now, and what's your plan?&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>career</category>
      <category>interview</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Data Engineers Don't Need DSA. So Why Do Interviews Still Test It?</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Mon, 27 Apr 2026 02:20:24 +0000</pubDate>
      <link>https://dev.to/datadriven/data-engineers-dont-need-dsa-so-why-do-interviews-still-test-it-bof</link>
      <guid>https://dev.to/datadriven/data-engineers-dont-need-dsa-so-why-do-interviews-still-test-it-bof</guid>
      <description>&lt;p&gt;I did somewhere around 20 interview loops during my last job search. Phone screens, take-homes, onsites, "culture chats" that were secretly technical screens. At one company I did eight rounds, was told I passed, was told the offer was sent, it was never sent, then a new recruiter said I'd declined the offer I never saw. I did four more rounds. Passed again. Headcount was closed.&lt;/p&gt;

&lt;p&gt;Through all of that, you know what never once came up on the actual job? Inverting a binary tree.&lt;/p&gt;

&lt;p&gt;This debate isn't new. Every six months, a Reddit thread blows up with senior &lt;strong&gt;data engineering&lt;/strong&gt; folks asking why they're being tested on dynamic programming when their actual job is debugging why a pipeline silently dropped 2M rows last Tuesday. But in 2026, with 80,000 tech layoffs in Q1 alone and companies banning AI tools in interviews while AI reshapes the job itself, the question isn't theoretical anymore. It's a breaking point.&lt;/p&gt;

&lt;h2&gt;
  
  
  The DSA Debate That Won't Die
&lt;/h2&gt;

&lt;p&gt;Here's why this fight resurfaces like clockwork: nobody agrees on what a data engineer even is.&lt;/p&gt;

&lt;p&gt;I'm not being glib. The title means something completely different at Google than it does at a Series B startup than it does at a Fortune 500 retailer. When companies can't define the role, they fall back on the only standardized proxy they have: &lt;strong&gt;LeetCode&lt;/strong&gt;-style algorithm problems. Binary trees. Graph traversal. Backtracking. Problems that software engineers have been grinding for a decade, repurposed wholesale for a role that shares maybe 30% of the same skill set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DSA&lt;/strong&gt; is a mechanism to rank candidates; not an indicator of data engineering experience. I've said this before and I'll keep saying it. Accept it for the arbitrary IQ measuring stick that it is.&lt;/p&gt;

&lt;p&gt;But here's the thing that's changed. The market broke.&lt;/p&gt;

&lt;p&gt;80,000 people lost their jobs in the first quarter of 2026. Nearly half of those cuts were attributed to AI and automation. The people flooding the &lt;strong&gt;interview&lt;/strong&gt; pipeline aren't junior developers taking their first shot; they're experienced engineers with production systems under their belt, competing for fewer roles, facing 5 to 7 round loops that stretch 60 to 90 days. Karat's data across 600,000+ technical interviews confirms this is the norm now, not the exception.&lt;/p&gt;

&lt;p&gt;And fewer than 30% of companies have updated their assessment systems to reflect what data engineering actually requires.&lt;/p&gt;

&lt;p&gt;Let that sit. Seven out of ten companies are screening data engineers the same way they screened them in 2022. The tools changed. The job changed. AI changed everything about how we write and review code. The interview didn't change.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Senior DEs aren't walking away from interview loops because they can't code. They're walking away because the cost/benefit calculation broke. Twenty hours of prep for problems that have zero correlation with the job, in a market where the roles might disappear before the loop finishes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What DSA Actually Measures (and What It Doesn't)
&lt;/h2&gt;

&lt;p&gt;Let's be precise about this. LeetCode has 3,000+ problems. The vast majority test binary trees, dynamic programming, and graph algorithms. Skills that data engineers report using "never to rarely" in production.&lt;/p&gt;

&lt;p&gt;You know what I use daily? SQL window functions. CTEs. Deduplication logic. Understanding why a LEFT JOIN is silently inflating row counts because someone upstream changed a grain without telling anyone. Figuring out why a Spark job is spilling to disk. Debugging schema drift that broke a downstream dashboard the CFO reads every Monday.&lt;/p&gt;

&lt;p&gt;None of that is on LeetCode.&lt;/p&gt;

&lt;p&gt;An empirical study from interviewing.io found that LeetCode rating has no correlation with interview performance percentile. What does correlate? Problem volume solved. Which isn't a signal of capability; it's a signal of free time. That's a selection bias, not a predictor of job performance.&lt;/p&gt;

&lt;p&gt;SQL appears in 61% of data engineering job postings. Data modeling skills have 122,000+ open US roles. Cloud cost optimization is now a top-5 interview category at companies tying bonus incentives to infrastructure savings. Yet the screening gate for all of these roles is still "solve this medium in 25 minutes."&lt;/p&gt;

&lt;p&gt;I've been on hiring panels where we passed on strong candidates for the dumbest reasons. "They got the optimal solution but took too long." Meanwhile, the candidate who speed-ran the binary search problem couldn't explain what idempotency means or why you'd want it in a pipeline. We hired the fast one. That pipeline broke in production within a month.&lt;/p&gt;

&lt;p&gt;50+ companies (Airtable, Buffer, Calendly, CircleCI, and others) have moved away from LeetCode-style assessments entirely, replacing them with take-home projects, code reviews, and system design discussions. The signal is there. The industry just hasn't followed it at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI-Banning Irony
&lt;/h2&gt;

&lt;p&gt;This is the part that makes me want to punch a hole in the wall.&lt;/p&gt;

&lt;p&gt;62% of organizations prohibit AI use in technical interviews. At the same time, 76% of data engineering work is now enhanced by AI tools, delivering 25% productivity improvements on average. Companies are telling candidates: "Don't use the thing you'll use every single day if we hire you."&lt;/p&gt;

&lt;p&gt;It's like banning calculators from a math test in a world where every math job involves using calculators.&lt;/p&gt;

&lt;p&gt;And it gets better. Karat's data says over half of candidates use AI anyway, despite being told not to. No company has disclosed a scalable detection method beyond "watch their eyes" and "screen recording." The enforcement is theater.&lt;/p&gt;

&lt;p&gt;Anthropic, the company that built Claude, initially banned candidates from using AI in interviews. Then reversed the policy in July 2025. If the company most invested in AI's credibility can't figure out a coherent policy, what chance does your average enterprise hiring committee have?&lt;/p&gt;

&lt;p&gt;Meanwhile, Meta went the opposite direction and piloted AI-enabled interviews where Claude, GPT, and Gemini are built into the coding environment. Amazon explicitly bans all GenAI with disqualification as the penalty. Google brought back in-person rounds because remote assessments were too easy to game.&lt;/p&gt;

&lt;p&gt;There's no consensus. There's no stable equilibrium. There's just companies reacting quarter by quarter while candidates try to figure out which rules apply at which company.&lt;/p&gt;

&lt;p&gt;Here's the contrarian take nobody wants to hear: if an AI can spit out a clean solution to a medium LC problem, what does asking that problem actually tell you about the candidate? That they memorized something a machine produces on demand? The signal was already thin. Now it's basically noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Predicts Whether Someone Can Do This Job
&lt;/h2&gt;

&lt;p&gt;The actual job of data engineering is less "write a DAG" and more "figure out why finance's board deck had wrong numbers for three months and nobody noticed." It's debugging. It's data modeling. It's understanding the business well enough to catch when something looks wrong before stakeholders do.&lt;/p&gt;

&lt;p&gt;Karat's own data from 400 engineering leaders confirms the baseline assessment focus should be SQL proficiency, window functions, CTEs, and Python fundamentals. Not graph algorithms. Not dynamic programming.&lt;/p&gt;

&lt;p&gt;The companies getting this right are testing for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data modeling fluency.&lt;/strong&gt; Can you design a schema that won't collapse when requirements change? Can you explain why you'd keep fact tables at grain? This is the make-or-break round, and every practitioner knows it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline architecture.&lt;/strong&gt; Not system design in the SWE sense (I don't care about load balancers and reverse proxies). Can you design an ETL pipeline that handles late-arriving data, schema evolution, and failure recovery?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost reasoning.&lt;/strong&gt; Cloud cost optimization is now a top interview category. Can you explain why denormalizing that table saves $40K/year in compute even though it costs $200/year in storage? The economics argument wins every time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident debugging.&lt;/strong&gt; What broke, why, and how do you make sure it never happens again? This is 60% of the actual job and maybe 5% of interview loops.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;35% year-over-year growth in data engineering demand tells you the &lt;strong&gt;career&lt;/strong&gt; isn't going anywhere. 2.9 million data-related roles remain open globally. The role is healthy. The hiring process is sick.&lt;/p&gt;

&lt;h2&gt;
  
  
  Play the Game, But Name It
&lt;/h2&gt;

&lt;p&gt;I'm not going to sit here and tell you to boycott DSA prep. That's bad advice from people who already have jobs. The game is the game. If you're interviewing at companies that screen on LeetCode, you grind LeetCode. Stick to mediums; do 50 and you'll be solid. Few companies ask hards consistently.&lt;/p&gt;

&lt;p&gt;But let's stop pretending this process is meritocratic. It's not. It's standardized and defensible, which is what legal departments and risk-averse hiring committees want. It has almost nothing to do with predicting whether you'll be good at maintaining the pipeline that finance depends on for board decks.&lt;/p&gt;

&lt;p&gt;Interviewing is a skill. It's separate from the actual job. Treat prep like a job. I'ma be super honest: I have a degree from a degree mill and don't feel particularly "skilled." Just a grind.&lt;/p&gt;

&lt;p&gt;The real fix isn't going to come from candidates complaining on Reddit. It's going to come from companies losing great engineers because those engineers did the math. Twenty hours of algorithm prep for a role where you'll never touch an algorithm, in a market where you might get ghosted anyway, while simultaneously being told you can't use the AI tools that define modern engineering work. At some point, the experienced people just stop showing up for that loop.&lt;/p&gt;

&lt;p&gt;Some companies have figured this out. The 50+ that ditched LeetCode. The ones testing pipeline architecture, data modeling, and cost optimization. They're getting better candidates because they're filtering for the right signal.&lt;/p&gt;

&lt;p&gt;The rest are going to keep wondering why their data pipelines break and their senior engineers leave.&lt;/p&gt;

&lt;p&gt;I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. The tools change every 18 months. The problems don't change. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal.&lt;/p&gt;

&lt;p&gt;The interview process should test for those eternal problems. Not for whether you memorized the optimal solution to "Minimum Window Substring."&lt;/p&gt;

&lt;p&gt;What's the worst interview loop you've been through, and did the questions have anything to do with the actual job?&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
      <category>sql</category>
    </item>
    <item>
      <title>The 6 Python Data Engineering Interview Questions You Will Actually Be Asked in 2026</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:31:19 +0000</pubDate>
      <link>https://dev.to/datadriven/the-6-python-data-engineering-interview-questions-you-will-actually-be-asked-in-2026-1o14</link>
      <guid>https://dev.to/datadriven/the-6-python-data-engineering-interview-questions-you-will-actually-be-asked-in-2026-1o14</guid>
      <description>&lt;p&gt;Every data engineer preparing for interviews hits the same confused moment. You search for python interview questions, get a list of reverse-a-linked-list and two-sum problems, grind them for two weeks, walk into your first data engineering loop, and get asked to deduplicate a 10 million row event stream while preserving the latest record per composite key.&lt;/p&gt;

&lt;p&gt;None of those LeetCode problems prepared you for that question. And the python round in a data engineering interview is not going to get easier until you realize the questions are a different species from the ones that show up in a software engineering loop.&lt;/p&gt;

&lt;p&gt;I have run over 250 interview loops at Google, Meta, LinkedIn, and Netflix. The python portion of a data engineering loop does not look like a python backend or frontend loop. It looks like a pipeline-correctness loop wearing a python costume. Below is the full taxonomy of the six questions that actually get asked, how they differ from the SWE canon, and which patterns you have to internalize to pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Difference In One Sentence
&lt;/h2&gt;

&lt;p&gt;The SWE python round tests whether you can write correct code on data that fits in memory. The data engineering python round tests whether you can reason about data correctness, grain, idempotency, and scale on data that usually does not.&lt;/p&gt;

&lt;p&gt;That is the entire gap. Every other difference flows from it.&lt;/p&gt;

&lt;p&gt;A SWE Python problem gives you a list of integers and asks you to do something clever with it. The list has 10 elements. The test cases have 10 elements. The expected behavior is obvious. The skill tested is algorithms.&lt;/p&gt;

&lt;p&gt;A data engineering Python problem gives you an iterator over events that might have 10 million elements, might have duplicates from retries, might have late-arriving data, might have schema drift between rows, and asks you to produce a deduplicated, ordered, grouped output without loading it all into memory. The test cases will have 5 elements. The skill tested is production instinct.&lt;/p&gt;

&lt;p&gt;Candidates who prepped for SWE Python walk in confident and freeze the moment the input becomes an iterator instead of a list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 1: Streaming Aggregation Over an Iterator
&lt;/h2&gt;

&lt;p&gt;This is the single most common python question I have given and received in a data engineering loop across four FAANG companies.&lt;/p&gt;

&lt;p&gt;Setup: you are handed an iterator that yields event dictionaries one at a time. Each event has &lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;event_type&lt;/code&gt;, &lt;code&gt;ts&lt;/code&gt;, and a few other fields. Compute the count of each event type per user without loading the full iterator into a list.&lt;/p&gt;

&lt;p&gt;A SWE candidate types this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;events_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That solution works on the 5 test events the interviewer gave you. It also kills the pipeline when a real day's worth of events arrives. The &lt;code&gt;list(events)&lt;/code&gt; call materializes the whole iterator. In production that is 40 GB of dictionaries in memory for no reason.&lt;/p&gt;

&lt;p&gt;The data engineering answer never materializes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same logic, different relationship with memory. An interviewer running this round is watching for whether you call &lt;code&gt;list()&lt;/code&gt; on an iterator. If you do, you have told them you think like a SWE, not a data engineer. Half the battle in Python data engineering interviews is showing you know the difference between an iterator and a list and that you default to iterators.&lt;/p&gt;

&lt;p&gt;The follow-up is always: what if the iterator is too large to even hold the counts dict in memory? Now you are in sketch-aggregation territory (HyperLogLog, Count-Min Sketch) or you partition by a hash of the key. If you have never heard of those, the senior bar just evaporated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 2: Deduplication With a Tiebreaker
&lt;/h2&gt;

&lt;p&gt;Every data engineering loop has a dedup question. The SWE version is "remove duplicates from a list." The DE version is "here is an event stream with retries. Each event has an &lt;code&gt;event_id&lt;/code&gt;, an &lt;code&gt;ingested_at&lt;/code&gt;, and an &lt;code&gt;updated_by&lt;/code&gt; that is sometimes null. Keep one row per &lt;code&gt;event_id&lt;/code&gt;, preferring the latest &lt;code&gt;ingested_at&lt;/code&gt;, breaking ties by preferring a non-null &lt;code&gt;updated_by&lt;/code&gt;."&lt;/p&gt;

&lt;p&gt;A SWE candidate reaches for a set. Sets do not have tiebreakers. Sets discard information you need.&lt;/p&gt;

&lt;p&gt;The DE answer iterates, keeps a dict keyed by &lt;code&gt;event_id&lt;/code&gt;, and compares against the current best:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dedupe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What interviewers are testing here is not the algorithm. The algorithm is trivial. They are testing whether you ask "what defines a duplicate" before typing, whether you handle the null case in the tiebreaker explicitly, and whether you return an iterator-compatible view instead of a list.&lt;/p&gt;

&lt;p&gt;The follow-up is always: what if the input is sorted by &lt;code&gt;ingested_at&lt;/code&gt; already? Can you do this in constant additional memory? That is where a streaming groupby pattern comes in, and if you can sketch it on the fly you are clearing a senior bar.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 3: Schema-Tolerant Parsing
&lt;/h2&gt;

&lt;p&gt;This is the one SWE prep completely ignores and DE interviews lean on heavily.&lt;/p&gt;

&lt;p&gt;Setup: you are given a list of dictionaries representing events from a log. Some events are missing fields. Some have extra fields. Some have the right field names but wrong types. Write a function that produces a clean, typed output and a quarantine list for rows that cannot be parsed.&lt;/p&gt;

&lt;p&gt;SWE Python does not train this muscle. The LeetCode problem gives you a clean input every time. The DE interviewer is watching whether you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Validate required fields explicitly before touching them&lt;/li&gt;
&lt;li&gt;Cast types with explicit &lt;code&gt;try/except&lt;/code&gt; around each cast&lt;/li&gt;
&lt;li&gt;Never let one bad row kill the whole batch&lt;/li&gt;
&lt;li&gt;Separate "valid" from "invalid" without discarding the invalid rows silently&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A reasonable answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarantine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missing required field&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;         &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;KeyError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;quarantine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;row&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarantine&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trap interviewers plant is a row where &lt;code&gt;user_id&lt;/code&gt; is the string &lt;code&gt;"null"&lt;/code&gt; instead of the Python &lt;code&gt;None&lt;/code&gt;. &lt;code&gt;int("null")&lt;/code&gt; raises. A candidate who wraps only the final output in one big try/except loses one bad row and all subsequent rows, which is a pipeline bug.&lt;/p&gt;

&lt;p&gt;If you have never written parsing code in production, the instinct to quarantine bad rows instead of crashing on them is foreign. It is the single most senior-signaling habit in a DE Python interview.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 4: Window and Session Logic Without SQL
&lt;/h2&gt;

&lt;p&gt;When the python round asks a question that maps to a SQL window function, the SQL-only candidates freeze. The prompt sounds like: "Here is a sorted list of events per user. Group them into sessions where a session ends after 30 minutes of inactivity. Return a list of sessions with start_ts, end_ts, and event_count."&lt;/p&gt;

&lt;p&gt;This is the sessionization pattern the SQL round tests, ported to Python. The interviewer wants to see whether you can implement in Python what you would write as a &lt;code&gt;LAG&lt;/code&gt; + running &lt;code&gt;SUM&lt;/code&gt; in SQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sessionize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gap_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1800&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;gap_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SWE candidate tries to use a library. The DE candidate writes the two-pointer/rolling-state loop. This is the clearest example of DE Python being closer to state-machine code than to algorithmic code.&lt;/p&gt;

&lt;p&gt;Follow-ups are always about edge cases. What if two events share a timestamp? What if the input is not sorted? What if the iterator is chunked across hour boundaries and you have to support resuming? Each one is a real pipeline concern, not an algorithmic concern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 5: Backfill-Safe Incremental Logic
&lt;/h2&gt;

&lt;p&gt;This is the question that most distinguishes senior data engineers from mid-level candidates.&lt;/p&gt;

&lt;p&gt;Setup: you have a function that processes yesterday's events. Rewrite it so that running it on today's date, on a date from last week, or on a date range over the last month all produce correct output without double-counting or dropping data.&lt;/p&gt;

&lt;p&gt;The SWE candidate does not realize they were asked a design question. They write a function that filters by date and returns aggregates. The DE candidate writes code that is idempotent, deterministic on input date range, and safe against partial re-runs.&lt;/p&gt;

&lt;p&gt;The moves interviewers are watching for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filter on &lt;code&gt;ingested_at&lt;/code&gt; (processing time), not on &lt;code&gt;event_ts&lt;/code&gt; (event time), when you want to catch late data&lt;/li&gt;
&lt;li&gt;Produce output keyed by (partition, primary_key) so re-running overwrites instead of appends&lt;/li&gt;
&lt;li&gt;Take the date range as an argument, not as &lt;code&gt;datetime.now()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Emit a result that is the same shape whether you process one day or thirty&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A minimal answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events_iter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_date&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events_iter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;end_date&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interviewer's follow-up is "what happens if this fails halfway through?" If you do not immediately say "the same inputs produce the same outputs, so re-running it is safe," you have not understood why you were asked this question. Idempotency is the senior signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 6: PySpark DataFrame Logic
&lt;/h2&gt;

&lt;p&gt;If the role is Spark-heavy, and many data engineering roles in 2026 are, the python portion of the interview is really a PySpark portion. The questions are the same five patterns above, expressed in DataFrame API calls instead of pure python.&lt;/p&gt;

&lt;p&gt;The specific patterns that recur:&lt;/p&gt;

&lt;p&gt;A join on multiple columns, phrased as "here are two DataFrames, produce one row per customer with their latest order and their primary payment method." Candidates who do not know the &lt;code&gt;join(other, on=["customer_id", "region"], how="left")&lt;/code&gt; syntax lose time to syntax, not logic.&lt;/p&gt;

&lt;p&gt;A window function in PySpark. Same sessionization prompt, but now you write &lt;code&gt;Window.partitionBy("user_id").orderBy("ts")&lt;/code&gt; and use &lt;code&gt;F.lag("ts").over(w)&lt;/code&gt; to compute gaps. This is the direct translation of the SQL pattern and the pure-Python pattern, and interviewers love it because it tests whether you have touched all three.&lt;/p&gt;

&lt;p&gt;A broadcast-join decision. Interviewers describe two tables of wildly different sizes and ask how you would join them. If you say "regular join," you fail. If you say "broadcast the small one," you pass the first level. If you say "broadcast the small one, but only if it fits in the executor memory budget, otherwise repartition and sort-merge," you pass the senior level.&lt;/p&gt;

&lt;p&gt;A partitioning and skew question. A DataFrame has 10 million rows, 90% of which share the same &lt;code&gt;user_id&lt;/code&gt; (a bot). The interviewer asks what happens when you &lt;code&gt;groupBy("user_id")&lt;/code&gt;. The answer involves salting, two-stage aggregation, or adaptive query execution. This is not a SWE question. It is a pipeline-performance question dressed as code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the SWE Round Never Asks That the DE Round Always Asks
&lt;/h2&gt;

&lt;p&gt;If you only prep SWE-style python you will never see these coming in a data engineering loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What is the memory footprint of this solution on 100 million rows?"&lt;/li&gt;
&lt;li&gt;"What happens to this code if the input is an iterator instead of a list?"&lt;/li&gt;
&lt;li&gt;"How do you make this idempotent?"&lt;/li&gt;
&lt;li&gt;"What does this return when a field is missing or null?"&lt;/li&gt;
&lt;li&gt;"How does this behave if you run it twice?"&lt;/li&gt;
&lt;li&gt;"If this was running in production and failed midway, what would the next run see?"&lt;/li&gt;
&lt;li&gt;"What is the grain of the output, and does that match what the downstream consumer expects?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of those is a pipeline-correctness question, not an algorithm question. Every one of them is what separates a DE hire from a SWE hire in the same company.&lt;/p&gt;

&lt;h2&gt;
  
  
  How To Practice In Four Weeks
&lt;/h2&gt;

&lt;p&gt;Week one, iterator-first coding. Rewrite every LeetCode-style problem you have solved so that it accepts an iterator and returns an iterator. Use &lt;code&gt;itertools.groupby&lt;/code&gt;, &lt;code&gt;itertools.islice&lt;/code&gt;, generator expressions. Stop reaching for lists.&lt;/p&gt;

&lt;p&gt;Week two, the five recurring patterns above. Streaming aggregation, dedup with tiebreakers, schema-tolerant parsing, sessionization, and backfill-safe incremental logic. Write each one on paper before running it.&lt;/p&gt;

&lt;p&gt;Week three, PySpark DataFrame fluency. Joins with multiple keys, window functions, broadcast hints, skew handling. Read one real PySpark job from an open-source repository end to end. The muscle memory for DataFrame syntax only comes from reading real jobs.&lt;/p&gt;

&lt;p&gt;Week four, edge cases. Null handling, duplicate keys, out-of-order inputs, idempotency, late-arriving data. Most DE interview rejections happen on the edge-case follow-up, not on the main question. Budget more time here than feels right.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Meta-Skill
&lt;/h2&gt;

&lt;p&gt;None of the patterns above are syntactically hard. The hard part is that the DE Python interview is testing a worldview. You see the question "count events per user" and the SWE worldview asks "what data structure." The DE worldview asks "what is the grain, what is the scale, what is the recovery story."&lt;/p&gt;

&lt;p&gt;Candidates who cross over from SWE Python to DE Python successfully are the ones who rewire the question first. Before typing: what is the grain, what is the scale, is the input an iterator, how does this behave on a re-run, how does it handle nulls and missing fields. Once those questions are automatic, the syntax is trivial.&lt;/p&gt;

&lt;p&gt;The DE Python interview is closer to a code review than to a coding round. You are not being asked if you can write the code. You are being asked if you would approve this code at 3 AM on a pipeline you own.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Last Thing
&lt;/h2&gt;

&lt;p&gt;If you are preparing for a data engineering loop with a LeetCode-style practice routine, stop. The patterns are wrong. The input shapes are wrong. The follow-ups will catch you off guard, and you will lose offers you should have won.&lt;/p&gt;

&lt;p&gt;Practice the six patterns above. Think about grain, scale, and idempotency every time you type. Make your default input an iterator and your default concern production safety. Do that for four weeks and the python round stops being scary.&lt;/p&gt;

&lt;p&gt;If you want to practice for your upcoming data engineer interview, &lt;a href="http://www.DataDriven.io" rel="noopener noreferrer"&gt;www.DataDriven.io&lt;/a&gt; is free. No trial, no credit card. Built because the gap between Python SWE prep and Python DE prep is costing good data engineers jobs that they would otherwise get.&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
    </item>
    <item>
      <title>78K Tech Layoffs, 47% AI-Blamed: Is Data Engineering Safe?</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 23 Apr 2026 13:54:34 +0000</pubDate>
      <link>https://dev.to/datadriven/78k-tech-layoffs-47-ai-blamed-is-data-engineering-safe-4en0</link>
      <guid>https://dev.to/datadriven/78k-tech-layoffs-47-ai-blamed-is-data-engineering-safe-4en0</guid>
      <description>&lt;p&gt;I woke up on March 31st to a Slack message from a former colleague at Oracle. Six words: "Got the email. 6am. It's done." Thirty thousand people, notified by email before sunrise. Not because Oracle was struggling; the company had just posted a 95% net income jump to $6.13 billion. They cut 18% of their workforce to fund data centers.&lt;/p&gt;

&lt;p&gt;That's the 2026 &lt;strong&gt;layoffs&lt;/strong&gt; story in a single sentence. Companies aren't cutting because they're broke. They're cutting because Wall Street rewards headcount-to-capex conversion, and "AI" is the magic word that makes the stock go up.&lt;/p&gt;

&lt;p&gt;78,557 tech workers were laid off in Q1 2026. Nearly half of those cuts, 47.9%, were publicly attributed to &lt;strong&gt;AI&lt;/strong&gt;. Block slashed 40% of its workforce and explicitly blamed AI. Meta announced 8,000 more cuts on April 20th. And every data engineer I know has been asking the same question: am I next?&lt;/p&gt;

&lt;p&gt;I've been through three waves of "&lt;strong&gt;data engineering&lt;/strong&gt; is getting automated away." Still here. Still employed. Still debugging the same categories of problems. But this wave feels different, and it deserves an honest look.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 47.9% Number Is Half Real, Half Investor Theater
&lt;/h2&gt;

&lt;p&gt;Let's start with the headline stat, because it's doing a lot of heavy lifting. 47.9% of Q1 2026 tech layoffs were attributed to AI. That sounds terrifying. It's also misleading.&lt;/p&gt;

&lt;p&gt;Here's the thing nobody's unpacking: of 45,363 confirmed layoffs tracked through early March, only 20.4% were &lt;em&gt;explicitly&lt;/em&gt; attributed to AI by the companies themselves. The 47.9% figure comes from retrospective analysis that assigns AI blame more liberally than the companies did in real-time disclosures. That's a gap you could drive a truck through.&lt;/p&gt;

&lt;p&gt;Sam Altman said it plainly: "There's some AI washing where people are blaming AI for layoffs that they would otherwise do, and there's some real displacement by AI of different kinds of jobs." When the CEO of OpenAI is telling you the AI attribution is inflated, maybe listen.&lt;/p&gt;

&lt;p&gt;59% of hiring managers surveyed admitted their companies frame workforce reductions as "AI-driven" partly to appeal to stakeholders, even when automation played a minimal role. Think about that. More than half of these companies are saying "AI made us do it" because it sounds better on an earnings call than "we overhired in 2021 and our margins need work."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The 47.9% figure is a stock market narrative wearing a labor statistic's clothing. Some of it is real displacement. A lot of it is executives who discovered that saying "AI efficiency" gets a better reaction from analysts than "cost cutting."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This doesn't mean AI displacement isn't happening. It is. But treating 47.9% as gospel is lazy analysis, and lazy analysis leads to bad &lt;strong&gt;career&lt;/strong&gt; decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Oracle's $50 Billion Bet (Funded by 30,000 People)
&lt;/h2&gt;

&lt;p&gt;Oracle's March layoffs deserve their own section because they're the clearest example of what's actually happening. This isn't AI replacing workers. This is capital replacing labor.&lt;/p&gt;

&lt;p&gt;Oracle cut 30,000 people, 18% of its global workforce, to free up $8 to $10 billion in annual cash flow. That cash is going directly into AI data center infrastructure; roughly $50 billion in 2026 capex alone, a 136% increase over 2025. India bore the worst of it: 12,000 of Oracle's approximately 30,000 Indian employees were terminated.&lt;/p&gt;

&lt;p&gt;The company had $523 billion in remaining performance obligations, up 433% year over year. Contracted demand from hyperscalers like OpenAI, Meta, and xAI. Oracle wasn't shrinking. It was restructuring its entire business model from "employ people to build software" to "build infrastructure that other companies rent."&lt;/p&gt;

&lt;p&gt;Here's where it gets relevant for data engineers: Oracle ran 8-month internal pilot programs with AI agents automating database administration tasks. Maintenance, performance optimization, backup verification. The routine stuff. Entry-level data analyst roles fell 40% industry-wide during the same period.&lt;/p&gt;

&lt;p&gt;The pattern is clear. Routine infrastructure work is on the chopping block. Non-routine infrastructure work (the kind where you're debugging why a pipeline silently dropped 2M rows last Tuesday) is not. Oracle didn't cut its cloud architects. It cut the people doing work that could be codified into a runbook and handed to an agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Data Engineering &lt;strong&gt;Job Security&lt;/strong&gt; Isn't a Myth (Yet)
&lt;/h2&gt;

&lt;p&gt;Here's where I'll validate the anxiety and then redirect it, because both things are true: the market is tightening &lt;em&gt;and&lt;/em&gt; data engineers are structurally safer than most adjacent roles.&lt;/p&gt;

&lt;p&gt;The numbers tell the story. Data engineering roles saw only a 20.6% reduction in openings when Q3 2024 layoffs hit; the smallest decline among all data roles. Data scientists accounted for just 3% of Q1 2026 layoffs, while software engineers absorbed 22%. Companies are allocating 60 to 70% of data budgets to engineering (ingestion, transformation, orchestration, reliability). And 90% of AI and ML projects depend directly on data engineering pipelines for training data, feature delivery, and real-time inference.&lt;/p&gt;

&lt;p&gt;That last stat is the one that matters. If you cut pipeline builders, your AI initiatives die. Full stop. Oracle is spending $50 billion on AI infrastructure. Meta is spending $115 to $135 billion. That infrastructure needs data flowing through it, which means it needs people who know how to make data flow reliably. You can't automate the thing that the automation depends on; at least not yet.&lt;/p&gt;

&lt;p&gt;55% of data professionals now identify primarily as data engineers, up from 40% in 2021. That's not just new hiring. That's existing staff reclassifying because companies realized they need infrastructure builders more than they need dashboard makers.&lt;/p&gt;

&lt;p&gt;But, and this is the part nobody wants to hear, entry-level data engineering positions represent just 2% of openings. Roles requiring 6+ years of experience make up 20%. The market isn't shrinking for data engineers. It's bifurcating. Senior engineers who can architect systems are in high demand. Junior engineers who can write a basic DAG are competing with AI tools that can do the same thing.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Junior engineers worry about which tool to learn. Senior engineers worry about which problems to solve. Staff engineers worry about which problems to prevent. The layoffs are targeting the first group.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Data and analytics postings are down 15.2% year over year, outpacing the overall tech decline of 8.5%. But that aggregate number masks a high-variance market. Data engineers at Series B/C startups and enterprise AI implementations are thriving. Legacy BI teams are hollowing out. The label "data engineer" covers everything from someone writing dbt models to someone designing real-time feature stores for ML inference. These are not the same job, and they don't have the same risk profile.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Skills That Actually Keep You Employed
&lt;/h2&gt;

&lt;p&gt;I've watched people with 10 YOE get laid off because their entire skillset was "I run Airflow DAGs and write SQL." That was a fine career in 2020. In 2026, it's a ceiling.&lt;/p&gt;

&lt;p&gt;Here's what the hiring data shows. AI job postings surged 92% in Q1 2026 versus Q1 2025. ML engineering and AI ops roles command 56% wage premiums. Streaming data engineer roles pay $114K to $245K annually. The real-time analytics market is growing at 23.8% CAGR through 2028. OpenAI and Instacart are actively hiring for data infrastructure roles requiring Kafka, Flink, Spark, and Terraform experience.&lt;/p&gt;

&lt;p&gt;The demand isn't for "data engineers." It's for data engineers who can do specific, hard things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data modeling at scale.&lt;/strong&gt; This has always been the core skill, and it's only getting more important. Getting the model wrong upstream means everything downstream is pain; including every AI training pipeline that depends on your tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline architecture for ML systems.&lt;/strong&gt; Not system design in the SWE sense. Nobody cares if you can whiteboard a load balancer. Can you design a feature pipeline that serves both batch training and real-time inference without duplicating logic?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming infrastructure.&lt;/strong&gt; I know, I know; I've said streaming is overrated. And for 90% of companies, it still is. But the 10% that need it are the ones paying $200K+ for Kafka and Flink expertise. If you want &lt;strong&gt;job security&lt;/strong&gt; in a tightening market, depth in an undersupplied niche beats breadth in an oversupplied generalist pool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-aware engineering.&lt;/strong&gt; Storage is 2 cents per GB per month. Compute is cheap. But "cheap" times a thousand pipelines times 365 days adds up. The engineer who can shave $400K off the annual cloud bill by rethinking a data model is worth more than the engineer who memorized the Spark API.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern isn't complicated. Routine work is getting automated. Non-routine work is getting more valuable. If your job can be described as a series of steps that don't require judgment calls, you're exposed. If your job involves figuring out &lt;em&gt;why&lt;/em&gt; the pipeline broke, &lt;em&gt;how&lt;/em&gt; to model the data so downstream teams aren't constantly filing tickets, and &lt;em&gt;what&lt;/em&gt; infrastructure choices save the company money at scale, you're fine.&lt;/p&gt;

&lt;p&gt;66% of CEOs are freezing hiring through the rest of 2026. But data engineer ranks #7 in CEO hiring priorities at 23%, and the roles that &lt;em&gt;are&lt;/em&gt; opening carry premium compensation. The market is smaller but richer. Fewer seats, higher stakes, better pay for the people who get them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Means for Your Career
&lt;/h2&gt;

&lt;p&gt;I'm not going to sugarcoat this: the &lt;strong&gt;layoffs&lt;/strong&gt; are real, the market is harder than it was in 2021, and "just learn SQL and Airflow" isn't a viable &lt;strong&gt;career&lt;/strong&gt; strategy anymore. But I've been through this cycle before. The tools change every 18 months. The problems don't change. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal.&lt;/p&gt;

&lt;p&gt;The 47.9% AI attribution number is mostly theater. The entry-level contraction is real. The senior-level demand is also real. And the data engineers who treat this moment as a reason to deepen their skills (not panic, not pivot to product management, not "learn AI" by taking a Coursera course) are going to come out of this cycle better compensated than they went in.&lt;/p&gt;

&lt;p&gt;I gave myself a week to feel anxious about the headlines. Then I went back to studying pipeline architecture patterns and brushing up on streaming fundamentals. Because that's always been the move: play the game, win the prize.&lt;/p&gt;

&lt;p&gt;What's the most in-demand skill in your corner of data engineering right now? Genuinely curious whether the streaming and ML infrastructure trend is as universal as the job postings suggest, or if it's concentrated in specific markets.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>career</category>
      <category>ai</category>
      <category>beginners</category>
    </item>
    <item>
      <title>The 6 Python Data Engineering Interview Questions You Will Actually Be Asked in 2026</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Sun, 19 Apr 2026 21:07:19 +0000</pubDate>
      <link>https://dev.to/datadriven/the-6-python-data-engineering-interview-questions-you-will-actually-be-asked-in-2026-504j</link>
      <guid>https://dev.to/datadriven/the-6-python-data-engineering-interview-questions-you-will-actually-be-asked-in-2026-504j</guid>
      <description>&lt;p&gt;Every data engineer preparing for interviews hits the same confused moment. You search for python interview questions, get a list of reverse-a-linked-list and two-sum problems, grind them for two weeks, walk into your first data engineering loop, and get asked to deduplicate a 10 million row event stream while preserving the latest record per composite key.&lt;/p&gt;

&lt;p&gt;None of those LeetCode problems prepared you for that question. And the python round in a data engineering interview is not going to get easier until you realize the questions are a different species from the ones that show up in a software engineering loop.&lt;/p&gt;

&lt;p&gt;I have run over 250 interview loops at Google, Meta, LinkedIn, and Netflix. The python portion of a data engineering loop does not look like a python backend or frontend loop. It looks like a pipeline-correctness loop wearing a python costume. Below is the full taxonomy of the six questions that actually get asked, how they differ from the SWE canon, and which patterns you have to internalize to pass.&lt;/p&gt;

&lt;p&gt;If you have already read the 5 SQL patterns that show up in every data engineering interview, this is the python companion. Same framing, different language, overlapping patterns. The pyspark and pandas DataFrame patterns live in a sibling post and should be read together with this one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Difference In One Sentence
&lt;/h2&gt;

&lt;p&gt;The SWE python round tests whether you can write correct code on data that fits in memory. The data engineering python round tests whether you can reason about data correctness, grain, idempotency, and scale on data that usually does not.&lt;/p&gt;

&lt;p&gt;That is the entire gap. Every other difference flows from it.&lt;/p&gt;

&lt;p&gt;A SWE Python problem gives you a list of integers and asks you to do something clever with it. The list has 10 elements. The test cases have 10 elements. The expected behavior is obvious. The skill tested is algorithms.&lt;/p&gt;

&lt;p&gt;A data engineering Python problem gives you an iterator over events that might have 10 million elements, might have duplicates from retries, might have late-arriving data, might have schema drift between rows, and asks you to produce a deduplicated, ordered, grouped output without loading it all into memory. The test cases will have 5 elements. The skill tested is production instinct.&lt;/p&gt;

&lt;p&gt;Candidates who prepped for SWE Python walk in confident and freeze the moment the input becomes an iterator instead of a list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 1: Streaming Aggregation Over an Iterator
&lt;/h2&gt;

&lt;p&gt;This is the single most common python question I have given and received in a data engineering loop across four FAANG companies.&lt;/p&gt;

&lt;p&gt;Setup: you are handed an iterator that yields event dictionaries one at a time. Each event has &lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;event_type&lt;/code&gt;, &lt;code&gt;ts&lt;/code&gt;, and a few other fields. Compute the count of each event type per user without loading the full iterator into a list.&lt;/p&gt;

&lt;p&gt;A SWE candidate types this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;events_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That solution works on the 5 test events the interviewer gave you. It also kills the pipeline when a real day's worth of events arrives. The &lt;code&gt;list(events)&lt;/code&gt; call materializes the whole iterator. In production that is 40 GB of dictionaries in memory for no reason.&lt;/p&gt;

&lt;p&gt;The data engineering answer never materializes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same logic, different relationship with memory. An interviewer running this round is watching for whether you call &lt;code&gt;list()&lt;/code&gt; on an iterator. If you do, you have told them you think like a SWE, not a data engineer. Half the battle in Python data engineering interviews is showing you know the difference between an iterator and a list and that you default to iterators.&lt;/p&gt;

&lt;p&gt;The follow-up is always: what if the iterator is too large to even hold the counts dict in memory? Now you are in sketch-aggregation territory (HyperLogLog, Count-Min Sketch) or you partition by a hash of the key. If you have never heard of those, the senior bar just evaporated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 2: Deduplication With a Tiebreaker
&lt;/h2&gt;

&lt;p&gt;Every data engineering loop has a dedup question. The SWE version is "remove duplicates from a list." The DE version is "here is an event stream with retries. Each event has an &lt;code&gt;event_id&lt;/code&gt;, an &lt;code&gt;ingested_at&lt;/code&gt;, and an &lt;code&gt;updated_by&lt;/code&gt; that is sometimes null. Keep one row per &lt;code&gt;event_id&lt;/code&gt;, preferring the latest &lt;code&gt;ingested_at&lt;/code&gt;, breaking ties by preferring a non-null &lt;code&gt;updated_by&lt;/code&gt;."&lt;/p&gt;

&lt;p&gt;A SWE candidate reaches for a set. Sets do not have tiebreakers. Sets discard information you need.&lt;/p&gt;

&lt;p&gt;The DE answer iterates, keeps a dict keyed by &lt;code&gt;event_id&lt;/code&gt;, and compares against the current best:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dedupe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What interviewers are testing here is not the algorithm. The algorithm is trivial. They are testing whether you ask "what defines a duplicate" before typing, whether you handle the null case in the tiebreaker explicitly, and whether you return an iterator-compatible view instead of a list.&lt;/p&gt;

&lt;p&gt;The follow-up is always: what if the input is sorted by &lt;code&gt;ingested_at&lt;/code&gt; already? Can you do this in constant additional memory? That is where a streaming groupby pattern comes in, and if you can sketch it on the fly you are clearing a senior bar.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 3: Schema-Tolerant Parsing
&lt;/h2&gt;

&lt;p&gt;This is the one SWE prep completely ignores and DE interviews lean on heavily.&lt;/p&gt;

&lt;p&gt;Setup: you are given a list of dictionaries representing events from a log. Some events are missing fields. Some have extra fields. Some have the right field names but wrong types. Write a function that produces a clean, typed output and a quarantine list for rows that cannot be parsed.&lt;/p&gt;

&lt;p&gt;SWE Python does not train this muscle. The LeetCode problem gives you a clean input every time. The DE interviewer is watching whether you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Validate required fields explicitly before touching them&lt;/li&gt;
&lt;li&gt;Cast types with explicit &lt;code&gt;try/except&lt;/code&gt; around each cast&lt;/li&gt;
&lt;li&gt;Never let one bad row kill the whole batch&lt;/li&gt;
&lt;li&gt;Separate "valid" from "invalid" without discarding the invalid rows silently&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A reasonable answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarantine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missing required field&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;         &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;KeyError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;quarantine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;row&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarantine&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trap interviewers plant is a row where &lt;code&gt;user_id&lt;/code&gt; is the string &lt;code&gt;"null"&lt;/code&gt; instead of the Python &lt;code&gt;None&lt;/code&gt;. &lt;code&gt;int("null")&lt;/code&gt; raises. A candidate who wraps only the final output in one big try/except loses one bad row and all subsequent rows, which is a pipeline bug.&lt;/p&gt;

&lt;p&gt;If you have never written parsing code in production, the instinct to quarantine bad rows instead of crashing on them is foreign. It is the single most senior-signaling habit in a DE Python interview.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 4: Window and Session Logic Without SQL
&lt;/h2&gt;

&lt;p&gt;When the python round asks a question that maps to a SQL window function, the SQL-only candidates freeze. The prompt sounds like: "Here is a sorted list of events per user. Group them into sessions where a session ends after 30 minutes of inactivity. Return a list of sessions with start_ts, end_ts, and event_count."&lt;/p&gt;

&lt;p&gt;This is the sessionization pattern the SQL round tests, ported to Python. The interviewer wants to see whether you can implement in Python what you would write as a &lt;code&gt;LAG&lt;/code&gt; + running &lt;code&gt;SUM&lt;/code&gt; in SQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sessionize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gap_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1800&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;gap_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SWE candidate tries to use a library. The DE candidate writes the two-pointer/rolling-state loop. This is the clearest example of DE Python being closer to state-machine code than to algorithmic code.&lt;/p&gt;

&lt;p&gt;Follow-ups are always about edge cases. What if two events share a timestamp? What if the input is not sorted? What if the iterator is chunked across hour boundaries and you have to support resuming? Each one is a real pipeline concern, not an algorithmic concern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 5: Backfill-Safe Incremental Logic
&lt;/h2&gt;

&lt;p&gt;This is the question that most distinguishes senior data engineers from mid-level candidates.&lt;/p&gt;

&lt;p&gt;Setup: you have a function that processes yesterday's events. Rewrite it so that running it on today's date, on a date from last week, or on a date range over the last month all produce correct output without double-counting or dropping data.&lt;/p&gt;

&lt;p&gt;The SWE candidate does not realize they were asked a design question. They write a function that filters by date and returns aggregates. The DE candidate writes code that is idempotent, deterministic on input date range, and safe against partial re-runs.&lt;/p&gt;

&lt;p&gt;The moves interviewers are watching for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filter on &lt;code&gt;ingested_at&lt;/code&gt; (processing time), not on &lt;code&gt;event_ts&lt;/code&gt; (event time), when you want to catch late data&lt;/li&gt;
&lt;li&gt;Produce output keyed by (partition, primary_key) so re-running overwrites instead of appends&lt;/li&gt;
&lt;li&gt;Take the date range as an argument, not as &lt;code&gt;datetime.now()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Emit a result that is the same shape whether you process one day or thirty&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A minimal answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events_iter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_date&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events_iter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;end_date&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interviewer's follow-up is "what happens if this fails halfway through?" If you do not immediately say "the same inputs produce the same outputs, so re-running it is safe," you have not understood why you were asked this question. Idempotency is the senior signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 6: PySpark DataFrame Logic
&lt;/h2&gt;

&lt;p&gt;If the role is Spark-heavy, and many data engineering roles in 2026 are, the python portion of the interview is really a PySpark portion. The questions are the same five patterns above, expressed in DataFrame API calls instead of pure python. For a deeper treatment of the DataFrame syntax itself, see the pyspark and pandas interview patterns post.&lt;/p&gt;

&lt;p&gt;The specific patterns that recur:&lt;/p&gt;

&lt;p&gt;A join on multiple columns, phrased as "here are two DataFrames, produce one row per customer with their latest order and their primary payment method." Candidates who do not know the &lt;code&gt;join(other, on=["customer_id", "region"], how="left")&lt;/code&gt; syntax lose time to syntax, not logic.&lt;/p&gt;

&lt;p&gt;A window function in PySpark. Same sessionization prompt, but now you write &lt;code&gt;Window.partitionBy("user_id").orderBy("ts")&lt;/code&gt; and use &lt;code&gt;F.lag("ts").over(w)&lt;/code&gt; to compute gaps. This is the direct translation of the SQL pattern and the pure-Python pattern, and interviewers love it because it tests whether you have touched all three.&lt;/p&gt;

&lt;p&gt;A broadcast-join decision. Interviewers describe two tables of wildly different sizes and ask how you would join them. If you say "regular join," you fail. If you say "broadcast the small one," you pass the first level. If you say "broadcast the small one, but only if it fits in the executor memory budget, otherwise repartition and sort-merge," you pass the senior level.&lt;/p&gt;

&lt;p&gt;A partitioning and skew question. A DataFrame has 10 million rows, 90% of which share the same &lt;code&gt;user_id&lt;/code&gt; (a bot). The interviewer asks what happens when you &lt;code&gt;groupBy("user_id")&lt;/code&gt;. The answer involves salting, two-stage aggregation, or adaptive query execution. This is not a SWE question. It is a pipeline-performance question dressed as code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the SWE Round Never Asks That the DE Round Always Asks
&lt;/h2&gt;

&lt;p&gt;If you only prep SWE-style python you will never see these coming in a data engineering loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What is the memory footprint of this solution on 100 million rows?"&lt;/li&gt;
&lt;li&gt;"What happens to this code if the input is an iterator instead of a list?"&lt;/li&gt;
&lt;li&gt;"How do you make this idempotent?"&lt;/li&gt;
&lt;li&gt;"What does this return when a field is missing or null?"&lt;/li&gt;
&lt;li&gt;"How does this behave if you run it twice?"&lt;/li&gt;
&lt;li&gt;"If this was running in production and failed midway, what would the next run see?"&lt;/li&gt;
&lt;li&gt;"What is the grain of the output, and does that match what the downstream consumer expects?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of those is a pipeline-correctness question, not an algorithm question. Every one of them is what separates a DE hire from a SWE hire in the same company.&lt;/p&gt;

&lt;h2&gt;
  
  
  How To Practice In Four Weeks
&lt;/h2&gt;

&lt;p&gt;Week one, iterator-first coding. Rewrite every LeetCode-style problem you have solved so that it accepts an iterator and returns an iterator. Use &lt;code&gt;itertools.groupby&lt;/code&gt;, &lt;code&gt;itertools.islice&lt;/code&gt;, generator expressions. Stop reaching for lists.&lt;/p&gt;

&lt;p&gt;Week two, the five recurring patterns above. Streaming aggregation, dedup with tiebreakers, schema-tolerant parsing, sessionization, and backfill-safe incremental logic. Write each one on paper before running it.&lt;/p&gt;

&lt;p&gt;Week three, PySpark DataFrame fluency. Joins with multiple keys, window functions, broadcast hints, skew handling. Read one real PySpark job from an open-source repository end to end. The muscle memory for DataFrame syntax only comes from reading real jobs.&lt;/p&gt;

&lt;p&gt;Week four, edge cases. Null handling, duplicate keys, out-of-order inputs, idempotency, late-arriving data. Most DE interview rejections happen on the edge-case follow-up, not on the main question. Budget more time here than feels right.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Meta-Skill
&lt;/h2&gt;

&lt;p&gt;None of the patterns above are syntactically hard. The hard part is that the DE Python interview is testing a worldview. You see the question "count events per user" and the SWE worldview asks "what data structure." The DE worldview asks "what is the grain, what is the scale, what is the recovery story."&lt;/p&gt;

&lt;p&gt;Candidates who cross over from SWE Python to DE Python successfully are the ones who rewire the question first. Before typing: what is the grain, what is the scale, is the input an iterator, how does this behave on a re-run, how does it handle nulls and missing fields. Once those questions are automatic, the syntax is trivial.&lt;/p&gt;

&lt;p&gt;The DE Python interview is closer to a code review than to a coding round. You are not being asked if you can write the code. You are being asked if you would approve this code at 3 AM on a pipeline you own.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Last Thing
&lt;/h2&gt;

&lt;p&gt;If you are preparing for a data engineering loop with a LeetCode-style practice routine, stop. The patterns are wrong. The input shapes are wrong. The follow-ups will catch you off guard, and you will lose offers you should have won.&lt;/p&gt;

&lt;p&gt;Practice the six patterns above. Think about grain, scale, and idempotency every time you type. Make your default input an iterator and your default concern production safety. Do that for four weeks and the python round stops being scary.&lt;/p&gt;

&lt;p&gt;For the SQL side of the same loop, the 5 SQL patterns every data engineering interview asks is the cross-training companion. For PySpark and pandas specifically, read the DataFrame patterns post next.&lt;/p&gt;

&lt;p&gt;(If you want a question bank where every Python problem is tagged by pattern, tagged by the company that asked a variant of it, and includes the iterator-vs-list and scale follow-ups the interviewer will ask next, DataDriven.io is free. No trial, no credit card. Built because the gap between Python SWE prep and Python DE prep is costing good engineers jobs they would otherwise get.)&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
    </item>
    <item>
      <title>What Spark Interviews Actually Test (Based on 189 Real Interview Reports)</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 16 Apr 2026 17:01:17 +0000</pubDate>
      <link>https://dev.to/datadriven/what-spark-interviews-actually-test-based-on-189-real-interview-reports-46ol</link>
      <guid>https://dev.to/datadriven/what-spark-interviews-actually-test-based-on-189-real-interview-reports-46ol</guid>
      <description>&lt;h1&gt;
  
  
  What Spark Interviews Actually Test (Based on 189 Real Interview Reports)
&lt;/h1&gt;

&lt;p&gt;We scraped thousands of data engineering interview reports from across the internet. 189 of them mentioned Spark. We tagged every question, tracked every outcome, and found patterns that contradicted most of the advice we see online.&lt;/p&gt;

&lt;p&gt;This is what the data says.&lt;/p&gt;

&lt;h2&gt;
  
  
  Spark Shows Up Less Than You Think
&lt;/h2&gt;

&lt;p&gt;Across all the reports we collected, Spark appeared in 6.7%. SQL appeared in 22.8%. Python in 16%.&lt;/p&gt;

&lt;p&gt;That ratio matters. If you have 4 weeks to prep and you spend 2 of them grinding Spark internals, you've made a bad bet. SQL is 3.4x more likely to show up. Python is 2.4x more likely.&lt;/p&gt;

&lt;p&gt;But here's the catch: when Spark does show up, it shows up hard. It's rarely one question in a round. It tends to be the entire round. And the failure rate is brutal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question Changes Completely By Level
&lt;/h2&gt;

&lt;p&gt;Most people prep for Spark interviews as if there's one test. There isn't. The question changes shape depending on what level you're interviewing for, and the jump between levels is steeper than people expect.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;L3/L4&lt;/strong&gt;, interviewers test whether you can explain the basics. "What is a DAG?" "Why is a shuffle expensive?" "Tell me about your PySpark projects." One candidate interviewing at Nasdaq described the round as "Python, Pandas, PySpark, Databricks, Linux commands, my projects in Python." Conceptual. Vocabulary. Can you talk about this stuff without stumbling.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;L5&lt;/strong&gt;, the entire format flips. The interviewer hands you a Spark UI screenshot and says "this job was meeting SLA for six months and now it's 10x slower. Nothing in the code changed. Walk me through your diagnosis." A TikTok L5 round combined "complex SQL problems, Spark architecture, and performance optimization questions, including indexing strategies, partitioning, query tuning, and resource management in distributed data processing systems" into a single session. You're not explaining what Spark is. You're fixing something that broke at 3am.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;L6&lt;/strong&gt;, the scope widens again. One candidate at Booking.com was rejected because their system design choices were wrong: "feedback centered on tool choices (Flink vs Spark despite prompt asking for low latency; Redis vs Cassandra)." The question isn't "fix this job." It's "design the memory layout for a system that caches 100GB of reference data while running a 500GB sort-merge join." You're sizing executors, reasoning about GC pressure past 30GB of heap, deciding between &lt;code&gt;MEMORY_AND_DISK&lt;/code&gt; and recomputation.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;L7&lt;/strong&gt;, it's organizational. "How would you design a Spark application that processes 100+ PB across a shared multi-tenant cluster?" The bottleneck isn't compute anymore, it's resource isolation between 50 competing teams.&lt;/p&gt;

&lt;p&gt;Same topic at every level. Completely different test. One Databricks candidate went through a 7-round process over 60 days that included a take-home with 15 hands-on Spark questions, followed by a live grilling where a lead engineer dug into their optimization choices. They called the whole experience "disappointing after almost 2 months."&lt;/p&gt;

&lt;p&gt;The prep that gets you through an L3 round won't even register as relevant at L5.&lt;/p&gt;

&lt;h2&gt;
  
  
  68.8% of "Spark Interviews" Are Really SQL-at-Scale Interviews
&lt;/h2&gt;

&lt;p&gt;This one surprised me the most. We tagged every technical topic mentioned in the 189 Spark interview reports. The breakdown:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;% of Spark Interviews&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SQL optimization&lt;/td&gt;
&lt;td&gt;68.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance tuning&lt;/td&gt;
&lt;td&gt;11.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Window functions&lt;/td&gt;
&lt;td&gt;6.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Joins&lt;/td&gt;
&lt;td&gt;5.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partitioning&lt;/td&gt;
&lt;td&gt;3.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data skew&lt;/td&gt;
&lt;td&gt;2.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory management&lt;/td&gt;
&lt;td&gt;2.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Nearly 7 out of 10 "Spark interviews" are really about running SQL efficiently at distributed scale. Not RDD transformations. Not Catalyst internals. SQL.&lt;/p&gt;

&lt;p&gt;The typical question sounds like this (from a real TikTok L5 interview): "...Discussed Spark architecture, and answered performance optimization questions, including indexing strategies, partitioning, query tuning, and resource management in distributed data processing systems."&lt;/p&gt;

&lt;p&gt;SQL is the entry point. Spark is the context. The question is whether you understand what happens to your SQL query after you hit enter on a 500-executor cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nobody Asks About RDDs Anymore
&lt;/h2&gt;

&lt;p&gt;Zero interviews in the dataset asked about RDDs as a primary topic. Zero asked about GC tuning directly.&lt;/p&gt;

&lt;p&gt;That doesn't mean these concepts are irrelevant. It means interviewers have stopped asking "what is an RDD" and started asking questions where RDD knowledge helps you reason about the answer. The question is "why is this job slow?" and the ability to think in terms of lineage, partitions, and shuffle boundaries is what separates a good answer from a textbook recitation.&lt;/p&gt;

&lt;p&gt;If you're spending prep time memorizing the difference between &lt;code&gt;map&lt;/code&gt; and &lt;code&gt;flatMap&lt;/code&gt; on RDDs, stop. That time is better spent learning to read a Spark UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Companies Actually Ask
&lt;/h2&gt;

&lt;p&gt;Here are real questions from real interviews, pulled directly from the reports:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Databricks&lt;/strong&gt; (37 interviews, 46% rejection rate): "Length, breadth, height, depth on Spark core, DLT, Unity Catalog, code optimization, scenarios, your project issues and how they were resolved." Their process runs 7+ rounds over 50-60 days. One candidate reported being rejected after the presentation round "because of less Databricks knowledge" despite the hiring manager saying Databricks knowledge wouldn't be needed. The bar is inconsistent and the process is long.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finance companies&lt;/strong&gt; (multi-round, structured): "Explanation of Spark architecture in detail and different optimization techniques if any Spark job is taking long to run." These tend to be 4-round processes: PySpark coding, Spark optimization, system design, then a techno-managerial round.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TikTok&lt;/strong&gt; (L5, 25% rejection): Complex SQL + Spark architecture + performance optimization in a single round. They test breadth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BNSF Railway&lt;/strong&gt; (100% rejection in dataset): "Multi-round process with system design, SQL, PySpark, and a deep technical discussion with leadership. The interviews were much challenging and focused heavily on real-world trade-offs, especially around data architecture and streaming concepts." When a railroad company rejects every Spark candidate in your dataset, they're not messing around.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QuantumBlack&lt;/strong&gt; (McKinsey's data arm, 71% rejection): "What makes PySpark great? How do you debug PySpark?" Then a 45-minute coding test with 3 problems solvable in either Pandas or PySpark.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Failure Patterns That Keep Showing Up
&lt;/h2&gt;

&lt;p&gt;After tagging all the technical content across interview reports, challenge databases, and company-specific prep guides, five production failure patterns dominate what senior interviews test:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Data skew on power-law keys.&lt;/strong&gt; One partition holds 320M rows while the others hold 3-4M. Task 199 runs for 7,140 seconds while the other 199 tasks finished in 22 seconds. The interviewer wants you to identify the skew from the Spark UI, explain why adding more executors won't help (the bottleneck is one partition, not total parallelism), and apply the right fix (broadcast the small table, or salt the key if both tables are large).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Broadcast overflow.&lt;/strong&gt; A dimension table that was 8MB a year ago grew past the 10MB &lt;code&gt;autoBroadcastJoinThreshold&lt;/code&gt; silently. Spark switched from BroadcastHashJoin to SortMergeJoin without anyone noticing. Runtime went from 8 minutes to 2 hours. The fix is one line of code. The interview tests whether you can find that line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Shuffle explosion.&lt;/strong&gt; Someone added a &lt;code&gt;repartition()&lt;/code&gt; before a join, thinking more partitions would speed things up. It multiplied shuffle volume by 50x. Network saturated. The interviewer wants you to explain why repartition before a join is almost always wrong and what to do instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Executor OOM from cached data.&lt;/strong&gt; A 100GB reference table is cached. A 500GB sort-merge join needs execution memory. Both compete for the unified pool (60% of heap). Spark's unified memory model lets execution evict cached blocks, but at 100GB the eviction churn destroys throughput. The interview tests whether you understand &lt;code&gt;spark.memory.fraction&lt;/code&gt;, &lt;code&gt;spark.memory.storageFraction&lt;/code&gt;, and the tradeoff between cache hit rate and execution headroom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Catalyst plan regression from stale statistics.&lt;/strong&gt; CBO statistics went stale after a table doubled in size. Spark picked sort-merge instead of broadcast. Nobody changed any code. The job just got slower. The interviewer wants you to explain how Catalyst's cost-based optimizer works and why &lt;code&gt;ANALYZE TABLE ... COMPUTE STATISTICS&lt;/code&gt; matters.&lt;/p&gt;

&lt;p&gt;These five patterns cover what I'd estimate is 80%+ of production Spark incidents. They're also what separates L3 answers ("it's slow because of the data") from L6 answers ("task 199 is reading 15.8GB of shuffle data because the top 1% of user_ids hash to the same partition, and the executor is at 78% GC overhead because it's trying to sort 320M rows in 28GB of heap").&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Signal: Can You Read the Spark UI?
&lt;/h2&gt;

&lt;p&gt;Every pattern above comes down to one skill: reading the Spark UI and reasoning about what you see.&lt;/p&gt;

&lt;p&gt;Stages, tasks, shuffle read/write, GC time, executor memory. That's the entire diagnostic surface. If you can look at a Spark UI screenshot and say "task 199 has 100x the shuffle read of every other task, the executor is at 98% heap, and the physical plan shows SortMergeJoin when this should be a broadcast" then you pass. If you can't, you recite textbook answers and the interviewer can tell.&lt;/p&gt;

&lt;p&gt;This is the skill that most Spark prep resources skip entirely. They teach you what a broadcast join is. They don't teach you to recognize when a missing broadcast join is the reason your 3am pager went off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practice This Before Your Interview
&lt;/h2&gt;

&lt;p&gt;I built a free Spark mock interview that simulates exactly this. You get paged. You see real Spark UI evidence: task durations, shuffle sizes, GC overhead, executor memory, the physical plan. You diagnose, write the fix in PySpark or Scala, run your code in the browser, then an AI interviewer grills you on tradeoffs and edge cases.&lt;/p&gt;

&lt;p&gt;Four phases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Think&lt;/strong&gt; (5 min): Read the Spark UI. Diagnose before you touch code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code&lt;/strong&gt; (15 min): Write and run your PySpark or Scala fix in a hosted IDE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discuss&lt;/strong&gt; (10 min): AI interviewer asks follow-ups one at a time. "What happens when the table doubles?" "Why not just add more executors?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verdict&lt;/strong&gt;: Scored across 5 dimensions (problem solving, technical execution, communication, verification, requirements understanding). Calibrated from L3 to L7.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No paywall, no trial, no credit card.&lt;/p&gt;

&lt;p&gt;Try it here: &lt;a href="https://www.datadriven.io/interview/spark_skew_broadcast_user_events" rel="noopener noreferrer"&gt;Spark Mock Interview&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Data sourced from thousands of interview reports scraped across the internet, covering 945+ companies.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
