<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gowtham Potureddi</title>
    <description>The latest articles on DEV Community by Gowtham Potureddi (@gowthampotureddi).</description>
    <link>https://dev.to/gowthampotureddi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3874592%2Fb901f929-0a60-4dd2-9dac-22ce22291bdc.png</url>
      <title>DEV Community: Gowtham Potureddi</title>
      <link>https://dev.to/gowthampotureddi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gowthampotureddi"/>
    <language>en</language>
    <item>
      <title>PostgreSQL SQL Data Types: Practical Column-Type Guide</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Mon, 11 May 2026 04:08:12 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/postgresql-sql-data-types-practical-column-type-guide-2l1b</link>
      <guid>https://dev.to/gowthampotureddi/postgresql-sql-data-types-practical-column-type-guide-2l1b</guid>
      <description>&lt;p&gt;Choosing the right &lt;strong&gt;SQL data types&lt;/strong&gt; is one of the quiet decisions that shapes &lt;strong&gt;storage&lt;/strong&gt;, &lt;strong&gt;correctness&lt;/strong&gt;, and &lt;strong&gt;query behavior&lt;/strong&gt; in PostgreSQL. In a tight SQL screen, interviewers often follow up on &lt;strong&gt;why&lt;/strong&gt; you picked a type—not only whether the query returns rows. This guide walks through the main families, common pitfalls (rounding, time zones, type mismatches), and how to reason about casts—using &lt;strong&gt;PostgreSQL&lt;/strong&gt; syntax, the same dialect PipeCode uses for practice.&lt;/p&gt;

&lt;p&gt;If you want &lt;strong&gt;hands-on reps&lt;/strong&gt; after you read, &lt;a href="https://dev.to/explore/practice"&gt;explore practice →&lt;/a&gt;, &lt;a href="https://dev.to/explore/practice/language/sql"&gt;drill SQL problems →&lt;/a&gt;, browse &lt;a href="https://dev.to/explore/practice/topic/sql"&gt;SQL by topic →&lt;/a&gt;, or open &lt;a href="https://dev.to/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang"&gt;Zero to FAANG SQL (full fundamentals) →&lt;/a&gt; for a structured path.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9exalq0grwf5oxbok422.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9exalq0grwf5oxbok422.jpeg" alt="PipeCode blog header for a PostgreSQL SQL data types guide with bold title text and purple accents on a dark background." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;On this page&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why column types matter&lt;/li&gt;
&lt;li&gt;Numeric types&lt;/li&gt;
&lt;li&gt;Text and binary&lt;/li&gt;
&lt;li&gt;Boolean and NULL&lt;/li&gt;
&lt;li&gt;Date and time&lt;/li&gt;
&lt;li&gt;Semi-structured and other types&lt;/li&gt;
&lt;li&gt;Casting and comparison rules&lt;/li&gt;
&lt;li&gt;Choosing types (checklist)&lt;/li&gt;
&lt;li&gt;Frequently asked questions&lt;/li&gt;
&lt;li&gt;Practice on PipeCode&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. Why column types matter
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Storage, comparisons, indexes, and the cost of silent coercion
&lt;/h3&gt;

&lt;p&gt;"Why did you pick that type?" is the single most common SQL-screen follow-up — and the cleanest answer is that &lt;strong&gt;a column's type controls four downstream things at once: how the value is laid out on disk, which operators compare it correctly, which indexes the planner can actually use, and when PostgreSQL has to silently coerce data behind your back&lt;/strong&gt;. Get the type right and joins are fast, comparisons are unambiguous, and disk pages are dense. Get it wrong and you ship a schema that &lt;em&gt;runs&lt;/em&gt; but quietly returns the wrong answer or scans 10× more pages than it should.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hdj3do1wwqqijln9dzu.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hdj3do1wwqqijln9dzu.jpeg" alt="Diagram linking SQL column types to storage, comparisons, indexes, and implicit casting with PipeCode purple and blue accents on a light card." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; When you walk an interviewer through a &lt;code&gt;CREATE TABLE&lt;/code&gt;, say the &lt;strong&gt;grain&lt;/strong&gt; and the &lt;strong&gt;type&lt;/strong&gt; in the same breath: &lt;em&gt;"one row per order, &lt;code&gt;order_id&lt;/code&gt; is &lt;code&gt;BIGINT&lt;/code&gt;, &lt;code&gt;total&lt;/code&gt; is &lt;code&gt;NUMERIC(14,2)&lt;/code&gt;."&lt;/em&gt; That single habit signals to the interviewer that you think about column types as design decisions, not afterthoughts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Storage footprint and on-disk layout
&lt;/h4&gt;

&lt;p&gt;The storage invariant: &lt;strong&gt;fixed-width integer and timestamp types occupy a known number of bytes (4 or 8) and never expand; variable-width types (&lt;code&gt;TEXT&lt;/code&gt;, &lt;code&gt;NUMERIC&lt;/code&gt;, &lt;code&gt;JSONB&lt;/code&gt;) carry a length prefix and grow with the value; choosing a tighter type packs more rows per 8 KB page and improves cache locality on every read&lt;/strong&gt;. A wider type is rarely free — even when the bytes look free, the planner statistics and TOAST thresholds shift.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;INTEGER&lt;/code&gt;&lt;/strong&gt; — 4 bytes, range ±2.1 B; default for surrogate counts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;BIGINT&lt;/code&gt;&lt;/strong&gt; — 8 bytes; required when row counts cross ~2 B or for user-facing IDs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NUMERIC(p, s)&lt;/code&gt;&lt;/strong&gt; — variable (~2 bytes overhead + 2 bytes per 4 digits); cost grows with precision.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;TEXT&lt;/code&gt; / &lt;code&gt;VARCHAR(n)&lt;/code&gt;&lt;/strong&gt; — variable; &lt;strong&gt;no storage penalty&lt;/strong&gt; for &lt;code&gt;TEXT&lt;/code&gt; vs &lt;code&gt;VARCHAR&lt;/code&gt; with the same content.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A 100 M-row &lt;code&gt;events&lt;/code&gt; table sized two ways:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;design&lt;/th&gt;
&lt;th&gt;per-row bytes&lt;/th&gt;
&lt;th&gt;total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;event_id BIGINT, ts TIMESTAMPTZ, user_id BIGINT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;~2.4 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;event_id BIGINT, ts TIMESTAMPTZ, user_id TEXT (avg 18 chars)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;24 + 20 = 44&lt;/td&gt;
&lt;td&gt;~4.4 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fixed-width row (&lt;code&gt;BIGINT, TIMESTAMPTZ, BIGINT&lt;/code&gt;) is 24 bytes on the heap regardless of values.&lt;/li&gt;
&lt;li&gt;Replacing the integer &lt;code&gt;user_id&lt;/code&gt; with &lt;code&gt;TEXT&lt;/code&gt; for a UUID-shaped string adds a 4-byte length header plus the bytes of the text itself.&lt;/li&gt;
&lt;li&gt;With ~100 M rows, the variable-width design adds ~2 GB to the table heap alone, before indexes.&lt;/li&gt;
&lt;li&gt;The wider rows also fit fewer per 8 KB page → fewer buffer-cache hits → more I/O per query.&lt;/li&gt;
&lt;li&gt;Net: same data, ~2× the disk and worse cache behavior.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Pick the tightest correct type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;event_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;      &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ts&lt;/span&gt;        &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;   &lt;span class="nb"&gt;BIGINT&lt;/span&gt;      &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;          &lt;span class="c1"&gt;-- not TEXT&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if a value is a count or an internal identifier, it is an integer; reach for &lt;code&gt;TEXT&lt;/code&gt; only when the value is a real human-readable string.&lt;/p&gt;

&lt;h4&gt;
  
  
  Equality and comparison semantics
&lt;/h4&gt;

&lt;p&gt;The comparison invariant: &lt;strong&gt;PostgreSQL compares values &lt;em&gt;within&lt;/em&gt; a type cleanly, but mixing types forces an implicit cast that can produce surprises — string &lt;code&gt;'10'&lt;/code&gt; compares lexicographically (&lt;code&gt;'10' &amp;lt; '2'&lt;/code&gt;), numeric &lt;code&gt;10&lt;/code&gt; compares mathematically (&lt;code&gt;10 &amp;gt; 2&lt;/code&gt;), and timestamps compare instant-to-instant only if both sides are &lt;code&gt;TIMESTAMPTZ&lt;/code&gt;&lt;/strong&gt;. The right type makes &lt;code&gt;&amp;lt;&lt;/code&gt;, &lt;code&gt;=&lt;/code&gt;, and &lt;code&gt;BETWEEN&lt;/code&gt; behave the way humans expect.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;'10' &amp;lt; '2'&lt;/code&gt; is &lt;code&gt;TRUE&lt;/code&gt;&lt;/strong&gt; when both are &lt;code&gt;TEXT&lt;/code&gt; — string compare reads left-to-right.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;10 &amp;lt; 2&lt;/code&gt; is &lt;code&gt;FALSE&lt;/code&gt;&lt;/strong&gt; when both are &lt;code&gt;INTEGER&lt;/code&gt; — numeric compare.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;TIMESTAMP&lt;/code&gt; vs &lt;code&gt;TIMESTAMPTZ&lt;/code&gt;&lt;/strong&gt; — PostgreSQL will compare them only after coercing one side; the answer depends on the session time zone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collations on &lt;code&gt;TEXT&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;'abc' = 'ABC'&lt;/code&gt; is &lt;code&gt;FALSE&lt;/code&gt; with the default &lt;code&gt;C&lt;/code&gt; collation, possibly &lt;code&gt;TRUE&lt;/code&gt; with a case-insensitive collation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A 5-row table where the comparison flips based on whether &lt;code&gt;score&lt;/code&gt; is &lt;code&gt;TEXT&lt;/code&gt; or &lt;code&gt;INTEGER&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;score (as TEXT)&lt;/th&gt;
&lt;th&gt;order&lt;/th&gt;
&lt;th&gt;score (as INT)&lt;/th&gt;
&lt;th&gt;order&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"10"&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"2"&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"100"&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"9"&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stored as &lt;code&gt;TEXT&lt;/code&gt;: &lt;code&gt;ORDER BY score&lt;/code&gt; compares character-by-character; &lt;code&gt;'1'&lt;/code&gt; (0x31) sorts before &lt;code&gt;'9'&lt;/code&gt; (0x39), so &lt;code&gt;'100'&lt;/code&gt; sorts before &lt;code&gt;'2'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Stored as &lt;code&gt;INTEGER&lt;/code&gt;: &lt;code&gt;ORDER BY score&lt;/code&gt; compares the numeric value; &lt;code&gt;2 &amp;lt; 9 &amp;lt; 10 &amp;lt; 100&lt;/code&gt; — the human-expected order.&lt;/li&gt;
&lt;li&gt;The query is &lt;strong&gt;identical&lt;/strong&gt; in both cases; only the &lt;strong&gt;column type&lt;/strong&gt; changed the answer.&lt;/li&gt;
&lt;li&gt;The bug is invisible until someone audits the leaderboard and notices &lt;code&gt;"100"&lt;/code&gt; ranked above &lt;code&gt;"9"&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Always store ordinal-comparable values in a numeric type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;leaderboard&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;player_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;  &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt;     &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;player_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;leaderboard&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if you ever compare values with &lt;code&gt;&amp;lt;&lt;/code&gt;, &lt;code&gt;&amp;gt;&lt;/code&gt;, or &lt;code&gt;BETWEEN&lt;/code&gt;, the type must support those operators &lt;em&gt;natively&lt;/em&gt; — never rely on string sort for numbers or dates.&lt;/p&gt;

&lt;h4&gt;
  
  
  Index operator classes and planner statistics
&lt;/h4&gt;

&lt;p&gt;The index invariant: &lt;strong&gt;a B-tree index is built against an &lt;em&gt;operator class&lt;/em&gt; tied to a specific type; when a query compares an indexed column to a value of a different type, the planner usually has to scan instead of seek, because the function it applies to your value (the implicit cast) isn't immutable on the indexed expression&lt;/strong&gt;. The right type matches the index; the wrong type silently disables it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CREATE INDEX … ON t (col)&lt;/code&gt;&lt;/strong&gt; — default B-tree, uses the type's default operator class.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;col = $1&lt;/code&gt; with matching type&lt;/strong&gt; — index seek.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;col = $1::other_type&lt;/code&gt;&lt;/strong&gt; — index seek when the cast is on the &lt;strong&gt;literal&lt;/strong&gt; side.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;col::other_type = $1&lt;/code&gt;&lt;/strong&gt; — sequential scan; you cast the column, not the value.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A &lt;code&gt;user_id BIGINT&lt;/code&gt; column with a B-tree index, queried two ways.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;predicate&lt;/th&gt;
&lt;th&gt;plan&lt;/th&gt;
&lt;th&gt;rows scanned&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE user_id = 42&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Index Scan&lt;/td&gt;
&lt;td&gt;~1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE user_id = '42'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Index Scan (literal cast)&lt;/td&gt;
&lt;td&gt;~1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE user_id::text = '42'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Seq Scan&lt;/td&gt;
&lt;td&gt;full table&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;WHERE user_id = 42&lt;/code&gt; — both sides are &lt;code&gt;BIGINT&lt;/code&gt;; planner uses the B-tree directly.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;WHERE user_id = '42'&lt;/code&gt; — PostgreSQL coerces the string literal &lt;code&gt;'42'&lt;/code&gt; to &lt;code&gt;BIGINT&lt;/code&gt; (since &lt;code&gt;BIGINT&lt;/code&gt; is the indexed side); index still usable.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;WHERE user_id::text = '42'&lt;/code&gt; — the cast is on the &lt;em&gt;column&lt;/em&gt;; PostgreSQL would have to apply the &lt;code&gt;::text&lt;/code&gt; function to every row to compare; the B-tree on &lt;code&gt;user_id&lt;/code&gt; cannot help.&lt;/li&gt;
&lt;li&gt;The third predicate triggers a full sequential scan even though an index "exists on &lt;code&gt;user_id&lt;/code&gt;."&lt;/li&gt;
&lt;li&gt;Diagnosis is an &lt;code&gt;EXPLAIN&lt;/code&gt; away: &lt;code&gt;Seq Scan on … Filter: ((user_id)::text = '42'::text)&lt;/code&gt; is the giveaway.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Keep casts on the literal side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- good: cast literal, index used&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'42'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;-- literal '42' coerced to BIGINT&lt;/span&gt;

&lt;span class="c1"&gt;-- bad: cast column, index killed&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'42'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if you see a &lt;code&gt;::&lt;/code&gt; on a column inside a &lt;code&gt;WHERE&lt;/code&gt; or &lt;code&gt;JOIN&lt;/code&gt;, expect a seq scan and ask whether the underlying type should change.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Declaring every text column as &lt;code&gt;VARCHAR(255)&lt;/code&gt; "just in case" — wastes nothing on storage but lies in the schema about the real constraint.&lt;/li&gt;
&lt;li&gt;Storing numeric IDs as &lt;code&gt;TEXT&lt;/code&gt; because the source CSV had quotes — every downstream comparison and index becomes a hazard.&lt;/li&gt;
&lt;li&gt;Mixing &lt;code&gt;TIMESTAMP&lt;/code&gt; and &lt;code&gt;TIMESTAMPTZ&lt;/code&gt; in joins — comparison depends on the session time zone; you have written a query that returns different rows for different users.&lt;/li&gt;
&lt;li&gt;Treating implicit coercion as free — the planner often hides the cost behind a seq scan and an unbroken &lt;code&gt;EXPLAIN&lt;/code&gt; summary.&lt;/li&gt;
&lt;li&gt;Skipping &lt;code&gt;CHECK&lt;/code&gt; constraints because "the application handles it" — types and constraints together are the only durable schema.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Picking Types for an Orders Schema
&lt;/h3&gt;

&lt;p&gt;A junior teammate sends a &lt;code&gt;CREATE TABLE orders&lt;/code&gt; script: &lt;code&gt;order_id VARCHAR(255)&lt;/code&gt;, &lt;code&gt;total FLOAT&lt;/code&gt;, &lt;code&gt;customer_id TEXT&lt;/code&gt;, &lt;code&gt;placed_at TIMESTAMP&lt;/code&gt;. The orders application is global, has ~5 M orders per day, and is joined daily to &lt;code&gt;dim_customer (customer_id BIGINT, …)&lt;/code&gt;. &lt;strong&gt;Identify every type-level risk in this schema and rewrite it so reports stay correct, joins stay indexed, and storage doesn't bloat.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using Tight Native Types + &lt;code&gt;NUMERIC&lt;/code&gt; + &lt;code&gt;TIMESTAMPTZ&lt;/code&gt; + &lt;code&gt;CHECK&lt;/code&gt; Constraints
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt;    &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt;     &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;        &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt;       &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;placed_at&lt;/span&gt;   &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt;   &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;placed_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; of the four problems:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;original type&lt;/th&gt;
&lt;th&gt;risk&lt;/th&gt;
&lt;th&gt;fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;order_id VARCHAR(255)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;lexicographic sort; wide rows; index mismatch&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;BIGSERIAL&lt;/code&gt; / &lt;code&gt;BIGINT&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;total FLOAT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;binary rounding (0.1 + 0.2 ≠ 0.3); aggregates drift&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NUMERIC(14,2)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;customer_id TEXT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;cross-type join with &lt;code&gt;dim_customer.customer_id BIGINT&lt;/code&gt;; seq scan&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;BIGINT&lt;/code&gt; + FK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;placed_at TIMESTAMP&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;wall-clock semantics; reports differ per session TZ&lt;/td&gt;
&lt;td&gt;&lt;code&gt;TIMESTAMPTZ&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; a typed, constrained schema. The daily customer-join now uses a B-tree seek on &lt;code&gt;customer_id&lt;/code&gt;; revenue rollups are exact to the cent; "orders placed today" is unambiguous regardless of the analyst's session time zone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;BIGSERIAL&lt;/code&gt; PK&lt;/strong&gt;&lt;/strong&gt; — monotonic, 8-byte integer; supports range scans, packs tight, and matches every downstream join.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;BIGINT customer_id&lt;/code&gt; with FK&lt;/strong&gt;&lt;/strong&gt; — joins are type-identical, the index is usable, and orphan rows are rejected at write time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;NUMERIC(14, 2)&lt;/code&gt; for money&lt;/strong&gt;&lt;/strong&gt; — exact decimal arithmetic; aggregates over millions of rows produce the same total a calculator would.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;TIMESTAMPTZ&lt;/code&gt; for placed_at&lt;/strong&gt;&lt;/strong&gt; — every value is stored as a UTC instant; display converts to the session TZ; reports never silently shift by 24 h after a deploy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;CHECK (total &amp;gt;= 0)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — durable invariant; even a buggy ETL run cannot insert negative revenue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;O(1)&lt;/code&gt; extra bytes per row vs the original; massive &lt;code&gt;O(log N)&lt;/code&gt; per join savings vs the seq-scan caused by the type mismatch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; drill the &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice page&lt;/a&gt; for type-fluency reps and the &lt;a href="https://pipecode.ai/explore/practice/topic/aggregations" rel="noopener noreferrer"&gt;aggregation topic&lt;/a&gt; for grain-correct rollups.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — SQL&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregations&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregations" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — SQL for DE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Zero to FAANG SQL fundamentals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Numeric types
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Integers for counts, NUMERIC for money, FLOAT for measurements
&lt;/h3&gt;

&lt;p&gt;PostgreSQL splits numeric types into three families: &lt;strong&gt;exact integers&lt;/strong&gt; (&lt;code&gt;SMALLINT&lt;/code&gt;, &lt;code&gt;INTEGER&lt;/code&gt;, &lt;code&gt;BIGINT&lt;/code&gt;), &lt;strong&gt;arbitrary-precision exact decimals&lt;/strong&gt; (&lt;code&gt;NUMERIC(p, s)&lt;/code&gt; / &lt;code&gt;DECIMAL&lt;/code&gt;), and &lt;strong&gt;binary floating point&lt;/strong&gt; (&lt;code&gt;REAL&lt;/code&gt;, &lt;code&gt;DOUBLE PRECISION&lt;/code&gt;). The choice is rarely about precision in the abstract — it's about &lt;em&gt;which arithmetic errors are acceptable&lt;/em&gt;. Integers never lose precision; &lt;code&gt;NUMERIC&lt;/code&gt; is exact at a fixed scale; floats trade precision for speed and are the wrong default for currency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ai4fmsxhgogm1i0xrfu.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ai4fmsxhgogm1i0xrfu.jpeg" alt="Side-by-side comparison of PostgreSQL-style integer, floating point, and numeric decimal types for counts versus money with a float rounding warning." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; When asked "what type is &lt;code&gt;revenue&lt;/code&gt;?", say &lt;code&gt;NUMERIC(p, s)&lt;/code&gt; and name &lt;code&gt;p&lt;/code&gt; and &lt;code&gt;s&lt;/code&gt; out loud — &lt;code&gt;NUMERIC(14, 2)&lt;/code&gt; for cents up to ~$100 B, &lt;code&gt;NUMERIC(18, 4)&lt;/code&gt; for FX rates and basis points. Knowing the scale is what separates "I know decimals exist" from "I have shipped a ledger."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;INTEGER&lt;/code&gt; / &lt;code&gt;BIGINT&lt;/code&gt; — surrogate keys and counts
&lt;/h4&gt;

&lt;p&gt;The integer invariant: &lt;strong&gt;&lt;code&gt;INTEGER&lt;/code&gt; is 4 bytes (range ±2.1 B) and &lt;code&gt;BIGINT&lt;/code&gt; is 8 bytes (range ±9.2 quintillion); use &lt;code&gt;INTEGER&lt;/code&gt; for small/medium counts and &lt;code&gt;BIGINT&lt;/code&gt; for surrogate keys, monotonically increasing IDs, and anything that might ever cross 2 billion&lt;/strong&gt;. Overflow is silent in some languages but is a hard error in PostgreSQL — once the column wraps, every insert fails.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SMALLINT&lt;/code&gt;&lt;/strong&gt; — 2 bytes; rarely used outside tightly packed enum-like values.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;INTEGER&lt;/code&gt;&lt;/strong&gt; — 4 bytes; default for row counts, scores, age, quantities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;BIGINT&lt;/code&gt;&lt;/strong&gt; — 8 bytes; default for primary keys on growing tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;BIGSERIAL&lt;/code&gt; / &lt;code&gt;GENERATED AS IDENTITY&lt;/code&gt;&lt;/strong&gt; — 8-byte auto-incrementing PK.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; An events table grows from 1 M to 3 B rows.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;year&lt;/th&gt;
&lt;th&gt;events&lt;/th&gt;
&lt;th&gt;INTEGER PK?&lt;/th&gt;
&lt;th&gt;BIGINT PK?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;td&gt;1 M&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025&lt;/td&gt;
&lt;td&gt;500 M&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026&lt;/td&gt;
&lt;td&gt;2.5 B&lt;/td&gt;
&lt;td&gt;✗ overflow&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with &lt;code&gt;event_id INTEGER&lt;/code&gt; — fits 2.1 B values.&lt;/li&gt;
&lt;li&gt;Daily growth at 5 M / day reaches 2.1 B by mid-2026.&lt;/li&gt;
&lt;li&gt;Next &lt;code&gt;INSERT&lt;/code&gt; fails: &lt;code&gt;ERROR: integer out of range&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Migration requires &lt;code&gt;ALTER TABLE … ALTER COLUMN event_id TYPE BIGINT;&lt;/code&gt; — rewrites the entire table; locks scale with table size.&lt;/li&gt;
&lt;li&gt;Doing this at 2.1 B rows means hours of downtime; doing it at table creation is free.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Use &lt;code&gt;BIGINT&lt;/code&gt; for any growing PK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;event_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;   &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ts&lt;/span&gt;        &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; every primary key on a table that "might be big someday" is &lt;code&gt;BIGINT&lt;/code&gt; from day one. The 4 extra bytes per row is the cheapest insurance you can buy.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;NUMERIC(p, s)&lt;/code&gt; — exact decimal for currency
&lt;/h4&gt;

&lt;p&gt;The decimal invariant: &lt;strong&gt;&lt;code&gt;NUMERIC(p, s)&lt;/code&gt; stores &lt;code&gt;p&lt;/code&gt; total digits with &lt;code&gt;s&lt;/code&gt; of them after the decimal point; arithmetic is exact at that scale; &lt;code&gt;SUM(NUMERIC)&lt;/code&gt; over millions of rows produces the byte-identical result a careful accountant would compute by hand&lt;/strong&gt;. The cost is performance — &lt;code&gt;NUMERIC&lt;/code&gt; math is slower than integer or float — but for currency the trade-off is settled: exact wins.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NUMERIC(14, 2)&lt;/code&gt;&lt;/strong&gt; — up to 12 digits before the decimal, 2 after; ~$1 T.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NUMERIC(18, 4)&lt;/code&gt;&lt;/strong&gt; — FX rates, fractional cents (interest, allocations).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NUMERIC(38, 6)&lt;/code&gt;&lt;/strong&gt; — analytics-warehouse scale; matches Snowflake / BigQuery default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt; — ~2 bytes overhead + 2 bytes per 4 digits; cheap up to ~$1 T.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Summing 1,000 invoice lines of &lt;code&gt;$0.10&lt;/code&gt; each:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;storage type&lt;/th&gt;
&lt;th&gt;&lt;code&gt;SUM(amount)&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DOUBLE PRECISION&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;100.00000000000007&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NUMERIC(14, 2)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;100.00&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;0.1&lt;/code&gt; cannot be represented exactly in binary floating point; the stored value is &lt;code&gt;0.1000000000000000055511…&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Adding 1,000 of these in &lt;code&gt;DOUBLE PRECISION&lt;/code&gt; accumulates &lt;code&gt;1000 * tiny_error&lt;/code&gt;; the result drifts.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;NUMERIC(14, 2)&lt;/code&gt; stores &lt;code&gt;0.10&lt;/code&gt; literally and adds with decimal arithmetic; 1,000 × &lt;code&gt;0.10&lt;/code&gt; is exactly &lt;code&gt;100.00&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The float error is invisible until a finance lead notices a $0.00000007 discrepancy on a reconciliation report.&lt;/li&gt;
&lt;li&gt;Once the column type is &lt;code&gt;NUMERIC&lt;/code&gt;, the drift is impossible by construction.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Currency columns always use &lt;code&gt;NUMERIC&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoice_lines&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;line_id&lt;/span&gt;    &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt;      &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantity&lt;/span&gt;   &lt;span class="nb"&gt;INTEGER&lt;/span&gt;        &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;unit_price&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;line_total&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;unit_price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;STORED&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; anything that touches money, tax, allocations, basis points, or a regulated ledger is &lt;code&gt;NUMERIC(p, s)&lt;/code&gt; — never &lt;code&gt;FLOAT&lt;/code&gt; or &lt;code&gt;DOUBLE PRECISION&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;REAL&lt;/code&gt; / &lt;code&gt;DOUBLE PRECISION&lt;/code&gt; — binary floating point and rounding
&lt;/h4&gt;

&lt;p&gt;The float invariant: &lt;strong&gt;&lt;code&gt;REAL&lt;/code&gt; (4 bytes, ~7 decimal digits) and &lt;code&gt;DOUBLE PRECISION&lt;/code&gt; (8 bytes, ~15 digits) follow IEEE 754; they're fast and compact but inexact at decimal fractions; their natural home is measurements where the underlying quantity is itself approximate (sensor reading, ML feature, scientific magnitude)&lt;/strong&gt;. Floats are not "lossy currency" — they are the right type for things that were never exact to begin with.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;REAL&lt;/code&gt;&lt;/strong&gt; — 4 bytes; ~7 decimal digits of precision.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DOUBLE PRECISION&lt;/code&gt;&lt;/strong&gt; — 8 bytes; ~15 digits; PostgreSQL's default &lt;code&gt;FLOAT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;0.1 + 0.2 = 0.30000000000000004&lt;/code&gt;&lt;/strong&gt; in both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use cases&lt;/strong&gt; — physical measurements, geographic coordinates, ML scores, neural-net outputs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Same 5 sensor readings stored two ways:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;reading&lt;/th&gt;
&lt;th&gt;&lt;code&gt;REAL&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;DOUBLE PRECISION&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;23.7&lt;/td&gt;
&lt;td&gt;23.7&lt;/td&gt;
&lt;td&gt;23.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.1 + 0.2&lt;/td&gt;
&lt;td&gt;0.3 (~0.30000001)&lt;/td&gt;
&lt;td&gt;0.30000000000000004&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3.14159265358979&lt;/td&gt;
&lt;td&gt;3.1415927&lt;/td&gt;
&lt;td&gt;3.141592653589793&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;REAL&lt;/code&gt; rounds aggressively after ~7 digits; fine for a temperature gauge, wrong for a price.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DOUBLE PRECISION&lt;/code&gt; keeps ~15 digits — enough for almost any measurement.&lt;/li&gt;
&lt;li&gt;Neither stores &lt;code&gt;0.1 + 0.2&lt;/code&gt; as exactly &lt;code&gt;0.3&lt;/code&gt; because base-2 cannot represent base-10 tenths.&lt;/li&gt;
&lt;li&gt;Equality (&lt;code&gt;=&lt;/code&gt;) on floats is unsafe; use a tolerance (&lt;code&gt;abs(a - b) &amp;lt; 1e-9&lt;/code&gt;) for "approximately equal."&lt;/li&gt;
&lt;li&gt;For currency, both are wrong — use &lt;code&gt;NUMERIC&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Use floats for genuinely approximate measurements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sensor_readings&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;reading_id&lt;/span&gt;   &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt;        &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_id&lt;/span&gt;    &lt;span class="nb"&gt;BIGINT&lt;/span&gt;           &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temp_celsius&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt; &lt;span class="nb"&gt;PRECISION&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ts&lt;/span&gt;           &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt;      &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if you would compare the value with &lt;code&gt;=&lt;/code&gt; and care about the result, it is not a float.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Defaulting all PKs to &lt;code&gt;SERIAL&lt;/code&gt; (32-bit) and discovering the overflow in production years later.&lt;/li&gt;
&lt;li&gt;Storing money in &lt;code&gt;DOUBLE PRECISION&lt;/code&gt; because &lt;code&gt;NUMERIC&lt;/code&gt; "is slow" — the slowdown is invisible to humans; the rounding is not.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;NUMERIC&lt;/code&gt; with no precision (&lt;code&gt;NUMERIC&lt;/code&gt; with no &lt;code&gt;(p,s)&lt;/code&gt;) — works but skips the documentation value of stating the scale.&lt;/li&gt;
&lt;li&gt;Comparing floats with &lt;code&gt;=&lt;/code&gt; instead of a tolerance window.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;INTEGER&lt;/code&gt; for cents (&lt;code&gt;total_cents&lt;/code&gt;) instead of &lt;code&gt;NUMERIC(14, 2)&lt;/code&gt; — works but burdens every read with a &lt;code&gt;/100.0&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Reconciling a Drifting Invoice Total
&lt;/h3&gt;

&lt;p&gt;The CFO reports that the monthly invoice total in the dashboard disagrees with the source-of-truth ledger by &lt;code&gt;$0.0000034&lt;/code&gt; on average. The dashboard sums an &lt;code&gt;invoice_lines.amount&lt;/code&gt; column declared as &lt;code&gt;DOUBLE PRECISION&lt;/code&gt;. &lt;strong&gt;Identify the cause and propose a schema fix that makes the totals byte-identical to the ledger from now on.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;NUMERIC(14, 4)&lt;/code&gt; + a Generated &lt;code&gt;line_total&lt;/code&gt; Column
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoice_lines&lt;/span&gt;
    &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoice_lines&lt;/span&gt;
    &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;line_total&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;STORED&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- nightly reconciliation&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dash_total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoice_lines&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;invoice_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="s1"&gt;'2026-04-13'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; of the drift:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;th&gt;running sum (DOUBLE PRECISION)&lt;/th&gt;
&lt;th&gt;running sum (NUMERIC)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;td&gt;0.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0.30000000000000004&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;…&lt;/td&gt;
&lt;td&gt;…&lt;/td&gt;
&lt;td&gt;accumulating error&lt;/td&gt;
&lt;td&gt;exact&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;100.00000000000007&lt;/td&gt;
&lt;td&gt;100.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; dashboard total per day now matches the ledger to the cent (or to the basis point, given scale 4). No silent drift; finance closes the books without manual adjustment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;NUMERIC(14, 4)&lt;/code&gt; exact decimal arithmetic&lt;/strong&gt;&lt;/strong&gt; — every addition stays exact at four decimal places; no IEEE 754 representation error.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Generated &lt;code&gt;line_total&lt;/code&gt; column&lt;/strong&gt;&lt;/strong&gt; — eliminates a class of bugs where the application computes &lt;code&gt;qty * price&lt;/code&gt; and the database computes a slightly different number.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;STORED&lt;/code&gt; not &lt;code&gt;VIRTUAL&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — value is materialised once at write time; reads are plain column reads with no per-row recomputation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Tolerance check on the ETL side&lt;/strong&gt;&lt;/strong&gt; — even with &lt;code&gt;NUMERIC&lt;/code&gt;, reconciliation should compare against the source-of-truth ledger with a 0-tolerance gate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;One-time &lt;code&gt;ALTER TABLE … USING&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — converts existing rows in place; from then on the type system makes drift impossible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — single rewrite at migration; per-row &lt;code&gt;NUMERIC&lt;/code&gt; math is ~3× slower than &lt;code&gt;DOUBLE PRECISION&lt;/code&gt; but invisible compared to network and disk costs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; for the structured currency-and-aggregation path see &lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;SQL for Data Engineering Interviews — From Zero to FAANG&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregations&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregations" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — conditional aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Conditional-aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/conditional-aggregation" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — SQL for DE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Zero to FAANG SQL fundamentals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Text and binary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CHAR vs VARCHAR vs TEXT, collations, and BYTEA
&lt;/h3&gt;

&lt;p&gt;PostgreSQL has three character types — &lt;strong&gt;&lt;code&gt;CHAR(n)&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;VARCHAR(n)&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;TEXT&lt;/code&gt;&lt;/strong&gt; — and one binary type, &lt;strong&gt;&lt;code&gt;BYTEA&lt;/code&gt;&lt;/strong&gt;. The decision rule is short: use &lt;code&gt;TEXT&lt;/code&gt; unless you have a hard reason to enforce a length cap, and store files outside the database with a URL or object-store key in the column. Most "text" bugs are not about storage at all — they are about &lt;strong&gt;collations&lt;/strong&gt;, which control how text &lt;em&gt;compares&lt;/em&gt; and &lt;em&gt;sorts&lt;/em&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Two strings that look identical can compare unequal under a different collation. When a join "returns no rows" on string keys, your first check after &lt;code&gt;EXPLAIN&lt;/code&gt; is &lt;code&gt;SHOW lc_collate;&lt;/code&gt; and &lt;code&gt;SELECT pg_collation_for(col1)&lt;/code&gt; on both columns.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;CHAR&lt;/code&gt; vs &lt;code&gt;VARCHAR&lt;/code&gt; vs &lt;code&gt;TEXT&lt;/code&gt; — pick &lt;code&gt;TEXT&lt;/code&gt; unless you need fixed-width
&lt;/h4&gt;

&lt;p&gt;The text invariant: &lt;strong&gt;&lt;code&gt;TEXT&lt;/code&gt; and &lt;code&gt;VARCHAR(n)&lt;/code&gt; have the same on-disk representation in PostgreSQL — no padding, no length penalty; the only difference is the &lt;code&gt;(n)&lt;/code&gt; constraint that throws an error on overflow&lt;/strong&gt;. &lt;code&gt;CHAR(n)&lt;/code&gt; pads with spaces to length, costing both storage and surprise (trailing-space equality is mostly stripped on read, but joins can still misbehave).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CHAR(n)&lt;/code&gt;&lt;/strong&gt; — fixed-width; pads with spaces; storage = &lt;code&gt;n&lt;/code&gt; bytes (plus a length header).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;VARCHAR(n)&lt;/code&gt;&lt;/strong&gt; — variable-width; rejects values longer than &lt;code&gt;n&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;TEXT&lt;/code&gt;&lt;/strong&gt; — variable-width; no length limit (up to 1 GB).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;citext&lt;/code&gt; extension&lt;/strong&gt; — case-insensitive text via the &lt;code&gt;citext&lt;/code&gt; type.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Storing &lt;code&gt;"abc"&lt;/code&gt; three ways:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;type&lt;/th&gt;
&lt;th&gt;stored bytes&lt;/th&gt;
&lt;th&gt;trailing pad&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CHAR(5)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;abc&lt;/code&gt; (5 bytes)&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;VARCHAR(5)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;abc&lt;/code&gt; (3 bytes)&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TEXT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;abc&lt;/code&gt; (3 bytes)&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;CHAR(5)&lt;/code&gt; stores literally &lt;code&gt;abc&lt;/code&gt; (5 chars), padding to length.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;VARCHAR(5)&lt;/code&gt; stores &lt;code&gt;abc&lt;/code&gt;; would reject &lt;code&gt;abcdef&lt;/code&gt; with a length-violation error.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;TEXT&lt;/code&gt; stores &lt;code&gt;abc&lt;/code&gt;; would accept &lt;code&gt;abcdef&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Equality semantics differ: &lt;code&gt;CHAR(5) 'abc' = VARCHAR(5) 'abc'&lt;/code&gt; may be &lt;code&gt;TRUE&lt;/code&gt; but joining a &lt;code&gt;CHAR&lt;/code&gt; column to a &lt;code&gt;VARCHAR&lt;/code&gt; column from another table can still fail when one side preserved trailing whitespace.&lt;/li&gt;
&lt;li&gt;Default to &lt;code&gt;TEXT&lt;/code&gt; — it is the simplest and never accumulates these padding surprises.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Schema for a free-form &lt;code&gt;bio&lt;/code&gt; field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;profiles&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;bio&lt;/span&gt;     &lt;span class="nb"&gt;TEXT&lt;/span&gt;   &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; use &lt;code&gt;VARCHAR(n)&lt;/code&gt; &lt;em&gt;only&lt;/em&gt; when you genuinely want the database to enforce a maximum length (e.g., regulator-imposed &lt;code&gt;description VARCHAR(280)&lt;/code&gt;); otherwise reach for &lt;code&gt;TEXT&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Collations and locale-aware equality
&lt;/h4&gt;

&lt;p&gt;The collation invariant: &lt;strong&gt;a collation is a tuple of (alphabet, sort order, case-sensitivity, accent-sensitivity) that the database applies to every text comparison; the default is usually &lt;code&gt;"C"&lt;/code&gt; (binary) or the OS locale; case-insensitive matching requires either an explicit &lt;code&gt;ICU&lt;/code&gt; collation or the &lt;code&gt;citext&lt;/code&gt; extension&lt;/strong&gt;. Two databases with different locales can disagree on whether &lt;code&gt;'café' = 'cafe'&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;C&lt;/code&gt; collation&lt;/strong&gt; — byte-by-byte; fastest; case- and accent-sensitive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;en_US.UTF-8&lt;/code&gt;&lt;/strong&gt; — locale-aware; sorts &lt;code&gt;'a' &amp;lt; 'B' &amp;lt; 'c'&lt;/code&gt; (case-insensitive primary).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;und-x-icu&lt;/code&gt;&lt;/strong&gt; — ICU root locale; consistent across platforms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;citext&lt;/code&gt;&lt;/strong&gt; — case-insensitive text type; &lt;code&gt;'ABC' = 'abc'&lt;/code&gt; is &lt;code&gt;TRUE&lt;/code&gt; automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Joining users by email under different collations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;left email&lt;/th&gt;
&lt;th&gt;right email&lt;/th&gt;
&lt;th&gt;join match (C)&lt;/th&gt;
&lt;th&gt;join match (citext)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;alice@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;alice@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Alice@X.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;alice@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;alice@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;alice@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗ (whitespace, not case)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Default &lt;code&gt;C&lt;/code&gt; collation does a byte compare; &lt;code&gt;'A'&lt;/code&gt; (0x41) is not equal to &lt;code&gt;'a'&lt;/code&gt; (0x61).&lt;/li&gt;
&lt;li&gt;Same string with mixed case fails to join in &lt;code&gt;C&lt;/code&gt; even though humans see them as the same email.&lt;/li&gt;
&lt;li&gt;Switching the column type to &lt;code&gt;citext&lt;/code&gt; makes the database compare case-insensitively, and the second row matches.&lt;/li&gt;
&lt;li&gt;Whitespace differences still cause mismatches — &lt;code&gt;citext&lt;/code&gt; does not trim; that requires &lt;code&gt;BTRIM(col)&lt;/code&gt; in ETL.&lt;/li&gt;
&lt;li&gt;Pick one normalization rule (lowercase + trim at write time) and apply it consistently rather than relying on collation alone.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Use &lt;code&gt;citext&lt;/code&gt; for emails and usernames:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;citext&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt;   &lt;span class="n"&gt;CITEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;UNIQUE&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if you ever want &lt;code&gt;'Foo' = 'foo'&lt;/code&gt; to be &lt;code&gt;TRUE&lt;/code&gt;, set that contract at the column type, not at every &lt;code&gt;LOWER(...)&lt;/code&gt; call site.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;BYTEA&lt;/code&gt; for binary blobs vs URL-in-SQL for files
&lt;/h4&gt;

&lt;p&gt;The binary invariant: &lt;strong&gt;&lt;code&gt;BYTEA&lt;/code&gt; stores raw bytes (hashes, signatures, compressed payloads, small binary tokens); large blobs (images, PDFs, ML model weights) belong in object storage (S3, GCS) with a &lt;code&gt;TEXT&lt;/code&gt; URL or key in SQL&lt;/strong&gt;. Databases are not file systems — every byte stored in &lt;code&gt;BYTEA&lt;/code&gt; slows backups, replication, and query cache.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;BYTEA&lt;/code&gt;&lt;/strong&gt; — variable-length binary; up to 1 GB but typically used for ≤ 10 KB tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SHA-256&lt;/code&gt; hash&lt;/strong&gt; — 32 bytes; perfect &lt;code&gt;BYTEA&lt;/code&gt; use case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large files&lt;/strong&gt; — store in S3; keep &lt;code&gt;s3_key TEXT&lt;/code&gt; in SQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pg_largeobject&lt;/code&gt;&lt;/strong&gt; — legacy API; rarely worth the complexity vs object storage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A &lt;code&gt;documents&lt;/code&gt; table with two design choices:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;design&lt;/th&gt;
&lt;th&gt;per-row storage&lt;/th&gt;
&lt;th&gt;backup time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;body BYTEA&lt;/code&gt; (10 MB PDFs, 1 M rows)&lt;/td&gt;
&lt;td&gt;10 TB in &lt;code&gt;pg_largeobject&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;s3_key TEXT&lt;/code&gt; (URL only, 1 M rows)&lt;/td&gt;
&lt;td&gt;&amp;lt; 100 MB&lt;/td&gt;
&lt;td&gt;seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Storing 10 MB PDFs in &lt;code&gt;BYTEA&lt;/code&gt; puts all bytes in TOAST; the table grows to 10 TB.&lt;/li&gt;
&lt;li&gt;Every &lt;code&gt;pg_dump&lt;/code&gt; reads all 10 TB; backups become days, not minutes.&lt;/li&gt;
&lt;li&gt;Replication lag grows; HA failover slows.&lt;/li&gt;
&lt;li&gt;Object storage (S3) is purpose-built for large files; the database keeps only a 50-byte &lt;code&gt;s3_key&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Reads still feel "one query" — the application fetches the URL from SQL, then streams the file from S3.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Store files externally; keep the key in SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;document_id&lt;/span&gt; &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;     &lt;span class="nb"&gt;BIGINT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sha256&lt;/span&gt;      &lt;span class="n"&gt;BYTEA&lt;/span&gt;     &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;octet_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;s3_key&lt;/span&gt;      &lt;span class="nb"&gt;TEXT&lt;/span&gt;      &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;uploaded_at&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the rough threshold is 100 KB — anything above that belongs in object storage; anything below is fine as &lt;code&gt;BYTEA&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Declaring text columns as &lt;code&gt;VARCHAR(255)&lt;/code&gt; everywhere — Java legacy from the MySQL world; meaningless in modern PostgreSQL.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;CHAR(n)&lt;/code&gt; and being surprised that &lt;code&gt;'abc' = 'abc '&lt;/code&gt; returns &lt;code&gt;FALSE&lt;/code&gt; in some join contexts.&lt;/li&gt;
&lt;li&gt;Storing emails as case-sensitive &lt;code&gt;TEXT&lt;/code&gt; and writing &lt;code&gt;LOWER(email) = LOWER($1)&lt;/code&gt; everywhere — set &lt;code&gt;citext&lt;/code&gt; once at the column.&lt;/li&gt;
&lt;li&gt;Putting megabyte payloads in &lt;code&gt;BYTEA&lt;/code&gt; and discovering the cost only when &lt;code&gt;pg_dump&lt;/code&gt; runs for six hours.&lt;/li&gt;
&lt;li&gt;Forgetting to trim whitespace at ingest — &lt;code&gt;'  alice@x.com'&lt;/code&gt; and &lt;code&gt;'alice@x.com'&lt;/code&gt; are different strings to the database.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Reconciling Case-Sensitive Email Joins
&lt;/h3&gt;

&lt;p&gt;A signup flow stores &lt;code&gt;users.email&lt;/code&gt; as &lt;code&gt;TEXT&lt;/code&gt;. The marketing dashboard joins &lt;code&gt;events.email&lt;/code&gt; (also &lt;code&gt;TEXT&lt;/code&gt;) to &lt;code&gt;users.email&lt;/code&gt; to count signed-up users. Roughly 8% of events fail to match even though the user definitely signed up. &lt;strong&gt;Diagnose the cause and propose a column-level fix that prevents recurrence.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;citext&lt;/code&gt; + Normalised Write-Time Email
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;citext&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;   &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;CITEXT&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BTRIM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;  &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;CITEXT&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BTRIM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

&lt;span class="c1"&gt;-- joins now match regardless of case; rejoin to verify&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; of the 8% miss:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;event email&lt;/th&gt;
&lt;th&gt;user email&lt;/th&gt;
&lt;th&gt;TEXT join&lt;/th&gt;
&lt;th&gt;CITEXT join&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Alice@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;alice@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bob@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bob@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;carol@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;carol@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗ (whitespace)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; the case-sensitivity portion of the miss disappears (≈ 7%); the remaining ≈ 1% is whitespace, fixed by &lt;code&gt;BTRIM&lt;/code&gt; in the &lt;code&gt;USING&lt;/code&gt; clause at migration and a &lt;code&gt;BEFORE INSERT&lt;/code&gt; trigger going forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;CITEXT&lt;/code&gt; columns&lt;/strong&gt;&lt;/strong&gt; — case-insensitive by construction; downstream queries never have to wrap &lt;code&gt;LOWER(...)&lt;/code&gt; and indexes still work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;LOWER(BTRIM(email))&lt;/code&gt; in the &lt;code&gt;USING&lt;/code&gt; clause&lt;/strong&gt;&lt;/strong&gt; — one-shot normalisation of existing rows during the type change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Trigger or &lt;code&gt;CHECK&lt;/code&gt; enforcement going forward&lt;/strong&gt;&lt;/strong&gt; — keeps future inserts canonical.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;No more &lt;code&gt;LOWER(...)&lt;/code&gt; at every query site&lt;/strong&gt;&lt;/strong&gt; — every analyst joins safely without remembering the casing rule.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Existing indexes rebuild automatically&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;ALTER COLUMN TYPE&lt;/code&gt; rebuilds the index against the new operator class.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — one rewrite at migration; per-row equality cost identical to &lt;code&gt;TEXT&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; for the string-fluency syllabus see &lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;SQL for Data Engineering Interviews — From Zero to FAANG&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — string manipulation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;String-manipulation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/string-manipulation" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — joins&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL join problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/joins" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — SQL for DE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Zero to FAANG SQL fundamentals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Boolean and NULL
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Three-valued logic and the &lt;code&gt;WHERE flag&lt;/code&gt; trap
&lt;/h3&gt;

&lt;p&gt;PostgreSQL has a real &lt;strong&gt;&lt;code&gt;BOOLEAN&lt;/code&gt;&lt;/strong&gt; type with three values: &lt;code&gt;TRUE&lt;/code&gt;, &lt;code&gt;FALSE&lt;/code&gt;, and &lt;code&gt;NULL&lt;/code&gt;. The third value is the source of nearly every "where did my rows go?" bug — &lt;code&gt;NULL&lt;/code&gt; is &lt;em&gt;not&lt;/em&gt; false; it is &lt;em&gt;unknown&lt;/em&gt;. Filters like &lt;code&gt;WHERE flag&lt;/code&gt; silently exclude &lt;code&gt;NULL&lt;/code&gt; rows, and &lt;code&gt;WHERE NOT flag&lt;/code&gt; excludes them too, so a "true-or-not-true" pair of queries can together miss rows entirely.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Whenever you write a boolean predicate, name the third bucket out loud. "Active users are &lt;code&gt;is_active = TRUE&lt;/code&gt;; bots are &lt;code&gt;is_bot = TRUE&lt;/code&gt;; unknown is &lt;code&gt;IS NULL&lt;/code&gt; and goes into the &lt;em&gt;needs-investigation&lt;/em&gt; drawer." That habit catches the silent-exclusion bug before it ships.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;BOOLEAN&lt;/code&gt; literals, &lt;code&gt;IS TRUE&lt;/code&gt; / &lt;code&gt;IS FALSE&lt;/code&gt; / &lt;code&gt;IS NULL&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The boolean invariant: &lt;strong&gt;&lt;code&gt;WHERE flag&lt;/code&gt; returns rows where the predicate is &lt;code&gt;TRUE&lt;/code&gt;; rows where &lt;code&gt;flag&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt; (unknown) are &lt;em&gt;also&lt;/em&gt; excluded; to include or exclude them deliberately you must use &lt;code&gt;IS NULL&lt;/code&gt; / &lt;code&gt;IS NOT NULL&lt;/code&gt; / &lt;code&gt;IS DISTINCT FROM&lt;/code&gt;&lt;/strong&gt;. Standard SQL three-valued logic treats &lt;code&gt;NULL = anything&lt;/code&gt; as &lt;code&gt;NULL&lt;/code&gt;, which is neither true nor false — and a &lt;code&gt;WHERE&lt;/code&gt; clause keeps only rows that evaluate to &lt;code&gt;TRUE&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;TRUE&lt;/code&gt; / &lt;code&gt;FALSE&lt;/code&gt;&lt;/strong&gt; — the two non-null boolean values.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/strong&gt; — unknown; not equal to anything (including itself).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;IS TRUE&lt;/code&gt; / &lt;code&gt;IS FALSE&lt;/code&gt;&lt;/strong&gt; — three-valued aware; never returns NULL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;IS DISTINCT FROM&lt;/code&gt;&lt;/strong&gt; — treats two NULLs as equal; useful for join keys.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A 5-row &lt;code&gt;events&lt;/code&gt; table with a nullable &lt;code&gt;is_bot&lt;/code&gt; flag:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;event_id&lt;/th&gt;
&lt;th&gt;is_bot&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;FALSE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;predicate&lt;/th&gt;
&lt;th&gt;rows kept&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE is_bot&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1, 4 (only TRUE)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE NOT is_bot&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2 (only FALSE)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE is_bot IS NOT TRUE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2, 3, 5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE is_bot IS NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3, 5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;WHERE is_bot&lt;/code&gt; keeps rows where the predicate is &lt;code&gt;TRUE&lt;/code&gt;; rows 3 and 5 (NULL) are silently dropped.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;WHERE NOT is_bot&lt;/code&gt; keeps rows where the predicate evaluates to &lt;code&gt;TRUE&lt;/code&gt;; &lt;code&gt;NOT NULL&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;, so rows 3 and 5 are &lt;em&gt;still&lt;/em&gt; silently dropped.&lt;/li&gt;
&lt;li&gt;The dashboard "Bots vs non-bots" pair (&lt;code&gt;is_bot&lt;/code&gt; true / &lt;code&gt;NOT is_bot&lt;/code&gt;) sums to 3 rows, not 5 — two rows are missing in plain sight.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;IS NOT TRUE&lt;/code&gt; is three-valued aware: it returns &lt;code&gt;TRUE&lt;/code&gt; for rows 2, 3, 5 — both the false ones and the nulls.&lt;/li&gt;
&lt;li&gt;Pick the form that matches your intent and audit any dashboard that splits a column on a boolean.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Three-valued-aware predicates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- bots&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- non-bots, including unknown&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- only unknown&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; never write &lt;code&gt;WHERE flag&lt;/code&gt; or &lt;code&gt;WHERE NOT flag&lt;/code&gt; on a nullable boolean column without consciously deciding what NULL means.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;NOT col&lt;/code&gt; vs &lt;code&gt;col = FALSE&lt;/code&gt; with NULLs
&lt;/h4&gt;

&lt;p&gt;The negation invariant: &lt;strong&gt;&lt;code&gt;col = FALSE&lt;/code&gt; and &lt;code&gt;NOT col&lt;/code&gt; are logically the same when &lt;code&gt;col&lt;/code&gt; is &lt;code&gt;TRUE&lt;/code&gt; or &lt;code&gt;FALSE&lt;/code&gt;, but both evaluate to &lt;code&gt;NULL&lt;/code&gt; when &lt;code&gt;col IS NULL&lt;/code&gt; — and a &lt;code&gt;WHERE&lt;/code&gt; clause keeps only &lt;code&gt;TRUE&lt;/code&gt;, so both forms silently drop nulls&lt;/strong&gt;. The fix is &lt;code&gt;COALESCE(col, FALSE)&lt;/code&gt; or &lt;code&gt;IS NOT TRUE&lt;/code&gt;, which collapse NULL into a definite answer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE col = FALSE&lt;/code&gt;&lt;/strong&gt; — keeps rows where &lt;code&gt;col&lt;/code&gt; is literally &lt;code&gt;FALSE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE NOT col&lt;/code&gt;&lt;/strong&gt; — same; both drop NULL rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE COALESCE(col, FALSE) = FALSE&lt;/code&gt;&lt;/strong&gt; — treats NULL as FALSE; keeps both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE col IS NOT TRUE&lt;/code&gt;&lt;/strong&gt; — treats NULL as not-true; keeps both.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Same &lt;code&gt;events&lt;/code&gt; table; analyst writes "all non-bot events":&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;query&lt;/th&gt;
&lt;th&gt;rows&lt;/th&gt;
&lt;th&gt;comment&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE is_bot = FALSE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;row 2 only — silent miss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE NOT is_bot&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;identical; same bug&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE is_bot IS NOT TRUE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;rows 2, 3, 5 — correct&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE COALESCE(is_bot, FALSE) = FALSE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;also correct&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Marketing asks "how many non-bot events?"; analyst writes &lt;code&gt;WHERE NOT is_bot&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Result is 1; marketing thinks bots account for 4 of 5 events.&lt;/li&gt;
&lt;li&gt;A second analyst writes &lt;code&gt;WHERE is_bot IS NOT TRUE&lt;/code&gt; and gets 3; the difference is the NULL rows.&lt;/li&gt;
&lt;li&gt;The dashboard's "bot vs non-bot" pie chart silently undercounts by 40%.&lt;/li&gt;
&lt;li&gt;The fix is &lt;em&gt;either&lt;/em&gt; a &lt;code&gt;COALESCE&lt;/code&gt; at query time &lt;em&gt;or&lt;/em&gt; a &lt;code&gt;NOT NULL DEFAULT FALSE&lt;/code&gt; constraint at schema time — both make the NULL case explicit.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Default boolean columns to a known value at write time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
    &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- queries are now safe&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if a boolean has no "unknown" business meaning, declare it &lt;code&gt;NOT NULL DEFAULT FALSE&lt;/code&gt; and remove the third bucket entirely.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;COALESCE&lt;/code&gt; and explicit NULL handling
&lt;/h4&gt;

&lt;p&gt;The COALESCE invariant: &lt;strong&gt;&lt;code&gt;COALESCE(a, b, c)&lt;/code&gt; returns the first non-NULL argument; it is the simplest way to replace NULL with a default in &lt;code&gt;WHERE&lt;/code&gt;, &lt;code&gt;ORDER BY&lt;/code&gt;, and aggregations — but use it deliberately, because hiding NULL is the same as throwing away information&lt;/strong&gt;. The right pattern is to decide whether NULL means "no answer" or "definitely false," then code that intent.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(col, default)&lt;/code&gt;&lt;/strong&gt; — first non-NULL argument.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NULLIF(a, b)&lt;/code&gt;&lt;/strong&gt; — returns NULL when &lt;code&gt;a = b&lt;/code&gt;; useful for "treat empty string as NULL."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;a IS DISTINCT FROM b&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;TRUE&lt;/code&gt; when values differ, treating NULL as a real value.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SUM(col)&lt;/code&gt;&lt;/strong&gt; — ignores NULLs; &lt;code&gt;COUNT(col)&lt;/code&gt; ignores NULLs; &lt;code&gt;COUNT(*)&lt;/code&gt; includes them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Summing &lt;code&gt;score&lt;/code&gt; where some rows are NULL:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;row&lt;/th&gt;
&lt;th&gt;score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;expression&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SUM(score)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SUM(COALESCE(score, 0))&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AVG(score)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;15 (n=2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AVG(COALESCE(score, 0))&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;10 (n=3)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;SUM&lt;/code&gt; ignores NULLs by SQL convention; you get the same answer with or without &lt;code&gt;COALESCE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AVG&lt;/code&gt; divides by &lt;code&gt;COUNT(non-NULL)&lt;/code&gt;; ignoring NULL gives 15, treating NULL as zero gives 10.&lt;/li&gt;
&lt;li&gt;The "right" answer depends on what NULL means — &lt;em&gt;missing measurement&lt;/em&gt; (use 15) vs &lt;em&gt;zero score&lt;/em&gt; (use 10).&lt;/li&gt;
&lt;li&gt;Always make the choice explicit; do not let a downstream consumer guess.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;IS DISTINCT FROM&lt;/code&gt; is the safe way to compare keys that may be NULL: &lt;code&gt;a IS DISTINCT FROM b&lt;/code&gt; is &lt;code&gt;TRUE&lt;/code&gt; when one is NULL and the other is not.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Choose the aggregation rule that matches the business question:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- "average of measurements we have"&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- "average where missing means zero"&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; every &lt;code&gt;COALESCE&lt;/code&gt; should answer the question "what should the missing row contribute?" in one sentence — if you cannot answer, do not coalesce.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Writing &lt;code&gt;WHERE flag = FALSE&lt;/code&gt; and assuming it includes NULL rows.&lt;/li&gt;
&lt;li&gt;Pairing &lt;code&gt;WHERE flag&lt;/code&gt; with &lt;code&gt;WHERE NOT flag&lt;/code&gt; and expecting the row counts to sum to the table size.&lt;/li&gt;
&lt;li&gt;Storing booleans as &lt;code&gt;'Y'&lt;/code&gt; / &lt;code&gt;'N'&lt;/code&gt; strings — every comparison becomes a &lt;code&gt;LOWER(...)&lt;/code&gt; hazard; use real &lt;code&gt;BOOLEAN&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Forgetting that &lt;code&gt;NULL = NULL&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;, not &lt;code&gt;TRUE&lt;/code&gt; — join keys with NULL need &lt;code&gt;IS DISTINCT FROM&lt;/code&gt; or pre-coalesced values.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;AVG&lt;/code&gt; over a nullable column without deciding whether missing means zero or excluded.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on a Dashboard Missing 12% of Rows
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;events.is_bot BOOLEAN&lt;/code&gt; column is nullable. The dashboard splits "bots vs humans" with &lt;code&gt;WHERE is_bot&lt;/code&gt; and &lt;code&gt;WHERE NOT is_bot&lt;/code&gt;. The two row counts sum to 88% of the table; nobody can explain where the missing 12% went. &lt;strong&gt;Identify the cause and produce a single query pair that correctly partitions every row.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;IS TRUE&lt;/code&gt; / &lt;code&gt;IS NOT TRUE&lt;/code&gt; + a Schema-Level &lt;code&gt;NOT NULL&lt;/code&gt; Fix
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- short-term query-side fix&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;bots&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;humans_or_unknown&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- long-term schema fix&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;FALSE&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
    &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;is_bot&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;predicate&lt;/th&gt;
&lt;th&gt;rows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;WHERE is_bot&lt;/code&gt; (old)&lt;/td&gt;
&lt;td&gt;12,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;WHERE NOT is_bot&lt;/code&gt; (old)&lt;/td&gt;
&lt;td&gt;76,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;sum&lt;/td&gt;
&lt;td&gt;88,000 of 100,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;missing&lt;/td&gt;
&lt;td&gt;12,000 rows where &lt;code&gt;is_bot IS NULL&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WHERE is_bot IS NOT TRUE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;88,000 — both FALSE and NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bots + humans_or_unknown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;100,000 ✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; the two-bucket dashboard sums to 100% of rows. Schema-level &lt;code&gt;NOT NULL DEFAULT FALSE&lt;/code&gt; makes future regression impossible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;IS TRUE&lt;/code&gt; / &lt;code&gt;IS NOT TRUE&lt;/code&gt; are three-valued safe&lt;/strong&gt;&lt;/strong&gt; — they never return NULL; the &lt;code&gt;WHERE&lt;/code&gt; clause keeps exactly the rows the analyst expects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COUNT(*) FILTER (WHERE …)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — single-pass two-bucket aggregation; faster than running two queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;UPDATE … WHERE IS NULL&lt;/code&gt; + &lt;code&gt;SET NOT NULL&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — one-shot remediation of historical NULLs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;DEFAULT FALSE&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — guarantees new rows start in a definite state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;No surprise on rerun&lt;/strong&gt;&lt;/strong&gt; — the dashboard's "missing 12%" cannot reappear because the column type now rules it out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — one &lt;code&gt;UPDATE&lt;/code&gt;; the &lt;code&gt;FILTER&lt;/code&gt; form has the same cost as two separate &lt;code&gt;COUNT&lt;/code&gt;s combined into one scan.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; for the safe-NULL drill set see &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — filtering&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL filtering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/filtering" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — conditional aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Conditional-aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/conditional-aggregation" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Language — SQL&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;All SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Date and time
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;DATE&lt;/code&gt;, &lt;code&gt;TIME&lt;/code&gt;, &lt;code&gt;TIMESTAMP&lt;/code&gt;, and &lt;code&gt;TIMESTAMPTZ&lt;/code&gt; — instants vs wall clocks
&lt;/h3&gt;

&lt;p&gt;PostgreSQL splits time into &lt;strong&gt;calendar dates&lt;/strong&gt; (&lt;code&gt;DATE&lt;/code&gt;), &lt;strong&gt;local wall-clock times&lt;/strong&gt; (&lt;code&gt;TIME&lt;/code&gt;), &lt;strong&gt;wall-clock timestamps&lt;/strong&gt; (&lt;code&gt;TIMESTAMP WITHOUT TIME ZONE&lt;/code&gt;), and &lt;strong&gt;absolute instants&lt;/strong&gt; (&lt;code&gt;TIMESTAMP WITH TIME ZONE&lt;/code&gt;, abbreviated &lt;code&gt;TIMESTAMPTZ&lt;/code&gt;). The two-row mental model: &lt;strong&gt;&lt;code&gt;TIMESTAMP&lt;/code&gt; is what a wall clock reads at a particular spot; &lt;code&gt;TIMESTAMPTZ&lt;/code&gt; is a point on the global timeline&lt;/strong&gt;. Every cross-region bug comes from picking the first when you wanted the second.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftibdnbfqmjb4a0uj0xrr.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftibdnbfqmjb4a0uj0xrr.jpeg" alt="Diagram contrasting PostgreSQL TIMESTAMP without time zone and TIMESTAMPTZ with UTC and local clock icons and a caution on wall-clock ambiguity." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Default every event-instant column to &lt;code&gt;TIMESTAMPTZ&lt;/code&gt; and use &lt;code&gt;TIMESTAMP&lt;/code&gt; only when the time is intentionally &lt;em&gt;local&lt;/em&gt; (a "9:00 AM recurring meeting" in the user's locale). Reporting that crosses regions becomes obviously correct or obviously wrong, with no middle ground.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;TIMESTAMP&lt;/code&gt; without time zone — local wall-clock semantics
&lt;/h4&gt;

&lt;p&gt;The wall-clock invariant: &lt;strong&gt;&lt;code&gt;TIMESTAMP&lt;/code&gt; stores the literal datetime you gave it with no time-zone metadata; "2026-04-13 09:00:00" means 9:00 local &lt;em&gt;wherever you happen to be&lt;/em&gt;; comparing two &lt;code&gt;TIMESTAMP&lt;/code&gt; values is correct only if both came from the same time zone&lt;/strong&gt;. It is the right type for "9:00 morning meeting in the user's local time" — and the wrong type for "the moment the user clicked Pay."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;TIMESTAMP&lt;/code&gt; storage&lt;/strong&gt; — 8 bytes; no time-zone info.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NOW()&lt;/code&gt; returns &lt;code&gt;TIMESTAMPTZ&lt;/code&gt;&lt;/strong&gt; — coerced to &lt;code&gt;TIMESTAMP&lt;/code&gt; strips the zone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comparison&lt;/strong&gt; — two &lt;code&gt;TIMESTAMP&lt;/code&gt;s compare by literal value, regardless of zones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use case&lt;/strong&gt; — "every Monday at 09:00 local time" recurring schedules.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Storing a 09:00 morning meeting for two users in different zones:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user&lt;/th&gt;
&lt;th&gt;wall-clock time&lt;/th&gt;
&lt;th&gt;TIMESTAMP value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice (NYC)&lt;/td&gt;
&lt;td&gt;9:00 AM EDT&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-04-13 09:00:00&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob (Tokyo)&lt;/td&gt;
&lt;td&gt;9:00 AM JST&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-04-13 09:00:00&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Both rows look identical because the type carries no zone — the database just stores the digits the application sent.&lt;/li&gt;
&lt;li&gt;Both meetings happen at "9:00 AM local"; they are &lt;em&gt;not&lt;/em&gt; the same UTC instant (13 hours apart).&lt;/li&gt;
&lt;li&gt;A query like &lt;code&gt;SELECT * WHERE start_at = '2026-04-13 09:00:00'&lt;/code&gt; returns both rows; that is the right answer for a "9 AM morning meetings" report.&lt;/li&gt;
&lt;li&gt;If the same column had been &lt;code&gt;TIMESTAMPTZ&lt;/code&gt;, the two values would have been stored as different UTC instants and the report would have returned one of them or neither, depending on session settings.&lt;/li&gt;
&lt;li&gt;Pick &lt;code&gt;TIMESTAMP&lt;/code&gt; only when the wall-clock semantics are the actual business rule.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Recurring local-time schedule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;recurring_meetings&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;meeting_id&lt;/span&gt; &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;    &lt;span class="nb"&gt;BIGINT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;local_tz&lt;/span&gt;   &lt;span class="nb"&gt;TEXT&lt;/span&gt;      &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="c1"&gt;-- 'America/New_York'&lt;/span&gt;
    &lt;span class="n"&gt;start_at&lt;/span&gt;   &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;                   &lt;span class="c1"&gt;-- intentional wall-clock&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if your column answers the question "what should the clock on the wall read?", use &lt;code&gt;TIMESTAMP&lt;/code&gt;; otherwise use &lt;code&gt;TIMESTAMPTZ&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;TIMESTAMPTZ&lt;/code&gt; — UTC instant, session display
&lt;/h4&gt;

&lt;p&gt;The instant invariant: &lt;strong&gt;&lt;code&gt;TIMESTAMPTZ&lt;/code&gt; stores every value as a UTC instant internally (8 bytes), regardless of the time-zone literal in the &lt;code&gt;INSERT&lt;/code&gt;; output is converted to the session's &lt;code&gt;TimeZone&lt;/code&gt; at read time; comparison is always instant-to-instant&lt;/strong&gt;. Same data ships to every region and every report agrees on "when did this happen."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;TIMESTAMPTZ&lt;/code&gt; storage&lt;/strong&gt; — 8 bytes; internal representation is UTC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;INSERT … TIMESTAMPTZ '2026-04-13 09:00 EDT'&lt;/code&gt;&lt;/strong&gt; — stored as 13:00 UTC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SET TimeZone = 'Asia/Tokyo'&lt;/code&gt;&lt;/strong&gt; then &lt;code&gt;SELECT ts&lt;/code&gt; — outputs &lt;code&gt;2026-04-13 22:00:00+09&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;AT TIME ZONE&lt;/code&gt;&lt;/strong&gt; — converts between zones in a query.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Same UTC instant viewed from three zones:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;session TimeZone&lt;/th&gt;
&lt;th&gt;what &lt;code&gt;SELECT ts FROM events WHERE id = 1&lt;/code&gt; shows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;UTC&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-04-13 13:00:00+00&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;America/New_York&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-04-13 09:00:00-04&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Asia/Tokyo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-04-13 22:00:00+09&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The instant &lt;code&gt;2026-04-13 13:00:00 UTC&lt;/code&gt; was inserted once into the table.&lt;/li&gt;
&lt;li&gt;The on-disk representation is a single 8-byte number — UTC microseconds since the epoch.&lt;/li&gt;
&lt;li&gt;Each session reads the same row, but the display function converts that instant to the session's &lt;code&gt;TimeZone&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The underlying data is identical; the &lt;em&gt;rendering&lt;/em&gt; differs.&lt;/li&gt;
&lt;li&gt;Cross-region reports stay correct because every comparison happens on the stored UTC value, not the displayed string.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Event-instant column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;click_id&lt;/span&gt; &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt;  &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;     &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ts&lt;/span&gt;       &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'24 hours'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; every "when did the event happen?" column is &lt;code&gt;TIMESTAMPTZ&lt;/code&gt;; never &lt;code&gt;TIMESTAMP&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;AT TIME ZONE&lt;/code&gt; conversions and &lt;code&gt;DATE_TRUNC&lt;/code&gt; pitfalls
&lt;/h4&gt;

&lt;p&gt;The conversion invariant: &lt;strong&gt;&lt;code&gt;ts AT TIME ZONE 'America/New_York'&lt;/code&gt; converts a &lt;code&gt;TIMESTAMPTZ&lt;/code&gt; to a wall-clock &lt;code&gt;TIMESTAMP&lt;/code&gt; in that zone, and the &lt;em&gt;reverse&lt;/em&gt; (&lt;code&gt;TIMESTAMP AT TIME ZONE 'America/New_York'&lt;/code&gt;) interprets the wall-clock as a UTC instant&lt;/strong&gt;; &lt;code&gt;DATE_TRUNC('day', ts)&lt;/code&gt; buckets by UTC midnight unless you convert first. The pattern for "daily count in the user's local time" is &lt;strong&gt;&lt;code&gt;DATE_TRUNC('day', ts AT TIME ZONE 'America/New_York')&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;TIMESTAMPTZ AT TIME ZONE 'zone'&lt;/code&gt;&lt;/strong&gt; → &lt;code&gt;TIMESTAMP&lt;/code&gt; (wall clock).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;TIMESTAMP AT TIME ZONE 'zone'&lt;/code&gt;&lt;/strong&gt; → &lt;code&gt;TIMESTAMPTZ&lt;/code&gt; (instant).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DATE_TRUNC('day', ts)&lt;/code&gt;&lt;/strong&gt; — uses UTC midnight; usually not what reports want.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DATE_TRUNC('day', ts AT TIME ZONE 'zone')&lt;/code&gt;&lt;/strong&gt; — uses local midnight.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Daily clicks for a US dashboard:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;click&lt;/th&gt;
&lt;th&gt;UTC &lt;code&gt;ts&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;UTC day&lt;/th&gt;
&lt;th&gt;NY day&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-04-13 03:00 UTC&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2026-04-13&lt;/td&gt;
&lt;td&gt;2026-04-12 (still 23:00 prev day NY)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-04-13 14:00 UTC&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2026-04-13&lt;/td&gt;
&lt;td&gt;2026-04-13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-04-14 02:00 UTC&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2026-04-14&lt;/td&gt;
&lt;td&gt;2026-04-13 (still 22:00 NY)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;DATE_TRUNC('day', ts)&lt;/code&gt; groups by UTC midnight; click A goes into UTC &lt;code&gt;2026-04-13&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;But the user in NY clicked at 11 PM on April 12; the dashboard credits the wrong calendar day.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ts AT TIME ZONE 'America/New_York'&lt;/code&gt; converts the instant to NY wall-clock: A becomes &lt;code&gt;2026-04-12 23:00&lt;/code&gt;, C becomes &lt;code&gt;2026-04-13 22:00&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DATE_TRUNC('day', ts AT TIME ZONE 'America/New_York')&lt;/code&gt; then buckets by NY midnight; A goes into April 12, B and C into April 13.&lt;/li&gt;
&lt;li&gt;Daily counts now match the user's perception of "yesterday."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Daily report in NY local time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="k"&gt;AT&lt;/span&gt; &lt;span class="nb"&gt;TIME&lt;/span&gt; &lt;span class="k"&gt;ZONE&lt;/span&gt; &lt;span class="s1"&gt;'America/New_York'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;day_ny&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if a report says "daily" or "monthly," ask whose calendar — and then &lt;code&gt;AT TIME ZONE&lt;/code&gt; before truncating.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Defaulting to &lt;code&gt;TIMESTAMP&lt;/code&gt; "because it's shorter to type" — silently breaks cross-region comparisons after the first deploy abroad.&lt;/li&gt;
&lt;li&gt;Storing &lt;code&gt;TIMESTAMP&lt;/code&gt; and then "adding the time zone in the app" — the database loses the original zone the moment you stored.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DATE_TRUNC('day', ts)&lt;/code&gt; on UTC instants for a regional dashboard — daily counts shift by hours.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;NOW()&lt;/code&gt; interchangeably with &lt;code&gt;CURRENT_DATE&lt;/code&gt; — &lt;code&gt;NOW()&lt;/code&gt; is &lt;code&gt;TIMESTAMPTZ&lt;/code&gt;, &lt;code&gt;CURRENT_DATE&lt;/code&gt; is &lt;code&gt;DATE&lt;/code&gt; in the session's zone.&lt;/li&gt;
&lt;li&gt;Forgetting daylight saving — &lt;code&gt;INTERVAL '24 hours'&lt;/code&gt; is not always "next day at the same wall-clock time."&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on a Dashboard That Shifted 24 Hours After Deploy
&lt;/h3&gt;

&lt;p&gt;The team deploys their analytics pipeline to a new region; the next morning the "orders today" dashboard shows yesterday's total. Storage is &lt;code&gt;placed_at TIMESTAMP&lt;/code&gt; (without time zone). &lt;strong&gt;Diagnose the cause and propose a schema + query fix that survives any future deploy.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;TIMESTAMPTZ&lt;/code&gt; + &lt;code&gt;AT TIME ZONE&lt;/code&gt; in the Reporting View
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;placed_at&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt;
    &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;placed_at&lt;/span&gt; &lt;span class="k"&gt;AT&lt;/span&gt; &lt;span class="nb"&gt;TIME&lt;/span&gt; &lt;span class="k"&gt;ZONE&lt;/span&gt; &lt;span class="s1"&gt;'America/New_York'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;v_daily_orders&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;placed_at&lt;/span&gt; &lt;span class="k"&gt;AT&lt;/span&gt; &lt;span class="nb"&gt;TIME&lt;/span&gt; &lt;span class="k"&gt;ZONE&lt;/span&gt; &lt;span class="s1"&gt;'America/New_York'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;order_day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                                                              &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                                                            &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;observation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;original &lt;code&gt;placed_at TIMESTAMP&lt;/code&gt; — interpreted in the application's local zone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;redeploy moves the app to a server in &lt;code&gt;UTC&lt;/code&gt;; same &lt;code&gt;NOW()&lt;/code&gt; literal now means UTC, not NY&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;rows inserted post-deploy look 4 hours older to the dashboard's NY-day buckets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ALTER COLUMN … TYPE TIMESTAMPTZ USING … AT TIME ZONE 'America/New_York'&lt;/code&gt; reinterprets all existing rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;new &lt;code&gt;TIMESTAMPTZ&lt;/code&gt; column stores UTC; the view's &lt;code&gt;AT TIME ZONE&lt;/code&gt; reverses to NY for display&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;dashboard buckets daily counts by NY midnight; results are stable across redeploys&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; "orders today" matches the operations team's intuition regardless of where the application server lives. Future deploys cannot reintroduce the 24-hour shift because the column type now stores instants, not wall clocks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;TIMESTAMPTZ&lt;/code&gt; stores UTC&lt;/strong&gt;&lt;/strong&gt; — the on-disk value is the same regardless of session or server zone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;USING … AT TIME ZONE 'America/New_York'&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — one-shot reinterpretation of legacy rows during the type migration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;AT TIME ZONE&lt;/code&gt; in the view, not the table&lt;/strong&gt;&lt;/strong&gt; — every report stays explicit about whose calendar it uses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;DATE_TRUNC&lt;/code&gt; on the local wall-clock&lt;/strong&gt;&lt;/strong&gt; — daily buckets align to the user's perception of "today."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Stable across redeploys&lt;/strong&gt;&lt;/strong&gt; — server moves do not change the displayed daily count.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — one rewrite per migration; per-row &lt;code&gt;AT TIME ZONE&lt;/code&gt; is essentially free (microseconds).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; drill the &lt;a href="https://pipecode.ai/explore/practice/topic/date-functions" rel="noopener noreferrer"&gt;date-functions practice topic&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/topic/filtering" rel="noopener noreferrer"&gt;filtering practice topic&lt;/a&gt; for time-aware predicates.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — date functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Date-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/date-functions" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — window functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Window-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — SQL for DE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Zero to FAANG SQL fundamentals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Semi-structured and other types
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;JSONB&lt;/code&gt;, &lt;code&gt;UUID&lt;/code&gt;, and arrays for flexible attributes
&lt;/h3&gt;

&lt;p&gt;PostgreSQL is a "relational with side quests" database — it has first-class &lt;strong&gt;&lt;code&gt;JSONB&lt;/code&gt;&lt;/strong&gt; (binary, indexable JSON), &lt;strong&gt;&lt;code&gt;UUID&lt;/code&gt;&lt;/strong&gt; (opaque distributed IDs), and &lt;strong&gt;array types&lt;/strong&gt; (&lt;code&gt;INTEGER[]&lt;/code&gt;, &lt;code&gt;TEXT[]&lt;/code&gt;, &lt;code&gt;JSONB[]&lt;/code&gt;) that make schema-flexible patterns possible without giving up SQL. The discipline is to use them deliberately: &lt;code&gt;JSONB&lt;/code&gt; for &lt;em&gt;truly&lt;/em&gt; sparse attributes, &lt;code&gt;UUID&lt;/code&gt; for public/distributed identifiers, arrays for short bounded lists. Reach for them often and the schema becomes hard to query; reach for them never and you write more tables than you need.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Any column that becomes a frequent filter or join key belongs in a real typed column, not nested inside &lt;code&gt;JSONB&lt;/code&gt;. Use &lt;code&gt;JSONB&lt;/code&gt; as the "everything else" bucket for attributes that vary by row.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;JSON&lt;/code&gt; vs &lt;code&gt;JSONB&lt;/code&gt; — when binary indexing matters
&lt;/h4&gt;

&lt;p&gt;The JSONB invariant: &lt;strong&gt;&lt;code&gt;JSON&lt;/code&gt; stores the input text exactly (whitespace, key order, duplicate keys preserved) and reparses on every read; &lt;code&gt;JSONB&lt;/code&gt; stores a binary-decoded representation that is faster to query, supports &lt;code&gt;GIN&lt;/code&gt; indexes, and rejects duplicate keys — pay the small write-time cost for read-time speed&lt;/strong&gt;. For event payloads, application config, and flexible user attributes, &lt;code&gt;JSONB&lt;/code&gt; is the default.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;JSON&lt;/code&gt;&lt;/strong&gt; — text-faithful; preserves whitespace and duplicate keys; slow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;JSONB&lt;/code&gt;&lt;/strong&gt; — binary; faster reads; canonical (no whitespace, no duplicate keys).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-&amp;gt;&lt;/code&gt;&lt;/strong&gt; — returns &lt;code&gt;JSON&lt;/code&gt; / &lt;code&gt;JSONB&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-&amp;gt;&amp;gt;&lt;/code&gt;&lt;/strong&gt; — returns &lt;code&gt;TEXT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;@&amp;gt;&lt;/code&gt; containment&lt;/strong&gt; — &lt;code&gt;'{"a": 1}'::jsonb @&amp;gt; '{"a": 1}'::jsonb&lt;/code&gt; is &lt;code&gt;TRUE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GIN&lt;/code&gt; index&lt;/strong&gt; — &lt;code&gt;CREATE INDEX … USING GIN (jsonb_col jsonb_path_ops)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Searching event payloads for &lt;code&gt;{"plan": "pro"}&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;design&lt;/th&gt;
&lt;th&gt;predicate&lt;/th&gt;
&lt;th&gt;plan&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;payload JSON&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;payload-&amp;gt;&amp;gt;'plan' = 'pro'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Seq Scan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;payload JSONB&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;payload @&amp;gt; '{"plan":"pro"}'::jsonb&lt;/code&gt; (with &lt;code&gt;GIN&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Index Scan&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;With plain &lt;code&gt;JSON&lt;/code&gt;, every row must be parsed at query time to extract the &lt;code&gt;plan&lt;/code&gt; key.&lt;/li&gt;
&lt;li&gt;The planner cannot use a B-tree index because the parse step is per-row.&lt;/li&gt;
&lt;li&gt;Switching the column to &lt;code&gt;JSONB&lt;/code&gt; lets you create a &lt;code&gt;GIN&lt;/code&gt; index on the document.&lt;/li&gt;
&lt;li&gt;The containment query &lt;code&gt;@&amp;gt;&lt;/code&gt; is index-eligible — PostgreSQL probes the GIN structure for documents that contain the requested subtree.&lt;/li&gt;
&lt;li&gt;On a 50 M-row table, the difference is full table scan vs sub-second seek.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Indexed JSONB column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;  &lt;span class="n"&gt;JSONB&lt;/span&gt;     &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ts&lt;/span&gt;       &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;GIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="n"&gt;jsonb_path_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'{"plan":"pro"}'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; default to &lt;code&gt;JSONB&lt;/code&gt; for any "flexible attributes" column; default to a real typed column for any attribute you filter on more than a few times a week.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;UUID&lt;/code&gt; — opaque IDs for distributed systems
&lt;/h4&gt;

&lt;p&gt;The UUID invariant: &lt;strong&gt;&lt;code&gt;UUID&lt;/code&gt; is a 16-byte fixed-width identifier that does not leak ordering or count; ideal for public IDs, multi-region writes, and any context where you don't want consumers inferring growth rate from the sequence; trade-off vs &lt;code&gt;BIGINT&lt;/code&gt; is ~2× storage and worse B-tree locality for monotonic insert patterns&lt;/strong&gt;. Use UUIDs at the &lt;em&gt;boundary&lt;/em&gt; (URLs, foreign systems) and BIGINTs internally if performance is critical.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;UUID&lt;/code&gt; storage&lt;/strong&gt; — 16 bytes; &lt;code&gt;gen_random_uuid()&lt;/code&gt; from &lt;code&gt;pgcrypto&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v4 (random)&lt;/strong&gt; — uniform random; great privacy, bad B-tree locality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v7 (time-ordered)&lt;/strong&gt; — sortable by creation time; better cache behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;UUID&lt;/code&gt; vs &lt;code&gt;TEXT&lt;/code&gt;&lt;/strong&gt; — always declare as &lt;code&gt;UUID&lt;/code&gt;; &lt;code&gt;TEXT&lt;/code&gt; UUIDs lose validation and index efficiency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two ways to model a public order ID:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;design&lt;/th&gt;
&lt;th&gt;bytes&lt;/th&gt;
&lt;th&gt;URL example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;order_id BIGINT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/orders/12345678&lt;/code&gt; (leaks volume)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;order_id UUID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/orders/8c3b7e2a-…&lt;/code&gt; (opaque)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;BIGINT&lt;/code&gt; is monotonic — scraping a few order URLs lets a competitor infer your daily volume.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;UUID&lt;/code&gt; v4 is unguessable; &lt;code&gt;8c3b7e2a-…&lt;/code&gt; carries no information.&lt;/li&gt;
&lt;li&gt;Storage cost: 8 extra bytes per row × millions of rows is meaningful but rarely decisive.&lt;/li&gt;
&lt;li&gt;B-tree locality: random UUIDs spread inserts across the index; v7 (time-ordered) restores append-friendly behavior.&lt;/li&gt;
&lt;li&gt;For most "public ID" use cases, UUID v7 is the clean middle ground.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Internal &lt;code&gt;BIGINT&lt;/code&gt; + public &lt;code&gt;UUID&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;pgcrypto&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt;    &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;public_id&lt;/span&gt;   &lt;span class="n"&gt;UUID&lt;/span&gt;      &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;UNIQUE&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;gen_random_uuid&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; expose UUIDs at the API boundary; keep BIGINT joins inside the database.&lt;/p&gt;

&lt;h4&gt;
  
  
  Arrays — &lt;code&gt;INTEGER[]&lt;/code&gt;, &lt;code&gt;TEXT[]&lt;/code&gt;, and the &lt;code&gt;UNNEST&lt;/code&gt; pattern
&lt;/h4&gt;

&lt;p&gt;The array invariant: &lt;strong&gt;PostgreSQL arrays are first-class typed columns; common operations are &lt;code&gt;ANY (arr)&lt;/code&gt; for membership, &lt;code&gt;arr @&amp;gt; arr&lt;/code&gt; for containment, and &lt;code&gt;UNNEST(arr)&lt;/code&gt; to flatten an array column into rows — useful when the list is &lt;em&gt;short&lt;/em&gt; (≤ ~10 items) and *bounded by the row&lt;/strong&gt;*. For unbounded or queried-often lists, a child table is the better design.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;INTEGER[]&lt;/code&gt;&lt;/strong&gt; — array of integers; literal &lt;code&gt;'{1,2,3}'::int[]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ANY (arr)&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;x = ANY ('{1,2,3}'::int[])&lt;/code&gt; is &lt;code&gt;TRUE&lt;/code&gt; if &lt;code&gt;x&lt;/code&gt; is in the array.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;@&amp;gt;&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;'{1,2,3}'::int[] @&amp;gt; '{2}'&lt;/code&gt; is &lt;code&gt;TRUE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;UNNEST(arr)&lt;/code&gt;&lt;/strong&gt; — produces one row per array element; pivot a row of N elements into N rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A &lt;code&gt;users.role_ids INTEGER[]&lt;/code&gt; column:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;role_ids&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{10, 20}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{20, 30}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{10}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;query&lt;/th&gt;
&lt;th&gt;rows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE 20 = ANY (role_ids)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1, 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE role_ids @&amp;gt; '{10, 20}'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SELECT user_id, UNNEST(role_ids)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;(1,10), (1,20), (2,20), (2,30), (3,10)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Storing roles as &lt;code&gt;INTEGER[]&lt;/code&gt; keeps the user table compact — no separate &lt;code&gt;user_roles&lt;/code&gt; table for a small bounded set.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ANY&lt;/code&gt; is the array-side &lt;code&gt;IN&lt;/code&gt;: it tests membership of one value against the column.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@&amp;gt;&lt;/code&gt; tests whether the column array &lt;em&gt;contains&lt;/em&gt; every element of the right-hand array.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;UNNEST&lt;/code&gt; flattens the column into rows; joining &lt;code&gt;UNNEST(role_ids)&lt;/code&gt; to &lt;code&gt;dim_role&lt;/code&gt; produces a per-role row.&lt;/li&gt;
&lt;li&gt;For unbounded role sets (10 K+) the array column gets slow and a child table wins; for typical "a user has 1-5 roles" cases, arrays are clean.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; A small bounded list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;       &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;role_ids&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'{}'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;dim_role&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;ANY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role_ids&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; arrays for short, bounded, rarely-filtered lists; child tables for everything else.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Storing everything as &lt;code&gt;JSONB&lt;/code&gt; because "schemas are hard" — you trade type safety and indexability for write-time convenience.&lt;/li&gt;
&lt;li&gt;Indexing &lt;code&gt;JSON&lt;/code&gt; instead of &lt;code&gt;JSONB&lt;/code&gt; — &lt;code&gt;JSON&lt;/code&gt; cannot use GIN; the index won't help.&lt;/li&gt;
&lt;li&gt;Picking UUID v4 PKs on a high-write table and watching B-tree fragmentation degrade write throughput.&lt;/li&gt;
&lt;li&gt;Treating &lt;code&gt;TEXT&lt;/code&gt; UUIDs the same as &lt;code&gt;UUID&lt;/code&gt; columns — same data, different operator class, broken indexes.&lt;/li&gt;
&lt;li&gt;Storing unbounded lists in arrays — once the array exceeds a few dozen entries, every read TOASTs the column and queries slow.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Searching JSONB Payloads at 50 M-Row Scale
&lt;/h3&gt;

&lt;p&gt;A 50 M-row &lt;code&gt;events.payload JSONB&lt;/code&gt; column holds variable payloads. Marketing wants to count events where &lt;code&gt;{"plan": "pro"}&lt;/code&gt; appears in the payload, and the query takes 60 seconds. &lt;strong&gt;Make it return in under 100 ms without changing the storage shape.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using a &lt;code&gt;GIN&lt;/code&gt; Index with &lt;code&gt;jsonb_path_ops&lt;/code&gt; + &lt;code&gt;@&amp;gt;&lt;/code&gt; Containment
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;events_payload_gin_idx&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
    &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;GIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="n"&gt;jsonb_path_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'{"plan":"pro"}'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;initial query &lt;code&gt;payload-&amp;gt;&amp;gt;'plan' = 'pro'&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;62 s, Seq Scan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;switch predicate to &lt;code&gt;payload @&amp;gt; '{"plan":"pro"}'::jsonb&lt;/code&gt; (no index yet)&lt;/td&gt;
&lt;td&gt;60 s, still Seq Scan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CREATE INDEX … USING GIN (payload jsonb_path_ops)&lt;/code&gt; (~5 min build)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;rerun containment query&lt;/td&gt;
&lt;td&gt;85 ms, GIN Index Scan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; confirms &lt;code&gt;Bitmap Heap Scan on events&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;one-pass&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; 60 s → 85 ms — three orders of magnitude faster — with no schema change, no application change, and no data rewrite. &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; shows the &lt;code&gt;GIN&lt;/code&gt; index handling the containment lookup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;@&amp;gt;&lt;/code&gt; containment is index-eligible&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;-&amp;gt;&amp;gt;&lt;/code&gt; text extraction is not; the operator choice unlocks the index.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;jsonb_path_ops&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — specialised GIN class for containment-only queries; smaller and faster than the default &lt;code&gt;jsonb_ops&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;No row rewrite&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;CREATE INDEX&lt;/code&gt; builds a new index without touching the table heap; existing reads are uninterrupted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Generalises to other keys&lt;/strong&gt;&lt;/strong&gt; — any future &lt;code&gt;payload @&amp;gt; '{"key":"val"}'&lt;/code&gt; query benefits; no per-key index needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Trade-off&lt;/strong&gt;&lt;/strong&gt; — write throughput drops slightly (GIN updates are heavier than B-tree); usually invisible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — index build is &lt;code&gt;O(N)&lt;/code&gt; one-time; reads become &lt;code&gt;O(log N)&lt;/code&gt; per query.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; drill the &lt;a href="https://pipecode.ai/explore/practice/topic/filtering" rel="noopener noreferrer"&gt;filtering practice page&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice page&lt;/a&gt; for JSON-flavoured patterns.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — filtering&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL filtering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/filtering" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Language — SQL&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;All SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — SQL for DE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Zero to FAANG SQL fundamentals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Casting and comparison rules
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Implicit coercion, explicit &lt;code&gt;CAST&lt;/code&gt;, and index-friendly predicates
&lt;/h3&gt;

&lt;p&gt;PostgreSQL silently coerces some type mixes (&lt;code&gt;'42'::text&lt;/code&gt; to &lt;code&gt;INTEGER&lt;/code&gt; in an &lt;code&gt;=&lt;/code&gt; context), refuses others, and lets you make the conversion explicit with &lt;strong&gt;&lt;code&gt;CAST(x AS type)&lt;/code&gt;&lt;/strong&gt; or its shorthand &lt;strong&gt;&lt;code&gt;x::type&lt;/code&gt;&lt;/strong&gt;. The high-leverage rule is &lt;em&gt;where&lt;/em&gt; the cast lands: a cast on a &lt;em&gt;literal&lt;/em&gt; is free and index-friendly; a cast on a &lt;em&gt;column&lt;/em&gt; usually disables the index. Mixed-type joins are the canonical cause of "the query returns no rows" and "the query is suddenly 100× slower."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1x57pp34luchaqxy70sl.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1x57pp34luchaqxy70sl.jpeg" alt="Flowchart showing mismatched column types breaking joins or filters until explicit cast or schema alignment with PipeCode brand colors." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; When &lt;code&gt;EXPLAIN&lt;/code&gt; reveals &lt;code&gt;Seq Scan on …&lt;/code&gt; on a column you indexed, scan the &lt;code&gt;Filter:&lt;/code&gt; line for a &lt;code&gt;::type&lt;/code&gt; cast. The fix is usually to cast the &lt;em&gt;other&lt;/em&gt; side or — better — to change the source column's type so no cast is needed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Implicit coercion — when PostgreSQL guesses
&lt;/h4&gt;

&lt;p&gt;The coercion invariant: &lt;strong&gt;PostgreSQL has a graph of allowed implicit casts (e.g., &lt;code&gt;INTEGER&lt;/code&gt; → &lt;code&gt;BIGINT&lt;/code&gt;, &lt;code&gt;INTEGER&lt;/code&gt; → &lt;code&gt;NUMERIC&lt;/code&gt;, &lt;code&gt;TEXT&lt;/code&gt; → &lt;code&gt;INTEGER&lt;/code&gt; in some contexts) and applies them silently when one side of a binary operator differs from the other; when no implicit cast exists, the query fails with a &lt;code&gt;operator does not exist&lt;/code&gt; error&lt;/strong&gt;. Implicit coercion is convenient until it produces a different answer than expected.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;INTEGER&lt;/code&gt; &lt;code&gt;=&lt;/code&gt; &lt;code&gt;BIGINT&lt;/code&gt;&lt;/strong&gt; — implicit widen; no surprise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;TEXT&lt;/code&gt; &lt;code&gt;=&lt;/code&gt; &lt;code&gt;INTEGER&lt;/code&gt;&lt;/strong&gt; — works for literals (&lt;code&gt;WHERE id = '42'&lt;/code&gt;); fails for columns (&lt;code&gt;WHERE t.id = b.id&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DATE&lt;/code&gt; &lt;code&gt;=&lt;/code&gt; &lt;code&gt;TIMESTAMPTZ&lt;/code&gt;&lt;/strong&gt; — implicit widen via session zone; can shift.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;BOOLEAN&lt;/code&gt; &lt;code&gt;=&lt;/code&gt; &lt;code&gt;INTEGER&lt;/code&gt;&lt;/strong&gt; — &lt;em&gt;not&lt;/em&gt; allowed; you must cast.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Joining a &lt;code&gt;TEXT user_id&lt;/code&gt; to a &lt;code&gt;BIGINT user_id&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;left.user_id (TEXT)&lt;/th&gt;
&lt;th&gt;right.user_id (BIGINT)&lt;/th&gt;
&lt;th&gt;join works?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'42'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;42&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;error / Seq Scan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'042'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;42&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;mismatch (lexicographic ≠ numeric)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;' 42'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;42&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;mismatch (whitespace)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;PostgreSQL needs both sides of &lt;code&gt;=&lt;/code&gt; to be the same type; it tries to coerce.&lt;/li&gt;
&lt;li&gt;Coercing &lt;code&gt;TEXT&lt;/code&gt; → &lt;code&gt;BIGINT&lt;/code&gt; is possible per-value (&lt;code&gt;'42'::BIGINT&lt;/code&gt;), but the planner applies it on the &lt;em&gt;column&lt;/em&gt; — disabling the index.&lt;/li&gt;
&lt;li&gt;Leading zeros, whitespace, and non-digit characters cause the cast to fail mid-query.&lt;/li&gt;
&lt;li&gt;The result is either a hard error or a slow seq scan.&lt;/li&gt;
&lt;li&gt;The fix is &lt;em&gt;upstream&lt;/em&gt;: align the source column types so no cross-type compare is needed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Avoid mixed-type joins:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- if you must cast, cast at write time, not query time&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;staging_users&lt;/span&gt;
    &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; never store an identifier as text on one side and as integer on the other side of a join. Pick one type at the warehouse contract level.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;CAST(x AS type)&lt;/code&gt; vs &lt;code&gt;x::type&lt;/code&gt; shorthand
&lt;/h4&gt;

&lt;p&gt;The CAST invariant: &lt;strong&gt;&lt;code&gt;CAST(x AS type)&lt;/code&gt; and &lt;code&gt;x::type&lt;/code&gt; produce identical output; the longhand is SQL-standard and self-documenting; the shorthand is PostgreSQL idiomatic and shorter in expression-heavy queries&lt;/strong&gt;. Both fail with a clear error when the conversion is illegal.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CAST(x AS type)&lt;/code&gt;&lt;/strong&gt; — ANSI SQL; works in every dialect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;x::type&lt;/code&gt;&lt;/strong&gt; — PostgreSQL shorthand.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure modes&lt;/strong&gt; — same for both: &lt;code&gt;invalid input syntax for type integer&lt;/code&gt; etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NULLIF&lt;/code&gt; + &lt;code&gt;CAST&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;NULLIF(x, '')::INT&lt;/code&gt; collapses empty string to NULL before casting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two equivalent expressions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'42'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'42'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'not a number'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;-- ERROR&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;            &lt;span class="c1"&gt;-- NULL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Both &lt;code&gt;CAST&lt;/code&gt; and &lt;code&gt;::&lt;/code&gt; produce the same output type and the same value.&lt;/li&gt;
&lt;li&gt;Failing input (non-digit string) raises the same error in both forms.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;NULLIF(x, '')::TYPE&lt;/code&gt; is the canonical "treat empty string as NULL" pattern.&lt;/li&gt;
&lt;li&gt;In multi-expression SELECTs, &lt;code&gt;::&lt;/code&gt; keeps lines short; in code-review-heavy contexts, &lt;code&gt;CAST&lt;/code&gt; is more legible.&lt;/li&gt;
&lt;li&gt;Use whichever your team's house style prefers; do not mix unnecessarily.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Safe cast for messy ETL data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BTRIM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;raw_payload&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;staging_events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;BTRIM&lt;/code&gt; + &lt;code&gt;NULLIF&lt;/code&gt; + &lt;code&gt;::type&lt;/code&gt; is the three-step safe-cast pattern for noisy inputs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Index-killing casts on indexed columns
&lt;/h4&gt;

&lt;p&gt;The index-killer invariant: &lt;strong&gt;a &lt;code&gt;WHERE&lt;/code&gt; predicate that wraps an indexed column in a function — including an implicit cast — usually forces a sequential scan; the planner cannot prove the function is monotonic on the index, so it falls back to scanning every row&lt;/strong&gt;. The same query rewritten to cast the literal instead is index-eligible.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;col::type = $1&lt;/code&gt;&lt;/strong&gt; — bad; column cast disables index.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;col = $1::type&lt;/code&gt;&lt;/strong&gt; — good; literal cast, index used.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LOWER(col) = $1&lt;/code&gt;&lt;/strong&gt; — bad unless you build a &lt;em&gt;functional&lt;/em&gt; index on &lt;code&gt;LOWER(col)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;col = LOWER($1)&lt;/code&gt;&lt;/strong&gt; — good; literal-side function call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A &lt;code&gt;user_id BIGINT&lt;/code&gt; column indexed; two predicates:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;predicate&lt;/th&gt;
&lt;th&gt;plan&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE user_id = 42&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Index Scan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE user_id::text = '42'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Seq Scan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE user_id = '42'::bigint&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Index Scan&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;user_id = 42&lt;/code&gt; matches the type of the indexed column directly.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;user_id::text&lt;/code&gt; applies a function to every row; the B-tree on the original value cannot be used.&lt;/li&gt;
&lt;li&gt;Rewriting as &lt;code&gt;user_id = '42'::bigint&lt;/code&gt; casts the literal once and reuses the existing index.&lt;/li&gt;
&lt;li&gt;If you genuinely need to query &lt;em&gt;by&lt;/em&gt; the casted form, create a functional index: &lt;code&gt;CREATE INDEX ON users ((user_id::text))&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The cheapest fix is almost always to change the data type so no cast is needed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Cast the literal, never the column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- good&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'42'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;        &lt;span class="c1"&gt;-- literal coerced&lt;/span&gt;
&lt;span class="c1"&gt;-- bad&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'42'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- column cast kills the index&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; every &lt;code&gt;::&lt;/code&gt; on the indexed side of a &lt;code&gt;WHERE&lt;/code&gt; or &lt;code&gt;JOIN&lt;/code&gt; is a code smell. Investigate before merging.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Joining a &lt;code&gt;TEXT user_id&lt;/code&gt; to a &lt;code&gt;BIGINT user_id&lt;/code&gt; and adding &lt;code&gt;::text&lt;/code&gt; on the BIGINT side — works but disables the index.&lt;/li&gt;
&lt;li&gt;Treating &lt;code&gt;'042' = 42&lt;/code&gt; as &lt;code&gt;TRUE&lt;/code&gt; everywhere — leading zeros are preserved in TEXT and lost in INTEGER.&lt;/li&gt;
&lt;li&gt;Mixing &lt;code&gt;TIMESTAMP&lt;/code&gt; and &lt;code&gt;TIMESTAMPTZ&lt;/code&gt; in joins — answers depend on session TZ.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;LIKE&lt;/code&gt; against a numeric column without realising it forces a &lt;code&gt;::text&lt;/code&gt; cast.&lt;/li&gt;
&lt;li&gt;Forgetting to handle empty strings before casting — &lt;code&gt;''::INT&lt;/code&gt; is a hard error; use &lt;code&gt;NULLIF&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on a Cross-Type Join Returning Zero Rows
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;staging_users.user_id TEXT&lt;/code&gt; joined to &lt;code&gt;dim_users.user_id BIGINT&lt;/code&gt; returns 0 rows even though both tables contain &lt;code&gt;user_id = 42&lt;/code&gt;. The planner reports a &lt;code&gt;Seq Scan&lt;/code&gt; on &lt;code&gt;dim_users&lt;/code&gt;. &lt;strong&gt;Identify every contributing cause and propose a fix that produces a sound result &lt;em&gt;and&lt;/em&gt; keeps the dim's primary-key index usable.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using a Single-Type Schema + Explicit Literal-Side Cast
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- short-term: cast the staging text to BIGINT (literal-side cast on TEXT)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;staging_users&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;dim_users&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BTRIM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- permanent fix: rewrite staging to BIGINT once&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;staging_users&lt;/span&gt;
    &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;
    &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BTRIM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;symptom&lt;/th&gt;
&lt;th&gt;cause&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;WHERE d.user_id = s.user_id&lt;/code&gt; errors with operator-does-not-exist&lt;/td&gt;
&lt;td&gt;type mismatch (BIGINT vs TEXT)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;analyst rewrites as &lt;code&gt;WHERE d.user_id::text = s.user_id&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;"fixes" the error&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;query returns 0 rows&lt;/td&gt;
&lt;td&gt;leading whitespace in &lt;code&gt;s.user_id&lt;/code&gt; (&lt;code&gt;' 42'&lt;/code&gt;) breaks lexicographic compare&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;EXPLAIN&lt;/code&gt; shows Seq Scan on &lt;code&gt;dim_users&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;column cast on &lt;code&gt;d.user_id&lt;/code&gt; killed the PK index&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;rewrite with &lt;code&gt;BTRIM&lt;/code&gt; + &lt;code&gt;NULLIF&lt;/code&gt; + &lt;code&gt;::BIGINT&lt;/code&gt; on the &lt;em&gt;staging&lt;/em&gt; side&lt;/td&gt;
&lt;td&gt;index restored, whitespace tolerated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;row count matches &lt;code&gt;dim_users.user_id&lt;/code&gt; cardinality&lt;/td&gt;
&lt;td&gt;sound result&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; join now returns the expected rows, the dim's primary-key index is back in the plan, and the permanent &lt;code&gt;ALTER COLUMN&lt;/code&gt; removes the per-query cast for good.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Single-type schema&lt;/strong&gt;&lt;/strong&gt; — after the &lt;code&gt;ALTER&lt;/code&gt;, both sides are &lt;code&gt;BIGINT&lt;/code&gt;; no cross-type compare ever runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Literal-side &lt;code&gt;BTRIM&lt;/code&gt; + &lt;code&gt;NULLIF&lt;/code&gt; + &lt;code&gt;::BIGINT&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — handles real-world dirty input without disabling the dim's index.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Index on &lt;code&gt;dim_users.user_id&lt;/code&gt; preserved&lt;/strong&gt;&lt;/strong&gt; — because the cast is on the &lt;em&gt;staging&lt;/em&gt; side, not the dim side.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Whitespace-tolerant&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;BTRIM&lt;/code&gt; eliminates the silent-zero-rows mode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Empty-string-safe&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;NULLIF(x, '')::BIGINT&lt;/code&gt; returns NULL instead of erroring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — one rewrite at the staging layer; per-query cost drops from full table scan to &lt;code&gt;O(log N)&lt;/code&gt; PK seek.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; for the join-fluency syllabus see the &lt;a href="https://pipecode.ai/explore/practice/topic/joins" rel="noopener noreferrer"&gt;joins practice page&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — joins&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL join problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/joins" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — filtering&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL filtering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/filtering" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — SQL for DE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Zero to FAANG SQL fundamentals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Choosing types (checklist)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you are storing…&lt;/th&gt;
&lt;th&gt;Prefer…&lt;/th&gt;
&lt;th&gt;Watch out for…&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Surrogate keys, row counts&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;BIGINT&lt;/code&gt; / &lt;code&gt;INTEGER&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Overflow, unnecessary &lt;code&gt;BIGSERIAL&lt;/code&gt; everywhere&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Money, rates, basis points&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NUMERIC(p, s)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Float rounding in aggregates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Labels, names, free text&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;TEXT&lt;/code&gt; or &lt;code&gt;VARCHAR(n)&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Collation, padding with &lt;code&gt;CHAR&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instants in distributed systems&lt;/td&gt;
&lt;td&gt;&lt;code&gt;TIMESTAMPTZ&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Mixing with &lt;code&gt;TIMESTAMP&lt;/code&gt; in joins&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nested / sparse attributes&lt;/td&gt;
&lt;td&gt;&lt;code&gt;JSONB&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Huge documents without indexes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Public opaque IDs&lt;/td&gt;
&lt;td&gt;&lt;code&gt;UUID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stringly-typed UUIDs in joins&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; When you explain a schema in a live screen, say the &lt;strong&gt;grain&lt;/strong&gt; and the &lt;strong&gt;type&lt;/strong&gt; together: "one row per order, &lt;code&gt;order_id&lt;/code&gt; is &lt;code&gt;BIGINT&lt;/code&gt;, &lt;code&gt;total&lt;/code&gt; is &lt;code&gt;NUMERIC(14,2)&lt;/code&gt;."&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Should I use &lt;code&gt;TEXT&lt;/code&gt; or &lt;code&gt;VARCHAR(255)&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;In PostgreSQL there is &lt;strong&gt;no storage penalty&lt;/strong&gt; for &lt;code&gt;TEXT&lt;/code&gt; vs &lt;code&gt;varchar&lt;/code&gt; with the same contents. Use &lt;strong&gt;&lt;code&gt;VARCHAR(n)&lt;/code&gt;&lt;/strong&gt; when you want the database to enforce a &lt;strong&gt;maximum length&lt;/strong&gt;; otherwise &lt;strong&gt;&lt;code&gt;TEXT&lt;/code&gt;&lt;/strong&gt; is simple and common.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is &lt;code&gt;SERIAL&lt;/code&gt; still OK for primary keys?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;SERIAL&lt;/code&gt; / &lt;code&gt;BIGSERIAL&lt;/code&gt; are convenient; &lt;strong&gt;&lt;code&gt;GENERATED ... AS IDENTITY&lt;/code&gt;&lt;/strong&gt; is the standards-preferred spelling in modern PostgreSQL. Know both in interviews.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is my join returning no rows when the IDs "look the same"?
&lt;/h3&gt;

&lt;p&gt;Check &lt;strong&gt;types&lt;/strong&gt; and &lt;strong&gt;whitespace&lt;/strong&gt; on string keys. Compare plans with &lt;strong&gt;&lt;code&gt;EXPLAIN&lt;/code&gt;&lt;/strong&gt;: mismatched types can prevent &lt;strong&gt;index&lt;/strong&gt; use or change &lt;strong&gt;semantics&lt;/strong&gt; of comparison. Then rehearse on &lt;a href="https://dev.to/explore/practice/language/sql"&gt;SQL-tagged problems →&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  When must I use &lt;code&gt;NUMERIC&lt;/code&gt; instead of float?
&lt;/h3&gt;

&lt;p&gt;Whenever &lt;strong&gt;exact decimal&lt;/strong&gt; behavior is required—&lt;strong&gt;currency&lt;/strong&gt;, tax, allocations—or when you must match a &lt;strong&gt;ledger&lt;/strong&gt; or &lt;strong&gt;regulatory&lt;/strong&gt; rule. Floats are for measured magnitudes where error bounds are acceptable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practice on PipeCode
&lt;/h2&gt;

&lt;p&gt;PipeCode ships &lt;strong&gt;450+&lt;/strong&gt; data engineering practice problems—&lt;strong&gt;SQL&lt;/strong&gt; uses the &lt;strong&gt;PostgreSQL&lt;/strong&gt; dialect, with editorials and topics aligned to what strong companies ask. Start from &lt;a href="https://dev.to/explore/practice"&gt;Explore practice →&lt;/a&gt;, open &lt;a href="https://dev.to/explore/practice/language/sql"&gt;SQL practice →&lt;/a&gt;, filter by &lt;a href="https://dev.to/explore/practice/topic/joins"&gt;joins →&lt;/a&gt; or &lt;a href="https://dev.to/explore/practice/topic/aggregations"&gt;aggregations →&lt;/a&gt;, and &lt;a href="https://dev.to/subscribe"&gt;see plans →&lt;/a&gt; when you want the full library.&lt;/p&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>dataengineering</category>
      <category>interview</category>
    </item>
    <item>
      <title>PostgreSQL SQL Cheat Sheet — Clause Order, Joins, Aggregates, Windows</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Mon, 11 May 2026 03:52:46 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/postgresql-sql-cheat-sheet-clause-order-joins-aggregates-windows-3kim</link>
      <guid>https://dev.to/gowthampotureddi/postgresql-sql-cheat-sheet-clause-order-joins-aggregates-windows-3kim</guid>
      <description>&lt;p&gt;A &lt;strong&gt;PostgreSQL SQL cheat sheet&lt;/strong&gt; is only useful when every row in it maps to something you can drop straight into a query — not a wall of syntax with no operational explanation. This guide condenses real PostgreSQL fluency to four primitives: &lt;strong&gt;the logical clause order (&lt;code&gt;FROM&lt;/code&gt; → &lt;code&gt;WHERE&lt;/code&gt; → &lt;code&gt;GROUP BY&lt;/code&gt; → &lt;code&gt;HAVING&lt;/code&gt; → &lt;code&gt;SELECT&lt;/code&gt; → &lt;code&gt;ORDER BY&lt;/code&gt; → &lt;code&gt;LIMIT&lt;/code&gt;), the six join shapes and the grain trap they create, &lt;code&gt;GROUP BY&lt;/code&gt; with &lt;code&gt;HAVING&lt;/code&gt; plus conditional aggregates for one-pass metrics, and window functions like &lt;code&gt;ROW_NUMBER&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;DENSE_RANK&lt;/code&gt;, &lt;code&gt;LAG&lt;/code&gt;, and &lt;code&gt;LEAD&lt;/code&gt; for ranking and lookback&lt;/strong&gt;. These four cover the bulk of analytical SQL — and the cheat-sheet style below is built so you can scan, copy a snippet, and tweak it for your own schema.&lt;/p&gt;

&lt;p&gt;Every section walks through a &lt;strong&gt;detailed topic explanation&lt;/strong&gt;, &lt;strong&gt;sub-topics with worked examples and runnable solutions&lt;/strong&gt;, &lt;strong&gt;common beginner mistakes&lt;/strong&gt;, and a &lt;strong&gt;worked interview-style scenario with a full traced answer&lt;/strong&gt;. PostgreSQL syntax throughout — the dialect that drives DataLemur, CoderPad, most product-analytics live screens, and the bulk of modern data-engineering SQL corpora.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyp41gmcjpov3wjaj2quz.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyp41gmcjpov3wjaj2quz.webp" alt="Bold PipeCode blog header for the PostgreSQL SQL cheat sheet with the elephant mascot and colored SQL keywords SELECT, FROM, JOIN, WHERE, WINDOW on a dark gradient background." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Top PostgreSQL SQL cheat sheet topics
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;four numbered sections below&lt;/strong&gt; follow this &lt;strong&gt;topic map&lt;/strong&gt; — one row per &lt;strong&gt;H2&lt;/strong&gt;, every row expanded into a full section with sub-topics, worked examples, a worked interview question, and a step-by-step traced solution:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;Why it shows up in a PostgreSQL cheat sheet&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Logical clause order — &lt;code&gt;FROM&lt;/code&gt; → &lt;code&gt;WHERE&lt;/code&gt; → &lt;code&gt;GROUP BY&lt;/code&gt; → &lt;code&gt;HAVING&lt;/code&gt; → &lt;code&gt;SELECT&lt;/code&gt; → &lt;code&gt;ORDER BY&lt;/code&gt; → &lt;code&gt;LIMIT&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The single most useful PostgreSQL mental model: the order you write clauses is not the order the engine evaluates them; knowing the evaluation order eliminates 80% of parse errors and explains why &lt;code&gt;WHERE&lt;/code&gt; cannot reference aggregates or column aliases.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Joins and grain — &lt;code&gt;INNER&lt;/code&gt;, &lt;code&gt;LEFT&lt;/code&gt;, &lt;code&gt;RIGHT&lt;/code&gt;, &lt;code&gt;FULL&lt;/code&gt;, &lt;code&gt;SELF&lt;/code&gt;, &lt;code&gt;CROSS&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Joins combine rows but they also change grain; a careless &lt;code&gt;1:N&lt;/code&gt; join inflates row counts silently, and the &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; anti-join is the canonical "find rows in A with no match in B" pattern (orphan customers, churned users).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;GROUP BY&lt;/code&gt;, &lt;code&gt;HAVING&lt;/code&gt;, and conditional aggregates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;WHERE&lt;/code&gt; filters rows before grouping; &lt;code&gt;HAVING&lt;/code&gt; filters groups after; &lt;code&gt;COUNT(*) FILTER (WHERE …)&lt;/code&gt; and &lt;code&gt;SUM(CASE WHEN …)&lt;/code&gt; express many metrics in one query — the universal duplicate finder &lt;code&gt;HAVING COUNT(*) &amp;gt; 1&lt;/code&gt; lives here.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Window functions — &lt;code&gt;ROW_NUMBER&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;DENSE_RANK&lt;/code&gt;, &lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-partition ranking without collapsing rows, top-N-per-group, second-highest salary, running totals with &lt;code&gt;SUM(...) OVER (PARTITION BY ... ORDER BY ...)&lt;/code&gt;, and month-over-month deltas via &lt;code&gt;LAG&lt;/code&gt;; the most-graded primitive in modern SQL screens.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Beginner-friendly framing:&lt;/strong&gt; every analytical SQL question reduces to four steps — &lt;strong&gt;filter rows, join tables without changing grain by accident, aggregate or rank, then present the result&lt;/strong&gt;. Holding the clause-order diagram in your head (Section 1) lets you write SQL outside-in: pick the grain, then the joins, then the filters, then the projection. The cheat sheet below is organized in the same order you would write a real query.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. PostgreSQL Logical Clause Order — &lt;code&gt;FROM&lt;/code&gt; → &lt;code&gt;WHERE&lt;/code&gt; → &lt;code&gt;GROUP BY&lt;/code&gt; → &lt;code&gt;HAVING&lt;/code&gt; → &lt;code&gt;SELECT&lt;/code&gt; → &lt;code&gt;ORDER BY&lt;/code&gt; → &lt;code&gt;LIMIT&lt;/code&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The seven-stage evaluation order every PostgreSQL query follows
&lt;/h3&gt;

&lt;p&gt;"Why does &lt;code&gt;WHERE customer_count &amp;gt; 5&lt;/code&gt; give me a parse error when I'm clearly counting customers?" is the signature beginner question — and the answer is &lt;strong&gt;logical clause order&lt;/strong&gt;. The mental model: &lt;strong&gt;PostgreSQL evaluates clauses in a fixed order that is different from the order you write them; &lt;code&gt;FROM&lt;/code&gt;/&lt;code&gt;JOIN&lt;/code&gt; builds the row set, &lt;code&gt;WHERE&lt;/code&gt; filters rows, &lt;code&gt;GROUP BY&lt;/code&gt; collapses rows into groups, &lt;code&gt;HAVING&lt;/code&gt; filters groups, &lt;code&gt;SELECT&lt;/code&gt; projects columns, &lt;code&gt;ORDER BY&lt;/code&gt; sorts, &lt;code&gt;LIMIT&lt;/code&gt;/&lt;code&gt;OFFSET&lt;/code&gt; trims&lt;/strong&gt;. &lt;code&gt;WHERE&lt;/code&gt; cannot reference aggregate functions because aggregates do not exist until after &lt;code&gt;GROUP BY&lt;/code&gt;; column aliases declared in &lt;code&gt;SELECT&lt;/code&gt; cannot be referenced in &lt;code&gt;WHERE&lt;/code&gt; for the same reason.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F069ii8r51746vgu79rz5.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F069ii8r51746vgu79rz5.webp" alt="Horizontal seven-step PostgreSQL clause-order diagram from FROM/JOIN through WHERE, GROUP BY, HAVING, SELECT, ORDER BY, to LIMIT/OFFSET with purple and orange brand icons for each stage and pipecode.ai attribution." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Memorize one sentence — "From-Where-Group-Having-Select-Order-Limit" — and you can decode any PostgreSQL parse error in under five seconds. The error &lt;code&gt;column "customer_count" does not exist&lt;/code&gt; almost always means the column is a &lt;code&gt;SELECT&lt;/code&gt;-level alias being referenced in &lt;code&gt;WHERE&lt;/code&gt;, which runs three stages earlier; lift the predicate into &lt;code&gt;HAVING&lt;/code&gt; (if it references an aggregate) or repeat the expression inline in &lt;code&gt;WHERE&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;FROM&lt;/code&gt; and &lt;code&gt;JOIN&lt;/code&gt; — build the working row set
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;FROM&lt;/code&gt;/&lt;code&gt;JOIN&lt;/code&gt; invariant: &lt;strong&gt;the first stage assembles a candidate row set by listing the tables (and how they join); every subsequent stage operates on this row set&lt;/strong&gt;. Subqueries in &lt;code&gt;FROM&lt;/code&gt; are also evaluated here, and &lt;code&gt;LATERAL&lt;/code&gt; joins let later subqueries reference earlier rows.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single table&lt;/strong&gt; — &lt;code&gt;FROM orders&lt;/code&gt; produces one row per &lt;code&gt;orders&lt;/code&gt; row.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Joined tables&lt;/strong&gt; — &lt;code&gt;FROM orders o JOIN customers c ON c.id = o.customer_id&lt;/code&gt; produces one row per matching pair.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subquery in &lt;code&gt;FROM&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;FROM (SELECT ...) t&lt;/code&gt; materializes the inner result, then treats it as a table.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LATERAL&lt;/code&gt; subquery&lt;/strong&gt; — &lt;code&gt;FROM orders o, LATERAL (SELECT ... WHERE x = o.id) s&lt;/code&gt; re-evaluates the inner subquery per outer row.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A &lt;code&gt;FROM&lt;/code&gt; with a &lt;code&gt;LEFT JOIN&lt;/code&gt; that produces the right row set before any filter runs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;output cardinality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;FROM customers&lt;/code&gt; alone&lt;/td&gt;
&lt;td&gt;3 rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;LEFT JOIN orders&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4 rows (Alice has 2 orders, Bob 1, Carol 0 padded with NULLs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ready for &lt;code&gt;WHERE&lt;/code&gt; filtering&lt;/td&gt;
&lt;td&gt;4 rows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The engine reads &lt;code&gt;customers&lt;/code&gt; first, producing three rows (Alice, Bob, Carol).&lt;/li&gt;
&lt;li&gt;For each customer, it scans &lt;code&gt;orders&lt;/code&gt; for matching &lt;code&gt;customer_id&lt;/code&gt; rows; Alice matches 2 orders, Bob matches 1, Carol matches 0.&lt;/li&gt;
&lt;li&gt;Because the join is &lt;code&gt;LEFT&lt;/code&gt;, Carol's row is preserved with the right-side columns filled with &lt;code&gt;NULL&lt;/code&gt;s — total 4 rows.&lt;/li&gt;
&lt;li&gt;This 4-row stream is what &lt;code&gt;WHERE&lt;/code&gt; will see; no filtering has happened yet.&lt;/li&gt;
&lt;li&gt;Without understanding &lt;code&gt;FROM&lt;/code&gt; runs first, you can't reason about why a &lt;code&gt;WHERE&lt;/code&gt; predicate on the right side of a &lt;code&gt;LEFT JOIN&lt;/code&gt; silently converts the join into an &lt;code&gt;INNER JOIN&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if your &lt;code&gt;LEFT JOIN&lt;/code&gt; is producing fewer rows than expected, check whether you have a &lt;code&gt;WHERE&lt;/code&gt; predicate that references the right-side table — that predicate runs after the join and discards the &lt;code&gt;NULL&lt;/code&gt;-padded rows.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;WHERE&lt;/code&gt; — row-level predicates before grouping
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;WHERE&lt;/code&gt; invariant: &lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt; filters individual rows from the &lt;code&gt;FROM&lt;/code&gt;/&lt;code&gt;JOIN&lt;/code&gt; output before &lt;code&gt;GROUP BY&lt;/code&gt; runs; it can reference any column from the joined row set, but cannot reference aggregate functions or &lt;code&gt;SELECT&lt;/code&gt;-level aliases&lt;/strong&gt;. This is the cheapest place to drop rows — push predicates here whenever possible.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Row predicates&lt;/strong&gt; — &lt;code&gt;WHERE amount &amp;gt; 30&lt;/code&gt;, &lt;code&gt;WHERE order_date &amp;gt;= '2026-01-01'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;IN&lt;/code&gt; / &lt;code&gt;EXISTS&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;WHERE customer_id IN (SELECT id FROM premium)&lt;/code&gt;, &lt;code&gt;WHERE EXISTS (...)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;BETWEEN&lt;/code&gt;&lt;/strong&gt; — inclusive on both ends; &lt;code&gt;WHERE x BETWEEN 1 AND 10&lt;/code&gt; is &lt;code&gt;x &amp;gt;= 1 AND x &amp;lt;= 10&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;IS NULL&lt;/code&gt; / &lt;code&gt;IS NOT NULL&lt;/code&gt;&lt;/strong&gt; — the only way to check for &lt;code&gt;NULL&lt;/code&gt;; never &lt;code&gt;= NULL&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Filter to one day of orders before grouping.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;filter&lt;/th&gt;
&lt;th&gt;rows surviving&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;no filter&lt;/td&gt;
&lt;td&gt;12,847 (full day's orders)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE order_date = '2026-05-10'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;12,847 (already today's)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE order_date = '2026-05-10' AND amount &amp;gt; 100&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4,290 (high-value only)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;FROM orders&lt;/code&gt; returns the full row stream.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;WHERE order_date = '2026-05-10'&lt;/code&gt; is evaluated per row; rows with other dates are dropped.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AND amount &amp;gt; 100&lt;/code&gt; is evaluated next; this is a row predicate (not an aggregate), so it lives in &lt;code&gt;WHERE&lt;/code&gt; correctly.&lt;/li&gt;
&lt;li&gt;The surviving row set (4,290 rows) flows into &lt;code&gt;GROUP BY&lt;/code&gt; if one is present, otherwise into &lt;code&gt;SELECT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Pushing the date filter into &lt;code&gt;WHERE&lt;/code&gt; rather than &lt;code&gt;HAVING&lt;/code&gt; is critical for index usage: a B-tree index on &lt;code&gt;order_date&lt;/code&gt; can prune 95% of the table before any grouping happens.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2026-05-10'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if the predicate uses only raw row columns, it belongs in &lt;code&gt;WHERE&lt;/code&gt;; if it uses &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;MIN&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt;, it belongs in &lt;code&gt;HAVING&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;GROUP BY&lt;/code&gt; → &lt;code&gt;HAVING&lt;/code&gt; → &lt;code&gt;SELECT&lt;/code&gt; → &lt;code&gt;ORDER BY&lt;/code&gt; → &lt;code&gt;LIMIT&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The downstream invariant: &lt;strong&gt;after &lt;code&gt;WHERE&lt;/code&gt;, the engine evaluates &lt;code&gt;GROUP BY&lt;/code&gt; (collapsing rows into one row per distinct key combination), then &lt;code&gt;HAVING&lt;/code&gt; (filtering groups), then &lt;code&gt;SELECT&lt;/code&gt; (projecting columns and computing expressions), then &lt;code&gt;ORDER BY&lt;/code&gt; (sorting the final result), then &lt;code&gt;LIMIT&lt;/code&gt;/&lt;code&gt;OFFSET&lt;/code&gt; (trimming for pagination)&lt;/strong&gt;. &lt;code&gt;SELECT&lt;/code&gt;-level aliases become referenceable only in &lt;code&gt;ORDER BY&lt;/code&gt; and the outer query (in a subquery context).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GROUP BY col1, col2&lt;/code&gt;&lt;/strong&gt; — one output row per distinct &lt;code&gt;(col1, col2)&lt;/code&gt; combination.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING agg_pred&lt;/code&gt;&lt;/strong&gt; — filter groups; can reference &lt;code&gt;COUNT(*)&lt;/code&gt;, &lt;code&gt;SUM(col)&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SELECT col, agg(col2) AS x&lt;/code&gt;&lt;/strong&gt; — project columns; aggregates and aliases are computed here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY x DESC, col&lt;/code&gt;&lt;/strong&gt; — can reference &lt;code&gt;SELECT&lt;/code&gt; aliases; deterministic with a tiebreaker.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LIMIT N OFFSET M&lt;/code&gt;&lt;/strong&gt; — page slicing; always pair with &lt;code&gt;ORDER BY&lt;/code&gt; for determinism.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Group by customer, filter to high-spend customers, sort descending, top 5.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;stage&lt;/th&gt;
&lt;th&gt;rows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;FROM orders WHERE order_date = '2026-05-10'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4,290&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GROUP BY customer_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1,720 (one row per customer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;HAVING SUM(amount) &amp;gt; 500&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;312 (high-spend)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SELECT customer_id, SUM(amount) AS spend&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;312 (projected)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ORDER BY spend DESC, customer_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;312 (sorted)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;LIMIT 5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;WHERE&lt;/code&gt; produces 4,290 rows for one day with &lt;code&gt;amount &amp;gt; 100&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GROUP BY customer_id&lt;/code&gt; collapses them into 1,720 buckets, one per customer.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;HAVING SUM(amount) &amp;gt; 500&lt;/code&gt; keeps only the 312 buckets whose total spend exceeds $500.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SELECT&lt;/code&gt; computes the alias &lt;code&gt;spend = SUM(amount)&lt;/code&gt; and projects two columns.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ORDER BY spend DESC, customer_id&lt;/code&gt; sorts the 312 surviving rows by descending spend with a deterministic tiebreaker; &lt;code&gt;LIMIT 5&lt;/code&gt; returns just the top five.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;spend&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2026-05-10'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;spend&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; every clause has a fixed slot; if you find yourself wanting &lt;code&gt;WHERE&lt;/code&gt; to reference an aggregate, the predicate belongs in &lt;code&gt;HAVING&lt;/code&gt; instead — and if you want &lt;code&gt;ORDER BY&lt;/code&gt; to use a long expression, alias it in &lt;code&gt;SELECT&lt;/code&gt; and reference the alias.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;WHERE COUNT(*) &amp;gt; 1&lt;/code&gt; — parse error; aggregates do not exist until after &lt;code&gt;GROUP BY&lt;/code&gt;. Use &lt;code&gt;HAVING&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Referencing a &lt;code&gt;SELECT&lt;/code&gt; alias in &lt;code&gt;WHERE&lt;/code&gt; — &lt;code&gt;WHERE spend &amp;gt; 100&lt;/code&gt; after &lt;code&gt;SELECT SUM(amount) AS spend&lt;/code&gt; fails; either repeat the expression or move to &lt;code&gt;HAVING&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Selecting a non-aggregated, non-&lt;code&gt;GROUP BY&lt;/code&gt; column — strict PostgreSQL errors out with "must appear in GROUP BY"; some other dialects pick an arbitrary row silently.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LIMIT 5&lt;/code&gt; without &lt;code&gt;ORDER BY&lt;/code&gt; — non-deterministic; two runs of the same query return different rows.&lt;/li&gt;
&lt;li&gt;Putting &lt;code&gt;HAVING&lt;/code&gt; before &lt;code&gt;GROUP BY&lt;/code&gt; — syntax error; the clause order is mandatory.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  PostgreSQL Interview Question on Clause Order
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;orders(order_id, customer_id, amount, order_date)&lt;/code&gt;, &lt;strong&gt;find every customer who placed more than 3 orders today with total spend above $500&lt;/strong&gt;. Return &lt;code&gt;customer_id&lt;/code&gt; and &lt;code&gt;total_spend&lt;/code&gt;, sorted by &lt;code&gt;total_spend&lt;/code&gt; descending.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;WHERE&lt;/code&gt; + &lt;code&gt;GROUP BY&lt;/code&gt; + &lt;code&gt;HAVING&lt;/code&gt; in the Right Slots
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_spend&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
   &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_spend&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;WHERE order_date = CURRENT_DATE&lt;/code&gt; filters to today's row set first (cheap, index-friendly); &lt;code&gt;GROUP BY customer_id&lt;/code&gt; collapses to one row per customer; &lt;code&gt;HAVING&lt;/code&gt; evaluates the two aggregate predicates together (more than 3 orders AND total &amp;gt; $500); &lt;code&gt;SELECT&lt;/code&gt; projects the alias &lt;code&gt;total_spend&lt;/code&gt;; &lt;code&gt;ORDER BY total_spend DESC, customer_id&lt;/code&gt; produces a deterministic ordering. Single pass over today's rows with hash aggregation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for sample data on 2026-05-10:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;orders today&lt;/th&gt;
&lt;th&gt;sum(amount)&lt;/th&gt;
&lt;th&gt;passes HAVING?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;720&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;410&lt;/td&gt;
&lt;td&gt;✗ (sum ≤ 500)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;1,250&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;104&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;800&lt;/td&gt;
&lt;td&gt;✗ (count ≤ 3)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;105&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;520&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three customers survive both predicates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;total_spend&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;1250&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;720&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;105&lt;/td&gt;
&lt;td&gt;520&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt; first&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;order_date = CURRENT_DATE&lt;/code&gt; is a row predicate using a non-aggregated column; pushing it into &lt;code&gt;WHERE&lt;/code&gt; shrinks the row set before grouping and lets the planner use a B-tree index on &lt;code&gt;order_date&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;GROUP BY customer_id&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — collapses today's rows into one bucket per customer; every subsequent aggregate is computed inside this bucket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;HAVING&lt;/code&gt; two-predicate AND&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;COUNT(*) &amp;gt; 3&lt;/code&gt; and &lt;code&gt;SUM(amount) &amp;gt; 500&lt;/code&gt; are both aggregate predicates; combining them with &lt;code&gt;AND&lt;/code&gt; in a single &lt;code&gt;HAVING&lt;/code&gt; is the canonical multi-condition group filter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;SELECT&lt;/code&gt; projection + alias&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;SUM(amount) AS total_spend&lt;/code&gt; is computed here; the alias becomes available to &lt;code&gt;ORDER BY&lt;/code&gt; (but not to &lt;code&gt;WHERE&lt;/code&gt; / &lt;code&gt;HAVING&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY total_spend DESC, customer_id&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — descending sort on the metric with a deterministic tiebreaker via &lt;code&gt;customer_id&lt;/code&gt;; reviewers depend on stable ordering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(|today's orders| + G log G)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — single hash aggregation produces &lt;code&gt;G&lt;/code&gt; groups; final sort is &lt;code&gt;G log G&lt;/code&gt;. With an index on &lt;code&gt;(order_date, customer_id)&lt;/code&gt; the planner can stream rather than hash.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;SQL filtering practice page&lt;/a&gt; for &lt;code&gt;WHERE&lt;/code&gt; patterns and the &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;SQL aggregation practice page&lt;/a&gt; for &lt;code&gt;GROUP BY&lt;/code&gt; + &lt;code&gt;HAVING&lt;/code&gt; shapes.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — filtering&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL filtering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Language — SQL&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;All SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. PostgreSQL Joins and Grain — &lt;code&gt;INNER&lt;/code&gt;, &lt;code&gt;LEFT&lt;/code&gt;, &lt;code&gt;RIGHT&lt;/code&gt;, &lt;code&gt;FULL&lt;/code&gt;, &lt;code&gt;SELF&lt;/code&gt;, &lt;code&gt;CROSS&lt;/code&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Joins, anti-joins, and the grain-inflation trap in PostgreSQL
&lt;/h3&gt;

&lt;p&gt;"Why is &lt;code&gt;SUM(amount)&lt;/code&gt; returning double what I expect after I add a &lt;code&gt;JOIN&lt;/code&gt;?" is the signature grain-inflation question — and the answer is that &lt;strong&gt;joins do not just combine columns; they change the row cardinality of the result&lt;/strong&gt;. The mental model: &lt;strong&gt;&lt;code&gt;INNER JOIN&lt;/code&gt; keeps only matching pairs, &lt;code&gt;LEFT JOIN&lt;/code&gt; keeps every left row and pads the right side with &lt;code&gt;NULL&lt;/code&gt;s, &lt;code&gt;RIGHT JOIN&lt;/code&gt; is the mirror, &lt;code&gt;FULL OUTER JOIN&lt;/code&gt; keeps both sides' unmatched rows, &lt;code&gt;SELF JOIN&lt;/code&gt; joins a table to itself (for hierarchies and pair queries), &lt;code&gt;CROSS JOIN&lt;/code&gt; produces a Cartesian product (one row per &lt;code&gt;(left, right)&lt;/code&gt; pair)&lt;/strong&gt;. The cardinality of any join is bounded by &lt;code&gt;|left| × |right|&lt;/code&gt;, and a &lt;code&gt;1:N&lt;/code&gt; relationship inflates left rows by &lt;code&gt;N&lt;/code&gt; — the silent source of doubled metrics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flcucep3sqz2qej2vf3ic.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flcucep3sqz2qej2vf3ic.webp" alt="Venn diagrams for INNER JOIN (purple intersection of Table A and Table B) and LEFT JOIN (green Table A with a NULL pocket where Table B does not match) under a PostgreSQL SQL Cheat Sheet headline with a grain/cardinality footer label." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Before writing any join, ask "what is the grain of the result?" — orders, order lines, customer-day, or &lt;code&gt;(customer, product)&lt;/code&gt; pair. A &lt;code&gt;1:N&lt;/code&gt; join (e.g., &lt;code&gt;customers&lt;/code&gt; to &lt;code&gt;orders&lt;/code&gt;) inflates customer rows by the number of orders; &lt;code&gt;SUM(customer.lifetime_value)&lt;/code&gt; after that join returns lifetime value × order count, not lifetime value. Always state the grain out loud.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;INNER JOIN&lt;/code&gt; — keep only matching pairs (no padding)
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;INNER JOIN&lt;/code&gt; invariant: &lt;strong&gt;a left row is paired with a right row iff the join predicate is &lt;code&gt;TRUE&lt;/code&gt;; unmatched rows on either side are discarded; the result cardinality is the count of matching pairs&lt;/strong&gt;. This is the most common join and the fastest because the planner can short-circuit on no-match.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ON l.key = r.key&lt;/code&gt;&lt;/strong&gt; — single-column equi-join; the planner hashes the right table.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-column&lt;/strong&gt; — &lt;code&gt;ON l.a = r.a AND l.b = r.b&lt;/code&gt; for composite keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-equi&lt;/strong&gt; — &lt;code&gt;ON l.range_start &amp;lt;= r.point AND l.range_end &amp;gt;= r.point&lt;/code&gt; (range join).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;USING (col)&lt;/code&gt;&lt;/strong&gt; — shorthand when both sides share the column name; merges the column.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two tables, three customers, two orders; one customer has no order.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer&lt;/th&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Carol (no orders) does not appear — &lt;code&gt;INNER JOIN&lt;/code&gt; dropped her.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The engine reads &lt;code&gt;customers&lt;/code&gt; (Alice, Bob, Carol) and &lt;code&gt;orders&lt;/code&gt; (101 for Alice, 102 for Bob).&lt;/li&gt;
&lt;li&gt;For each &lt;code&gt;customers&lt;/code&gt; row, it scans &lt;code&gt;orders&lt;/code&gt; for a matching &lt;code&gt;customer_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Alice matches &lt;code&gt;order_id = 101&lt;/code&gt;; Bob matches &lt;code&gt;order_id = 102&lt;/code&gt;; Carol has no match.&lt;/li&gt;
&lt;li&gt;Carol's row is silently discarded because the join is &lt;code&gt;INNER&lt;/code&gt; — no &lt;code&gt;NULL&lt;/code&gt;-padded row is produced.&lt;/li&gt;
&lt;li&gt;The output has two rows because there were two matching pairs; the result cardinality is &lt;code&gt;min(|customers|, |orders|) ≤ N ≤ |customers| × |orders|&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; reach for &lt;code&gt;INNER JOIN&lt;/code&gt; whenever the question is "rows where both sides exist"; it is the smallest, fastest, most common join.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;LEFT JOIN&lt;/code&gt; — keep every left row, pad the right with &lt;code&gt;NULL&lt;/code&gt;s (anti-join trick)
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;LEFT JOIN&lt;/code&gt; invariant: &lt;strong&gt;every row from the left table appears in the output; if no right row matches, the right columns are &lt;code&gt;NULL&lt;/code&gt;; &lt;code&gt;LEFT JOIN ... WHERE right.key IS NULL&lt;/code&gt; keeps exactly the left rows that had no match — the anti-join idiom&lt;/strong&gt;. &lt;code&gt;RIGHT JOIN&lt;/code&gt; is the mirror; flip the table order and use &lt;code&gt;LEFT&lt;/code&gt; for consistency.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt;&lt;/strong&gt; — preserves every left row.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right columns &lt;code&gt;NULL&lt;/code&gt;&lt;/strong&gt; when no match — the key signal for anti-joins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; anti-join&lt;/strong&gt; — "find rows in A with no match in B".&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;RIGHT JOIN&lt;/code&gt;&lt;/strong&gt; — mirror image; rarely needed (just flip table order and use &lt;code&gt;LEFT&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Same &lt;code&gt;customers&lt;/code&gt; + &lt;code&gt;orders&lt;/code&gt;; Carol is preserved with &lt;code&gt;NULL&lt;/code&gt; right-side columns.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer&lt;/th&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;For each &lt;code&gt;customers&lt;/code&gt; row, scan &lt;code&gt;orders&lt;/code&gt; for a matching &lt;code&gt;customer_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Alice matches → row &lt;code&gt;(Alice, 101)&lt;/code&gt;; Bob matches → row &lt;code&gt;(Bob, 102)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Carol does not match → row &lt;code&gt;(Carol, NULL)&lt;/code&gt; is produced because the join is &lt;code&gt;LEFT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;To find Carol via the anti-join: add &lt;code&gt;WHERE o.order_id IS NULL&lt;/code&gt; after the &lt;code&gt;LEFT JOIN&lt;/code&gt;; only Carol's row passes the filter.&lt;/li&gt;
&lt;li&gt;Equivalent to &lt;code&gt;WHERE NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id)&lt;/code&gt; and (under &lt;code&gt;NOT NULL&lt;/code&gt; constraints) &lt;code&gt;WHERE c.id NOT IN (SELECT customer_id FROM orders)&lt;/code&gt; — but the anti-join is immune to the &lt;code&gt;NOT IN&lt;/code&gt; &lt;code&gt;NULL&lt;/code&gt;-swallowing bug.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; "find X with no Y" → &lt;code&gt;LEFT JOIN ... WHERE Y.id IS NULL&lt;/code&gt;. Memorize this; it is the most-asked join shape in SQL interviews and the cleanest fix for the &lt;code&gt;NOT IN ... NULL&lt;/code&gt; trap.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;FULL OUTER&lt;/code&gt;, &lt;code&gt;SELF&lt;/code&gt;, and &lt;code&gt;CROSS&lt;/code&gt; joins — the rarer shapes
&lt;/h4&gt;

&lt;p&gt;The rarer-joins invariant: &lt;strong&gt;&lt;code&gt;FULL OUTER JOIN&lt;/code&gt; keeps every left row AND every right row (with &lt;code&gt;NULL&lt;/code&gt; padding on the unmatched side); &lt;code&gt;SELF JOIN&lt;/code&gt; joins a table to itself by aliasing it twice (employees-and-managers, parent-child, pair queries); &lt;code&gt;CROSS JOIN&lt;/code&gt; produces every &lt;code&gt;(left, right)&lt;/code&gt; combination — the Cartesian product&lt;/strong&gt;. Each has a narrow but important use.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;FULL OUTER JOIN&lt;/code&gt;&lt;/strong&gt; — reconcile two sources; rows from either side without a match get padded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SELF JOIN&lt;/code&gt;&lt;/strong&gt; — employee/manager, hierarchical recursion (alternative to recursive CTE), pair queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CROSS JOIN&lt;/code&gt;&lt;/strong&gt; — generate every combination (small tables only) or paired with &lt;code&gt;LATERAL&lt;/code&gt; for top-N per row.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implicit cross join&lt;/strong&gt; — comma-separated tables (&lt;code&gt;FROM a, b&lt;/code&gt;) without an &lt;code&gt;ON&lt;/code&gt; is a &lt;code&gt;CROSS JOIN&lt;/code&gt; — usually a bug.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Self-join &lt;code&gt;employees&lt;/code&gt; to itself to surface each person's manager.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;manager_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;NULL (CEO)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Alias the same &lt;code&gt;employees&lt;/code&gt; table twice: &lt;code&gt;e&lt;/code&gt; (for employees) and &lt;code&gt;m&lt;/code&gt; (for managers).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LEFT JOIN&lt;/code&gt; on &lt;code&gt;e.manager_id = m.emp_id&lt;/code&gt; looks each employee up against the manager rows.&lt;/li&gt;
&lt;li&gt;Alice's &lt;code&gt;manager_id&lt;/code&gt; points to Carol → row &lt;code&gt;(Alice, Carol)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Bob's &lt;code&gt;manager_id&lt;/code&gt; points to Carol → row &lt;code&gt;(Bob, Carol)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Carol is the CEO so her &lt;code&gt;manager_id IS NULL&lt;/code&gt; → no match → row &lt;code&gt;(Carol, NULL)&lt;/code&gt; because the join is &lt;code&gt;LEFT&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;manager_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;emp_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;SELF JOIN&lt;/code&gt; is one-level hierarchy; for arbitrary-depth recursion (org chart traversal, BOM tree), reach for &lt;code&gt;WITH RECURSIVE&lt;/code&gt; instead.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Forgetting that a &lt;code&gt;1:N&lt;/code&gt; &lt;code&gt;JOIN&lt;/code&gt; inflates the left side — &lt;code&gt;SUM(left.col)&lt;/code&gt; returns &lt;code&gt;left.col × N&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Filtering the right table inside &lt;code&gt;WHERE&lt;/code&gt; after a &lt;code&gt;LEFT JOIN&lt;/code&gt; (e.g., &lt;code&gt;WHERE o.amount &amp;gt; 0&lt;/code&gt;) — silently turns the &lt;code&gt;LEFT JOIN&lt;/code&gt; into an &lt;code&gt;INNER JOIN&lt;/code&gt; because &lt;code&gt;NULL &amp;gt; 0&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;NOT IN (subquery)&lt;/code&gt; when the subquery can contain &lt;code&gt;NULL&lt;/code&gt; — returns zero rows because &lt;code&gt;x NOT IN (..., NULL, ...)&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;, which fails the predicate.&lt;/li&gt;
&lt;li&gt;Comma-separated &lt;code&gt;FROM a, b&lt;/code&gt; with no &lt;code&gt;ON&lt;/code&gt; clause — produces a Cartesian product (&lt;code&gt;CROSS JOIN&lt;/code&gt;); usually a bug.&lt;/li&gt;
&lt;li&gt;Joining on the wrong column (&lt;code&gt;o.id = c.id&lt;/code&gt; instead of &lt;code&gt;o.customer_id = c.id&lt;/code&gt;) — produces nonsense rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  PostgreSQL Interview Question on Customers With No Orders
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;customers(id, name)&lt;/code&gt; and &lt;code&gt;orders(order_id, customer_id, amount)&lt;/code&gt;, &lt;strong&gt;return the names of customers who have never placed an order&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;LEFT JOIN ... WHERE orders.order_id IS NULL&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the &lt;code&gt;LEFT JOIN&lt;/code&gt; preserves every customer row regardless of whether a matching order exists; for matched customers, &lt;code&gt;o.order_id&lt;/code&gt; carries a real value; for unmatched customers, the right-side columns are &lt;code&gt;NULL&lt;/code&gt; and the &lt;code&gt;WHERE o.order_id IS NULL&lt;/code&gt; predicate is &lt;code&gt;TRUE&lt;/code&gt;; the filter keeps only the unmatched customers — the anti-join. Single pass over &lt;code&gt;customers&lt;/code&gt;; one keyed lookup into &lt;code&gt;orders&lt;/code&gt; per customer; no subquery materialization needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the sample input:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customers.id&lt;/th&gt;
&lt;th&gt;customers.name&lt;/th&gt;
&lt;th&gt;LEFT JOIN orders.order_id&lt;/th&gt;
&lt;th&gt;IS NULL?&lt;/th&gt;
&lt;th&gt;survives?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Carol and Dan survive the filter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt; semantics&lt;/strong&gt;&lt;/strong&gt; — keeps every left row; right side is &lt;code&gt;NULL&lt;/code&gt; when there is no match. This &lt;code&gt;NULL&lt;/code&gt; is the entire signal we filter on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;WHERE o.order_id IS NULL&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;o.order_id&lt;/code&gt; is the right-side primary key; it is &lt;code&gt;NULL&lt;/code&gt; only when the join produced a synthetic unmatched row. A real-&lt;code&gt;NULL&lt;/code&gt; order-id from the source table never happens because primary keys are &lt;code&gt;NOT NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Anti-join semantics&lt;/strong&gt;&lt;/strong&gt; — equivalent to &lt;code&gt;NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id)&lt;/code&gt;; the &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; form is typically faster on planners that materialize a hash join.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;No &lt;code&gt;NULL&lt;/code&gt;-swallowing&lt;/strong&gt;&lt;/strong&gt; — unlike &lt;code&gt;NOT IN&lt;/code&gt;, the predicate is &lt;code&gt;IS NULL&lt;/code&gt;, which is well-defined for &lt;code&gt;NULL&lt;/code&gt; values. There is no silent zero-row failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY c.name&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — deterministic ordering for reviewer stability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(|customers| + |orders|)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — hash-join build on &lt;code&gt;orders.customer_id&lt;/code&gt;, single probe per customer. With an index on &lt;code&gt;orders.customer_id&lt;/code&gt; this is near-linear.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;SQL joins practice page&lt;/a&gt; for &lt;code&gt;INNER&lt;/code&gt;, &lt;code&gt;LEFT&lt;/code&gt;, and anti-join shapes, and the &lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;SQL null-handling practice page&lt;/a&gt; for &lt;code&gt;NULL&lt;/code&gt;-aware predicates.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — joins&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL join problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — null handling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL null-handling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — filtering&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL filtering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. PostgreSQL &lt;code&gt;GROUP BY&lt;/code&gt;, &lt;code&gt;HAVING&lt;/code&gt;, and Conditional Aggregates
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;GROUP BY&lt;/code&gt; with &lt;code&gt;HAVING&lt;/code&gt;, &lt;code&gt;FILTER&lt;/code&gt;, and &lt;code&gt;CASE&lt;/code&gt; for one-pass metrics in PostgreSQL
&lt;/h3&gt;

&lt;p&gt;"Compute total revenue, refunded revenue, and the percentage refunded — in a single query" is the signature conditional-aggregate prompt — and the cleanest PostgreSQL answer is &lt;strong&gt;&lt;code&gt;SUM(... ) FILTER (WHERE …)&lt;/code&gt; clauses inside a single &lt;code&gt;SELECT&lt;/code&gt;&lt;/strong&gt;. The mental model: &lt;strong&gt;&lt;code&gt;GROUP BY col&lt;/code&gt; collapses rows into buckets; &lt;code&gt;COUNT(*)&lt;/code&gt;, &lt;code&gt;SUM(...)&lt;/code&gt;, &lt;code&gt;AVG(...)&lt;/code&gt;, &lt;code&gt;MIN(...)&lt;/code&gt;, &lt;code&gt;MAX(...)&lt;/code&gt; summarize each bucket; &lt;code&gt;WHERE&lt;/code&gt; filters individual rows before grouping; &lt;code&gt;HAVING&lt;/code&gt; filters groups after grouping; &lt;code&gt;FILTER (WHERE …)&lt;/code&gt; and &lt;code&gt;CASE WHEN …&lt;/code&gt; express conditional aggregates that count or sum only certain rows per group&lt;/strong&gt;. The duplicate-finder pattern &lt;code&gt;GROUP BY key HAVING COUNT(*) &amp;gt; 1&lt;/code&gt; lives here too.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbo6xn6pursz7krqfnz0a.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbo6xn6pursz7krqfnz0a.webp" alt="Two-panel PostgreSQL SQL Cheat Sheet diagram: left panel WHERE filters individual rows (orange funnel with rows flowing into filtered output); right panel HAVING filters groups (group boxes labeled kept and rejected separated by a purple filter bar) with pipecode.ai attribution." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; PostgreSQL supports the SQL standard &lt;code&gt;FILTER (WHERE …)&lt;/code&gt; clause on every aggregate — &lt;code&gt;COUNT(*) FILTER (WHERE status = 'refunded')&lt;/code&gt;. It produces clearer queries than &lt;code&gt;SUM(CASE WHEN … THEN 1 ELSE 0 END)&lt;/code&gt; and is exactly what interviewers like to see. The &lt;code&gt;CASE&lt;/code&gt; variant still works for portability across dialects.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;MIN&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt; — &lt;code&gt;NULL&lt;/code&gt;-aware aggregates
&lt;/h4&gt;

&lt;p&gt;The aggregate-&lt;code&gt;NULL&lt;/code&gt; invariant: &lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt; counts every row including ones with &lt;code&gt;NULL&lt;/code&gt; columns; &lt;code&gt;COUNT(col)&lt;/code&gt; counts only rows where &lt;code&gt;col&lt;/code&gt; is not &lt;code&gt;NULL&lt;/code&gt;; &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;MIN&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt; skip &lt;code&gt;NULL&lt;/code&gt; values entirely; if every value in a group is &lt;code&gt;NULL&lt;/code&gt;, the result is &lt;code&gt;NULL&lt;/code&gt; (not &lt;code&gt;0&lt;/code&gt;)&lt;/strong&gt;. The distinction between &lt;code&gt;COUNT(*)&lt;/code&gt; and &lt;code&gt;COUNT(col)&lt;/code&gt; is the #1 source of "my counts are off by 10%" bugs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt; — every row in the bucket, regardless of &lt;code&gt;NULL&lt;/code&gt;s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(col)&lt;/code&gt;&lt;/strong&gt; — non-&lt;code&gt;NULL&lt;/code&gt; values of &lt;code&gt;col&lt;/code&gt; only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt;&lt;/strong&gt; — unique non-&lt;code&gt;NULL&lt;/code&gt; values; essential after a &lt;code&gt;JOIN&lt;/code&gt; that may have inflated rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SUM&lt;/code&gt; / &lt;code&gt;AVG&lt;/code&gt;&lt;/strong&gt; — numeric only; &lt;code&gt;AVG&lt;/code&gt; is sum-of-non-null-divided-by-count-of-non-null, so &lt;code&gt;NULL&lt;/code&gt; does &lt;strong&gt;not&lt;/strong&gt; count as &lt;code&gt;0&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three rows in one customer's bucket: &lt;code&gt;amount&lt;/code&gt; = &lt;code&gt;10, NULL, 30&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;aggregate&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COUNT(amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SUM(amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AVG(amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MIN(amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MAX(amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;COUNT(*)&lt;/code&gt; = 3 because every row in the bucket counts, regardless of &lt;code&gt;amount&lt;/code&gt;'s value.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;COUNT(amount)&lt;/code&gt; = 2 because the &lt;code&gt;NULL&lt;/code&gt; row is skipped; only &lt;code&gt;10&lt;/code&gt; and &lt;code&gt;30&lt;/code&gt; contribute.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SUM(amount)&lt;/code&gt; = 10 + 30 = 40; the &lt;code&gt;NULL&lt;/code&gt; is treated as missing, not as &lt;code&gt;0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AVG(amount)&lt;/code&gt; = (10 + 30) / 2 = 20; the denominator is &lt;code&gt;COUNT(amount) = 2&lt;/code&gt;, not &lt;code&gt;COUNT(*) = 3&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MIN&lt;/code&gt; and &lt;code&gt;MAX&lt;/code&gt; skip the &lt;code&gt;NULL&lt;/code&gt; and return the smallest/largest non-&lt;code&gt;NULL&lt;/code&gt; value.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;n_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;n_known&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hi&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if the metric is "people who clicked" use &lt;code&gt;COUNT(DISTINCT user_id)&lt;/code&gt;; if it is "click events" use &lt;code&gt;COUNT(*)&lt;/code&gt;; if it is "rows with a known value" use &lt;code&gt;COUNT(col)&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;WHERE&lt;/code&gt; vs &lt;code&gt;HAVING&lt;/code&gt; — row filter vs group filter
&lt;/h4&gt;

&lt;p&gt;The two-clause invariant: &lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt; runs before &lt;code&gt;GROUP BY&lt;/code&gt; and references raw row columns; &lt;code&gt;HAVING&lt;/code&gt; runs after grouping and can reference aggregate functions; trying to use &lt;code&gt;WHERE COUNT(*) &amp;gt; 1&lt;/code&gt; is a parse error because aggregates do not exist until after grouping&lt;/strong&gt;. Both can appear in the same query.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt;&lt;/strong&gt; — filter rows; uses &lt;code&gt;col&lt;/code&gt;, &lt;code&gt;col2&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING&lt;/code&gt;&lt;/strong&gt; — filter groups; uses &lt;code&gt;COUNT(*)&lt;/code&gt;, &lt;code&gt;SUM(col)&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order of evaluation&lt;/strong&gt; — &lt;code&gt;FROM&lt;/code&gt; → &lt;code&gt;WHERE&lt;/code&gt; → &lt;code&gt;GROUP BY&lt;/code&gt; → &lt;code&gt;HAVING&lt;/code&gt; → &lt;code&gt;SELECT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt; — push predicates into &lt;code&gt;WHERE&lt;/code&gt; whenever possible; &lt;code&gt;WHERE&lt;/code&gt; filters before the (often expensive) sort/hash step.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Six employees across &lt;code&gt;eng&lt;/code&gt; and &lt;code&gt;sales&lt;/code&gt;; find departments whose average salary exceeds 50,000 across employees earning more than 30,000.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;40,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;70,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;25,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;60,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;60,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;20,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;WHERE salary &amp;gt; 30000&lt;/code&gt; drops the two rows below the threshold (eng 25,000 and sales 20,000) — 4 rows remain.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GROUP BY department&lt;/code&gt; collapses to two buckets: eng (40,000 + 70,000) and sales (60,000 + 60,000).&lt;/li&gt;
&lt;li&gt;The planner computes &lt;code&gt;AVG(salary)&lt;/code&gt; per bucket: eng = 55,000; sales = 60,000.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;HAVING AVG(salary) &amp;gt; 50000&lt;/code&gt; keeps both buckets (both averages exceed 50,000).&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;SELECT&lt;/code&gt; projects the department name and its average; final result is two rows.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;30000&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; aggregate predicate → &lt;code&gt;HAVING&lt;/code&gt;; row predicate → &lt;code&gt;WHERE&lt;/code&gt;. If the predicate uses &lt;code&gt;SUM&lt;/code&gt; / &lt;code&gt;COUNT&lt;/code&gt; / &lt;code&gt;AVG&lt;/code&gt; / &lt;code&gt;MIN&lt;/code&gt; / &lt;code&gt;MAX&lt;/code&gt;, it must live in &lt;code&gt;HAVING&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;FILTER (WHERE …)&lt;/code&gt; and &lt;code&gt;CASE&lt;/code&gt; — conditional aggregates
&lt;/h4&gt;

&lt;p&gt;The conditional-aggregate invariant: &lt;strong&gt;&lt;code&gt;SUM(col) FILTER (WHERE pred)&lt;/code&gt; and &lt;code&gt;COUNT(*) FILTER (WHERE pred)&lt;/code&gt; apply the aggregate only to rows where the predicate is &lt;code&gt;TRUE&lt;/code&gt;; the portable alternative is &lt;code&gt;SUM(CASE WHEN pred THEN col ELSE 0 END)&lt;/code&gt; and &lt;code&gt;COUNT(CASE WHEN pred THEN 1 END)&lt;/code&gt;&lt;/strong&gt;. PostgreSQL supports both; pick &lt;code&gt;FILTER&lt;/code&gt; for clarity in PostgreSQL-only code, &lt;code&gt;CASE&lt;/code&gt; for cross-dialect portability.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;FILTER (WHERE …)&lt;/code&gt;&lt;/strong&gt; — PostgreSQL/SQL-standard syntax; applies per-aggregate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SUM(CASE WHEN … THEN col ELSE 0 END)&lt;/code&gt;&lt;/strong&gt; — portable across dialects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(CASE WHEN … THEN 1 END)&lt;/code&gt;&lt;/strong&gt; — counts only matching rows; &lt;code&gt;NULL&lt;/code&gt;s in the &lt;code&gt;ELSE&lt;/code&gt; branch are skipped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple aggregates, one query&lt;/strong&gt; — combine many &lt;code&gt;FILTER&lt;/code&gt; clauses to compute several metrics in one pass.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; One pass over &lt;code&gt;orders&lt;/code&gt; to compute total revenue, refunded revenue, and the refund rate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;total_revenue&lt;/th&gt;
&lt;th&gt;refunded_revenue&lt;/th&gt;
&lt;th&gt;refund_pct&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;800&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;25.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;SUM(amount)&lt;/code&gt; aggregates every row in the bucket → &lt;code&gt;total_revenue&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SUM(amount) FILTER (WHERE status = 'refunded')&lt;/code&gt; aggregates only refunded rows → &lt;code&gt;refunded_revenue&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The refund percentage is &lt;code&gt;refunded_revenue / total_revenue * 100&lt;/code&gt;; cast one side to &lt;code&gt;NUMERIC&lt;/code&gt; to avoid integer division.&lt;/li&gt;
&lt;li&gt;PostgreSQL evaluates every &lt;code&gt;FILTER&lt;/code&gt; independently per row of input; one scan computes all metrics.&lt;/li&gt;
&lt;li&gt;The portable variant uses &lt;code&gt;SUM(CASE WHEN status = 'refunded' THEN amount ELSE 0 END)&lt;/code&gt; — same result, slightly more verbose.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                              &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'refunded'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;refunded_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
         &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'refunded'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;NUMERIC&lt;/span&gt;
         &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
       &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;refund_pct&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; whenever you find yourself running two queries with different &lt;code&gt;WHERE&lt;/code&gt; clauses against the same table and joining the results, refactor to a single query with two &lt;code&gt;FILTER&lt;/code&gt; clauses — same answer, half the cost.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;WHERE COUNT(*) &amp;gt; 1&lt;/code&gt; — parse error; aggregates do not exist until after &lt;code&gt;GROUP BY&lt;/code&gt;. Use &lt;code&gt;HAVING&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AVG(col)&lt;/code&gt; and assuming &lt;code&gt;NULL&lt;/code&gt; rows count as &lt;code&gt;0&lt;/code&gt; — they are excluded from both numerator and denominator. Use &lt;code&gt;AVG(COALESCE(col, 0))&lt;/code&gt; only if "missing means 0" is the business rule.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt; forgotten after a &lt;code&gt;JOIN&lt;/code&gt; that inflates rows — reports inflated counts.&lt;/li&gt;
&lt;li&gt;Integer division — &lt;code&gt;5 / 100 = 0&lt;/code&gt; in PostgreSQL. Cast one operand to &lt;code&gt;NUMERIC&lt;/code&gt; or &lt;code&gt;FLOAT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Division by zero — &lt;code&gt;NULLIF(denom, 0)&lt;/code&gt; converts &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;NULL&lt;/code&gt;, so the division returns &lt;code&gt;NULL&lt;/code&gt; instead of erroring.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  PostgreSQL Interview Question on Duplicate Emails
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;users(id, email)&lt;/code&gt;, &lt;strong&gt;return every email that appears more than once&lt;/strong&gt;, along with the number of copies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;GROUP BY email HAVING COUNT(*) &amp;gt; 1&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;n_copies&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;n_copies&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;GROUP BY email&lt;/code&gt; collapses every row with the same email into a single bucket; &lt;code&gt;COUNT(*)&lt;/code&gt; counts how many rows fell into each bucket; &lt;code&gt;HAVING COUNT(*) &amp;gt; 1&lt;/code&gt; keeps only buckets with at least two rows; &lt;code&gt;ORDER BY n_copies DESC, email&lt;/code&gt; produces a deterministic, reviewer-friendly output. Single pass over &lt;code&gt;users&lt;/code&gt;; sort cost dominates only when email cardinality is huge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the sample input:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;email&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:alice@example.com"&gt;alice@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:bob@example.com"&gt;bob@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:alice@example.com"&gt;alice@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:carol@example.com"&gt;carol@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:bob@example.com"&gt;bob@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:alice@example.com"&gt;alice@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;FROM users&lt;/code&gt;&lt;/strong&gt; — read all six rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;WHERE&lt;/code&gt;&lt;/strong&gt; — every row passes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GROUP BY email&lt;/code&gt;&lt;/strong&gt; — three buckets: alice (3 rows), bob (2 rows), carol (1 row).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt; — 3, 2, 1 respectively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING COUNT(*) &amp;gt; 1&lt;/code&gt;&lt;/strong&gt; — drops the carol bucket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY n_copies DESC, email&lt;/code&gt;&lt;/strong&gt; — alice (3), then bob (2).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;email&lt;/th&gt;
&lt;th&gt;n_copies&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="mailto:alice@example.com"&gt;alice@example.com&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="mailto:bob@example.com"&gt;bob@example.com&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;GROUP BY email&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — collapses to one bucket per distinct email; the bucket is the unit of all subsequent aggregates and group-level filters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — counts every row in the bucket, perfect for "how many copies".&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;HAVING COUNT(*) &amp;gt; 1&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — group-level filter; the aggregate predicate must live here, not in &lt;code&gt;WHERE&lt;/code&gt;. This is the precise interview signal for duplicate detection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY n_copies DESC, email&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — deterministic ordering; tie-broken by &lt;code&gt;email&lt;/code&gt; so the output is stable across runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(|users| + G log G)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — single hash aggregation produces &lt;code&gt;G&lt;/code&gt; group rows; the final sort is &lt;code&gt;G log G&lt;/code&gt;. With an index on &lt;code&gt;email&lt;/code&gt;, the planner may use stream aggregation and skip the hash step.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;SQL aggregation practice page&lt;/a&gt; for &lt;code&gt;GROUP BY&lt;/code&gt; and &lt;code&gt;HAVING&lt;/code&gt; shapes, and the &lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;SQL filtering practice page&lt;/a&gt; for &lt;code&gt;WHERE&lt;/code&gt; vs &lt;code&gt;HAVING&lt;/code&gt; distinctions.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — filtering&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL filtering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — null handling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL null-handling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. PostgreSQL Window Functions — &lt;code&gt;ROW_NUMBER&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;DENSE_RANK&lt;/code&gt;, &lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Ranking, top-N-per-group, running totals, and lookback in PostgreSQL window functions
&lt;/h3&gt;

&lt;p&gt;"Find the second-highest distinct salary" and "compute a running total of daily revenue" are the two signature window-function prompts — and both reduce to a &lt;strong&gt;window function with &lt;code&gt;OVER (PARTITION BY … ORDER BY …)&lt;/code&gt;&lt;/strong&gt;. The mental model: &lt;strong&gt;a window function computes a value across a window of rows related to the current row without collapsing the rows like &lt;code&gt;GROUP BY&lt;/code&gt; does; &lt;code&gt;OVER (PARTITION BY col)&lt;/code&gt; defines the window boundary; &lt;code&gt;OVER (ORDER BY col)&lt;/code&gt; defines the order within the window&lt;/strong&gt;. &lt;code&gt;ROW_NUMBER&lt;/code&gt; assigns unique integers; &lt;code&gt;RANK&lt;/code&gt; skips after ties (&lt;code&gt;1, 2, 2, 4&lt;/code&gt;); &lt;code&gt;DENSE_RANK&lt;/code&gt; does not skip (&lt;code&gt;1, 2, 2, 3&lt;/code&gt;); &lt;code&gt;LAG&lt;/code&gt; looks back; &lt;code&gt;LEAD&lt;/code&gt; looks forward; &lt;code&gt;SUM/AVG/COUNT(...) OVER (...)&lt;/code&gt; compute running totals and moving averages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fypy49fa51qwnlgekf61n.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fypy49fa51qwnlgekf61n.webp" alt="Three-column comparison of PostgreSQL window functions on a salary ladder with tied rows: ROW_NUMBER yields unique 1-2-3-4, RANK yields 1-2-2-4 with a +2 skip, DENSE_RANK yields 1-2-2-3 with no gap; a caption explains DENSE_RANK equals N for the Nth distinct value, plus an inset showing a running total via SUM(amount) OVER (ORDER BY date) on a small sales table." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Window functions cannot be referenced in &lt;code&gt;WHERE&lt;/code&gt; of the same &lt;code&gt;SELECT&lt;/code&gt; because they execute &lt;em&gt;after&lt;/em&gt; &lt;code&gt;WHERE&lt;/code&gt;. Wrap the window in a CTE or subquery, then filter on the alias. The error &lt;code&gt;column "rn" does not exist&lt;/code&gt; after writing &lt;code&gt;WHERE rn = 1&lt;/code&gt; almost always means you forgot this rule.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;ROW_NUMBER&lt;/code&gt; — unique sequential numbering per partition
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;ROW_NUMBER&lt;/code&gt; invariant: &lt;strong&gt;&lt;code&gt;ROW_NUMBER() OVER (PARTITION BY p ORDER BY o)&lt;/code&gt; assigns a unique integer &lt;code&gt;1, 2, 3, …&lt;/code&gt; to every row inside each partition &lt;code&gt;p&lt;/code&gt;, ordered by &lt;code&gt;o&lt;/code&gt;; ties in &lt;code&gt;o&lt;/code&gt; are broken arbitrarily by the planner&lt;/strong&gt;. Use it when you need a unique sequence per group regardless of tie semantics — most often for deduplication (keep &lt;code&gt;rn = 1&lt;/code&gt; per business key).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;OVER (PARTITION BY …)&lt;/code&gt;&lt;/strong&gt; — bucket the rows; without this, the window is the whole result set.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;OVER (ORDER BY …)&lt;/code&gt;&lt;/strong&gt; — order within the bucket; required for &lt;code&gt;ROW_NUMBER&lt;/code&gt;/&lt;code&gt;RANK&lt;/code&gt;/&lt;code&gt;LAG&lt;/code&gt;/&lt;code&gt;LEAD&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ties broken arbitrarily&lt;/strong&gt; — add a tiebreaker column to &lt;code&gt;ORDER BY&lt;/code&gt; for determinism.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top-N-per-group&lt;/strong&gt; — &lt;code&gt;WHERE rn &amp;lt;= N&lt;/code&gt; after &lt;code&gt;ROW_NUMBER&lt;/code&gt;; works only when ties at rank N can be ignored.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;employees&lt;/code&gt; with three engineers; rank by salary descending.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;th&gt;row_number&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;90,000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bob and Carol tie on salary; &lt;code&gt;ROW_NUMBER&lt;/code&gt; still gives them unique ranks (planner-chosen unless you add a tiebreaker).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;PARTITION BY department&lt;/code&gt; defines the boundary — only &lt;code&gt;eng&lt;/code&gt; rows are compared with each other; if there were a &lt;code&gt;sales&lt;/code&gt; partition it would have its own &lt;code&gt;1, 2, 3&lt;/code&gt; sequence.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ORDER BY salary DESC, name&lt;/code&gt; orders rows within the partition: Alice (90,000) first, then Bob and Carol (tied at 80,000) broken by name.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ROW_NUMBER()&lt;/code&gt; assigns &lt;code&gt;1, 2, 3&lt;/code&gt; sequentially regardless of ties; Bob gets &lt;code&gt;2&lt;/code&gt; and Carol gets &lt;code&gt;3&lt;/code&gt; because &lt;code&gt;name&lt;/code&gt; breaks the tie.&lt;/li&gt;
&lt;li&gt;Without the &lt;code&gt;, name&lt;/code&gt; tiebreaker, Bob/Carol order is undefined — two query runs could swap them.&lt;/li&gt;
&lt;li&gt;To deduplicate a table that has multiple rows per &lt;code&gt;(business_key, source_ts)&lt;/code&gt;, use &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY source_ts DESC) = 1&lt;/code&gt; to keep the latest.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
         &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
         &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;
       &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;ROW_NUMBER&lt;/code&gt; is the right tool for &lt;em&gt;deduplication&lt;/em&gt; (&lt;code&gt;WHERE rn = 1&lt;/code&gt;) and for ordered streams; reach for &lt;code&gt;RANK&lt;/code&gt; or &lt;code&gt;DENSE_RANK&lt;/code&gt; when ties must be honored.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;RANK&lt;/code&gt; vs &lt;code&gt;DENSE_RANK&lt;/code&gt; — tie semantics
&lt;/h4&gt;

&lt;p&gt;The rank-vs-dense-rank invariant: &lt;strong&gt;both assign the same rank to tied rows; &lt;code&gt;RANK&lt;/code&gt; then skips the next &lt;code&gt;k-1&lt;/code&gt; ranks (gap), while &lt;code&gt;DENSE_RANK&lt;/code&gt; continues without a gap&lt;/strong&gt;. For "find the Nth distinct value" questions, &lt;code&gt;DENSE_RANK = N&lt;/code&gt; is the correct filter; for "find the Nth row in skip-aware ranking order", &lt;code&gt;RANK = N&lt;/code&gt; is correct.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;RANK&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;1, 2, 2, 4&lt;/code&gt; — skips after ties.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DENSE_RANK&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;1, 2, 2, 3&lt;/code&gt; — no skip.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ROW_NUMBER&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;1, 2, 3, 4&lt;/code&gt; — never ties.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick by semantics&lt;/strong&gt; — "Nth highest distinct salary" → &lt;code&gt;DENSE_RANK = N&lt;/code&gt;; "Nth-ranked row in skip ordering" → &lt;code&gt;RANK = N&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Four employees; Bob and Carol tied at second-highest salary.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;th&gt;rank&lt;/th&gt;
&lt;th&gt;dense_rank&lt;/th&gt;
&lt;th&gt;row_number&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;90,000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;70,000&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;RANK&lt;/code&gt; jumps &lt;code&gt;2 → 4&lt;/code&gt; (skipping &lt;code&gt;3&lt;/code&gt;); &lt;code&gt;DENSE_RANK&lt;/code&gt; continues &lt;code&gt;2 → 3&lt;/code&gt; (no skip).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;All three window functions agree on Alice (rank 1) because she is alone at the top.&lt;/li&gt;
&lt;li&gt;Bob and Carol both get &lt;code&gt;rank = 2&lt;/code&gt; and &lt;code&gt;dense_rank = 2&lt;/code&gt; because they tie on salary; &lt;code&gt;row_number&lt;/code&gt; gives them distinct values 2 and 3.&lt;/li&gt;
&lt;li&gt;Dan is the next-lowest salary; &lt;code&gt;RANK&lt;/code&gt; skips ahead by the number of tied rows (2 tied → next rank is &lt;code&gt;2 + 2 = 4&lt;/code&gt;); &lt;code&gt;DENSE_RANK&lt;/code&gt; continues with no gap (&lt;code&gt;3&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;For "second highest distinct salary", &lt;code&gt;DENSE_RANK = 2&lt;/code&gt; correctly returns 80,000; &lt;code&gt;RANK = 2&lt;/code&gt; would also work here, but &lt;code&gt;RANK&lt;/code&gt; would &lt;em&gt;not&lt;/em&gt; return 80,000 if three people tied for first (it would skip to 4).&lt;/li&gt;
&lt;li&gt;For "top 3 distinct salaries", use &lt;code&gt;DENSE_RANK &amp;lt;= 3&lt;/code&gt; — it returns Alice, Bob, Carol, Dan (four rows because Bob/Carol both have &lt;code&gt;dr = 2&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;       &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rnk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; "second highest salary" → &lt;code&gt;DENSE_RANK = 2&lt;/code&gt;; "top 3 distinct salaries" → &lt;code&gt;DENSE_RANK &amp;lt;= 3&lt;/code&gt;; never use &lt;code&gt;RANK&lt;/code&gt; for these unless the spec explicitly says ties should consume rank slots.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;, and running totals — lookback, lookahead, and &lt;code&gt;SUM(...) OVER (...)&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The lookback-and-running invariant: &lt;strong&gt;&lt;code&gt;LAG(col, n)&lt;/code&gt; returns the value of &lt;code&gt;col&lt;/code&gt; &lt;code&gt;n&lt;/code&gt; rows back within the partition (default &lt;code&gt;n=1&lt;/code&gt;); &lt;code&gt;LEAD(col, n)&lt;/code&gt; is the symmetric forward; &lt;code&gt;SUM(col) OVER (PARTITION BY p ORDER BY o)&lt;/code&gt; produces a running total within each partition&lt;/strong&gt;. These three primitives drive month-over-month deltas, sessionization, running balances, and moving averages.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(amount) OVER (ORDER BY date)&lt;/code&gt;&lt;/strong&gt; — previous day's amount.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEAD(amount) OVER (ORDER BY date)&lt;/code&gt;&lt;/strong&gt; — next day's amount.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;amount - LAG(amount) OVER (ORDER BY date)&lt;/code&gt;&lt;/strong&gt; — day-over-day delta.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SUM(amount) OVER (ORDER BY date)&lt;/code&gt;&lt;/strong&gt; — running total from start of partition through current row.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three days of sales; compute previous-day amount, day-over-day delta, and running total.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;sales_date&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;th&gt;prev_amount&lt;/th&gt;
&lt;th&gt;dod_delta&lt;/th&gt;
&lt;th&gt;running_total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-09&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-10&lt;/td&gt;
&lt;td&gt;130&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;230&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-11&lt;/td&gt;
&lt;td&gt;120&lt;/td&gt;
&lt;td&gt;130&lt;/td&gt;
&lt;td&gt;-10&lt;/td&gt;
&lt;td&gt;350&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first day has &lt;code&gt;LAG = NULL&lt;/code&gt; because no prior row exists; consumers usually &lt;code&gt;COALESCE(delta, 0)&lt;/code&gt; for display.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;LAG(amount) OVER (ORDER BY sales_date)&lt;/code&gt; returns the previous row's amount, ordered by date.&lt;/li&gt;
&lt;li&gt;Day 1 (May 9): no previous row, so &lt;code&gt;LAG = NULL&lt;/code&gt;; &lt;code&gt;amount - LAG = NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Day 2 (May 10): &lt;code&gt;LAG = 100&lt;/code&gt;; &lt;code&gt;delta = 130 - 100 = 30&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Day 3 (May 11): &lt;code&gt;LAG = 130&lt;/code&gt;; &lt;code&gt;delta = 120 - 130 = -10&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SUM(amount) OVER (ORDER BY sales_date)&lt;/code&gt; accumulates from the start of the partition through the current row: 100, 230, 350.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;sales_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dod_delta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;running_total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales_date&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;LAG&lt;/code&gt; for "compare this row to its predecessor" (delta, retention, gap); &lt;code&gt;LEAD&lt;/code&gt; for "what happens next" (sessionization, churn-from-here); &lt;code&gt;SUM(...) OVER (...)&lt;/code&gt; for running totals — always &lt;code&gt;PARTITION BY&lt;/code&gt; the entity if the table holds multiple series.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;RANK&lt;/code&gt; when the question wants the Nth &lt;em&gt;distinct&lt;/em&gt; value — &lt;code&gt;RANK = 2&lt;/code&gt; skips entirely if two rows tie for first.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;PARTITION BY&lt;/code&gt; for a per-group ranking — produces a global ranking instead of per-department.&lt;/li&gt;
&lt;li&gt;Referencing the window-function alias in &lt;code&gt;WHERE&lt;/code&gt; of the same &lt;code&gt;SELECT&lt;/code&gt; — window functions execute after &lt;code&gt;WHERE&lt;/code&gt;; wrap in a CTE or subquery first.&lt;/li&gt;
&lt;li&gt;Confusing &lt;code&gt;LAG&lt;/code&gt; (previous) with &lt;code&gt;LEAD&lt;/code&gt; (next) — quietly produces inverted deltas.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;ORDER BY&lt;/code&gt; inside &lt;code&gt;OVER&lt;/code&gt; for &lt;code&gt;ROW_NUMBER&lt;/code&gt;/&lt;code&gt;RANK&lt;/code&gt;/&lt;code&gt;LAG&lt;/code&gt;/&lt;code&gt;LEAD&lt;/code&gt; — required; the result is non-deterministic without it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  PostgreSQL Interview Question on Top 3 Salaries Per Department
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;employees(emp_id, name, department, salary)&lt;/code&gt;, &lt;strong&gt;return the top 3 distinct salaries per department&lt;/strong&gt;, with ties at rank 3 included. Output &lt;code&gt;department&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;salary&lt;/code&gt;, and the rank.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC)&lt;/code&gt; in a CTE
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
               &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
               &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
           &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the CTE &lt;code&gt;ranked&lt;/code&gt; materializes a per-department &lt;code&gt;DENSE_RANK&lt;/code&gt; keyed by salary descending — &lt;code&gt;dr = 1&lt;/code&gt; is the highest distinct salary in that department, &lt;code&gt;dr = 2&lt;/code&gt; is the second-highest, and so on; the outer &lt;code&gt;WHERE dr &amp;lt;= 3&lt;/code&gt; keeps every row whose salary is in the top three distinct salaries of its department, including all ties at rank 3; the &lt;code&gt;ORDER BY&lt;/code&gt; produces a deterministic, reviewer-friendly output. &lt;code&gt;DENSE_RANK&lt;/code&gt; over &lt;code&gt;RANK&lt;/code&gt; because the spec wants the top three &lt;em&gt;distinct&lt;/em&gt; salaries; &lt;code&gt;DENSE_RANK&lt;/code&gt; over &lt;code&gt;ROW_NUMBER&lt;/code&gt; because ties at rank 3 must be retained.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the sample input:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;emp_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;90,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;70,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Eve&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;60,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Frank&lt;/td&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;100,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Grace&lt;/td&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;90,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Heidi&lt;/td&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;CTE &lt;code&gt;ranked&lt;/code&gt;&lt;/strong&gt; — partition by &lt;code&gt;department&lt;/code&gt;; order by &lt;code&gt;salary DESC&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DENSE_RANK&lt;/code&gt; per partition&lt;/strong&gt; — eng: Alice → 1, Bob → 2, Carol → 2, Dan → 3, Eve → 4. sales: Frank → 1, Grace → 2, Heidi → 3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outer &lt;code&gt;WHERE dr &amp;lt;= 3&lt;/code&gt;&lt;/strong&gt; — drops Eve (&lt;code&gt;dr = 4&lt;/code&gt;); keeps both Bob and Carol (tied at 2) and Dan (3).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY department, dr, name&lt;/code&gt;&lt;/strong&gt; — eng rows first, then sales; within department by &lt;code&gt;dr&lt;/code&gt;, then &lt;code&gt;name&lt;/code&gt; for tiebreak.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;th&gt;dr&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;90000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;80000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;80000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;70000&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;Frank&lt;/td&gt;
&lt;td&gt;100000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;Grace&lt;/td&gt;
&lt;td&gt;90000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;Heidi&lt;/td&gt;
&lt;td&gt;80000&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;CTE &lt;code&gt;ranked&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — names the intermediate ranked result; the outer query then filters it like a regular table. Far cleaner than a nested subquery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;PARTITION BY department&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — restarts the rank at each department boundary; without this, the rank is global and the answer is wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY salary DESC&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — defines "highest first" inside each partition; required for any deterministic ranking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;DENSE_RANK&lt;/code&gt; over &lt;code&gt;RANK&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — the spec wants the top three &lt;em&gt;distinct&lt;/em&gt; salaries; &lt;code&gt;RANK&lt;/code&gt; would skip after ties and miss the third distinct salary if there is a two-way tie above it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;WHERE dr &amp;lt;= 3&lt;/code&gt; in the outer&lt;/strong&gt;&lt;/strong&gt; — window functions cannot be referenced in &lt;code&gt;WHERE&lt;/code&gt; of the same &lt;code&gt;SELECT&lt;/code&gt;; the CTE provides the materialized column the outer can filter on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(N log N)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — sort within each partition dominates; with an index on &lt;code&gt;(department, salary DESC)&lt;/code&gt; the planner can stream rather than sort.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;SQL window-function practice problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;SQL CTE practice problems&lt;/a&gt; on PipeCode.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — window functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL window-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — CTE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL CTE problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — date functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL date-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/date-functions/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips to use this PostgreSQL cheat sheet effectively
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hold the clause-order diagram in your head
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;FROM&lt;/code&gt; → &lt;code&gt;WHERE&lt;/code&gt; → &lt;code&gt;GROUP BY&lt;/code&gt; → &lt;code&gt;HAVING&lt;/code&gt; → &lt;code&gt;SELECT&lt;/code&gt; → &lt;code&gt;ORDER BY&lt;/code&gt; → &lt;code&gt;LIMIT&lt;/code&gt;. Memorize this sentence and 80% of "weird" PostgreSQL parse errors decode themselves in five seconds. The error &lt;code&gt;column "x" does not exist&lt;/code&gt; almost always means you referenced a &lt;code&gt;SELECT&lt;/code&gt; alias in &lt;code&gt;WHERE&lt;/code&gt;; the error &lt;code&gt;aggregate functions are not allowed in WHERE&lt;/code&gt; means you wanted &lt;code&gt;HAVING&lt;/code&gt; instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  State the grain before any &lt;code&gt;JOIN&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Before writing the &lt;code&gt;JOIN&lt;/code&gt;, name the grain you're producing: "this is order-line grain", "this is customer-day grain", "this is &lt;code&gt;(customer, product)&lt;/code&gt; grain". The single most common bug in analytical SQL is &lt;code&gt;SUM(left.col)&lt;/code&gt; after a &lt;code&gt;1:N&lt;/code&gt; join — the metric is silently multiplied by &lt;code&gt;N&lt;/code&gt;. If grain doubles, you'll spot it immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; over &lt;code&gt;NOT IN&lt;/code&gt; for anti-joins
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;NOT IN (subquery)&lt;/code&gt; returns zero rows when the subquery contains a single &lt;code&gt;NULL&lt;/code&gt; because &lt;code&gt;x NOT IN (..., NULL, ...)&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;, which fails the &lt;code&gt;WHERE&lt;/code&gt; predicate. &lt;code&gt;LEFT JOIN ... WHERE right.id IS NULL&lt;/code&gt; and &lt;code&gt;NOT EXISTS (...)&lt;/code&gt; are immune. Production engineers who have been burned once never write &lt;code&gt;NOT IN&lt;/code&gt; again.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pick &lt;code&gt;DENSE_RANK&lt;/code&gt; for "Nth distinct"; pick &lt;code&gt;ROW_NUMBER&lt;/code&gt; for deduplication
&lt;/h3&gt;

&lt;p&gt;The single most-graded ranking distinction: &lt;strong&gt;&lt;code&gt;DENSE_RANK = N&lt;/code&gt; is the Nth distinct value; &lt;code&gt;RANK = N&lt;/code&gt; is the Nth row in skip-aware ranking order; &lt;code&gt;ROW_NUMBER = N&lt;/code&gt; is the Nth row in arbitrary order&lt;/strong&gt;. For "second-highest distinct salary" → &lt;code&gt;DENSE_RANK = 2&lt;/code&gt;. For "remove duplicate rows keeping the canonical one" → &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreaker) = 1&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use &lt;code&gt;FILTER (WHERE …)&lt;/code&gt; for one-pass conditional metrics
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;SUM(amount) FILTER (WHERE status = 'refunded')&lt;/code&gt; is cleaner than &lt;code&gt;SUM(CASE WHEN status = 'refunded' THEN amount ELSE 0 END)&lt;/code&gt; — PostgreSQL supports both. Use &lt;code&gt;FILTER&lt;/code&gt; in PostgreSQL-only code, &lt;code&gt;CASE&lt;/code&gt; for cross-dialect portability. One scan, many metrics, half the cost of two queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Always &lt;code&gt;ORDER BY&lt;/code&gt; + tiebreaker; pair &lt;code&gt;LIMIT&lt;/code&gt; with &lt;code&gt;ORDER BY&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Window functions, &lt;code&gt;LIMIT N&lt;/code&gt;, and "top result" queries all require an &lt;code&gt;ORDER BY&lt;/code&gt; with a &lt;em&gt;deterministic&lt;/em&gt; tiebreaker (e.g., &lt;code&gt;ORDER BY salary DESC, name&lt;/code&gt;). Without one, two runs of the same query can return different rows in the tie band — silently wrong in production and visibly wrong in an interview if the reviewer's reference answer locks an ordering.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use PostgreSQL-specific helpers — &lt;code&gt;EXTRACT&lt;/code&gt;, &lt;code&gt;DATE_TRUNC&lt;/code&gt;, &lt;code&gt;INTERVAL&lt;/code&gt;, &lt;code&gt;::DATE&lt;/code&gt; cast
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;EXTRACT(MONTH FROM ts)&lt;/code&gt;, &lt;code&gt;DATE_TRUNC('month', ts)&lt;/code&gt;, &lt;code&gt;ts - INTERVAL '1 month'&lt;/code&gt;, &lt;code&gt;ts::DATE&lt;/code&gt;. These four cover 95% of date arithmetic. Reach for &lt;code&gt;DATE_TRUNC&lt;/code&gt; whenever the spec says "by month" or "by week" — it groups timestamps to the bucket boundary deterministically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to practice on PipeCode
&lt;/h3&gt;

&lt;p&gt;Start with the &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice surface&lt;/a&gt; for the all-language SQL corpus. Drill the four-primitive pages: &lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;SQL filtering&lt;/a&gt; for &lt;code&gt;WHERE&lt;/code&gt; patterns, &lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;SQL joins&lt;/a&gt; for join shapes, &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;SQL aggregation&lt;/a&gt; for &lt;code&gt;GROUP BY&lt;/code&gt; + &lt;code&gt;HAVING&lt;/code&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;SQL window functions&lt;/a&gt; for ranking and lookback. Add adjacent topics: &lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;SQL CTE&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/subqueries/sql" rel="noopener noreferrer"&gt;SQL subqueries&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;SQL null-handling&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/date-functions/sql" rel="noopener noreferrer"&gt;SQL date functions&lt;/a&gt;. The &lt;a href="https://pipecode.ai/explore/courses" rel="noopener noreferrer"&gt;interview courses page&lt;/a&gt; bundles structured curricula — start with &lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;SQL for Data Engineering Interviews — From Zero to FAANG&lt;/a&gt;. For broader coverage, &lt;a href="https://pipecode.ai/explore/practice/topics" rel="noopener noreferrer"&gt;browse by topic&lt;/a&gt; or read the related &lt;a href="https://pipecode.ai/blogs/sql-interview-questions-for-data-engineering" rel="noopener noreferrer"&gt;SQL interview questions for data engineering&lt;/a&gt; and &lt;a href="https://pipecode.ai/blogs/data-lake-architecture-data-engineering-interviews" rel="noopener noreferrer"&gt;data lake architecture for data engineering interviews&lt;/a&gt; blogs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the logical clause order in a PostgreSQL query?
&lt;/h3&gt;

&lt;p&gt;PostgreSQL evaluates clauses in the order &lt;strong&gt;&lt;code&gt;FROM&lt;/code&gt; / &lt;code&gt;JOIN&lt;/code&gt; → &lt;code&gt;WHERE&lt;/code&gt; → &lt;code&gt;GROUP BY&lt;/code&gt; → &lt;code&gt;HAVING&lt;/code&gt; → &lt;code&gt;SELECT&lt;/code&gt; → &lt;code&gt;ORDER BY&lt;/code&gt; → &lt;code&gt;LIMIT&lt;/code&gt; / &lt;code&gt;OFFSET&lt;/code&gt;&lt;/strong&gt;, regardless of the order you write them. This is why &lt;code&gt;WHERE&lt;/code&gt; cannot reference aggregate functions (they don't exist until after &lt;code&gt;GROUP BY&lt;/code&gt;) and why &lt;code&gt;SELECT&lt;/code&gt;-level aliases cannot be referenced in &lt;code&gt;WHERE&lt;/code&gt; (they're computed in stage 5). Aliases become available only in &lt;code&gt;ORDER BY&lt;/code&gt; and the outer query in a nested context.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between &lt;code&gt;WHERE&lt;/code&gt; and &lt;code&gt;HAVING&lt;/code&gt; in PostgreSQL?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;WHERE&lt;/code&gt; filters individual rows &lt;strong&gt;before&lt;/strong&gt; the &lt;code&gt;GROUP BY&lt;/code&gt; step and can reference only raw row columns. &lt;code&gt;HAVING&lt;/code&gt; filters whole groups &lt;strong&gt;after&lt;/strong&gt; the &lt;code&gt;GROUP BY&lt;/code&gt; step and can reference aggregate functions like &lt;code&gt;COUNT(*)&lt;/code&gt;, &lt;code&gt;SUM(col)&lt;/code&gt;, &lt;code&gt;AVG(col)&lt;/code&gt;. Trying to use an aggregate in &lt;code&gt;WHERE&lt;/code&gt; (e.g., &lt;code&gt;WHERE COUNT(*) &amp;gt; 1&lt;/code&gt;) is a parse error because the aggregate does not yet exist. Both clauses can appear in the same query.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I find rows in table A that have no match in table B?
&lt;/h3&gt;

&lt;p&gt;The canonical PostgreSQL pattern is &lt;code&gt;SELECT a.* FROM a LEFT JOIN b ON b.fk = a.pk WHERE b.pk IS NULL&lt;/code&gt; — the &lt;code&gt;LEFT JOIN&lt;/code&gt; preserves every left row, and the &lt;code&gt;WHERE b.pk IS NULL&lt;/code&gt; filter keeps only the ones where no right-side match was found. This is the &lt;strong&gt;anti-join&lt;/strong&gt; pattern. Equivalent to &lt;code&gt;WHERE NOT EXISTS (SELECT 1 FROM b WHERE b.fk = a.pk)&lt;/code&gt;. Both are safer than &lt;code&gt;NOT IN (subquery)&lt;/code&gt;, which returns zero rows if the subquery contains a single &lt;code&gt;NULL&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between &lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;DENSE_RANK&lt;/code&gt;, and &lt;code&gt;ROW_NUMBER&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;All three assign integers within a window. &lt;code&gt;ROW_NUMBER&lt;/code&gt; gives every row a unique sequential integer (&lt;code&gt;1, 2, 3, 4&lt;/code&gt;), even on ties. &lt;code&gt;RANK&lt;/code&gt; gives tied rows the same rank but skips after them (&lt;code&gt;1, 2, 2, 4&lt;/code&gt;). &lt;code&gt;DENSE_RANK&lt;/code&gt; gives tied rows the same rank with no skip (&lt;code&gt;1, 2, 2, 3&lt;/code&gt;). For "Nth distinct value" use &lt;code&gt;DENSE_RANK = N&lt;/code&gt;; for "Nth row in skip-aware ranking order" use &lt;code&gt;RANK = N&lt;/code&gt;; for "Nth row in arbitrary order" or "deduplicate keeping one canonical row" use &lt;code&gt;ROW_NUMBER = 1&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What does &lt;code&gt;FILTER (WHERE …)&lt;/code&gt; do in PostgreSQL aggregates?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;SUM(col) FILTER (WHERE pred)&lt;/code&gt; and &lt;code&gt;COUNT(*) FILTER (WHERE pred)&lt;/code&gt; apply the aggregate only to rows where the predicate is &lt;code&gt;TRUE&lt;/code&gt;; rows where the predicate is &lt;code&gt;FALSE&lt;/code&gt; or &lt;code&gt;NULL&lt;/code&gt; are skipped for &lt;em&gt;that aggregate&lt;/em&gt;, while other aggregates in the same &lt;code&gt;SELECT&lt;/code&gt; still see them. The portable cross-dialect equivalent is &lt;code&gt;SUM(CASE WHEN pred THEN col ELSE 0 END)&lt;/code&gt; and &lt;code&gt;COUNT(CASE WHEN pred THEN 1 END)&lt;/code&gt;. Use &lt;code&gt;FILTER&lt;/code&gt; for clarity in PostgreSQL-only code.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I compute a running total in PostgreSQL?
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;SUM(col) OVER (PARTITION BY p ORDER BY o)&lt;/code&gt; — the window aggregate accumulates from the start of each partition through the current row in the order defined by &lt;code&gt;ORDER BY&lt;/code&gt;. Example: &lt;code&gt;SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date)&lt;/code&gt; gives a per-customer running total of order amounts ordered by date. Drop &lt;code&gt;PARTITION BY&lt;/code&gt; for a single global running total.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is &lt;code&gt;LIMIT 5&lt;/code&gt; returning different rows on different runs?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;LIMIT&lt;/code&gt; without &lt;code&gt;ORDER BY&lt;/code&gt; is non-deterministic — PostgreSQL returns whatever rows it sees first, which depends on the query plan, parallelism, and table physical layout. Always pair &lt;code&gt;LIMIT&lt;/code&gt; with &lt;code&gt;ORDER BY &amp;lt;col&amp;gt; DESC, &amp;lt;tiebreaker&amp;gt;&lt;/code&gt; so two runs return the same rows. Reviewers depend on stable ordering, and dashboards break silently when row order drifts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start practicing PostgreSQL SQL problems
&lt;/h2&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>interview</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Data Lake Architecture for Data Engineering Interviews</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Mon, 11 May 2026 03:20:43 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/data-lake-architecture-for-data-engineering-interviews-32e1</link>
      <guid>https://dev.to/gowthampotureddi/data-lake-architecture-for-data-engineering-interviews-32e1</guid>
      <description>&lt;p&gt;&lt;strong&gt;Data lake architecture&lt;/strong&gt; questions in data-engineering interviews almost always reduce to four primitives: &lt;strong&gt;medallion zones (bronze → silver → gold) for progressive refinement, an ingestion → metadata catalog → compute flow on object storage, the lake vs cloud warehouse vs lakehouse decision driven by open table formats (Iceberg, Delta, Hudi), and a disciplined answer shape that covers grain, idempotency, lineage, and aggregate reconciliation&lt;/strong&gt;. Whether the prompt is "design our analytics lake from scratch", "how would you land CDC from Postgres into the lake", "when would you pick a lakehouse over a warehouse", or "why do counts drift between the lake and the source app", interviewers grade the same handful of mental models — and candidates who skip straight to vendor names without naming the primitives lose the round.&lt;/p&gt;

&lt;p&gt;This guide walks four topic clusters end-to-end, each with a &lt;strong&gt;detailed topic explanation&lt;/strong&gt;, &lt;strong&gt;per-sub-topic explanation with a worked example and its solution&lt;/strong&gt;, &lt;strong&gt;common beginner mistakes&lt;/strong&gt;, and an &lt;strong&gt;interview-style scenario with a full answer&lt;/strong&gt; that traces the design step by step. Every section ends with a concept-by-concept breakdown that explains why the design works, what it costs, and where beginners typically slip. Storage examples assume an S3-style object store on the cloud, but every primitive transfers to GCS, Azure Blob / ADLS, or any other modern object backend.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogczu08vrn78ssklfoxg.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogczu08vrn78ssklfoxg.webp" alt="Bold blog header for data lake architecture and data engineering interviews with PipeCode branding, layered storage stack icon in purple, green, and orange, and pipecode.ai attribution on a dark gradient background." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Top data lake architecture interview topics
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;four numbered sections below&lt;/strong&gt; follow this &lt;strong&gt;topic map&lt;/strong&gt; — one row per &lt;strong&gt;H2&lt;/strong&gt;, every row expanded into a full section with sub-topics, a worked scenario, an interview-style design question, and a step-by-step solution:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;Why it shows up in DE interviews&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Bronze / silver / gold medallion zones&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Progressive refinement is the single biggest lake-architecture concept; interviewers grade whether you know which transformations belong in landing/bronze vs refined/silver vs curated/gold and how SLAs differ per layer.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Ingestion → catalog → compute flow on object storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sources land into S3/GCS/ABS, register in a Hive/Glue/Unity catalog, and are queried by Spark, Trino, or warehouse external tables; the small-file problem, partition pruning, and schema evolution all live here.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Lake vs cloud warehouse vs lakehouse — and Iceberg / Delta / Hudi&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The pattern-selection question is canonical; open table formats are what turn a lake into a lakehouse and bring ACID, time travel, and partition evolution to object storage.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Interview answer shape — grain, idempotency, lineage, reconciliation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Even system-design rounds reduce to a five-step template: clarify grain, separate landing from conformed, make loads idempotent, attach lineage keys, and reconcile aggregates against the source.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Beginner-friendly framing:&lt;/strong&gt; A data lake is &lt;strong&gt;cheap, durable object storage&lt;/strong&gt; plus &lt;strong&gt;conventions for layout, metadata, and processing&lt;/strong&gt;. The "lake vs warehouse" decision is rarely binary — most large organizations run a blend, with the lake handling flexible high-volume ingestion and ML feature stores while a warehouse or lakehouse handles curated SQL analytics. Interviews test whether you can place each workload on the right side of that line and explain the trade-offs without reaching for vendor names as a substitute for first principles.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. Bronze / Silver / Gold Medallion Zones for Data Lake Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Progressive refinement through landing/bronze, refined/silver, and curated/gold zones
&lt;/h3&gt;

&lt;p&gt;"Walk me through how you would lay out an analytics lake from scratch" is the signature opening prompt — and the cleanest answer is &lt;strong&gt;medallion architecture&lt;/strong&gt; with three numbered zones. The mental model: &lt;strong&gt;landing/bronze is an append-only mirror of the source payloads with minimal transformation; refined/silver applies dedup, type coercion, and conformed business keys; curated/gold publishes subject-area tables and star-schema facts/dims that downstream applications and BI tools consume&lt;/strong&gt;. Each zone has a different SLA, different read/write permissions, and different retention. The names vary across vendors — Databricks coined "bronze/silver/gold", AWS uses "raw/curated/consumption", Microsoft uses "landing/refined/analytics" — but the three-tier shape is universal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgu6o9g0wg8ccvecdn3jt.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgu6o9g0wg8ccvecdn3jt.webp" alt="Medallion zone diagram showing landing/bronze (raw, append-only) flowing into refined/silver (dedupe, type) flowing into curated/gold (star and subject tables analytics-ready) on a dark PipeCode-branded card with green and purple accents." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; When you whiteboard the medallion zones, label each box with &lt;strong&gt;who writes&lt;/strong&gt;, &lt;strong&gt;who reads&lt;/strong&gt;, and &lt;strong&gt;what breaks if the job reruns&lt;/strong&gt;. Idempotent writes and clear grain matter as much in a lake as they do in a warehouse — interviewers grade the candidate who naturally adds these annotations without prompting.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Landing / bronze — append-only mirror of source payloads
&lt;/h4&gt;

&lt;p&gt;The landing-zone invariant: &lt;strong&gt;bronze is an append-only, immutable copy of the source payload with minimal transformation; the schema is captured but not enforced; partitioning is by **&lt;code&gt;ingest_date&lt;/code&gt;&lt;/strong&gt; (or &lt;strong&gt;&lt;code&gt;ingest_hour&lt;/code&gt;&lt;/strong&gt; for high-frequency sources); replays are safe because writes never overwrite**. The zone optimizes for fidelity and replay, not query performance.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Append-only writes&lt;/strong&gt; — every batch produces a new file under a date-partitioned prefix; &lt;code&gt;MERGE&lt;/code&gt; and &lt;code&gt;UPDATE&lt;/code&gt; are forbidden.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source-payload fidelity&lt;/strong&gt; — store the raw shape (JSON, Avro, CSV, Parquet snapshot) plus an &lt;code&gt;ingest_id&lt;/code&gt; and &lt;code&gt;source_ts&lt;/code&gt; per row.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition by &lt;code&gt;ingest_date&lt;/code&gt;&lt;/strong&gt; — makes back-fill, replay, and audit trivially scoped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retention&lt;/strong&gt; — keep at least 30-90 days; audits and reconciliations need historical bronze.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A Postgres CDC pipeline lands daily JSON snapshots into &lt;code&gt;s3://analytics-lake/bronze/orders/&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;prefix&lt;/th&gt;
&lt;th&gt;files&lt;/th&gt;
&lt;th&gt;purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bronze/orders/ingest_date=2026-04-11/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;part-00000.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Apr 11 snapshot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bronze/orders/ingest_date=2026-04-12/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;part-00000.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Apr 12 snapshot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bronze/orders/ingest_date=2026-04-13/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;part-00000.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Apr 13 snapshot&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The source app emits one JSON snapshot per day at 02:00 UTC.&lt;/li&gt;
&lt;li&gt;The ingestion job lands each snapshot under a calendar-keyed prefix &lt;code&gt;bronze/orders/ingest_date=YYYY-MM-DD/&lt;/code&gt; so partition pruning works for any date filter downstream.&lt;/li&gt;
&lt;li&gt;Each batch is also stamped with a unique &lt;code&gt;ingest_id&lt;/code&gt; (timestamp + UUID) sub-prefix so retries write fresh files instead of overwriting a previous attempt.&lt;/li&gt;
&lt;li&gt;Files inside a partition are append-only &lt;code&gt;part-NNNNN.json&lt;/code&gt;; bronze never edits a written file — corrected payloads land as new files under a new &lt;code&gt;ingest_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;After three days you have three day-partitions; each is independently re-readable with &lt;code&gt;WHERE ingest_date = 'YYYY-MM-DD'&lt;/code&gt; and any single day can be replayed without touching the others.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; A landing-zone object layout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://analytics-lake/bronze/orders/
  ingest_date=2026-04-13/
    ingest_id=20260413T0200Z/
      part-00000.json
      part-00001.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; never edit a bronze file. If a payload is wrong, drop a corrected file under a new &lt;code&gt;ingest_id&lt;/code&gt; and let the silver-layer dedup logic resolve it; never overwrite history.&lt;/p&gt;

&lt;h4&gt;
  
  
  Refined / silver — deduped, typed, conformed business keys
&lt;/h4&gt;

&lt;p&gt;The refined-zone invariant: &lt;strong&gt;silver applies dedup against natural or business keys, coerces types to a canonical schema, conforms key columns across sources, and may emit slowly-changing-dimension (SCD) history; the zone is the single source of truth for downstream application code and most analyst SQL&lt;/strong&gt;. Idempotency at the silver layer is non-negotiable — re-running a daily job must produce byte-identical output.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dedup on &lt;code&gt;(business_key, source_ts)&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY source_ts DESC) = 1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type coercion&lt;/strong&gt; — JSON strings → typed columns; epoch ms → &lt;code&gt;TIMESTAMP&lt;/code&gt;; cents → &lt;code&gt;DECIMAL(18,2)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conformed dimensions&lt;/strong&gt; — &lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;product_id&lt;/code&gt;, &lt;code&gt;geo_id&lt;/code&gt; mapped to one canonical form across every source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SCD type 2&lt;/strong&gt; — emit &lt;code&gt;(valid_from, valid_to, is_current)&lt;/code&gt; columns when downstream consumers need point-in-time joins.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Bronze &lt;code&gt;orders&lt;/code&gt; rows arrive twice for &lt;code&gt;order_id=448&lt;/code&gt; due to a CDC retry; silver dedup keeps the latest.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;source_ts&lt;/th&gt;
&lt;th&gt;bronze_rn&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;448&lt;/td&gt;
&lt;td&gt;2026-04-12 09:30:00&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;448&lt;/td&gt;
&lt;td&gt;2026-04-12 09:30:15&lt;/td&gt;
&lt;td&gt;1 (kept)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;449&lt;/td&gt;
&lt;td&gt;2026-04-12 10:00:00&lt;/td&gt;
&lt;td&gt;1 (kept)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Bronze contains three rows for &lt;code&gt;ingest_date = 2026-04-12&lt;/code&gt;: two for &lt;code&gt;order_id = 448&lt;/code&gt; (a CDC retry produced two payloads at 09:30:00 and 09:30:15) and one for &lt;code&gt;order_id = 449&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC)&lt;/code&gt; numbers rows independently inside each &lt;code&gt;order_id&lt;/code&gt; group, with the latest &lt;code&gt;source_ts&lt;/code&gt; getting &lt;code&gt;rn = 1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For &lt;code&gt;order_id = 448&lt;/code&gt;: the row at 09:30:15 is later, so it gets &lt;code&gt;rn = 1&lt;/code&gt;; the 09:30:00 row gets &lt;code&gt;rn = 2&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For &lt;code&gt;order_id = 449&lt;/code&gt;: only one row, so it gets &lt;code&gt;rn = 1&lt;/code&gt; automatically.&lt;/li&gt;
&lt;li&gt;The outer &lt;code&gt;WHERE rn = 1&lt;/code&gt; keeps two rows — the latest &lt;code&gt;order_id = 448&lt;/code&gt; and the only &lt;code&gt;order_id = 449&lt;/code&gt; — and silently drops the duplicate, producing a deterministic single-row-per-business-key silver table.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
               &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;
               &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;source_ts&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
           &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ingest_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="s1"&gt;'2026-04-12'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;order_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;source_ts&lt;/span&gt;                         &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;as_of_ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;                 &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;silver_loaded_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the silver zone is where ETL bugs hide — invest in unit-tested dedup logic, schema-evolution tests, and aggregate reconciliation against bronze totals before promoting to gold.&lt;/p&gt;

&lt;h4&gt;
  
  
  Curated / gold — subject-area tables and star schemas
&lt;/h4&gt;

&lt;p&gt;The curated-zone invariant: &lt;strong&gt;gold publishes tables shaped for downstream consumption: dimensional models (fact tables + conformed dimensions), subject-area marts, or one-big-table (OBT) flattenings; SLAs are stricter, freshness is tracked, and consumer contracts are explicit&lt;/strong&gt;. Each gold table maps to exactly one consumer class — analysts, dashboards, ML feature pipelines, or reverse-ETL into operational systems.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Star schema&lt;/strong&gt; — &lt;code&gt;fact_orders&lt;/code&gt; joined to &lt;code&gt;dim_customer&lt;/code&gt;, &lt;code&gt;dim_product&lt;/code&gt;, &lt;code&gt;dim_date&lt;/code&gt;; one row per business event.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subject-area marts&lt;/strong&gt; — domain-scoped denormalized tables (e.g., &lt;code&gt;mart_marketing_attribution&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OBT flattening&lt;/strong&gt; — when consumers prefer one wide table over a join (Looker, Power BI dashboards).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumer contracts&lt;/strong&gt; — column types, refresh cadence, breakage policy declared in &lt;code&gt;dbt&lt;/code&gt;-style metadata.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A gold star schema for the orders subject area.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;table&lt;/th&gt;
&lt;th&gt;grain&lt;/th&gt;
&lt;th&gt;example columns&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fact_orders&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;one row per order line&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;order_id&lt;/code&gt;, &lt;code&gt;line_id&lt;/code&gt;, &lt;code&gt;customer_key&lt;/code&gt;, &lt;code&gt;product_key&lt;/code&gt;, &lt;code&gt;date_key&lt;/code&gt;, &lt;code&gt;qty&lt;/code&gt;, &lt;code&gt;revenue&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dim_customer&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;one row per customer (SCD2)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;customer_key&lt;/code&gt;, &lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;valid_from&lt;/code&gt;, &lt;code&gt;valid_to&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dim_product&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;one row per product&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;product_key&lt;/code&gt;, &lt;code&gt;product_id&lt;/code&gt;, &lt;code&gt;category&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dim_date&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;one row per calendar date&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;date_key&lt;/code&gt;, &lt;code&gt;date&lt;/code&gt;, &lt;code&gt;iso_week&lt;/code&gt;, &lt;code&gt;is_weekend&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;fact_orders&lt;/code&gt; is the central transactional table at order-line grain — one row per line item, with numeric measures (&lt;code&gt;qty&lt;/code&gt;, &lt;code&gt;revenue&lt;/code&gt;) and foreign-key columns to every dimension.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dim_customer&lt;/code&gt; is an SCD2 dimension: a single real-world customer can appear in multiple rows over time, each with &lt;code&gt;valid_from&lt;/code&gt; / &lt;code&gt;valid_to&lt;/code&gt; / &lt;code&gt;is_current&lt;/code&gt; columns to capture historical attribute changes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dim_product&lt;/code&gt; is a simpler Type-1 dimension: one row per product, current state only — overwrites on update with no history.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dim_date&lt;/code&gt; is the conformed date dimension: one row per calendar date with pre-computed week, month, quarter, year, and &lt;code&gt;is_weekend&lt;/code&gt; columns so dashboards never have to compute date math at query time.&lt;/li&gt;
&lt;li&gt;Joins from &lt;code&gt;fact_orders&lt;/code&gt; to each dimension use the surrogate keys (&lt;code&gt;customer_key&lt;/code&gt;, &lt;code&gt;product_key&lt;/code&gt;, &lt;code&gt;date_key&lt;/code&gt;) — never the natural business IDs — so SCD2 history is preserved when the same customer's row evolves over time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fact_orders&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;line_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unit_price&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_lines&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dim_customer&lt;/span&gt; &lt;span class="n"&gt;dc&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_ts&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_from&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_to&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dim_product&lt;/span&gt;  &lt;span class="n"&gt;dp&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dim_date&lt;/span&gt;     &lt;span class="n"&gt;dd&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_ts&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; gold tables are the only zone customers should reference by name; if a dashboard reads silver directly, your contract is leaking. Use views or feature-flagged exposures rather than letting consumers couple to interim grains.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Treating bronze as a junk drawer with no &lt;code&gt;ingest_date&lt;/code&gt; partitioning — replay and audit become impossible.&lt;/li&gt;
&lt;li&gt;Doing dedup at gold instead of silver — every downstream job has to repeat the work and answers diverge.&lt;/li&gt;
&lt;li&gt;Letting consumers query silver directly — silver schemas can change without notice; gold contracts are explicit.&lt;/li&gt;
&lt;li&gt;Skipping &lt;code&gt;ingest_id&lt;/code&gt; and &lt;code&gt;source_ts&lt;/code&gt; lineage columns — when counts drift, you have no way to reconstruct what landed when.&lt;/li&gt;
&lt;li&gt;Mixing batch and streaming writes into the same bronze prefix without a partition key for write-mode — late arrivals overwrite eager batches.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Lake Interview Question on Designing Layered Zones
&lt;/h3&gt;

&lt;p&gt;A team dumps daily JSON exports of &lt;code&gt;orders&lt;/code&gt; into a single S3 prefix. Analysts complain that order counts drift versus the source application by 0.5–2% on most days. &lt;strong&gt;Design a three-zone medallion layout that fixes the drift, makes the discrepancy investigable, and supports daily reruns without producing duplicates.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using Bronze (append-only) + Silver (dedup) + Gold (star schema)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Move existing daily dumps into:
     s3://analytics-lake/bronze/orders/ingest_date=YYYY-MM-DD/ingest_id=&amp;lt;batch&amp;gt;/
   Append-only; never overwrite a date partition.

2. Build silver/orders as a daily MERGE that:
     - Dedups bronze rows by ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) = 1
     - Coerces JSON fields to a typed schema
     - Joins against dim_customer / dim_product on conformed keys
     - Carries ingest_id + source_ts as lineage columns

3. Promote to gold/fact_orders only after a silver-vs-source aggregate-reconciliation job
   passes a tolerance threshold (e.g., |silver_count - source_count| / source_count &amp;lt; 0.001).

4. Surface a row-count + revenue-sum dashboard sourced from BOTH bronze and the source app's
   replica, so any future drift surfaces within one ingest cycle.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the append-only bronze layer makes the discrepancy &lt;em&gt;investigable&lt;/em&gt; — every historical payload is preserved with &lt;code&gt;ingest_id&lt;/code&gt; and &lt;code&gt;ingest_date&lt;/code&gt;, so analysts can replay any day's source state; the silver dedup converts CDC retries and late-arriving rows into a deterministic single row per &lt;code&gt;order_id&lt;/code&gt;; the gold layer is gated by an aggregate-reconciliation step that catches drift before it reaches dashboards; and the dual-source row-count dashboard surfaces residual drift immediately. The combination addresses both the &lt;em&gt;prevention&lt;/em&gt; (idempotent dedup) and &lt;em&gt;detection&lt;/em&gt; (reconciliation + dashboard) sides of the failure mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the drift scenario on 2026-04-12:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;observation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;bronze ingests &lt;code&gt;ingest_date=2026-04-12&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;12,847 raw rows including 12 CDC retries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;silver dedup keeps &lt;code&gt;rn = 1&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;12,835 unique &lt;code&gt;order_id&lt;/code&gt;s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;source-app replica reports&lt;/td&gt;
&lt;td&gt;12,835 orders for 2026-04-12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;reconciliation passes&lt;/td&gt;
&lt;td&gt;drift = 0 / 12,835 = 0.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;promote to gold/fact_orders&lt;/td&gt;
&lt;td&gt;12,835 fact rows; counts match dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; the fixed-state contract per ingest day:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;metric&lt;/th&gt;
&lt;th&gt;bronze&lt;/th&gt;
&lt;th&gt;silver&lt;/th&gt;
&lt;th&gt;source&lt;/th&gt;
&lt;th&gt;gold&lt;/th&gt;
&lt;th&gt;drift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;row count&lt;/td&gt;
&lt;td&gt;12,847&lt;/td&gt;
&lt;td&gt;12,835&lt;/td&gt;
&lt;td&gt;12,835&lt;/td&gt;
&lt;td&gt;12,835&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;total revenue&lt;/td&gt;
&lt;td&gt;$4,128,931&lt;/td&gt;
&lt;td&gt;$4,128,931&lt;/td&gt;
&lt;td&gt;$4,128,931&lt;/td&gt;
&lt;td&gt;$4,128,931&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Append-only bronze with &lt;code&gt;ingest_date&lt;/code&gt; partitioning&lt;/strong&gt;&lt;/strong&gt; — every payload is preserved and addressable; replay is a &lt;code&gt;WHERE ingest_date = ...&lt;/code&gt; filter rather than a re-ingest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Silver dedup via &lt;code&gt;ROW_NUMBER&lt;/code&gt; over &lt;code&gt;(order_id ORDER BY source_ts DESC)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — collapses CDC retries to a deterministic single row per business key; idempotent on rerun.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Lineage columns &lt;code&gt;ingest_id&lt;/code&gt; + &lt;code&gt;source_ts&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — every silver row points back to a specific bronze file and source moment; forensic debugging is one join away.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Aggregate reconciliation gate before gold&lt;/strong&gt;&lt;/strong&gt; — drift cannot reach dashboards because gold is gated on &lt;code&gt;|silver - source| / source &amp;lt; threshold&lt;/code&gt;; failures page the on-call rather than silently corrupt the BI tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Dual-source dashboard&lt;/strong&gt;&lt;/strong&gt; — surface drift instantly even when reconciliation isn't perfect; the early-warning loop pays for itself the first incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(|bronze|)&lt;/code&gt; time per day&lt;/strong&gt;&lt;/strong&gt; — single linear scan + window for dedup; reconciliation adds one aggregate per zone, negligible compared to ingest cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/topic/etl" rel="noopener noreferrer"&gt;ETL practice page&lt;/a&gt; for medallion-zone problems and the &lt;a href="https://pipecode.ai/explore/practice/topic/dimensional-modeling" rel="noopener noreferrer"&gt;dimensional modeling practice page&lt;/a&gt; for star-schema patterns at the gold layer.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;ETL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — ETL pipelines&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;ETL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/etl" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — dimensional modeling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Dimensional modeling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/dimensional-modeling" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Language — SQL&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;All SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Ingestion → Catalog → Compute Flow on Object Storage
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Sources to query engines through metadata catalogs in data lake architecture
&lt;/h3&gt;

&lt;p&gt;"How does data physically get from a Postgres source into a query engine like Spark or Trino on the lake?" is the signature design follow-up — and the cleanest answer is the &lt;strong&gt;ingest → register → query&lt;/strong&gt; flow with three distinct components. The mental model: &lt;strong&gt;sources (databases, APIs, streaming platforms, file feeds) ingest into object storage as files; a metadata catalog (Hive Metastore, AWS Glue, Unity Catalog, Polaris, Iceberg REST catalog) maps logical tables to physical file paths and column schemas; compute engines (Spark, Trino, Presto, DuckDB-in-the-cloud, Snowflake external tables) read the catalog to discover tables and read the object store to fetch data&lt;/strong&gt;. The decoupling is the entire value proposition — many engines can read the same footprint, and storage scales independently from compute.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetes9pdzu83ptnq03zpd.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetes9pdzu83ptnq03zpd.webp" alt="Architecture flow diagram showing sources (DB, API, files, stream) ingesting into object storage lake, registering into metadata catalog, then compute and query engines (Spark, SQL) reading via curated-read paths in PipeCode brand styling." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; When the interviewer asks "where does Spark get the schema from?", the answer is the &lt;strong&gt;catalog&lt;/strong&gt;, not the file. Files (Parquet, ORC, Avro) carry their own schema in the footer, but the catalog is what makes a logical table addressable across sessions and engines. State this distinction explicitly — it separates candidates who learned data lake architecture by reading docs from those who learned by debugging production.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Object storage as the storage layer — S3, GCS, ADLS
&lt;/h4&gt;

&lt;p&gt;The object-store invariant: &lt;strong&gt;modern lakes use cloud object storage (Amazon S3, Google Cloud Storage, Azure Data Lake Storage / ADLS Gen2) rather than HDFS; storage is **infinitely scalable, durable, and decoupled from compute&lt;/strong&gt;, with eventual-consistency semantics that the table format is responsible for masking**. Files are typically Parquet (columnar, compressed) or ORC; Avro shows up in streaming pipelines.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hive-style partitioning&lt;/strong&gt; — &lt;code&gt;s3://bucket/table/col=value/file.parquet&lt;/code&gt; for partition pruning at query time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File sizes&lt;/strong&gt; — target 128MB-1GB per file; smaller files trigger the small-file problem (excessive metadata, slow planning).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compaction&lt;/strong&gt; — periodic batch jobs that rewrite many small files into fewer large ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Eventual consistency&lt;/strong&gt; — S3 was eventually consistent for many years; the table format handles the retry / commit semantics that mask this from queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A Hive-style partition layout for a daily-loaded &lt;code&gt;orders&lt;/code&gt; table in silver.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;prefix&lt;/th&gt;
&lt;th&gt;role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;s3://analytics-lake/silver/orders/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;table root&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;…/ingest_date=2026-04-13/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;partition value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;…/ingest_date=2026-04-13/part-00000.parquet&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;data file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;…/_delta_log/&lt;/code&gt; or &lt;code&gt;…/metadata/&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;table-format metadata (if Delta/Iceberg)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The table root &lt;code&gt;s3://analytics-lake/silver/orders/&lt;/code&gt; is the registered location in the catalog; everything under it belongs to one logical table.&lt;/li&gt;
&lt;li&gt;Each child prefix &lt;code&gt;ingest_date=YYYY-MM-DD/&lt;/code&gt; is one Hive partition value; the &lt;code&gt;key=value&lt;/code&gt; syntax is the convention every engine (Spark, Trino, Athena, Snowflake) recognizes.&lt;/li&gt;
&lt;li&gt;Inside each partition, multiple Parquet files (~180MB each) split the data so a Spark reader can fetch them in parallel; the file count is bounded by your micro-batch size.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;_delta_log/&lt;/code&gt; (Delta) or &lt;code&gt;metadata/&lt;/code&gt; (Iceberg) prefix holds the table-format commit log — a sequence of JSON files describing every transaction, which is what gives you ACID and time travel on top of plain object storage.&lt;/li&gt;
&lt;li&gt;A query with &lt;code&gt;WHERE ingest_date = '2026-04-13'&lt;/code&gt; triggers partition pruning: the planner reads only files under that one prefix, skipping every other day's files entirely — the difference between 200ms and 60s.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Object layout for a partitioned silver table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://analytics-lake/silver/orders/
  ingest_date=2026-04-13/
    part-00000.parquet  (180MB, 1.2M rows)
    part-00001.parquet  (165MB, 1.1M rows)
  ingest_date=2026-04-12/
    part-00000.parquet  (175MB)
  _delta_log/                              # Delta Lake commit log
    00000000000000000001.json
    00000000000000000002.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if your average file size is below 50MB, schedule a daily compaction job; if it's above 1GB, your partitions are too coarse. Both extremes hurt query latency.&lt;/p&gt;

&lt;h4&gt;
  
  
  Metadata catalog — Hive Metastore, AWS Glue, Unity Catalog
&lt;/h4&gt;

&lt;p&gt;The catalog invariant: &lt;strong&gt;a metadata catalog maps logical names (&lt;code&gt;silver.orders&lt;/code&gt;) to physical locations (&lt;code&gt;s3://analytics-lake/silver/orders&lt;/code&gt;), column schemas, partition definitions, and table properties; it is the single source of truth for "what tables exist" across every compute engine that reads the lake&lt;/strong&gt;. The catalog can be a long-running service (Hive Metastore, AWS Glue Data Catalog, Databricks Unity Catalog) or a REST API on top of files (Iceberg REST catalog, Polaris).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Logical → physical mapping&lt;/strong&gt; — &lt;code&gt;silver.orders&lt;/code&gt; → &lt;code&gt;s3://...&lt;/code&gt;; column names, types, partition keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engine-agnostic&lt;/strong&gt; — Spark, Trino, Presto, Snowflake external tables, Athena, DuckDB all read the same catalog.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema evolution&lt;/strong&gt; — add column, widen type, rename (with caveats); the catalog records the evolution history.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permissions&lt;/strong&gt; — many catalogs (Unity, Glue with Lake Formation) carry table/column-level access policies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Registering a partitioned &lt;code&gt;silver.orders&lt;/code&gt; table in Glue.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;field&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;logical name&lt;/td&gt;
&lt;td&gt;&lt;code&gt;silver.orders&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;location&lt;/td&gt;
&lt;td&gt;&lt;code&gt;s3://analytics-lake/silver/orders/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;input format&lt;/td&gt;
&lt;td&gt;&lt;code&gt;parquet&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;partition keys&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ingest_date STRING&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;schema&lt;/td&gt;
&lt;td&gt;&lt;code&gt;order_id BIGINT, customer_id BIGINT, amount DECIMAL(18,2), source_ts TIMESTAMP&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;CREATE EXTERNAL TABLE silver.orders&lt;/code&gt; declares a logical name in the catalog without copying or moving any data files.&lt;/li&gt;
&lt;li&gt;The column list (&lt;code&gt;order_id BIGINT&lt;/code&gt;, …) declares the schema the engine should expect; Parquet files store their own schema in the footer, but the catalog is the canonical answer the planner trusts.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PARTITIONED BY (ingest_date STRING)&lt;/code&gt; declares the partition column; this column is &lt;em&gt;derived from the prefix path&lt;/em&gt;, not stored in the data files, which keeps each partition lean.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LOCATION 's3://analytics-lake/silver/orders/'&lt;/code&gt; is the prefix the engine scans when reading; data files must already exist at this location.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MSCK REPAIR TABLE silver.orders&lt;/code&gt; walks the S3 prefix, discovers any partition values it doesn't yet know about, and registers them; without this command after a backfill, the planner returns zero rows for the new dates.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt;     &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt;       &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;source_ts&lt;/span&gt;    &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ingest_id&lt;/span&gt;    &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;PARTITIONED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ingest_date&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://analytics-lake/silver/orders/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;MSCK&lt;/span&gt; &lt;span class="n"&gt;REPAIR&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always run &lt;code&gt;MSCK REPAIR TABLE&lt;/code&gt; (or the engine equivalent) after a backfill that adds new partition prefixes; otherwise the catalog won't know about them and the partition predicate will return zero rows.&lt;/p&gt;

&lt;h4&gt;
  
  
  Compute engines — Spark, Trino, Presto, DuckDB
&lt;/h4&gt;

&lt;p&gt;The compute invariant: &lt;strong&gt;compute engines read the catalog to discover tables, plan queries with partition pruning and predicate pushdown, then read the relevant Parquet/ORC files from object storage; storage and compute scale independently and the same data can be queried by multiple engines simultaneously&lt;/strong&gt;. Spark dominates for batch + streaming pipelines; Trino/Presto dominate for interactive SQL; DuckDB is rising for single-node analytics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Spark&lt;/strong&gt; — JVM, batch + streaming, rich ecosystem (Iceberg/Delta connectors, Spark SQL, MLlib, Structured Streaming).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trino / Presto&lt;/strong&gt; — interactive SQL across many catalogs; great for federated queries across lake + warehouse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DuckDB&lt;/strong&gt; — single-node, embeddable, blazing fast for sub-TB analytics; popular for ad-hoc + notebooks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snowflake / BigQuery / Redshift external tables&lt;/strong&gt; — read lake data from inside a managed warehouse.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A Spark SQL query against &lt;code&gt;silver.orders&lt;/code&gt; with partition pruning.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;layer&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;data scanned&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;catalog&lt;/td&gt;
&lt;td&gt;resolve &lt;code&gt;silver.orders&lt;/code&gt; → &lt;code&gt;s3://...&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;metadata only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;planner&lt;/td&gt;
&lt;td&gt;prune partitions for &lt;code&gt;ingest_date = '2026-04-13'&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;one partition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spark workers&lt;/td&gt;
&lt;td&gt;read Parquet column-block for &lt;code&gt;amount&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;~50MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;executor&lt;/td&gt;
&lt;td&gt;aggregate &lt;code&gt;SUM(amount)&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Spark resolves &lt;code&gt;silver.orders&lt;/code&gt; against the catalog — pure metadata fetch, zero data scanned, returns the location plus the partition schema.&lt;/li&gt;
&lt;li&gt;The planner sees &lt;code&gt;WHERE ingest_date = '2026-04-13'&lt;/code&gt; and prunes the partition list to a single value, so workers only need to list files under one S3 prefix instead of all of them.&lt;/li&gt;
&lt;li&gt;Workers issue an S3 &lt;code&gt;LIST&lt;/code&gt; for that single partition, fetching a list of ~one to ten Parquet file paths.&lt;/li&gt;
&lt;li&gt;Each Parquet reader uses footer metadata to skip every column except &lt;code&gt;amount&lt;/code&gt;, then streams just that column block — typically 50MB instead of the full 1GB Parquet file.&lt;/li&gt;
&lt;li&gt;Each task computes a partial &lt;code&gt;SUM(amount)&lt;/code&gt; locally; a final shuffle sums the partial values to one number — the entire query is &lt;code&gt;O(rows in one partition)&lt;/code&gt; and runs in sub-second time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;daily_revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ingest_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2026-04-13'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always include the partition key in your &lt;code&gt;WHERE&lt;/code&gt; clause to enable partition pruning; without it, the planner reads every partition (terabytes), and your query goes from 500ms to 50 seconds.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Skipping the catalog and reading raw S3 paths in every job — schemas drift, no central source of truth, no permissions.&lt;/li&gt;
&lt;li&gt;Ignoring file-size budgets — millions of 5KB files (the small-file problem) make Spark planning slower than the actual scan.&lt;/li&gt;
&lt;li&gt;Not declaring partition keys — full-table scans on every query, costs balloon by 100x.&lt;/li&gt;
&lt;li&gt;Mixing file formats inside one logical table (some Parquet, some JSON) — the planner can't push predicates and queries error out.&lt;/li&gt;
&lt;li&gt;Forgetting to refresh the catalog after a backfill — &lt;code&gt;MSCK REPAIR TABLE&lt;/code&gt; or &lt;code&gt;REFRESH TABLE&lt;/code&gt; is the single most-forgotten command.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Lake Interview Question on CDC Ingestion from Postgres
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Design a near-real-time ingestion pipeline that lands changes from a 10TB Postgres database into the lake, registers them in a catalog, and exposes them to Spark and Trino with sub-five-minute freshness.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using Debezium → Kafka → Iceberg with Hive Metastore
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Postgres (with logical replication enabled)
      │
      ▼
Debezium connector (CDC reader, emits change events)
      │
      ▼
Kafka topic per table (key = primary key; value = before/after JSON or Avro)
      │
      ▼
Spark Structured Streaming job (1-minute trigger):
      - Reads Kafka topic
      - Writes to bronze.orders_cdc as append-only Iceberg files (partitioned by event_date)
      │
      ▼
Hive Metastore / Glue catalog:
      - bronze.orders_cdc registered with Iceberg metadata
      - silver.orders_current registered as a Spark MERGE-on-read view
      │
      ▼
Compute consumers:
      - Trino: SELECT * FROM silver.orders_current WHERE event_date = today
      - Spark batch: nightly compaction + table-maintenance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; Debezium reads the Postgres write-ahead log (WAL) directly via logical replication, so it captures every insert/update/delete with no impact on the source; Kafka decouples the producer from the consumer and absorbs traffic spikes; the Spark Structured Streaming job runs with a one-minute trigger, so the lake is at most one minute behind; Iceberg's ACID transactions make concurrent micro-batch writes safe; the Hive Metastore registers the table once, and both Trino and Spark see the same schema; partitioning by &lt;code&gt;event_date&lt;/code&gt; enables prune-friendly time-window queries; nightly compaction keeps file sizes in the 128MB-1GB sweet spot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for an order update at &lt;code&gt;09:30:00.000&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;time&lt;/th&gt;
&lt;th&gt;component&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;09:30:00.000&lt;/td&gt;
&lt;td&gt;Postgres&lt;/td&gt;
&lt;td&gt;UPDATE orders SET status='shipped' WHERE order_id=448&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;09:30:00.150&lt;/td&gt;
&lt;td&gt;Debezium&lt;/td&gt;
&lt;td&gt;reads WAL, emits change event to Kafka&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;09:30:00.300&lt;/td&gt;
&lt;td&gt;Kafka&lt;/td&gt;
&lt;td&gt;persists change event to topic &lt;code&gt;orders.cdc&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;09:30:30.000&lt;/td&gt;
&lt;td&gt;Spark Streaming&lt;/td&gt;
&lt;td&gt;next 1-min trigger; reads change events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;09:30:35.000&lt;/td&gt;
&lt;td&gt;Spark Streaming&lt;/td&gt;
&lt;td&gt;writes Parquet to &lt;code&gt;bronze.orders_cdc/event_date=2026-04-13/&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;09:30:35.500&lt;/td&gt;
&lt;td&gt;Iceberg&lt;/td&gt;
&lt;td&gt;commits new snapshot; catalog updated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;09:30:40.000&lt;/td&gt;
&lt;td&gt;Trino&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;SELECT … FROM silver.orders_current WHERE order_id=448&lt;/code&gt; returns updated row&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;End-to-end latency: ~40 seconds. Well within the five-minute SLA.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; the consumer-visible contract per minute:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;metric&lt;/th&gt;
&lt;th&gt;target&lt;/th&gt;
&lt;th&gt;actual&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;freshness (P50)&lt;/td&gt;
&lt;td&gt;&amp;lt; 5 min&lt;/td&gt;
&lt;td&gt;~40 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;freshness (P99)&lt;/td&gt;
&lt;td&gt;&amp;lt; 5 min&lt;/td&gt;
&lt;td&gt;~2 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dropped events&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;schema-drift incidents&lt;/td&gt;
&lt;td&gt;&amp;lt; 1/quarter&lt;/td&gt;
&lt;td&gt;0 last quarter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Postgres logical replication + Debezium&lt;/strong&gt;&lt;/strong&gt; — captures every row change at the WAL layer; no impact on source query performance; no missed events.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Kafka as the decoupler&lt;/strong&gt;&lt;/strong&gt; — handles backpressure, replays, and multiple downstream consumers; lake outages don't lose source events.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Spark Structured Streaming with 1-minute trigger&lt;/strong&gt;&lt;/strong&gt; — micro-batch sweet spot; latency vs throughput trade-off favors throughput here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Iceberg table format&lt;/strong&gt;&lt;/strong&gt; — ACID commits make concurrent micro-batch writes safe; time travel makes "what did the table look like at 09:30?" a one-line query.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Hive Metastore as the unified catalog&lt;/strong&gt;&lt;/strong&gt; — Spark and Trino see the same schema; no per-engine duplication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;event_date&lt;/code&gt; partitioning + nightly compaction&lt;/strong&gt;&lt;/strong&gt; — bounds query scan size and keeps file count manageable; both maintenance jobs are idempotent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;End-to-end latency ~40s&lt;/strong&gt;&lt;/strong&gt; — well inside the 5-min SLA; the 4.5-min headroom absorbs Kafka rebalances and Spark micro-batch jitter without alerting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/topic/streaming" rel="noopener noreferrer"&gt;streaming practice page&lt;/a&gt; for Kafka + micro-batch problems and the &lt;a href="https://pipecode.ai/explore/practice/language/python" rel="noopener noreferrer"&gt;Python practice page&lt;/a&gt; for PySpark Structured Streaming patterns. Course: &lt;a href="https://pipecode.ai/explore/courses/pyspark-fundamentals" rel="noopener noreferrer"&gt;PySpark Fundamentals&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;ETL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — streaming&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Streaming practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/streaming" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Language — Python&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Python practice for data pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/language/python" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;ETL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — ETL pipelines&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;ETL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/etl" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Lake vs Cloud Warehouse vs Lakehouse — Iceberg, Delta, Hudi
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern selection and open table formats in data lake architecture
&lt;/h3&gt;

&lt;p&gt;"When would you pick a lakehouse over a warehouse?" and "what is the difference between Iceberg, Delta Lake, and Hudi?" are the two signature pattern-selection prompts — and they share one mental model: &lt;strong&gt;a data lake is files + a catalog; a cloud warehouse is a managed ACID SQL system with proprietary storage; a lakehouse is a lake plus an open table format that adds ACID, time travel, partition evolution, and concurrent writers — bringing warehouse-like semantics to object storage&lt;/strong&gt;. Iceberg, Delta Lake, and Hudi are the three dominant open table formats, each with slightly different trade-offs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fro52s5ukbr1vfkb80l56.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fro52s5ukbr1vfkb80l56.webp" alt="Three-column comparison infographic of Data Lake, Cloud Warehouse, and Lakehouse storage architectures showing strengths (modern flexible files, structured SQL, hybrid design) and watch-outs (data quality challenges, limited unstructured support, increasing complexity) with PipeCode brand colors." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Most large organizations run a &lt;strong&gt;blend&lt;/strong&gt;: lake for flexible high-volume ingestion and ML feature stores, warehouse or lakehouse SQL for curated analytics. Don't propose a single-pattern solution to a system-design question — describe the boundary between the two and the contracts that flow across it. That's the senior signal.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Data lake — files on object storage with a catalog
&lt;/h4&gt;

&lt;p&gt;The lake invariant: &lt;strong&gt;a data lake is object storage (S3/GCS/ADLS) plus a metadata catalog plus open file formats (Parquet, ORC, Avro) plus convention-based partitioning; reads are cheap and parallel, writes are eventual-consistent unless wrapped in a table format, and the cost model is storage + compute-at-query-time&lt;/strong&gt;. Lakes shine when data shapes are diverse and high-volume.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths&lt;/strong&gt; — accepts any data format; massive scale; cheap storage; many engines can read.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch-outs&lt;/strong&gt; — no ACID without a table format; no time travel; concurrent writes can corrupt the table; small-file problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best fit&lt;/strong&gt; — ML feature stores, log archives, raw event data, ingestion landing zones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt; — storage ~$0.023/GB/month (S3 Standard); compute pay-per-query.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A 50TB clickstream feature store in S3 + Glue.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;attribute&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;storage&lt;/td&gt;
&lt;td&gt;S3 Standard, ~$1,150/month for 50TB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;catalog&lt;/td&gt;
&lt;td&gt;AWS Glue (free for first million objects)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;Athena, ~$5/TB scanned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;typical query&lt;/td&gt;
&lt;td&gt;scan 100GB → ~$0.50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Storage line: 50TB × 1024GB × $0.023/GB/month (S3 Standard pricing) ≈ $1,150/month — this is the floor regardless of query activity.&lt;/li&gt;
&lt;li&gt;Catalog line: AWS Glue is free for the first million metadata objects; a 50TB clickstream table partitioned by year/month/day fits comfortably under that limit.&lt;/li&gt;
&lt;li&gt;Compute line: Athena charges per TB scanned, not per query — write efficient SQL (use the partition predicate, project only needed columns) and you pay only for what you actually read.&lt;/li&gt;
&lt;li&gt;Typical query: a partition-pruned + column-projected scan touches ~100GB → 0.1 TB × $5/TB ≈ $0.50; an unpruned full-table scan would touch 50TB → $250 per query.&lt;/li&gt;
&lt;li&gt;Net at this scale: storage dominates the monthly bill (~$1,150) and compute scales linearly with query discipline — bad queries cost real money, good queries are nearly free.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; A lake-first deployment for clickstream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://feature-lake/raw_events/year=2026/month=04/day=13/
  part-00000.parquet
  part-00001.parquet
   …
Glue catalog: feature_lake.raw_events
Athena query: SELECT user_id, COUNT(*) FROM feature_lake.raw_events WHERE day = '2026-04-13' GROUP BY user_id;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; a pure lake is the right answer when data is high-volume, schema-flexible, and primarily consumed by ML or batch analytics; reach for a lakehouse the moment you need ACID or concurrent writers.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cloud warehouse — managed ACID SQL on proprietary storage
&lt;/h4&gt;

&lt;p&gt;The warehouse invariant: &lt;strong&gt;a cloud warehouse (Snowflake, BigQuery, Redshift, Synapse) is a managed system that owns both storage and compute, exposes SQL as the primary interface, provides ACID transactions out of the box, and handles indexing, statistics, and query optimization automatically&lt;/strong&gt;. Warehouses shine when data is structured and the primary consumer is analyst SQL.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths&lt;/strong&gt; — mature SQL; ACID; managed governance products (RBAC, masking); workload management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch-outs&lt;/strong&gt; — proprietary storage = vendor lock; cost at huge semi-structured scale; less flexible for non-tabular data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best fit&lt;/strong&gt; — curated analytics, BI dashboards, financial reporting, dimensional models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt; — ~$2-5 per credit-hour or per-TB-scanned; storage ~$0.02-0.04/GB/month (compressed).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A 5TB curated finance mart in Snowflake.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;attribute&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;storage&lt;/td&gt;
&lt;td&gt;Snowflake, ~$200/month for 5TB compressed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;Small warehouse, ~$2/credit-hour&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;typical query&lt;/td&gt;
&lt;td&gt;dashboard refresh in ~30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ACID&lt;/td&gt;
&lt;td&gt;full transactions across multi-table updates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Storage line: Snowflake compresses raw data 3-5x, so 5TB raw becomes ~1-1.5TB stored at ~$23-46/TB/month, landing around $200/month total.&lt;/li&gt;
&lt;li&gt;Compute line: a Small warehouse runs at ~$2/credit-hour; nightly ELT jobs plus business-hours dashboards consume ~50-200 credits/month for a finance mart of this size.&lt;/li&gt;
&lt;li&gt;Typical query: dashboard refresh hits a sub-30-second target because data is co-located with compute and the planner has full statistics.&lt;/li&gt;
&lt;li&gt;ACID guarantee: multi-table updates within a &lt;code&gt;BEGIN ... COMMIT&lt;/code&gt; block are atomic — the finance close cannot land half-updated, which is the whole reason finance reports run on a warehouse rather than a raw lake.&lt;/li&gt;
&lt;li&gt;Net at 5TB scale: the warehouse premium (~$200 storage) is small versus a lake's ~$115 equivalent; ergonomics, SQL-first BI integration, and ACID tilt the choice clearly toward warehouse.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; A curated star schema in Snowflake:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;finance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fact_revenue&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;date_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_lines&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; warehouses are the right answer when SQL ergonomics and ACID matter more than format flexibility; reach for a lakehouse when you need both &lt;em&gt;and&lt;/em&gt; the ability to query the same data from outside the warehouse.&lt;/p&gt;

&lt;h4&gt;
  
  
  Lakehouse with Iceberg / Delta / Hudi — ACID on object storage
&lt;/h4&gt;

&lt;p&gt;The lakehouse invariant: &lt;strong&gt;a lakehouse is an open-table-format layer (Apache Iceberg, Delta Lake, Apache Hudi) on top of object storage that adds ACID transactions, schema evolution, partition evolution, time travel, and safe concurrent writers; the data sits in standard Parquet files but is governed by a JSON/Avro commit log that any engine can read&lt;/strong&gt;. Lakehouse architectures combine lake scale with warehouse-like semantics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Iceberg&lt;/strong&gt; — table format invented at Netflix; broad engine support (Spark, Trino, Snowflake, BigQuery, Dremio); REST catalog spec.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta Lake&lt;/strong&gt; — invented at Databricks; strong Spark integration; commit log in &lt;code&gt;_delta_log/&lt;/code&gt;; OSS Delta works across engines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Hudi&lt;/strong&gt; — invented at Uber; optimized for upsert-heavy CDC workloads; merge-on-read and copy-on-write modes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All three&lt;/strong&gt; — provide ACID, time travel, schema evolution, and partition pruning; pick by ecosystem and team skill.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A 50TB lakehouse on S3 + Iceberg + Spark/Trino.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;dimension&lt;/th&gt;
&lt;th&gt;data lake&lt;/th&gt;
&lt;th&gt;warehouse&lt;/th&gt;
&lt;th&gt;lakehouse (Iceberg)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;storage cost&lt;/td&gt;
&lt;td&gt;✓ cheap&lt;/td&gt;
&lt;td&gt;✗ expensive&lt;/td&gt;
&lt;td&gt;✓ cheap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ACID transactions&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;concurrent writers&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;time travel&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;depends&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;schema evolution&lt;/td&gt;
&lt;td&gt;manual&lt;/td&gt;
&lt;td&gt;managed&lt;/td&gt;
&lt;td&gt;managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vendor lock&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;high&lt;/td&gt;
&lt;td&gt;low (open standard)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ML / Python access&lt;/td&gt;
&lt;td&gt;direct&lt;/td&gt;
&lt;td&gt;via connector&lt;/td&gt;
&lt;td&gt;direct&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Storage cost row: lake and lakehouse both win because data sits in cheap object storage; warehouse loses at scale because storage is bundled with managed compute.&lt;/li&gt;
&lt;li&gt;ACID + concurrent writers rows: warehouse and lakehouse both provide them out of the box; pure lake does not — concurrent writers can corrupt a lake table without an open table format on top.&lt;/li&gt;
&lt;li&gt;Time travel row: only the lakehouse exposes it natively via the Iceberg/Delta snapshot log; some warehouses offer it as a managed feature; pure lake has no concept.&lt;/li&gt;
&lt;li&gt;Schema evolution row: lakehouse and warehouse both manage adding/widening columns as a metadata commit; pure-lake users do it manually with file rewrites.&lt;/li&gt;
&lt;li&gt;Vendor lock + ML/Python rows: pure lake is open standard; lakehouse is open standard with a richer feature set; warehouse is proprietary and ML access usually requires connectors that copy data back out — which is why ML teams gravitate to lake/lakehouse for feature stores.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Creating an Iceberg table via Spark SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;lakehouse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt;     &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt;       &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt;   &lt;span class="nb"&gt;DATE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;iceberg&lt;/span&gt;
&lt;span class="n"&gt;PARTITIONED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://lakehouse-bucket/orders/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;MERGE&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;lakehouse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;staging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders_delta&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the lakehouse pattern is the right answer when you need ACID + time travel + concurrent writers + the ability to query from multiple engines; pick Iceberg for the broadest engine support, Delta for tightest Databricks/Spark integration, Hudi for upsert-heavy CDC.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Conflating "lake" with "Hadoop / HDFS" — modern lakes are object storage; HDFS is the legacy on-prem variant.&lt;/li&gt;
&lt;li&gt;Picking a lakehouse "because it's modern" without matching it to the workload — for pure curated SQL analytics, a warehouse is often simpler and cheaper.&lt;/li&gt;
&lt;li&gt;Treating Iceberg / Delta / Hudi as interchangeable — Hudi is upsert-tuned; Delta is Spark-tightest; Iceberg is most engine-agnostic. The choice has long-term implications.&lt;/li&gt;
&lt;li&gt;Forgetting that lakehouses still need governance — IAM, lineage, quality tests, contracts; the "open" part is the storage, not the operational discipline.&lt;/li&gt;
&lt;li&gt;Underestimating the operational cost of running an open lakehouse vs a managed warehouse — engineering time matters.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Lake Interview Question on Pattern Selection
&lt;/h3&gt;

&lt;p&gt;A retail company stores 200TB of clickstream events plus a 5TB curated finance mart and a 1TB ML feature store. &lt;strong&gt;Should they run on a pure data lake, a cloud warehouse, a lakehouse, or a hybrid? Walk through your decision.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using a Hybrid — Lakehouse for Clickstream + Features, Warehouse for Finance
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Workload                  Volume    Pattern recommended    Why
────────────────────────  ────────  ─────────────────────  ─────────────────────────────────────────
Clickstream events        200 TB    Lakehouse (Iceberg)    Volume + schema flexibility + ML access
ML feature store           1 TB     Lakehouse (Iceberg)    Same engine, same catalog as clickstream
Curated finance mart       5 TB     Cloud warehouse        SQL ergonomics, ACID across many tables, BI tools
                                                            Snowflake / BigQuery / Redshift

Boundary contract:
  - Clickstream + features stay in S3 + Iceberg
  - Finance mart loads nightly from Iceberg via Snowflake external tables
  - Reverse-ETL syncs finance summaries back into the lakehouse for ML feature joins
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; clickstream at 200TB is the workload that justifies the cheaper object-storage cost model; the lakehouse table format adds ACID and time travel that the team will need for replays and audits; the ML feature store sits on the same engine + catalog so feature engineers can &lt;code&gt;JOIN&lt;/code&gt; against clickstream without a cross-system data hop; the finance mart at only 5TB is small enough that warehouse storage cost is negligible, and the team's BI tools and analyst SQL ergonomics dominate the decision; the boundary contract (Snowflake external tables) lets finance read curated lake tables without copying them, and reverse-ETL closes the loop for ML.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; of the decision:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;question&lt;/th&gt;
&lt;th&gt;answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Is volume &amp;gt; 50TB?&lt;/td&gt;
&lt;td&gt;yes (clickstream) → lake or lakehouse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Need ACID + concurrent writers + time travel?&lt;/td&gt;
&lt;td&gt;yes (CDC + ML feature recomputation) → lakehouse, not pure lake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Pick a table format&lt;/td&gt;
&lt;td&gt;Iceberg (broadest engine support across Spark, Trino, Snowflake)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Is the curated SQL workload &amp;lt; 10TB?&lt;/td&gt;
&lt;td&gt;yes (finance, 5TB) → warehouse is fine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Pick a warehouse&lt;/td&gt;
&lt;td&gt;Snowflake (ergonomics + multi-cloud + Iceberg external table support)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Boundary contract&lt;/td&gt;
&lt;td&gt;Snowflake external tables on Iceberg; reverse-ETL nightly job&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; the recommended architecture summary:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;zone&lt;/th&gt;
&lt;th&gt;technology&lt;/th&gt;
&lt;th&gt;volume&lt;/th&gt;
&lt;th&gt;primary consumer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Clickstream lakehouse&lt;/td&gt;
&lt;td&gt;S3 + Iceberg + Spark/Trino&lt;/td&gt;
&lt;td&gt;200 TB&lt;/td&gt;
&lt;td&gt;ML pipelines, analyst SQL via Trino&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ML feature store&lt;/td&gt;
&lt;td&gt;S3 + Iceberg + Spark&lt;/td&gt;
&lt;td&gt;1 TB&lt;/td&gt;
&lt;td&gt;ML training + serving&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Finance warehouse&lt;/td&gt;
&lt;td&gt;Snowflake (managed)&lt;/td&gt;
&lt;td&gt;5 TB&lt;/td&gt;
&lt;td&gt;Finance analysts, BI dashboards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Boundary&lt;/td&gt;
&lt;td&gt;Snowflake external tables on Iceberg&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;finance reads curated lake data zero-copy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Volume-driven storage choice&lt;/strong&gt;&lt;/strong&gt; — 200TB at warehouse storage cost ($0.02-0.04/GB/month) = ~$5K/month; same data on S3 = ~$4.6K/month &lt;em&gt;and&lt;/em&gt; available to ML directly. The cost gap widens with growth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Lakehouse for ACID + time travel&lt;/strong&gt;&lt;/strong&gt; — clickstream replays and ML feature recomputation need transactional snapshots; a pure lake without Iceberg cannot give you that.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Warehouse for curated SQL&lt;/strong&gt;&lt;/strong&gt; — finance analysts live in BI tools; warehouse SQL ergonomics + ACID across multi-table updates dominates the cost-per-query argument at 5TB scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Iceberg as the open boundary&lt;/strong&gt;&lt;/strong&gt; — Snowflake reads Iceberg tables natively via external tables; no nightly copy job, no schema drift between systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Reverse-ETL closes the loop&lt;/strong&gt;&lt;/strong&gt; — finance summaries flow back to the lakehouse so ML features can &lt;code&gt;JOIN&lt;/code&gt; against revenue without leaving the lake stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Operational cost trade-off&lt;/strong&gt;&lt;/strong&gt; — running both a lakehouse and a warehouse is more engineering than a single managed warehouse; the cost is justified at this volume mix but not at 5TB total.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice problems&lt;/a&gt; for warehouse-style queries and &lt;a href="https://pipecode.ai/explore/practice/language/data-modeling" rel="noopener noreferrer"&gt;data modeling practice&lt;/a&gt; for star-schema and OBT patterns. Course: &lt;a href="https://pipecode.ai/explore/courses/data-modeling-for-data-engineering-interviews" rel="noopener noreferrer"&gt;Data Modeling for Data Engineering Interviews&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — dimensional modeling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Dimensional modeling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/dimensional-modeling" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Language — data modeling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Data modeling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/language/data-modeling" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Language — SQL&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;All SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Interview Answer Shape — Grain, Idempotency, Lineage, Reconciliation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  A five-step template for data lake design rounds
&lt;/h3&gt;

&lt;p&gt;"Design our company's analytics data lake" is the canonical open-ended system-design prompt — and the cleanest answer is a &lt;strong&gt;five-step template&lt;/strong&gt; that walks the interviewer through the load-bearing decisions in a fixed order. The mental model: &lt;strong&gt;clarify grain → separate landing from conformed → make loads idempotent → attach lineage keys → reconcile aggregates against source&lt;/strong&gt;. Following this template demonstrates that you have shipped data pipelines before, and it gives the interviewer five concrete spots to drill deeper. Candidates who jump straight to vendor names or who skip the grain question lose the round, regardless of how many tools they can name.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv2gytmc0nfg8udalzf9f.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv2gytmc0nfg8udalzf9f.webp" alt="Interview answer shape checklist for data lake design questions: clarify grain, separate landing vs conformed, idempotent loads, row lineage keys, aggregate reconciliation to source — with green checkmarks and PipeCode branding." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; State the template out loud at the start: "I'd answer this in five steps — first clarify grain, then separate landing from conformed, then make loads idempotent, then attach lineage keys, then explain how I'd reconcile aggregates against the source." This gives the interviewer a road map and makes it easy for them to interrupt at any step with "tell me more about X" — which is exactly the signal you want.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Step 1 — Clarify the grain and the metric definition
&lt;/h4&gt;

&lt;p&gt;The grain invariant: &lt;strong&gt;the grain of a fact table is the business event one row represents — orders, order lines, shipments, page views, user-day, user-session — and ambiguous grain is the single most common bug in data engineering&lt;/strong&gt;. Ask the interviewer "are we counting orders or order lines?" before drawing a box. The answer changes joins, group-bys, and reconciliation totals.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Order grain&lt;/strong&gt; — one row per order; &lt;code&gt;COUNT(*)&lt;/code&gt; = number of orders.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order-line grain&lt;/strong&gt; — one row per line item; &lt;code&gt;COUNT(DISTINCT order_id)&lt;/code&gt; = number of orders.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User-day grain&lt;/strong&gt; — one row per user per day; &lt;code&gt;SUM(events)&lt;/code&gt; = events per user per day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session grain&lt;/strong&gt; — one row per session; rolling &lt;code&gt;LAG&lt;/code&gt; over events to define session boundaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; "How many orders did we ship last week?" against &lt;code&gt;fact_shipments&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;grain candidate&lt;/th&gt;
&lt;th&gt;implied metric&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;order grain&lt;/td&gt;
&lt;td&gt;&lt;code&gt;COUNT(*) WHERE shipped_date BETWEEN ...&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;order-line grain&lt;/td&gt;
&lt;td&gt;&lt;code&gt;COUNT(DISTINCT order_id) WHERE shipped_date BETWEEN ...&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;shipment grain&lt;/td&gt;
&lt;td&gt;&lt;code&gt;COUNT(DISTINCT order_id) WHERE shipment_event = 'shipped'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If the table is at &lt;em&gt;order grain&lt;/em&gt; (one row per order), &lt;code&gt;COUNT(*) WHERE shipped_date BETWEEN ...&lt;/code&gt; directly counts orders shipped — clean and simple.&lt;/li&gt;
&lt;li&gt;If the table is at &lt;em&gt;order-line grain&lt;/em&gt; (one row per item per order), &lt;code&gt;COUNT(*)&lt;/code&gt; over-counts every multi-item order; the right answer becomes &lt;code&gt;COUNT(DISTINCT order_id)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If the table is at &lt;em&gt;shipment grain&lt;/em&gt; (one row per shipment event per line, including partial shipments and cancellations), filter by &lt;code&gt;event_type = 'shipped'&lt;/code&gt; first and then &lt;code&gt;COUNT(DISTINCT order_id)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Without naming the grain, the same SQL can produce three different "right" numbers — and the analyst, dashboard, and source-of-truth Slack thread will each pick a different one.&lt;/li&gt;
&lt;li&gt;Stating the grain in the first sentence of every interview answer prevents this entire class of bug — and the same rule applies in production: every fact table should have its grain documented in the catalog comment.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Always state grain explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"This fact_shipments table has shipment grain — one row per shipment_event per order_line.
For 'orders shipped last week' I'll do COUNT(DISTINCT order_id) where event_type = 'shipped'
and shipped_date BETWEEN start_of_week AND end_of_week."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the first sentence of every interview answer should name the grain. Even if the interviewer doesn't ask, declaring grain demonstrates senior intent.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 2 — Separate landing from conformed (bronze vs silver)
&lt;/h4&gt;

&lt;p&gt;The separation invariant: &lt;strong&gt;landing is what the source sent; conformed is what the business agrees to call truth; never let analysts query landing directly because schemas change without notice&lt;/strong&gt;. The bronze/silver split is the architectural manifestation of this rule.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Landing / bronze&lt;/strong&gt; — append-only, source-fidelity, partitioned by &lt;code&gt;ingest_date&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conformed / silver&lt;/strong&gt; — deduplicated, typed, with conformed business keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curated / gold&lt;/strong&gt; — subject-area marts and dimensional models for downstream consumption.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boundary&lt;/strong&gt; — only the silver and gold layers carry consumer contracts; bronze is for re-processors only.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A pipeline lands daily JSON snapshots; without a separation layer, analysts join directly against &lt;code&gt;bronze.orders&lt;/code&gt; and break every time the source adds a column.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;layer&lt;/th&gt;
&lt;th&gt;who reads&lt;/th&gt;
&lt;th&gt;breakage tolerance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;bronze.orders&lt;/td&gt;
&lt;td&gt;re-processors only&lt;/td&gt;
&lt;td&gt;high (re-process on demand)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;silver.orders&lt;/td&gt;
&lt;td&gt;analyst ad-hoc, ML&lt;/td&gt;
&lt;td&gt;low (contract change ≥ 30 days notice)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gold.fact_orders&lt;/td&gt;
&lt;td&gt;dashboards, BI&lt;/td&gt;
&lt;td&gt;zero (versioned column contracts)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Bronze is owned by the re-processors only — no SLA, no consumer contract; analysts who query it get whatever the source app emitted today, including freshly-renamed columns and broken types.&lt;/li&gt;
&lt;li&gt;Silver is the contract layer — analyst ad-hoc SQL, ML feature pipelines, and reverse-ETL all read it; breakage requires ≥30-day notice so consumers can adapt.&lt;/li&gt;
&lt;li&gt;Gold has zero breakage tolerance — dashboards and BI tools couple to specific column names + types; any change requires explicit version bumping (&lt;code&gt;gold.fact_orders_v2&lt;/code&gt;) so old dashboards keep working.&lt;/li&gt;
&lt;li&gt;Without these boundaries, a source app's column rename cascades immediately into a broken executive dashboard, and the data team learns about it from a Slack screenshot.&lt;/li&gt;
&lt;li&gt;With these boundaries, the silver-layer owner absorbs the upstream change inside the dedup logic, gold contracts stay intact, and the dashboard never breaks — the architecture has done its job.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; State the layer boundaries explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"I'd split the platform into three layers — bronze for raw landing, silver for conformed,
gold for analytics-ready. Bronze is for re-processors only; analysts and dashboards read
silver and gold. The boundary contract is documented and breakage requires 30-day notice
for silver and explicit version bumping for gold."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; any answer that allows analysts to query the landing zone has a hidden bug-factory; the bronze/silver split is what prevents source-schema chaos from cascading into BI.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 3 — Idempotent loads — same input → same output, every time
&lt;/h4&gt;

&lt;p&gt;The idempotency invariant: &lt;strong&gt;a daily load is idempotent if re-running it (after any failure, manual intervention, or backfill) produces byte-identical output; without idempotency, retries cause duplicates and counts drift silently&lt;/strong&gt;. Idempotency is achieved through &lt;code&gt;MERGE&lt;/code&gt; instead of &lt;code&gt;INSERT&lt;/code&gt;, partition-overwrite semantics, or table-format ACID transactions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MERGE&lt;/code&gt; on a business key&lt;/strong&gt; — &lt;code&gt;WHEN MATCHED UPDATE SET *&lt;/code&gt; + &lt;code&gt;WHEN NOT MATCHED INSERT *&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition overwrite&lt;/strong&gt; — &lt;code&gt;INSERT OVERWRITE INTO silver.orders PARTITION (ingest_date='2026-04-13')&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iceberg / Delta &lt;code&gt;MERGE&lt;/code&gt;&lt;/strong&gt; — ACID transaction; safe for concurrent writers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Functional idempotency&lt;/strong&gt; — pure transformations whose output depends only on inputs, never on &lt;code&gt;NOW()&lt;/code&gt; or random.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A retry on a half-completed daily load should produce the same final state as the original successful run.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;run&lt;/th&gt;
&lt;th&gt;rows in silver before&lt;/th&gt;
&lt;th&gt;rows after&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;original&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;12,835&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;retry (after partial failure)&lt;/td&gt;
&lt;td&gt;12,401&lt;/td&gt;
&lt;td&gt;12,835 (no duplicates)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;backfill 2026-04-12 a week later&lt;/td&gt;
&lt;td&gt;12,820&lt;/td&gt;
&lt;td&gt;12,820 (overwritten cleanly)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Original run: silver starts at 0 rows; the &lt;code&gt;MERGE&lt;/code&gt; writes 12,835 unique rows after dedup; final count = 12,835.&lt;/li&gt;
&lt;li&gt;Retry after a partial failure: silver already has 12,401 rows (the partial write that crashed); the &lt;code&gt;MERGE&lt;/code&gt; updates the existing rows and inserts only the missing 434; final count = 12,835 — no duplicates.&lt;/li&gt;
&lt;li&gt;Backfill 2026-04-12 a week later: partition-overwrite semantics drop the existing 12,820 rows for that date and replace them with the freshly recomputed 12,820; final count = 12,820 — clean.&lt;/li&gt;
&lt;li&gt;The key invariant: every rerun produces the same final state regardless of the starting state — that's what idempotency means.&lt;/li&gt;
&lt;li&gt;Without idempotency, the retry would have inserted 434 duplicate rows (12,835 - 12,401), and the backfill would have either errored on the unique constraint or silently created shadow data that broke the next dashboard refresh.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;MERGE&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ingest_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2026-04-13'&lt;/span&gt;
    &lt;span class="n"&gt;QUALIFY&lt;/span&gt; &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;source_ts&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if your interviewer asks "what happens if this job runs twice", and your answer involves any kind of cleanup script, you don't have idempotency — restructure.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 4 — Attach row-level lineage keys — &lt;code&gt;ingest_id&lt;/code&gt;, &lt;code&gt;source_ts&lt;/code&gt;, &lt;code&gt;pipeline_version&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The lineage invariant: &lt;strong&gt;every silver and gold row carries the columns that let you reconstruct &lt;em&gt;which source payload&lt;/em&gt; produced it and &lt;em&gt;which pipeline version&lt;/em&gt; transformed it; without lineage, debugging "why does this row look wrong" is forensic archaeology&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ingest_id&lt;/code&gt;&lt;/strong&gt; — unique identifier of the bronze batch (e.g., timestamp + UUID).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;source_ts&lt;/code&gt;&lt;/strong&gt; — timestamp from the source system (CDC) for ordering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pipeline_version&lt;/code&gt;&lt;/strong&gt; — git SHA or version tag of the transformation code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;silver_loaded_at&lt;/code&gt;&lt;/strong&gt; — when the row entered silver; useful for SLA metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Analysts notice that &lt;code&gt;revenue&lt;/code&gt; for &lt;code&gt;order_id=448&lt;/code&gt; is wrong; with lineage, they can trace it back to the exact bronze file and pipeline version that produced it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;field&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;order_id&lt;/td&gt;
&lt;td&gt;448&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;revenue&lt;/td&gt;
&lt;td&gt;$99.00 (wrong; should be $999.00)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ingest_id&lt;/td&gt;
&lt;td&gt;&lt;code&gt;20260412T0200Z_a3f2&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;source_ts&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-04-12 09:30:15&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pipeline_version&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;v2.1.7&lt;/code&gt; (commit &lt;code&gt;b3a4d72&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;silver_loaded_at&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-04-12 02:15:32&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;An analyst notices &lt;code&gt;order_id = 448&lt;/code&gt; shows revenue $99 instead of the expected $999 in the BI dashboard.&lt;/li&gt;
&lt;li&gt;They look the row up in silver: &lt;code&gt;SELECT ingest_id, source_ts, pipeline_version FROM silver.orders WHERE order_id = 448&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The result tells them exactly which bronze batch produced this row (&lt;code&gt;ingest_id = '20260412T0200Z_a3f2'&lt;/code&gt;), the source moment (&lt;code&gt;source_ts = 2026-04-12 09:30:15&lt;/code&gt;), and the pipeline version that ran (&lt;code&gt;v2.1.7&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;They open the bronze file at that &lt;code&gt;ingest_id&lt;/code&gt;. If the source payload already shows $99, it's a source bug — file a ticket with the upstream team and replay from a known-good &lt;code&gt;source_ts&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If the bronze payload shows $999 but silver shows $99, the bug is in the pipeline. Run &lt;code&gt;git log v2.1.7&lt;/code&gt; to find the exact commit, fix the transformation, deploy &lt;code&gt;v2.1.8&lt;/code&gt;, and backfill the affected &lt;code&gt;ingest_date&lt;/code&gt; partition — total recovery time ~30 minutes instead of multi-day forensic SQL.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; Carry lineage in every silver row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;source_ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ingest_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="s1"&gt;'v2.1.7'&lt;/span&gt;                          &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;pipeline_version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;                 &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;silver_loaded_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deduped&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if the dashboard shows a wrong number and you can't answer "which source file produced this row?" in under five minutes, your lineage isn't strong enough.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 5 — Aggregate reconciliation against the source
&lt;/h4&gt;

&lt;p&gt;The reconciliation invariant: &lt;strong&gt;a daily job compares aggregate metrics (row counts, sums, distinct counts) between the lake and the source system, alerts on drift above a tolerance, and blocks promotion to gold until the drift is investigated&lt;/strong&gt;. Reconciliation is the difference between "we trust the lake" and "we hope the lake is right".&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Row count&lt;/strong&gt; — &lt;code&gt;COUNT(*)&lt;/code&gt; in lake vs source for the same time window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sum reconciliation&lt;/strong&gt; — &lt;code&gt;SUM(amount)&lt;/code&gt; in lake vs source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distinct count&lt;/strong&gt; — &lt;code&gt;COUNT(DISTINCT user_id)&lt;/code&gt; to catch dedup bugs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tolerance threshold&lt;/strong&gt; — typically 0.1% for high-volume facts, 0.01% for finance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Daily reconciliation between silver and source-app replica.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;metric&lt;/th&gt;
&lt;th&gt;silver&lt;/th&gt;
&lt;th&gt;source&lt;/th&gt;
&lt;th&gt;drift&lt;/th&gt;
&lt;th&gt;passes?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;row count&lt;/td&gt;
&lt;td&gt;12,835&lt;/td&gt;
&lt;td&gt;12,835&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sum(amount)&lt;/td&gt;
&lt;td&gt;$4,128,931&lt;/td&gt;
&lt;td&gt;$4,128,931&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;count(distinct user_id)&lt;/td&gt;
&lt;td&gt;8,712&lt;/td&gt;
&lt;td&gt;8,712&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step explanation.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The daily reconciliation job runs after the silver load completes for the prior day.&lt;/li&gt;
&lt;li&gt;It computes three metrics over &lt;code&gt;silver.orders&lt;/code&gt;: &lt;code&gt;COUNT(*)&lt;/code&gt;, &lt;code&gt;SUM(amount)&lt;/code&gt;, and &lt;code&gt;COUNT(DISTINCT user_id)&lt;/code&gt; for the same date.&lt;/li&gt;
&lt;li&gt;It computes the same three metrics over &lt;code&gt;source_replica.orders&lt;/code&gt; (a read-only replica of the source-app database) for the same date.&lt;/li&gt;
&lt;li&gt;For each metric, drift is calculated as &lt;code&gt;ABS(silver - source) / source&lt;/code&gt;; the gate passes only if every metric is below the tolerance (0.001 = 0.1% for facts; 0.0001 for finance).&lt;/li&gt;
&lt;li&gt;If all three pass: silver promotes to gold and the dashboard refresh proceeds. If any fail: the gate blocks promotion, pages the on-call engineer, and emits the failing metric to a drift dashboard for investigation — the BI team never sees stale or wrong numbers because they see no refresh.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt; A reconciliation gate before gold promotion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;lake&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;source_ts&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2026-04-13'&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;source_replica&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_ts&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2026-04-13'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;ABS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;row_drift&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;ABS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;sum_drift&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CASE&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;ABS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;001&lt;/span&gt;
         &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;ABS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;001&lt;/span&gt;
        &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'PASS'&lt;/span&gt;
        &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'FAIL'&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;gate&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;lake&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; never promote to gold without a reconciliation gate; the BI team will discover any drift the hard way otherwise, and trust takes years to rebuild.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Skipping Step 1 (grain) and going straight to architecture — every downstream answer is wrong if grain is wrong.&lt;/li&gt;
&lt;li&gt;Letting analysts query the bronze zone directly — schema drift cascades into BI dashboards.&lt;/li&gt;
&lt;li&gt;"Idempotent" loads that depend on &lt;code&gt;NOW()&lt;/code&gt; — re-runs produce different rows; not actually idempotent.&lt;/li&gt;
&lt;li&gt;Lineage limited to the pipeline level (not the row level) — debugging "this row is wrong" is a multi-day forensic effort.&lt;/li&gt;
&lt;li&gt;Reconciliation that only checks row counts but not sums — &lt;code&gt;COUNT&lt;/code&gt; matches when the dedup deletes the wrong row but the count happens to be right.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Lake Interview Question on a Full System-Design Walkthrough
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Walk through your end-to-end answer to "design our company's analytics data lake" using the five-step template.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using the Five-Step Template
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. CLARIFY GRAIN
   "Before I draw any boxes — what's the canonical fact event? Orders, order lines, shipments?
    What's the metric we ultimately care about? Revenue, user counts, latency?"
   → assume: order grain; canonical metric = daily revenue per region.

2. SEPARATE LANDING FROM CONFORMED
   bronze.orders   ← S3 append-only daily JSON, partitioned by ingest_date
   silver.orders   ← deduped + typed + conformed customer_key/region_key
   gold.fact_orders ← star schema with dim_customer, dim_region, dim_date

3. IDEMPOTENT LOADS
   - bronze: append-only writes by ingest_id (never overwrite)
   - silver: MERGE on order_id with QUALIFY ROW_NUMBER() = 1 dedup
   - gold: INSERT OVERWRITE PARTITION (date_key) for the affected day(s)
   Re-runs produce byte-identical output.

4. ROW-LEVEL LINEAGE
   Carry ingest_id, source_ts, pipeline_version, silver_loaded_at on every silver row.
   Carry silver_loaded_at and pipeline_version on every gold row.
   Forensic queries: "show me every silver.orders row where pipeline_version='v2.1.6'."

5. AGGREGATE RECONCILIATION
   Daily SQL job: compare COUNT(*), SUM(amount), COUNT(DISTINCT user_id) between
   silver.orders and the source-app replica for the prior day. Drift &amp;gt; 0.1% blocks
   gold promotion and pages on-call. Drift dashboard surfaces history at a glance.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the template gives the interviewer a clear road map (so they know where to drill) while demonstrating that the candidate has shipped this kind of pipeline before; each step addresses a specific failure mode (grain ambiguity, schema drift, retry duplicates, debugging dead-ends, silent data corruption); the order is non-arbitrary — Step N depends on Step N-1, and skipping any step weakens the foundation; every step has a concrete artifact (a layer, a SQL pattern, a column, a job) so the interviewer can ask "show me what that looks like" and get a specific answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; through a sample interview round:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;time (min)&lt;/th&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;candidate output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0-2&lt;/td&gt;
&lt;td&gt;grain&lt;/td&gt;
&lt;td&gt;"Are we counting orders or order lines? Confirmed: orders."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2-7&lt;/td&gt;
&lt;td&gt;landing vs conformed&lt;/td&gt;
&lt;td&gt;drew bronze/silver/gold split with ownership boxes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7-12&lt;/td&gt;
&lt;td&gt;idempotency&lt;/td&gt;
&lt;td&gt;walked through silver MERGE; named QUALIFY ROW_NUMBER dedup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12-15&lt;/td&gt;
&lt;td&gt;lineage&lt;/td&gt;
&lt;td&gt;listed &lt;code&gt;ingest_id&lt;/code&gt;, &lt;code&gt;source_ts&lt;/code&gt;, &lt;code&gt;pipeline_version&lt;/code&gt; columns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15-20&lt;/td&gt;
&lt;td&gt;reconciliation&lt;/td&gt;
&lt;td&gt;sketched daily-reconciliation SQL job + drift dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20-25&lt;/td&gt;
&lt;td&gt;open questions&lt;/td&gt;
&lt;td&gt;streaming variant, schema evolution, multi-region replication&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; the recommended interview-round shape:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;minutes&lt;/th&gt;
&lt;th&gt;failure mode addressed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 — grain&lt;/td&gt;
&lt;td&gt;0-2&lt;/td&gt;
&lt;td&gt;ambiguous metric → wrong joins&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2 — landing vs conformed&lt;/td&gt;
&lt;td&gt;2-7&lt;/td&gt;
&lt;td&gt;source-schema drift → BI breakage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 — idempotency&lt;/td&gt;
&lt;td&gt;7-12&lt;/td&gt;
&lt;td&gt;retries → duplicates → drift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 — lineage&lt;/td&gt;
&lt;td&gt;12-15&lt;/td&gt;
&lt;td&gt;"why is this row wrong" → forensic dead-end&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 — reconciliation&lt;/td&gt;
&lt;td&gt;15-20&lt;/td&gt;
&lt;td&gt;silent corruption → trust loss&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Step 1 anchors the conversation in business semantics&lt;/strong&gt;&lt;/strong&gt; — grain is the foundation; getting it right makes Steps 2-5 simpler, getting it wrong makes them all moot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Step 2 turns architecture into ownership&lt;/strong&gt;&lt;/strong&gt; — naming the layer boundary makes it easy to talk about who reads what, who's allowed to break what, and what notice consumers get.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Step 3 prevents the most common production incident&lt;/strong&gt;&lt;/strong&gt; — non-idempotent loads are the #1 source of "duplicate row" bug reports; demonstrating the &lt;code&gt;MERGE&lt;/code&gt; + &lt;code&gt;QUALIFY ROW_NUMBER&lt;/code&gt; pattern signals senior fluency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Step 4 turns debugging from hours to minutes&lt;/strong&gt;&lt;/strong&gt; — lineage columns are the difference between "I can fix this in 10 min" and "I'll get back to you tomorrow."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Step 5 is the operational backstop&lt;/strong&gt;&lt;/strong&gt; — even with steps 1-4 done well, you need reconciliation to catch the failures you didn't anticipate; the gate-before-promotion pattern blocks drift before consumers see it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;The template's value compounds&lt;/strong&gt;&lt;/strong&gt; — each step makes the next one easier, and skipping any step weakens the foundation that the later steps build on.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/etl" rel="noopener noreferrer"&gt;ETL practice problems&lt;/a&gt; for end-to-end pipeline design and the &lt;a href="https://pipecode.ai/explore/practice/language/data-modeling" rel="noopener noreferrer"&gt;data modeling practice page&lt;/a&gt; for grain and dimensional patterns. Course: &lt;a href="https://pipecode.ai/explore/courses/etl-system-design-for-data-engineering-interviews" rel="noopener noreferrer"&gt;ETL System Design for Data Engineering Interviews&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;ETL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — ETL pipelines&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;ETL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/etl" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Language — data modeling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Data modeling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/language/data-modeling" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Language — SQL&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;All SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips to crack data lake architecture interviews
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Master the four primitives — zones, ingest flow, pattern selection, answer template
&lt;/h3&gt;

&lt;p&gt;If you can draw the bronze/silver/gold zones with ownership labels, walk the ingest → catalog → compute flow without skipping the catalog, articulate when a lakehouse beats a warehouse and when it doesn't, and structure your answer using the five-step grain → idempotency → lineage → reconciliation template — you can clear most data-engineering system-design rounds. The remaining 20% is dialect-specific (Spark vs Snowflake idioms, Iceberg vs Delta semantics) and behavioral.&lt;/p&gt;

&lt;h3&gt;
  
  
  Always state grain in the first sentence
&lt;/h3&gt;

&lt;p&gt;Before drawing any boxes, name the grain: "this is order-line grain" or "this is user-day grain". Every wrong answer in a system-design round can be traced back to a grain ambiguity that nobody named. Stating grain explicitly costs five seconds and saves the entire round.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pick Iceberg unless you have a reason not to
&lt;/h3&gt;

&lt;p&gt;Iceberg has the broadest engine support (Spark, Trino, Snowflake, BigQuery, Dremio, Athena) and is the most engine-agnostic of the three open table formats. Pick Delta if your stack is Databricks-centric and Spark-only. Pick Hudi if your workload is upsert-heavy CDC. State the choice and the reason out loud — "I'd pick Iceberg for engine portability" — interviewers grade the reasoning more than the choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Treat idempotency as table stakes, not advanced
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;MERGE&lt;/code&gt; instead of &lt;code&gt;INSERT&lt;/code&gt;, partition-overwrite for backfill, and pure transformations whose output depends only on inputs (never &lt;code&gt;NOW()&lt;/code&gt; or random) — these are baseline expectations, not advanced techniques. If you forget to mention idempotency in a system-design round, the interviewer will assume you have not shipped a production pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Spark for batch, Trino for interactive, DuckDB for ad-hoc
&lt;/h3&gt;

&lt;p&gt;Spark dominates batch + streaming with the richest connector ecosystem; Trino dominates federated interactive SQL across many catalogs; DuckDB is rising fast for single-node ad-hoc analytics under 1TB. Naming the right tool for the workload (without over-explaining) signals breadth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reconciliation is what separates "we trust the lake" from "we hope the lake is right"
&lt;/h3&gt;

&lt;p&gt;Always include a reconciliation step that compares aggregate metrics between the lake and the source system, alerts on drift above a tolerance, and blocks gold promotion until drift is investigated. The five seconds it takes to mention reconciliation is the difference between a senior signal and a mid-level signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to practice on PipeCode
&lt;/h3&gt;

&lt;p&gt;Start with the &lt;a href="https://pipecode.ai/explore/practice/topic/etl" rel="noopener noreferrer"&gt;ETL practice page&lt;/a&gt; for medallion-zone and end-to-end pipeline problems. Drill the related topic pages: &lt;a href="https://pipecode.ai/explore/practice/topic/streaming" rel="noopener noreferrer"&gt;streaming&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/dimensional-modeling" rel="noopener noreferrer"&gt;dimensional modeling&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/language/python" rel="noopener noreferrer"&gt;Python practice&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/language/data-modeling" rel="noopener noreferrer"&gt;data modeling practice&lt;/a&gt;. The interview-first courses page bundles structured curricula — start with &lt;a href="https://pipecode.ai/explore/courses/etl-system-design-for-data-engineering-interviews" rel="noopener noreferrer"&gt;ETL System Design for Data Engineering Interviews&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/courses/data-modeling-for-data-engineering-interviews" rel="noopener noreferrer"&gt;Data Modeling for Data Engineering Interviews&lt;/a&gt;, or &lt;a href="https://pipecode.ai/explore/courses/pyspark-fundamentals" rel="noopener noreferrer"&gt;PySpark Fundamentals&lt;/a&gt;. For broader coverage, &lt;a href="https://pipecode.ai/explore/practice/topics" rel="noopener noreferrer"&gt;browse by topic&lt;/a&gt; or read the &lt;a href="https://pipecode.ai/blogs/sql-interview-questions-for-data-engineering" rel="noopener noreferrer"&gt;SQL interview questions for data engineering&lt;/a&gt; and &lt;a href="https://pipecode.ai/blogs/top-data-engineering-interview-questions-2026" rel="noopener noreferrer"&gt;top data engineering interview questions 2026&lt;/a&gt; blogs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is data lake architecture?
&lt;/h3&gt;

&lt;p&gt;Data lake architecture is the set of conventions — &lt;strong&gt;layered zones (bronze/silver/gold), an ingest → catalog → compute flow on object storage, an open table format for ACID semantics, and disciplined ownership and quality contracts&lt;/strong&gt; — that turn raw object storage into a trustworthy analytics platform. Without these conventions, a "data lake" devolves into a data swamp where nobody can trust the numbers.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between a data lake, a data warehouse, and a lakehouse?
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;data lake&lt;/strong&gt; is cheap, flexible object storage with file-based reads and no built-in ACID; a &lt;strong&gt;cloud warehouse&lt;/strong&gt; (Snowflake, BigQuery, Redshift) is a managed system with proprietary storage, full ACID, and SQL-first ergonomics; a &lt;strong&gt;lakehouse&lt;/strong&gt; is a lake plus an open table format (Iceberg, Delta Lake, Hudi) that adds ACID, time travel, schema evolution, and concurrent writers — bringing warehouse-like semantics to object storage. Most organizations run a hybrid: lake/lakehouse for high-volume + ML workloads, warehouse for curated SQL.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are bronze, silver, and gold layers?
&lt;/h3&gt;

&lt;p&gt;Bronze (or landing/raw) is an &lt;strong&gt;append-only mirror&lt;/strong&gt; of source payloads with minimal transformation. Silver (or refined/conformed) applies &lt;strong&gt;dedup, type coercion, and conformed business keys&lt;/strong&gt;; this is the source of truth for downstream applications. Gold (or curated/consumption) publishes &lt;strong&gt;subject-area marts and star-schema fact + dim tables&lt;/strong&gt; for analyst SQL and BI dashboards. The names vary across vendors but the three-tier shape is universal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need Iceberg, Delta Lake, or Hudi for every project?
&lt;/h3&gt;

&lt;p&gt;No. Small teams can start with well-partitioned &lt;strong&gt;Parquet&lt;/strong&gt; and strict naming conventions. Reach for an open table format when you need &lt;strong&gt;ACID transactions, concurrent writers, partition evolution, time travel, or simpler upserts and deletes&lt;/strong&gt;. Pick &lt;strong&gt;Iceberg&lt;/strong&gt; for the broadest engine support, &lt;strong&gt;Delta&lt;/strong&gt; for tightest Databricks/Spark integration, &lt;strong&gt;Hudi&lt;/strong&gt; for upsert-heavy CDC workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the small-file problem?
&lt;/h3&gt;

&lt;p&gt;When a lake table accumulates millions of small files (e.g., 5KB each from frequent micro-batch writes), query planning spends more time &lt;strong&gt;listing files in the catalog and metastore&lt;/strong&gt; than actually scanning data — a Spark or Trino query that should take 500ms can take 50 seconds. The fix is &lt;strong&gt;scheduled compaction jobs&lt;/strong&gt; that rewrite many small files into fewer 128MB-1GB files, plus targeting larger micro-batch sizes upstream.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I handle schema evolution in a data lake?
&lt;/h3&gt;

&lt;p&gt;Open table formats handle schema evolution gracefully — adding a column or widening a type is a single metadata commit. Without a table format, schema evolution requires &lt;strong&gt;rewriting partitions&lt;/strong&gt; or carrying a column-version field on every row. Either way, the silver layer should be the &lt;strong&gt;schema-stability boundary&lt;/strong&gt;: bronze accepts whatever the source sends, silver enforces a canonical schema, and changes to silver require explicit consumer notice (typically 30 days).&lt;/p&gt;

&lt;h3&gt;
  
  
  How does this connect to data engineering interviews on PipeCode?
&lt;/h3&gt;

&lt;p&gt;System-design questions still reduce to &lt;strong&gt;SQL queries, Python data transforms, and dimensional modeling decisions&lt;/strong&gt;. PipeCode focuses on those signals with &lt;strong&gt;450+&lt;/strong&gt; problems — drill SQL aggregations and joins, Python pipeline patterns, and dimensional models, then layer on system-design depth via the courses. Use &lt;a href="https://pipecode.ai/explore/practice" rel="noopener noreferrer"&gt;Practice&lt;/a&gt; once you can draw the medallion zones and the ingest → catalog → compute flow confidently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start practicing data lake architecture problems
&lt;/h2&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>interview</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Data Engineering Roadmap for Freshers (2026): A 13-Step Beginner's Guide from SQL to Your First Data Engineering Job</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Mon, 11 May 2026 03:17:57 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/data-engineering-roadmap-for-freshers-2026-a-13-step-beginners-guide-from-sql-to-your-first-4b51</link>
      <guid>https://dev.to/gowthampotureddi/data-engineering-roadmap-for-freshers-2026-a-13-step-beginners-guide-from-sql-to-your-first-4b51</guid>
      <description>&lt;p&gt;&lt;strong&gt;Data engineering&lt;/strong&gt; is one of the fastest-growing tech careers in 2026. Companies collect huge amounts of data every day, and &lt;strong&gt;data engineers&lt;/strong&gt; build the systems that &lt;strong&gt;collect, clean, transform, store, and deliver&lt;/strong&gt; that data so analysts, scientists, and product teams can use it. If you're a fresher and confused about where to start, this &lt;strong&gt;data engineering roadmap for freshers&lt;/strong&gt; lays out a clear, ordered 13-step path — what to learn first, what to learn next, what to build, and how to prove the work to a recruiter.&lt;/p&gt;

&lt;p&gt;This guide is a beginner-first walkthrough for &lt;strong&gt;how to become a data engineer in 2026&lt;/strong&gt; without a CS degree, three certificates, or a Spark cluster on day one. The 13 steps are grouped into five learning blocks below, each with a tiny worked example you can run on your laptop. Most freshers fail because they jump to Spark too early, ignore SQL depth, avoid projects, or watch tutorials without practising — the roadmap below fixes all four. Examples use &lt;strong&gt;PostgreSQL&lt;/strong&gt; SQL (the dialect every coding-environment interview defaults to) and standard-library &lt;strong&gt;Python&lt;/strong&gt; so you can run everything on a laptop without setup overhead. Default plan: &lt;strong&gt;about 6–9 months at 10–15 hours per week&lt;/strong&gt; to be job-ready, &lt;strong&gt;9–12 months at 6–8 hours per week&lt;/strong&gt; for working learners.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0w43zecdwk9uoxiax9k.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0w43zecdwk9uoxiax9k.jpeg" alt="Bold 2026 data engineering roadmap header for freshers — SQL, Python, modeling, ETL on a dark purple background." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Step 1 — Master SQL: The Most Important Skill for a Data Engineer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  SQL fundamentals, joins, aggregations, window functions, and the queries you'll write every day
&lt;/h3&gt;

&lt;p&gt;SQL is the &lt;strong&gt;foundation of data engineering&lt;/strong&gt; — you'll write it daily for querying, cleaning, transforming, joining datasets, building reports, and writing ETL logic. Master SQL first; everything else becomes easier.&lt;/p&gt;

&lt;p&gt;The five SQL skill clusters every fresher needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Basics&lt;/strong&gt; — &lt;code&gt;SELECT&lt;/code&gt;, &lt;code&gt;WHERE&lt;/code&gt;, &lt;code&gt;ORDER BY&lt;/code&gt;, &lt;code&gt;LIMIT&lt;/code&gt;, &lt;code&gt;DISTINCT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregations&lt;/strong&gt; — &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;MIN&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt;, &lt;code&gt;GROUP BY&lt;/code&gt;, &lt;code&gt;HAVING&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Joins&lt;/strong&gt; — &lt;code&gt;INNER&lt;/code&gt;, &lt;code&gt;LEFT&lt;/code&gt;, &lt;code&gt;RIGHT&lt;/code&gt;, &lt;code&gt;FULL&lt;/code&gt;, &lt;code&gt;SELF&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Window functions&lt;/strong&gt; — &lt;code&gt;ROW_NUMBER&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;DENSE_RANK&lt;/code&gt;, &lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced&lt;/strong&gt; — CTEs, subqueries, &lt;code&gt;CASE&lt;/code&gt;, &lt;code&gt;NULL&lt;/code&gt; handling, date functions, indexes, query optimisation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqhu41ryew4f52c8oyn4.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqhu41ryew4f52c8oyn4.jpeg" alt="Phase timeline table showing the four-phase data engineering roadmap for freshers — weeks 1-26 with one shippable proof per phase." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; SQL is non-negotiable. Drill it daily on a free coding environment (DataLemur, LeetCode SQL, StrataScratch, HackerRank SQL). Most fresher rejections at the SQL screen are not from missing syntax — they are from joining at the wrong grain or putting an aggregate in the wrong clause.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  SQL basics — &lt;code&gt;SELECT&lt;/code&gt;, &lt;code&gt;WHERE&lt;/code&gt;, &lt;code&gt;ORDER BY&lt;/code&gt;, &lt;code&gt;LIMIT&lt;/code&gt;, &lt;code&gt;DISTINCT&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The bedrock SQL shape: &lt;code&gt;SELECT cols FROM table WHERE row_filter ORDER BY col DESC LIMIT N&lt;/code&gt;. That one query covers most "show me the top X by Y" prompts you'll ever write.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SELECT cols FROM table&lt;/code&gt;&lt;/strong&gt; — pick the columns you actually need; never &lt;code&gt;SELECT *&lt;/code&gt; in production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE filter&lt;/code&gt;&lt;/strong&gt; — row-level predicate; runs before grouping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY col DESC&lt;/code&gt;&lt;/strong&gt; — sort the result; &lt;code&gt;ASC&lt;/code&gt; is default, &lt;code&gt;DESC&lt;/code&gt; is biggest-first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LIMIT N&lt;/code&gt;&lt;/strong&gt; — keep only the top &lt;code&gt;N&lt;/code&gt; rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DISTINCT col&lt;/code&gt;&lt;/strong&gt; — collapse duplicates to a single value per group.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A 4-row &lt;code&gt;employees&lt;/code&gt; table with &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;salary&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;70000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;45000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;90000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;55000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Return the names and salaries of employees who earn more than 50,000, sorted from highest to lowest salary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;WHERE salary &amp;gt; 50000&lt;/code&gt; runs first and drops Bob (45000). The remaining three rows are then sorted by salary in descending order, so Carol (highest) comes first, Alice second, Dan third. No &lt;code&gt;LIMIT&lt;/code&gt;, so all three qualifying rows are returned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;clause&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;FROM employees&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;scan all 4 rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WHERE salary &amp;gt; 50000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;drop Bob (45000); 3 rows left&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ORDER BY salary DESC&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;sort: Carol (90000) → Alice (70000) → Dan (55000)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SELECT name, salary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;project the two named columns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;90000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;70000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;55000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always name the columns in the &lt;code&gt;SELECT&lt;/code&gt;; &lt;code&gt;SELECT *&lt;/code&gt; outside an exploratory REPL is a code smell.&lt;/p&gt;

&lt;h4&gt;
  
  
  Aggregations — &lt;code&gt;GROUP BY&lt;/code&gt; + &lt;code&gt;HAVING&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The aggregation shape: &lt;code&gt;SELECT dim, AGG(col) FROM table GROUP BY dim HAVING AGG_filter&lt;/code&gt;. &lt;code&gt;GROUP BY&lt;/code&gt; collapses many rows to one row per group; &lt;code&gt;HAVING&lt;/code&gt; filters the resulting groups (you cannot put an aggregate in &lt;code&gt;WHERE&lt;/code&gt;).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt; — number of rows per group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SUM(col)&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;AVG(col)&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;MIN(col)&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;MAX(col)&lt;/code&gt;&lt;/strong&gt; — collapse a numeric column.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GROUP BY dim&lt;/code&gt;&lt;/strong&gt; — one output row per distinct value of &lt;code&gt;dim&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING AGG &amp;gt; N&lt;/code&gt;&lt;/strong&gt; — keep only groups whose aggregate exceeds &lt;code&gt;N&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A 6-row &lt;code&gt;employees&lt;/code&gt; table with &lt;code&gt;department&lt;/code&gt; and &lt;code&gt;salary&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;Engineering&lt;/td&gt;
&lt;td&gt;90000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;Engineering&lt;/td&gt;
&lt;td&gt;80000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;td&gt;50000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;td&gt;55000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Eve&lt;/td&gt;
&lt;td&gt;Marketing&lt;/td&gt;
&lt;td&gt;65000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frank&lt;/td&gt;
&lt;td&gt;Marketing&lt;/td&gt;
&lt;td&gt;60000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Return the average salary per department, but only show departments whose average exceeds 60,000.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;GROUP BY department&lt;/code&gt; collapses the six rows into three groups — Engineering, Sales, Marketing. &lt;code&gt;AVG(salary)&lt;/code&gt; computes the per-group average: Engineering 85000, Sales 52500, Marketing 62500. &lt;code&gt;HAVING AVG(salary) &amp;gt; 60000&lt;/code&gt; then drops Sales (52500 fails the threshold) and keeps the other two.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;clause&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;FROM employees&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;scan all 6 rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GROUP BY department&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3 groups — Engineering (2 rows), Sales (2 rows), Marketing (2 rows)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;AVG(salary)&lt;/code&gt; per group&lt;/td&gt;
&lt;td&gt;Engineering 85000, Sales 52500, Marketing 62500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HAVING AVG(salary) &amp;gt; 60000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;drop Sales (52500 fails); keep Engineering + Marketing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SELECT department, AVG(salary) AS avg_salary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;project the 2 surviving rows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;avg_salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Engineering&lt;/td&gt;
&lt;td&gt;85000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Marketing&lt;/td&gt;
&lt;td&gt;62500&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; row predicates → &lt;code&gt;WHERE&lt;/code&gt;; aggregate predicates → &lt;code&gt;HAVING&lt;/code&gt;. Putting &lt;code&gt;AVG &amp;gt; X&lt;/code&gt; in &lt;code&gt;WHERE&lt;/code&gt; is a parse error.&lt;/p&gt;

&lt;h4&gt;
  
  
  Joins — connecting tables on a common key
&lt;/h4&gt;

&lt;p&gt;Joins combine columns from two tables on a matching key. The four every fresher needs: &lt;strong&gt;&lt;code&gt;INNER&lt;/code&gt;&lt;/strong&gt; (only matched rows survive), &lt;strong&gt;&lt;code&gt;LEFT&lt;/code&gt;&lt;/strong&gt; (all rows from the left table, even unmatched), &lt;strong&gt;&lt;code&gt;RIGHT&lt;/code&gt;&lt;/strong&gt; (mirror of LEFT, rarely used), &lt;strong&gt;&lt;code&gt;FULL&lt;/code&gt;&lt;/strong&gt; (all rows from both sides). &lt;code&gt;SELF JOIN&lt;/code&gt; joins a table to itself for hierarchies (manager / employee, parent / child).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;INNER JOIN&lt;/code&gt;&lt;/strong&gt; — strict match on both sides.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt;&lt;/strong&gt; — keep every left row; &lt;code&gt;NULL&lt;/code&gt; on the right when no match.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;RIGHT JOIN&lt;/code&gt;&lt;/strong&gt; — same as LEFT with sides swapped; usually rewrite as LEFT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;FULL JOIN&lt;/code&gt;&lt;/strong&gt; — keep every row from both sides; useful for reconciliation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SELF JOIN&lt;/code&gt;&lt;/strong&gt; — alias the same table twice (&lt;code&gt;employees a JOIN employees b ON a.manager_id = b.id&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; An &lt;code&gt;orders&lt;/code&gt; table and a &lt;code&gt;customers&lt;/code&gt; table.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;orders&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;C2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;customers&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;customer_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C2&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Return one row per order showing &lt;code&gt;order_id&lt;/code&gt; and the matching &lt;code&gt;customer_name&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; The &lt;code&gt;INNER JOIN&lt;/code&gt; (the default form when you just write &lt;code&gt;JOIN&lt;/code&gt;) matches each order to its customer using &lt;code&gt;customer_id&lt;/code&gt;. Order 101 → Alice, order 102 → Bob, order 103 → Alice. All three orders have a matching customer, so every order survives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;scan &lt;code&gt;orders&lt;/code&gt; (left side)&lt;/td&gt;
&lt;td&gt;3 rows: 101→C1, 102→C2, 103→C1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;for each row, look up &lt;code&gt;customer_id&lt;/code&gt; in &lt;code&gt;customers&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;C1→Alice (twice), C2→Bob&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;INNER JOIN&lt;/code&gt; keeps only matched pairs&lt;/td&gt;
&lt;td&gt;all 3 orders matched, 0 dropped&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SELECT o.order_id, c.customer_name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;project the 2 named columns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;customer_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always give every table a short alias (&lt;code&gt;o&lt;/code&gt;, &lt;code&gt;c&lt;/code&gt;) and prefix every column (&lt;code&gt;o.order_id&lt;/code&gt;, &lt;code&gt;c.customer_name&lt;/code&gt;) — the SQL becomes self-documenting.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;SELECT *&lt;/code&gt; everywhere — production queries always name the columns.&lt;/li&gt;
&lt;li&gt;Putting an aggregate in &lt;code&gt;WHERE&lt;/code&gt; instead of &lt;code&gt;HAVING&lt;/code&gt; — parse error in PostgreSQL.&lt;/li&gt;
&lt;li&gt;Joining at the wrong grain (one-to-many without thinking) — the #1 source of "the number is suddenly 3× too high" bugs.&lt;/li&gt;
&lt;li&gt;Memorising syntax without internalising &lt;strong&gt;which side keeps its rows&lt;/strong&gt; in a &lt;code&gt;LEFT JOIN&lt;/code&gt; — the part that breaks numbers.&lt;/li&gt;
&lt;li&gt;Skipping window functions because they "look hard" — interviewers love them; they take a week to learn.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Worked Problem on Ranking Top Earners per Department with Window Functions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A 6-row &lt;code&gt;employees&lt;/code&gt; table mixing departments and salaries.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;Engineering&lt;/td&gt;
&lt;td&gt;90000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;Engineering&lt;/td&gt;
&lt;td&gt;80000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;td&gt;50000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;td&gt;55000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Eve&lt;/td&gt;
&lt;td&gt;Marketing&lt;/td&gt;
&lt;td&gt;65000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frank&lt;/td&gt;
&lt;td&gt;Marketing&lt;/td&gt;
&lt;td&gt;60000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Rank each employee by salary &lt;strong&gt;within their department&lt;/strong&gt; (highest = rank 1) and return only the &lt;strong&gt;top earner per department&lt;/strong&gt;. Use a window function — pure &lt;code&gt;GROUP BY&lt;/code&gt; cannot keep both the rank and the row's other columns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
            &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)&lt;/code&gt; assigns a strict 1, 2, 3 sequence within each department, ordered by salary from highest to lowest. The outer &lt;code&gt;WHERE rank = 1&lt;/code&gt; keeps only the top-paid row per department. The wrapping subquery is needed because PostgreSQL evaluates window functions &lt;em&gt;after&lt;/em&gt; &lt;code&gt;WHERE&lt;/code&gt;, so we cannot filter &lt;code&gt;rank = 1&lt;/code&gt; in the same level where we compute it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Engineering&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;90000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Marketing&lt;/td&gt;
&lt;td&gt;Eve&lt;/td&gt;
&lt;td&gt;65000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;55000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the input rows above:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;th&gt;rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Engineering&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;90000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engineering&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;80000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Marketing&lt;/td&gt;
&lt;td&gt;Eve&lt;/td&gt;
&lt;td&gt;65000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Marketing&lt;/td&gt;
&lt;td&gt;Frank&lt;/td&gt;
&lt;td&gt;60000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;55000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;50000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After &lt;code&gt;WHERE rank = 1&lt;/code&gt;: three rows — one per department, the top earner.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;PARTITION BY department&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — defines the group inside which the ranking happens; without it, the rank would be global across all employees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY salary DESC&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — descending so rank 1 is the highest-paid; ascending would give the lowest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ROW_NUMBER&lt;/code&gt; not &lt;code&gt;RANK&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — strict 1, 2, 3; ties produce one rank-1 row per partition, which is what "top earner" demands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Outer &lt;code&gt;WHERE rank = 1&lt;/code&gt; filter&lt;/strong&gt;&lt;/strong&gt; — Postgres cannot filter window-function output in the same query level; the wrap is required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;one row per department guaranteed&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;ROW_NUMBER&lt;/code&gt; (not &lt;code&gt;RANK&lt;/code&gt; or &lt;code&gt;DENSE_RANK&lt;/code&gt;) ensures no ties, so the result has exactly one row per group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;O(N log N)&lt;/code&gt; from the partitioned sort; with an index on &lt;code&gt;(department, salary DESC)&lt;/code&gt; this becomes &lt;code&gt;O(N)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; drill the &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice page&lt;/a&gt; for short curated reps; the structured path for fresher SQL is &lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;SQL for Data Engineering Interviews — From Zero to FAANG&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — window functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL window-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — joins&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL join problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/joins" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregation" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Step 2 — Learn Python for Data Engineering
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Python, file handling, Pandas, and the API requests every DE writes
&lt;/h3&gt;

&lt;p&gt;Python is the &lt;strong&gt;glue language&lt;/strong&gt; for everything outside the database — ETL scripts, automation, data pipelines, API integrations, transformations. You don't need to be a Python wizard; you need to be fluent at reading CSVs, calling APIs, transforming data with Pandas, and writing small testable functions.&lt;/p&gt;

&lt;p&gt;Three Python skill clusters every fresher needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Core Python&lt;/strong&gt; — variables, loops, functions, lists / dicts / sets, classes, exception handling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File handling&lt;/strong&gt; — read and write CSV, JSON, and Excel files using the standard library and Pandas.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Libraries&lt;/strong&gt; — &lt;strong&gt;Pandas&lt;/strong&gt; for data transformation; &lt;strong&gt;Requests&lt;/strong&gt; for API calls; &lt;strong&gt;PySpark&lt;/strong&gt; later (Step 6) for big-data processing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc8oc72hmmtznwd591vwk.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc8oc72hmmtznwd591vwk.jpeg" alt="Diagram of what a data engineer actually does — sources, pipelines, warehouse, consumers — with the data engineer owning the middle two stages." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; the 10% of Python you actually use day-to-day is &lt;code&gt;csv&lt;/code&gt;, &lt;code&gt;json&lt;/code&gt;, &lt;code&gt;pathlib&lt;/code&gt;, &lt;code&gt;collections&lt;/code&gt;, &lt;code&gt;dataclasses&lt;/code&gt;, &lt;code&gt;typing&lt;/code&gt;, and &lt;code&gt;pandas&lt;/code&gt;. Skip metaclasses, descriptors, and async event loops on day one — they're irrelevant to fresher DE work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Core Python — loops, lists, and small functions
&lt;/h4&gt;

&lt;p&gt;The fresher Python invariant: write small testable functions over loops over lists / dicts. Type hints (&lt;code&gt;def f(x: int) -&amp;gt; int:&lt;/code&gt;) make a 2-month-old script readable when you come back to it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Variables and types&lt;/strong&gt; — &lt;code&gt;int&lt;/code&gt;, &lt;code&gt;float&lt;/code&gt;, &lt;code&gt;str&lt;/code&gt;, &lt;code&gt;bool&lt;/code&gt;, &lt;code&gt;None&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lists, dicts, sets&lt;/strong&gt; — ordered, key-value, unique-only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loops&lt;/strong&gt; — &lt;code&gt;for x in xs:&lt;/code&gt; over iterables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Functions&lt;/strong&gt; — single-responsibility; takes inputs, returns outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exception handling&lt;/strong&gt; — &lt;code&gt;try / except FileNotFoundError&lt;/code&gt; for fragile I/O.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A Python list of three integers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Multiply each number by 2 and print the result. Show the canonical &lt;code&gt;for&lt;/code&gt; loop pattern that every other Python data-engineering script will mirror.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;for num in data:&lt;/code&gt; walks the list one element at a time, binding the current value to &lt;code&gt;num&lt;/code&gt;. Inside the loop body, &lt;code&gt;num * 2&lt;/code&gt; doubles the value and &lt;code&gt;print(...)&lt;/code&gt; writes it to stdout. The pattern generalises directly to "for every row in this CSV, do something" — replace &lt;code&gt;data&lt;/code&gt; with &lt;code&gt;csv.DictReader(f)&lt;/code&gt; and you have an ETL skeleton.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;iteration&lt;/th&gt;
&lt;th&gt;&lt;code&gt;num&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;num * 2&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;stdout&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;6&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;end&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;loop exits when list is exhausted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2
4
6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if your Python script grows past 100 lines and has zero functions, it's a notebook draft, not a script — refactor before sharing it.&lt;/p&gt;

&lt;h4&gt;
  
  
  File handling — reading CSV and JSON
&lt;/h4&gt;

&lt;p&gt;Most data-engineering Python is reading a file, transforming the contents, and writing the result somewhere. The standard library has &lt;code&gt;csv&lt;/code&gt; and &lt;code&gt;json&lt;/code&gt; modules that cover 90% of fresher needs; for anything richer reach for Pandas.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;open(path, encoding='utf-8')&lt;/code&gt;&lt;/strong&gt; — open a text file safely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;csv.DictReader(f)&lt;/code&gt;&lt;/strong&gt; — iterate CSV rows as dictionaries (column-name access).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;json.load(f)&lt;/code&gt;&lt;/strong&gt; — parse a JSON file into a Python dict / list.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pathlib.Path('file.csv')&lt;/code&gt;&lt;/strong&gt; — modern path object; works on Windows, macOS, Linux.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A &lt;code&gt;data.json&lt;/code&gt; file containing one JSON object.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Alice"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"salary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;70000&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Open &lt;code&gt;data.json&lt;/code&gt;, parse it into a Python dict, and print the parsed result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;with open("data.json") as f:&lt;/code&gt; opens the file safely (the &lt;code&gt;with&lt;/code&gt; block guarantees the file is closed when the block exits, even on error). &lt;code&gt;json.load(f)&lt;/code&gt; parses the file's contents into a Python object — a dict here because the JSON started with &lt;code&gt;{&lt;/code&gt;. Printing the dict shows the parsed data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;with open("data.json") as f&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;file handle &lt;code&gt;f&lt;/code&gt; opens in text mode&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;json.load(f)&lt;/code&gt; reads bytes&lt;/td&gt;
&lt;td&gt;parses JSON object → Python &lt;code&gt;dict&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;bind result to &lt;code&gt;data&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;data = {"name": "Alice", "salary": 70000}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;exit &lt;code&gt;with&lt;/code&gt; block&lt;/td&gt;
&lt;td&gt;file auto-closed (even on error)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;print(data)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;dict printed to stdout&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'name': 'Alice', 'salary': 70000}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always use &lt;code&gt;with open(...)&lt;/code&gt; rather than the bare &lt;code&gt;open()&lt;/code&gt; call — it auto-closes the file and handles exceptions cleanly.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pandas for tabular data — &lt;code&gt;read_csv&lt;/code&gt;, &lt;code&gt;groupby&lt;/code&gt;, &lt;code&gt;sum&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Pandas&lt;/strong&gt; is the Python library every DE uses for transforming tabular data. The three operations you'll do hundreds of times: read a CSV into a &lt;code&gt;DataFrame&lt;/code&gt;, group by one or more columns, aggregate with &lt;code&gt;sum&lt;/code&gt; / &lt;code&gt;mean&lt;/code&gt; / &lt;code&gt;count&lt;/code&gt;. &lt;strong&gt;Requests&lt;/strong&gt; is the API-call counterpart.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pd.read_csv('file.csv')&lt;/code&gt;&lt;/strong&gt; — read a CSV into a &lt;code&gt;DataFrame&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;df.groupby('col')&lt;/code&gt;&lt;/strong&gt; — group rows by a column.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.sum()&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;.mean()&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;.count()&lt;/code&gt;&lt;/strong&gt; — aggregate the groups.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;requests.get(url).json()&lt;/code&gt;&lt;/strong&gt; — fetch a URL and parse the JSON response.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A &lt;code&gt;sales.csv&lt;/code&gt; file with 5 rows across two regions.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;region&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;North&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;North&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;South&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;South&lt;/td&gt;
&lt;td&gt;120&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;North&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Read &lt;code&gt;sales.csv&lt;/code&gt; into a Pandas DataFrame, group by &lt;code&gt;region&lt;/code&gt;, and print the sum of &lt;code&gt;amount&lt;/code&gt; per region.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;pd.read_csv("sales.csv")&lt;/code&gt; loads the entire CSV into a &lt;code&gt;DataFrame&lt;/code&gt;, with the first row treated as column headers. &lt;code&gt;df.groupby("region")&lt;/code&gt; produces a grouped object that buckets rows by region. &lt;code&gt;.sum()&lt;/code&gt; aggregates every numeric column within each bucket — here that's &lt;code&gt;order_id&lt;/code&gt; (sum of IDs, usually meaningless) and &lt;code&gt;amount&lt;/code&gt; (the metric we care about).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pd.read_csv("sales.csv")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;DataFrame with 5 rows × 3 columns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;df.groupby("region")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bucket rows: North = {1, 2, 5}; South = {3, 4}&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.sum()&lt;/code&gt; per bucket&lt;/td&gt;
&lt;td&gt;North: order_id sum = 8, amount = 320 (100+150+70); South: order_id sum = 7, amount = 200 (80+120)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;print(...)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;the two-row grouped frame prints to stdout&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        order_id  amount
region
North          8     320
South          7     200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; when the data fits in memory and you don't need a database, Pandas is faster than writing the SQL — but for anything past a few million rows, push the work back into SQL or PySpark.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Skipping type hints — code becomes unreadable in 2 months.&lt;/li&gt;
&lt;li&gt;Reading huge CSVs into Pandas without &lt;code&gt;chunksize&lt;/code&gt; — your laptop runs out of RAM.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;requests&lt;/code&gt; without a timeout — a hung API call freezes your script forever (&lt;code&gt;requests.get(url, timeout=10)&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Not handling &lt;code&gt;None&lt;/code&gt; / missing values — &lt;code&gt;int(None)&lt;/code&gt; crashes with a &lt;code&gt;TypeError&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Writing 200-line scripts as one big block — break into &lt;code&gt;def&lt;/code&gt;-defined functions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Worked Problem on Building a CSV-to-Summary Python ETL Script
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A &lt;code&gt;sales.csv&lt;/code&gt; file with 5 rows.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;region&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;North&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;North&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;South&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;South&lt;/td&gt;
&lt;td&gt;120&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;North&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Build a small Python ETL script that reads &lt;code&gt;sales.csv&lt;/code&gt;, sums &lt;code&gt;amount&lt;/code&gt; per &lt;code&gt;region&lt;/code&gt;, writes the result to &lt;code&gt;summary.csv&lt;/code&gt;, and prints the count of rows processed. This is the canonical Phase-1 portfolio script every fresher should ship to GitHub.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using Pandas + a writeable summary path
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarise_sales&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;as_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;summarise_sales&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processed &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; The function takes two &lt;code&gt;Path&lt;/code&gt; objects so it's testable — you can call it from a test with mock paths instead of hardcoding filenames. &lt;code&gt;pd.read_csv(input_path)&lt;/code&gt; loads the CSV, &lt;code&gt;groupby("region", as_index=False)["amount"].sum()&lt;/code&gt; produces a clean two-column summary (&lt;code&gt;as_index=False&lt;/code&gt; keeps &lt;code&gt;region&lt;/code&gt; as a column rather than becoming the index), and &lt;code&gt;to_csv(output_path, index=False)&lt;/code&gt; writes the summary back out without Pandas' default integer index column. The function returns the row count so the caller can log a clean status line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;summary.csv&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;region&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;North&lt;/td&gt;
&lt;td&gt;320&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;South&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;stdout: &lt;code&gt;processed 5 rows&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the input rows above:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pd.read_csv("sales.csv")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;DataFrame with 5 rows × 3 columns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;df.groupby("region", as_index=False)["amount"].sum()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2-row summary DataFrame&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;summary.to_csv("summary.csv", index=False)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;file written to disk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;return len(df)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;returns &lt;code&gt;5&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;print(f"processed {rows} rows")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;stdout: &lt;code&gt;processed 5 rows&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Path&lt;/code&gt; objects for testable I/O&lt;/strong&gt;&lt;/strong&gt; — paths are inputs, not hardcoded constants, so the function works with any source / destination.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;groupby(..., as_index=False)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — keeps &lt;code&gt;region&lt;/code&gt; as a regular column instead of the DataFrame index; the resulting CSV reads naturally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;["amount"].sum()&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — selects the metric column before aggregation; otherwise Pandas would also sum &lt;code&gt;order_id&lt;/code&gt;, which is meaningless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;to_csv(..., index=False)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — suppresses Pandas' default integer index column; the CSV has only the two columns you actually want.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;return + print separation&lt;/strong&gt;&lt;/strong&gt; — the function returns a value (good for tests); the caller decides whether to print it (good for scripts vs imports).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;O(N)&lt;/code&gt; where &lt;code&gt;N&lt;/code&gt; is the input row count; fits in memory up to a few million rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; for fresher Python reps see &lt;a href="https://pipecode.ai/explore/practice/language/python" rel="noopener noreferrer"&gt;Python practice page&lt;/a&gt;; the structured path is &lt;a href="https://pipecode.ai/explore/courses/python-for-data-engineering-interviews-the-complete-fundamentals" rel="noopener noreferrer"&gt;Python for Data Engineering Interviews — Complete Fundamentals&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Language — Python&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Python practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/language/python" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — Python for DE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Python for Data Engineering Interviews&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/python-for-data-engineering-interviews-the-complete-fundamentals" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — CSV processing&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL CSV-processing problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/csv-processing" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Steps 3-5 — Databases, Data Warehousing, and ETL/ELT
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How data is stored, modeled, and moved through pipelines
&lt;/h3&gt;

&lt;p&gt;Three closely-related steps in one section because they answer the same question: &lt;em&gt;where does the data live, and how does it get there?&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Step 3 — Databases.&lt;/strong&gt; Relational (PostgreSQL, MySQL) for transactional workloads; NoSQL (MongoDB, Cassandra, Redis) for specialised cases. Learn keys, normalisation, transactions, indexing, ACID.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step 4 — Data Warehousing.&lt;/strong&gt; Snowflake, BigQuery, Redshift store analytics-ready data in &lt;strong&gt;fact tables&lt;/strong&gt; + &lt;strong&gt;dimension tables&lt;/strong&gt;, organised as a &lt;strong&gt;star schema&lt;/strong&gt; (fact in the middle, dimensions hanging off). Heavily asked in interviews.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step 5 — ETL / ELT.&lt;/strong&gt; &lt;strong&gt;ETL&lt;/strong&gt; = Extract → Transform → Load (transform before loading). &lt;strong&gt;ELT&lt;/strong&gt; = Extract → Load → Transform (load raw, then transform inside the warehouse). Plus batch vs streaming pipelines, incremental loads, and CDC (change data capture).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqs7y2cc51ouqr2i8xr90.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqs7y2cc51ouqr2i8xr90.jpeg" alt="ETL flow diagram for freshers — source CSV through staging table to a curated warehouse table with the safe-rerun (idempotent) pattern." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; the three databases worth installing for practice: &lt;strong&gt;PostgreSQL&lt;/strong&gt; (covers 90% of relational SQL you'll see at work), &lt;strong&gt;SQLite&lt;/strong&gt; (zero-setup local dev), and one &lt;strong&gt;NoSQL&lt;/strong&gt; (MongoDB is the friendliest). Skip Redis until you genuinely need a cache; skip Cassandra until you genuinely have wide-column data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Relational databases — tables, keys, normalisation, ACID
&lt;/h4&gt;

&lt;p&gt;Relational databases store data in &lt;strong&gt;tables&lt;/strong&gt; with &lt;strong&gt;primary keys&lt;/strong&gt; (one column uniquely identifies each row) and &lt;strong&gt;foreign keys&lt;/strong&gt; (a column in one table references the primary key of another). &lt;strong&gt;Normalisation&lt;/strong&gt; splits data so each fact lives in exactly one place — no duplication, no inconsistency. &lt;strong&gt;ACID&lt;/strong&gt; properties (Atomicity, Consistency, Isolation, Durability) guarantee that transactions either fully succeed or fully roll back.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Primary key&lt;/strong&gt; — uniquely identifies a row (&lt;code&gt;customer_id&lt;/code&gt; in &lt;code&gt;customers&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Foreign key&lt;/strong&gt; — points to another table's primary key (&lt;code&gt;orders.customer_id → customers.customer_id&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normalisation&lt;/strong&gt; — &lt;code&gt;1NF&lt;/code&gt; / &lt;code&gt;2NF&lt;/code&gt; / &lt;code&gt;3NF&lt;/code&gt; — split tables until each fact lives once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Indexing&lt;/strong&gt; — speeds up lookups; trade-off is slower writes and extra storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ACID transactions&lt;/strong&gt; — &lt;code&gt;BEGIN; … COMMIT;&lt;/code&gt; (or &lt;code&gt;ROLLBACK;&lt;/code&gt; on failure).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A two-table relational design — &lt;code&gt;customers&lt;/code&gt; references &lt;code&gt;orders&lt;/code&gt; via the foreign key &lt;code&gt;customer_id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;customers&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;customer_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C2&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;orders&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;C2&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Write the &lt;code&gt;CREATE TABLE&lt;/code&gt; statements for &lt;code&gt;customers&lt;/code&gt; and &lt;code&gt;orders&lt;/code&gt; with proper primary keys and a foreign key from &lt;code&gt;orders&lt;/code&gt; to &lt;code&gt;customers&lt;/code&gt;. Then write a transactional &lt;code&gt;INSERT&lt;/code&gt; that adds a new customer plus their first order &lt;strong&gt;atomically&lt;/strong&gt; — both rows commit or neither does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;   &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt;    &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt;      &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;BEGIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'C3'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Carol'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;104&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'C3'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; The &lt;code&gt;customers&lt;/code&gt; table declares &lt;code&gt;customer_id&lt;/code&gt; as &lt;code&gt;PRIMARY KEY&lt;/code&gt; (uniqueness + index automatically created). The &lt;code&gt;orders&lt;/code&gt; table's &lt;code&gt;customer_id&lt;/code&gt; is &lt;code&gt;REFERENCES customers(customer_id)&lt;/code&gt; — a foreign key that prevents you from inserting an order for a non-existent customer. The &lt;code&gt;BEGIN; … COMMIT;&lt;/code&gt; block makes both inserts a single &lt;strong&gt;atomic transaction&lt;/strong&gt;: if the second insert fails for any reason, the first is rolled back automatically — the database never ends up with a customer who has no order or an order pointing to a missing customer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;statement&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CREATE TABLE customers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;empty table, &lt;code&gt;customer_id&lt;/code&gt; enforced unique&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CREATE TABLE orders&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;empty table; FK rejects orphan &lt;code&gt;customer_id&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BEGIN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;open a transaction — changes are invisible until commit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;INSERT INTO customers ('C3', 'Carol')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;row staged; FK in &lt;code&gt;orders&lt;/code&gt; will accept C3 later&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;INSERT INTO orders (104, 'C3', 75)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;row staged; FK satisfied because C3 exists in-tx&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;COMMIT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;both rows persisted atomically; on error, both rolled back&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After the transaction commits:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;customer_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C2&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C3&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;C2&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;104&lt;/td&gt;
&lt;td&gt;C3&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; every multi-row write that has to be "all or nothing" goes inside a &lt;code&gt;BEGIN; … COMMIT;&lt;/code&gt; block — that's the entire point of a relational database.&lt;/p&gt;

&lt;h4&gt;
  
  
  Data warehousing — fact tables, dimension tables, star schema
&lt;/h4&gt;

&lt;p&gt;A &lt;strong&gt;data warehouse&lt;/strong&gt; stores analytics-ready data optimised for fast &lt;code&gt;SELECT&lt;/code&gt; queries (not for high-volume &lt;code&gt;INSERT&lt;/code&gt; / &lt;code&gt;UPDATE&lt;/code&gt;). The canonical model is the &lt;strong&gt;star schema&lt;/strong&gt; — one &lt;strong&gt;fact table&lt;/strong&gt; in the middle that records events (sales, clicks, logins) surrounded by &lt;strong&gt;dimension tables&lt;/strong&gt; that describe context (customers, products, dates). Heavily tested at interviews.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fact table&lt;/strong&gt; — measures events; mostly numeric columns + foreign keys to dimensions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dimension table&lt;/strong&gt; — descriptive context; mostly text columns (customer name, product category).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Star schema&lt;/strong&gt; — one fact in the centre, dimensions hanging off as star points.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snowflake schema&lt;/strong&gt; — dimensions further normalised into sub-dimensions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partitioning / clustering&lt;/strong&gt; — physical layout choices that speed up filtered queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A star-schema design for an e-commerce sales fact with three dimensions.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;fact_sales&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;sale_id&lt;/th&gt;
&lt;th&gt;date_id&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;product_id&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;S1&lt;/td&gt;
&lt;td&gt;20260501&lt;/td&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;P1&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S2&lt;/td&gt;
&lt;td&gt;20260501&lt;/td&gt;
&lt;td&gt;C2&lt;/td&gt;
&lt;td&gt;P2&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;dim_customer&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;customer_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C2&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;dim_product&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;product_id&lt;/th&gt;
&lt;th&gt;product_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P1&lt;/td&gt;
&lt;td&gt;Book&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P2&lt;/td&gt;
&lt;td&gt;Headphones&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;dim_date&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;date_id&lt;/th&gt;
&lt;th&gt;day&lt;/th&gt;
&lt;th&gt;month&lt;/th&gt;
&lt;th&gt;year&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;20260501&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;2026&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Write a query that joins the fact to all three dimensions and returns &lt;code&gt;customer_name&lt;/code&gt;, &lt;code&gt;product_name&lt;/code&gt;, &lt;code&gt;month&lt;/code&gt;, and &lt;code&gt;amount&lt;/code&gt; for every sale. This is the canonical "fact + dim rollup" report every BI dashboard runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;fact_sales&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;dim_product&lt;/span&gt;  &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;dim_date&lt;/span&gt;     &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date_id&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; The fact table sits in the middle and is joined once to each dimension on the matching surrogate key. Because each dimension has exactly one row per dimension key, the joins do not multiply rows — the output has the same number of rows as &lt;code&gt;fact_sales&lt;/code&gt;. The &lt;code&gt;SELECT&lt;/code&gt; then pulls the descriptive columns from the dimensions plus the &lt;code&gt;amount&lt;/code&gt; from the fact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;scan &lt;code&gt;fact_sales&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;2 rows (S1, S2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;join &lt;code&gt;dim_customer&lt;/code&gt; on &lt;code&gt;customer_id&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;S1 → Alice, S2 → Bob; row count unchanged (1:1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;join &lt;code&gt;dim_product&lt;/code&gt; on &lt;code&gt;product_id&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;S1 → Book, S2 → Headphones; row count unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;join &lt;code&gt;dim_date&lt;/code&gt; on &lt;code&gt;date_id&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;both rows pick up month=5; row count unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;SELECT&lt;/code&gt; 4 projected columns&lt;/td&gt;
&lt;td&gt;final 2-row report&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_name&lt;/th&gt;
&lt;th&gt;product_name&lt;/th&gt;
&lt;th&gt;month&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;Book&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;Headphones&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; fact tables hold the &lt;em&gt;measure&lt;/em&gt;; dimensions hold the &lt;em&gt;context&lt;/em&gt;. If you can't tell whether a column belongs in the fact or the dim, ask "is this a number we'll aggregate, or text we'll group by?"&lt;/p&gt;

&lt;h4&gt;
  
  
  ETL vs ELT, batch vs streaming, and CDC in plain English
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;ETL&lt;/strong&gt; = &lt;em&gt;extract, transform, load&lt;/em&gt; — read source data, transform it in a separate engine (Spark, Python), then load the clean result into the warehouse. &lt;strong&gt;ELT&lt;/strong&gt; = &lt;em&gt;extract, load, transform&lt;/em&gt; — load the raw source straight into the warehouse, then transform with SQL. Modern cloud warehouses are powerful enough that ELT has become the default. &lt;strong&gt;Batch&lt;/strong&gt; processes data on a schedule (every hour / day); &lt;strong&gt;streaming&lt;/strong&gt; processes data as it arrives (sub-second). &lt;strong&gt;CDC&lt;/strong&gt; (change data capture) tracks &lt;code&gt;INSERT&lt;/code&gt; / &lt;code&gt;UPDATE&lt;/code&gt; / &lt;code&gt;DELETE&lt;/code&gt; events on a source so the warehouse stays in sync without re-loading the whole table.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ETL&lt;/strong&gt; — transform outside the warehouse (older pattern; Spark, Python, custom).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ELT&lt;/strong&gt; — transform inside the warehouse with SQL (newer; dbt, Snowflake, BigQuery).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch&lt;/strong&gt; — scheduled jobs (hourly, daily); cheaper, simpler, slightly stale data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming&lt;/strong&gt; — event-by-event processing (Kafka, Flink); fresher, more expensive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CDC&lt;/strong&gt; — incremental change tracking; loads only what changed since last run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A daily-batch ETL skeleton in Python that loads yesterday's orders, transforms them, and writes a curated table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source: raw orders dropped to S3 daily under s3://orders/2026-05-08/orders.csv
target: warehouse table fact_orders, partitioned by order_date
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Sketch the three-stage &lt;strong&gt;ETL&lt;/strong&gt; pipeline shape — &lt;code&gt;extract&lt;/code&gt; reads the CSV, &lt;code&gt;transform&lt;/code&gt; cleans / dedupes / casts types, &lt;code&gt;load&lt;/code&gt; writes to the warehouse. Use plain Python pseudocode; the goal is the &lt;em&gt;shape&lt;/em&gt; not a runnable example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://orders/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/orders.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partition_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# warehouse-specific COPY or INSERT INTO fact_orders WHERE order_date = partition_date
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-05-08&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;loaded &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;extract&lt;/code&gt; is the only function that knows where the source is; &lt;code&gt;transform&lt;/code&gt; is pure (no I/O) and easy to unit-test; &lt;code&gt;load&lt;/code&gt; is the only function that writes to the warehouse. Splitting the pipeline into three named functions makes the script readable, testable, and easy to swap (you can replace &lt;code&gt;extract&lt;/code&gt; with a Postgres reader without touching &lt;code&gt;transform&lt;/code&gt;). The dedupe + type-cast inside &lt;code&gt;transform&lt;/code&gt; is the canonical "raw → curated" cleaning step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;function&lt;/th&gt;
&lt;th&gt;what it does&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;extract("2026-05-08")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;read S3 path for that day&lt;/td&gt;
&lt;td&gt;raw DataFrame from CSV&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;transform(df)&lt;/code&gt; step a&lt;/td&gt;
&lt;td&gt;&lt;code&gt;drop_duplicates(subset=["order_id"])&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;duplicate orders removed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;transform(df)&lt;/code&gt; step b&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pd.to_datetime(...).dt.date&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;order_date&lt;/code&gt; cast to date type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;transform(df)&lt;/code&gt; step c&lt;/td&gt;
&lt;td&gt;&lt;code&gt;astype(float)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;amount&lt;/code&gt; cast to float&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;load(df, date)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;warehouse &lt;code&gt;COPY&lt;/code&gt; / &lt;code&gt;INSERT&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;row count returned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;print(...)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;stdout summary&lt;/td&gt;
&lt;td&gt;&lt;code&gt;loaded 5 rows for 2026-05-08&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;loaded 5 rows for 2026-05-08
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always separate &lt;code&gt;extract&lt;/code&gt;, &lt;code&gt;transform&lt;/code&gt;, &lt;code&gt;load&lt;/code&gt; into three named functions — even when the pipeline is small. The shape is what reviewers look for.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Treating data warehouses like OLTP databases — running thousands of &lt;code&gt;UPDATE&lt;/code&gt;s per minute (warehouses optimise for &lt;code&gt;SELECT&lt;/code&gt;, not &lt;code&gt;UPDATE&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Modelling everything in one wide table — kills performance and makes joins impossible later.&lt;/li&gt;
&lt;li&gt;Confusing batch and streaming — batch is the default; pick streaming only when you genuinely need sub-second freshness.&lt;/li&gt;
&lt;li&gt;Forgetting CDC — re-loading the whole &lt;code&gt;customers&lt;/code&gt; table every night when only 100 rows changed wastes hours.&lt;/li&gt;
&lt;li&gt;Skipping the staging step — going source → curated directly means you can't reproduce yesterday's run.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Worked Problem on Building an Idempotent Daily ETL with Quality Checks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A daily CSV &lt;code&gt;orders_2026-05-08.csv&lt;/code&gt; that lands in S3. The warehouse has a &lt;code&gt;fact_orders&lt;/code&gt; table partitioned by &lt;code&gt;order_date&lt;/code&gt;. The pipeline must be &lt;strong&gt;idempotent&lt;/strong&gt; — running it twice with the same input produces the same output.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;order_date&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2026-05-08&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2026-05-08&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2026-05-08&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Write a Python script that loads the daily CSV, replaces today's partition (so a rerun does not double-count), and runs three data-quality checks (row count &amp;gt; 0, no &lt;code&gt;NULL&lt;/code&gt; order_ids, no duplicate order_ids). Fail loudly with a non-zero exit code if any check fails.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;TRUNCATE&lt;/code&gt; of today's partition + three quality checks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;

&lt;span class="n"&gt;LOAD_DATE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-05-08&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;csv_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;csv_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DELETE FROM fact_orders WHERE order_date = %s;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LOAD_DATE&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO fact_orders (order_id, order_date, amount) VALUES (%s, %s, %s);&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COUNT(*) FROM fact_orders WHERE order_date = %s;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LOAD_DATE&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COUNT(*) FROM fact_orders WHERE order_id IS NULL;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            SELECT COUNT(*) FROM (
              SELECT order_id, COUNT(*) c FROM fact_orders GROUP BY 1 HAVING COUNT(*) &amp;gt; 1
            ) d;
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dbname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warehouse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orders_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;LOAD_DATE&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;DELETE FROM fact_orders WHERE order_date = LOAD_DATE&lt;/code&gt; wipes today's partition before re-inserting — that's what makes the pipeline idempotent (a rerun overwrites today's slice instead of appending). The &lt;code&gt;INSERT&lt;/code&gt; loop loads every CSV row with explicit type casts so dates land as dates and amounts land as numbers. Three quality checks then verify the load worked — non-zero row count, no null primary keys, no duplicates. Any failure returns exit code &lt;code&gt;1&lt;/code&gt; so the orchestrator (Airflow, cron) notices automatically and the developer is paged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After a healthy run:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;order_date&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2026-05-08&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2026-05-08&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2026-05-08&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Exit code: &lt;code&gt;0&lt;/code&gt;. A second run of the same script produces an identical &lt;code&gt;fact_orders&lt;/code&gt; (idempotent).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for a clean 3-row CSV:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DELETE WHERE order_date = '2026-05-08'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;today's partition wiped&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;INSERT&lt;/code&gt; 3 CSV rows&lt;/td&gt;
&lt;td&gt;3 rows in today's partition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;row-count check&lt;/td&gt;
&lt;td&gt;3 &amp;gt; 0 → pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;null-PK check&lt;/td&gt;
&lt;td&gt;0 nulls → pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;duplicate-PK check&lt;/td&gt;
&lt;td&gt;0 dupes → pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;commit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;exit &lt;code&gt;0&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;DELETE&lt;/code&gt; of today's partition before insert&lt;/strong&gt;&lt;/strong&gt; — makes the pipeline idempotent; rerun overwrites instead of appending.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;explicit type casts in the &lt;code&gt;INSERT&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;int()&lt;/code&gt;, &lt;code&gt;float()&lt;/code&gt;, ISO date strings make the warehouse see clean types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;three quality checks inside the same job&lt;/strong&gt;&lt;/strong&gt; — checks live next to the load, not in a "we'll add monitoring later" backlog.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;non-zero exit code on failure&lt;/strong&gt;&lt;/strong&gt; — Airflow / cron / GitHub Actions detect the failure automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;conn.commit()&lt;/code&gt; only on success&lt;/strong&gt;&lt;/strong&gt; — bad runs roll back; the warehouse is never left half-loaded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;O(rows in today's CSV)&lt;/code&gt;; the historical &lt;code&gt;fact_orders&lt;/code&gt; is only scanned for the duplicate check.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; for the structured ETL learning path see &lt;a href="https://pipecode.ai/explore/courses/etl-system-design-for-data-engineering-interviews" rel="noopener noreferrer"&gt;ETL System Design for Data Engineering Interviews&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/courses/data-modeling-for-data-engineering-interviews" rel="noopener noreferrer"&gt;Data Modeling course&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;ETL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — ETL&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;ETL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/etl" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — Data Modeling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Data Modeling for Data Engineering Interviews&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/data-modeling-for-data-engineering-interviews" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — ETL System Design&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;ETL System Design for DE Interviews&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/etl-system-design-for-data-engineering-interviews" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Steps 6-9 — Apache Spark, Airflow, Cloud, and Data Modeling
&lt;/h2&gt;

&lt;h3&gt;
  
  
  From single-machine SQL and Pandas to production-scale pipelines
&lt;/h3&gt;

&lt;p&gt;After SQL, Python, databases, and ETL fundamentals are solid, four scaling skills turn you from a script-writer into a production data engineer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Step 6 — Apache Spark.&lt;/strong&gt; The industry standard for large-scale processing; PySpark is its Python API. Learn DataFrames, transformations, actions, Spark SQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step 7 — Workflow orchestration.&lt;/strong&gt; Apache Airflow runs your pipelines on a schedule. Learn DAGs (directed acyclic graphs), tasks, operators, dependencies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step 8 — Cloud platforms.&lt;/strong&gt; Modern data engineering lives on AWS, Azure, or GCP. Pick &lt;strong&gt;AWS first&lt;/strong&gt; — it's the most asked. Learn S3, EC2, Lambda, Glue, Redshift, IAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step 9 — Data modeling.&lt;/strong&gt; OLTP vs OLAP, normalisation vs denormalisation, slowly changing dimensions (SCDs), fact-vs-dim design. Read the &lt;strong&gt;Kimball Data Warehouse Toolkit&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; these are scale-and-production skills. Don't open them until SQL and Python are second nature. The most common fresher failure mode is &lt;em&gt;"I learned Spark but I can't write a &lt;code&gt;LEFT JOIN&lt;/code&gt; correctly under pressure."&lt;/em&gt; Master the foundations first.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Apache Spark + PySpark — the big-data engine
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Apache Spark&lt;/strong&gt; processes data that doesn't fit on a single machine by splitting work across a cluster. &lt;strong&gt;PySpark&lt;/strong&gt; is its Python API — almost everything you do in Pandas has a PySpark equivalent, just distributed. The simplest entry point is &lt;code&gt;SparkSession.builder.getOrCreate()&lt;/code&gt; followed by &lt;code&gt;spark.read.csv(...)&lt;/code&gt; to load data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SparkSession&lt;/code&gt;&lt;/strong&gt; — the entry point; creates the cluster connection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DataFrame&lt;/strong&gt; — the main abstraction; like a Pandas DataFrame but distributed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transformations&lt;/strong&gt; — &lt;code&gt;select&lt;/code&gt;, &lt;code&gt;filter&lt;/code&gt;, &lt;code&gt;groupBy&lt;/code&gt; — lazy, build a plan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actions&lt;/strong&gt; — &lt;code&gt;show&lt;/code&gt;, &lt;code&gt;count&lt;/code&gt;, &lt;code&gt;write&lt;/code&gt; — trigger actual execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spark SQL&lt;/strong&gt; — register a DataFrame as a table and run SQL against it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A &lt;code&gt;sales.csv&lt;/code&gt; file similar to the Pandas example, but big enough that we want Spark to process it on a cluster.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;region&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;North&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;South&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;North&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Write a minimal PySpark script that reads &lt;code&gt;sales.csv&lt;/code&gt; and shows the first few rows. Show the canonical &lt;code&gt;SparkSession&lt;/code&gt; setup that every PySpark script begins with.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;demo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inferSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;SparkSession.builder.appName("demo").getOrCreate()&lt;/code&gt; either creates a new Spark session or attaches to an existing one — either way, you end up with a &lt;code&gt;spark&lt;/code&gt; object that knows how to talk to the cluster. &lt;code&gt;spark.read.csv("sales.csv", header=True, inferSchema=True)&lt;/code&gt; loads the file as a DataFrame, treating the first row as headers and inferring column types. &lt;code&gt;df.show()&lt;/code&gt; is an &lt;em&gt;action&lt;/em&gt; that triggers execution and prints the first 20 rows to stdout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;call&lt;/th&gt;
&lt;th&gt;kind&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SparkSession.builder.appName(...).getOrCreate()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;setup&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;spark&lt;/code&gt; session attached to a (local or cluster) executor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;spark.read.csv(..., header=True, inferSchema=True)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;transformation (lazy)&lt;/td&gt;
&lt;td&gt;DataFrame plan registered — no rows scanned yet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;df.show()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;action&lt;/td&gt;
&lt;td&gt;plan executes: read CSV → infer types → render first 20 rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;stdout&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;grid-formatted table printed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------+------+------+
|order_id|region|amount|
+--------+------+------+
|       1| North|   100|
|       2| South|   200|
|       3| North|   150|
+--------+------+------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; in PySpark, transformations (&lt;code&gt;filter&lt;/code&gt;, &lt;code&gt;select&lt;/code&gt;, &lt;code&gt;groupBy&lt;/code&gt;) are &lt;em&gt;lazy&lt;/em&gt; — nothing runs until you call an action like &lt;code&gt;.show()&lt;/code&gt;, &lt;code&gt;.count()&lt;/code&gt;, or &lt;code&gt;.write()&lt;/code&gt;. That's why the same PySpark code can be reused for 1 GB and 1 TB datasets.&lt;/p&gt;

&lt;h4&gt;
  
  
  Apache Airflow — DAGs, tasks, scheduling
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Airflow&lt;/strong&gt; is the workflow orchestrator most data teams use. You write a &lt;strong&gt;DAG&lt;/strong&gt; (directed acyclic graph) of &lt;strong&gt;tasks&lt;/strong&gt;; Airflow runs them on a schedule, respects dependencies, retries failures, and surfaces alerts. The minimum viable DAG is two tasks chained with &lt;code&gt;&amp;gt;&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DAG&lt;/strong&gt; — the workflow; a Python file in &lt;code&gt;dags/&lt;/code&gt; directory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task&lt;/strong&gt; — a single unit of work (run a SQL query, call an API, run a PySpark job).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operator&lt;/strong&gt; — a reusable task type (&lt;code&gt;BashOperator&lt;/code&gt;, &lt;code&gt;PythonOperator&lt;/code&gt;, &lt;code&gt;SQLExecuteQueryOperator&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependencies&lt;/strong&gt; — &lt;code&gt;task1 &amp;gt;&amp;gt; task2&lt;/code&gt; means "run task1 then task2."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schedule&lt;/strong&gt; — &lt;code&gt;schedule_interval='@daily'&lt;/code&gt;, &lt;code&gt;'0 3 * * *'&lt;/code&gt; (cron), or &lt;code&gt;None&lt;/code&gt; for manual.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A simple two-stage daily ETL — extract data from an API, load it into a warehouse table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;task1: extract — call the API, write raw JSON to S3
task2: load — read S3 JSON, INSERT into fact_events
schedule: daily at 03:00 UTC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Write a minimal Airflow DAG that defines two &lt;code&gt;PythonOperator&lt;/code&gt; tasks &lt;code&gt;extract_task&lt;/code&gt; and &lt;code&gt;load_task&lt;/code&gt;, and chains them so &lt;code&gt;load_task&lt;/code&gt; only runs after &lt;code&gt;extract_task&lt;/code&gt; succeeds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# call the API, write raw JSON to S3
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# read S3 JSON, INSERT into fact_events
&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;etl_pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2026&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;extract_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;load_task&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;load&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;extract_task&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;load_task&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; The &lt;code&gt;with DAG(...) as dag:&lt;/code&gt; block defines the DAG metadata — its name, when it starts, how often it runs (&lt;code&gt;@daily&lt;/code&gt; is shorthand for "every day at midnight"), and whether to backfill missed runs (&lt;code&gt;catchup=False&lt;/code&gt; means "no, just run from now on"). Two &lt;code&gt;PythonOperator&lt;/code&gt; tasks wrap the actual Python functions. &lt;code&gt;extract_task &amp;gt;&amp;gt; load_task&lt;/code&gt; declares the dependency — Airflow will only run &lt;code&gt;load_task&lt;/code&gt; if &lt;code&gt;extract_task&lt;/code&gt; succeeds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;when&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;parse time&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;DAG(...)&lt;/code&gt; instantiates&lt;/td&gt;
&lt;td&gt;DAG &lt;code&gt;etl_pipeline&lt;/code&gt; registered in Airflow metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;parse time&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;PythonOperator(...)&lt;/code&gt; ×2&lt;/td&gt;
&lt;td&gt;two tasks attached to the DAG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;parse time&lt;/td&gt;
&lt;td&gt;&lt;code&gt;extract_task &amp;gt;&amp;gt; load_task&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;dependency edge added (extract → load)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;every day at midnight&lt;/td&gt;
&lt;td&gt;scheduler triggers a DAG run&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;extract&lt;/code&gt; task starts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;extract succeeds&lt;/td&gt;
&lt;td&gt;scheduler sees green upstream&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;load&lt;/code&gt; task starts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;load succeeds&lt;/td&gt;
&lt;td&gt;DAG run marked success&lt;/td&gt;
&lt;td&gt;green tick in calendar grid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6′&lt;/td&gt;
&lt;td&gt;extract fails&lt;/td&gt;
&lt;td&gt;downstream skipped&lt;/td&gt;
&lt;td&gt;red tick; alert fires&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the Airflow UI, this DAG appears as two boxes connected by an arrow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[extract] → [load]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each daily run produces a tick in the calendar grid; failures are red, successes are green.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; one DAG = one logical workflow. If you find yourself writing 50 tasks in a single DAG, you probably want 5 DAGs of 10 tasks each — easier to debug, easier to retry.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cloud platforms — AWS first, then expand
&lt;/h4&gt;

&lt;p&gt;Modern data engineering is cloud-based. &lt;strong&gt;Pick one platform first&lt;/strong&gt; and learn its data services before branching out — most teams use AWS, so it's the highest-leverage starting point. Azure and GCP are equally valid second choices once you have one cloud under your belt.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;S3&lt;/strong&gt; — object storage; where raw data lands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EC2&lt;/strong&gt; — virtual machines; rarely touched directly anymore.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda&lt;/strong&gt; — serverless functions; great for small ETL triggers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Glue&lt;/strong&gt; — managed ETL service; runs Spark jobs without you managing the cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redshift&lt;/strong&gt; — AWS data warehouse; SQL-compatible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM&lt;/strong&gt; — identity and access; &lt;em&gt;non-optional&lt;/em&gt; — every cloud bug eventually traces back to permissions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; You have a daily CSV that lands in S3 at &lt;code&gt;s3://my-bucket/orders/{date}/orders.csv&lt;/code&gt; and a Redshift table &lt;code&gt;fact_orders&lt;/code&gt; to load it into.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Write the AWS CLI / SQL pseudocode that copies the CSV from S3 into Redshift on a schedule. (Don't worry about IAM details; the goal is the &lt;em&gt;shape&lt;/em&gt;.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Inside Redshift, run on a schedule from Airflow / cron&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;fact_orders&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-bucket/orders/2026-05-08/orders.csv'&lt;/span&gt;
&lt;span class="n"&gt;IAM_ROLE&lt;/span&gt; &lt;span class="s1"&gt;'arn:aws:iam::ACCOUNT:role/RedshiftS3ReadRole'&lt;/span&gt;
&lt;span class="n"&gt;CSV&lt;/span&gt;
&lt;span class="n"&gt;IGNOREHEADER&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;COPY ... FROM 's3://...'&lt;/code&gt; is the Redshift-specific bulk-load command — it pulls a file directly from S3 into a table without needing an intermediate machine. &lt;code&gt;IAM_ROLE&lt;/code&gt; references an AWS IAM role that grants Redshift permission to read that S3 bucket — without this, the copy fails with a permission error. &lt;code&gt;CSV&lt;/code&gt; tells Redshift the file format; &lt;code&gt;IGNOREHEADER 1&lt;/code&gt; skips the column-header row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;actor&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Airflow / cron&lt;/td&gt;
&lt;td&gt;submits &lt;code&gt;COPY&lt;/code&gt; to Redshift&lt;/td&gt;
&lt;td&gt;command queued&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Redshift leader&lt;/td&gt;
&lt;td&gt;assumes &lt;code&gt;RedshiftS3ReadRole&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;temporary AWS credentials obtained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Redshift compute nodes&lt;/td&gt;
&lt;td&gt;parallel-fetch the S3 object&lt;/td&gt;
&lt;td&gt;bytes streamed direct to slices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;parser&lt;/td&gt;
&lt;td&gt;apply &lt;code&gt;CSV, IGNOREHEADER 1&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;header row skipped; data rows parsed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;loader&lt;/td&gt;
&lt;td&gt;bulk-insert into &lt;code&gt;fact_orders&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;rows committed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;system catalogue&lt;/td&gt;
&lt;td&gt;log to &lt;code&gt;STL_LOAD_COMMITS&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;row count + reject count recorded&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The CSV's rows land in &lt;code&gt;fact_orders&lt;/code&gt;. A small status row is logged in &lt;code&gt;STL_LOAD_COMMITS&lt;/code&gt; showing how many rows were copied and whether any were rejected.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; for AWS, S3 + IAM are the two services you actually need to be fluent in. Everything else (Lambda, Glue, Redshift) layers on top.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Learning Spark before SQL is solid — Spark is just bigger SQL with more failure modes.&lt;/li&gt;
&lt;li&gt;Writing 1,000-line Airflow DAGs — split into smaller DAGs that each do one thing.&lt;/li&gt;
&lt;li&gt;Storing AWS credentials in code — always use IAM roles or environment variables, never hardcode.&lt;/li&gt;
&lt;li&gt;Ignoring data modeling because it "feels theoretical" — interviewers test it heavily.&lt;/li&gt;
&lt;li&gt;Trying to learn all three clouds at once — pick AWS first; the others are easier once you know one.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Worked Problem on Designing a Slowly Changing Dimension (Type 2) for Customer Addresses
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; Customer &lt;code&gt;C1&lt;/code&gt; lives at "12 Old St" until 2026-03-15, then moves to "88 New Ave". The fact tables need to know which address was current at the time of each historical sale.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;address&lt;/th&gt;
&lt;th&gt;valid_from&lt;/th&gt;
&lt;th&gt;valid_to&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;12 Old St&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;2026-03-14&lt;/td&gt;
&lt;td&gt;FALSE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;88 New Ave&lt;/td&gt;
&lt;td&gt;2026-03-15&lt;/td&gt;
&lt;td&gt;(NULL)&lt;/td&gt;
&lt;td&gt;TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Write the SQL to handle the address change as an &lt;strong&gt;SCD Type 2&lt;/strong&gt; update — close the old row by setting &lt;code&gt;valid_to&lt;/code&gt; and &lt;code&gt;is_current = FALSE&lt;/code&gt;, then insert a new row with the new address and &lt;code&gt;is_current = TRUE&lt;/code&gt;. This pattern preserves historical correctness without losing the past.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;UPDATE&lt;/code&gt; to close the old row + &lt;code&gt;INSERT&lt;/code&gt; for the new one
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Step 1: close the existing current row&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;valid_to&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="s1"&gt;'2026-03-14'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;is_current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;FALSE&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'C1'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;is_current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Step 2: insert the new current row&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;valid_from&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;valid_to&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'C1'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'88 New Ave'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="s1"&gt;'2026-03-15'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; SCD Type 2 keeps full history by &lt;em&gt;adding&lt;/em&gt; new rows rather than overwriting old ones. The &lt;code&gt;UPDATE&lt;/code&gt; finds the row where &lt;code&gt;is_current = TRUE&lt;/code&gt; for the customer and closes it — sets &lt;code&gt;valid_to&lt;/code&gt; to the day before the change and &lt;code&gt;is_current&lt;/code&gt; to &lt;code&gt;FALSE&lt;/code&gt;. The &lt;code&gt;INSERT&lt;/code&gt; then adds the new row with &lt;code&gt;valid_from&lt;/code&gt; set to the change date, &lt;code&gt;valid_to&lt;/code&gt; left &lt;code&gt;NULL&lt;/code&gt; (still current), and &lt;code&gt;is_current = TRUE&lt;/code&gt;. Historical fact tables can join to this dim with a date predicate to find the address that was current at the time of each sale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After the two statements:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;address&lt;/th&gt;
&lt;th&gt;valid_from&lt;/th&gt;
&lt;th&gt;valid_to&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;12 Old St&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;2026-03-14&lt;/td&gt;
&lt;td&gt;FALSE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;88 New Ave&lt;/td&gt;
&lt;td&gt;2026-03-15&lt;/td&gt;
&lt;td&gt;(NULL)&lt;/td&gt;
&lt;td&gt;TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A query like &lt;code&gt;WHERE is_current = TRUE&lt;/code&gt; returns only the current address. A historical join uses &lt;code&gt;WHERE sale_date BETWEEN valid_from AND COALESCE(valid_to, DATE '9999-12-31')&lt;/code&gt; to pick the right address per sale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the change on 2026-03-15:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;UPDATE&lt;/code&gt; closes row 1&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;valid_to = 2026-03-14&lt;/code&gt;, &lt;code&gt;is_current = FALSE&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;INSERT&lt;/code&gt; adds row 2&lt;/td&gt;
&lt;td&gt;new row with &lt;code&gt;valid_from = 2026-03-15&lt;/code&gt;, &lt;code&gt;is_current = TRUE&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;dimension now has 2 rows for C1&lt;/td&gt;
&lt;td&gt;one historical, one current&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;SCD Type 2 keeps history&lt;/strong&gt;&lt;/strong&gt; — old rows are not overwritten; both versions of the customer's address coexist with date ranges.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;valid_from&lt;/code&gt; / &lt;code&gt;valid_to&lt;/code&gt; define the row's lifetime&lt;/strong&gt;&lt;/strong&gt; — the date range during which this row was the truth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;is_current = TRUE&lt;/code&gt; flag&lt;/strong&gt;&lt;/strong&gt; — shortcut for dashboards that always want the latest; saves an &lt;code&gt;ORDER BY ... LIMIT 1&lt;/code&gt; lookup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;historical joins use &lt;code&gt;BETWEEN&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — pick the dim row whose date range contains the fact row's date.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COALESCE(valid_to, '9999-12-31')&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — handles the open-ended current row whose &lt;code&gt;valid_to&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — two row-level operations; constant time per dimension change.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; for the deeper modeling syllabus see &lt;a href="https://pipecode.ai/explore/courses/data-modeling-for-data-engineering-interviews" rel="noopener noreferrer"&gt;Data Modeling for Data Engineering Interviews&lt;/a&gt;; when you do start Spark, the gentle entry point is &lt;a href="https://pipecode.ai/explore/courses/pyspark-fundamentals" rel="noopener noreferrer"&gt;PySpark Fundamentals&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — PySpark&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;PySpark Fundamentals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/pyspark-fundamentals" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — Spark internals&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Apache Spark Internals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/apache-spark-internals-for-data-engineering-interviews" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — slowly changing data&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SCD practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/slowly-changing-data" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Steps 10-13 — Streaming, Portfolio Projects, Git, and Interview Prep
&lt;/h2&gt;

&lt;h3&gt;
  
  
  From skills to a job offer — proving the work and clearing the loop
&lt;/h3&gt;

&lt;p&gt;The last four steps turn your skills into a job offer. Streaming systems handle real-time data, portfolio projects prove you can ship, Git makes your code visible, and interview prep closes the deal.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Step 10 — Streaming systems.&lt;/strong&gt; Kafka, event-driven architectures, message queues, real-time processing. Required for advanced roles; optional for first jobs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step 11 — Build five portfolio projects.&lt;/strong&gt; SQL analytics, Python ETL, Airflow pipeline, PySpark large-data, cloud deployment. Put all on GitHub.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step 12 — Master Git.&lt;/strong&gt; &lt;code&gt;clone&lt;/code&gt;, &lt;code&gt;add&lt;/code&gt;, &lt;code&gt;commit&lt;/code&gt;, &lt;code&gt;push&lt;/code&gt;, &lt;code&gt;branch&lt;/code&gt;, &lt;code&gt;merge&lt;/code&gt; — every company uses Git from day one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step 13 — Interview prep.&lt;/strong&gt; SQL questions (joins, windows, aggregations, ranking), Python questions (dicts, strings, lists, hashmaps), system-design basics (ETL architecture, lake vs warehouse, batch vs streaming, scalability).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79sm4v2y8s8t1qolijvi.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79sm4v2y8s8t1qolijvi.jpeg" alt="Proof-by-phase checklist mapping each data engineering roadmap phase to a GitHub repo and a resume bullet a fresher can show recruiters." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; projects beat certificates. A GitHub repo with a clean README and a runnable pipeline outperforms a stack of certifications. Your top-of-funnel signal to recruiters is &lt;em&gt;"here's the URL to my orders-batch-etl project"&lt;/em&gt; — not your transcript.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Streaming systems — Kafka in plain English
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Kafka&lt;/strong&gt; is a distributed message queue that lets producers publish events to a "topic" and consumers read them in order. Event-driven architectures use Kafka as the spine — payment events flow in, multiple downstream consumers (fraud detection, analytics, notifications) read the same stream independently. &lt;strong&gt;Required for advanced / senior DE roles; optional for fresher first jobs.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Producer&lt;/strong&gt; — writes events to a Kafka topic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Topic&lt;/strong&gt; — a named append-only log; events stay in order.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumer&lt;/strong&gt; — reads events from a topic; multiple consumers per topic are fine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition&lt;/strong&gt; — topics are split into partitions for parallelism.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use case&lt;/strong&gt; — live payment events flowing into a fraud-detection model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; A payment event payload that a producer wants to publish to the &lt;code&gt;payments&lt;/code&gt; topic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PAY-1001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;250.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;U42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-05-08T10:15:00Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Sketch the producer-side Python code that publishes this event to a Kafka topic called &lt;code&gt;payments&lt;/code&gt;. Use &lt;code&gt;kafka-python&lt;/code&gt; (the most popular client). Include just the producer setup + send call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KafkaProducer&lt;/span&gt;

&lt;span class="n"&gt;producer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaProducer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:9092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;value_serializer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PAY-1001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;250.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;U42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-05-08T10:15:00Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;KafkaProducer(bootstrap_servers=[...])&lt;/code&gt; connects to one or more Kafka brokers. The &lt;code&gt;value_serializer&lt;/code&gt; lambda turns the Python dict into JSON bytes (Kafka stores raw bytes, not Python objects). &lt;code&gt;producer.send("payments", event)&lt;/code&gt; queues the event for delivery to the &lt;code&gt;payments&lt;/code&gt; topic; &lt;code&gt;producer.flush()&lt;/code&gt; blocks until the queued messages are actually sent. Downstream consumers (fraud detection, analytics) can read this event independently and in order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;KafkaProducer(bootstrap_servers=[...])&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TCP connection to broker established; metadata fetched&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;value_serializer = json.dumps(...).encode("utf-8")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;every send will convert dict → JSON bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;producer.send("payments", event)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;record buffered in the producer's in-memory queue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;broker picks a partition (hashed key or round-robin)&lt;/td&gt;
&lt;td&gt;record assigned to a partition log&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;producer.flush()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;blocks until all buffered records are acknowledged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;subscribed consumers&lt;/td&gt;
&lt;td&gt;next &lt;code&gt;poll()&lt;/code&gt; returns the new event&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The event is appended to the &lt;code&gt;payments&lt;/code&gt; topic. Any consumer subscribed to &lt;code&gt;payments&lt;/code&gt; will receive it on its next &lt;code&gt;poll()&lt;/code&gt; call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"payment_id": "PAY-1001", "amount": 250.0, "currency": "USD", "user_id": "U42", "ts": "2026-05-08T10:15:00Z"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; fresher first jobs rarely need Kafka. Master batch (Step 5) before opening Step 10. Mention Kafka in interviews only if you've actually shipped a project that uses it.&lt;/p&gt;

&lt;h4&gt;
  
  
  Five portfolio projects — what to build, in order
&lt;/h4&gt;

&lt;p&gt;Projects matter more than certificates. Build all five and put them on GitHub with a clean &lt;code&gt;README.md&lt;/code&gt; for each. The five build on each other — by the end you have a production-grade portfolio.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Project 1 — SQL analytics.&lt;/strong&gt; E-commerce sales dashboard built entirely in SQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project 2 — Python ETL.&lt;/strong&gt; Extract API data → clean → store in PostgreSQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project 3 — Airflow pipeline.&lt;/strong&gt; Schedule the Python ETL as a daily DAG.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project 4 — PySpark large-data pipeline.&lt;/strong&gt; Process millions of rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project 5 — Cloud project.&lt;/strong&gt; Deploy the ETL pipeline on AWS.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; Project 1 — an e-commerce dataset (&lt;code&gt;orders&lt;/code&gt;, &lt;code&gt;customers&lt;/code&gt;, &lt;code&gt;products&lt;/code&gt;) for which you'll write the SQL behind a sales dashboard.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;orders&lt;/code&gt; (sample):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;product_id&lt;/th&gt;
&lt;th&gt;order_date&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;P1&lt;/td&gt;
&lt;td&gt;2026-04-01&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;C2&lt;/td&gt;
&lt;td&gt;P2&lt;/td&gt;
&lt;td&gt;2026-04-15&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;td&gt;P1&lt;/td&gt;
&lt;td&gt;2026-05-01&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; For Project 1, write the canonical "monthly revenue per product" SQL. This is the single query the entire dashboard hangs off — get this right and the rest of the dashboard is just &lt;code&gt;WHERE&lt;/code&gt; and &lt;code&gt;ORDER BY&lt;/code&gt; variations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-01'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;DATE_TRUNC('month', o.order_date)&lt;/code&gt; collapses every order date to the first day of its month, so all April orders aggregate together. The &lt;code&gt;JOIN&lt;/code&gt; brings in &lt;code&gt;product_name&lt;/code&gt; from &lt;code&gt;products&lt;/code&gt; so the dashboard can label rows. &lt;code&gt;GROUP BY&lt;/code&gt; collapses to one row per (product, month). &lt;code&gt;ORDER BY&lt;/code&gt; produces a chronologically-readable result. Wrap this in a saved view or a dbt model and the dashboard renders automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;clause&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;FROM orders o&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;scan all order rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;JOIN products p ON p.product_id = o.product_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;each order picks up its &lt;code&gt;product_name&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WHERE o.order_date &amp;gt;= DATE '2026-01-01'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;drop pre-2026 rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DATE_TRUNC('month', o.order_date)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;every date snapped to month-start (e.g. 2026-04-15 → 2026-04-01)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GROUP BY product_name, month&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bucket by (product, month)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;SUM(o.amount)&lt;/code&gt; per bucket&lt;/td&gt;
&lt;td&gt;revenue total per group&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ORDER BY month, product_name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;chronological, then alphabetical&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;product_name&lt;/th&gt;
&lt;th&gt;month&lt;/th&gt;
&lt;th&gt;revenue&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Book&lt;/td&gt;
&lt;td&gt;2026-04-01&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Headphones&lt;/td&gt;
&lt;td&gt;2026-04-15 (truncated to 2026-04-01)&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Book&lt;/td&gt;
&lt;td&gt;2026-05-01&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;(Real output would have one row per (product, month) combination; the sample is too small for a strong rollup.)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; every Project 1 SQL should be runnable on a free PostgreSQL sandbox with a 100-row sample dataset. Put both the SQL and the sample data in your GitHub repo so a recruiter can clone and run it in 60 seconds.&lt;/p&gt;

&lt;h4&gt;
  
  
  Git, GitHub, and the resume bullet
&lt;/h4&gt;

&lt;p&gt;Git is &lt;strong&gt;non-optional infrastructure&lt;/strong&gt;. Every team's workflow assumes you can clone a repo, branch off, commit, and push. The bare minimum command set fits on a single screen.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;git clone &amp;lt;url&amp;gt;&lt;/code&gt;&lt;/strong&gt; — copy a remote repo locally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;git checkout -b feature/x&lt;/code&gt;&lt;/strong&gt; — create + switch to a new branch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;git add file&lt;/code&gt;&lt;/strong&gt; — stage a change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;git commit -m "..."&lt;/code&gt;&lt;/strong&gt; — record the staged changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;git push origin &amp;lt;branch&amp;gt;&lt;/code&gt;&lt;/strong&gt; — push the branch to GitHub; open a pull request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;git merge&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;git rebase&lt;/code&gt;&lt;/strong&gt; — combine branches.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; You've finished Project 1 (the SQL analytics dashboard) on your laptop and want to push it to GitHub under your account.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Show the canonical six-command workflow: clone an empty template repo, branch off, add the files you've written, commit with a descriptive message, push the branch, and open a pull request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/&amp;lt;you&amp;gt;/sql-sales-dashboard.git
&lt;span class="nb"&gt;cd &lt;/span&gt;sql-sales-dashboard
git checkout &lt;span class="nt"&gt;-b&lt;/span&gt; feature/initial-dashboard
&lt;span class="c"&gt;# (write README.md, schema.sql, queries.sql, sample-data/)&lt;/span&gt;
git add README.md schema.sql queries.sql sample-data/
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Add Project 1: SQL sales dashboard with sample data"&lt;/span&gt;
git push origin feature/initial-dashboard
&lt;span class="c"&gt;# open a pull request on github.com&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; &lt;code&gt;clone&lt;/code&gt; brings the empty repo to your laptop. &lt;code&gt;checkout -b&lt;/code&gt; creates a feature branch (never push to &lt;code&gt;main&lt;/code&gt; directly on a team repo, even your own — make the habit). After writing the project files, &lt;code&gt;add&lt;/code&gt; stages them, &lt;code&gt;commit&lt;/code&gt; records the change with a one-line message that future-you can scan, and &lt;code&gt;push&lt;/code&gt; sends the branch to GitHub. The pull request is the artifact a recruiter or interviewer will actually look at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;command&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;git clone &amp;lt;url&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;empty repo copied to laptop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;cd sql-sales-dashboard&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;move into the working tree&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;git checkout -b feature/initial-dashboard&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;new branch created and checked out&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;write &lt;code&gt;README.md&lt;/code&gt;, &lt;code&gt;schema.sql&lt;/code&gt;, &lt;code&gt;queries.sql&lt;/code&gt;, &lt;code&gt;sample-data/&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;working tree now has 4 untracked items&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;git add ...&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;files staged for commit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;git commit -m "..."&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;snapshot recorded with descriptive message&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;code&gt;git push origin feature/initial-dashboard&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;branch published to GitHub&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;open PR on github.com&lt;/td&gt;
&lt;td&gt;reviewable artifact link a recruiter can click&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A GitHub repo URL with a feature branch and a pull request — both visible to anyone you share the link with. The README renders directly on the repo home page, becoming your portfolio artifact.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if you can't clone, branch, commit, and push within 60 seconds without looking commands up, Git is still on your to-do list. Practice it daily until it's muscle memory.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Trying to learn Kafka before mastering batch ETL — Kafka adds complexity without removing any.&lt;/li&gt;
&lt;li&gt;Building one giant project instead of five small ones — recruiters skim; five clear repos beat one tangled one.&lt;/li&gt;
&lt;li&gt;Pushing to &lt;code&gt;main&lt;/code&gt; directly — every commit becomes part of history with no review trail.&lt;/li&gt;
&lt;li&gt;No &lt;code&gt;README.md&lt;/code&gt; per project — repos without READMEs are invisible.&lt;/li&gt;
&lt;li&gt;Skipping interview prep — solid skills + zero practice = solid skills wasted at the screen.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Worked Problem on Picking Project 2 and Writing the Resume Bullet
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Example input.&lt;/strong&gt; You've shipped Project 1 (SQL dashboard). Project 2 is the Python ETL — extract from an API, clean, store in PostgreSQL. The repo will be &lt;code&gt;python-api-etl&lt;/code&gt;. The recruiter call is in two weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question.&lt;/strong&gt; Sketch the four-file layout for the Project 2 repo plus the one-line resume bullet you'll lead with on the recruiter call. The goal: a stranger should be able to read the repo, run it locally, and understand the work in 5 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using a four-file repo layout + a metric-led resume bullet
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Code solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python-api-etl/
├── README.md           # 60-second pitch + how to run
├── etl.py              # extract / transform / load functions
├── tests/
│   └── test_etl.py     # one test per function
└── requirements.txt    # pinned dependencies
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resume bullet (lead with the metric):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Built a Python ETL pipeline that ingests 10K daily API records into a PostgreSQL warehouse with row-level validation and CI-friendly exit codes.&lt;/strong&gt; &lt;em&gt;(github.com/&amp;lt;you&amp;gt;/python-api-etl)&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Explanation of code.&lt;/strong&gt; Four files is the floor — one for documentation, one for code, one for tests, one for dependencies. The &lt;code&gt;README&lt;/code&gt; is what a recruiter sees first; lead with the &lt;em&gt;what&lt;/em&gt; and &lt;em&gt;how to run&lt;/em&gt;, then explain the &lt;em&gt;why&lt;/em&gt;. The resume bullet leads with a quantitative metric (&lt;code&gt;10K daily records&lt;/code&gt;) and ends with the GitHub URL — recruiters scan for both, in that order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your GitHub now has a runnable, documented ETL repo. The recruiter receives the link, clicks through, sees the README, and forwards your resume to the hiring manager. The bullet on the resume becomes the first sentence of the recruiter's pitch to the hiring manager.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; of how a recruiter reads it:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;recruiter action&lt;/th&gt;
&lt;th&gt;what they see&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;clicks the GitHub link in the resume&lt;/td&gt;
&lt;td&gt;repo home page with the README rendered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;scans the first paragraph of README&lt;/td&gt;
&lt;td&gt;"Daily API → PostgreSQL with validation"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;scrolls to "How to run"&lt;/td&gt;
&lt;td&gt;three commands (&lt;code&gt;git clone&lt;/code&gt;, &lt;code&gt;pip install&lt;/code&gt;, &lt;code&gt;python etl.py&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;clicks &lt;code&gt;etl.py&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;sees three named functions; reads in 30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;clicks &lt;code&gt;tests/&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;tests exist; quality signal confirmed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;one repo per project&lt;/strong&gt;&lt;/strong&gt; — recruiters skim; five clean repos beat one tangled monorepo.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;README-first design&lt;/strong&gt;&lt;/strong&gt; — the home page is the pitch; lead with what + how to run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;tests in &lt;code&gt;tests/&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — even one test per function is a quality signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;pinned &lt;code&gt;requirements.txt&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — anyone can clone and run; no "works on my machine" surprises.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;metric-led resume bullet&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;10K daily records&lt;/code&gt; is concrete; "ETL pipeline" alone is generic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Cost&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — about a weekend of focused work for the project; 30 minutes for the bullet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; for fresher interview reps see &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice page&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/language/python" rel="noopener noreferrer"&gt;Python practice page&lt;/a&gt;, and the canonical course path &lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;SQL for Data Engineering Interviews — From Zero to FAANG&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Hub — all practice&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Browse all data-engineering practice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Course — SQL for DE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL for Data Engineering Interviews&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;View course →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COURSE&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Hub — all courses&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Browse all DE courses&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/courses" rel="noopener noreferrer"&gt;View courses →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips to master the data engineering roadmap (best learning order + timeline)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Follow the order — and the calendar
&lt;/h3&gt;

&lt;p&gt;The 13 steps above have a &lt;strong&gt;best learning order&lt;/strong&gt; that works for most freshers — skip ahead at your own risk. The order plus a realistic timeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Order:&lt;/strong&gt; SQL → Python → Databases → Pandas → ETL concepts → Data Warehousing → PySpark → Airflow → Cloud (AWS) → Kafka → Projects → Interview prep.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2-3 months&lt;/strong&gt; — SQL + Python basics solid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4-6 months&lt;/strong&gt; — intermediate DE (warehousing, ETL, modeling).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6-9 months&lt;/strong&gt; — job-ready (Airflow, cloud, projects shipped).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9-12 months&lt;/strong&gt; — strong fresher profile (Spark, streaming basics, polished portfolio).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Most freshers fail for the same four reasons — avoid them
&lt;/h3&gt;

&lt;p&gt;The failure modes are predictable. Watch for these in your own routine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Jumping to Spark too early.&lt;/strong&gt; Spark is just bigger SQL with more failure modes; without solid SQL it's noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring SQL depth.&lt;/strong&gt; Beyond &lt;code&gt;SELECT&lt;/code&gt; and &lt;code&gt;JOIN&lt;/code&gt;, the bar at the screen is window functions + grain reasoning. Drill them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoiding projects.&lt;/strong&gt; Tutorials and certifications are signals; &lt;em&gt;shipped code on GitHub&lt;/em&gt; is proof.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watching tutorials without practice.&lt;/strong&gt; Watch the video → close it → rebuild the example without it. If you can't, you didn't learn it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The winning formula
&lt;/h3&gt;

&lt;p&gt;Every successful fresher career follows the same five-step loop: &lt;strong&gt;learn → practice → build → publish → interview&lt;/strong&gt;. Pick a topic, drill it on a coding-environment, build a small artifact, push to GitHub, then interview for jobs that touch that topic. Repeat for each step in the roadmap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Books worth buying
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Designing Data-Intensive Applications&lt;/strong&gt; (Martin Kleppmann) — the modern systems book; read once a quarter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Kimball Data Warehouse Toolkit&lt;/strong&gt; (Ralph Kimball) — the canonical modeling reference.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where to practice on PipeCode
&lt;/h3&gt;

&lt;p&gt;Start with the &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice page&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/language/python" rel="noopener noreferrer"&gt;Python practice page&lt;/a&gt;; the structured paths are &lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;SQL for Data Engineering Interviews — From Zero to FAANG&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/courses/python-for-data-engineering-interviews-the-complete-fundamentals" rel="noopener noreferrer"&gt;Python for Data Engineering Interviews — Complete Fundamentals&lt;/a&gt;. After SQL and Python land, drill &lt;a href="https://pipecode.ai/explore/practice/topic/etl" rel="noopener noreferrer"&gt;ETL practice&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions" rel="noopener noreferrer"&gt;window functions&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/joins" rel="noopener noreferrer"&gt;joins&lt;/a&gt;, and the deeper &lt;a href="https://pipecode.ai/explore/courses/data-modeling-for-data-engineering-interviews" rel="noopener noreferrer"&gt;Data Modeling course&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/courses/etl-system-design-for-data-engineering-interviews" rel="noopener noreferrer"&gt;ETL System Design course&lt;/a&gt;. Pivot to peer guides — the &lt;a href="https://pipecode.ai/blogs/airbnb-data-engineering-interview-questions-prep-guide" rel="noopener noreferrer"&gt;Airbnb DE interview guide&lt;/a&gt;, the &lt;a href="https://pipecode.ai/blogs/top-data-engineering-interview-questions-2026" rel="noopener noreferrer"&gt;top DE interview questions 2026&lt;/a&gt;, and the &lt;a href="https://pipecode.ai/blogs/sql-data-types-postgresql-guide" rel="noopener noreferrer"&gt;SQL data types Postgres guide&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How long does it really take to become a data engineer in 2026?
&lt;/h3&gt;

&lt;p&gt;If consistent, &lt;strong&gt;6-9 months&lt;/strong&gt; at 10-15 hours per week is enough to be &lt;strong&gt;job-ready&lt;/strong&gt; for junior / fresher data-engineering roles; &lt;strong&gt;9-12 months&lt;/strong&gt; produces a &lt;strong&gt;strong fresher profile&lt;/strong&gt; with Spark, streaming basics, and a polished portfolio. The 2-3 months mark is where SQL and Python basics click; 4-6 months gets you through warehousing, ETL, and modeling. The single biggest predictor of speed is &lt;strong&gt;consistency&lt;/strong&gt; — 10 hours a week for 6 months beats 40 hours a week for 6 weeks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need to learn all 13 steps before applying for jobs?
&lt;/h3&gt;

&lt;p&gt;No — start applying as soon as &lt;strong&gt;Steps 1-5&lt;/strong&gt; are solid (SQL, Python, databases, warehousing, ETL/ELT). Roles you can target with the first five steps done: junior data engineer, junior analytics engineer, data engineer intern, ETL developer trainee. Steps 6-9 (Spark, Airflow, Cloud, Modeling) turn "hireable" into "competitive." Steps 10-13 (Streaming, Projects, Git, Interview prep) close the deal. Apply earlier than you think you should — interviewing is itself a skill that needs reps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I master one cloud or learn all three (AWS, Azure, GCP)?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pick one first&lt;/strong&gt; and master its core data services before touching the others. &lt;strong&gt;AWS&lt;/strong&gt; is the most asked at fresher interviews and the most widely deployed in industry — start there. The core AWS services for fresher DE work: &lt;strong&gt;S3&lt;/strong&gt; (object storage), &lt;strong&gt;IAM&lt;/strong&gt; (access control), &lt;strong&gt;Lambda&lt;/strong&gt; (serverless functions), &lt;strong&gt;Glue&lt;/strong&gt; (managed ETL), &lt;strong&gt;Redshift&lt;/strong&gt; (warehouse). Once you have one cloud under your belt, the other two are easy because the concepts (object storage, IAM, serverless, managed ETL, warehouse) are the same — only the names change.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Apache Spark required for fresher data-engineering jobs?
&lt;/h3&gt;

&lt;p&gt;For most fresher first jobs, &lt;strong&gt;no&lt;/strong&gt; — but knowing &lt;em&gt;what Spark is&lt;/em&gt; and &lt;em&gt;when it appears&lt;/em&gt; is required. The honest fresher posture: &lt;em&gt;"I've shipped batch ETL with Python and SQL; I know Spark is the next step when data outgrows a single machine; I've done the PySpark Fundamentals tutorial and would learn the rest on the job."&lt;/em&gt; That's enough for 80% of fresher screens. Roles at Spark-heavy shops (Databricks customers, ad-tech, large e-commerce) will test deeper — for those, ship a PySpark project as part of your Step 11 portfolio.&lt;/p&gt;

&lt;h3&gt;
  
  
  What does a data engineer actually do day-to-day?
&lt;/h3&gt;

&lt;p&gt;Day-to-day, a data engineer &lt;strong&gt;writes SQL queries, builds and maintains batch pipelines, models new tables, fixes data quality issues, and reviews other engineers' pipelines&lt;/strong&gt;. A typical week: Monday — investigate a Slack message about a wrong dashboard number (usually a grain or null-handling bug); Tuesday-Wednesday — model a new dimension table for a product launch; Thursday — code review on a teammate's Airflow DAG; Friday — add a quality check that would have caught Monday's bug. Spark, Kafka, and lakehouse architecture appear at scale-heavy companies; the day-to-day at most companies is SQL + modeling + pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the difference between a data engineer, data analyst, and data scientist?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Data engineers build the pipelines and tables&lt;/strong&gt;; analysts query them for business questions; scientists run experiments and ML models on top. In a typical e-commerce team: a DE owns the daily ETL that loads &lt;code&gt;cur_orders&lt;/code&gt;; an analyst writes the SQL behind the daily revenue dashboard; a scientist runs the A/B test that decides whether the new checkout flow ships. The roles overlap on SQL — every analytics person writes it — but only DEs own the &lt;em&gt;infrastructure&lt;/em&gt; that produces the tables everyone else queries. Salaries also follow this stack — DEs are typically paid more than analysts and on par with scientists at most companies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start practicing data engineering interview problems
&lt;/h2&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>dataengineering</category>
      <category>interview</category>
    </item>
    <item>
      <title>SQL Interview Questions for Data Engineering</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Sun, 10 May 2026 15:45:00 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/sql-interview-questions-for-data-engineering-1d1i</link>
      <guid>https://dev.to/gowthampotureddi/sql-interview-questions-for-data-engineering-1d1i</guid>
      <description>&lt;p&gt;&lt;strong&gt;SQL interview questions for data engineering&lt;/strong&gt; circle around the same four primitives in every loop, regardless of the company name on the JD: &lt;strong&gt;&lt;code&gt;JOIN&lt;/code&gt; semantics with &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; for anti-joins, &lt;code&gt;GROUP BY&lt;/code&gt; with &lt;code&gt;HAVING&lt;/code&gt; for aggregate filters and duplicate detection, window functions like &lt;code&gt;ROW_NUMBER&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;DENSE_RANK&lt;/code&gt;, &lt;code&gt;LAG&lt;/code&gt;, and &lt;code&gt;LEAD&lt;/code&gt; for ranking and lookback, and CTEs (including recursive CTEs) plus correlated subqueries for multi-step logic and top-N-per-group queries&lt;/strong&gt;. Whether the prompt is "find customers who never placed an order", "count duplicate emails", "second-highest salary", or "top 3 salaries per department", the same handful of mental models keeps showing up — interviewers grade fluency with these primitives over memorized syntax.&lt;/p&gt;

&lt;p&gt;This guide walks four topic clusters end-to-end, each with a &lt;strong&gt;detailed topic explanation&lt;/strong&gt;, &lt;strong&gt;per-sub-topic explanation with a worked example and its solution&lt;/strong&gt;, &lt;strong&gt;common beginner mistakes&lt;/strong&gt;, and an &lt;strong&gt;interview-style problem with a full solution&lt;/strong&gt; that traces the query step by step. Every example uses PostgreSQL-flavored syntax — the dialect that drives DataLemur, the live-coding environments at most product-analytics companies, and the bulk of public SQL interview corpora — and every solution ends with a concept-by-concept breakdown that explains why the query is correct, what the cost is, and where beginners typically slip.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxn6v9quqrehug1297dj0.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxn6v9quqrehug1297dj0.webp" alt="SQL interview questions for data engineering header with bold title, joins, GROUP BY, window-function, and CTE chips on a dark gradient, and pipecode.ai attribution." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Top SQL data engineering interview topics
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;four numbered sections below&lt;/strong&gt; follow this &lt;strong&gt;topic map&lt;/strong&gt; — one row per &lt;strong&gt;H2&lt;/strong&gt;, every row expanded into a full section with sub-topics, worked examples, an interview question, and a traced solution:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;Why it shows up in SQL DE interviews&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;INNER JOIN&lt;/code&gt; vs &lt;code&gt;LEFT JOIN&lt;/code&gt; and &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; anti-join&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Combine rows from multiple tables without inflating cardinality; the &lt;code&gt;IS NULL&lt;/code&gt; trick after a &lt;code&gt;LEFT JOIN&lt;/code&gt; is the canonical "find rows in A with no match in B" pattern (orphan customers, churned users).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;GROUP BY&lt;/code&gt; and &lt;code&gt;HAVING&lt;/code&gt; for aggregates and duplicate detection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;WHERE&lt;/code&gt; filters rows before grouping, &lt;code&gt;HAVING&lt;/code&gt; filters groups after; the &lt;code&gt;HAVING COUNT(*) &amp;gt; 1&lt;/code&gt; shape is the universal duplicate-finder.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Window functions — &lt;code&gt;ROW_NUMBER&lt;/code&gt; vs &lt;code&gt;RANK&lt;/code&gt; vs &lt;code&gt;DENSE_RANK&lt;/code&gt;, &lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-partition ranking, top-N-per-group, second-highest salary, running totals, lookback for month-over-month deltas — all without collapsing rows.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CTEs, recursive CTEs, and correlated subqueries&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;WITH&lt;/code&gt; clauses make multi-step logic readable; recursive CTEs generate sequences and traverse hierarchies; correlated subqueries express row-by-row predicates against the same table.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Beginner-friendly framing:&lt;/strong&gt; SQL engines process a query roughly in this order — &lt;strong&gt;&lt;code&gt;FROM&lt;/code&gt; / &lt;code&gt;JOIN&lt;/code&gt; → &lt;code&gt;WHERE&lt;/code&gt; (row filter) → &lt;code&gt;GROUP BY&lt;/code&gt; → aggregates → &lt;code&gt;HAVING&lt;/code&gt; (group filter) → window functions → &lt;code&gt;SELECT&lt;/code&gt; → &lt;code&gt;ORDER BY&lt;/code&gt; → &lt;code&gt;LIMIT&lt;/code&gt;&lt;/strong&gt;. When in doubt, ask: "Am I filtering one row at a time (&lt;code&gt;WHERE&lt;/code&gt;) or a whole bucket after summing (&lt;code&gt;HAVING&lt;/code&gt;)?" That single question resolves more than half of the parse errors candidates hit on a live screen.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. SQL Joins for Data Engineering — &lt;code&gt;INNER&lt;/code&gt;, &lt;code&gt;LEFT&lt;/code&gt;, and Anti-Joins
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;INNER JOIN&lt;/code&gt;, &lt;code&gt;LEFT JOIN&lt;/code&gt;, and the &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; anti-join in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Find customers who never placed an order" is the signature SQL join interview question — and the cleanest answer is not a &lt;code&gt;NOT IN&lt;/code&gt; subquery or a &lt;code&gt;NOT EXISTS&lt;/code&gt; correlated query but a &lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt; with &lt;code&gt;WHERE right.key IS NULL&lt;/code&gt;&lt;/strong&gt;. The mental model: &lt;strong&gt;&lt;code&gt;INNER JOIN&lt;/code&gt; keeps only matching pairs, &lt;code&gt;LEFT JOIN&lt;/code&gt; keeps every left row and pads the right side with &lt;code&gt;NULL&lt;/code&gt;s when there is no match, and filtering for &lt;code&gt;right.id IS NULL&lt;/code&gt; after the &lt;code&gt;LEFT JOIN&lt;/code&gt; isolates exactly the left rows that had no match — the anti-join&lt;/strong&gt;. The same primitive surfaces churned-user, never-bought, never-clicked, and never-converted queries across product-analytics interviews.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3v9y8omsq279748kj7m.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3v9y8omsq279748kj7m.webp" alt="Diagram of LEFT JOIN with IS NULL anti-join: a customers table joined to an orders table where unmatched rows surface, isolating orphan customer Carol, with WHERE o.order_id IS NULL highlighted in green." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; &lt;code&gt;LEFT JOIN ... WHERE right.id IS NULL&lt;/code&gt; is generally as fast as or faster than &lt;code&gt;NOT IN (subquery)&lt;/code&gt; because &lt;code&gt;NOT IN&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt; (not &lt;code&gt;FALSE&lt;/code&gt;) for any &lt;code&gt;NULL&lt;/code&gt; in the subquery and silently drops every outer row. State this gotcha out loud — interviewers grade the candidate who knows why &lt;code&gt;NOT IN&lt;/code&gt; can return zero rows when the data has a single &lt;code&gt;NULL&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;INNER JOIN&lt;/code&gt;: keep only matching rows
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;INNER JOIN&lt;/code&gt; invariant: &lt;strong&gt;a row from the left table is paired with a row from the right table iff the join predicate evaluates to &lt;code&gt;TRUE&lt;/code&gt;; rows on either side that have no match are discarded&lt;/strong&gt;. The cardinality of the result is at most &lt;code&gt;|left| × |right|&lt;/code&gt; but typically far less — bounded by the multiplicity of matching keys.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Predicate&lt;/strong&gt; — &lt;code&gt;ON l.key = r.key&lt;/code&gt; is the common shape; multi-column predicates are fine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No padding&lt;/strong&gt; — unmatched rows on either side are silently dropped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result schema&lt;/strong&gt; — every column from both tables (use aliases to disambiguate).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cardinality risk&lt;/strong&gt; — &lt;code&gt;1:N&lt;/code&gt; on the right inflates left rows; &lt;code&gt;N:M&lt;/code&gt; is a Cartesian-by-key explosion.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two tables, &lt;code&gt;customers(id, name)&lt;/code&gt; with rows &lt;code&gt;(1, Alice), (2, Bob), (3, Carol)&lt;/code&gt; and &lt;code&gt;orders(order_id, customer_id, amount)&lt;/code&gt; with rows &lt;code&gt;(101, 1, 50), (102, 1, 30), (103, 2, 80)&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer&lt;/th&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Carol has no row — &lt;code&gt;INNER JOIN&lt;/code&gt; drops her.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; reach for &lt;code&gt;INNER JOIN&lt;/code&gt; when the question is "rows where both sides exist"; it is the smallest, fastest, most common join.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;LEFT JOIN&lt;/code&gt;: keep every left row, pad the right with &lt;code&gt;NULL&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;LEFT JOIN&lt;/code&gt; invariant: &lt;strong&gt;every row from the left table appears in the output; if the join predicate matches at least one right row, the right columns are filled in; otherwise the right columns are &lt;code&gt;NULL&lt;/code&gt;&lt;/strong&gt;. The result has at least &lt;code&gt;|left|&lt;/code&gt; rows and at most &lt;code&gt;|left| × max_right_match&lt;/code&gt; rows.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All left rows preserved&lt;/strong&gt; — even unmatched ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right columns are &lt;code&gt;NULL&lt;/code&gt;&lt;/strong&gt; when there is no match — the key signal for the anti-join trick.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;RIGHT JOIN&lt;/code&gt;&lt;/strong&gt; — the mirror image; rarely needed, since you can flip the table order.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;FULL OUTER JOIN&lt;/code&gt;&lt;/strong&gt; — keeps unmatched rows from both sides; less common in interviews.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Same &lt;code&gt;customers&lt;/code&gt; and &lt;code&gt;orders&lt;/code&gt;. Carol stays in the output.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer&lt;/th&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;LEFT JOIN&lt;/code&gt; is the right answer whenever the question asks for "every X, with Y when it exists" — a churn report, a coverage report, a left-padded join for downstream pipelines.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; anti-join: rows with no match
&lt;/h4&gt;

&lt;p&gt;The anti-join invariant: &lt;strong&gt;a &lt;code&gt;LEFT JOIN&lt;/code&gt; followed by &lt;code&gt;WHERE right.key IS NULL&lt;/code&gt; keeps exactly the left rows for which no right row matched&lt;/strong&gt;. Equivalent in result to &lt;code&gt;NOT EXISTS&lt;/code&gt; and (under &lt;code&gt;NOT NULL&lt;/code&gt; constraints) to &lt;code&gt;NOT IN&lt;/code&gt;, but typically faster than the latter and immune to the &lt;code&gt;NULL&lt;/code&gt;-swallowing bug.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt;&lt;/strong&gt; — preserves every left row.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE right.&amp;lt;pk&amp;gt; IS NULL&lt;/code&gt;&lt;/strong&gt; — strips out every left row that &lt;em&gt;did&lt;/em&gt; match, leaving only the unmatched ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NOT EXISTS&lt;/code&gt; equivalent&lt;/strong&gt; — &lt;code&gt;WHERE NOT EXISTS (SELECT 1 FROM right WHERE right.fk = left.pk)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NOT IN&lt;/code&gt; pitfall&lt;/strong&gt; — returns zero rows if the subquery contains a single &lt;code&gt;NULL&lt;/code&gt;; avoid in production data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Same &lt;code&gt;customers&lt;/code&gt; and &lt;code&gt;orders&lt;/code&gt;; Carol has no order.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer&lt;/th&gt;
&lt;th&gt;matched_order_id&lt;/th&gt;
&lt;th&gt;passes filter?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;✗ (matched)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;✗ (matched)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;✓ (anti-match)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; "find X with no Y" → &lt;code&gt;LEFT JOIN ... WHERE Y.id IS NULL&lt;/code&gt;. Memorize this pattern; it is the single most-asked SQL join shape in data-engineering interviews.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Filtering the right table inside &lt;code&gt;WHERE&lt;/code&gt; after a &lt;code&gt;LEFT JOIN&lt;/code&gt; (e.g., &lt;code&gt;WHERE o.amount &amp;gt; 0&lt;/code&gt;) — silently turns the &lt;code&gt;LEFT JOIN&lt;/code&gt; back into an &lt;code&gt;INNER JOIN&lt;/code&gt; because &lt;code&gt;NULL &amp;gt; 0&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;, which fails the filter.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;NOT IN (subquery)&lt;/code&gt; when the subquery can return &lt;code&gt;NULL&lt;/code&gt; — drops every outer row.&lt;/li&gt;
&lt;li&gt;Forgetting to alias both sides of the join — &lt;code&gt;id&lt;/code&gt; is ambiguous when both tables have it.&lt;/li&gt;
&lt;li&gt;Joining on the wrong column (&lt;code&gt;o.id = c.id&lt;/code&gt; instead of &lt;code&gt;o.customer_id = c.id&lt;/code&gt;) — produces a Cartesian-flavored mess.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;LEFT JOIN&lt;/code&gt; when &lt;code&gt;INNER JOIN&lt;/code&gt; is correct — leaves spurious &lt;code&gt;NULL&lt;/code&gt; rows in the output.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Customers With No Orders
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;customers(id, name)&lt;/code&gt; and &lt;code&gt;orders(order_id, customer_id, amount)&lt;/code&gt;, return the &lt;strong&gt;names of customers who have never placed an order&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;LEFT JOIN ... WHERE orders.order_id IS NULL&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the &lt;code&gt;LEFT JOIN&lt;/code&gt; preserves every customer row regardless of whether a matching order exists; for matched customers, &lt;code&gt;o.order_id&lt;/code&gt; carries a real value; for unmatched customers, the right-side columns are &lt;code&gt;NULL&lt;/code&gt; and &lt;code&gt;o.order_id IS NULL&lt;/code&gt; is &lt;code&gt;TRUE&lt;/code&gt;; filtering on that predicate keeps only the unmatched customers — the anti-join. Single pass over &lt;code&gt;customers&lt;/code&gt;; one keyed lookup into &lt;code&gt;orders&lt;/code&gt; per customer; no subquery materialization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the sample input:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customers.id&lt;/th&gt;
&lt;th&gt;customers.name&lt;/th&gt;
&lt;th&gt;LEFT JOIN orders.order_id&lt;/th&gt;
&lt;th&gt;IS NULL?&lt;/th&gt;
&lt;th&gt;survives?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Only Carol's row survives the &lt;code&gt;WHERE&lt;/code&gt; filter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt; semantics&lt;/strong&gt;&lt;/strong&gt; — keeps every left row; right side is &lt;code&gt;NULL&lt;/code&gt; when there is no match. This &lt;code&gt;NULL&lt;/code&gt; is the entire signal we filter on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;WHERE o.order_id IS NULL&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;o.order_id&lt;/code&gt; is the right-side primary key; it is &lt;code&gt;NULL&lt;/code&gt; only when the join produced a synthetic unmatched row. A real-&lt;code&gt;NULL&lt;/code&gt; order-id from the source table never happens because primary keys are &lt;code&gt;NOT NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Anti-join semantics&lt;/strong&gt;&lt;/strong&gt; — equivalent to &lt;code&gt;NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id)&lt;/code&gt;; the &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; form is typically faster on planners that materialize a hash join.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;No &lt;code&gt;NULL&lt;/code&gt;-swallowing&lt;/strong&gt;&lt;/strong&gt; — unlike &lt;code&gt;NOT IN&lt;/code&gt;, the predicate is &lt;code&gt;IS NULL&lt;/code&gt;, which is well-defined for &lt;code&gt;NULL&lt;/code&gt; values. There is no silent zero-row failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(|customers| + |orders|)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — hash-join build on &lt;code&gt;orders.customer_id&lt;/code&gt;, single probe per customer. Index on &lt;code&gt;orders.customer_id&lt;/code&gt; makes this near-linear.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;SQL joins practice page&lt;/a&gt; for &lt;code&gt;INNER&lt;/code&gt;, &lt;code&gt;LEFT&lt;/code&gt;, and anti-join shapes, and the &lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;SQL filtering practice page&lt;/a&gt; for the &lt;code&gt;WHERE&lt;/code&gt; vs &lt;code&gt;HAVING&lt;/code&gt; distinction.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — joins&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL join problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — filtering&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL filtering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Language — SQL&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;All SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. SQL Aggregations and &lt;code&gt;GROUP BY&lt;/code&gt; for Data Engineering
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;GROUP BY&lt;/code&gt;, &lt;code&gt;HAVING&lt;/code&gt;, and aggregate functions in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Find duplicate emails in the users table" and "find the department with the highest average salary" are the two signature aggregation prompts — and both reduce to &lt;strong&gt;&lt;code&gt;GROUP BY&lt;/code&gt; + an aggregate function + &lt;code&gt;HAVING&lt;/code&gt; filter&lt;/strong&gt;. The mental model: &lt;strong&gt;&lt;code&gt;GROUP BY col&lt;/code&gt; collapses rows that share the same &lt;code&gt;col&lt;/code&gt; value into a single output row; &lt;code&gt;COUNT(*)&lt;/code&gt;, &lt;code&gt;SUM(...)&lt;/code&gt;, &lt;code&gt;AVG(...)&lt;/code&gt;, &lt;code&gt;MIN(...)&lt;/code&gt;, &lt;code&gt;MAX(...)&lt;/code&gt; summarize each bucket; &lt;code&gt;WHERE&lt;/code&gt; filters individual rows before grouping; &lt;code&gt;HAVING&lt;/code&gt; filters whole groups after grouping&lt;/strong&gt;. Aggregate predicates belong in &lt;code&gt;HAVING&lt;/code&gt;; row predicates belong in &lt;code&gt;WHERE&lt;/code&gt;. Mixing the two up is the most common parse-error in live coding rounds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fomdv9xh4ovdcq5czciha.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fomdv9xh4ovdcq5czciha.webp" alt="Diagram of GROUP BY with HAVING COUNT(*) &amp;gt; 1 duplicate detection on the users.email column, showing email buckets with count badges and a Duplicates result card surfacing alice@example.com (3) and bob@example.com (2)." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; When the question is "find duplicates", the canonical shape is &lt;code&gt;SELECT key, COUNT(*) FROM t GROUP BY key HAVING COUNT(*) &amp;gt; 1&lt;/code&gt;. State the &lt;code&gt;HAVING COUNT(*) &amp;gt; 1&lt;/code&gt; half out loud while writing — it signals that you understand &lt;code&gt;WHERE&lt;/code&gt; cannot reference aggregates and that you can compose row-level and group-level filters in the right order.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;MIN&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt; — &lt;code&gt;NULL&lt;/code&gt;-aware aggregates
&lt;/h4&gt;

&lt;p&gt;The aggregate-&lt;code&gt;NULL&lt;/code&gt; invariant: &lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt; counts every row including ones with &lt;code&gt;NULL&lt;/code&gt; columns; &lt;code&gt;COUNT(col)&lt;/code&gt; counts only rows where &lt;code&gt;col&lt;/code&gt; is not &lt;code&gt;NULL&lt;/code&gt;; &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;MIN&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt; skip &lt;code&gt;NULL&lt;/code&gt; values entirely; if every value in a group is &lt;code&gt;NULL&lt;/code&gt;, the result is &lt;code&gt;NULL&lt;/code&gt; (not &lt;code&gt;0&lt;/code&gt;)&lt;/strong&gt;. Beginners conflate &lt;code&gt;COUNT(*)&lt;/code&gt; and &lt;code&gt;COUNT(col)&lt;/code&gt; and silently report wrong totals.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt; — every row in the bucket, regardless of &lt;code&gt;NULL&lt;/code&gt;s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(col)&lt;/code&gt;&lt;/strong&gt; — non-&lt;code&gt;NULL&lt;/code&gt; values of &lt;code&gt;col&lt;/code&gt; only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt;&lt;/strong&gt; — unique non-&lt;code&gt;NULL&lt;/code&gt; values; essential after a &lt;code&gt;JOIN&lt;/code&gt; that may have inflated rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SUM&lt;/code&gt; / &lt;code&gt;AVG&lt;/code&gt;&lt;/strong&gt; — numeric only; &lt;code&gt;AVG&lt;/code&gt; is sum-of-non-null-divided-by-count-of-non-null, so &lt;code&gt;NULL&lt;/code&gt; does &lt;strong&gt;not&lt;/strong&gt; count as &lt;code&gt;0&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three rows in one group: &lt;code&gt;amount&lt;/code&gt; = &lt;code&gt;10, NULL, 30&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;aggregate&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COUNT(amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SUM(amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AVG(amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MIN(amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MAX(amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;n_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;n_known&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hi&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if the metric is "people who clicked" use &lt;code&gt;COUNT(DISTINCT user_id)&lt;/code&gt;; if it is "click events" use &lt;code&gt;COUNT(*)&lt;/code&gt;; never confuse the two.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;WHERE&lt;/code&gt; vs &lt;code&gt;HAVING&lt;/code&gt; — row filter vs group filter
&lt;/h4&gt;

&lt;p&gt;The two-clause invariant: &lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt; runs before &lt;code&gt;GROUP BY&lt;/code&gt; and references raw row columns only; &lt;code&gt;HAVING&lt;/code&gt; runs after grouping and can reference aggregate functions; trying to use &lt;code&gt;WHERE COUNT(*) &amp;gt; 1&lt;/code&gt; is a parse error because &lt;code&gt;COUNT(*)&lt;/code&gt; does not exist until after grouping&lt;/strong&gt;. They are not interchangeable — both can appear in the same query.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt;&lt;/strong&gt; — filter rows; uses &lt;code&gt;col&lt;/code&gt;, &lt;code&gt;col2&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING&lt;/code&gt;&lt;/strong&gt; — filter groups; uses &lt;code&gt;COUNT(*)&lt;/code&gt;, &lt;code&gt;SUM(col)&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order of evaluation&lt;/strong&gt; — &lt;code&gt;FROM&lt;/code&gt; → &lt;code&gt;WHERE&lt;/code&gt; → &lt;code&gt;GROUP BY&lt;/code&gt; → &lt;code&gt;HAVING&lt;/code&gt; → &lt;code&gt;SELECT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt; — push predicates into &lt;code&gt;WHERE&lt;/code&gt; whenever possible; &lt;code&gt;WHERE&lt;/code&gt; filters before the (often expensive) sort/hash step.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;employees(department, salary)&lt;/code&gt; with six rows; ask for departments whose average salary exceeds 50,000 across employees earning more than 30,000.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;40,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;70,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;25,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;60,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;60,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;20,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;WHERE salary &amp;gt; 30000&lt;/code&gt; drops the 25k and 20k rows; &lt;code&gt;GROUP BY department&lt;/code&gt; then computes &lt;code&gt;AVG&lt;/code&gt;; &lt;code&gt;HAVING AVG(salary) &amp;gt; 50000&lt;/code&gt; keeps only departments whose surviving rows average above 50k.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;30000&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; aggregate predicate → &lt;code&gt;HAVING&lt;/code&gt;; row predicate → &lt;code&gt;WHERE&lt;/code&gt;. If the predicate uses &lt;code&gt;SUM&lt;/code&gt; / &lt;code&gt;COUNT&lt;/code&gt; / &lt;code&gt;AVG&lt;/code&gt; / &lt;code&gt;MIN&lt;/code&gt; / &lt;code&gt;MAX&lt;/code&gt;, it must live in &lt;code&gt;HAVING&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;HAVING COUNT(*) &amp;gt; 1&lt;/code&gt; — the universal duplicate finder
&lt;/h4&gt;

&lt;p&gt;The duplicate-detection invariant: &lt;strong&gt;&lt;code&gt;SELECT key, COUNT(*) FROM t GROUP BY key HAVING COUNT(*) &amp;gt; 1&lt;/code&gt; returns every distinct &lt;code&gt;key&lt;/code&gt; value that appears more than once in &lt;code&gt;t&lt;/code&gt;, along with its multiplicity&lt;/strong&gt;. Replace &lt;code&gt;key&lt;/code&gt; with the column you want to dedupe on (email, user_id, order_id), and the same query surfaces every duplicate group.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GROUP BY key&lt;/code&gt;&lt;/strong&gt; — one bucket per distinct value.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(*) &amp;gt; 1&lt;/code&gt;&lt;/strong&gt; — keeps only buckets with at least two rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(*) = N&lt;/code&gt;&lt;/strong&gt; — variant for "exactly N copies"; rare but cleanly expressible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composite keys&lt;/strong&gt; — &lt;code&gt;GROUP BY a, b, c HAVING COUNT(*) &amp;gt; 1&lt;/code&gt; finds duplicate &lt;code&gt;(a, b, c)&lt;/code&gt; triples.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;users(id, email)&lt;/code&gt; with two &lt;code&gt;alice@example.com&lt;/code&gt; rows.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;email&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:alice@example.com"&gt;alice@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:bob@example.com"&gt;bob@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:alice@example.com"&gt;alice@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Group by &lt;code&gt;email&lt;/code&gt;; the &lt;code&gt;alice@example.com&lt;/code&gt; group has &lt;code&gt;COUNT(*) = 2&lt;/code&gt;; the &lt;code&gt;bob@example.com&lt;/code&gt; group has &lt;code&gt;COUNT(*) = 1&lt;/code&gt; and is filtered out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;n_copies&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; every "find duplicates" question reduces to &lt;code&gt;GROUP BY key HAVING COUNT(*) &amp;gt; 1&lt;/code&gt;; reach for &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY key)&lt;/code&gt; only when you need to &lt;em&gt;delete&lt;/em&gt; the duplicates and keep one canonical row.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Writing &lt;code&gt;WHERE COUNT(*) &amp;gt; 1&lt;/code&gt; — parse error; &lt;code&gt;WHERE&lt;/code&gt; cannot reference aggregates.&lt;/li&gt;
&lt;li&gt;Selecting a non-aggregated, non-&lt;code&gt;GROUP BY&lt;/code&gt; column — strict SQL rejects this; lax dialects pick an arbitrary value silently.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;COUNT(DISTINCT user_id)&lt;/code&gt; after a &lt;code&gt;JOIN&lt;/code&gt; that inflates rows — reports inflated user counts.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;AVG(col)&lt;/code&gt; and forgetting that &lt;code&gt;NULL&lt;/code&gt; rows are excluded from the denominator — wrong for "treat missing as 0" metrics; use &lt;code&gt;AVG(COALESCE(col, 0))&lt;/code&gt; only when the spec says so.&lt;/li&gt;
&lt;li&gt;Putting &lt;code&gt;HAVING&lt;/code&gt; before &lt;code&gt;GROUP BY&lt;/code&gt; — syntax error; the order is &lt;code&gt;WHERE → GROUP BY → HAVING&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Duplicate Emails
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;users(id, email)&lt;/code&gt;, return every &lt;strong&gt;email that appears more than once&lt;/strong&gt; in the table, along with the number of copies. Output &lt;code&gt;email&lt;/code&gt; and &lt;code&gt;n_copies&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;GROUP BY email HAVING COUNT(*) &amp;gt; 1&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;n_copies&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;n_copies&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;GROUP BY email&lt;/code&gt; collapses every row with the same email into a single bucket; &lt;code&gt;COUNT(*)&lt;/code&gt; counts how many rows fell into each bucket; &lt;code&gt;HAVING COUNT(*) &amp;gt; 1&lt;/code&gt; keeps only buckets with at least two rows; &lt;code&gt;ORDER BY n_copies DESC, email&lt;/code&gt; produces a deterministic, reviewer-friendly output. Single pass over &lt;code&gt;users&lt;/code&gt;; sort cost dominates only when the email cardinality is huge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the sample input:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;email&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:alice@example.com"&gt;alice@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:bob@example.com"&gt;bob@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:alice@example.com"&gt;alice@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:carol@example.com"&gt;carol@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:bob@example.com"&gt;bob@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:alice@example.com"&gt;alice@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;FROM users&lt;/code&gt;&lt;/strong&gt; — read all six rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;WHERE&lt;/code&gt;&lt;/strong&gt; — every row passes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GROUP BY email&lt;/code&gt;&lt;/strong&gt; — three buckets: &lt;code&gt;alice&lt;/code&gt; (3 rows), &lt;code&gt;bob&lt;/code&gt; (2 rows), &lt;code&gt;carol&lt;/code&gt; (1 row).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt; — 3, 2, 1 respectively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING COUNT(*) &amp;gt; 1&lt;/code&gt;&lt;/strong&gt; — drops the &lt;code&gt;carol&lt;/code&gt; bucket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY n_copies DESC, email&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;alice&lt;/code&gt; (3), then &lt;code&gt;bob&lt;/code&gt; (2).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;email&lt;/th&gt;
&lt;th&gt;n_copies&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="mailto:alice@example.com"&gt;alice@example.com&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="mailto:bob@example.com"&gt;bob@example.com&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;GROUP BY email&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — collapses to one bucket per distinct email; the bucket is the unit of all subsequent aggregates and group-level filters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — counts every row in the bucket, including ones with &lt;code&gt;NULL&lt;/code&gt; non-key columns; perfect for "how many copies".&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;HAVING COUNT(*) &amp;gt; 1&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — group-level filter; the aggregate predicate must live here, not in &lt;code&gt;WHERE&lt;/code&gt;. This is the precise interview signal for duplicate detection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY n_copies DESC, email&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — deterministic ordering; tie-broken by &lt;code&gt;email&lt;/code&gt; so the output is stable across runs and reviewers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(|users| + G log G)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — single hash-aggregation over &lt;code&gt;users&lt;/code&gt; produces &lt;code&gt;G&lt;/code&gt; group rows; the final sort is &lt;code&gt;G log G&lt;/code&gt;. With an index on &lt;code&gt;email&lt;/code&gt;, the planner may use stream aggregation and skip the hash step.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;SQL aggregation practice page&lt;/a&gt; for &lt;code&gt;GROUP BY&lt;/code&gt; and &lt;code&gt;HAVING&lt;/code&gt; shapes, and the &lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;SQL null-handling practice page&lt;/a&gt; for &lt;code&gt;NULL&lt;/code&gt;-aware aggregates.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — null handling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL null-handling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — filtering&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL filtering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. SQL Window Functions for Data Engineering — &lt;code&gt;ROW_NUMBER&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;DENSE_RANK&lt;/code&gt;, &lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Window functions for ranking and lookback in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Find the second-highest salary" and "find the top 3 salaries per department" are the two signature window-function prompts — and both reduce to a &lt;strong&gt;&lt;code&gt;DENSE_RANK() OVER (PARTITION BY ... ORDER BY ...)&lt;/code&gt; filter&lt;/strong&gt;. The mental model: &lt;strong&gt;a window function computes a value across a set of rows ("the window") that are related to the current row, without collapsing the rows like &lt;code&gt;GROUP BY&lt;/code&gt; does; &lt;code&gt;OVER (PARTITION BY col)&lt;/code&gt; defines the window boundary; &lt;code&gt;OVER (ORDER BY col)&lt;/code&gt; defines the order within the window&lt;/strong&gt;. &lt;code&gt;ROW_NUMBER&lt;/code&gt; assigns unique sequential numbers; &lt;code&gt;RANK&lt;/code&gt; skips after ties (&lt;code&gt;1, 2, 2, 4&lt;/code&gt;); &lt;code&gt;DENSE_RANK&lt;/code&gt; does not skip (&lt;code&gt;1, 2, 2, 3&lt;/code&gt;). &lt;code&gt;LAG&lt;/code&gt; looks back; &lt;code&gt;LEAD&lt;/code&gt; looks forward. These five primitives drive almost every "ranking" or "lookback" SQL interview question.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdubw1ag5050fsp1g98zg.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdubw1ag5050fsp1g98zg.webp" alt="Side-by-side comparison of ROW_NUMBER, RANK, and DENSE_RANK on a salary ladder with tied rows for Bob and Carol, showing 1-2-3-4 vs 1-2-2-4 vs 1-2-2-3 ranking outputs and a +2 skip annotation for RANK." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; When the question is "second-highest salary", reach for &lt;code&gt;DENSE_RANK&lt;/code&gt; over &lt;code&gt;RANK&lt;/code&gt; — &lt;code&gt;DENSE_RANK = 2&lt;/code&gt; reliably means "second distinct salary" even when ties exist at the top, while &lt;code&gt;RANK = 2&lt;/code&gt; skips entirely if there are two rows tied for first. State this distinction; interviewers grade it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;ROW_NUMBER&lt;/code&gt; — unique sequential numbering per partition
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;ROW_NUMBER&lt;/code&gt; invariant: &lt;strong&gt;&lt;code&gt;ROW_NUMBER() OVER (PARTITION BY p ORDER BY o)&lt;/code&gt; assigns a unique integer &lt;code&gt;1, 2, 3, ...&lt;/code&gt; to every row inside each partition &lt;code&gt;p&lt;/code&gt;, ordered by &lt;code&gt;o&lt;/code&gt;; ties in &lt;code&gt;o&lt;/code&gt; are broken arbitrarily by the planner&lt;/strong&gt;. Use it when you need a unique sequence per group regardless of tie semantics — most often for deduplication.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;OVER (PARTITION BY ...)&lt;/code&gt;&lt;/strong&gt; — bucket the rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;OVER (ORDER BY ...)&lt;/code&gt;&lt;/strong&gt; — order within the bucket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ties broken arbitrarily&lt;/strong&gt; — add a tiebreaker column to &lt;code&gt;ORDER BY&lt;/code&gt; for determinism.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top-N-per-group&lt;/strong&gt; — &lt;code&gt;WHERE rn &amp;lt;= N&lt;/code&gt; after &lt;code&gt;ROW_NUMBER&lt;/code&gt;; works only when ties at rank N are not desired.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;employees(department, name, salary)&lt;/code&gt; with three engineers; rank by salary desc.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;th&gt;row_number&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;90,000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bob and Carol tie on salary; &lt;code&gt;ROW_NUMBER&lt;/code&gt; still gives them unique ranks (planner-chosen).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;ROW_NUMBER&lt;/code&gt; is the right tool for &lt;em&gt;deduplication&lt;/em&gt; (&lt;code&gt;WHERE rn = 1&lt;/code&gt;) and for ordered streams; reach for &lt;code&gt;RANK&lt;/code&gt; or &lt;code&gt;DENSE_RANK&lt;/code&gt; when ties must be honored.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;RANK&lt;/code&gt; vs &lt;code&gt;DENSE_RANK&lt;/code&gt; — tie semantics
&lt;/h4&gt;

&lt;p&gt;The rank-vs-dense-rank invariant: &lt;strong&gt;both assign the same rank to tied rows; &lt;code&gt;RANK&lt;/code&gt; then skips the next &lt;code&gt;k-1&lt;/code&gt; ranks (gap), while &lt;code&gt;DENSE_RANK&lt;/code&gt; continues without a gap&lt;/strong&gt;. For "find the Nth distinct value" questions, &lt;code&gt;DENSE_RANK = N&lt;/code&gt; is the correct filter; for "find the Nth row in order" questions, &lt;code&gt;ROW_NUMBER = N&lt;/code&gt; is correct.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;RANK&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;1, 2, 2, 4&lt;/code&gt; — skips after ties.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DENSE_RANK&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;1, 2, 2, 3&lt;/code&gt; — no skip.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ROW_NUMBER&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;1, 2, 3, 4&lt;/code&gt; — never ties.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick by question semantics&lt;/strong&gt; — "Nth highest distinct salary" → &lt;code&gt;DENSE_RANK = N&lt;/code&gt;; "Nth-highest-salaried row in ranking order with skips" → &lt;code&gt;RANK = N&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Same &lt;code&gt;employees&lt;/code&gt;, plus a third tied row.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;th&gt;rank&lt;/th&gt;
&lt;th&gt;dense_rank&lt;/th&gt;
&lt;th&gt;row_number&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;90,000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;70,000&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;RANK&lt;/code&gt; jumps &lt;code&gt;2 → 4&lt;/code&gt;; &lt;code&gt;DENSE_RANK&lt;/code&gt; continues &lt;code&gt;2 → 3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;       &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rnk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; "second highest salary" → &lt;code&gt;DENSE_RANK = 2&lt;/code&gt;; "top 3 distinct salaries" → &lt;code&gt;DENSE_RANK &amp;lt;= 3&lt;/code&gt;; never use &lt;code&gt;RANK&lt;/code&gt; for these unless the spec explicitly says ties should consume rank slots.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;LAG&lt;/code&gt; and &lt;code&gt;LEAD&lt;/code&gt; — lookback and lookahead
&lt;/h4&gt;

&lt;p&gt;The lookback invariant: &lt;strong&gt;&lt;code&gt;LAG(col) OVER (PARTITION BY p ORDER BY o)&lt;/code&gt; returns the value of &lt;code&gt;col&lt;/code&gt; in the previous row within the partition (or &lt;code&gt;NULL&lt;/code&gt; for the first row); &lt;code&gt;LEAD&lt;/code&gt; returns the next row's value&lt;/strong&gt;. Both take an optional &lt;code&gt;offset&lt;/code&gt; (default 1) and an optional &lt;code&gt;default&lt;/code&gt; value to substitute for &lt;code&gt;NULL&lt;/code&gt;. They power running deltas, month-over-month growth, sessionization gap detection, and previous-event-aware analytics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(col, n, dflt)&lt;/code&gt;&lt;/strong&gt; — value &lt;code&gt;n&lt;/code&gt; rows back; &lt;code&gt;dflt&lt;/code&gt; (default &lt;code&gt;NULL&lt;/code&gt;) when out of range.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEAD(col, n, dflt)&lt;/code&gt;&lt;/strong&gt; — symmetric forward.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;PARTITION BY user_id&lt;/code&gt;&lt;/strong&gt; — restart the lookback at each user.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;col - LAG(col)&lt;/code&gt;&lt;/strong&gt; — the delta-from-previous-row idiom.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;sales(sales_date, amount)&lt;/code&gt; with a contiguous month series; compute month-over-month delta.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;sales_date&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;th&gt;lag_amount&lt;/th&gt;
&lt;th&gt;delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2026-01-01&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-02-01&lt;/td&gt;
&lt;td&gt;130&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-03-01&lt;/td&gt;
&lt;td&gt;120&lt;/td&gt;
&lt;td&gt;130&lt;/td&gt;
&lt;td&gt;-10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first row's &lt;code&gt;LAG&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;; subtraction yields &lt;code&gt;NULL&lt;/code&gt;; consumers usually &lt;code&gt;COALESCE(delta, 0)&lt;/code&gt; for display.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;sales_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sales_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mom_delta&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;LAG&lt;/code&gt; for "compare this row to its predecessor" (delta, retention, gap); &lt;code&gt;LEAD&lt;/code&gt; for "what happens next" (sessionization, churn-from-here). Always &lt;code&gt;PARTITION BY&lt;/code&gt; the entity if the table holds multiple series.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;RANK&lt;/code&gt; when the question wants the Nth &lt;em&gt;distinct&lt;/em&gt; value — &lt;code&gt;RANK = 2&lt;/code&gt; skips entirely if two rows tie for first.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;PARTITION BY&lt;/code&gt; for a per-group ranking — produces a global ranking instead of per-department.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;WHERE rn = 2&lt;/code&gt; directly without wrapping in a subquery or CTE — window functions cannot be referenced in &lt;code&gt;WHERE&lt;/code&gt; of the same &lt;code&gt;SELECT&lt;/code&gt; (they run after &lt;code&gt;WHERE&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Confusing &lt;code&gt;LAG&lt;/code&gt; (previous) with &lt;code&gt;LEAD&lt;/code&gt; (next) — quietly produces inverted deltas.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;ORDER BY&lt;/code&gt; inside &lt;code&gt;OVER&lt;/code&gt; — required for &lt;code&gt;ROW_NUMBER&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;; the result is non-deterministic without it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Second-Highest Salary
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;employees(emp_id, name, salary)&lt;/code&gt;, return the &lt;strong&gt;second-highest distinct salary&lt;/strong&gt;. If there is no second-highest distinct salary (e.g., all employees earn the same), return &lt;code&gt;NULL&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;DENSE_RANK() OVER (ORDER BY salary DESC)&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;second_highest_salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;DENSE_RANK() OVER (ORDER BY salary DESC)&lt;/code&gt; numbers each row by its salary rank, with no gaps after ties — &lt;code&gt;dr = 1&lt;/code&gt; is the highest distinct salary, &lt;code&gt;dr = 2&lt;/code&gt; is the second-highest distinct salary; the outer &lt;code&gt;WHERE dr = 2&lt;/code&gt; filters to that group; &lt;code&gt;MAX(salary)&lt;/code&gt; collapses to a single row and (critically) returns &lt;code&gt;NULL&lt;/code&gt; if no rows match — handling the "no second-highest" edge case gracefully without a &lt;code&gt;LIMIT 1 OFFSET 1&lt;/code&gt; that would error or return zero rows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the sample input:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;emp_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;90,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;70,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inner &lt;code&gt;SELECT&lt;/code&gt;&lt;/strong&gt; reads all four rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DENSE_RANK() OVER (ORDER BY salary DESC)&lt;/code&gt;&lt;/strong&gt; assigns: Alice → 1, Bob → 2, Carol → 2, Dan → 3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outer &lt;code&gt;WHERE dr = 2&lt;/code&gt;&lt;/strong&gt; keeps Bob and Carol.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MAX(salary)&lt;/code&gt;&lt;/strong&gt; collapses to a single row: 80,000.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-match path&lt;/strong&gt; — if every employee earned 90,000, the inner query has only &lt;code&gt;dr = 1&lt;/code&gt; rows; outer &lt;code&gt;WHERE dr = 2&lt;/code&gt; filters to zero rows; &lt;code&gt;MAX(salary)&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt; (the spec).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;second_highest_salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;80000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;DENSE_RANK()&lt;/code&gt; over &lt;code&gt;RANK()&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;DENSE_RANK = 2&lt;/code&gt; is the second-distinct-salary, regardless of how many people tie for first; &lt;code&gt;RANK = 2&lt;/code&gt; would skip if two people tied for first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;OVER (ORDER BY salary DESC)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — single global window, ordered by salary descending; no &lt;code&gt;PARTITION BY&lt;/code&gt; because the question is over the whole table.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Subquery wrapper&lt;/strong&gt;&lt;/strong&gt; — required because window functions cannot be referenced in &lt;code&gt;WHERE&lt;/code&gt; of the same &lt;code&gt;SELECT&lt;/code&gt;; the outer query reads the materialized rank.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;MAX(salary)&lt;/code&gt; collapses ties&lt;/strong&gt;&lt;/strong&gt; — when &lt;code&gt;dr = 2&lt;/code&gt; matches multiple rows (a tie), &lt;code&gt;MAX&lt;/code&gt; returns one value; when &lt;code&gt;dr = 2&lt;/code&gt; matches zero rows, &lt;code&gt;MAX&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt; — both edge cases handled.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(N log N)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — sort by salary dominates; ranking is &lt;code&gt;O(N)&lt;/code&gt; over the sorted stream.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;SQL window-functions practice page&lt;/a&gt; for &lt;code&gt;ROW_NUMBER&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;DENSE_RANK&lt;/code&gt;, &lt;code&gt;LAG&lt;/code&gt;, and &lt;code&gt;LEAD&lt;/code&gt; shapes, and the &lt;a href="https://pipecode.ai/explore/practice/topic/date-functions/sql" rel="noopener noreferrer"&gt;SQL date-functions practice page&lt;/a&gt; for time-series window patterns.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — window functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL window-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — date functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL date-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/date-functions/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. SQL CTEs and Subqueries for Data Engineering — &lt;code&gt;WITH&lt;/code&gt;, Recursive CTEs, and Correlated Subqueries
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CTE composition, recursive CTEs, and correlated subqueries in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Find the top 3 salaries per department" and "find employees earning above their department average" are the two signature CTE-and-subquery prompts — and they showcase the two complementary patterns. The mental model: &lt;strong&gt;a CTE (&lt;code&gt;WITH name AS (SELECT ...)&lt;/code&gt;) names an intermediate result you reference like a table; a recursive CTE (&lt;code&gt;WITH RECURSIVE&lt;/code&gt;) repeatedly evaluates a base case plus a recursive case, terminating when no new rows are added; a correlated subquery is a subquery whose &lt;code&gt;WHERE&lt;/code&gt; clause references the outer query's alias, re-evaluating per outer row&lt;/strong&gt;. CTEs win on readability for multi-step logic; correlated subqueries win on per-row predicates against the same table.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy881v79sro8ehr101die.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy881v79sro8ehr101die.webp" alt="Two-panel diagram: left shows a WITH RECURSIVE CTE generating the integer sequence 1 through 5 with a stepped accumulator, right shows a per-department DENSE_RANK CTE filtering top-3 salaries per department with eng and sales partitions and an Eve row marked dr &amp;gt; 3 — filtered." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; When the prompt is "top N per group", reach for the CTE-plus-&lt;code&gt;DENSE_RANK&lt;/code&gt; pattern, not &lt;code&gt;LIMIT N&lt;/code&gt; — &lt;code&gt;LIMIT&lt;/code&gt; does not respect partitions and only works on the global stream. The CTE makes the per-group ranking explicit and trivial to reason about; the alternative correlated-subquery &lt;code&gt;WHERE col &amp;gt;= (SELECT col FROM t2 ... LIMIT 1 OFFSET N - 1)&lt;/code&gt; is an interview red flag.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;WITH name AS (SELECT ...)&lt;/code&gt; — non-recursive CTEs for readability
&lt;/h4&gt;

&lt;p&gt;The CTE invariant: &lt;strong&gt;&lt;code&gt;WITH name AS (SELECT ...) SELECT ... FROM name&lt;/code&gt; defines an intermediate result (a "common table expression") that subsequent &lt;code&gt;SELECT&lt;/code&gt;s reference like a table; the engine may inline or materialize it depending on cost; multiple CTEs can be chained in a single &lt;code&gt;WITH&lt;/code&gt; clause separated by commas&lt;/strong&gt;. CTEs let you build up complex queries step by step without nesting subqueries five layers deep.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single CTE&lt;/strong&gt; — &lt;code&gt;WITH a AS (...) SELECT * FROM a&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chained CTEs&lt;/strong&gt; — &lt;code&gt;WITH a AS (...), b AS (SELECT * FROM a WHERE ...) SELECT * FROM b&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-reference&lt;/strong&gt; — a CTE can be referenced multiple times in the main query.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No mandatory materialization&lt;/strong&gt; — modern planners often inline; PostgreSQL 12+ removed the implicit fence.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Find departments where the average salary is above 50,000, then list every employee in those departments.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;output rows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;high_paying&lt;/code&gt; CTE&lt;/td&gt;
&lt;td&gt;departments with &lt;code&gt;AVG(salary) &amp;gt; 50000&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;outer &lt;code&gt;SELECT&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;every employee whose department is in &lt;code&gt;high_paying&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;high_paying&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
    &lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;department&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;high_paying&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;department&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if you find yourself nesting a subquery three levels deep, refactor to a CTE — the engine produces the same plan but the human review takes a tenth of the time.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;WITH RECURSIVE&lt;/code&gt; — recursive CTEs for sequences and hierarchies
&lt;/h4&gt;

&lt;p&gt;The recursive-CTE invariant: &lt;strong&gt;a &lt;code&gt;WITH RECURSIVE&lt;/code&gt; CTE has two parts joined by &lt;code&gt;UNION ALL&lt;/code&gt;: an anchor query (base case, evaluated once) and a recursive query (refers to the CTE itself, evaluated repeatedly until it returns no new rows); the planner accumulates results across iterations until termination&lt;/strong&gt;. Use it for sequence generation (1..N, dates), hierarchy traversal (org charts, BOM trees, parent-child relationships), and graph-style reachability queries.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anchor (base)&lt;/strong&gt; — &lt;code&gt;SELECT 1&lt;/code&gt; or &lt;code&gt;SELECT root_id FROM table WHERE parent IS NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;UNION ALL&lt;/code&gt;&lt;/strong&gt; — combines anchor and recursive output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursive query&lt;/strong&gt; — references the CTE name; must converge (return zero rows eventually).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Termination&lt;/strong&gt; — the engine stops when the recursive step produces no new rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Generate the integers 1 through 5.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;iteration&lt;/th&gt;
&lt;th&gt;new row&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;anchor&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;step 1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;step 2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;step 3&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;step 4&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;step 5&lt;/td&gt;
&lt;td&gt;(none — &lt;code&gt;n = 5&lt;/code&gt; fails the predicate)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;RECURSIVE&lt;/span&gt; &lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;nums&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always include a termination predicate (&lt;code&gt;WHERE n &amp;lt; N&lt;/code&gt;); a missing or wrong predicate produces an infinite loop that most planners eventually kill with an out-of-memory error.&lt;/p&gt;

&lt;h4&gt;
  
  
  Correlated subqueries — per-row predicates against the same table
&lt;/h4&gt;

&lt;p&gt;The correlated-subquery invariant: &lt;strong&gt;a subquery whose &lt;code&gt;WHERE&lt;/code&gt; clause references a column of the outer query is re-evaluated for every outer row; this enables predicates like "salary above this row's department average" without a &lt;code&gt;JOIN&lt;/code&gt;&lt;/strong&gt;. Powerful but expensive; the planner sometimes rewrites them into joins or hash-aggregations automatically.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Outer alias reference&lt;/strong&gt; — &lt;code&gt;WHERE inner.dept = outer.dept&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;EXISTS&lt;/code&gt;&lt;/strong&gt; — short-circuit-friendly; stop on first match.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NOT EXISTS&lt;/code&gt;&lt;/strong&gt; — anti-join equivalent; immune to &lt;code&gt;NULL&lt;/code&gt; like &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt; — index the inner referenced column; otherwise the planner runs an inner scan per outer row.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Find employees whose salary is above their department's average.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;dept&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;th&gt;dept_avg&lt;/th&gt;
&lt;th&gt;survives?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;90,000&lt;/td&gt;
&lt;td&gt;76,667&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;td&gt;76,667&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;60,000&lt;/td&gt;
&lt;td&gt;76,667&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;60,000&lt;/td&gt;
&lt;td&gt;60,000&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;department&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; prefer a CTE + window function (&lt;code&gt;AVG(salary) OVER (PARTITION BY department)&lt;/code&gt;) for performance on large tables; the correlated subquery is the &lt;em&gt;clearer&lt;/em&gt; but &lt;em&gt;slower&lt;/em&gt; form. Interviewers grade you on knowing both.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Forgetting &lt;code&gt;WITH RECURSIVE&lt;/code&gt; for a recursive CTE — non-&lt;code&gt;RECURSIVE&lt;/code&gt; CTEs cannot self-reference; parse error.&lt;/li&gt;
&lt;li&gt;Missing the termination predicate in &lt;code&gt;WITH RECURSIVE&lt;/code&gt; — infinite loop.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;LIMIT N&lt;/code&gt; for top-N-per-group — &lt;code&gt;LIMIT&lt;/code&gt; is global; it does not respect partitions.&lt;/li&gt;
&lt;li&gt;Writing a correlated subquery without an index on the inner column — quadratic blow-up on large tables.&lt;/li&gt;
&lt;li&gt;Re-materializing the same CTE under different names — the planner may run it twice; inline a single shared CTE instead.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Top 3 Salaries Per Department
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;employees(emp_id, name, department, salary)&lt;/code&gt;, return the &lt;strong&gt;top 3 distinct salaries per department&lt;/strong&gt;, with ties at rank 3 included. Output &lt;code&gt;department&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;salary&lt;/code&gt;, and the rank.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC)&lt;/code&gt; in a CTE
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
               &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
               &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
           &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the CTE &lt;code&gt;ranked&lt;/code&gt; materializes a per-department &lt;code&gt;DENSE_RANK&lt;/code&gt; keyed by salary descending — &lt;code&gt;dr = 1&lt;/code&gt; is the highest distinct salary in that department, &lt;code&gt;dr = 2&lt;/code&gt; is the second-highest, and so on; the outer &lt;code&gt;WHERE dr &amp;lt;= 3&lt;/code&gt; keeps every row whose salary is in the top three distinct salaries of its department, including all ties at rank 3; the &lt;code&gt;ORDER BY&lt;/code&gt; produces a deterministic, reviewer-friendly output. &lt;code&gt;DENSE_RANK&lt;/code&gt; over &lt;code&gt;RANK&lt;/code&gt; because we want the top three &lt;em&gt;distinct&lt;/em&gt; salaries; &lt;code&gt;DENSE_RANK&lt;/code&gt; over &lt;code&gt;ROW_NUMBER&lt;/code&gt; because ties at rank 3 must be retained (the spec).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the sample input:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;emp_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;90,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;70,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Eve&lt;/td&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;60,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Frank&lt;/td&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;100,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Grace&lt;/td&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;90,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Heidi&lt;/td&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;80,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;CTE &lt;code&gt;ranked&lt;/code&gt;&lt;/strong&gt; — partition by &lt;code&gt;department&lt;/code&gt;; order by &lt;code&gt;salary DESC&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DENSE_RANK&lt;/code&gt; per partition&lt;/strong&gt; — eng: Alice → 1, Bob → 2, Carol → 2, Dan → 3, Eve → 4. sales: Frank → 1, Grace → 2, Heidi → 3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outer &lt;code&gt;WHERE dr &amp;lt;= 3&lt;/code&gt;&lt;/strong&gt; — drops Eve (&lt;code&gt;dr = 4&lt;/code&gt;); keeps both Bob and Carol (tied at 2) and Dan (3).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY department, dr, name&lt;/code&gt;&lt;/strong&gt; — eng rows first, then sales; within department by &lt;code&gt;dr&lt;/code&gt;, then &lt;code&gt;name&lt;/code&gt; for tiebreak.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;department&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;salary&lt;/th&gt;
&lt;th&gt;dr&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;90000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;80000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;80000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eng&lt;/td&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td&gt;70000&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;Frank&lt;/td&gt;
&lt;td&gt;100000&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;Grace&lt;/td&gt;
&lt;td&gt;90000&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sales&lt;/td&gt;
&lt;td&gt;Heidi&lt;/td&gt;
&lt;td&gt;80000&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;CTE &lt;code&gt;ranked&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — names the intermediate ranked result; the outer query then filters it like a regular table. Far cleaner than a nested subquery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;PARTITION BY department&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — restarts the rank at each department boundary; without this, the rank is global and the answer is wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY salary DESC&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — defines "highest first" inside each partition; required for any deterministic ranking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;DENSE_RANK&lt;/code&gt; over &lt;code&gt;RANK&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — the spec wants the top three &lt;em&gt;distinct&lt;/em&gt; salaries; &lt;code&gt;RANK&lt;/code&gt; would skip after ties and miss the third distinct salary if there is a two-way tie above it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;WHERE dr &amp;lt;= 3&lt;/code&gt; in the outer&lt;/strong&gt;&lt;/strong&gt; — window functions cannot be referenced in &lt;code&gt;WHERE&lt;/code&gt; of the same &lt;code&gt;SELECT&lt;/code&gt;; the CTE provides the materialized column the outer can filter on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(N log N)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — sort within each partition dominates; with an index on &lt;code&gt;(department, salary DESC)&lt;/code&gt; the planner can stream rather than sort.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;SQL CTE practice problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/subqueries/sql" rel="noopener noreferrer"&gt;SQL subquery practice problems&lt;/a&gt; on PipeCode.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — CTE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL CTE problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — subqueries&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL subquery problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/subqueries/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — window functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL window-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips to crack SQL interviews for data engineering roles
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Master the four primitives — joins, aggregates, windows, CTEs
&lt;/h3&gt;

&lt;p&gt;If you can write &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; for orphans, &lt;code&gt;GROUP BY ... HAVING COUNT(*) &amp;gt; 1&lt;/code&gt; for duplicates, &lt;code&gt;DENSE_RANK() OVER (PARTITION BY ...)&lt;/code&gt; for top-N-per-group, and a &lt;code&gt;WITH RECURSIVE&lt;/code&gt; CTE for sequence generation without thinking — you can pass most fresher and mid-level data-engineering SQL rounds. These four primitives compose into 80% of the questions you will see; the remaining 20% is &lt;code&gt;LAG&lt;/code&gt;/&lt;code&gt;LEAD&lt;/code&gt; lookback, &lt;code&gt;COALESCE&lt;/code&gt; null-safety, and date arithmetic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Know the difference between &lt;code&gt;WHERE&lt;/code&gt;, &lt;code&gt;HAVING&lt;/code&gt;, and the &lt;code&gt;OVER&lt;/code&gt; clause
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;WHERE&lt;/code&gt; filters individual rows before grouping; &lt;code&gt;HAVING&lt;/code&gt; filters groups after aggregation; &lt;code&gt;OVER (...)&lt;/code&gt; defines a window for a window function and runs after &lt;code&gt;WHERE&lt;/code&gt; but before &lt;code&gt;SELECT&lt;/code&gt;. Window functions cannot be referenced in &lt;code&gt;WHERE&lt;/code&gt; of the same &lt;code&gt;SELECT&lt;/code&gt; — wrap them in a CTE or subquery first. State the order of evaluation (&lt;code&gt;FROM → WHERE → GROUP BY → HAVING → window → SELECT → ORDER BY → LIMIT&lt;/code&gt;) when an interviewer asks; it is graded as fundamental literacy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pick &lt;code&gt;DENSE_RANK&lt;/code&gt; for "Nth distinct"; pick &lt;code&gt;ROW_NUMBER&lt;/code&gt; for deduplication
&lt;/h3&gt;

&lt;p&gt;The single most-graded ranking distinction: &lt;strong&gt;&lt;code&gt;DENSE_RANK = N&lt;/code&gt; is the Nth distinct value; &lt;code&gt;RANK = N&lt;/code&gt; is the Nth row in skip-aware ranking order; &lt;code&gt;ROW_NUMBER = N&lt;/code&gt; is the Nth row in arbitrary order&lt;/strong&gt;. For "second-highest distinct salary" → &lt;code&gt;DENSE_RANK = 2&lt;/code&gt;. For "remove duplicate rows keeping the canonical one" → &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreaker) = 1&lt;/code&gt;. State which one and why.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; over &lt;code&gt;NOT IN&lt;/code&gt; for anti-joins
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;NOT IN (subquery)&lt;/code&gt; returns zero rows if the subquery contains a single &lt;code&gt;NULL&lt;/code&gt; because &lt;code&gt;x NOT IN (..., NULL, ...)&lt;/code&gt; evaluates to &lt;code&gt;NULL&lt;/code&gt;, which fails the &lt;code&gt;WHERE&lt;/code&gt; predicate. &lt;code&gt;LEFT JOIN ... WHERE right.id IS NULL&lt;/code&gt; and &lt;code&gt;NOT EXISTS (...)&lt;/code&gt; are both immune to this. Production data engineers who have been bitten by this once never write &lt;code&gt;NOT IN&lt;/code&gt; again. State the gotcha out loud.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practice on PostgreSQL — it is the default dialect of most live coders
&lt;/h3&gt;

&lt;p&gt;DataLemur, CoderPad, HackerRank's SQL practice, most product-analytics live screens, and most public SQL interview corpora use PostgreSQL syntax. Drill &lt;code&gt;EXTRACT(...)&lt;/code&gt;, &lt;code&gt;INTERVAL '1 month'&lt;/code&gt;, &lt;code&gt;DATE_TRUNC&lt;/code&gt;, &lt;code&gt;::DATE&lt;/code&gt; casting, and &lt;code&gt;COALESCE&lt;/code&gt; until they are reflexive. MySQL-only quirks (back-tick identifiers, &lt;code&gt;LIMIT N OFFSET M&lt;/code&gt;) and SQL-Server-only quirks (&lt;code&gt;TOP N&lt;/code&gt;, &lt;code&gt;OFFSET ... FETCH ...&lt;/code&gt;) are second priority — only drill them if you know the company uses that dialect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Always &lt;code&gt;ORDER BY&lt;/code&gt; and add a tiebreaker for determinism
&lt;/h3&gt;

&lt;p&gt;Window functions, &lt;code&gt;LIMIT N&lt;/code&gt;, and "top result" queries all require an &lt;code&gt;ORDER BY&lt;/code&gt; with a &lt;em&gt;deterministic&lt;/em&gt; tiebreaker (e.g., &lt;code&gt;ORDER BY salary DESC, name&lt;/code&gt;). Without one, two runs of the same query can return different rows in the tie band — silently wrong in production and visibly wrong in an interview if the reviewer's reference answer locks an ordering. Always state your tiebreaker.&lt;/p&gt;

&lt;h3&gt;
  
  
  Read the data shape before writing the query
&lt;/h3&gt;

&lt;p&gt;Before typing, ask: what is the grain of each table (one row per...)? What are the keys? Are there &lt;code&gt;NULL&lt;/code&gt;s in the join columns? Is &lt;code&gt;email&lt;/code&gt; indexed? The most common "almost passed" failure mode is correct happy-path SQL that breaks on the actual schema — duplicate primary keys, &lt;code&gt;NULL&lt;/code&gt; foreign keys, off-by-one cardinality. A 30-second schema sweep prevents it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to practice on PipeCode
&lt;/h3&gt;

&lt;p&gt;Start with the &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice surface&lt;/a&gt; for the all-language SQL corpus. Drill the four-primitive pages: &lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;SQL joins&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;SQL aggregation&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;SQL window functions&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;SQL CTE&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/subqueries/sql" rel="noopener noreferrer"&gt;SQL subqueries&lt;/a&gt;. Add adjacent topics: &lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;SQL filtering&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;SQL null-handling&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/date-functions/sql" rel="noopener noreferrer"&gt;SQL date functions&lt;/a&gt;. The &lt;a href="https://pipecode.ai/explore/courses" rel="noopener noreferrer"&gt;interview courses page&lt;/a&gt; bundles structured curricula. For broader coverage, &lt;a href="https://pipecode.ai/explore/practice/topics" rel="noopener noreferrer"&gt;browse by topic&lt;/a&gt; or read the &lt;a href="https://pipecode.ai/blogs/top-data-engineering-interview-questions-2026" rel="noopener noreferrer"&gt;top data engineering interview questions 2026&lt;/a&gt; blog and the &lt;a href="https://pipecode.ai/blogs/sql-data-types-postgresql-guide" rel="noopener noreferrer"&gt;SQL data types in PostgreSQL guide&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What SQL topics are most asked in data engineering interviews?
&lt;/h3&gt;

&lt;p&gt;Four primitives carry the loop: &lt;strong&gt;joins&lt;/strong&gt; (especially &lt;code&gt;LEFT JOIN ... IS NULL&lt;/code&gt; for anti-joins), &lt;strong&gt;aggregations&lt;/strong&gt; (&lt;code&gt;GROUP BY&lt;/code&gt; plus &lt;code&gt;HAVING&lt;/code&gt;, with &lt;code&gt;COUNT(*) &amp;gt; 1&lt;/code&gt; as the universal duplicate finder), &lt;strong&gt;window functions&lt;/strong&gt; (&lt;code&gt;ROW_NUMBER&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;DENSE_RANK&lt;/code&gt;, &lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt; for ranking and lookback), and &lt;strong&gt;CTEs&lt;/strong&gt; (&lt;code&gt;WITH&lt;/code&gt; and &lt;code&gt;WITH RECURSIVE&lt;/code&gt; for multi-step logic and sequence generation). Adjacent shapes — &lt;code&gt;WHERE&lt;/code&gt; vs &lt;code&gt;HAVING&lt;/code&gt;, &lt;code&gt;NULL&lt;/code&gt; handling with &lt;code&gt;COALESCE&lt;/code&gt;, dedup via &lt;code&gt;ROW_NUMBER&lt;/code&gt;, and indexes — round out the typical fresher and mid-level loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between &lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;DENSE_RANK&lt;/code&gt;, and &lt;code&gt;ROW_NUMBER&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;All three assign integers within a window. &lt;code&gt;ROW_NUMBER&lt;/code&gt; gives every row a unique sequential integer (&lt;code&gt;1, 2, 3, 4&lt;/code&gt;), even on ties. &lt;code&gt;RANK&lt;/code&gt; gives tied rows the same rank but skips after them (&lt;code&gt;1, 2, 2, 4&lt;/code&gt;). &lt;code&gt;DENSE_RANK&lt;/code&gt; gives tied rows the same rank with no skip (&lt;code&gt;1, 2, 2, 3&lt;/code&gt;). For "Nth distinct value" use &lt;code&gt;DENSE_RANK = N&lt;/code&gt;; for "Nth row in skip-aware ranking order" use &lt;code&gt;RANK = N&lt;/code&gt;; for "Nth row in arbitrary order" or "deduplicate keeping one canonical row" use &lt;code&gt;ROW_NUMBER = 1&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between &lt;code&gt;WHERE&lt;/code&gt; and &lt;code&gt;HAVING&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;WHERE&lt;/code&gt; filters individual rows &lt;strong&gt;before&lt;/strong&gt; the &lt;code&gt;GROUP BY&lt;/code&gt; step and can reference only raw row columns. &lt;code&gt;HAVING&lt;/code&gt; filters whole groups &lt;strong&gt;after&lt;/strong&gt; the &lt;code&gt;GROUP BY&lt;/code&gt; step and can reference aggregate functions like &lt;code&gt;COUNT(*)&lt;/code&gt;, &lt;code&gt;SUM(col)&lt;/code&gt;, &lt;code&gt;AVG(col)&lt;/code&gt;. Trying to use an aggregate in &lt;code&gt;WHERE&lt;/code&gt; (e.g., &lt;code&gt;WHERE COUNT(*) &amp;gt; 1&lt;/code&gt;) is a parse error because the aggregate does not yet exist. Both clauses can appear in the same query; &lt;code&gt;WHERE&lt;/code&gt; runs first.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a CTE and when should I use one over a subquery?
&lt;/h3&gt;

&lt;p&gt;A CTE (Common Table Expression, written as &lt;code&gt;WITH name AS (SELECT ...)&lt;/code&gt;) is a named intermediate result you reference like a table in subsequent &lt;code&gt;SELECT&lt;/code&gt;s. Use a CTE when the same intermediate is referenced multiple times, when the multi-step logic is deeply nested as a subquery, or when you need recursion (&lt;code&gt;WITH RECURSIVE&lt;/code&gt;). Use a subquery when the intermediate is referenced exactly once and inlined cleanly. Modern PostgreSQL no longer materializes CTEs by default, so the performance gap has narrowed — pick by readability.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I find duplicate rows in SQL?
&lt;/h3&gt;

&lt;p&gt;The canonical pattern is &lt;code&gt;SELECT key, COUNT(*) FROM table GROUP BY key HAVING COUNT(*) &amp;gt; 1&lt;/code&gt;, which returns every distinct &lt;code&gt;key&lt;/code&gt; value that appears more than once along with its multiplicity. To find duplicate &lt;code&gt;(a, b, c)&lt;/code&gt; triples, use &lt;code&gt;GROUP BY a, b, c&lt;/code&gt;. To &lt;em&gt;delete&lt;/em&gt; duplicates while keeping one canonical row per key, use &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreaker)&lt;/code&gt; inside a CTE and delete every row where &lt;code&gt;ROW_NUMBER &amp;gt; 1&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the best way to find the second-highest salary?
&lt;/h3&gt;

&lt;p&gt;The cleanest answer is &lt;code&gt;SELECT MAX(salary) FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS dr FROM employees) t WHERE dr = 2&lt;/code&gt;. &lt;code&gt;DENSE_RANK = 2&lt;/code&gt; reliably gives the second-distinct-salary even when ties exist at the top; the outer &lt;code&gt;MAX(salary)&lt;/code&gt; collapses ties at rank 2 into a single value and returns &lt;code&gt;NULL&lt;/code&gt; when no second-distinct-salary exists — both edge cases handled. Avoid &lt;code&gt;LIMIT 1 OFFSET 1&lt;/code&gt; — it errors on empty result sets in some dialects and does not handle ties correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I learn PostgreSQL or MySQL for SQL interviews?
&lt;/h3&gt;

&lt;p&gt;PostgreSQL is the default dialect of most modern data-engineering interview platforms (DataLemur, CoderPad, most product-analytics live screens) because of its strong window-function support, &lt;code&gt;EXTRACT&lt;/code&gt; / &lt;code&gt;INTERVAL&lt;/code&gt; date arithmetic, and &lt;code&gt;::TYPE&lt;/code&gt; casting syntax. Drill PostgreSQL first; learn MySQL-specific quirks (back-tick identifiers, slightly different &lt;code&gt;LIMIT&lt;/code&gt; syntax, no &lt;code&gt;FULL OUTER JOIN&lt;/code&gt; until 8.0.31) only if the company explicitly uses MySQL. Either way, the four primitives — joins, aggregates, windows, CTEs — are dialect-agnostic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start practicing SQL data engineering problems
&lt;/h2&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>dataengineering</category>
      <category>interview</category>
    </item>
    <item>
      <title>COALESCE in SQL — First Non-NULL, LEFT JOIN Defaults, and Interview Patterns</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Sun, 03 May 2026 15:52:58 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/coalesce-in-sql-first-non-null-left-join-defaults-and-interview-patterns-3df4</link>
      <guid>https://dev.to/gowthampotureddi/coalesce-in-sql-first-non-null-left-join-defaults-and-interview-patterns-3df4</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; in SQL&lt;/strong&gt; is the single most-asked NULL-handling primitive at every data-engineering interview that touches production analytics. The mental model: &lt;strong&gt;&lt;code&gt;COALESCE(expr1, expr2, ..., exprN)&lt;/code&gt; returns the first argument that is not &lt;code&gt;NULL&lt;/code&gt;, evaluated left-to-right; if every argument is &lt;code&gt;NULL&lt;/code&gt;, the result is &lt;code&gt;NULL&lt;/code&gt;&lt;/strong&gt;. That one rule covers a vast surface — fallback chains for "best available column," default values to neutralize &lt;code&gt;LEFT JOIN&lt;/code&gt; misses, sentinel labels like &lt;code&gt;'NONE'&lt;/code&gt; for missing dimensions, and zero-substitution before downstream math. Four sub-primitives carry the loop: left-to-right evaluation with short-circuit semantics, the &lt;code&gt;COALESCE(left_join_col, default)&lt;/code&gt; pattern that turns outer-join &lt;code&gt;NULL&lt;/code&gt;s into reportable defaults, the dialect-portability matrix versus &lt;code&gt;CASE&lt;/code&gt; / &lt;code&gt;ISNULL&lt;/code&gt; / &lt;code&gt;NVL&lt;/code&gt; / &lt;code&gt;IFNULL&lt;/code&gt;, and the pitfall set around &lt;code&gt;NULL&lt;/code&gt; ≠ &lt;code&gt;0&lt;/code&gt; semantics, type coercion, empty strings, and the &lt;code&gt;COALESCE(NULLIF(col, ''), 'default')&lt;/code&gt; composition.&lt;/p&gt;

&lt;p&gt;This guide walks four concept clusters end-to-end, each with a &lt;strong&gt;detailed topic explanation&lt;/strong&gt;, &lt;strong&gt;per-sub-topic explanation with a worked example and its solution&lt;/strong&gt;, and an &lt;strong&gt;interview-style worked problem with a full solution&lt;/strong&gt; that explains why it works. The mix matches the actual surface area &lt;code&gt;COALESCE&lt;/code&gt; covers in production SQL — analytics dashboards, ETL transforms, BI reports, and whiteboard interview questions. Examples use &lt;strong&gt;PostgreSQL-friendly&lt;/strong&gt; syntax; engine differences are called out where they bite (&lt;code&gt;ISNULL&lt;/code&gt; is SQL Server, &lt;code&gt;NVL&lt;/code&gt; is Oracle, &lt;code&gt;IFNULL&lt;/code&gt; is MySQL — all of them are two-argument cousins of the standard-SQL &lt;code&gt;COALESCE&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxar69ahe9i8gwgiludmf.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxar69ahe9i8gwgiludmf.webp" alt="Blog header thumbnail for a PipeCode SQL guide to COALESCE with stylized query text and purple brand accents on a dark background." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Top COALESCE concepts and SQL patterns
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;four numbered sections below&lt;/strong&gt; follow this &lt;strong&gt;concept map&lt;/strong&gt; (one row per &lt;strong&gt;H2&lt;/strong&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Topic (sections &lt;strong&gt;1–4&lt;/strong&gt;)&lt;/th&gt;
&lt;th&gt;Why it matters in SQL data engineering&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; evaluation order and basic fallback patterns&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;COALESCE(a, b, c)&lt;/code&gt; is left-to-right, short-circuits on first non-&lt;code&gt;NULL&lt;/code&gt;, and follows engine-specific type-precedence rules; the bedrock primitive every other pattern composes on.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; with &lt;code&gt;LEFT JOIN&lt;/code&gt; for default values in analytics and BI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Outer joins to lookup tables (FX rates, dimensions, slowly-changing-dim surrogates) produce &lt;code&gt;NULL&lt;/code&gt; on misses; &lt;code&gt;COALESCE(fx.rate, 1)&lt;/code&gt; and &lt;code&gt;COALESCE(p.promo_code, 'NONE')&lt;/code&gt; turn that &lt;code&gt;NULL&lt;/code&gt; into a default that downstream metrics can consume.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; vs &lt;code&gt;CASE&lt;/code&gt;, &lt;code&gt;ISNULL&lt;/code&gt;, and &lt;code&gt;NVL&lt;/code&gt; — portability and when to pick which&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;COALESCE&lt;/code&gt; is standard SQL and portable across PostgreSQL / MySQL / SQL Server / Oracle / Snowflake / BigQuery; &lt;code&gt;ISNULL&lt;/code&gt; (SQL Server, 2-arg), &lt;code&gt;NVL&lt;/code&gt; (Oracle, 2-arg), &lt;code&gt;IFNULL&lt;/code&gt; (MySQL, 2-arg) are dialect cousins; &lt;code&gt;CASE&lt;/code&gt; is the right tool when the logic is not "first non-&lt;code&gt;NULL&lt;/code&gt;."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; pitfalls — NULL semantics, type coercion, empty strings, and &lt;code&gt;NULLIF&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;NULL&lt;/code&gt; ≠ &lt;code&gt;0&lt;/code&gt;; replacing unknown with zero changes the metric meaning. Empty string &lt;code&gt;''&lt;/code&gt; is not &lt;code&gt;NULL&lt;/code&gt; in PostgreSQL; combine with &lt;code&gt;NULLIF&lt;/code&gt; for "empty-or-null." Mixing types without &lt;code&gt;CAST&lt;/code&gt; raises errors or silently coerces.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Concept-based framing rule:&lt;/strong&gt; &lt;code&gt;COALESCE&lt;/code&gt; is the primitive; the four sections are the operating contexts where it shows up. Master the evaluation rule first (§1), then the &lt;code&gt;LEFT JOIN&lt;/code&gt; default pattern (§2), then the dialect comparison (§3), then the pitfalls (§4). State the left-to-right-first-non-&lt;code&gt;NULL&lt;/code&gt; rule out loud in any interview answer that touches null handling — interviewers grade the explicit verbalization, not just the correct query.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. &lt;code&gt;COALESCE&lt;/code&gt; Evaluation Order and Basic Fallback Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Left-to-right first-non-NULL evaluation in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Return the first non-&lt;code&gt;NULL&lt;/code&gt; value from a list of expressions" is the canonical &lt;code&gt;COALESCE&lt;/code&gt; semantics. The mental model: &lt;strong&gt;the engine evaluates arguments left-to-right; the first one whose value is not &lt;code&gt;NULL&lt;/code&gt; becomes the result; remaining arguments may not be evaluated at all (most engines short-circuit); if every argument is &lt;code&gt;NULL&lt;/code&gt;, the result itself is &lt;code&gt;NULL&lt;/code&gt;&lt;/strong&gt;. Same primitive powers any "priority-ordered fallback" pattern — pick the user's &lt;code&gt;work_email&lt;/code&gt;, else &lt;code&gt;personal_email&lt;/code&gt;, else a literal &lt;code&gt;'no-email@example.com'&lt;/code&gt;; pick the trade's &lt;code&gt;settled_price&lt;/code&gt;, else &lt;code&gt;mark_price&lt;/code&gt;, else &lt;code&gt;last_close&lt;/code&gt;; pick the column you trust, else the column you mostly trust, else a sentinel.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl6wa3veilayk8bdgu3kt.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl6wa3veilayk8bdgu3kt.webp" alt="Diagram showing COALESCE evaluating SQL expressions from left to right and returning the first non-NULL value in PipeCode brand colors." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Even when the engine documents short-circuit evaluation, never put side-effecting expressions in later &lt;code&gt;COALESCE&lt;/code&gt; arguments — volatile functions, RAISE statements, or sequence calls. State this explicitly in interviews; it signals production fluency. The reader-friendly contract is "every argument is a pure expression; the engine picks the first non-&lt;code&gt;NULL&lt;/code&gt;."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Syntax: &lt;code&gt;COALESCE(expr1, expr2, ..., exprN)&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The syntax invariant: &lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; takes 1 to N arguments (most engines require ≥ 2) and returns a single value of the unified result type&lt;/strong&gt;. The unified type is determined by precedence rules across all argument types — mixing &lt;code&gt;INT&lt;/code&gt; and &lt;code&gt;BIGINT&lt;/code&gt; is fine and returns &lt;code&gt;BIGINT&lt;/code&gt;; mixing &lt;code&gt;INT&lt;/code&gt; and &lt;code&gt;TEXT&lt;/code&gt; without an explicit &lt;code&gt;CAST&lt;/code&gt; raises a type error in PostgreSQL.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(a, b)&lt;/code&gt;&lt;/strong&gt; — minimum form; first argument &lt;code&gt;a&lt;/code&gt; if non-&lt;code&gt;NULL&lt;/code&gt;, else &lt;code&gt;b&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(a, b, c, d, e)&lt;/code&gt;&lt;/strong&gt; — N-ary form; ordered fallback chain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(NULL)&lt;/code&gt;&lt;/strong&gt; — single-argument form; rejected by PostgreSQL parser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type unification&lt;/strong&gt; — all arguments must coerce to a common type; explicit &lt;code&gt;CAST(... AS ...)&lt;/code&gt; keeps intent obvious.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Mixed literal arguments; the engine returns &lt;code&gt;'third'&lt;/code&gt; because the first two are &lt;code&gt;NULL&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;arg 1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;arg 2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;arg 3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'third'&lt;/code&gt; (first non-&lt;code&gt;NULL&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;result&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'third'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'third'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'fourth'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;first_non_null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- first_non_null = 'third'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;COALESCE&lt;/code&gt; reads as "try this, else try this, else try this" — the order of arguments encodes business priority directly in SQL.&lt;/p&gt;

&lt;h4&gt;
  
  
  Short-circuit evaluation and side-effect safety
&lt;/h4&gt;

&lt;p&gt;The short-circuit invariant: &lt;strong&gt;most engines stop evaluating arguments once they find a non-&lt;code&gt;NULL&lt;/code&gt;; PostgreSQL, SQL Server, Oracle, and MySQL all document this behavior, but the SQL standard does not require it&lt;/strong&gt;. Treat short-circuit as a performance optimization, not a contract — never rely on it for correctness when later arguments have side effects.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(safe_col, expensive_lookup())&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;expensive_lookup()&lt;/code&gt; only runs when &lt;code&gt;safe_col&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(col, RAISE_NOTICE('...'))&lt;/code&gt;&lt;/strong&gt; — never rely on the side-effect ordering; the optimizer may rewrite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volatile functions&lt;/strong&gt; — &lt;code&gt;random()&lt;/code&gt;, &lt;code&gt;now()&lt;/code&gt;, sequence calls can be evaluated even when later in the list.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pure functions only&lt;/strong&gt; — write the contract as "every argument is a pure expression."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Cheap column first, expensive subquery second.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;input&lt;/th&gt;
&lt;th&gt;expensive_lookup called?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cached_value = 100&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cached_value = NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
         &lt;span class="n"&gt;cached_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                           &lt;span class="c1"&gt;-- cheap, indexed&lt;/span&gt;
         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;email_lookup&lt;/span&gt;
          &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;-- expensive, only runs on NULL cached&lt;/span&gt;
       &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;effective_email&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; put the cheapest reliably-non-&lt;code&gt;NULL&lt;/code&gt; column first; let short-circuit save the expensive lookups for rows that need them.&lt;/p&gt;

&lt;h4&gt;
  
  
  Type precedence and &lt;code&gt;CAST&lt;/code&gt; for unambiguous types
&lt;/h4&gt;

&lt;p&gt;The type-precedence invariant: &lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; returns a single unified type computed from the precedence rules of all argument types; mixing incompatible types (e.g. &lt;code&gt;INTEGER&lt;/code&gt; with &lt;code&gt;TEXT&lt;/code&gt;) is a parse error in PostgreSQL and a silent coercion in MySQL&lt;/strong&gt;. Use &lt;code&gt;CAST(... AS ...)&lt;/code&gt; to make intent explicit.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(int_col, 0)&lt;/code&gt;&lt;/strong&gt; — both &lt;code&gt;INTEGER&lt;/code&gt;; result &lt;code&gt;INTEGER&lt;/code&gt;. ✓&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(numeric_col, 0)&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;NUMERIC&lt;/code&gt; and &lt;code&gt;INTEGER&lt;/code&gt; unify to &lt;code&gt;NUMERIC&lt;/code&gt;. ✓&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(int_col, 'unknown')&lt;/code&gt;&lt;/strong&gt; — type error in PostgreSQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(CAST(int_col AS TEXT), 'unknown')&lt;/code&gt;&lt;/strong&gt; — explicit cast resolves the conflict.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Avoid the type error by casting &lt;code&gt;INTEGER&lt;/code&gt; to &lt;code&gt;TEXT&lt;/code&gt; before mixing with a string default.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;expression&lt;/th&gt;
&lt;th&gt;result type&lt;/th&gt;
&lt;th&gt;safe?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COALESCE(qty, 'N/A')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;error&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COALESCE(CAST(qty AS TEXT), 'N/A')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TEXT&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'N/A'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;qty_label&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; when you need a string sentinel for a numeric column, cast the numeric first; never trust implicit coercion across families.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Calling &lt;code&gt;COALESCE&lt;/code&gt; with one argument — most engines reject; minimum is 2.&lt;/li&gt;
&lt;li&gt;Putting volatile / side-effecting expressions in later positions — non-deterministic results.&lt;/li&gt;
&lt;li&gt;Mixing types without &lt;code&gt;CAST&lt;/code&gt; — silent coercion in MySQL, parse error in PostgreSQL.&lt;/li&gt;
&lt;li&gt;Assuming all engines short-circuit — relying on it is a portability bug.&lt;/li&gt;
&lt;li&gt;Returning &lt;code&gt;NULL&lt;/code&gt; and surprising downstream — if every argument is &lt;code&gt;NULL&lt;/code&gt;, so is the result; add a guaranteed-non-&lt;code&gt;NULL&lt;/code&gt; literal at the end if the metric needs a default.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Worked Problem on COALESCE Evaluation Order
&lt;/h3&gt;

&lt;p&gt;Given a &lt;code&gt;contacts&lt;/code&gt; table with three nullable email columns — &lt;code&gt;work_email&lt;/code&gt;, &lt;code&gt;personal_email&lt;/code&gt;, &lt;code&gt;legacy_email&lt;/code&gt; — return one column &lt;code&gt;effective_email&lt;/code&gt; per row that is the first non-&lt;code&gt;NULL&lt;/code&gt; of the three, falling back to the literal &lt;code&gt;'no-email@example.com'&lt;/code&gt; when every column is &lt;code&gt;NULL&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;COALESCE&lt;/code&gt; with a literal final fallback
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;contact_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;work_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;personal_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;legacy_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;'no-email@example.com'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;effective_email&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;contacts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the four-argument &lt;code&gt;COALESCE&lt;/code&gt; evaluates each column left-to-right; on the first non-&lt;code&gt;NULL&lt;/code&gt; it short-circuits and returns that value; the trailing literal &lt;code&gt;'no-email@example.com'&lt;/code&gt; guarantees a non-&lt;code&gt;NULL&lt;/code&gt; result for rows where all three columns are &lt;code&gt;NULL&lt;/code&gt;. One pass, no &lt;code&gt;CASE&lt;/code&gt;, no joins, fully PostgreSQL-portable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for four sample rows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;contact_id&lt;/th&gt;
&lt;th&gt;work_email&lt;/th&gt;
&lt;th&gt;personal_email&lt;/th&gt;
&lt;th&gt;legacy_email&lt;/th&gt;
&lt;th&gt;effective_email&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;alice@work.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;alice@home.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;alice@work.com&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bob@home.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bob@old.net&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bob@home.com&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;carol@old.net&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;carol@old.net&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;no-email@example.com&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Row 1&lt;/strong&gt; — &lt;code&gt;work_email&lt;/code&gt; is non-&lt;code&gt;NULL&lt;/code&gt; → return immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row 2&lt;/strong&gt; — &lt;code&gt;work_email&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;; &lt;code&gt;personal_email&lt;/code&gt; is non-&lt;code&gt;NULL&lt;/code&gt; → return.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row 3&lt;/strong&gt; — first two are &lt;code&gt;NULL&lt;/code&gt;; &lt;code&gt;legacy_email&lt;/code&gt; is non-&lt;code&gt;NULL&lt;/code&gt; → return.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row 4&lt;/strong&gt; — all three are &lt;code&gt;NULL&lt;/code&gt;; literal &lt;code&gt;'no-email@example.com'&lt;/code&gt; returned.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;contact_id&lt;/th&gt;
&lt;th&gt;effective_email&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:alice@work.com"&gt;alice@work.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:bob@home.com"&gt;bob@home.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:carol@old.net"&gt;carol@old.net&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:no-email@example.com"&gt;no-email@example.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Left-to-right evaluation&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;COALESCE&lt;/code&gt; walks arguments in source order; the priority encoded in the SQL exactly matches the business rule "prefer work, then home, then legacy."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Short-circuit on first non-&lt;code&gt;NULL&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — the engine stops as soon as it has a value; rows with a populated &lt;code&gt;work_email&lt;/code&gt; never touch &lt;code&gt;personal_email&lt;/code&gt; or &lt;code&gt;legacy_email&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Literal final fallback&lt;/strong&gt;&lt;/strong&gt; — the trailing string literal guarantees &lt;code&gt;effective_email&lt;/code&gt; is never &lt;code&gt;NULL&lt;/code&gt;; downstream consumers (reports, joins, filters) can treat the column as &lt;code&gt;NOT NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Single-pass aggregation&lt;/strong&gt;&lt;/strong&gt; — no &lt;code&gt;CASE&lt;/code&gt;, no &lt;code&gt;JOIN&lt;/code&gt;, no subquery; the SELECT scans &lt;code&gt;contacts&lt;/code&gt; once and emits one output row per input row.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(N)&lt;/code&gt; time / &lt;code&gt;O(1)&lt;/code&gt; space&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;N&lt;/code&gt; rows scanned, no auxiliary structures; the per-row work is constant 4-argument evaluation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;SQL null-handling practice page&lt;/a&gt; for the full curated set of COALESCE-style problems.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — null handling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL null-handling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — conditional logic&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL conditional-logic problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/conditional-logic/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — filtering&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL filtering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. &lt;code&gt;COALESCE&lt;/code&gt; with &lt;code&gt;LEFT JOIN&lt;/code&gt; for Default Values in Analytics and BI
&lt;/h2&gt;

&lt;h3&gt;
  
  
  COALESCE for outer-join NULL handling in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Replace &lt;code&gt;NULL&lt;/code&gt; from a &lt;code&gt;LEFT JOIN&lt;/code&gt; miss with a sensible default" is the most common production use of &lt;code&gt;COALESCE&lt;/code&gt;. The mental model: &lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt; keeps every row from the left table; rows that have no match in the right table get &lt;code&gt;NULL&lt;/code&gt; for every right-table column; &lt;code&gt;COALESCE(right_col, default)&lt;/code&gt; neutralizes those &lt;code&gt;NULL&lt;/code&gt;s into business-meaningful defaults — &lt;code&gt;1&lt;/code&gt; for missing FX rates (treat as USD), &lt;code&gt;'NONE'&lt;/code&gt; for missing promo codes, &lt;code&gt;0&lt;/code&gt; for missing aggregate counts&lt;/strong&gt;. Same primitive powers any "fact + lookup" pipeline — sales fact joined to currency dim, click event joined to user dim, transaction joined to merchant category.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqjz0fk8oadhutbjkqchz.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqjz0fk8oadhutbjkqchz.webp" alt="Flowchart of a transactions table left-joining FX rates and using COALESCE to apply a default rate when the join returns NULL." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Choose the default carefully. &lt;code&gt;COALESCE(fx.rate, 1)&lt;/code&gt; is correct only when the source amount is already in USD; if amounts are in unknown local currencies, defaulting to &lt;code&gt;1&lt;/code&gt; silently fabricates dollars. State the assumption in the SQL comment or the surrounding documentation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;LEFT JOIN&lt;/code&gt; + &lt;code&gt;COALESCE(right_col, default)&lt;/code&gt; for fact-dim joins
&lt;/h4&gt;

&lt;p&gt;The fact-dim invariant: &lt;strong&gt;a fact table (transactions, events, orders) is &lt;code&gt;LEFT JOIN&lt;/code&gt;-ed to a dim table (currency rates, customer reference, product catalog); rows with no dim match return &lt;code&gt;NULL&lt;/code&gt; for every dim column; &lt;code&gt;COALESCE&lt;/code&gt; substitutes a default that downstream math can consume&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(dim_col, default)&lt;/code&gt;&lt;/strong&gt; — the canonical wrapping pattern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt;&lt;/strong&gt; — keep all fact rows; never drop unmatched.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right-side filtering&lt;/strong&gt; — putting &lt;code&gt;WHERE dim.col = X&lt;/code&gt; after a &lt;code&gt;LEFT JOIN&lt;/code&gt; silently turns it into &lt;code&gt;INNER JOIN&lt;/code&gt; (the &lt;code&gt;WHERE&lt;/code&gt; filters out the &lt;code&gt;NULL&lt;/code&gt; rows); use &lt;code&gt;AND dim.col = X&lt;/code&gt; inside the &lt;code&gt;ON&lt;/code&gt; clause if you need to keep all fact rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default semantics&lt;/strong&gt; — pick a value that means "no match" without poisoning the metric.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three transactions in two currencies; FX table has rates only for &lt;code&gt;EUR&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;transaction&lt;/th&gt;
&lt;th&gt;currency&lt;/th&gt;
&lt;th&gt;fx.rate (after LEFT JOIN)&lt;/th&gt;
&lt;th&gt;coalesced rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;t1&lt;/td&gt;
&lt;td&gt;USD&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;t2&lt;/td&gt;
&lt;td&gt;EUR&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.08&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1.08&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;t3&lt;/td&gt;
&lt;td&gt;USD&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transaction_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;amount_usd&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;fx_rates&lt;/span&gt; &lt;span class="n"&gt;fx&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;fx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;currency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;LEFT JOIN ... COALESCE(right_col, default)&lt;/code&gt; is the canonical "fact + dim with safe defaults" pattern; it appears in nearly every BI dashboard SQL.&lt;/p&gt;

&lt;h4&gt;
  
  
  "First non-blank across sources" via chained &lt;code&gt;COALESCE&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The chained-coalesce invariant: &lt;strong&gt;multiple &lt;code&gt;LEFT JOIN&lt;/code&gt;s to supplemental tables (primary, secondary, fallback) return per-table columns that may all be &lt;code&gt;NULL&lt;/code&gt;; &lt;code&gt;COALESCE(a.col, b.col, c.col, default)&lt;/code&gt; picks the first source that actually has data&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Primary source&lt;/strong&gt; — most reliable column (e.g., production CRM).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secondary source&lt;/strong&gt; — supplemental (e.g., marketing CRM).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tertiary source&lt;/strong&gt; — historical (e.g., legacy import).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default literal&lt;/strong&gt; — guaranteed non-&lt;code&gt;NULL&lt;/code&gt; final fallback.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; User name from CRM, fall back to billing system, fall back to login email.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;crm.name&lt;/th&gt;
&lt;th&gt;billing.name&lt;/th&gt;
&lt;th&gt;login.email&lt;/th&gt;
&lt;th&gt;display_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Alice Lee&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Alice L.&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;alice@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Alice Lee&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Bob B.&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bob@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Bob B.&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;carol@x.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;carol@x.com&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Unknown&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;full_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;billing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Unknown'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;display_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;crm&lt;/span&gt;     &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;crm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;billing&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;billing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt;   &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; chain &lt;code&gt;LEFT JOIN&lt;/code&gt;s for multiple sources, then &lt;code&gt;COALESCE&lt;/code&gt; across them in priority order; the SQL reads top-to-bottom exactly like the business rule.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;COALESCE(SUM(x), 0)&lt;/code&gt; for grouped aggregates with empty groups
&lt;/h4&gt;

&lt;p&gt;The aggregate-coalesce invariant: &lt;strong&gt;&lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt; over an empty group returns &lt;code&gt;NULL&lt;/code&gt;; wrapping the aggregate in &lt;code&gt;COALESCE(SUM(x), 0)&lt;/code&gt; returns a numeric default so the report row still appears with a zero metric instead of disappearing&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SUM(x)&lt;/code&gt; over empty group&lt;/strong&gt; — returns &lt;code&gt;NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt; — returns &lt;code&gt;0&lt;/code&gt;, not &lt;code&gt;NULL&lt;/code&gt; (the exception).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(SUM(x), 0)&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;0&lt;/code&gt; for empty groups; report row remains.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(AVG(x), 0)&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;0&lt;/code&gt; for empty groups; semantic consideration — may mislead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three regions; only two have orders; the third should show &lt;code&gt;0&lt;/code&gt; not be omitted.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;region&lt;/th&gt;
&lt;th&gt;SUM(amount)&lt;/th&gt;
&lt;th&gt;COALESCE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;North&lt;/td&gt;
&lt;td&gt;1500&lt;/td&gt;
&lt;td&gt;1500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;South&lt;/td&gt;
&lt;td&gt;800&lt;/td&gt;
&lt;td&gt;800&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;West&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;regions&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always wrap &lt;code&gt;SUM&lt;/code&gt; / &lt;code&gt;AVG&lt;/code&gt; in &lt;code&gt;COALESCE(..., 0)&lt;/code&gt; when joining a dimension that should produce a row even with zero matching facts; without it, empty groups silently disappear.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Filtering &lt;code&gt;WHERE dim.col = X&lt;/code&gt; after a &lt;code&gt;LEFT JOIN&lt;/code&gt; — silently downgrades to &lt;code&gt;INNER JOIN&lt;/code&gt;; move to &lt;code&gt;AND dim.col = X&lt;/code&gt; in the &lt;code&gt;ON&lt;/code&gt; clause.&lt;/li&gt;
&lt;li&gt;Defaulting FX rate to &lt;code&gt;1&lt;/code&gt; when the source amount is in unknown currency — fabricates dollars.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;COALESCE(SUM(x), 0)&lt;/code&gt; for aggregate metrics — empty groups disappear from reports.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;COALESCE&lt;/code&gt; to skip rows — that's &lt;code&gt;WHERE&lt;/code&gt;'s job, not &lt;code&gt;COALESCE&lt;/code&gt;'s.&lt;/li&gt;
&lt;li&gt;Mismatched join keys — &lt;code&gt;COALESCE&lt;/code&gt; cannot fix a missing or wrong &lt;code&gt;ON&lt;/code&gt; predicate.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Worked Problem on LEFT JOIN with COALESCE Default
&lt;/h3&gt;

&lt;p&gt;Given an &lt;code&gt;orders&lt;/code&gt; table and an optional &lt;code&gt;promos&lt;/code&gt; table, return a single column &lt;code&gt;promo_label&lt;/code&gt; per order that shows the promo code when matched and the literal &lt;code&gt;'NONE'&lt;/code&gt; when there is no matching promo.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;LEFT JOIN&lt;/code&gt; + &lt;code&gt;COALESCE(p.promo_code, 'NONE')&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;promo_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'NONE'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;promo_label&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;promos&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;promo_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;promo_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;LEFT JOIN&lt;/code&gt; keeps every order, even those with no &lt;code&gt;promo_id&lt;/code&gt; or whose &lt;code&gt;promo_id&lt;/code&gt; doesn't match any row in &lt;code&gt;promos&lt;/code&gt;; the unmatched rows get &lt;code&gt;NULL&lt;/code&gt; for &lt;code&gt;p.promo_code&lt;/code&gt;; &lt;code&gt;COALESCE&lt;/code&gt; substitutes the literal &lt;code&gt;'NONE'&lt;/code&gt; so the output column is non-&lt;code&gt;NULL&lt;/code&gt; and dashboard-safe.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for sample data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;promo_id&lt;/th&gt;
&lt;th&gt;matched promo_code&lt;/th&gt;
&lt;th&gt;COALESCE result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1001&lt;/td&gt;
&lt;td&gt;P1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SUMMER20&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SUMMER20&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1002&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;(no row)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NONE&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1003&lt;/td&gt;
&lt;td&gt;P9&lt;/td&gt;
&lt;td&gt;(no row)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NONE&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1004&lt;/td&gt;
&lt;td&gt;P2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WELCOME10&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WELCOME10&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt;&lt;/strong&gt; — every order row survives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Match&lt;/strong&gt; — order 1001 → P1 → &lt;code&gt;SUMMER20&lt;/code&gt;; order 1004 → P2 → &lt;code&gt;WELCOME10&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-match&lt;/strong&gt; — orders 1002 (NULL key) and 1003 (key not in &lt;code&gt;promos&lt;/code&gt;) get &lt;code&gt;NULL&lt;/code&gt; for &lt;code&gt;p.promo_code&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;NULL&lt;/code&gt; rows become &lt;code&gt;'NONE'&lt;/code&gt;; matched rows pass through.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;promo_label&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1001&lt;/td&gt;
&lt;td&gt;SUMMER20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1002&lt;/td&gt;
&lt;td&gt;NONE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1003&lt;/td&gt;
&lt;td&gt;NONE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1004&lt;/td&gt;
&lt;td&gt;WELCOME10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt; preserves left-side rows&lt;/strong&gt;&lt;/strong&gt; — every order stays in the result; no order silently disappears because of a missing or invalid &lt;code&gt;promo_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Right-side &lt;code&gt;NULL&lt;/code&gt; on miss&lt;/strong&gt;&lt;/strong&gt; — orders with no matching promo see &lt;code&gt;p.promo_code = NULL&lt;/code&gt;; this is the &lt;code&gt;LEFT JOIN&lt;/code&gt; contract, not a bug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COALESCE(p.promo_code, 'NONE')&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — sentinel substitution; the output column becomes non-&lt;code&gt;NULL&lt;/code&gt; and downstream code can treat it as a finite enum (&lt;code&gt;SUMMER20&lt;/code&gt;, &lt;code&gt;WELCOME10&lt;/code&gt;, ..., &lt;code&gt;NONE&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Sentinel, not silent zero&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;'NONE'&lt;/code&gt; is a string sentinel that flags "no promo applied"; it is never confused with a real code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(|orders| + |promos|)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — hash-join cost; &lt;code&gt;O(|orders|)&lt;/code&gt; output rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Practice &lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;SQL join problems&lt;/a&gt; to drill the &lt;code&gt;LEFT JOIN&lt;/code&gt; + &lt;code&gt;COALESCE&lt;/code&gt; muscle.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — joins&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL join problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — null handling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL null-handling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. &lt;code&gt;COALESCE&lt;/code&gt; vs &lt;code&gt;CASE&lt;/code&gt;, &lt;code&gt;ISNULL&lt;/code&gt;, and &lt;code&gt;NVL&lt;/code&gt; — Portability and When to Pick Which
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Standard SQL portability versus dialect-specific NULL helpers in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Which NULL-handling construct should I use in this engine?" is a question every SQL data engineer answers daily. The mental model: &lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; is the standard-SQL N-argument primitive supported on every major engine; &lt;code&gt;ISNULL&lt;/code&gt;, &lt;code&gt;NVL&lt;/code&gt;, and &lt;code&gt;IFNULL&lt;/code&gt; are dialect-specific 2-argument cousins that exist for historical reasons; &lt;code&gt;CASE WHEN ... IS NOT NULL THEN ... ELSE ... END&lt;/code&gt; is the general-purpose conditional that handles any logic, not just NULL fallback&lt;/strong&gt;. Reach for &lt;code&gt;COALESCE&lt;/code&gt; by default; reach for &lt;code&gt;CASE&lt;/code&gt; when the branches are not "first non-&lt;code&gt;NULL&lt;/code&gt;."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkeowztqal7daxxc50ele.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkeowztqal7daxxc50ele.webp" alt="Side-by-side comparison graphic of COALESCE, CASE, ISNULL, and NVL for SQL null-handling with portability notes in brand colors." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; In multi-engine analytics codebases (e.g., dbt models targeting both Snowflake and BigQuery, or migrations from legacy SQL Server to PostgreSQL), &lt;code&gt;COALESCE&lt;/code&gt; is the only portable choice. Hard-coding &lt;code&gt;ISNULL&lt;/code&gt; or &lt;code&gt;NVL&lt;/code&gt; in shared code is a portability bug waiting for the migration ticket.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;COALESCE&lt;/code&gt; (standard SQL) — the portable default
&lt;/h4&gt;

&lt;p&gt;The portability invariant: &lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; is part of the SQL:1992 standard and is supported by PostgreSQL, MySQL, SQL Server, Oracle, Snowflake, BigQuery, Redshift, Databricks SQL, DuckDB, SQLite, and every other major engine you'll encounter&lt;/strong&gt;. It accepts 2 to N arguments.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;N-argument support&lt;/strong&gt; — &lt;code&gt;COALESCE(a, b, c, d, ...)&lt;/code&gt; works everywhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standard-SQL identifier&lt;/strong&gt; — no version-specific syntax.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recommended default&lt;/strong&gt; — choose &lt;code&gt;COALESCE&lt;/code&gt; unless you have a dialect-specific reason.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Most engines short-circuit&lt;/strong&gt; — performance-equivalent to dialect cousins.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Same query runs unmodified on PostgreSQL, MySQL, SQL Server, Oracle, BigQuery.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;dialect&lt;/th&gt;
&lt;th&gt;query&lt;/th&gt;
&lt;th&gt;works?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SELECT COALESCE(a, b, c) FROM t;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MySQL&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL Server&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oracle&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BigQuery&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;personal_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'no-email@example.com'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;contact&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- runs identically across PostgreSQL / MySQL / SQL Server / Oracle / BigQuery / Snowflake&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if portability matters at all, &lt;code&gt;COALESCE&lt;/code&gt; is the only correct choice; it never gets you locked into a single dialect.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;ISNULL&lt;/code&gt; (SQL Server), &lt;code&gt;NVL&lt;/code&gt; (Oracle), &lt;code&gt;IFNULL&lt;/code&gt; (MySQL) — dialect 2-arg cousins
&lt;/h4&gt;

&lt;p&gt;The dialect-cousin invariant: &lt;strong&gt;all three are 2-argument NULL-replacement functions; they exist for historical reasons predating standard &lt;code&gt;COALESCE&lt;/code&gt; adoption; they have minor type-coercion quirks vs &lt;code&gt;COALESCE&lt;/code&gt; that occasionally bite (e.g., SQL Server &lt;code&gt;ISNULL&lt;/code&gt; truncates to the type of the first argument, while &lt;code&gt;COALESCE&lt;/code&gt; uses precedence)&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;function&lt;/th&gt;
&lt;th&gt;engine&lt;/th&gt;
&lt;th&gt;arity&lt;/th&gt;
&lt;th&gt;type rule&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ISNULL(a, b)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SQL Server&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;result type = type of &lt;code&gt;a&lt;/code&gt; (truncation risk)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NVL(a, b)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Oracle&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;result type = type of &lt;code&gt;a&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;IFNULL(a, b)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MySQL, BigQuery&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;result type = standard precedence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; SQL Server &lt;code&gt;ISNULL&lt;/code&gt; truncation gotcha — &lt;code&gt;ISNULL(varchar(3), 'longer string')&lt;/code&gt; truncates the second argument to 3 characters.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;query&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;th&gt;gotcha&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ISNULL(CAST('hi' AS VARCHAR(3)), 'goodbye')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'goo'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;truncated to len-3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COALESCE(CAST('hi' AS VARCHAR(3)), 'goodbye')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'goodbye'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;no truncation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- SQL Server-only (avoid in portable code)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ISNULL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;personal_email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Oracle-only&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;NVL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;personal_email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- MySQL / BigQuery&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;IFNULL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;personal_email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Standard SQL — runs everywhere&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;personal_email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; unless you are maintaining an existing single-dialect codebase that already uses the cousin, prefer &lt;code&gt;COALESCE&lt;/code&gt;; the cousins offer no benefit and lock you into one engine.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;CASE WHEN ... IS NOT NULL THEN ... ELSE ... END&lt;/code&gt; — when logic is not "first non-NULL"
&lt;/h4&gt;

&lt;p&gt;The CASE invariant: &lt;strong&gt;&lt;code&gt;CASE&lt;/code&gt; is the general-purpose conditional in SQL; it handles any boolean predicate, not just &lt;code&gt;IS NOT NULL&lt;/code&gt;; reach for &lt;code&gt;CASE&lt;/code&gt; when the branches involve ranges, flags, transformations, or any logic richer than "pick the first non-&lt;code&gt;NULL&lt;/code&gt;"&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Two-argument CASE&lt;/strong&gt; equivalent to &lt;code&gt;COALESCE(a, b)&lt;/code&gt; — verbose but explicit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Range branches&lt;/strong&gt; — &lt;code&gt;CASE WHEN amount &amp;gt; 1000 THEN 'large' WHEN amount &amp;gt; 100 THEN 'medium' ELSE 'small' END&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple-condition logic&lt;/strong&gt; — combine &lt;code&gt;IS NULL&lt;/code&gt;, comparisons, and string predicates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-branch transformations&lt;/strong&gt; — different SQL functions per branch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Categorize order amounts; &lt;code&gt;COALESCE&lt;/code&gt; doesn't apply because the logic is range-based.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;th&gt;category&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1500&lt;/td&gt;
&lt;td&gt;large&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;250&lt;/td&gt;
&lt;td&gt;medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;small&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;unknown&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;CASE&lt;/span&gt;
         &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'unknown'&lt;/span&gt;
         &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'large'&lt;/span&gt;
         &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;  &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'medium'&lt;/span&gt;
         &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'small'&lt;/span&gt;
       &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;amount_category&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- COALESCE cannot express ranges; CASE is the right tool&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if your logic is "first non-&lt;code&gt;NULL&lt;/code&gt;," use &lt;code&gt;COALESCE&lt;/code&gt;; if your logic is anything else, use &lt;code&gt;CASE&lt;/code&gt;; if you find yourself nesting &lt;code&gt;COALESCE&lt;/code&gt; with &lt;code&gt;CASE&lt;/code&gt;, the &lt;code&gt;CASE&lt;/code&gt; alone is usually clearer.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;ISNULL&lt;/code&gt; in PostgreSQL — &lt;code&gt;ISNULL(a, b)&lt;/code&gt; is not a function in PostgreSQL; it's a keyword for &lt;code&gt;IS NULL&lt;/code&gt;. Parse error.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;NVL&lt;/code&gt; outside Oracle — non-portable; &lt;code&gt;COALESCE&lt;/code&gt; instead.&lt;/li&gt;
&lt;li&gt;Hitting the SQL Server &lt;code&gt;ISNULL&lt;/code&gt; truncation gotcha — use &lt;code&gt;COALESCE&lt;/code&gt; for safe type unification.&lt;/li&gt;
&lt;li&gt;Reaching for &lt;code&gt;CASE&lt;/code&gt; when &lt;code&gt;COALESCE&lt;/code&gt; is shorter — verbose and harder to review.&lt;/li&gt;
&lt;li&gt;Reaching for &lt;code&gt;COALESCE&lt;/code&gt; when the logic is not first-non-&lt;code&gt;NULL&lt;/code&gt; — pick &lt;code&gt;CASE&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Worked Problem on Picking the Right NULL Construct
&lt;/h3&gt;

&lt;p&gt;Given a &lt;code&gt;users&lt;/code&gt; table with three nullable columns — &lt;code&gt;work_email&lt;/code&gt;, &lt;code&gt;personal_email&lt;/code&gt;, and &lt;code&gt;phone&lt;/code&gt; — return a column &lt;code&gt;effective_contact&lt;/code&gt; that is the first non-&lt;code&gt;NULL&lt;/code&gt; of the three, prefixed by &lt;code&gt;'email: '&lt;/code&gt; for emails and &lt;code&gt;'phone: '&lt;/code&gt; for phone numbers, falling back to &lt;code&gt;'no contact'&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;COALESCE&lt;/code&gt; inside a &lt;code&gt;CASE&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CASE&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;work_email&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;personal_email&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
            &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'email: '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;personal_email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;phone&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
            &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'phone: '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;phone&lt;/span&gt;
        &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'no contact'&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;effective_contact&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the outer &lt;code&gt;CASE&lt;/code&gt; handles the per-channel transformation (different prefix for email vs phone), which &lt;code&gt;COALESCE&lt;/code&gt; alone cannot express; the inner &lt;code&gt;COALESCE(work_email, personal_email)&lt;/code&gt; collapses the two email candidates with first-non-&lt;code&gt;NULL&lt;/code&gt; priority, then the prefix concatenation runs only when at least one email is present; the &lt;code&gt;WHEN phone IS NOT NULL&lt;/code&gt; branch handles the secondary channel; the &lt;code&gt;ELSE 'no contact'&lt;/code&gt; is the guaranteed fallback. This is the canonical "use &lt;code&gt;COALESCE&lt;/code&gt; for null pickers, &lt;code&gt;CASE&lt;/code&gt; for everything else" idiom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for four sample rows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;work_email&lt;/th&gt;
&lt;th&gt;personal_email&lt;/th&gt;
&lt;th&gt;phone&lt;/th&gt;
&lt;th&gt;effective_contact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;a@w.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;a@h.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;5551111&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;email: a@w.com&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;b@h.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;5552222&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;email: b@h.com&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;5553333&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;phone: 5553333&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;no contact&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Row 1&lt;/strong&gt; — &lt;code&gt;work_email&lt;/code&gt; non-null → first &lt;code&gt;WHEN&lt;/code&gt; matches → &lt;code&gt;email: a@w.com&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row 2&lt;/strong&gt; — &lt;code&gt;work_email&lt;/code&gt; null but &lt;code&gt;personal_email&lt;/code&gt; non-null → first &lt;code&gt;WHEN&lt;/code&gt; matches → &lt;code&gt;COALESCE&lt;/code&gt; returns &lt;code&gt;b@h.com&lt;/code&gt; → &lt;code&gt;email: b@h.com&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row 3&lt;/strong&gt; — both emails null → second &lt;code&gt;WHEN&lt;/code&gt; matches → &lt;code&gt;phone: 5553333&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row 4&lt;/strong&gt; — all three null → &lt;code&gt;ELSE&lt;/code&gt; → &lt;code&gt;no contact&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;effective_contact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;email: &lt;a href="mailto:a@w.com"&gt;a@w.com&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;email: &lt;a href="mailto:b@h.com"&gt;b@h.com&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;phone: 5553333&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;no contact&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; for null-picker&lt;/strong&gt;&lt;/strong&gt; — collapses &lt;code&gt;work_email&lt;/code&gt;/&lt;code&gt;personal_email&lt;/code&gt; into one value with first-non-&lt;code&gt;NULL&lt;/code&gt; priority; clean, standard SQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;CASE&lt;/code&gt; for per-branch transformation&lt;/strong&gt;&lt;/strong&gt; — different string prefixes per channel; &lt;code&gt;COALESCE&lt;/code&gt; cannot express this because every branch needs a different formatting rule.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Composition&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;COALESCE&lt;/code&gt; nested inside &lt;code&gt;CASE&lt;/code&gt; is idiomatic; the outer &lt;code&gt;CASE&lt;/code&gt; switches behavior, the inner &lt;code&gt;COALESCE&lt;/code&gt; resolves null preference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Guaranteed-non-&lt;code&gt;NULL&lt;/code&gt; &lt;code&gt;ELSE&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — the &lt;code&gt;ELSE 'no contact'&lt;/code&gt; ensures the output column is never &lt;code&gt;NULL&lt;/code&gt; regardless of input.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(N)&lt;/code&gt; time / &lt;code&gt;O(1)&lt;/code&gt; space&lt;/strong&gt;&lt;/strong&gt; — single linear pass; per-row constant work for both the &lt;code&gt;CASE&lt;/code&gt; and the &lt;code&gt;COALESCE&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/case-when/sql" rel="noopener noreferrer"&gt;SQL CASE-when problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/conditional-logic/sql" rel="noopener noreferrer"&gt;conditional-logic problems&lt;/a&gt; for the dialect-portability muscle.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — case when&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL CASE-when problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/case-when/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — conditional logic&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL conditional-logic problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/conditional-logic/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — null handling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL null-handling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. &lt;code&gt;COALESCE&lt;/code&gt; Pitfalls — NULL Semantics, Type Coercion, Empty Strings, and &lt;code&gt;NULLIF&lt;/code&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Avoiding silent semantic bugs in COALESCE-heavy SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Where does &lt;code&gt;COALESCE&lt;/code&gt; quietly produce wrong answers?" is the question senior interviewers probe to separate fluent candidates from rote memorizers. The mental model: &lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; is purely mechanical — first non-&lt;code&gt;NULL&lt;/code&gt; — but the meaning of &lt;code&gt;NULL&lt;/code&gt; is not; defaulting &lt;code&gt;NULL&lt;/code&gt; to &lt;code&gt;0&lt;/code&gt; for a financial KPI changes the metric semantics; empty string &lt;code&gt;''&lt;/code&gt; is not &lt;code&gt;NULL&lt;/code&gt; so &lt;code&gt;COALESCE&lt;/code&gt; skips it; type mixing without &lt;code&gt;CAST&lt;/code&gt; is a coercion landmine; combining &lt;code&gt;COALESCE&lt;/code&gt; with &lt;code&gt;NULLIF&lt;/code&gt; produces the "empty-or-null → default" idiom&lt;/strong&gt;. Drill the four pitfalls below and you'll never ship a &lt;code&gt;COALESCE&lt;/code&gt; bug to production.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; State the &lt;code&gt;NULL&lt;/code&gt; semantics out loud before writing the SQL. If &lt;code&gt;NULL&lt;/code&gt; means "unknown," replacing it with &lt;code&gt;0&lt;/code&gt; may turn a missing data point into a real-looking zero. The interviewer wants to hear "I'm coalescing to &lt;code&gt;0&lt;/code&gt; because the metric definition says missing rentals count as zero rentals" — not "I added &lt;code&gt;COALESCE&lt;/code&gt; to make the dashboard not show &lt;code&gt;NULL&lt;/code&gt;."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;NULL&lt;/code&gt; ≠ &lt;code&gt;0&lt;/code&gt; — semantic awareness for KPIs and metrics
&lt;/h4&gt;

&lt;p&gt;The semantic invariant: &lt;strong&gt;&lt;code&gt;NULL&lt;/code&gt; typically means "unknown"; &lt;code&gt;0&lt;/code&gt; is a concrete numeric fact; replacing &lt;code&gt;NULL&lt;/code&gt; with &lt;code&gt;0&lt;/code&gt; in display code is usually fine but in calculation code can fabricate facts&lt;/strong&gt;. Treat the substitution as a documented business decision, not a code-style preference.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Display&lt;/strong&gt; — &lt;code&gt;COALESCE(amount, 0)&lt;/code&gt; is fine in dashboards (the user sees &lt;code&gt;0&lt;/code&gt;, knows it's a zero).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregate&lt;/strong&gt; — &lt;code&gt;AVG(COALESCE(amount, 0))&lt;/code&gt; changes the average semantically; &lt;code&gt;AVG&lt;/code&gt; ignores &lt;code&gt;NULL&lt;/code&gt; natively, so coalescing first lowers the average.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Math&lt;/strong&gt; — &lt;code&gt;COALESCE(qty, 0) * price&lt;/code&gt; may be wrong if &lt;code&gt;qty IS NULL&lt;/code&gt; means "we don't know how much was ordered."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documented sentinel&lt;/strong&gt; — for metrics, prefer carrying the &lt;code&gt;NULL&lt;/code&gt; and handling it at the report layer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Average of &lt;code&gt;[100, 200, NULL]&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;approach&lt;/th&gt;
&lt;th&gt;computation&lt;/th&gt;
&lt;th&gt;average&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AVG(amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(100+200)/2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AVG(COALESCE(amount, 0))&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(100+200+0)/3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Choose intentionally; both are valid for different metric definitions&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                       &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_known_only&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;          &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_with_nulls_as_zero&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                          &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                     &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;non_null_rows&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the &lt;code&gt;NULL&lt;/code&gt;-vs-&lt;code&gt;0&lt;/code&gt; decision is a metric-definition decision, not a coding style; document it inline.&lt;/p&gt;

&lt;h4&gt;
  
  
  Empty strings are not &lt;code&gt;NULL&lt;/code&gt; — combine with &lt;code&gt;NULLIF&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The empty-string invariant: &lt;strong&gt;in PostgreSQL and most engines, &lt;code&gt;''&lt;/code&gt; and &lt;code&gt;NULL&lt;/code&gt; are different; &lt;code&gt;COALESCE(col, 'default')&lt;/code&gt; returns &lt;code&gt;''&lt;/code&gt; when &lt;code&gt;col = ''&lt;/code&gt; (because &lt;code&gt;''&lt;/code&gt; is non-&lt;code&gt;NULL&lt;/code&gt;); to treat empty-as-null, wrap with &lt;code&gt;NULLIF(col, '')&lt;/code&gt; first&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NULLIF(a, b)&lt;/code&gt;&lt;/strong&gt; — returns &lt;code&gt;NULL&lt;/code&gt; when &lt;code&gt;a = b&lt;/code&gt;, else returns &lt;code&gt;a&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(NULLIF(col, ''), 'default')&lt;/code&gt;&lt;/strong&gt; — the canonical "empty-or-null → default" idiom.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Oracle quirk&lt;/strong&gt; — Oracle's varchar treats &lt;code&gt;''&lt;/code&gt; AS &lt;code&gt;NULL&lt;/code&gt;; not portable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MySQL behavior&lt;/strong&gt; — &lt;code&gt;''&lt;/code&gt; is not &lt;code&gt;NULL&lt;/code&gt;; same as PostgreSQL.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three rows: real value, empty string, real &lt;code&gt;NULL&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;input&lt;/th&gt;
&lt;th&gt;&lt;code&gt;COALESCE(col, 'X')&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;COALESCE(NULLIF(col, ''), 'X')&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'hello'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'hello'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'hello'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;''&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;''&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'X'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'X'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'X'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;raw_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'X'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;naive_default&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'X'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;empty_or_null_default&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; whenever a string column might be empty AND &lt;code&gt;COALESCE&lt;/code&gt; is the right tool, wrap with &lt;code&gt;NULLIF(col, '')&lt;/code&gt; to handle both cases.&lt;/p&gt;

&lt;h4&gt;
  
  
  Type-coercion landmines and explicit &lt;code&gt;CAST&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The type-coercion invariant: &lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; returns a single unified type; mixing incompatible types is a parse error in strict engines (PostgreSQL) and a silent coercion in lenient ones (MySQL); &lt;code&gt;CAST&lt;/code&gt; keeps intent explicit and review-friendly&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(int_col, 'unknown')&lt;/code&gt;&lt;/strong&gt; — error in PostgreSQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(CAST(int_col AS TEXT), 'unknown')&lt;/code&gt;&lt;/strong&gt; — explicit, portable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(numeric_col, 0)&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;NUMERIC&lt;/code&gt; and &lt;code&gt;INTEGER&lt;/code&gt; unify cleanly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(date_col, '1900-01-01'::DATE)&lt;/code&gt;&lt;/strong&gt; — explicit &lt;code&gt;::DATE&lt;/code&gt; cast on the literal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Mixing &lt;code&gt;INTEGER&lt;/code&gt; and &lt;code&gt;TEXT&lt;/code&gt; cleanly via &lt;code&gt;CAST&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;input&lt;/th&gt;
&lt;th&gt;naive&lt;/th&gt;
&lt;th&gt;with CAST&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qty = 5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;error&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'5'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qty = NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;error&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'unknown'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'unknown'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;qty_label&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; whenever &lt;code&gt;COALESCE&lt;/code&gt; mixes families (numeric + string, date + string), insert an explicit &lt;code&gt;CAST&lt;/code&gt; on whichever argument needs it; never trust implicit coercion to do the right thing.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Defaulting &lt;code&gt;NULL&lt;/code&gt; to &lt;code&gt;0&lt;/code&gt; without considering the metric semantics — fabricates zeros into "unknown" rows.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;COALESCE(col, 'default')&lt;/code&gt; for empty strings — &lt;code&gt;''&lt;/code&gt; is not &lt;code&gt;NULL&lt;/code&gt;; wrap with &lt;code&gt;NULLIF&lt;/code&gt; first.&lt;/li&gt;
&lt;li&gt;Mixing &lt;code&gt;INTEGER&lt;/code&gt; and &lt;code&gt;TEXT&lt;/code&gt; without &lt;code&gt;CAST&lt;/code&gt; — parse error in PostgreSQL.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;COALESCE&lt;/code&gt; for string concatenation — that's &lt;code&gt;STRING_AGG&lt;/code&gt; / &lt;code&gt;LISTAGG&lt;/code&gt;'s job.&lt;/li&gt;
&lt;li&gt;Forgetting that &lt;code&gt;AVG&lt;/code&gt; ignores &lt;code&gt;NULL&lt;/code&gt; natively — coalescing first changes the average.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Worked Problem on Empty-or-NULL Default with NULLIF
&lt;/h3&gt;

&lt;p&gt;Given a &lt;code&gt;customers&lt;/code&gt; table with a &lt;code&gt;display_name&lt;/code&gt; column that is sometimes &lt;code&gt;NULL&lt;/code&gt;, sometimes &lt;code&gt;''&lt;/code&gt; (empty string), and sometimes a real value, return a column &lt;code&gt;effective_name&lt;/code&gt; that is the real value when present, the literal &lt;code&gt;'Anonymous'&lt;/code&gt; when the input is &lt;code&gt;NULL&lt;/code&gt; or empty.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;COALESCE(NULLIF(display_name, ''), 'Anonymous')&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'Anonymous'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;effective_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;NULLIF(display_name, '')&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt; when &lt;code&gt;display_name&lt;/code&gt; equals &lt;code&gt;''&lt;/code&gt;, else returns &lt;code&gt;display_name&lt;/code&gt; unchanged; &lt;code&gt;COALESCE(..., 'Anonymous')&lt;/code&gt; then substitutes the literal &lt;code&gt;'Anonymous'&lt;/code&gt; for &lt;code&gt;NULL&lt;/code&gt; (which now covers both the original &lt;code&gt;NULL&lt;/code&gt; and the converted &lt;code&gt;''&lt;/code&gt;). The composition handles both edge cases in one expression.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for four sample rows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;display_name&lt;/th&gt;
&lt;th&gt;NULLIF(display_name, '')&lt;/th&gt;
&lt;th&gt;COALESCE&lt;/th&gt;
&lt;th&gt;effective_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'Alice Lee'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'Alice Lee'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'Alice Lee'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Alice Lee&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;''&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'Anonymous'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Anonymous&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NULL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'Anonymous'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Anonymous&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'Bob B.'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'Bob B.'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'Bob B.'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Bob B.&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Row 1&lt;/strong&gt; — real value passes through &lt;code&gt;NULLIF&lt;/code&gt; unchanged → &lt;code&gt;COALESCE&lt;/code&gt; short-circuits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row 2&lt;/strong&gt; — empty string converted to &lt;code&gt;NULL&lt;/code&gt; by &lt;code&gt;NULLIF&lt;/code&gt; → &lt;code&gt;COALESCE&lt;/code&gt; substitutes &lt;code&gt;'Anonymous'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row 3&lt;/strong&gt; — original &lt;code&gt;NULL&lt;/code&gt; passes through &lt;code&gt;NULLIF&lt;/code&gt; (&lt;code&gt;NULL = ''&lt;/code&gt; evaluates to &lt;code&gt;NULL&lt;/code&gt;, not true) → &lt;code&gt;COALESCE&lt;/code&gt; substitutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row 4&lt;/strong&gt; — real value passes through unchanged.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;effective_name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Alice Lee&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Anonymous&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Anonymous&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Bob B.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;NULLIF(a, b)&lt;/code&gt; semantics&lt;/strong&gt;&lt;/strong&gt; — returns &lt;code&gt;NULL&lt;/code&gt; when &lt;code&gt;a = b&lt;/code&gt;, else returns &lt;code&gt;a&lt;/code&gt;; converts &lt;code&gt;''&lt;/code&gt; to &lt;code&gt;NULL&lt;/code&gt; so &lt;code&gt;COALESCE&lt;/code&gt; can treat both as "missing."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Composition with &lt;code&gt;COALESCE&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — the inner &lt;code&gt;NULLIF&lt;/code&gt; normalizes the input, the outer &lt;code&gt;COALESCE&lt;/code&gt; applies the default; reads exactly like the business rule "empty-or-null → 'Anonymous'."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;No &lt;code&gt;CASE&lt;/code&gt; needed&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;CASE WHEN display_name IS NULL OR display_name = '' THEN 'Anonymous' ELSE display_name END&lt;/code&gt; is verbose; the two-function composition is cleaner and equivalent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Single-column output&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;effective_name&lt;/code&gt; is non-&lt;code&gt;NULL&lt;/code&gt; and non-empty by construction; downstream code can treat it as a clean string.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(N)&lt;/code&gt; time / &lt;code&gt;O(1)&lt;/code&gt; space&lt;/strong&gt;&lt;/strong&gt; — one linear scan; per-row constant work for the &lt;code&gt;NULLIF&lt;/code&gt; + &lt;code&gt;COALESCE&lt;/code&gt; composition.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill more &lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;SQL null-handling problems&lt;/a&gt; on PipeCode.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — null handling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL null-handling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — conditional logic&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL conditional-logic problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/conditional-logic/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — filtering&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL filtering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips to master COALESCE in SQL
&lt;/h2&gt;

&lt;h3&gt;
  
  
  State the rule out loud before writing the query
&lt;/h3&gt;

&lt;p&gt;Every interview answer that touches &lt;code&gt;NULL&lt;/code&gt; should open with "left-to-right, first non-&lt;code&gt;NULL&lt;/code&gt;, returns &lt;code&gt;NULL&lt;/code&gt; only if every argument is &lt;code&gt;NULL&lt;/code&gt;." That single sentence demonstrates fluency. Candidates who write &lt;code&gt;COALESCE&lt;/code&gt; in the query without naming the rule are graded as memorizers; candidates who name the rule and then write the query are graded as fluent. The rule takes 7 seconds to say; it's worth a level on the interviewer's rubric.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drill the four primitives
&lt;/h3&gt;

&lt;p&gt;The four primitives in this guide map directly to the surface area &lt;code&gt;COALESCE&lt;/code&gt; covers in production: &lt;strong&gt;left-to-right evaluation with short-circuit semantics&lt;/strong&gt; (the rule itself plus the side-effect-safety contract), &lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt; + &lt;code&gt;COALESCE(right_col, default)&lt;/code&gt;&lt;/strong&gt; (the canonical "fact + dim with safe defaults" pattern that drives every BI dashboard), &lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt; vs &lt;code&gt;CASE&lt;/code&gt; / &lt;code&gt;ISNULL&lt;/code&gt; / &lt;code&gt;NVL&lt;/code&gt; / &lt;code&gt;IFNULL&lt;/code&gt;&lt;/strong&gt; (the dialect portability matrix), and &lt;strong&gt;the pitfall set&lt;/strong&gt; (&lt;code&gt;NULL&lt;/code&gt; ≠ &lt;code&gt;0&lt;/code&gt;, empty string ≠ &lt;code&gt;NULL&lt;/code&gt;, type coercion, &lt;code&gt;NULLIF&lt;/code&gt; composition).&lt;/p&gt;

&lt;h3&gt;
  
  
  Pick PostgreSQL-flavored answers in interviews
&lt;/h3&gt;

&lt;p&gt;Most coding-environment interviews — DataLemur, StrataScratch, LeetCode SQL, CoderPad live coding — default to PostgreSQL. Drill PostgreSQL syntax: &lt;code&gt;COALESCE&lt;/code&gt; (not &lt;code&gt;ISNULL&lt;/code&gt;), &lt;code&gt;EXTRACT(MONTH FROM ts)&lt;/code&gt; (not &lt;code&gt;MONTH(ts)&lt;/code&gt;), &lt;code&gt;INTERVAL '30 days'&lt;/code&gt; (not &lt;code&gt;DATEADD&lt;/code&gt;), &lt;code&gt;||&lt;/code&gt; for string concatenation (not &lt;code&gt;+&lt;/code&gt;), &lt;code&gt;::TYPE&lt;/code&gt; for casts (not &lt;code&gt;CONVERT(TYPE, ...)&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Combine &lt;code&gt;COALESCE&lt;/code&gt; with &lt;code&gt;NULLIF&lt;/code&gt; for empty-or-null
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;COALESCE(NULLIF(col, ''), 'default')&lt;/code&gt; idiom is the canonical "treat empty string as null and fall back to a default" pattern. Memorize it. PostgreSQL keeps &lt;code&gt;''&lt;/code&gt; and &lt;code&gt;NULL&lt;/code&gt; distinct; without the inner &lt;code&gt;NULLIF&lt;/code&gt;, &lt;code&gt;COALESCE&lt;/code&gt; returns the empty string unchanged. Naming this composition unprompted in interviews is a strong product-fluency signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose &lt;code&gt;CASE&lt;/code&gt; when the logic is not "first non-NULL"
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;COALESCE&lt;/code&gt; is for ordered fallbacks. &lt;code&gt;CASE&lt;/code&gt; is for everything else — ranges, flags, transformations, branch-specific formatting. If you find yourself writing nested &lt;code&gt;COALESCE&lt;/code&gt; calls or &lt;code&gt;COALESCE&lt;/code&gt; inside &lt;code&gt;CASE&lt;/code&gt; to express conditional logic, the pure &lt;code&gt;CASE&lt;/code&gt; is usually clearer. The decision tree: "first non-&lt;code&gt;NULL&lt;/code&gt;?" → &lt;code&gt;COALESCE&lt;/code&gt;; "anything richer?" → &lt;code&gt;CASE&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Default thoughtfully — &lt;code&gt;NULL&lt;/code&gt; carries information
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;NULL&lt;/code&gt; in a financial KPI is information ("we don't know"); replacing it with &lt;code&gt;0&lt;/code&gt; is a documented business decision. State the decision in the SQL comment or in the surrounding pull-request description. A senior interviewer wants to hear "I'm coalescing to &lt;code&gt;0&lt;/code&gt; because the metric definition says missing rentals count as zero" — not "I added &lt;code&gt;COALESCE&lt;/code&gt; so the dashboard wouldn't show &lt;code&gt;NULL&lt;/code&gt;."&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to practice on PipeCode
&lt;/h3&gt;

&lt;p&gt;Start with the &lt;a href="https://pipecode.ai/explore/practice/topic/null-handling/sql" rel="noopener noreferrer"&gt;SQL null-handling practice page&lt;/a&gt; for the curated set of &lt;code&gt;COALESCE&lt;/code&gt;-style problems. After that, drill the matching topic pages: &lt;a href="https://pipecode.ai/explore/practice/topic/conditional-logic/sql" rel="noopener noreferrer"&gt;conditional logic&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/case-when/sql" rel="noopener noreferrer"&gt;CASE when&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;joins&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;aggregation&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;filtering&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;window functions&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/cte/sql" rel="noopener noreferrer"&gt;CTE&lt;/a&gt;. The &lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;SQL for data engineering interviews course&lt;/a&gt; bundles structured curricula. For broader coverage, &lt;a href="https://pipecode.ai/explore/practice" rel="noopener noreferrer"&gt;browse all practice problems&lt;/a&gt; or pivot to peer guides — the &lt;a href="https://pipecode.ai/blogs/airbnb-data-engineering-interview-questions-prep-guide" rel="noopener noreferrer"&gt;Airbnb DE interview guide&lt;/a&gt;, the &lt;a href="https://pipecode.ai/blogs/top-data-engineering-interview-questions-2026" rel="noopener noreferrer"&gt;top DE interview questions 2026&lt;/a&gt; blog, and the &lt;a href="https://pipecode.ai/blogs/sql-data-types-postgresql-guide" rel="noopener noreferrer"&gt;SQL data types Postgres guide&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Communication and approach under time pressure
&lt;/h3&gt;

&lt;p&gt;Talk through the rule first ("left-to-right, first non-&lt;code&gt;NULL&lt;/code&gt;"), the obvious application second ("here that means prefer &lt;code&gt;work_email&lt;/code&gt; then &lt;code&gt;personal_email&lt;/code&gt; then a literal default"), and the edge cases third ("if every column is &lt;code&gt;NULL&lt;/code&gt;, the literal kicks in; if &lt;code&gt;NULL&lt;/code&gt; could mean &lt;code&gt;0&lt;/code&gt; semantically, I'd think twice about defaulting to &lt;code&gt;0&lt;/code&gt;"). Interviewers grade &lt;strong&gt;process&lt;/strong&gt; as much as the final query. Leave 30 seconds for a sweep: empty string vs &lt;code&gt;NULL&lt;/code&gt;, type mixing, side-effect safety on later arguments, the &lt;code&gt;LEFT JOIN&lt;/code&gt;-becomes-&lt;code&gt;INNER&lt;/code&gt; trap if a &lt;code&gt;WHERE&lt;/code&gt; filters the right table.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is &lt;code&gt;COALESCE&lt;/code&gt; the same as &lt;code&gt;IFNULL&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;Not exactly. &lt;strong&gt;&lt;code&gt;IFNULL&lt;/code&gt;&lt;/strong&gt; (MySQL, BigQuery) and &lt;strong&gt;&lt;code&gt;ISNULL&lt;/code&gt;&lt;/strong&gt; (SQL Server) are usually two-argument forms. &lt;strong&gt;&lt;code&gt;COALESCE&lt;/code&gt;&lt;/strong&gt; accepts many arguments and is part of the SQL:1992 standard, which is why teams prefer it in portable analytics code. PipeCode's SQL practice is PostgreSQL-oriented, so canonical solutions use &lt;code&gt;COALESCE&lt;/code&gt; everywhere. If you're working in a MySQL-only codebase that already uses &lt;code&gt;IFNULL&lt;/code&gt;, you can keep it; if you're writing portable code or migrating between engines, switch to &lt;code&gt;COALESCE&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does &lt;code&gt;COALESCE&lt;/code&gt; skip evaluating later arguments?
&lt;/h3&gt;

&lt;p&gt;Most major databases short-circuit once they find a non-&lt;code&gt;NULL&lt;/code&gt; value — PostgreSQL, MySQL, SQL Server, Oracle, and Snowflake all document this behavior. That's useful for performance, but the SQL standard does not require short-circuit, and you should not rely on side effects in later arguments (volatile functions, RAISE statements, sequence calls). In interviews, saying "typically short-circuits, but I would not depend on side effects" is a strong answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I nest &lt;code&gt;COALESCE&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;Yes — &lt;code&gt;COALESCE(a, b, COALESCE(c, d))&lt;/code&gt; works — but a flat list is easier to read and review. Prefer &lt;code&gt;COALESCE(a, b, c, d)&lt;/code&gt; when all fallbacks are simple columns or literals. The only legitimate reason to nest is when one fallback itself depends on another expression that needs its own fallback chain.&lt;/p&gt;

&lt;h3&gt;
  
  
  When should I use &lt;code&gt;CASE&lt;/code&gt; instead of &lt;code&gt;COALESCE&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;CASE&lt;/code&gt; when the logic is not "first non-&lt;code&gt;NULL&lt;/code&gt;" — for example, ranges (&lt;code&gt;amount &amp;gt; 1000&lt;/code&gt; → &lt;code&gt;'large'&lt;/code&gt;), flags (&lt;code&gt;is_premium&lt;/code&gt; → premium pricing), or different transformations per branch. &lt;code&gt;COALESCE&lt;/code&gt; stays the right default when you only need ordered fallbacks. For more conditional patterns, browse &lt;a href="https://pipecode.ai/explore/practice/topic/conditional-logic/sql" rel="noopener noreferrer"&gt;conditional-logic SQL on PipeCode&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/case-when/sql" rel="noopener noreferrer"&gt;CASE-when problems&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I handle empty strings vs &lt;code&gt;NULL&lt;/code&gt; together?
&lt;/h3&gt;

&lt;p&gt;Use the &lt;code&gt;COALESCE(NULLIF(col, ''), 'default')&lt;/code&gt; composition. &lt;code&gt;NULLIF(col, '')&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt; when &lt;code&gt;col&lt;/code&gt; is empty (and returns &lt;code&gt;col&lt;/code&gt; unchanged otherwise); the outer &lt;code&gt;COALESCE&lt;/code&gt; then treats both the original &lt;code&gt;NULL&lt;/code&gt; and the converted-empty as missing and applies the default. This is the canonical PostgreSQL idiom for "missing or blank → default" — drill it; it shows up at every interview that touches dirty user-input data.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the right default value for &lt;code&gt;COALESCE&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;The right default depends on the metric definition, not on coding style. For displays and dashboards, &lt;code&gt;0&lt;/code&gt; for numerics and &lt;code&gt;'NONE'&lt;/code&gt; / &lt;code&gt;'Unknown'&lt;/code&gt; for strings are common. For aggregates, &lt;code&gt;COALESCE(SUM(x), 0)&lt;/code&gt; keeps empty groups visible in reports. For financial or scientific metrics where &lt;code&gt;NULL&lt;/code&gt; means "unknown," carry the &lt;code&gt;NULL&lt;/code&gt; and handle it at the report layer instead of replacing it. Document the decision inline so the next reviewer knows it was intentional.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start practicing SQL COALESCE problems
&lt;/h2&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>interview</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Facebook Data Engineering Interview Questions &amp; Prep Guide</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Sun, 03 May 2026 05:43:11 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/facebook-data-engineering-interview-questions-prep-guide-1da6</link>
      <guid>https://dev.to/gowthampotureddi/facebook-data-engineering-interview-questions-prep-guide-1da6</guid>
      <description>&lt;p&gt;&lt;strong&gt;Facebook data engineering interview questions&lt;/strong&gt; are bilingual, product-analytics-flavored, and PostgreSQL-grounded on the SQL side. Facebook Inc. rebranded to &lt;strong&gt;Meta Platforms Inc.&lt;/strong&gt; in October 2021, but the data-engineering interview shape — and the question patterns that show up in the live phone screen and onsite — has not moved. The standard technical phone screen is &lt;strong&gt;5 minutes intro + 30 minutes SQL + 30 minutes Python + 5 minutes Q&amp;amp;A&lt;/strong&gt;, with the candidate choosing whether to start with SQL or Python. Four primitives carry the loop: &lt;code&gt;n*(n+1)/2 - sum(arr)&lt;/code&gt; arithmetic-series sum-formula (and its XOR self-cancellation alternative) for the missing-number array problem, character-by-character tokenization plus two-pass evaluation for arithmetic formula parsing, correlated &lt;code&gt;EXISTS&lt;/code&gt; subqueries with &lt;code&gt;EXTRACT(MONTH FROM ts - INTERVAL '1 month')&lt;/code&gt; for month-over-month MAU retention, and CTE composition plus self-joins for post-hiatus aggregation and friend-recommendation queries.&lt;/p&gt;

&lt;p&gt;This guide walks four topic clusters end-to-end, each with a &lt;strong&gt;detailed topic explanation&lt;/strong&gt;, &lt;strong&gt;per-sub-topic explanation with a worked example and its solution&lt;/strong&gt;, and an &lt;strong&gt;interview-style problem with a full solution&lt;/strong&gt; that explains why it works. The mix matches a curated 2-problem Facebook set (1 EASY Python array + 1 MEDIUM Python array+math+bit+string parser) plus two adjacent SQL primitives — &lt;code&gt;EXISTS&lt;/code&gt; month-over-month retention and CTE + self-join aggregation — that show up on every Meta SQL question list and at every product-analytics onsite. The interview is bilingual SQL + Python; candidates who prep only one language stutter on the half they avoided.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuo0y7qn1izz8zgxfocjm.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuo0y7qn1izz8zgxfocjm.jpeg" alt="Facebook data engineering interview questions cover image with bold headline, Python and SQL chips, Meta-Facebook rebrand chip, faint code ghost, and pipecode.ai attribution." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Top Facebook data engineering interview topics
&lt;/h2&gt;

&lt;p&gt;From the &lt;a href="https://pipecode.ai/explore/practice/company/facebook" rel="noopener noreferrer"&gt;Facebook data engineering practice set&lt;/a&gt;, the &lt;strong&gt;four numbered sections below&lt;/strong&gt; follow this &lt;strong&gt;topic map&lt;/strong&gt; (one row per &lt;strong&gt;H2&lt;/strong&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Topic (sections &lt;strong&gt;1–4&lt;/strong&gt;)&lt;/th&gt;
&lt;th&gt;Why it shows up at Facebook&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Python array missing number — sum formula and XOR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Missing Number (EASY) — &lt;code&gt;n*(n+1)/2 - sum(arr)&lt;/code&gt; arithmetic-series identity and its XOR self-cancellation alternative, the classic array primitive Meta phone-screens with.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Python arithmetic formula evaluator — array, math, bit, string parsing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Arithmetic Formula Evaluator (MEDIUM) — character-by-character tokenization plus two-pass evaluation (handle &lt;code&gt;*&lt;/code&gt; and &lt;code&gt;/&lt;/code&gt; first, then &lt;code&gt;+&lt;/code&gt; and &lt;code&gt;-&lt;/code&gt;), the parser primitive that powers any "given a string formula, return the value" question.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL window functions and &lt;code&gt;EXISTS&lt;/code&gt; for monthly active user retention&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Active User Retention — correlated &lt;code&gt;EXISTS&lt;/code&gt; subquery with &lt;code&gt;EXTRACT(MONTH FROM ts - INTERVAL '1 month')&lt;/code&gt; for month-over-month MAU, the SQL primitive that drives Meta product-analytics retention dashboards.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL CTE and self-join for post hiatus and friend recommendations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Average Post Hiatus + Friend Recommendations — &lt;code&gt;MIN/MAX(post_date)&lt;/code&gt; per user with &lt;code&gt;HAVING COUNT &amp;gt; 1&lt;/code&gt; and CTE-driven self-joins for pair-wise friend-rec queries, the SQL primitive that drives Meta social-graph analytics.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bilingual phone-screen framing rule:&lt;/strong&gt; Facebook / Meta data engineering phone screens run a strict &lt;strong&gt;5-30-30-5&lt;/strong&gt; format: 5 minutes intro, 30 minutes SQL, 30 minutes Python, 5 minutes Q&amp;amp;A — and the candidate chooses whether to open with SQL or Python. Drill both halves equally; over-indexing on one side sets up a stuttering second half. State your preferred opener at the start so the interviewer can plan accordingly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. Python Array Missing Number — Sum Formula and XOR
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Sum-formula and XOR self-cancellation in Python for data engineering
&lt;/h3&gt;

&lt;p&gt;"Given an array of &lt;code&gt;n&lt;/code&gt; distinct integers in the range &lt;code&gt;[0, n]&lt;/code&gt; with exactly one missing, return the missing number" is Facebook's signature EASY Python prompt (Missing Number). The mental model: &lt;strong&gt;the arithmetic-series identity &lt;code&gt;0 + 1 + … + n = n*(n+1)/2&lt;/code&gt; gives the expected sum; subtracting the actual sum reveals the missing number in &lt;code&gt;O(n)&lt;/code&gt; time and &lt;code&gt;O(1)&lt;/code&gt; space&lt;/strong&gt;. The mirror primitive is XOR self-cancellation — &lt;code&gt;a ^ a == 0&lt;/code&gt; and &lt;code&gt;a ^ 0 == a&lt;/code&gt; — which gives the same answer using bit operations and never overflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fveejywybtyv82fa5svxz.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fveejywybtyv82fa5svxz.jpeg" alt="Diagram showing an input array with one missing integer, the arithmetic-series sum formula computing the expected total, the actual sum subtracted to reveal the missing number, and a parallel XOR chain showing how XOR self-cancellation isolates the missing value." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; State both approaches out loud before writing code. Interviewers grade the candidate's awareness that the sum-formula can overflow on huge &lt;code&gt;n&lt;/code&gt; (Python ints are arbitrary precision so it's fine in practice, but the answer in Java or C++ requires the XOR variant). Naming both demonstrates breadth.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Arithmetic-series sum formula: &lt;code&gt;n*(n+1)/2&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The sum-formula invariant: &lt;strong&gt;the sum of integers from &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;n&lt;/code&gt; inclusive is &lt;code&gt;n*(n+1)/2&lt;/code&gt;&lt;/strong&gt;. Subtracting &lt;code&gt;sum(arr)&lt;/code&gt; from this expected total yields the missing element when exactly one integer is absent from the contiguous range.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;n*(n+1)//2&lt;/code&gt;&lt;/strong&gt; — integer division in Python; produces the expected total.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;sum(nums)&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;O(n)&lt;/code&gt; linear scan; produces the actual total.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;expected - actual&lt;/code&gt;&lt;/strong&gt; — the difference is the missing value.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Constant space&lt;/strong&gt; — no auxiliary structure required.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;nums = [0, 1, 3, 4]&lt;/code&gt;, &lt;code&gt;n = 4&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;n&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;n*(n+1)//2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sum(nums)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;missing&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;missing_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the sum-formula is the cleanest one-liner in Python; reach for it whenever the prompt guarantees a contiguous integer range.&lt;/p&gt;

&lt;h4&gt;
  
  
  XOR self-cancellation: &lt;code&gt;a ^ a == 0&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The XOR invariant: &lt;strong&gt;&lt;code&gt;a ^ a == 0&lt;/code&gt; and &lt;code&gt;a ^ 0 == a&lt;/code&gt;; XOR is commutative and associative so the order does not matter&lt;/strong&gt;. XORing every element of &lt;code&gt;nums&lt;/code&gt; with every element of &lt;code&gt;range(n + 1)&lt;/code&gt; cancels every value that appears in both, leaving only the one missing from &lt;code&gt;nums&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;a ^ a == 0&lt;/code&gt;&lt;/strong&gt; — self-cancellation property.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;a ^ 0 == a&lt;/code&gt;&lt;/strong&gt; — identity property.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Commutative + associative&lt;/strong&gt; — order-independent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No overflow&lt;/strong&gt; — bit operations never overflow in any language.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Same &lt;code&gt;nums = [0, 1, 3, 4]&lt;/code&gt;, &lt;code&gt;n = 4&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;XOR chain&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;nums&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;0 ^ 1 ^ 3 ^ 4&lt;/code&gt; = 6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;range&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;0 ^ 1 ^ 2 ^ 3 ^ 4&lt;/code&gt; = 4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;combined&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;6 ^ 4&lt;/code&gt; = 2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;functools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nb"&gt;reduce&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;xor&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;missing_number_xor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nums&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the XOR variant is the right answer when the prompt asks about overflow safety or when you want to demonstrate bit-manipulation fluency.&lt;/p&gt;

&lt;h4&gt;
  
  
  Set-difference fallback for non-contiguous ranges
&lt;/h4&gt;

&lt;p&gt;The set-difference invariant: &lt;strong&gt;when the input range is not contiguous (e.g., "find the missing element from &lt;code&gt;[10, 20, 30, 40, 60]&lt;/code&gt; knowing the full set should be &lt;code&gt;[10, 20, 30, 40, 50, 60]&lt;/code&gt;"), the sum and XOR shortcuts no longer apply; reach for &lt;code&gt;set(expected) - set(actual)&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;set(expected) - set(actual)&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;O(n)&lt;/code&gt; time, &lt;code&gt;O(n)&lt;/code&gt; space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;(set(expected) - set(actual)).pop()&lt;/code&gt;&lt;/strong&gt; — return the single missing element.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple missing elements&lt;/strong&gt; — the same set-difference returns all of them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slower than sum / XOR&lt;/strong&gt; — uses &lt;code&gt;O(n)&lt;/code&gt; extra space; pick this only when the range is non-contiguous.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;actual = [10, 20, 30, 40, 60]&lt;/code&gt;, &lt;code&gt;expected = [10, 20, 30, 40, 50, 60]&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;set(expected) - set(actual)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{50}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;missing_from_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; contiguous range → sum-formula or XOR (&lt;code&gt;O(1)&lt;/code&gt; space); non-contiguous → set-difference (&lt;code&gt;O(n)&lt;/code&gt; space).&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;n * (n + 1) / 2&lt;/code&gt; (float division) instead of &lt;code&gt;n * (n + 1) // 2&lt;/code&gt; (integer division) — produces a float result that fails int-typed tests.&lt;/li&gt;
&lt;li&gt;Computing &lt;code&gt;n&lt;/code&gt; as &lt;code&gt;max(nums)&lt;/code&gt; instead of &lt;code&gt;len(nums)&lt;/code&gt; — wrong by one when the missing element is the max itself.&lt;/li&gt;
&lt;li&gt;Forgetting that &lt;code&gt;range(n)&lt;/code&gt; excludes &lt;code&gt;n&lt;/code&gt; — use &lt;code&gt;range(n + 1)&lt;/code&gt; for inclusive &lt;code&gt;[0, n]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Sorting &lt;code&gt;nums&lt;/code&gt; first — &lt;code&gt;O(n log n)&lt;/code&gt; instead of &lt;code&gt;O(n)&lt;/code&gt;; signals algorithmic weakness.&lt;/li&gt;
&lt;li&gt;Returning the difference set instead of a single value — read the contract; the missing-number problem returns one int.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Python Interview Question on Missing Number
&lt;/h3&gt;

&lt;p&gt;Given an array &lt;code&gt;nums&lt;/code&gt; containing &lt;code&gt;n&lt;/code&gt; distinct integers in the range &lt;code&gt;[0, n]&lt;/code&gt;, return the &lt;strong&gt;single missing number&lt;/strong&gt; in &lt;code&gt;O(n)&lt;/code&gt; time and &lt;code&gt;O(1)&lt;/code&gt; space.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;missing_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# your code here
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;n*(n+1)//2 - sum(nums)&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;missing_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;len(nums)&lt;/code&gt; gives &lt;code&gt;n&lt;/code&gt; because the array contains exactly one fewer element than the full range &lt;code&gt;[0, n]&lt;/code&gt;; the arithmetic-series identity &lt;code&gt;0 + 1 + … + n = n*(n+1)/2&lt;/code&gt; produces the expected total; subtracting &lt;code&gt;sum(nums)&lt;/code&gt; (the actual total) yields the missing value; the integer-division &lt;code&gt;//&lt;/code&gt; keeps the result an &lt;code&gt;int&lt;/code&gt;. &lt;code&gt;O(n)&lt;/code&gt; time for the single linear &lt;code&gt;sum&lt;/code&gt;, &lt;code&gt;O(1)&lt;/code&gt; extra space — no auxiliary structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for &lt;code&gt;nums = [3, 0, 1]&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;n = len(nums)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;expected = &lt;code&gt;3 * 4 // 2&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;actual = &lt;code&gt;sum([3, 0, 1])&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;missing = &lt;code&gt;6 - 4&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;input&lt;/th&gt;
&lt;th&gt;expected&lt;/th&gt;
&lt;th&gt;actual&lt;/th&gt;
&lt;th&gt;missing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[3, 0, 1]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Arithmetic-series identity&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;0 + 1 + ... + n = n*(n+1)/2&lt;/code&gt; is a closed-form expression; computing it is &lt;code&gt;O(1)&lt;/code&gt; regardless of &lt;code&gt;n&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Linear &lt;code&gt;sum(nums)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;O(n)&lt;/code&gt; scan of the array; the only data-touching cost in the algorithm.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Integer division &lt;code&gt;//&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — keeps the expected total as a Python &lt;code&gt;int&lt;/code&gt;; &lt;code&gt;n*(n+1)&lt;/code&gt; is always even so the division has no remainder.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Subtraction reveals the gap&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;expected - actual&lt;/code&gt; cancels every shared value and leaves the missing one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(n)&lt;/code&gt; time / &lt;code&gt;O(1)&lt;/code&gt; space&lt;/strong&gt;&lt;/strong&gt; — single pass over &lt;code&gt;nums&lt;/code&gt;, constant extra storage, no recursion or auxiliary collection.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/company/facebook/python" rel="noopener noreferrer"&gt;Facebook Python practice page&lt;/a&gt; for the curated EASY array problem and the &lt;a href="https://pipecode.ai/explore/practice/company/facebook/topic/array" rel="noopener noreferrer"&gt;Facebook array practice page&lt;/a&gt; for the company_topic surface.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Facebook (Python)&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Facebook Python practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/facebook/python" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Facebook / array&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Facebook array problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/facebook/topic/array" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — array&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Python array problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/array" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Python Arithmetic Formula Evaluator — Array, Math, Bit, String Parsing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Tokenization and two-pass evaluation in Python for data engineering
&lt;/h3&gt;

&lt;p&gt;"Given a string like &lt;code&gt;'3 + 5 * 2'&lt;/code&gt;, parse and evaluate it with standard precedence" is Facebook's signature MEDIUM Python prompt (Arithmetic Formula Evaluator). The mental model: &lt;strong&gt;a length-1 character-by-character scan tokenizes the string into a flat list of integers and operator characters; a first pass collapses every &lt;code&gt;*&lt;/code&gt; and &lt;code&gt;/&lt;/code&gt; left-to-right; a second pass collapses the remaining &lt;code&gt;+&lt;/code&gt; and &lt;code&gt;-&lt;/code&gt; left-to-right; the final list has one element — the result&lt;/strong&gt;. Same primitive powers any "expression evaluator" pipeline — calculator apps, formula columns in spreadsheets, simple DSL interpreters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy311dr5169aiol4xjb1c.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy311dr5169aiol4xjb1c.jpeg" alt="Diagram showing the input string 3 + 5 * 2 tokenized into a list of numbers and operators, a first pass that collapses multiplication and division left-to-right, a second pass that collapses addition and subtraction left-to-right, and a final integer result." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Avoid &lt;code&gt;eval(s)&lt;/code&gt; even when the test prompt allows it. Production code never trusts arbitrary strings, and interviewers grade the candidate who writes a real parser. State the no-&lt;code&gt;eval&lt;/code&gt; rule out loud before coding; the senior signal is unmistakable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Tokenization: split a string into numbers and operators
&lt;/h4&gt;

&lt;p&gt;The tokenization invariant: &lt;strong&gt;scan the string character-by-character; accumulate digits into an integer buffer, flush the buffer when an operator is hit, append the operator as its own token; whitespace is skipped&lt;/strong&gt;. The output is a flat list alternating between integers and operator chars.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Digit accumulation&lt;/strong&gt; — &lt;code&gt;if c.isdigit(): num = num * 10 + int(c)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operator flush&lt;/strong&gt; — &lt;code&gt;elif c in '+-*/': tokens.append(num); tokens.append(c); num = 0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final flush&lt;/strong&gt; — append the trailing &lt;code&gt;num&lt;/code&gt; after the loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whitespace&lt;/strong&gt; — skip with &lt;code&gt;if c.isspace(): continue&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;s = '3 + 5 * 2'&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;start&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;+&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3, '+']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3, '+', 5]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3, '+', 5, '*']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3, '+', 5, '*', 2]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isspace&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isdigit&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;in_num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;in_num&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
            &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;in_num&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always handle multi-digit numbers explicitly — &lt;code&gt;'12 + 3'&lt;/code&gt; is two operands, not three; the digit accumulator is what makes that work.&lt;/p&gt;

&lt;h4&gt;
  
  
  Two-pass evaluation: handle &lt;code&gt;*&lt;/code&gt; and &lt;code&gt;/&lt;/code&gt; first, then &lt;code&gt;+&lt;/code&gt; and &lt;code&gt;-&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The precedence invariant: &lt;strong&gt;the first pass walks the token list, finds every &lt;code&gt;*&lt;/code&gt; and &lt;code&gt;/&lt;/code&gt;, and replaces the triple &lt;code&gt;(left, op, right)&lt;/code&gt; with &lt;code&gt;(left op right)&lt;/code&gt;; the second pass does the same for &lt;code&gt;+&lt;/code&gt; and &lt;code&gt;-&lt;/code&gt;&lt;/strong&gt;. After the first pass, the token list contains only &lt;code&gt;+&lt;/code&gt; and &lt;code&gt;-&lt;/code&gt; operators between integers — left-to-right evaluation produces the result.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;First pass&lt;/strong&gt; — handle &lt;code&gt;*&lt;/code&gt; and &lt;code&gt;/&lt;/code&gt; (higher precedence).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Second pass&lt;/strong&gt; — handle &lt;code&gt;+&lt;/code&gt; and &lt;code&gt;-&lt;/code&gt; (lower precedence).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Left-to-right within precedence&lt;/strong&gt; — preserves the standard expression semantics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integer division&lt;/strong&gt; — &lt;code&gt;//&lt;/code&gt; for &lt;code&gt;/&lt;/code&gt; if the contract is integer arithmetic; &lt;code&gt;int(a / b)&lt;/code&gt; for truncate-toward-zero.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Continuing &lt;code&gt;tokens = [3, '+', 5, '*', 2]&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;pass&lt;/th&gt;
&lt;th&gt;tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;input&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3, '+', 5, '*', 2]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pass 1 (&lt;code&gt;*&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3, '+', 10]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pass 2 (&lt;code&gt;+&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[13]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;result&lt;/td&gt;
&lt;td&gt;&lt;code&gt;13&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;collapse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ops&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ops&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# '-'
&lt;/span&gt;                &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; two passes is the simplest correct approach; reach for the stack-based shunting-yard variant only when parens or unary minus enter the spec.&lt;/p&gt;

&lt;h4&gt;
  
  
  Stack-based evaluator for general expressions
&lt;/h4&gt;

&lt;p&gt;The stack invariant: &lt;strong&gt;maintain a stack of partial results; when scanning a token, the operator decides whether to update the top of the stack (&lt;code&gt;*&lt;/code&gt; or &lt;code&gt;/&lt;/code&gt;) or to push a new term (&lt;code&gt;+&lt;/code&gt; or &lt;code&gt;-&lt;/code&gt;)&lt;/strong&gt;. The final answer is &lt;code&gt;sum(stack)&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;+&lt;/code&gt;&lt;/strong&gt; — push the next operand onto the stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-&lt;/code&gt;&lt;/strong&gt; — push the negated next operand.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;*&lt;/code&gt;&lt;/strong&gt; — multiply the top of stack by the next operand in place.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/&lt;/code&gt;&lt;/strong&gt; — integer-divide the top of stack by the next operand in place.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;s = '3+5*2'&lt;/code&gt;, single-pass with stack.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;token&lt;/th&gt;
&lt;th&gt;stack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;+ 5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3, 5]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;* 2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3, 10]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;end&lt;/td&gt;
&lt;td&gt;sum = 13&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isdigit&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;+-*/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the stack-based single-pass evaluator is the production-grade answer; the two-pass variant is the easier explanation but allocates more.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;eval(s)&lt;/code&gt; — works on the test cases but is graded as a fail; production code never &lt;code&gt;eval&lt;/code&gt;s untrusted input.&lt;/li&gt;
&lt;li&gt;Forgetting multi-digit numbers — &lt;code&gt;'12 + 3'&lt;/code&gt; becomes &lt;code&gt;[1, 2, '+', 3]&lt;/code&gt; instead of &lt;code&gt;[12, '+', 3]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Doing left-to-right without precedence — &lt;code&gt;3 + 5 * 2&lt;/code&gt; becomes &lt;code&gt;(3 + 5) * 2 = 16&lt;/code&gt; instead of &lt;code&gt;3 + (5 * 2) = 13&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;int(a / b)&lt;/code&gt; when the contract says &lt;code&gt;//&lt;/code&gt; — produces wrong rounding for negative integers.&lt;/li&gt;
&lt;li&gt;Skipping the trailing-number flush — the last operand never enters the token list.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Python Interview Question on Arithmetic Formula Evaluator
&lt;/h3&gt;

&lt;p&gt;Given a string &lt;code&gt;s&lt;/code&gt; containing integer operands and the operators &lt;code&gt;+ - * /&lt;/code&gt; (no parentheses, integer arithmetic, standard precedence), return the integer result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# your code here
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Solution Using single-pass stack-based evaluator
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isdigit&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;+-*/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# '/'
&lt;/span&gt;                &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the single pass tokenizes and evaluates simultaneously; &lt;code&gt;op&lt;/code&gt; tracks the operator that applies to the &lt;em&gt;just-finished&lt;/em&gt; number; on &lt;code&gt;+&lt;/code&gt;/&lt;code&gt;-&lt;/code&gt; we push a new term onto the stack; on &lt;code&gt;*&lt;/code&gt;/&lt;code&gt;//&lt;/code&gt; we update the top of stack in place — which is exactly the precedence-respecting behavior we need; the trailing flush (&lt;code&gt;i == len(s) - 1&lt;/code&gt;) handles the last number; &lt;code&gt;sum(stack)&lt;/code&gt; collapses the additive sub-results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for &lt;code&gt;s = '3+5*2'&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;i&lt;/th&gt;
&lt;th&gt;c&lt;/th&gt;
&lt;th&gt;num&lt;/th&gt;
&lt;th&gt;op_pending&lt;/th&gt;
&lt;th&gt;stack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3, 5]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3, 5]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;end&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[3, 10]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;sum(stack)&lt;/code&gt; = &lt;code&gt;3 + 10&lt;/code&gt; = &lt;strong&gt;13&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;input&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'3+5*2'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'12-3*2'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'10/3'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Single-pass tokenize-and-evaluate&lt;/strong&gt;&lt;/strong&gt; — one scan of the string drives both tokenization and stack updates; no separate token list materialized.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;op&lt;/code&gt; tracks the &lt;em&gt;previous&lt;/em&gt; operator&lt;/strong&gt;&lt;/strong&gt; — applies to the number that just finished accumulating, which is the key trick for the stack-update timing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Stack push for &lt;code&gt;+&lt;/code&gt; and &lt;code&gt;-&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — additive operations push new terms; the final &lt;code&gt;sum&lt;/code&gt; collapses them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;In-place stack update for &lt;code&gt;*&lt;/code&gt; and &lt;code&gt;/&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — multiplicative operations modify the top of stack so they bind tighter than additive ones; this is what gives precedence without a second pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Trailing flush via &lt;code&gt;i == len(s) - 1&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — the last operand has no following operator to trigger a flush; the index check forces the final stack update.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(n)&lt;/code&gt; time / &lt;code&gt;O(n)&lt;/code&gt; space&lt;/strong&gt;&lt;/strong&gt; — one pass over the string; the stack holds at most one element per &lt;code&gt;+&lt;/code&gt; or &lt;code&gt;-&lt;/code&gt; token, which is &lt;code&gt;O(n)&lt;/code&gt; worst case.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/string" rel="noopener noreferrer"&gt;SQL string problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/array" rel="noopener noreferrer"&gt;array problems&lt;/a&gt; for breadth.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Facebook (Python)&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Facebook Python practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/facebook/python" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — string&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Python string problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/string" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — array&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Python array problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/array" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. SQL Window Functions and &lt;code&gt;EXISTS&lt;/code&gt; for Monthly Active User Retention
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Correlated EXISTS subquery for month-over-month MAU retention in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Return the number of monthly active users in July 2022 — users who were active in both July AND June" is Meta's signature SQL retention prompt (DataLemur Q4). The mental model: &lt;strong&gt;a user is a July-MAU iff they have at least one event in July AND at least one event in June; a correlated &lt;code&gt;EXISTS&lt;/code&gt; subquery checks the second condition row-by-row against the same &lt;code&gt;user_actions&lt;/code&gt; table; &lt;code&gt;EXTRACT(MONTH FROM curr_month.event_date - INTERVAL '1 month')&lt;/code&gt; shifts the comparison window&lt;/strong&gt;. Same primitive powers any "active in current period AND in previous period" retention metric — week-over-week active users, day-over-day session retention, cohort-N-day return.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9t0oxfkqqy4dec0xca2.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9t0oxfkqqy4dec0xca2.jpeg" alt="Diagram showing a user_actions table with rows for two users in June and July 2022, a correlated EXISTS subquery checking each July user for a matching June row, and a green output card listing the MAU count for July 2022." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Correlated &lt;code&gt;EXISTS&lt;/code&gt; subqueries are graded as the right answer over self-joins for retention queries. Self-joins explode the row count when a user has many events; &lt;code&gt;EXISTS&lt;/code&gt; short-circuits on the first match per outer row. State this performance distinction to the interviewer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;EXTRACT(MONTH FROM ts)&lt;/code&gt; and &lt;code&gt;INTERVAL '1 month'&lt;/code&gt; arithmetic
&lt;/h4&gt;

&lt;p&gt;The date-arithmetic invariant: &lt;strong&gt;&lt;code&gt;EXTRACT(MONTH FROM ts)&lt;/code&gt; returns the month component of &lt;code&gt;ts&lt;/code&gt; as an integer 1-12; &lt;code&gt;ts - INTERVAL '1 month'&lt;/code&gt; shifts &lt;code&gt;ts&lt;/code&gt; back exactly one calendar month respecting end-of-month edges&lt;/strong&gt;. Combining both produces "month of one month ago" — the comparison key for retention.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;EXTRACT(MONTH FROM ts)&lt;/code&gt;&lt;/strong&gt; — month number 1-12.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;EXTRACT(YEAR FROM ts)&lt;/code&gt;&lt;/strong&gt; — year number; combine with month for unique periods.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ts - INTERVAL '1 month'&lt;/code&gt;&lt;/strong&gt; — calendar shift, handles 31-day → 30-day automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DATE_TRUNC('month', ts)&lt;/code&gt;&lt;/strong&gt; — alternative; returns first day of the month as a &lt;code&gt;DATE&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;event_date = '2022-07-15'&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;expression&lt;/th&gt;
&lt;th&gt;output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;EXTRACT(MONTH FROM event_date)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;event_date - INTERVAL '1 month'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2022-06-15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;EXTRACT(MONTH FROM event_date - INTERVAL '1 month')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                            &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 month'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_month&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_actions&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2022-07-15'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; combine &lt;code&gt;EXTRACT(MONTH FROM ...)&lt;/code&gt; and &lt;code&gt;EXTRACT(YEAR FROM ...)&lt;/code&gt; when retention spans calendar years; month-only is wrong for Dec→Jan transitions.&lt;/p&gt;

&lt;h4&gt;
  
  
  Correlated &lt;code&gt;EXISTS&lt;/code&gt; subquery for previous-month presence
&lt;/h4&gt;

&lt;p&gt;The correlated-subquery invariant: &lt;strong&gt;&lt;code&gt;WHERE EXISTS (SELECT 1 FROM ... WHERE inner.col = outer.col)&lt;/code&gt; returns &lt;code&gt;TRUE&lt;/code&gt; for outer rows whose &lt;code&gt;col&lt;/code&gt; value has at least one matching row in the inner query; the inner query references the outer alias and re-evaluates per outer row&lt;/strong&gt;. Short-circuits on the first match.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;EXISTS (SELECT 1 FROM ... WHERE ...)&lt;/code&gt;&lt;/strong&gt; — the canonical pattern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Correlation&lt;/strong&gt; — inner WHERE clause references outer alias.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Short-circuit&lt;/strong&gt; — stops scanning inner rows on first match.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NOT EXISTS&lt;/code&gt;&lt;/strong&gt; — mirror image; "no match found in inner."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two users; user 445 has June + July rows, user 742 has only July.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;event_date&lt;/th&gt;
&lt;th&gt;EXISTS June row?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;445&lt;/td&gt;
&lt;td&gt;2022-06-30&lt;/td&gt;
&lt;td&gt;(no — this IS June)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;445&lt;/td&gt;
&lt;td&gt;2022-07-05&lt;/td&gt;
&lt;td&gt;✓ (matches user 445's June row)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;742&lt;/td&gt;
&lt;td&gt;2022-07-03&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_actions&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt;  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2022&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
      &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_actions&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;last_month&lt;/span&gt;
      &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;last_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
        &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;last_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
        &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt;  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;last_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2022&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;EXISTS&lt;/code&gt; beats self-join for retention because it short-circuits per outer row and never explodes the cardinality.&lt;/p&gt;

&lt;h4&gt;
  
  
  Alternative: window function with self-join over month buckets
&lt;/h4&gt;

&lt;p&gt;The window-alternative invariant: &lt;strong&gt;a self-join on &lt;code&gt;user_id&lt;/code&gt; between two pre-bucketed tables (or &lt;code&gt;LAG&lt;/code&gt;/&lt;code&gt;LEAD&lt;/code&gt; over month-truncated dates inside a partition) achieves the same retention answer with different performance trade-offs&lt;/strong&gt;. Useful when the prompt asks for a continuous N-month-streak metric.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DATE_TRUNC('month', event_date)&lt;/code&gt;&lt;/strong&gt; — bucket per month.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(month_bucket) OVER (PARTITION BY user_id ORDER BY month_bucket)&lt;/code&gt;&lt;/strong&gt; — previous month bucket per user.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-join&lt;/strong&gt; — &lt;code&gt;JOIN user_actions u2 ON u2.user_id = u1.user_id AND u2.month_bucket = u1.month_bucket - INTERVAL '1 month'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Window approach scales better&lt;/strong&gt; — single sort, no nested scan.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Same data; window approach.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;curr_month&lt;/th&gt;
&lt;th&gt;prev_month_bucket&lt;/th&gt;
&lt;th&gt;retained?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;445&lt;/td&gt;
&lt;td&gt;2022-07&lt;/td&gt;
&lt;td&gt;2022-06&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;742&lt;/td&gt;
&lt;td&gt;2022-07&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;per_user_months&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;month_bucket&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_actions&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;per_user_months&lt;/span&gt; &lt;span class="n"&gt;curr&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;per_user_months&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;curr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;curr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month_bucket&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 month'&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;curr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2022-07-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; one-month retention → &lt;code&gt;EXISTS&lt;/code&gt;; multi-month streak → window functions over &lt;code&gt;DISTINCT user_id, month_bucket&lt;/code&gt; rows.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Comparing &lt;code&gt;EXTRACT(MONTH FROM ts) = 6&lt;/code&gt; without checking the year — June 2021 silently counts as June 2022.&lt;/li&gt;
&lt;li&gt;Using a self-join when &lt;code&gt;EXISTS&lt;/code&gt; is cleaner — explodes the cardinality.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;WHERE EXTRACT(MONTH FROM event_date) = 7&lt;/code&gt; on the outer query — counts all-time MAU, not July-specific.&lt;/li&gt;
&lt;li&gt;Comparing &lt;code&gt;event_date::date&lt;/code&gt; directly without &lt;code&gt;INTERVAL '1 month'&lt;/code&gt; — fails on calendar boundaries (30 vs 31 day months, leap years).&lt;/li&gt;
&lt;li&gt;Returning all rows instead of distinct user counts — &lt;code&gt;COUNT(DISTINCT user_id)&lt;/code&gt; is the metric.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Active User Retention
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;user_actions(user_id, event_id, event_type, event_date)&lt;/code&gt;, return the count of &lt;strong&gt;monthly active users&lt;/strong&gt; for July 2022 — users who had at least one event in July AND at least one event in June. Output &lt;code&gt;month&lt;/code&gt; (numeric) and &lt;code&gt;monthly_active_users&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using correlated &lt;code&gt;EXISTS&lt;/code&gt; subquery
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;monthly_active_users&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_actions&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt;  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2022&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
      &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_actions&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;last_month&lt;/span&gt;
      &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;last_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
        &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;last_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 month'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;last_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 month'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;curr_month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the outer query restricts to July 2022 events; the correlated &lt;code&gt;EXISTS&lt;/code&gt; short-circuits on the first June-2022 event for the same &lt;code&gt;user_id&lt;/code&gt;; &lt;code&gt;INTERVAL '1 month'&lt;/code&gt; makes the comparison work across year boundaries (Jan retention against December of prior year); &lt;code&gt;COUNT(DISTINCT user_id)&lt;/code&gt; collapses multiple July events per user to a single MAU contribution; the &lt;code&gt;EXTRACT(MONTH ...) GROUP BY&lt;/code&gt; is mechanically required by the SELECT alias.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the DataLemur sample:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;event_date&lt;/th&gt;
&lt;th&gt;event_type&lt;/th&gt;
&lt;th&gt;curr_month?&lt;/th&gt;
&lt;th&gt;EXISTS prev?&lt;/th&gt;
&lt;th&gt;counted?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;445&lt;/td&gt;
&lt;td&gt;2022-06-30&lt;/td&gt;
&lt;td&gt;sign-in&lt;/td&gt;
&lt;td&gt;✗ (June)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;742&lt;/td&gt;
&lt;td&gt;2022-07-03&lt;/td&gt;
&lt;td&gt;sign-in&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;445&lt;/td&gt;
&lt;td&gt;2022-07-05&lt;/td&gt;
&lt;td&gt;like&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;742&lt;/td&gt;
&lt;td&gt;2022-07-05&lt;/td&gt;
&lt;td&gt;comment&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;648&lt;/td&gt;
&lt;td&gt;2022-07-18&lt;/td&gt;
&lt;td&gt;like&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Only user 445 satisfies both conditions → MAU = 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;mth&lt;/th&gt;
&lt;th&gt;monthly_active_users&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;EXTRACT(MONTH FROM ts)&lt;/code&gt; + &lt;code&gt;EXTRACT(YEAR FROM ts)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — joint period-key avoids cross-year false positives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Correlated &lt;code&gt;EXISTS&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — short-circuits on first match per outer row; &lt;code&gt;O(N · M)&lt;/code&gt; worst case but typically far less in practice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;INTERVAL '1 month'&lt;/code&gt; shift&lt;/strong&gt;&lt;/strong&gt; — calendar-aware month subtraction; handles end-of-month edges and year transitions automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COUNT(DISTINCT user_id)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — collapses multiple July events per user; the contract demands "users", not "events."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;GROUP BY EXTRACT(MONTH FROM ...)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — required because the SELECT references a non-aggregate; produces one output row per month.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(|user_actions| × log|user_actions|)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — index lookup per outer row inside the EXISTS; with a &lt;code&gt;(user_id, event_date)&lt;/code&gt; index this is near-linear.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;SQL window-function problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;aggregation problems&lt;/a&gt; on PipeCode.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Facebook&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Facebook SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/facebook" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — window functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL window-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. SQL CTE and Self-Join for Post Hiatus and Friend Recommendations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CTE composition with MIN/MAX aggregates and self-joins in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"For each user who posted at least twice in 2024, return the days between their first and last post" is Meta's signature post-hiatus prompt (DataLemur Q1). The mental model: &lt;strong&gt;&lt;code&gt;MAX(post_date) - MIN(post_date)&lt;/code&gt; per user gives the hiatus; &lt;code&gt;WHERE EXTRACT(YEAR FROM post_date) = 2024&lt;/code&gt; filters to the year; &lt;code&gt;HAVING COUNT(post_id) &amp;gt; 1&lt;/code&gt; ensures the user actually posted multiple times&lt;/strong&gt;. The same CTE-and-self-join skeleton scales up to the friend-recommendation pattern (Q6) — a CTE captures &lt;code&gt;private_events&lt;/code&gt; and a self-join produces non-friend pairs who attended the same events.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Always alias the result of date subtraction (&lt;code&gt;AS days_between&lt;/code&gt; for post hiatus, &lt;code&gt;AS user_pair&lt;/code&gt; for friend-rec). PostgreSQL returns an &lt;code&gt;INTERVAL&lt;/code&gt; from date subtraction by default — explicit casting (&lt;code&gt;::DATE - ::DATE&lt;/code&gt;) returns a plain &lt;code&gt;int&lt;/code&gt;. State the cast in the SELECT.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;MIN&lt;/code&gt; / &lt;code&gt;MAX&lt;/code&gt; aggregates per user with date subtraction
&lt;/h4&gt;

&lt;p&gt;The aggregate-date invariant: &lt;strong&gt;&lt;code&gt;MAX(post_date) - MIN(post_date)&lt;/code&gt; over a &lt;code&gt;GROUP BY user_id&lt;/code&gt; window returns the per-user span of activity; casting to &lt;code&gt;::DATE&lt;/code&gt; first ensures the subtraction returns a plain integer-day count, not an &lt;code&gt;INTERVAL&lt;/code&gt;&lt;/strong&gt;. Filter to the year first via &lt;code&gt;WHERE&lt;/code&gt;, group second.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MIN(post_date::DATE)&lt;/code&gt;&lt;/strong&gt; — earliest post date per group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MAX(post_date::DATE)&lt;/code&gt;&lt;/strong&gt; — latest post date per group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MAX - MIN&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;INTERVAL&lt;/code&gt; if both are timestamps, &lt;code&gt;int&lt;/code&gt; if both are &lt;code&gt;DATE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;EXTRACT(DAY FROM ...)&lt;/code&gt; cast&lt;/strong&gt; — alternative if working with timestamps.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two users, six posts in 2024.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;post_dates&lt;/th&gt;
&lt;th&gt;min&lt;/th&gt;
&lt;th&gt;max&lt;/th&gt;
&lt;th&gt;days_between&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;151652&lt;/td&gt;
&lt;td&gt;07/10, 07/12&lt;/td&gt;
&lt;td&gt;07/10&lt;/td&gt;
&lt;td&gt;07/12&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;661093&lt;/td&gt;
&lt;td&gt;07/08, 07/29&lt;/td&gt;
&lt;td&gt;07/08&lt;/td&gt;
&lt;td&gt;07/29&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post_date&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post_date&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;days_between&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;posts&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;post_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2024&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; when the prompt says "days between first and last", &lt;code&gt;MAX(::DATE) - MIN(::DATE)&lt;/code&gt; is one line; never compute it via window functions when GROUP BY suffices.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;HAVING COUNT() &amp;gt; 1&lt;/code&gt; for multi-event users
&lt;/h4&gt;

&lt;p&gt;The filter-on-aggregate invariant: &lt;strong&gt;&lt;code&gt;HAVING COUNT(post_id) &amp;gt; 1&lt;/code&gt; filters group rows after the GROUP BY to keep only users with at least 2 posts; &lt;code&gt;WHERE&lt;/code&gt; cannot reference aggregates and would parse-error&lt;/strong&gt;. The two clauses are not interchangeable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt;&lt;/strong&gt; — row-level filter (year predicate).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING&lt;/code&gt;&lt;/strong&gt; — group-level filter (count predicate).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING COUNT(post_id) &amp;gt; 1&lt;/code&gt;&lt;/strong&gt; — strictly more than 1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING COUNT(*) &amp;gt;= 2&lt;/code&gt;&lt;/strong&gt; — equivalent here since &lt;code&gt;post_id&lt;/code&gt; is non-null.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three users; one posted only once in 2024.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;post_count_2024&lt;/th&gt;
&lt;th&gt;survives?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;151652&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;661093&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;004239&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;✗ (filtered)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post_date&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post_date&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;days_between&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;posts&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;post_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2024&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; a single-row hiatus is &lt;code&gt;0&lt;/code&gt; days, which is misleading; the &lt;code&gt;HAVING COUNT &amp;gt; 1&lt;/code&gt; filter is non-negotiable for the post-hiatus prompt.&lt;/p&gt;

&lt;h4&gt;
  
  
  CTE composition for multi-step logic + self-join for friend-rec / pair queries
&lt;/h4&gt;

&lt;p&gt;The CTE-and-self-join invariant: &lt;strong&gt;&lt;code&gt;WITH cte AS (SELECT ...)&lt;/code&gt; names an intermediate result; subsequent SELECTs reference it like a table; for pair queries (e.g., friend recommendations), self-join the same CTE on a not-equal-id condition to generate every ordered pair&lt;/strong&gt;. Combine with friendship-status filtering to surface non-friend pairs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CTE definition&lt;/strong&gt; — &lt;code&gt;WITH private_events AS (SELECT user_id, event_id FROM event_rsvp WHERE attendance_status IN ('going', 'maybe') AND event_type = 'private')&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-join&lt;/strong&gt; — &lt;code&gt;JOIN private_events e2 ON e1.event_id = e2.event_id AND e1.user_id != e2.user_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Friendship-status filter&lt;/strong&gt; — &lt;code&gt;JOIN friendship_status fs ... WHERE fs.status = 'not_friends'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pair count&lt;/strong&gt; — &lt;code&gt;HAVING COUNT(*) &amp;gt;= 2&lt;/code&gt; for the "two-or-more shared events" requirement.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three users (111, 222, 333) attended event 234; only (222, 333) are not friends.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;pair&lt;/th&gt;
&lt;th&gt;shared_events&lt;/th&gt;
&lt;th&gt;friend?&lt;/th&gt;
&lt;th&gt;recommend?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;(111, 222)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;friends&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(111, 333)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;not_friends&lt;/td&gt;
&lt;td&gt;✗ (only 1 shared)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(222, 333)&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;not_friends&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;private_events&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;event_rsvp&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;attendance_status&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'going'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'maybe'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'private'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_a_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_b_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;private_events&lt;/span&gt; &lt;span class="n"&gt;e1&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;private_events&lt;/span&gt; &lt;span class="n"&gt;e2&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_id&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;e1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;e2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;friendship_status&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_a_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_b_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'not_friends'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_a_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_b_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; CTEs make multi-step logic readable; self-joins on the same CTE are the canonical shape for pair / recommendation queries.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Computing &lt;code&gt;MAX - MIN&lt;/code&gt; without &lt;code&gt;::DATE&lt;/code&gt; cast — returns an &lt;code&gt;INTERVAL&lt;/code&gt;, not days.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;WHERE COUNT(post_id) &amp;gt; 1&lt;/code&gt; — parse error; &lt;code&gt;WHERE&lt;/code&gt; cannot reference aggregates.&lt;/li&gt;
&lt;li&gt;Forgetting the year filter — counts all-time hiatus instead of 2024-only.&lt;/li&gt;
&lt;li&gt;Self-joining without &lt;code&gt;e1.user_id != e2.user_id&lt;/code&gt; — generates &lt;code&gt;(user, user)&lt;/code&gt; pairs that pollute the answer.&lt;/li&gt;
&lt;li&gt;Returning a single direction &lt;code&gt;(user_a, user_b)&lt;/code&gt; when the prompt says recommendations are bidirectional — add the mirror by &lt;code&gt;UNION&lt;/code&gt; or by including both orderings in the GROUP BY.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Average Post Hiatus
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;posts(user_id, post_id, post_date)&lt;/code&gt;, for each user who posted at least twice in 2024, return the &lt;strong&gt;number of days between the user's first and last post in 2024&lt;/strong&gt;. Output &lt;code&gt;user_id&lt;/code&gt; and &lt;code&gt;days_between&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;MAX - MIN&lt;/code&gt; per user with &lt;code&gt;HAVING COUNT &amp;gt; 1&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post_date&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post_date&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;days_between&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;posts&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;post_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2024&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the &lt;code&gt;WHERE&lt;/code&gt; clause restricts the row stream to 2024 posts before grouping; &lt;code&gt;GROUP BY user_id&lt;/code&gt; collapses to one row per user; &lt;code&gt;MAX(post_date::DATE) - MIN(post_date::DATE)&lt;/code&gt; produces the hiatus as an integer day count thanks to the &lt;code&gt;::DATE&lt;/code&gt; cast; &lt;code&gt;HAVING COUNT(post_id) &amp;gt; 1&lt;/code&gt; strips out users who posted only once. Single-pass aggregation; no self-join needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the DataLemur sample:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;post_id&lt;/th&gt;
&lt;th&gt;post_date&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;151652&lt;/td&gt;
&lt;td&gt;599415&lt;/td&gt;
&lt;td&gt;2024-07-10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;661093&lt;/td&gt;
&lt;td&gt;624356&lt;/td&gt;
&lt;td&gt;2024-07-29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;004239&lt;/td&gt;
&lt;td&gt;784254&lt;/td&gt;
&lt;td&gt;2024-07-04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;661093&lt;/td&gt;
&lt;td&gt;442560&lt;/td&gt;
&lt;td&gt;2024-07-08&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;151652&lt;/td&gt;
&lt;td&gt;111766&lt;/td&gt;
&lt;td&gt;2024-07-12&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;WHERE filter&lt;/strong&gt; — all 5 rows are in 2024; nothing dropped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GROUP BY user_id&lt;/strong&gt; — three groups: 151652 (2 posts), 661093 (2 posts), 004239 (1 post).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIN / MAX per group&lt;/strong&gt; — 151652 → (07-10, 07-12); 661093 → (07-08, 07-29); 004239 → (07-04, 07-04).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MAX - MIN&lt;/strong&gt; — 2 days, 21 days, 0 days.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HAVING COUNT &amp;gt; 1&lt;/strong&gt; — strips 004239 (only 1 post).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;days_between&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;151652&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;661093&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;WHERE EXTRACT(YEAR FROM post_date) = 2024&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — row-level filter; runs before grouping for correctness and performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;GROUP BY user_id&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — collapses to one row per user.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;MAX(post_date::DATE) - MIN(post_date::DATE)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — date subtraction returns an integer day count after the &lt;code&gt;::DATE&lt;/code&gt; cast; &lt;code&gt;INTERVAL&lt;/code&gt; would be the wrong shape.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;HAVING COUNT(post_id) &amp;gt; 1&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — group-level filter; aggregate predicate must live in &lt;code&gt;HAVING&lt;/code&gt;, not &lt;code&gt;WHERE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(|posts| + G log G)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — single scan + sort of &lt;code&gt;G&lt;/code&gt; groups; no self-join.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;SQL CTE problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;SQL join problems&lt;/a&gt; on PipeCode.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Facebook&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Facebook SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/facebook" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — CTE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL CTE problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — joins&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL join problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips to crack Facebook data engineering interviews
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Facebook = Meta — the rebrand survived but the SQL bar didn't move
&lt;/h3&gt;

&lt;p&gt;Facebook Inc. rebranded to &lt;strong&gt;Meta Platforms Inc.&lt;/strong&gt; in &lt;strong&gt;October 2021&lt;/strong&gt;. The legal entity is Meta; the consumer products (Facebook, Instagram, WhatsApp, Messenger, Threads) keep their original brands. The data-engineering interview loop, the SQL bar, and the question shapes did not change with the rebrand. Search for both "Facebook data engineer interview" and "Meta data engineer interview" — every external article you find under one name applies under the other.&lt;/p&gt;

&lt;h3&gt;
  
  
  30+30 phone-screen format — pick your opener
&lt;/h3&gt;

&lt;p&gt;The standard Meta data-engineering technical phone screen is &lt;strong&gt;5 minutes intro + 30 minutes SQL + 30 minutes Python + 5 minutes Q&amp;amp;A&lt;/strong&gt;, with the candidate choosing whether to open with SQL or Python. Pick whichever you'd rather attack first when you're freshest; the second half always feels harder under fatigue, so save your stronger language for the closer if you can. State your preferred opener at the start so the interviewer can plan.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two role variants — pick the one that matches your loop
&lt;/h3&gt;

&lt;p&gt;Meta has two distinct DE loops: the &lt;strong&gt;standard "Data Engineer"&lt;/strong&gt; role (algo-heavy Python plus SQL) and the &lt;strong&gt;"Data Engineer — Product Analytics"&lt;/strong&gt; role (5 SQL questions plus 5 algo coding questions, more product-sense flavor). Confirm which loop you're in during the recruiter call; the prep mix shifts — Product Analytics candidates drill more SQL retention / cohort patterns; standard DE candidates drill more Python algorithms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drill the four primitives
&lt;/h3&gt;

&lt;p&gt;The four primitives in this guide map directly to the curated 2 PipeCode Python problems plus the two adjacent SQL primitives every Meta SQL list rotates through: &lt;strong&gt;&lt;code&gt;n*(n+1)/2 - sum(arr)&lt;/code&gt; and XOR self-cancellation&lt;/strong&gt; for the missing-number array problem (Python EASY, #83), &lt;strong&gt;two-pass tokenize-and-evaluate or single-pass stack-based evaluator&lt;/strong&gt; for arithmetic formula parsing (Python MEDIUM, #273), &lt;strong&gt;correlated &lt;code&gt;EXISTS&lt;/code&gt; with &lt;code&gt;INTERVAL '1 month'&lt;/code&gt;&lt;/strong&gt; for month-over-month MAU retention (DataLemur Q4), and &lt;strong&gt;CTE composition + self-joins with &lt;code&gt;MIN/MAX&lt;/code&gt; aggregates and &lt;code&gt;HAVING COUNT &amp;gt; 1&lt;/code&gt;&lt;/strong&gt; for post hiatus and friend recommendations (DataLemur Q1 / Q6).&lt;/p&gt;

&lt;h3&gt;
  
  
  Product-analytics SQL emphasis
&lt;/h3&gt;

&lt;p&gt;Meta's data-engineering SQL questions are heavily product-analytics-flavored — MAU retention, post hiatus, power users, click-through rates, friend recommendations. Drill the &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;Meta-style SQL practice surface&lt;/a&gt; with a focus on &lt;code&gt;EXISTS&lt;/code&gt; subqueries, CTE composition, &lt;code&gt;MIN/MAX&lt;/code&gt; per-user aggregates, and &lt;code&gt;CASE WHEN + ROUND&lt;/code&gt; for percentage metrics. Avoid generic "joins and group-by" prep; Meta's bar is higher.&lt;/p&gt;

&lt;h3&gt;
  
  
  IC4-IC6 leveling and behavioral expectations
&lt;/h3&gt;

&lt;p&gt;Meta levels DE roles from IC4 (Senior, ~5+ YoE) through IC6 (Staff, ~10+ YoE) and IC7 (Senior Staff). Behavioral rounds at IC5+ probe ownership ("tell me about a time you owned a critical pipeline migration"), navigating ambiguity, and cross-functional collaboration with product / DS / ML teams. Have STAR-format stories ready for each. Compensation per Levels.fyi: IC4 ~$340K-$430K total, IC5 ~$430K-$580K, IC6 $600K+.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to practice on PipeCode
&lt;/h3&gt;

&lt;p&gt;Start with the &lt;a href="https://pipecode.ai/explore/practice/company/facebook" rel="noopener noreferrer"&gt;Facebook practice page&lt;/a&gt; and the language-scoped &lt;a href="https://pipecode.ai/explore/practice/company/facebook/python" rel="noopener noreferrer"&gt;Facebook Python practice page&lt;/a&gt; for the curated 2-problem set. Hit the company_topic &lt;a href="https://pipecode.ai/explore/practice/company/facebook/topic/array" rel="noopener noreferrer"&gt;Facebook — array page&lt;/a&gt; for the only Facebook-tagged topic surface available. After that, drill the matching topic pages: &lt;a href="https://pipecode.ai/explore/practice/topic/array" rel="noopener noreferrer"&gt;array&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/string" rel="noopener noreferrer"&gt;string&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;window functions&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;aggregation&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;CTE&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;joins&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/date-functions/sql" rel="noopener noreferrer"&gt;date functions&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;filtering&lt;/a&gt;. The &lt;a href="https://pipecode.ai/explore/courses" rel="noopener noreferrer"&gt;interview courses page&lt;/a&gt; bundles structured curricula. For broader coverage, &lt;a href="https://pipecode.ai/explore/practice/topics" rel="noopener noreferrer"&gt;browse by topic&lt;/a&gt;, or pivot to peer guides — the &lt;a href="https://pipecode.ai/blogs/airbnb-data-engineering-interview-questions-prep-guide" rel="noopener noreferrer"&gt;Airbnb DE interview guide&lt;/a&gt;, the &lt;a href="https://pipecode.ai/blogs/top-data-engineering-interview-questions-2026" rel="noopener noreferrer"&gt;top DE interview questions 2026&lt;/a&gt; blog, and the &lt;a href="https://pipecode.ai/blogs/sql-data-types-postgresql-guide" rel="noopener noreferrer"&gt;SQL data types Postgres guide&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Communication and approach under time pressure
&lt;/h3&gt;

&lt;p&gt;Talk through the invariant first ("this is a missing-number problem with sum-formula and XOR alternatives"), the brute force second ("a sort-then-scan would also work but is &lt;code&gt;O(n log n)&lt;/code&gt;"), and the optimal third ("but the sum-formula gives &lt;code&gt;O(n)&lt;/code&gt; time and &lt;code&gt;O(1)&lt;/code&gt; space"). Interviewers grade &lt;strong&gt;process&lt;/strong&gt; as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, single-element array, zero-only array, year-boundary date arithmetic, NULL in a &lt;code&gt;user_id&lt;/code&gt; partition. The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the Facebook (Meta) data engineering interview process?
&lt;/h3&gt;

&lt;p&gt;The Meta data engineering interview opens with a recruiter screen (15-30 min), then a technical phone screen with the &lt;strong&gt;5 minute intro + 30 minute SQL + 30 minute Python + 5 minute Q&amp;amp;A&lt;/strong&gt; format (candidate-choice on opener), then a 4-5 round virtual onsite covering 2 SQL rounds, 1 Python algorithm round, 1 system design round, and 1 behavioral round. Senior roles (IC5+) add a data-architecture round. End-to-end the loop runs three to four weeks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Facebook the same as Meta? Which name should I search for?
&lt;/h3&gt;

&lt;p&gt;Yes — Facebook Inc. rebranded to &lt;strong&gt;Meta Platforms Inc.&lt;/strong&gt; in October 2021. The legal entity is Meta; the consumer brands (Facebook, Instagram, WhatsApp, Messenger, Threads) keep their original names. Search for both "Facebook data engineer interview" and "Meta data engineer interview" — every external article you find under one name applies under the other. The DataLemur SQL guide is titled "Facebook/Meta SQL Interview Questions" because both names index the same content.&lt;/p&gt;

&lt;h3&gt;
  
  
  What programming languages does Facebook test in data engineering interviews?
&lt;/h3&gt;

&lt;p&gt;Meta tests &lt;strong&gt;Python and SQL&lt;/strong&gt; in the technical phone screen and onsite — bilingual by design. The phone screen is exactly 30 minutes of each, with the candidate choosing the opener. Python emphasizes algorithms (array missing-number, string parsing, hash-table aggregation, sliding-window patterns). SQL emphasizes product-analytics queries (MAU retention, post hiatus, power users, CTR, friend recommendations) on PostgreSQL syntax. Python questions for the Meta data engineer interview lean medium-difficulty algorithm style, not data-pipeline scripting.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between the standard DE role and the Data Engineer — Product Analytics variant?
&lt;/h3&gt;

&lt;p&gt;Standard &lt;strong&gt;Data Engineer&lt;/strong&gt; at Meta = algo-heavier Python loops with bilingual SQL coverage (the curated PipeCode set's algorithm focus). &lt;strong&gt;Data Engineer — Product Analytics&lt;/strong&gt; = a product-sense-flavored variant where the technical phone screen is 5 SQL questions + 5 algo coding questions in one session, more weighted toward retention / cohort / funnel analytics SQL than pure algorithm fluency. The recruiter call confirms which loop you're in; ask explicitly if it isn't stated.&lt;/p&gt;

&lt;h3&gt;
  
  
  What SQL questions does Meta ask data engineers?
&lt;/h3&gt;

&lt;p&gt;Meta SQL interview questions concentrate on six product-analytics shapes: (1) post hiatus / first-and-last-event aggregations via &lt;code&gt;MIN/MAX(::DATE)&lt;/code&gt; with &lt;code&gt;HAVING COUNT &amp;gt; 1&lt;/code&gt;; (2) power-user identification via JOIN + 2-condition &lt;code&gt;HAVING&lt;/code&gt;; (3) MAU retention via correlated &lt;code&gt;EXISTS&lt;/code&gt; subquery with &lt;code&gt;INTERVAL '1 month'&lt;/code&gt; shifts; (4) friend recommendations via CTE + self-join over &lt;code&gt;event_rsvp&lt;/code&gt; with &lt;code&gt;friendship_status&lt;/code&gt; filtering; (5) average-shares-per-post via &lt;code&gt;LEFT JOIN + COALESCE&lt;/code&gt;; (6) ad click-through rate via &lt;code&gt;CASE WHEN + ROUND&lt;/code&gt; and 2022-year filters. Drill all six against the PostgreSQL dialect; CoderPad is the live coding environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the Meta data engineer salary range?
&lt;/h3&gt;

&lt;p&gt;Meta data engineer total compensation per Levels.fyi: &lt;strong&gt;IC4 (Senior, ~5+ YoE)&lt;/strong&gt; $340K-$430K total comp ($180K-$220K base + RSUs + bonus); &lt;strong&gt;IC5 (Staff, ~10+ YoE)&lt;/strong&gt; $430K-$580K total ($210K-$260K base); &lt;strong&gt;IC6 (Senior Staff, ~12+ YoE)&lt;/strong&gt; $600K+ total. RSU refreshers are annual; equity vests on a 4-year schedule with a typical 25% cliff. Negotiation success rates run 10-20% with competing offers. The "meta data engineer salary" search keyword (vol 260) reflects strong candidate interest in this number.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start practicing Facebook data engineering problems
&lt;/h2&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>interview</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Square Data Engineering Interview Questions &amp; Prep Guide</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Sun, 03 May 2026 05:21:28 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/square-data-engineering-interview-questions-prep-guide-1efi</link>
      <guid>https://dev.to/gowthampotureddi/square-data-engineering-interview-questions-prep-guide-1efi</guid>
      <description>&lt;p&gt;&lt;strong&gt;Square data engineering interview questions&lt;/strong&gt; are SQL-heavy, fintech-flavored, and PostgreSQL-grounded. Square rebranded to &lt;strong&gt;Block Inc.&lt;/strong&gt; in December 2021, but the SQL bar — and the question shapes that show up in the live CoderPad pair-programming round — has not moved. Four primitives carry the loop: &lt;code&gt;GROUP BY sender_id + COUNT(*) + ORDER BY DESC LIMIT 10&lt;/code&gt; for top-N invoice senders, &lt;code&gt;DATEDIFF&lt;/code&gt; / &lt;code&gt;INTERVAL '30 days'&lt;/code&gt; cohort math for 30-day-post-signup activity windows, &lt;code&gt;AVG(stars) OVER (PARTITION BY product_id, month)&lt;/code&gt; window aggregates for monthly product analytics, and &lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt; over status-filtered payment rows for fintech-grade transaction analysis. The framings are everyday Block / Square / CashApp data engineering — invoice ranking, cohort retention, monthly review averages, payment success counts.&lt;/p&gt;

&lt;p&gt;This guide walks four SQL topic clusters end-to-end, each with a &lt;strong&gt;detailed topic explanation&lt;/strong&gt;, &lt;strong&gt;per-sub-topic explanation with a worked example and its solution&lt;/strong&gt;, and an &lt;strong&gt;interview-style problem with a full solution&lt;/strong&gt; that explains why it works. The mix matches a curated 2-problem set (1 EASY ranking+sorting+aggregation, 1 MEDIUM aggregation+date-functions+cohort-analysis) plus the two adjacent SQL primitives — window functions and payment-flow COUNT DISTINCT — that show up on every Block / Square / CashApp SQL question list. The interview is &lt;strong&gt;PostgreSQL on CoderPad&lt;/strong&gt;; candidates who prep in MySQL / Snowflake / BigQuery dialect stutter on &lt;code&gt;INTERVAL&lt;/code&gt;, &lt;code&gt;DATE_PART&lt;/code&gt;, and &lt;code&gt;EXTRACT&lt;/code&gt; syntax. Drill PostgreSQL-flavored answers from the start.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsvcrvinfsv4xpm1kot14.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsvcrvinfsv4xpm1kot14.jpeg" alt="Square data engineering interview questions cover image with bold headline, SQL and Block-Square rebrand chips, faint code ghost, and pipecode.ai attribution." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Top Square data engineering interview topics
&lt;/h2&gt;

&lt;p&gt;From the &lt;a href="https://pipecode.ai/explore/practice/company/square" rel="noopener noreferrer"&gt;Square data engineering practice set&lt;/a&gt;, the &lt;strong&gt;four numbered sections below&lt;/strong&gt; follow this &lt;strong&gt;topic map&lt;/strong&gt; (one row per &lt;strong&gt;H2&lt;/strong&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Topic (sections &lt;strong&gt;1–4&lt;/strong&gt;)&lt;/th&gt;
&lt;th&gt;Why it shows up at Square&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL ranking, sorting, and aggregation for top-N invoice senders&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Top 10 Invoice Senders (EASY) — &lt;code&gt;GROUP BY sender_id + COUNT(*) + ORDER BY DESC + LIMIT 10&lt;/code&gt;, the SQL primitive for any "top-N entities by event volume" question.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL aggregation, date functions, and cohort analysis for post-signup activity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Users with High Activity After 30 Days Signup (MEDIUM) — &lt;code&gt;DATEDIFF&lt;/code&gt; / &lt;code&gt;INTERVAL '30 days'&lt;/code&gt; + &lt;code&gt;GROUP BY user_id + HAVING COUNT(*) &amp;gt;= threshold&lt;/code&gt;, the canonical retention-cohort pattern.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL window functions for monthly averages and duplicate detection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Monthly product rating + duplicate detection — &lt;code&gt;AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))&lt;/code&gt; and &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY ...)&lt;/code&gt; for dedupe (DataLemur Block staples).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL &lt;code&gt;COUNT DISTINCT&lt;/code&gt; and status filters for payment-flow analysis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total unique successful transactions / senders / recipients — &lt;code&gt;COUNT(DISTINCT col) WHERE status = 'Success'&lt;/code&gt;, the fintech-grade pattern that drives Square / Block / CashApp payment analytics.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;PostgreSQL-on-CoderPad framing rule:&lt;/strong&gt; Square interviews live SQL in CoderPad with PostgreSQL syntax. Use &lt;code&gt;EXTRACT(MONTH FROM ...)&lt;/code&gt;, &lt;code&gt;DATE_TRUNC('week', ...)&lt;/code&gt;, &lt;code&gt;INTERVAL '30 days'&lt;/code&gt;, &lt;code&gt;DATEDIFF&lt;/code&gt;-equivalent date math, and &lt;code&gt;COUNT(DISTINCT)&lt;/code&gt; natively. Snowflake / BigQuery / SQL Server idioms (&lt;code&gt;DATE_PART&lt;/code&gt; quirks, &lt;code&gt;QUALIFY&lt;/code&gt;, &lt;code&gt;TOP N&lt;/code&gt;) trip up candidates and signal weak SQL fluency. State your dialect upfront and drill PostgreSQL syntax.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. SQL Ranking, Sorting, and Aggregation for Top-N Invoice Senders
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Top-N entity ranking via GROUP BY + ORDER BY DESC + LIMIT in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Return the top 10 senders by completed invoice count" is Square's signature EASY SQL prompt (Top 10 Invoice Senders). The mental model: &lt;strong&gt;&lt;code&gt;GROUP BY sender_id&lt;/code&gt; collapses one row per sender; &lt;code&gt;COUNT(*)&lt;/code&gt; produces the per-sender invoice count; &lt;code&gt;ORDER BY count DESC&lt;/code&gt; ranks senders descending; &lt;code&gt;LIMIT 10&lt;/code&gt; returns only the top 10&lt;/strong&gt;. Same primitive powers any "top-N entities by event volume" pipeline — top-N customers by order count, top-N products by review volume, top-N regions by daily active users.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpr508lij7xa446lyz43j.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpr508lij7xa446lyz43j.jpeg" alt="Diagram showing an invoices mini-table on the left, a horizontal bar chart of per-sender invoice counts sorted descending in the center with the top 10 cut highlighted in green, and a green output card on the right listing the top-10 sender_ids with their counts." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Always alias the count column (&lt;code&gt;AS invoice_count&lt;/code&gt;) and sort by the alias in &lt;code&gt;ORDER BY&lt;/code&gt;. Mixing literal &lt;code&gt;COUNT(*)&lt;/code&gt; and the alias is style noise that interviewers grade. Add a deterministic tiebreaker (&lt;code&gt;, sender_id ASC&lt;/code&gt;) when ties at the cut matter — Square's interviewers probe ties at the boundary frequently.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  GROUP BY + COUNT(*) for per-sender aggregation
&lt;/h4&gt;

&lt;p&gt;The aggregation invariant: &lt;strong&gt;&lt;code&gt;GROUP BY sender_id&lt;/code&gt; collapses all invoice rows that share the same &lt;code&gt;sender_id&lt;/code&gt; into one output row; &lt;code&gt;COUNT(*)&lt;/code&gt; counts the number of rows in each group&lt;/strong&gt;. Every non-aggregate column in the &lt;code&gt;SELECT&lt;/code&gt; must appear in &lt;code&gt;GROUP BY&lt;/code&gt; (or be functionally dependent on it).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GROUP BY sender_id&lt;/code&gt;&lt;/strong&gt; — collapses to one row per sender.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt; — total rows per group; equivalent to &lt;code&gt;COUNT(invoice_id)&lt;/code&gt; when &lt;code&gt;invoice_id&lt;/code&gt; is non-null.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt;&lt;/strong&gt; — unique values per group; useful when senders can repeat invoices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SELECT sender_id, COUNT(*)&lt;/code&gt;&lt;/strong&gt; — only &lt;code&gt;sender_id&lt;/code&gt; is referenced ungrouped; the &lt;code&gt;GROUP BY&lt;/code&gt; makes this legal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Five invoices, three distinct senders.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;input&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[(101, s1), (102, s2), (103, s1), (104, s3), (105, s1)]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GROUP BY sender_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[s1: 3 rows, s2: 1 row, s3: 1 row]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COUNT(*) AS cnt&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[(s1, 3), (s2, 1), (s3, 1)]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;invoice_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;GROUP BY&lt;/code&gt; on the dimension you want one-row-per-value, &lt;code&gt;COUNT(*)&lt;/code&gt; for the metric, alias the count immediately so subsequent clauses can reference it.&lt;/p&gt;

&lt;h4&gt;
  
  
  ORDER BY count DESC + LIMIT N for top-N rankings
&lt;/h4&gt;

&lt;p&gt;The top-N invariant: &lt;strong&gt;&lt;code&gt;ORDER BY &amp;lt;metric&amp;gt; DESC&lt;/code&gt; sorts groups by the metric in descending order; &lt;code&gt;LIMIT N&lt;/code&gt; returns only the first &lt;code&gt;N&lt;/code&gt;&lt;/strong&gt;. For ties at the cut, dialects differ — PostgreSQL (Square's CoderPad default) returns whichever order the planner picks unless you add an explicit tiebreaker.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY invoice_count DESC&lt;/code&gt;&lt;/strong&gt; — sorts groups by the aliased column.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LIMIT 10&lt;/code&gt;&lt;/strong&gt; — first 10 groups in sort order; PostgreSQL syntax.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;OFFSET 10 LIMIT 10&lt;/code&gt;&lt;/strong&gt; — pagination; ranks 11-20.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic tiebreak&lt;/strong&gt; — &lt;code&gt;, sender_id ASC&lt;/code&gt; makes ties stable across runs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Sort 4 senders by &lt;code&gt;invoice_count&lt;/code&gt; and take top 3.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;sender_id&lt;/th&gt;
&lt;th&gt;invoice_count&lt;/th&gt;
&lt;th&gt;rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s4&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;(cut)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;invoice_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;invoice_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always include a deterministic tiebreaker — fintech audits demand that ranked lists be reproducible across runs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Tiebreaks via &lt;code&gt;RANK()&lt;/code&gt;, &lt;code&gt;DENSE_RANK()&lt;/code&gt;, and &lt;code&gt;ROW_NUMBER()&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The tiebreak invariant: &lt;strong&gt;&lt;code&gt;ROW_NUMBER&lt;/code&gt; always gives a unique sequence (1, 2, 3, 4); &lt;code&gt;RANK&lt;/code&gt; gives the same number for ties and skips the next (1, 1, 3, 4); &lt;code&gt;DENSE_RANK&lt;/code&gt; gives the same number for ties without skipping (1, 1, 2, 3)&lt;/strong&gt;. Choose based on whether the prompt wants strict top-N row count, "all senders tied at rank N", or compact ranks without gaps.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ROW_NUMBER() OVER (ORDER BY count DESC)&lt;/code&gt;&lt;/strong&gt; — strict 1..N sequence; ties resolved by planner.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;RANK() OVER (ORDER BY count DESC)&lt;/code&gt;&lt;/strong&gt; — ties share rank, next rank skips.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DENSE_RANK() OVER (ORDER BY count DESC)&lt;/code&gt;&lt;/strong&gt; — ties share rank, next rank does not skip.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LIMIT 10&lt;/code&gt;&lt;/strong&gt; — chops to top 10 rows after sort; doesn't address ties unless paired with &lt;code&gt;RANK&lt;/code&gt; filter.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three senders tied at 5 invoices, one at 3, one at 2.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;sender_id&lt;/th&gt;
&lt;th&gt;invoice_count&lt;/th&gt;
&lt;th&gt;ROW_NUMBER&lt;/th&gt;
&lt;th&gt;RANK&lt;/th&gt;
&lt;th&gt;DENSE_RANK&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s2&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s3&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s4&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s5&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invoice_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;invoice_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;       &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;invoice_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;invoice_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;drk&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;invoice_count&lt;/span&gt;
      &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; "top 10" with strict cardinality → &lt;code&gt;LIMIT 10&lt;/code&gt;; "all senders tied at top 10" → wrap with &lt;code&gt;WHERE rk &amp;lt;= 10&lt;/code&gt;; "compact ranks 1..K with no gaps" → &lt;code&gt;DENSE_RANK&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Forgetting &lt;code&gt;GROUP BY&lt;/code&gt; — &lt;code&gt;SELECT sender_id, COUNT(*)&lt;/code&gt; without grouping is a parse error in PostgreSQL.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;ORDER BY ... DESC&lt;/code&gt; — top 10 silently becomes "first 10 by planner whim."&lt;/li&gt;
&lt;li&gt;Returning more than &lt;code&gt;N&lt;/code&gt; rows by skipping &lt;code&gt;LIMIT&lt;/code&gt; — graded as a wrong answer even when the top &lt;code&gt;N&lt;/code&gt; are correct.&lt;/li&gt;
&lt;li&gt;Hardcoding ties to a single row when the prompt says "all senders tied at rank 10" — use &lt;code&gt;RANK&lt;/code&gt; filter, not &lt;code&gt;LIMIT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Mixing &lt;code&gt;COUNT(*)&lt;/code&gt; and &lt;code&gt;COUNT(DISTINCT invoice_id)&lt;/code&gt; — non-equivalent when invoice_id can repeat (it usually can't, but state the assumption).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Top 10 Invoice Senders
&lt;/h3&gt;

&lt;p&gt;Given an &lt;code&gt;invoices(invoice_id, sender_id, recipient_id, sent_at, amount)&lt;/code&gt; table, write a query that returns the &lt;strong&gt;top 10 senders by completed invoice count&lt;/strong&gt;, ordered by count descending and breaking ties by &lt;code&gt;sender_id&lt;/code&gt; ascending. Output two columns: &lt;code&gt;sender_id&lt;/code&gt; and &lt;code&gt;invoice_count&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;GROUP BY&lt;/code&gt; + &lt;code&gt;COUNT(*)&lt;/code&gt; + &lt;code&gt;ORDER BY DESC&lt;/code&gt; + &lt;code&gt;LIMIT 10&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;invoice_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;invoice_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;GROUP BY sender_id&lt;/code&gt; collapses the row stream to one row per sender; &lt;code&gt;COUNT(*)&lt;/code&gt; produces the metric for each group; &lt;code&gt;ORDER BY invoice_count DESC&lt;/code&gt; ranks groups descending; the &lt;code&gt;, sender_id ASC&lt;/code&gt; tiebreak ensures stable output across runs (fintech-audit-grade); &lt;code&gt;LIMIT 10&lt;/code&gt; returns the first 10 ranked rows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for an 8-row sample:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;invoice_id&lt;/th&gt;
&lt;th&gt;sender_id&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;s2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;104&lt;/td&gt;
&lt;td&gt;s3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;105&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;106&lt;/td&gt;
&lt;td&gt;s2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;107&lt;/td&gt;
&lt;td&gt;s4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Group by sender_id&lt;/strong&gt; — four groups: s1 (4 rows), s2 (2 rows), s3 (1 row), s4 (1 row).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt; — produces &lt;code&gt;(s1, 4), (s2, 2), (s3, 1), (s4, 1)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order by &lt;code&gt;invoice_count DESC, sender_id ASC&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;(s1, 4), (s2, 2), (s3, 1), (s4, 1)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LIMIT 10&lt;/code&gt;&lt;/strong&gt; — only 4 senders exist; all four returned.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;sender_id&lt;/th&gt;
&lt;th&gt;invoice_count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s4&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;GROUP BY sender_id&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — collapses rows that share &lt;code&gt;sender_id&lt;/code&gt;; produces one output row per sender.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt; aggregate&lt;/strong&gt;&lt;/strong&gt; — counts rows in each group; equivalent to &lt;code&gt;COUNT(invoice_id)&lt;/code&gt; when &lt;code&gt;invoice_id&lt;/code&gt; is non-null (always true since it's the primary key).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY invoice_count DESC&lt;/code&gt; ranking&lt;/strong&gt;&lt;/strong&gt; — sorts groups descending by the metric; the &lt;code&gt;DESC&lt;/code&gt; keyword is the entire difference between top-N and bottom-N.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Deterministic tiebreak&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;, sender_id ASC&lt;/code&gt; ensures the output is reproducible across runs; fintech audits require this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;LIMIT 10&lt;/code&gt; cut&lt;/strong&gt;&lt;/strong&gt; — chops to the top 10 ranked rows; PostgreSQL syntax (CoderPad default).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O((|invoices|) + G log G)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;|invoices|&lt;/code&gt; rows scanned for the GROUP BY, then &lt;code&gt;O(G log G)&lt;/code&gt; to sort the group output where &lt;code&gt;G&lt;/code&gt; is the number of senders.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/company/square/sql" rel="noopener noreferrer"&gt;Square SQL practice page&lt;/a&gt; for the curated EASY problem and the &lt;a href="https://pipecode.ai/explore/practice/company/square/topic/aggregation" rel="noopener noreferrer"&gt;Square aggregation practice page&lt;/a&gt; for the only company-tagged topic surface available.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Square (SQL)&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Square SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/square/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Square / aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Square aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/square/topic/aggregation" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — ranking&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL ranking problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/ranking/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. SQL Aggregation, Date Functions, and Cohort Analysis for Post-Signup Activity
&lt;/h2&gt;

&lt;h3&gt;
  
  
  30-day cohort retention via DATEDIFF + GROUP BY + HAVING in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Return every user who has at least 5 events in the 30 days after signup" is Square's signature MEDIUM SQL prompt (Users with High Activity After 30 Days Signup). The mental model: &lt;strong&gt;the cohort is defined by &lt;code&gt;signup_date&lt;/code&gt;; the activity window is &lt;code&gt;signup_date + INTERVAL '30 days'&lt;/code&gt;; events filtered to that window are aggregated &lt;code&gt;GROUP BY user_id&lt;/code&gt;; &lt;code&gt;HAVING COUNT(*) &amp;gt;= 5&lt;/code&gt; returns the active users&lt;/strong&gt;. Same primitive powers any retention or cohort analysis — N-day-post-signup activity, post-purchase repeat behavior, post-event engagement windows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fquw722pn1lz08nki7nf8.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fquw722pn1lz08nki7nf8.jpeg" alt="Diagram showing a users table with signup_date markers, a horizontal timeline with 30-day forward windows tinted purple where in-window events are tinted green and out-of-window events stay slate, and a green output card listing active users meeting the threshold." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; PostgreSQL has multiple ways to express the 30-day boundary: &lt;code&gt;signup_date + INTERVAL '30 days'&lt;/code&gt;, &lt;code&gt;signup_date + INTERVAL '30 day'&lt;/code&gt;, &lt;code&gt;(event_at - signup_date) &amp;lt;= INTERVAL '30 days'&lt;/code&gt;, &lt;code&gt;(event_at::date - signup_date::date) &amp;lt;= 30&lt;/code&gt;. Pick one and stick with it. Mixing forms within the same query is a syntax-fluency red flag.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Date arithmetic: &lt;code&gt;DATEDIFF&lt;/code&gt;, &lt;code&gt;DATE_TRUNC&lt;/code&gt;, and &lt;code&gt;INTERVAL '30 days'&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The date-arithmetic invariant: &lt;strong&gt;PostgreSQL date math returns either a &lt;code&gt;DATE&lt;/code&gt; (when adding &lt;code&gt;INTERVAL&lt;/code&gt;) or an &lt;code&gt;INTERVAL&lt;/code&gt; (when subtracting two dates); both are comparable to literal intervals&lt;/strong&gt;. &lt;code&gt;DATEDIFF&lt;/code&gt; is not native PostgreSQL — use &lt;code&gt;(d1 - d2)::int&lt;/code&gt; or &lt;code&gt;EXTRACT(DAY FROM (d1 - d2))&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;signup_date + INTERVAL '30 days'&lt;/code&gt;&lt;/strong&gt; — returns a date 30 days after signup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;(event_at - signup_date) &amp;lt;= INTERVAL '30 days'&lt;/code&gt;&lt;/strong&gt; — interval comparison; works for both &lt;code&gt;DATE&lt;/code&gt; and &lt;code&gt;TIMESTAMP&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DATE_TRUNC('week', event_at)&lt;/code&gt;&lt;/strong&gt; — snaps to the week boundary; useful for weekly cohorts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;EXTRACT(DAY FROM (event_at - signup_date))&lt;/code&gt;&lt;/strong&gt; — pulls the day-count out as an integer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Signup on 2025-01-01; events at 2025-01-10 and 2025-02-15.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;event_at&lt;/th&gt;
&lt;th&gt;signup_date&lt;/th&gt;
&lt;th&gt;diff&lt;/th&gt;
&lt;th&gt;within 30 days?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2025-01-10&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;9 days&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025-02-15&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;45 days&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;event_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_at&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;signup_date&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;days_since_signup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;signup_date&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'30 days'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;within_window&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; prefer &lt;code&gt;event_at &amp;lt;= signup_date + INTERVAL '30 days'&lt;/code&gt; over &lt;code&gt;(event_at - signup_date) &amp;lt;= INTERVAL '30 days'&lt;/code&gt; — the additive form reads as "the event happened within the window" without a subtraction step.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cohort definition via &lt;code&gt;signup_date + INTERVAL '30 days'&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The cohort invariant: &lt;strong&gt;a cohort is a set of users defined by a shared signup date or signup-date-bucket; the cohort's activity window is that signup-date plus a fixed interval&lt;/strong&gt;. The cohort filter goes in &lt;code&gt;WHERE&lt;/code&gt;, the per-user aggregation happens in &lt;code&gt;GROUP BY&lt;/code&gt;, and the threshold check goes in &lt;code&gt;HAVING&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cohort by exact signup date&lt;/strong&gt; — &lt;code&gt;WHERE u.signup_date = '2025-01-01'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cohort by week&lt;/strong&gt; — &lt;code&gt;WHERE DATE_TRUNC('week', u.signup_date) = '2025-01-06'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cohort by all users&lt;/strong&gt; — no extra filter; aggregate per user.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Activity window&lt;/strong&gt; — &lt;code&gt;WHERE e.event_at &amp;lt;= u.signup_date + INTERVAL '30 days'&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two users signed up on different dates; activity-window check per user.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;signup_date&lt;/th&gt;
&lt;th&gt;window_end&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;u1&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;2025-01-31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u2&lt;/td&gt;
&lt;td&gt;2025-01-15&lt;/td&gt;
&lt;td&gt;2025-02-14&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;signup_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;signup_date&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'30 days'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;window_end&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; compute the window boundary in &lt;code&gt;SELECT&lt;/code&gt; (or a CTE) when you need to reference it multiple times; recomputing &lt;code&gt;signup_date + INTERVAL '30 days'&lt;/code&gt; in three different clauses signals copy-paste.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;COUNT(*) per user&lt;/code&gt; with &lt;code&gt;HAVING&lt;/code&gt; threshold
&lt;/h4&gt;

&lt;p&gt;The threshold invariant: &lt;strong&gt;&lt;code&gt;GROUP BY user_id&lt;/code&gt; collapses to one row per user; &lt;code&gt;COUNT(*)&lt;/code&gt; produces the per-user activity count; &lt;code&gt;HAVING COUNT(*) &amp;gt;= 5&lt;/code&gt; filters groups whose count meets the threshold&lt;/strong&gt;. &lt;code&gt;HAVING&lt;/code&gt; is "WHERE on aggregates" — it filters group rows, not source rows.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING COUNT(*) &amp;gt;= 5&lt;/code&gt;&lt;/strong&gt; — group-level filter; aggregate predicate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING COUNT(*) &amp;gt;= 5 AND MAX(event_at) &amp;gt; signup_date + INTERVAL '14 days'&lt;/code&gt;&lt;/strong&gt; — compound threshold.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid &lt;code&gt;WHERE COUNT(*) &amp;gt;= 5&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;WHERE&lt;/code&gt; cannot reference aggregates; PostgreSQL parser rejects it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid &lt;code&gt;WHERE&lt;/code&gt; to enforce the post-signup window when you want to count zero-event users&lt;/strong&gt; — outer join + &lt;code&gt;HAVING&lt;/code&gt; instead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three users; threshold 5 events.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;events_in_window&lt;/th&gt;
&lt;th&gt;passes?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;u1&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u3&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;events_in_window&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;signup_date&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'30 days'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; row-level predicates → &lt;code&gt;WHERE&lt;/code&gt;; aggregate predicates → &lt;code&gt;HAVING&lt;/code&gt;. Crossing them is a graded conceptual error.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;DATEDIFF(event_at, signup_date)&lt;/code&gt; in PostgreSQL — that's MySQL syntax; PostgreSQL uses &lt;code&gt;(d1 - d2)::int&lt;/code&gt; or &lt;code&gt;EXTRACT(DAY FROM ...)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Putting the aggregate threshold in &lt;code&gt;WHERE&lt;/code&gt; — parser rejects; use &lt;code&gt;HAVING&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Filtering &lt;code&gt;signup_date&lt;/code&gt; in &lt;code&gt;HAVING&lt;/code&gt; — that's a row predicate; should be in &lt;code&gt;WHERE&lt;/code&gt; for performance.&lt;/li&gt;
&lt;li&gt;Forgetting the &lt;code&gt;JOIN&lt;/code&gt; — querying &lt;code&gt;events&lt;/code&gt; alone misses the per-user signup_date anchor.&lt;/li&gt;
&lt;li&gt;Missing the &lt;code&gt;INTERVAL '30 days'&lt;/code&gt; boundary — counting all-time activity instead of the 30-day window inflates results.&lt;/li&gt;
&lt;li&gt;Timezone bugs — &lt;code&gt;event_at&lt;/code&gt; and &lt;code&gt;signup_date&lt;/code&gt; in different timezones produces off-by-one days; cast both to a single timezone when in doubt.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on 30-Day Post-Signup Activity
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;users(user_id, signup_date)&lt;/code&gt; and &lt;code&gt;events(event_id, user_id, event_at)&lt;/code&gt;, return every &lt;code&gt;user_id&lt;/code&gt; whose &lt;strong&gt;events in the 30 days after signup&lt;/strong&gt; are at least &lt;strong&gt;5&lt;/strong&gt;. Output &lt;code&gt;user_id&lt;/code&gt; and &lt;code&gt;events_in_window&lt;/code&gt;, ordered by &lt;code&gt;events_in_window&lt;/code&gt; descending.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;JOIN&lt;/code&gt; + &lt;code&gt;WHERE INTERVAL&lt;/code&gt; + &lt;code&gt;GROUP BY&lt;/code&gt; + &lt;code&gt;HAVING COUNT(*) &amp;gt;= 5&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;events_in_window&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;signup_date&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'30 days'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;signup_date&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;events_in_window&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the &lt;code&gt;JOIN&lt;/code&gt; pairs each event with its user's &lt;code&gt;signup_date&lt;/code&gt;; the &lt;code&gt;WHERE&lt;/code&gt; clause restricts events to the 30-day window post-signup (&lt;code&gt;&amp;gt;= signup_date AND &amp;lt;= signup_date + INTERVAL '30 days'&lt;/code&gt;); &lt;code&gt;GROUP BY u.user_id&lt;/code&gt; collapses to one row per user; &lt;code&gt;COUNT(*)&lt;/code&gt; produces the per-user in-window event count; &lt;code&gt;HAVING COUNT(*) &amp;gt;= 5&lt;/code&gt; filters users whose count crosses the threshold; &lt;code&gt;ORDER BY events_in_window DESC&lt;/code&gt; ranks active users by engagement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for three users:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;signup_date&lt;/th&gt;
&lt;th&gt;events_in_first_30d&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;u1&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u2&lt;/td&gt;
&lt;td&gt;2025-01-15&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u3&lt;/td&gt;
&lt;td&gt;2025-02-10&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;JOIN&lt;/strong&gt; — pairs each event row with its &lt;code&gt;signup_date&lt;/code&gt; from &lt;code&gt;users&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WHERE window filter&lt;/strong&gt; — keeps only events where &lt;code&gt;event_at&lt;/code&gt; is in &lt;code&gt;[signup_date, signup_date + 30 days]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GROUP BY user_id&lt;/strong&gt; — collapses to one row per user with the in-window event count.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HAVING &amp;gt;= 5&lt;/strong&gt; — filters out u2 (3 events). u1 (6) and u3 (5) pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ORDER BY DESC&lt;/strong&gt; — u1 (6) ranks above u3 (5).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;events_in_window&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;u1&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u3&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;JOIN&lt;/code&gt; on &lt;code&gt;user_id&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — pairs each event with its user's &lt;code&gt;signup_date&lt;/code&gt;; without this, the query has no anchor for the window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;INTERVAL '30 days'&lt;/code&gt; boundary&lt;/strong&gt;&lt;/strong&gt; — PostgreSQL-native expression of the 30-day forward window; reads as "30 days from signup."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt; row filter&lt;/strong&gt;&lt;/strong&gt; — strips out-of-window events before grouping; doing this with &lt;code&gt;HAVING&lt;/code&gt; would still work but reads worse and runs slower.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;GROUP BY u.user_id&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — collapses the row stream to one row per user; the only non-aggregate column in &lt;code&gt;SELECT&lt;/code&gt; matches the &lt;code&gt;GROUP BY&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;HAVING COUNT(*) &amp;gt;= 5&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — filters groups; aggregate predicates have to live here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O((|users| + |events|) + G log G)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — the JOIN dominates; &lt;code&gt;G&lt;/code&gt; users with at least one in-window event get sorted at the end.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Practice &lt;a href="https://pipecode.ai/explore/practice/topic/date-functions/sql" rel="noopener noreferrer"&gt;date-functions SQL problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/cohort-analysis" rel="noopener noreferrer"&gt;cohort-analysis problems&lt;/a&gt; on PipeCode.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Square (SQL)&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Square SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/square/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — date functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL date-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/date-functions/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — cohort analysis&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL cohort-analysis problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/cohort-analysis" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. SQL Window Functions for Monthly Averages and Duplicate Detection
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Window AVG OVER + ROW_NUMBER OVER for analytics in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;"Calculate the monthly average rating for each product" is Block / Square's signature window-function SQL prompt (DataLemur Q1). The mental model: &lt;strong&gt;&lt;code&gt;AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))&lt;/code&gt; produces the monthly product average on every row, with no row-collapse&lt;/strong&gt;. The mirror primitive is &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)&lt;/code&gt; for duplicate detection — flag rows whose &lt;code&gt;rn &amp;gt; 1&lt;/code&gt; per partition. Same primitive powers any "row-level value vs group-aggregate" or "first-occurrence-per-group" question.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz45idizy67zen2wl87v2.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz45idizy67zen2wl87v2.jpeg" alt="Diagram showing a reviews table with submit_date and stars columns, a partition box wrapping rows by product_id and month with AVG(stars) OVER per partition annotated, and a green output card listing month, product_id, avg_stars tuples." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Window functions and &lt;code&gt;GROUP BY&lt;/code&gt; answer different questions. &lt;code&gt;GROUP BY&lt;/code&gt; collapses rows; window aggregates compute on the group &lt;strong&gt;without collapsing&lt;/strong&gt;. Use &lt;code&gt;GROUP BY&lt;/code&gt; when you want one row per group; use a window when you want every original row plus a per-group aggregate column.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Window AVG: &lt;code&gt;AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The window-AVG invariant: &lt;strong&gt;&lt;code&gt;AVG(expr) OVER (PARTITION BY ...)&lt;/code&gt; returns the average of &lt;code&gt;expr&lt;/code&gt; across all rows in the same partition, attached to every row&lt;/strong&gt;. Unlike &lt;code&gt;GROUP BY AVG(...)&lt;/code&gt;, the row count is preserved.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;PARTITION BY product_id&lt;/code&gt;&lt;/strong&gt; — one window per product.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)&lt;/code&gt;&lt;/strong&gt; — one window per (product, month).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;OVER ()&lt;/code&gt;&lt;/strong&gt; — empty parens = global window across all rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY submit_date ROWS UNBOUNDED PRECEDING&lt;/code&gt;&lt;/strong&gt; — running average up to current row.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two reviews for product 50001 in June; window AVG returns 3.5 on both rows.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;review_id&lt;/th&gt;
&lt;th&gt;product_id&lt;/th&gt;
&lt;th&gt;submit_date&lt;/th&gt;
&lt;th&gt;stars&lt;/th&gt;
&lt;th&gt;window_avg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;6171&lt;/td&gt;
&lt;td&gt;50001&lt;/td&gt;
&lt;td&gt;2022-06-08&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;3.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5293&lt;/td&gt;
&lt;td&gt;50001&lt;/td&gt;
&lt;td&gt;2022-06-18&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3.5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;review_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;submit_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stars&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
         &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;submit_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;monthly_avg&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;reviews&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; "show me each row alongside the group average" → window; "show me one row per group" → &lt;code&gt;GROUP BY&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)&lt;/code&gt; for duplicate detection
&lt;/h4&gt;

&lt;p&gt;The duplicate-detection invariant: &lt;strong&gt;&lt;code&gt;ROW_NUMBER() OVER (PARTITION BY &amp;lt;key_columns&amp;gt; ORDER BY &amp;lt;tiebreaker&amp;gt;)&lt;/code&gt; assigns 1, 2, 3, … to rows within each partition by the duplicate key; rows with &lt;code&gt;row_number &amp;gt; 1&lt;/code&gt; are duplicates&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;PARTITION BY user_id, product_id&lt;/code&gt;&lt;/strong&gt; — define what makes a duplicate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY submit_date ASC&lt;/code&gt;&lt;/strong&gt; — tiebreaker; deterministic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filter&lt;/strong&gt; — wrap in subquery / CTE: &lt;code&gt;WHERE rn &amp;gt; 1&lt;/code&gt; for duplicates only, &lt;code&gt;WHERE rn = 1&lt;/code&gt; for the first occurrence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alternative&lt;/strong&gt; — &lt;code&gt;COUNT(*) OVER (PARTITION BY ...)&lt;/code&gt; to count duplicates per row.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two reviews from same user for same product → duplicate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;review_id&lt;/th&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;product_id&lt;/th&gt;
&lt;th&gt;submit_date&lt;/th&gt;
&lt;th&gt;rn&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;6171&lt;/td&gt;
&lt;td&gt;123&lt;/td&gt;
&lt;td&gt;50001&lt;/td&gt;
&lt;td&gt;2022-06-08&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9999&lt;/td&gt;
&lt;td&gt;123&lt;/td&gt;
&lt;td&gt;50001&lt;/td&gt;
&lt;td&gt;2022-06-15&lt;/td&gt;
&lt;td&gt;2 (duplicate)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;review_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;submit_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
           &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;submit_date&lt;/span&gt;
         &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;reviews&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;ROW_NUMBER OVER (PARTITION BY &amp;lt;duplicate_key&amp;gt; ORDER BY &amp;lt;tiebreaker&amp;gt;)&lt;/code&gt; is the universal SQL idiom for "find duplicates" — no GROUP BY needed.&lt;/p&gt;

&lt;h4&gt;
  
  
  Window vs &lt;code&gt;GROUP BY&lt;/code&gt;: row-level vs collapsed-row aggregations
&lt;/h4&gt;

&lt;p&gt;The distinction invariant: &lt;strong&gt;window functions preserve the input row count and add per-group aggregate columns; &lt;code&gt;GROUP BY&lt;/code&gt; collapses input rows to one row per group and replaces non-aggregate columns&lt;/strong&gt;. Many candidates conflate them; interviewers grade the distinction.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Window&lt;/strong&gt; — &lt;code&gt;SELECT review_id, AVG(stars) OVER (PARTITION BY product_id) FROM reviews;&lt;/code&gt; — row count preserved.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GROUP BY&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;SELECT product_id, AVG(stars) FROM reviews GROUP BY product_id;&lt;/code&gt; — one row per product.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixed&lt;/strong&gt; — wrap the GROUP BY in a CTE, then JOIN to the original; messy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;QUALIFY&lt;/code&gt;&lt;/strong&gt; — Snowflake-only filter on window-result; PostgreSQL needs a wrapping subquery instead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Same data, two different shapes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;approach&lt;/th&gt;
&lt;th&gt;rows returned&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Window AVG&lt;/td&gt;
&lt;td&gt;5 (same as input)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GROUP BY AVG&lt;/td&gt;
&lt;td&gt;2 (one per product)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Window: keep all rows&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;review_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stars&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;product_avg&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;reviews&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- GROUP BY: collapse to one row per product&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;product_avg&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;reviews&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; read the prompt — "for each row, show…" = window; "summarize per product" = GROUP BY.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Forgetting &lt;code&gt;PARTITION BY&lt;/code&gt; — the window covers the whole table, producing the global average on every row.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;GROUP BY&lt;/code&gt; when the prompt says "for each review, show the monthly product average" — collapses rows incorrectly.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;ORDER BY&lt;/code&gt; inside &lt;code&gt;ROW_NUMBER OVER (...)&lt;/code&gt; — non-deterministic row numbering; the duplicate filter becomes random.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;RANK()&lt;/code&gt; when &lt;code&gt;ROW_NUMBER()&lt;/code&gt; is correct — &lt;code&gt;RANK&lt;/code&gt; ties become same number; &lt;code&gt;RANK &amp;gt; 1&lt;/code&gt; doesn't reliably flag duplicates.&lt;/li&gt;
&lt;li&gt;Filtering window results in &lt;code&gt;WHERE&lt;/code&gt; — Postgres requires a wrapping subquery / CTE because window functions evaluate after WHERE.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Monthly Average Product Ratings
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;reviews(review_id, user_id, submit_date, product_id, stars)&lt;/code&gt;, write a query that returns, for each row, &lt;code&gt;submit_date_month&lt;/code&gt;, &lt;code&gt;product_id&lt;/code&gt;, and the &lt;strong&gt;monthly average rating&lt;/strong&gt; for that product (&lt;code&gt;avg_stars&lt;/code&gt;), sorted by month and product_id.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;AVG ... OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;submit_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stars&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;submit_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_stars&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;reviews&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;mth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;EXTRACT(MONTH FROM submit_date)&lt;/code&gt; pulls the month component from each row; &lt;code&gt;AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))&lt;/code&gt; partitions by (product, month) and averages stars within each partition; the cast to &lt;code&gt;decimal(10, 2)&lt;/code&gt; rounds to two decimals; &lt;code&gt;ORDER BY mth, product_id&lt;/code&gt; produces the canonical sort order. Each row keeps its identity — this is window, not GROUP BY.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the DataLemur sample:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;review_id&lt;/th&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;submit_date&lt;/th&gt;
&lt;th&gt;product_id&lt;/th&gt;
&lt;th&gt;stars&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;6171&lt;/td&gt;
&lt;td&gt;123&lt;/td&gt;
&lt;td&gt;2022-06-08&lt;/td&gt;
&lt;td&gt;50001&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7802&lt;/td&gt;
&lt;td&gt;265&lt;/td&gt;
&lt;td&gt;2022-06-10&lt;/td&gt;
&lt;td&gt;69852&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5293&lt;/td&gt;
&lt;td&gt;362&lt;/td&gt;
&lt;td&gt;2022-06-18&lt;/td&gt;
&lt;td&gt;50001&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6352&lt;/td&gt;
&lt;td&gt;192&lt;/td&gt;
&lt;td&gt;2022-07-26&lt;/td&gt;
&lt;td&gt;69852&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4517&lt;/td&gt;
&lt;td&gt;981&lt;/td&gt;
&lt;td&gt;2022-07-05&lt;/td&gt;
&lt;td&gt;69852&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;EXTRACT(MONTH ...)&lt;/strong&gt; — produces &lt;code&gt;[6, 6, 6, 7, 7]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition by (product_id, month)&lt;/strong&gt; — four partitions: (50001, 6) with 2 rows, (69852, 6) with 1 row, (69852, 7) with 2 rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AVG within partition&lt;/strong&gt; — (50001, 6) → 3.5; (69852, 6) → 4.0; (69852, 7) → 2.5.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cast + ORDER BY&lt;/strong&gt; — five rows emitted, sorted by (mth, product_id).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;mth&lt;/th&gt;
&lt;th&gt;product_id&lt;/th&gt;
&lt;th&gt;avg_stars&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;50001&lt;/td&gt;
&lt;td&gt;3.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;50001&lt;/td&gt;
&lt;td&gt;3.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;69852&lt;/td&gt;
&lt;td&gt;4.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;69852&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;69852&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;EXTRACT(MONTH FROM submit_date)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — PostgreSQL date-component function; returns the month as integer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;AVG(...) OVER (PARTITION BY ...)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — window aggregate; preserves row count and computes per-partition mean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Composite partition key&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;(product_id, EXTRACT(MONTH FROM submit_date))&lt;/code&gt; ensures monthly granularity per product.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;::decimal(10, 2)&lt;/code&gt; cast&lt;/strong&gt;&lt;/strong&gt; — rounds to 2 decimals; cleaner output than the default float representation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY mth, product_id&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — canonical sort order for the audit-readable result.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(N log N)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — the planner sorts rows by partition keys once; no GROUP BY collapse.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;SQL window-function problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;aggregation problems&lt;/a&gt; on PipeCode.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Square (SQL)&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Square SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/square/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — window functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL window-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. SQL &lt;code&gt;COUNT DISTINCT&lt;/code&gt; and Status Filters for Payment-Flow Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Payment-flow SQL with COUNT DISTINCT and status filters in SQL for Square data engineering
&lt;/h3&gt;

&lt;p&gt;"Return the total unique successful transactions, unique senders, and unique recipients" is Square / CashApp's signature payment-flow SQL prompt (Medium / DataLemur Q6 staple). The mental model: &lt;strong&gt;filter rows with &lt;code&gt;WHERE status = 'Success'&lt;/code&gt; (idempotency boundary), then aggregate with &lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt; over the columns you care about — payment_id, sender_id, recipient_id&lt;/strong&gt;. Same primitive powers any "unique-X for filtered Y" pipeline — unique active users last week, unique trading symbols traded yesterday, unique ad creatives clicked today.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Always state your idempotency boundary out loud before writing the query: "I'm filtering to &lt;code&gt;status = 'Success'&lt;/code&gt; because failed/pending transactions don't count toward unique-success metrics." Square interviewers grade this phrasing — it's the difference between "candidate writes correct SQL" and "candidate thinks like a data engineer at a regulated payments company."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;COUNT&lt;/code&gt; vs &lt;code&gt;COUNT DISTINCT&lt;/code&gt;: row count vs unique-value count
&lt;/h4&gt;

&lt;p&gt;The distinct-count invariant: &lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt; counts rows; &lt;code&gt;COUNT(col)&lt;/code&gt; counts non-null values in &lt;code&gt;col&lt;/code&gt;; &lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt; counts distinct non-null values in &lt;code&gt;col&lt;/code&gt;&lt;/strong&gt;. The three are not interchangeable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt; — total rows in the result set.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(payment_id)&lt;/code&gt;&lt;/strong&gt; — non-null payment IDs (often equal to COUNT(*) since payment_id is the PK).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(DISTINCT sender_id)&lt;/code&gt;&lt;/strong&gt; — unique senders; collapses repeated sender_ids.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(DISTINCT col1, col2)&lt;/code&gt;&lt;/strong&gt; — PostgreSQL syntax for distinct combinations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Five payments from 3 distinct senders.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;payment_id&lt;/th&gt;
&lt;th&gt;sender_id&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;p1&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p2&lt;/td&gt;
&lt;td&gt;s2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p3&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p4&lt;/td&gt;
&lt;td&gt;s3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p5&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;metric&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COUNT(payment_id)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COUNT(DISTINCT sender_id)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_payment_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;unique_senders&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;payments&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; "how many rows" → &lt;code&gt;COUNT(*)&lt;/code&gt;; "how many unique entities" → &lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;WHERE status = 'Success'&lt;/code&gt; before aggregation
&lt;/h4&gt;

&lt;p&gt;The idempotency-filter invariant: &lt;strong&gt;payment-flow analytics almost always filter to &lt;code&gt;status = 'Success'&lt;/code&gt; (or whichever success state applies); failed and pending transactions do not contribute to success metrics&lt;/strong&gt;. The filter goes in &lt;code&gt;WHERE&lt;/code&gt;, before aggregation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE status = 'Success'&lt;/code&gt;&lt;/strong&gt; — strictly successful payments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE status IN ('Success', 'Settled')&lt;/code&gt;&lt;/strong&gt; — multiple success-equivalent statuses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE status NOT IN ('Failed', 'Cancelled')&lt;/code&gt;&lt;/strong&gt; — exclusionary; semantically same when the universe is finite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State the assumption&lt;/strong&gt; — interviewers expect you to clarify which statuses count.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Six payments, three successful.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;payment_id&lt;/th&gt;
&lt;th&gt;sender_id&lt;/th&gt;
&lt;th&gt;status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;p1&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p2&lt;/td&gt;
&lt;td&gt;s2&lt;/td&gt;
&lt;td&gt;Failed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p3&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p4&lt;/td&gt;
&lt;td&gt;s3&lt;/td&gt;
&lt;td&gt;Pending&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p5&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p6&lt;/td&gt;
&lt;td&gt;s4&lt;/td&gt;
&lt;td&gt;Failed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After &lt;code&gt;WHERE status = 'Success'&lt;/code&gt;: 3 rows (p1, p3, p5), unique senders = 1 (s1 only).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;successful_payments&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;unique_successful_senders&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;payments&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Success'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the success-filter is non-negotiable for payment-flow questions; ask the interviewer which status values count if the prompt is ambiguous.&lt;/p&gt;

&lt;h4&gt;
  
  
  Multiple &lt;code&gt;COUNT DISTINCT&lt;/code&gt; in one &lt;code&gt;SELECT&lt;/code&gt;: sender / recipient splits
&lt;/h4&gt;

&lt;p&gt;The compound-aggregate invariant: &lt;strong&gt;a single &lt;code&gt;SELECT&lt;/code&gt; can compute multiple &lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt; aggregates over the same filtered row set; this is the common shape for payment-flow "uniques per role" questions&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Three aggregates in one SELECT&lt;/strong&gt; — &lt;code&gt;COUNT(DISTINCT payment_id), COUNT(DISTINCT sender_id), COUNT(DISTINCT recipient_id)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same WHERE filter applies to all&lt;/strong&gt; — single source row set drives every aggregate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Senders ≠ recipients&lt;/strong&gt; — even though both columns reference users, the distinct counts can differ.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(DISTINCT user_id)&lt;/code&gt;&lt;/strong&gt; — pool sender + recipient with &lt;code&gt;UNION ALL&lt;/code&gt; first if you want "unique users involved."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three successful payments; two distinct senders, three distinct recipients.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;payment_id&lt;/th&gt;
&lt;th&gt;sender_id&lt;/th&gt;
&lt;th&gt;recipient_id&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;p1&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;td&gt;r1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p3&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;td&gt;r2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p5&lt;/td&gt;
&lt;td&gt;s2&lt;/td&gt;
&lt;td&gt;r3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;metric&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COUNT(DISTINCT payment_id)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COUNT(DISTINCT sender_id)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COUNT(DISTINCT recipient_id)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;payment_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;unique_transactions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;unique_senders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;recipient_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;unique_recipients&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;payments&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Success'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; one query, multiple aggregates is the cleanest shape; never write three separate queries for sender / recipient / transaction counts.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;COUNT(*)&lt;/code&gt; when the prompt says "unique" — undercounts (or overcounts, depending on whether duplicates exist).&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;WHERE status = 'Success'&lt;/code&gt; — counts failed and pending transactions; the metric becomes meaningless.&lt;/li&gt;
&lt;li&gt;Putting the status filter in &lt;code&gt;HAVING&lt;/code&gt; instead of &lt;code&gt;WHERE&lt;/code&gt; — works but is slow and reads worse.&lt;/li&gt;
&lt;li&gt;Writing three separate queries for sender / recipient / transaction — composes into one SELECT with three aggregates.&lt;/li&gt;
&lt;li&gt;Confusing "unique senders" with "unique users" — senders and recipients can be the same individual; pool them only if the prompt asks for "unique users."&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Payment-Flow Unique Counts
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;payments(payment_id, sender_id, recipient_id, amount, status, created_at)&lt;/code&gt; where &lt;code&gt;status&lt;/code&gt; is one of &lt;code&gt;'Success'&lt;/code&gt;, &lt;code&gt;'Failed'&lt;/code&gt;, &lt;code&gt;'Pending'&lt;/code&gt;, write a query that returns three columns: &lt;code&gt;unique_transactions&lt;/code&gt;, &lt;code&gt;unique_senders&lt;/code&gt;, &lt;code&gt;unique_recipients&lt;/code&gt; — each a &lt;code&gt;COUNT(DISTINCT)&lt;/code&gt; over rows where &lt;code&gt;status = 'Success'&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;COUNT(DISTINCT)&lt;/code&gt; + &lt;code&gt;WHERE status = 'Success'&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;payment_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;unique_transactions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;unique_senders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;recipient_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;unique_recipients&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;payments&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Success'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;WHERE status = 'Success'&lt;/code&gt; strips failed and pending payments before aggregation so every count reflects only successful transactions; three &lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt; aggregates over the filtered row set produce the three unique counts in a single pass; the result is one row with three columns — the canonical shape for "summary metrics for a filtered universe."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for a 6-row sample:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;payment_id&lt;/th&gt;
&lt;th&gt;sender_id&lt;/th&gt;
&lt;th&gt;recipient_id&lt;/th&gt;
&lt;th&gt;status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;p1&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;td&gt;r1&lt;/td&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p2&lt;/td&gt;
&lt;td&gt;s2&lt;/td&gt;
&lt;td&gt;r2&lt;/td&gt;
&lt;td&gt;Failed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p3&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;td&gt;r2&lt;/td&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p4&lt;/td&gt;
&lt;td&gt;s3&lt;/td&gt;
&lt;td&gt;r1&lt;/td&gt;
&lt;td&gt;Pending&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p5&lt;/td&gt;
&lt;td&gt;s2&lt;/td&gt;
&lt;td&gt;r3&lt;/td&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p6&lt;/td&gt;
&lt;td&gt;s1&lt;/td&gt;
&lt;td&gt;r1&lt;/td&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;WHERE filter&lt;/strong&gt; — strips p2 (Failed) and p4 (Pending); 4 successful rows remain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;COUNT(DISTINCT payment_id)&lt;/strong&gt; — &lt;code&gt;{p1, p3, p5, p6}&lt;/code&gt; = 4 (every payment_id is unique by definition).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;COUNT(DISTINCT sender_id)&lt;/strong&gt; — &lt;code&gt;{s1, s2}&lt;/code&gt; = 2.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;COUNT(DISTINCT recipient_id)&lt;/strong&gt; — &lt;code&gt;{r1, r2, r3}&lt;/code&gt; = 3.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;unique_transactions&lt;/th&gt;
&lt;th&gt;unique_senders&lt;/th&gt;
&lt;th&gt;unique_recipients&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;WHERE status = 'Success'&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — idempotency-filter; payment-flow metrics never include failed or pending rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COUNT(DISTINCT payment_id)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — counts unique transaction IDs; equal to &lt;code&gt;COUNT(*)&lt;/code&gt; here since &lt;code&gt;payment_id&lt;/code&gt; is the primary key, but spelled out for clarity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COUNT(DISTINCT sender_id)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — counts unique senders; senders can repeat across payments, so this collapses repeats.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COUNT(DISTINCT recipient_id)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — counts unique recipients; same logic, different column.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Single-pass three-aggregate SELECT&lt;/strong&gt;&lt;/strong&gt; — one scan of &lt;code&gt;payments&lt;/code&gt; filtered to &lt;code&gt;Success&lt;/code&gt;, three aggregates in parallel; no need for separate queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(|payments|)&lt;/code&gt; time / &lt;code&gt;O(D)&lt;/code&gt; space&lt;/strong&gt;&lt;/strong&gt; — one linear scan; &lt;code&gt;D&lt;/code&gt; = sum of distinct cardinalities held in three small hash structures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill more &lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;SQL filtering problems&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/company/square/sql" rel="noopener noreferrer"&gt;Square SQL practice page&lt;/a&gt; for the curated 2-problem set.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Square (SQL)&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Square SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/square/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — filtering&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL filtering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips to crack Square data engineering interviews
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Square = Block — both names refer to the same loop
&lt;/h3&gt;

&lt;p&gt;Square rebranded to &lt;strong&gt;Block Inc.&lt;/strong&gt; in &lt;strong&gt;December 2021&lt;/strong&gt;. The company name on offer letters is Block; the consumer products (Square seller tools, Cash App, Tidal, Spiral) keep their original brands. The data-engineering interview loop, the SQL bar, and the question shapes did not change with the rebrand. Search for both "Square data engineer interview" and "Block data engineer interview" — every external article you find under one name applies under the other.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drill the four SQL primitives
&lt;/h3&gt;

&lt;p&gt;The four primitives in this guide map directly to the curated 2 PipeCode SQL problems plus the two adjacent SQL primitives every Block / Square / CashApp SQL list rotates through: &lt;code&gt;GROUP BY + COUNT + ORDER BY DESC + LIMIT N&lt;/code&gt; for top-N rankings (#45), &lt;code&gt;DATEDIFF / INTERVAL '30 days' + GROUP BY + HAVING&lt;/code&gt; for cohort-retention queries (#217), &lt;code&gt;AVG OVER PARTITION BY&lt;/code&gt; and &lt;code&gt;ROW_NUMBER OVER&lt;/code&gt; window functions for monthly aggregates and duplicate detection (DataLemur Q1 + Q3), and &lt;code&gt;COUNT(DISTINCT) + WHERE status = 'Success'&lt;/code&gt; payment-flow aggregations (Medium + DataLemur Q6).&lt;/p&gt;

&lt;h3&gt;
  
  
  CoderPad PostgreSQL is the live coding environment
&lt;/h3&gt;

&lt;p&gt;Square / Block / CashApp interviews live SQL in &lt;strong&gt;CoderPad&lt;/strong&gt; with &lt;strong&gt;PostgreSQL&lt;/strong&gt; as the only available dialect. Drill PostgreSQL-flavored answers — &lt;code&gt;EXTRACT(MONTH FROM ...)&lt;/code&gt;, &lt;code&gt;DATE_TRUNC('week', ...)&lt;/code&gt;, &lt;code&gt;INTERVAL '30 days'&lt;/code&gt;, &lt;code&gt;(d1 - d2)::int&lt;/code&gt;, &lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt;, &lt;code&gt;LIMIT N OFFSET M&lt;/code&gt;. Avoid Snowflake's &lt;code&gt;QUALIFY&lt;/code&gt;, MySQL's &lt;code&gt;DATEDIFF(d1, d2)&lt;/code&gt;, BigQuery's &lt;code&gt;DATE_DIFF&lt;/code&gt;, SQL Server's &lt;code&gt;TOP N&lt;/code&gt; — these all parse-fail in PostgreSQL. Google search is allowed during the live interview, but stuttering on dialect signals weak SQL fluency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Medallion architecture (Bronze / Silver / Gold) is the system-design house style
&lt;/h3&gt;

&lt;p&gt;Square's data platform is structured around the &lt;strong&gt;Medallion architecture&lt;/strong&gt; — &lt;code&gt;Bronze&lt;/code&gt; for raw landed data, &lt;code&gt;Silver&lt;/code&gt; for cleaned and enriched data, &lt;code&gt;Gold&lt;/code&gt; for aggregated analytics-ready data. When you discuss pipeline design in the system-design or onsite panel rounds, frame the data flow as Bronze → Silver → Gold, name the orchestrator (Airflow or dbt), and mention data-quality gates (Great Expectations) between layers. This single framing carries weight at every Square data-engineering onsite.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data quality is graded heavily — Great Expectations and validation gates
&lt;/h3&gt;

&lt;p&gt;Square's interviewers probe data-quality concerns on every round. Mention specific tools you've used (Great Expectations, Soda, dbt tests), describe the validation gates between Bronze / Silver / Gold layers, and articulate the difference between "schema validation" (column types and nullability) and "semantic validation" (business-logic invariants like &lt;code&gt;transaction_amount &amp;gt; 0&lt;/code&gt; or &lt;code&gt;payment_status IN ('Success', 'Failed', 'Pending')&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  4-stage interview process — HR, Hiring Manager, Technical, Onsite Panel (NDA)
&lt;/h3&gt;

&lt;p&gt;The Square data-engineering loop runs four stages: &lt;strong&gt;HR screen&lt;/strong&gt; (15-30 min recruiter), &lt;strong&gt;Hiring Manager interview&lt;/strong&gt; (~30 min, past experience and data-quality), &lt;strong&gt;Technical Screen&lt;/strong&gt; (~1 hour, Python + SQL on CoderPad with medium-level complexity), and &lt;strong&gt;Onsite Panel&lt;/strong&gt; (multiple 30-45 min rounds, NDA required, data modeling + analytical + cultural fit). Total comp data points are limited (only 2 reported), but the average base salary is &lt;strong&gt;$139,850&lt;/strong&gt;, range $101K-$185K.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to practice on PipeCode
&lt;/h3&gt;

&lt;p&gt;Start with the &lt;a href="https://pipecode.ai/explore/practice/company/square" rel="noopener noreferrer"&gt;Square practice page&lt;/a&gt; and the language-scoped &lt;a href="https://pipecode.ai/explore/practice/company/square/sql" rel="noopener noreferrer"&gt;Square SQL practice page&lt;/a&gt; for the curated 2-problem set. Hit the company_topic &lt;a href="https://pipecode.ai/explore/practice/company/square/topic/aggregation" rel="noopener noreferrer"&gt;Square — aggregation page&lt;/a&gt; for the only Square-tagged topic surface available. After that, drill the matching topic pages: &lt;a href="https://pipecode.ai/explore/practice/topic/ranking/sql" rel="noopener noreferrer"&gt;ranking&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;aggregation&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/date-functions/sql" rel="noopener noreferrer"&gt;date functions&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/cohort-analysis" rel="noopener noreferrer"&gt;cohort analysis&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;window functions&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;filtering&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/group-by/sql" rel="noopener noreferrer"&gt;group by&lt;/a&gt;. The &lt;a href="https://pipecode.ai/explore/courses" rel="noopener noreferrer"&gt;interview courses page&lt;/a&gt; bundles structured curricula. For broader coverage, &lt;a href="https://pipecode.ai/explore/practice/topics" rel="noopener noreferrer"&gt;browse by topic&lt;/a&gt;, or pivot to peer guides — the &lt;a href="https://pipecode.ai/blogs/airbnb-data-engineering-interview-questions-prep-guide" rel="noopener noreferrer"&gt;Airbnb DE interview guide&lt;/a&gt; and the &lt;a href="https://pipecode.ai/blogs/top-data-engineering-interview-questions-2026" rel="noopener noreferrer"&gt;top DE interview questions 2026&lt;/a&gt; blog. The &lt;a href="https://pipecode.ai/blogs/sql-data-types-postgresql-guide" rel="noopener noreferrer"&gt;SQL data types Postgres guide&lt;/a&gt; is a useful refresher because PostgreSQL is the CoderPad default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Communication and approach under time pressure
&lt;/h3&gt;

&lt;p&gt;Talk through the invariant first ("this is a top-N ranking problem with a tiebreak requirement"), the brute force second ("a self-join would also work but is &lt;code&gt;O(n²)&lt;/code&gt;"), and the optimal third ("&lt;code&gt;GROUP BY + ORDER BY + LIMIT&lt;/code&gt; is &lt;code&gt;O(n log n)&lt;/code&gt; and idiomatic"). Interviewers grade &lt;strong&gt;process&lt;/strong&gt; as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, ties at the cut, NULL handling, timezone bugs in date arithmetic, status values you didn't expect (&lt;code&gt;Pending&lt;/code&gt;, &lt;code&gt;Cancelled&lt;/code&gt;). The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the Square data engineering interview process?
&lt;/h3&gt;

&lt;p&gt;The Square data engineering interview opens with a 15-30 minute recruiter HR screen, then a 30-minute hiring-manager interview focused on past experience and data-quality philosophy, then a 1-hour technical screen on CoderPad covering Python and SQL at medium-level complexity, then an onsite panel of multiple 30-45 minute rounds (data modeling, analytical thinking, cultural fit). The onsite panel requires an NDA. End-to-end the loop runs three to four weeks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Square the same as Block? Which name should I search for?
&lt;/h3&gt;

&lt;p&gt;Yes — Square rebranded to &lt;strong&gt;Block Inc.&lt;/strong&gt; in December 2021. The legal entity is Block; the consumer brands (Square seller tools, Cash App, Tidal, Spiral) keep their original names. Search for both "Square data engineer interview" and "Block data engineer interview" — every external article you find under one name applies under the other. The DataLemur SQL guide is titled "Block SQL Interview Questions" and the IQ guide is titled "Square Data Engineer Interview Questions"; both refer to the same loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  What programming languages does Square test in data engineering interviews?
&lt;/h3&gt;

&lt;p&gt;Square tests &lt;strong&gt;SQL and Python&lt;/strong&gt; in the technical screen and onsite. SQL is the heavier surface — the live coding round in CoderPad is PostgreSQL-only, with patterns like &lt;code&gt;GROUP BY + COUNT + ORDER BY DESC + LIMIT&lt;/code&gt;, &lt;code&gt;INTERVAL '30 days'&lt;/code&gt; cohort math, &lt;code&gt;AVG OVER PARTITION BY&lt;/code&gt; window aggregates, and &lt;code&gt;COUNT(DISTINCT col)&lt;/code&gt; over status-filtered payment data. Python is the lighter surface — typically 30 minutes of medium-difficulty data-manipulation questions (Pandas / dict / list comprehensions). Spark / PySpark show up in the onsite panel for senior roles.&lt;/p&gt;

&lt;h3&gt;
  
  
  How difficult are Square data engineering interview questions?
&lt;/h3&gt;

&lt;p&gt;The curated Square practice set on PipeCode is &lt;strong&gt;1 EASY + 1 MEDIUM&lt;/strong&gt;, no hard. The EASY is a SQL ranking + aggregation problem (Top 10 Invoice Senders); the MEDIUM is a SQL aggregation + date-functions + cohort-analysis problem (Users with High Activity After 30 Days Signup). The DataLemur Block list adds 10 more SQL questions ranging from definition-style (joins, constraints, clustered vs non-clustered indexes) to scenario-grade (window AVG OVER, ROW_NUMBER for duplicates, click-through rate via LEFT JOIN + COUNT DISTINCT). Stuttering on the EASY is a stronger negative signal than struggling with the MEDIUM.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the Square data engineer salary range?
&lt;/h3&gt;

&lt;p&gt;Square data engineer base salary ranges from $101K to $185K, with an average of &lt;strong&gt;$139,850&lt;/strong&gt; (median $140K) across 21 reported data points (per Interview Query). Total compensation data is sparse (only 2 reported points), so the total-comp average ($72K) is under-sampled and likely inaccurate; expect actual total comp to be substantially higher once equity refreshers and bonuses are factored in. Negotiation success rates are best-supported by competing offers and verified levels.fyi entries.&lt;/p&gt;

&lt;h3&gt;
  
  
  What tech stack does Square's data engineering team use?
&lt;/h3&gt;

&lt;p&gt;Square's data engineering stack includes &lt;strong&gt;Python&lt;/strong&gt; (heavy — Pandas, PySpark for large-scale transforms), &lt;strong&gt;SQL&lt;/strong&gt; (PostgreSQL on CoderPad for interviews; production warehouses include Snowflake-equivalent platforms), &lt;strong&gt;Airflow + dbt&lt;/strong&gt; for ETL/ELT orchestration, &lt;strong&gt;AWS or GCP&lt;/strong&gt; for cloud infrastructure, &lt;strong&gt;Apache Spark&lt;/strong&gt; for distributed data processing, the &lt;strong&gt;Medallion architecture&lt;/strong&gt; (Bronze / Silver / Gold layers) for data organization, and &lt;strong&gt;Great Expectations&lt;/strong&gt; for data-quality validation. The cultural emphasis is on data-quality-first pipelines, cross-functional collaboration with product / DS teams, and continuous learning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start practicing Square data engineering problems
&lt;/h2&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>interview</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Snowflake Data Engineering Interview Questions &amp; Prep Guide</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Sun, 03 May 2026 05:20:29 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/snowflake-data-engineering-interview-questions-prep-guide-10ko</link>
      <guid>https://dev.to/gowthampotureddi/snowflake-data-engineering-interview-questions-prep-guide-10ko</guid>
      <description>&lt;p&gt;&lt;strong&gt;Snowflake data engineering interview questions&lt;/strong&gt; split into two distinct loops that share a name. The Snowflake-the-company SWE / DE interview is &lt;strong&gt;LeetCode-style Python&lt;/strong&gt;: array iteration with set-based validation logic (the classic SET card-game rule, &lt;code&gt;len(set(values)) in {1, 3}&lt;/code&gt; per attribute) and hash-table sliding-window counters over strings (&lt;code&gt;Counter(substrings)&lt;/code&gt; driven by a rolling character-frequency dict). The Snowflake-as-tool data-engineering interview — for any DE role at any company that runs on Snowflake — is &lt;strong&gt;product-knowledge plus SQL&lt;/strong&gt;: three-layer architecture, micro-partitions, clustering keys, Time Travel and Fail-safe, Zero-Copy Cloning, Snowpipe ingestion, and the &lt;code&gt;LAG&lt;/code&gt;/&lt;code&gt;LEAD&lt;/code&gt;/&lt;code&gt;AVG ... OVER (PARTITION BY ...)&lt;/code&gt; window-function primitives that drive consecutive-streak detection, day-over-day deltas, and monthly-aggregate analytics.&lt;/p&gt;

&lt;p&gt;This guide walks four topic clusters end-to-end, each with a &lt;strong&gt;detailed topic explanation&lt;/strong&gt;, &lt;strong&gt;per-sub-topic explanation with a worked example and its solution&lt;/strong&gt;, and an &lt;strong&gt;interview-style problem with a full solution&lt;/strong&gt; that explains why it works. The mix matches a curated 2-problem set (1 EASY Python array + 1 MEDIUM Python hash-table-sliding-window) plus two adjacent primitives — SQL window functions and Snowflake architecture — that every Snowflake-flavored interview rotates through. Whether the candidate is interviewing AT Snowflake or for a data-engineering role at a company that builds on Snowflake, the four primitives below cover the bar.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm8izoz6fh52v9ynvjc1x.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm8izoz6fh52v9ynvjc1x.jpeg" alt="Snowflake data engineering interview questions cover image with bold headline, Python and SQL chips, faint code ghost, and pipecode.ai attribution." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Top Snowflake data engineering interview topics
&lt;/h2&gt;

&lt;p&gt;From the &lt;a href="https://pipecode.ai/explore/practice/company/snowflake" rel="noopener noreferrer"&gt;Snowflake data engineering practice set&lt;/a&gt;, the &lt;strong&gt;four numbered sections below&lt;/strong&gt; follow this &lt;strong&gt;topic map&lt;/strong&gt; (one row per &lt;strong&gt;H2&lt;/strong&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Topic (sections &lt;strong&gt;1–4&lt;/strong&gt;)&lt;/th&gt;
&lt;th&gt;Why it shows up at Snowflake&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Python arrays and set validation for the SET card game&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SET Card Game Validation (EASY) — &lt;code&gt;zip(*cards)&lt;/code&gt; per-attribute iteration plus &lt;code&gt;len(set(vals)) in {1, 3}&lt;/code&gt;, the all-same-or-all-different invariant that powers any "validate a row across N records" pipeline.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Python hash tables and sliding window for maximum substring occurrences&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maximum Substring Occurrences (MEDIUM) — rolling character-frequency dict over a length-&lt;code&gt;k&lt;/code&gt; window plus a &lt;code&gt;Counter&lt;/code&gt; keyed on the substring itself, the canonical pattern for "count distinct windows under a constraint."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL window functions (&lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;, &lt;code&gt;AVG OVER&lt;/code&gt;) for Snowflake analytics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Marketing-touch streak detection — &lt;code&gt;LAG(DATE_TRUNC('week', event_date)) OVER (PARTITION BY contact_id ORDER BY ...)&lt;/code&gt; plus CTE composition, the SQL primitive every Snowflake-flavored interview rotates through.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Snowflake architecture: micro-partitions, clustering, and Time Travel&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Three-layer architecture (Storage / Query Processing / Cloud Services), micro-partition pruning via min/max metadata, clustering keys, scale-up vs scale-out, Time Travel + Fail-safe + Zero-Copy Cloning — the product-knowledge primitive every Snowflake-as-tool interviewer probes.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Dual-audience framing rule:&lt;/strong&gt; Snowflake interview prompts split into two categories. If the candidate is interviewing AT Snowflake the company, sections 1–2 (Python algorithms) carry the loop. If the candidate is interviewing for a data-engineering role at a Snowflake-using company, sections 3–4 (SQL window functions + Snowflake architecture) carry the loop. State which loop you're prepping for and rebalance the four primitives accordingly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. Python Arrays and Set Validation for the SET Card Game
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Array iteration with set-based validation in Python for data engineering
&lt;/h3&gt;

&lt;p&gt;"Given three SET cards (each a 4-tuple of attributes from {color, shape, number, shading}), validate that across all three cards every attribute is either &lt;strong&gt;all the same&lt;/strong&gt; or &lt;strong&gt;all different&lt;/strong&gt;" is Snowflake's signature EASY Python prompt (SET Card Game Validation). The mental model: &lt;strong&gt;&lt;code&gt;zip(*cards)&lt;/code&gt; produces one tuple per attribute index across all three cards; &lt;code&gt;set(values)&lt;/code&gt; deduplicates that tuple; the SET game rule is satisfied when every attribute's set has cardinality 1 (all same) or 3 (all different) — never 2&lt;/strong&gt;. Same primitive powers any "validate a row across N records" pipeline — schema-conformance checks, all-same-or-all-different label validation in ML datasets, "every replica reports the same status" health monitors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftee764srqzqnlydf3jls.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftee764srqzqnlydf3jls.jpeg" alt="Diagram showing three SET cards with four attributes each (color, shape, number, shading) and a per-attribute validation column where len(set(values)) is 1 (all same, green) or 3 (all different, green) but never 2 (invalid, red), producing a final True or False output." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; &lt;code&gt;len(set(values))&lt;/code&gt; collapses three separate equality checks into a single primitive. Avoid the temptation to write &lt;code&gt;(a == b == c) or (a != b and b != c and a != c)&lt;/code&gt; — the set-cardinality test is shorter, faster, and idiomatic. State the cardinality invariant out loud before writing code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Iterating tuples by attribute index: &lt;code&gt;zip(*cards)&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The transpose-via-zip invariant: &lt;strong&gt;&lt;code&gt;zip(*iterables)&lt;/code&gt; interleaves elements at the same index across all iterables; &lt;code&gt;zip(*cards)&lt;/code&gt; where &lt;code&gt;cards&lt;/code&gt; is a list of 3 tuples produces 4 tuples, one per attribute, each with 3 values&lt;/strong&gt;. This is the standard Python idiom for column-wise iteration over a row-major data structure.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;zip(*cards)&lt;/code&gt;&lt;/strong&gt; — yields per-attribute tuples; &lt;code&gt;*&lt;/code&gt; unpacks the outer list.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;list(zip(*cards))&lt;/code&gt;&lt;/strong&gt; — materialize into a list if you need to iterate twice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Equivalent loop&lt;/strong&gt; — &lt;code&gt;for i in range(len(cards[0])): values = tuple(c[i] for c in cards)&lt;/code&gt; — verbose but explicit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same pattern for matrix transpose&lt;/strong&gt; — &lt;code&gt;zip(*matrix)&lt;/code&gt; is the one-line transpose.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three cards as 4-tuples; transpose to 4 attribute tuples.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;input&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[('R', 'oval', 1, 'solid'), ('G', 'oval', 2, 'solid'), ('P', 'oval', 3, 'solid')]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;zip(*cards)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[('R','G','P'), ('oval','oval','oval'), (1,2,3), ('solid','solid','solid')]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cards&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;R&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;oval&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;solid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;G&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;oval&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;solid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;P&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;oval&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;solid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;attributes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;cards&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# [('R','G','P'), ('oval','oval','oval'), (1,2,3), ('solid','solid','solid')]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;zip(*rows)&lt;/code&gt; is the one-line Python transpose; reach for it whenever the question asks about "across all rows" properties.&lt;/p&gt;

&lt;h4&gt;
  
  
  All-same-or-all-different rule via &lt;code&gt;len(set(values)) in {1, 3}&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The cardinality invariant: &lt;strong&gt;&lt;code&gt;set(values)&lt;/code&gt; deduplicates a sequence; &lt;code&gt;len(set(values))&lt;/code&gt; returns the number of distinct elements; for a 3-element input, that count is 1 (all same), 2 (mixed — invalid), or 3 (all different)&lt;/strong&gt;. The SET game rule accepts cardinality 1 or 3 and rejects 2.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;len(set(vals)) == 1&lt;/code&gt;&lt;/strong&gt; — all values equal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;len(set(vals)) == 3&lt;/code&gt;&lt;/strong&gt; — all values pairwise distinct.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;len(set(vals)) == 2&lt;/code&gt;&lt;/strong&gt; — invalid mixed pattern (rejected by the rule).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;len(set(vals)) in {1, 3}&lt;/code&gt;&lt;/strong&gt; — combined acceptance test in one expression.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three attribute tuples; one valid (all same), one valid (all different), one invalid (mixed).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;values&lt;/th&gt;
&lt;th&gt;set(values)&lt;/th&gt;
&lt;th&gt;len&lt;/th&gt;
&lt;th&gt;rule&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;('oval','oval','oval')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'oval'}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;all same ✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;('R','G','P')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'R','G','P'}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;all different ✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;('R','R','G')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'R','G'}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;invalid ✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;attribute_ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;attribute_ok&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;oval&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;oval&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;oval&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# True
&lt;/span&gt;&lt;span class="nf"&gt;attribute_ok&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;R&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;G&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;P&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;           &lt;span class="c1"&gt;# True
&lt;/span&gt;&lt;span class="nf"&gt;attribute_ok&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;R&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;R&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;G&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;           &lt;span class="c1"&gt;# False
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the cardinality test (&lt;code&gt;len(set(...)) in {1, 3}&lt;/code&gt;) is the one-line replacement for any "all equal OR all distinct" check; never expand it into a conjunction of pairwise comparisons.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;all()&lt;/code&gt; composition over per-attribute checks
&lt;/h4&gt;

&lt;p&gt;The composition invariant: &lt;strong&gt;&lt;code&gt;all(predicate(x) for x in iterable)&lt;/code&gt; short-circuits on the first &lt;code&gt;False&lt;/code&gt; and returns &lt;code&gt;True&lt;/code&gt; only if every element passes&lt;/strong&gt;. Combined with &lt;code&gt;zip(*cards)&lt;/code&gt;, the entire SET validation collapses to one expression: &lt;code&gt;all(len(set(v)) in {1, 3} for v in zip(*cards))&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;all(...)&lt;/code&gt;&lt;/strong&gt; — returns &lt;code&gt;True&lt;/code&gt; if every element is truthy; short-circuits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;any(...)&lt;/code&gt;&lt;/strong&gt; — mirror image; returns &lt;code&gt;True&lt;/code&gt; if any element is truthy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generator expression&lt;/strong&gt; — &lt;code&gt;(predicate(x) for x in iter)&lt;/code&gt; is lazy; pairs perfectly with &lt;code&gt;all&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid &lt;code&gt;for&lt;/code&gt; + flag&lt;/strong&gt; — &lt;code&gt;ok = True; for ...: if not check: ok = False; return ok&lt;/code&gt; is verbose.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Compose &lt;code&gt;attribute_ok&lt;/code&gt; over the four attribute tuples of a valid SET.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;attribute&lt;/th&gt;
&lt;th&gt;values&lt;/th&gt;
&lt;th&gt;ok&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;color&lt;/td&gt;
&lt;td&gt;&lt;code&gt;('R','G','P')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;True&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;shape&lt;/td&gt;
&lt;td&gt;&lt;code&gt;('oval','oval','oval')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;True&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;number&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(1,2,3)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;True&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;shading&lt;/td&gt;
&lt;td&gt;&lt;code&gt;('solid','solid','solid')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;True&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_valid_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cards&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;cards&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;cards&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;R&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;oval&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;solid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;G&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;oval&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;solid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;P&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;oval&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;solid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="nf"&gt;is_valid_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cards&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# True
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the entire problem is a one-liner once you compose &lt;code&gt;all&lt;/code&gt; + &lt;code&gt;set&lt;/code&gt; + &lt;code&gt;zip(*cards)&lt;/code&gt;; if your code needs more than three lines, you're missing a primitive.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Writing &lt;code&gt;(a == b == c) or (a != b and b != c and a != c)&lt;/code&gt; instead of &lt;code&gt;len(set(values)) in {1, 3}&lt;/code&gt; — slow, verbose, and easy to mistype.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;cards[0][i], cards[1][i], cards[2][i]&lt;/code&gt; indexing instead of &lt;code&gt;zip(*cards)&lt;/code&gt; — works but signals you don't know the idiom.&lt;/li&gt;
&lt;li&gt;Treating cardinality 2 as valid — SET's whole rule is "no 2-distinct attributes allowed"; stuttering on this is graded as not understanding the prompt.&lt;/li&gt;
&lt;li&gt;Hardcoding 3 cards / 4 attributes instead of letting &lt;code&gt;zip&lt;/code&gt; and &lt;code&gt;all&lt;/code&gt; derive the dimensions — the same code should work for any N-card / M-attribute generalization.&lt;/li&gt;
&lt;li&gt;Returning &lt;code&gt;1&lt;/code&gt; or &lt;code&gt;0&lt;/code&gt; instead of &lt;code&gt;True&lt;/code&gt; / &lt;code&gt;False&lt;/code&gt; — the contract is a boolean.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Python Interview Question on SET Card Game Validation
&lt;/h3&gt;

&lt;p&gt;Given three SET cards as 4-tuples of attributes from &lt;code&gt;{color, shape, number, shading}&lt;/code&gt;, return &lt;code&gt;True&lt;/code&gt; if every attribute is either all-same or all-different across the three cards, otherwise &lt;code&gt;False&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_valid_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cards&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# your code here
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;all&lt;/code&gt;, &lt;code&gt;set&lt;/code&gt;, and &lt;code&gt;zip(*cards)&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_valid_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cards&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;cards&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;zip(*cards)&lt;/code&gt; transposes the 3×4 row-major card list into 4 attribute tuples of length 3; &lt;code&gt;len(set(values))&lt;/code&gt; counts distinct values per attribute and is 1 (all same) or 3 (all different) iff the SET rule is satisfied; &lt;code&gt;len(...) in {1, 3}&lt;/code&gt; accepts both valid cardinalities and rejects the 2-distinct case; &lt;code&gt;all(...)&lt;/code&gt; short-circuits to &lt;code&gt;False&lt;/code&gt; on the first invalid attribute, returning &lt;code&gt;True&lt;/code&gt; only when every attribute passes. The whole solution is one line, branch-free, and &lt;code&gt;O(N · M)&lt;/code&gt; for N cards × M attributes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for &lt;code&gt;cards = [('R','oval',1,'solid'), ('G','oval',2,'solid'), ('P','oval',3,'solid')]&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Transpose&lt;/strong&gt; — &lt;code&gt;zip(*cards)&lt;/code&gt; → &lt;code&gt;[('R','G','P'), ('oval','oval','oval'), (1,2,3), ('solid','solid','solid')]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-attribute &lt;code&gt;set&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;{'R','G','P'}&lt;/code&gt; (len 3), &lt;code&gt;{'oval'}&lt;/code&gt; (len 1), &lt;code&gt;{1,2,3}&lt;/code&gt; (len 3), &lt;code&gt;{'solid'}&lt;/code&gt; (len 1).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cardinality check&lt;/strong&gt; — 3 ∈ {1, 3} ✓; 1 ∈ {1, 3} ✓; 3 ∈ {1, 3} ✓; 1 ∈ {1, 3} ✓.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;all&lt;/code&gt;&lt;/strong&gt; — every check &lt;code&gt;True&lt;/code&gt; → returns &lt;code&gt;True&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;attribute&lt;/th&gt;
&lt;th&gt;values&lt;/th&gt;
&lt;th&gt;len(set)&lt;/th&gt;
&lt;th&gt;in {1, 3}&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;color&lt;/td&gt;
&lt;td&gt;&lt;code&gt;R, G, P&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;shape&lt;/td&gt;
&lt;td&gt;&lt;code&gt;oval, oval, oval&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;number&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1, 2, 3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;shading&lt;/td&gt;
&lt;td&gt;&lt;code&gt;solid, solid, solid&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;result&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;True&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;zip(*cards)&lt;/code&gt; transpose&lt;/strong&gt;&lt;/strong&gt; — turns row-major card data into column-major attribute tuples; the standard Python one-liner for "iterate by attribute across rows."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;set(values)&lt;/code&gt; dedupe&lt;/strong&gt;&lt;/strong&gt; — collapses duplicates so &lt;code&gt;len(...)&lt;/code&gt; returns the distinct count in O(N) hash-set inserts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;len(...) in {1, 3}&lt;/code&gt; cardinality test&lt;/strong&gt;&lt;/strong&gt; — the SET game rule expressed as a single set-membership check; rejects the 2-distinct case implicitly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;all(...)&lt;/code&gt; short-circuit&lt;/strong&gt;&lt;/strong&gt; — returns &lt;code&gt;False&lt;/code&gt; on the first invalid attribute without scanning the rest; pairs naturally with the generator expression.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;one-line composition&lt;/strong&gt;&lt;/strong&gt; — the entire solution is a single expression, branch-free, with no temporary state or accumulator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(N · M)&lt;/code&gt; time / &lt;code&gt;O(M)&lt;/code&gt; space&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;N&lt;/code&gt; cards × &lt;code&gt;M&lt;/code&gt; attributes; each per-attribute set holds at most &lt;code&gt;N&lt;/code&gt; items, so space is &lt;code&gt;O(M · N)&lt;/code&gt; worst case but typically just &lt;code&gt;O(M)&lt;/code&gt; distinct sets.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/company/snowflake/python" rel="noopener noreferrer"&gt;Snowflake Python practice page&lt;/a&gt; for the curated array problem and the &lt;a href="https://pipecode.ai/explore/practice/topic/array" rel="noopener noreferrer"&gt;array practice page&lt;/a&gt; for breadth.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Snowflake (Python)&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Snowflake Python practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/snowflake/python" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — array&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Python array problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/array" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Snowflake&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Snowflake data engineering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/snowflake" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Python Hash Tables and Sliding Window for Maximum Substring Occurrences
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hash-table sliding-window for substring frequency in Python for data engineering
&lt;/h3&gt;

&lt;p&gt;"Given a string &lt;code&gt;s&lt;/code&gt; and integers &lt;code&gt;k&lt;/code&gt; (substring length) and &lt;code&gt;maxLetters&lt;/code&gt; (max distinct chars allowed), return the maximum number of times any length-&lt;code&gt;k&lt;/code&gt; substring with at most &lt;code&gt;maxLetters&lt;/code&gt; distinct characters appears in &lt;code&gt;s&lt;/code&gt;" is Snowflake's signature MEDIUM Python prompt (Maximum Substring Occurrences). The mental model: &lt;strong&gt;a length-&lt;code&gt;k&lt;/code&gt; window slides across &lt;code&gt;s&lt;/code&gt;; a &lt;code&gt;freq&lt;/code&gt; dict tracks per-character counts inside the window in &lt;code&gt;O(1)&lt;/code&gt; per shift; a &lt;code&gt;Counter&lt;/code&gt; tracks how many times each valid window-substring has been seen; the answer is the max value in that counter&lt;/strong&gt;. Same primitive powers any "find the most frequent fixed-length pattern under a constraint" pipeline — find the most-repeated DNA k-mer with at most M distinct nucleotides, the most-repeated user-event sequence under a uniqueness cap.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7rvwocg6jia4aebp340u.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7rvwocg6jia4aebp340u.jpeg" alt="Diagram showing a sliding length-3 window scanning the string 'abcabcab' with a per-window character-frequency dict and a final Counter that tallies each candidate substring, highlighting the maximum-occurrence substring." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Maintain the &lt;code&gt;freq&lt;/code&gt; dict incrementally — increment for the entering character, decrement for the leaving character, and delete keys whose count hits 0 so &lt;code&gt;len(freq)&lt;/code&gt; always equals the &lt;strong&gt;distinct&lt;/strong&gt; character count in the window. Rebuilding &lt;code&gt;freq&lt;/code&gt; from scratch each window is &lt;code&gt;O(k)&lt;/code&gt; per shift and turns the algorithm &lt;code&gt;O(n · k)&lt;/code&gt;; the incremental update is &lt;code&gt;O(1)&lt;/code&gt; per shift and the total runtime is &lt;code&gt;O(n)&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Char-frequency dict for the window: increment / decrement per character
&lt;/h4&gt;

&lt;p&gt;The window-state invariant: &lt;strong&gt;the &lt;code&gt;freq&lt;/code&gt; dict at any time reflects the multiset of characters in the current length-&lt;code&gt;k&lt;/code&gt; window; updating from window &lt;code&gt;[i..i+k-1]&lt;/code&gt; to &lt;code&gt;[i+1..i+k]&lt;/code&gt; requires one increment (for &lt;code&gt;s[i+k]&lt;/code&gt;) and one decrement (for &lt;code&gt;s[i]&lt;/code&gt;)&lt;/strong&gt;. Cleanup the zero-count key so &lt;code&gt;len(freq)&lt;/code&gt; always equals the distinct character count.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Increment&lt;/strong&gt; — &lt;code&gt;freq[c] = freq.get(c, 0) + 1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decrement&lt;/strong&gt; — &lt;code&gt;freq[c] -= 1&lt;/code&gt;; then &lt;code&gt;del freq[c] if freq[c] == 0&lt;/code&gt; (the cleanup step).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distinct-char count&lt;/strong&gt; — &lt;code&gt;len(freq)&lt;/code&gt; (with cleanup) gives the number of distinct chars in the window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;defaultdict(int)&lt;/code&gt;&lt;/strong&gt; — alternative; same outcome, slightly cleaner increment but you still need the zero-cleanup branch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Slide a length-3 window over &lt;code&gt;'abca'&lt;/code&gt;; show &lt;code&gt;freq&lt;/code&gt; per position.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;window&lt;/th&gt;
&lt;th&gt;freq&lt;/th&gt;
&lt;th&gt;distinct&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;abc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{a:1, b:1, c:1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bca&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{b:1, c:1, a:1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abca&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="n"&gt;freq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="c1"&gt;# freq == {'a': 1, 'b': 1, 'c': 1}; len(freq) == 3
&lt;/span&gt;
&lt;span class="c1"&gt;# slide: out s[0]='a', in s[3]='a'
&lt;/span&gt;&lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="c1"&gt;# freq == {'b': 1, 'c': 1, 'a': 1}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always pair the decrement with the zero-cleanup; otherwise &lt;code&gt;len(freq)&lt;/code&gt; overcounts distinct characters and the constraint check breaks.&lt;/p&gt;

&lt;h4&gt;
  
  
  Distinct-char count via &lt;code&gt;len(freq)&lt;/code&gt; and zero-cleanup
&lt;/h4&gt;

&lt;p&gt;The distinct-count invariant: &lt;strong&gt;after every increment / decrement + cleanup, &lt;code&gt;len(freq)&lt;/code&gt; equals the number of distinct characters currently in the window&lt;/strong&gt;. This is the per-window check against &lt;code&gt;maxLetters&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Valid window&lt;/strong&gt; — &lt;code&gt;len(freq) &amp;lt;= maxLetters&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cleanup is non-negotiable&lt;/strong&gt; — without &lt;code&gt;del freq[c] if freq[c] == 0&lt;/code&gt;, a key with count 0 still inflates &lt;code&gt;len(freq)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;set()&lt;/code&gt; rebuild&lt;/strong&gt; — &lt;code&gt;len(set(window))&lt;/code&gt; is &lt;code&gt;O(k)&lt;/code&gt; per check; the incremental dict gives &lt;code&gt;O(1)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot only on demand&lt;/strong&gt; — record the substring in the answer counter only when the validity check passes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Validate windows of &lt;code&gt;'abcabcab'&lt;/code&gt; with &lt;code&gt;maxLetters = 2&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;window&lt;/th&gt;
&lt;th&gt;distinct&lt;/th&gt;
&lt;th&gt;valid?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;abc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bca&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxLetters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abcabcab&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="c1"&gt;# every length-3 window has 3 distinct chars → no valid window → answer = 0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the distinct-char check is one comparison (&lt;code&gt;len(freq) &amp;lt;= maxLetters&lt;/code&gt;); never recompute distinct count from scratch.&lt;/p&gt;

&lt;h4&gt;
  
  
  Substring counter for the answer: &lt;code&gt;Counter&lt;/code&gt; or &lt;code&gt;dict.get(s, 0) + 1&lt;/code&gt; over candidates
&lt;/h4&gt;

&lt;p&gt;The answer-aggregation invariant: &lt;strong&gt;a separate &lt;code&gt;Counter&lt;/code&gt; (or vanilla dict) keyed on the substring tracks how many times each valid window-substring has been seen; the final answer is the maximum value in that counter&lt;/strong&gt;. The constraint filter happens before the counter increment, so only valid substrings ever land in the answer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;from collections import Counter&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;Counter()[k] += 1&lt;/code&gt; is a clean increment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vanilla dict&lt;/strong&gt; — &lt;code&gt;answer[sub] = answer.get(sub, 0) + 1&lt;/code&gt;; same outcome, no import.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final answer&lt;/strong&gt; — &lt;code&gt;max(answer.values(), default=0)&lt;/code&gt;; the &lt;code&gt;default=0&lt;/code&gt; handles the empty-counter case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Substring slice&lt;/strong&gt; — &lt;code&gt;s[i:i+k]&lt;/code&gt; is &lt;code&gt;O(k)&lt;/code&gt; to create; one slice per valid window is the cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Count valid windows for &lt;code&gt;'aababcaab'&lt;/code&gt; with &lt;code&gt;k=3, maxLetters=2&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;window&lt;/th&gt;
&lt;th&gt;distinct&lt;/th&gt;
&lt;th&gt;counter&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;aab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{aab: 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;aba&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{aab: 1, aba: 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{aab: 1, aba: 1, bab: 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;abc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;(skipped)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bca&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;(skipped)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;caa&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{aab: 1, aba: 1, bab: 1, caa: 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;aab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{aab: 2, aba: 1, bab: 1, caa: 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;
&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxLetters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;aababcaab&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# ... incremental loop ...
# answer == {'aab': 2, 'aba': 1, 'bab': 1, 'caa': 1}
# max(answer.values()) == 2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; keep two state structures — one for the rolling window (&lt;code&gt;freq&lt;/code&gt;), one for the answer (&lt;code&gt;Counter&lt;/code&gt;); never conflate them.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Rebuilding &lt;code&gt;freq&lt;/code&gt; from &lt;code&gt;s[i:i+k]&lt;/code&gt; each iteration — &lt;code&gt;O(n · k)&lt;/code&gt; instead of &lt;code&gt;O(n)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Forgetting the zero-cleanup — &lt;code&gt;len(freq)&lt;/code&gt; overcounts distinct chars and the validity check rejects valid windows.&lt;/li&gt;
&lt;li&gt;Counting every window (not just valid ones) — pollutes the answer counter with substrings that violate &lt;code&gt;maxLetters&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;if sub in answer: answer[sub] += 1 else: answer[sub] = 1&lt;/code&gt; — works but reads worse than &lt;code&gt;Counter&lt;/code&gt; or &lt;code&gt;dict.get(sub, 0) + 1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Returning the first valid substring's count instead of &lt;code&gt;max(answer.values())&lt;/code&gt; — mis-reads the prompt.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Python Interview Question on Maximum Substring Occurrences
&lt;/h3&gt;

&lt;p&gt;Given a string &lt;code&gt;s&lt;/code&gt; and integers &lt;code&gt;k&lt;/code&gt; and &lt;code&gt;maxLetters&lt;/code&gt;, return the &lt;strong&gt;maximum number of occurrences&lt;/strong&gt; of any length-&lt;code&gt;k&lt;/code&gt; substring of &lt;code&gt;s&lt;/code&gt; whose distinct-character count is at most &lt;code&gt;maxLetters&lt;/code&gt;. If no such substring exists, return &lt;code&gt;0&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;maxFreq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxLetters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# your code here
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Solution Using rolling &lt;code&gt;freq&lt;/code&gt; dict + &lt;code&gt;Counter&lt;/code&gt; over valid substrings
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;maxFreq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxLetters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# initialize the first window
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;maxLetters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="c1"&gt;# slide
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;out_c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;out_c&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;out_c&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;out_c&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;in_c&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;maxLetters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the &lt;code&gt;freq&lt;/code&gt; dict is updated incrementally — one increment (entering char) and one decrement-plus-cleanup (leaving char) per shift, total &lt;code&gt;O(1)&lt;/code&gt; per window — so the rolling distinct-character count &lt;code&gt;len(freq)&lt;/code&gt; is always correct in constant time; &lt;code&gt;len(freq) &amp;lt;= maxLetters&lt;/code&gt; filters out invalid windows before they touch the answer counter; valid window substrings are tallied in a &lt;code&gt;Counter&lt;/code&gt;; the final &lt;code&gt;max(answer.values(), default=0)&lt;/code&gt; returns the most frequent valid substring's count, defaulting to &lt;code&gt;0&lt;/code&gt; if nothing qualified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for &lt;code&gt;s = 'aababcaab', k = 3, maxLetters = 2&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;i&lt;/th&gt;
&lt;th&gt;window&lt;/th&gt;
&lt;th&gt;freq (after slide)&lt;/th&gt;
&lt;th&gt;distinct&lt;/th&gt;
&lt;th&gt;valid?&lt;/th&gt;
&lt;th&gt;answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{a:2, b:1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{aab: 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aba&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{a:2, b:1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{aab: 1, aba: 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{a:1, b:2}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{aab: 1, aba: 1, bab: 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;abc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{a:1, b:1, c:1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;(unchanged)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bca&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{a:1, b:1, c:1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;(unchanged)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;caa&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{a:2, c:1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{aab: 1, aba: 1, bab: 1, caa: 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{a:2, b:1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{aab: 2, aba: 1, bab: 1, caa: 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Final: &lt;code&gt;max(answer.values()) == 2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;substring&lt;/th&gt;
&lt;th&gt;occurrences&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;aab&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;aba&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bab&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;caa&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;answer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Rolling &lt;code&gt;freq&lt;/code&gt; dict&lt;/strong&gt;&lt;/strong&gt; — the per-window character-frequency map is updated in &lt;code&gt;O(1)&lt;/code&gt; per shift via one increment and one decrement-plus-cleanup; total scan is &lt;code&gt;O(n)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Zero-cleanup via &lt;code&gt;del freq[c] if freq[c] == 0&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — keeps &lt;code&gt;len(freq)&lt;/code&gt; equal to the &lt;strong&gt;distinct&lt;/strong&gt; character count; without it the constraint check would reject valid windows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;len(freq) &amp;lt;= maxLetters&lt;/code&gt; validity filter&lt;/strong&gt;&lt;/strong&gt; — single comparison; only valid windows enter the answer counter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;Counter&lt;/code&gt; over substrings&lt;/strong&gt;&lt;/strong&gt; — tallies how many times each valid substring has been seen; &lt;code&gt;Counter&lt;/code&gt; short-circuits the missing-key branch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;max(answer.values(), default=0)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — final aggregation; the &lt;code&gt;default=0&lt;/code&gt; handles the case where no window qualified.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(n · k)&lt;/code&gt; time / &lt;code&gt;O(n)&lt;/code&gt; space&lt;/strong&gt;&lt;/strong&gt; — the loop is &lt;code&gt;O(n)&lt;/code&gt; shifts, each with a &lt;code&gt;O(k)&lt;/code&gt; substring slice for the answer key; the answer counter holds at most &lt;code&gt;O(n - k + 1)&lt;/code&gt; distinct substrings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/sliding-window" rel="noopener noreferrer"&gt;sliding-window problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/hash-table" rel="noopener noreferrer"&gt;hash-table problems&lt;/a&gt; on PipeCode.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Snowflake (Python)&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Snowflake Python practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/snowflake/python" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — hash table&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Python hash-table problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/hash-table" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — sliding window&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Python sliding-window problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/sliding-window" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. SQL Window Functions (&lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;, &lt;code&gt;AVG OVER&lt;/code&gt;) for Snowflake Analytics
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Window functions and CTE composition in SQL for Snowflake data engineering
&lt;/h3&gt;

&lt;p&gt;"Find every contact who had a marketing touch in three or more consecutive weeks AND at least one &lt;code&gt;trial_request&lt;/code&gt;" is the canonical Snowflake-flavored SQL prompt — the DataLemur-staple marketing-touch streak question that surfaces in nearly every Snowflake-as-tool DE interview. The mental model: &lt;strong&gt;&lt;code&gt;DATE_TRUNC('week', event_date)&lt;/code&gt; snaps event timestamps to the week boundary; &lt;code&gt;LAG(week_trunc) OVER (PARTITION BY contact_id ORDER BY week_trunc)&lt;/code&gt; reaches one row back inside the per-contact partition; the streak condition is &lt;code&gt;lag_week = current_week - INTERVAL '1 week'&lt;/code&gt;&lt;/strong&gt;. Same primitive powers any "consecutive-period" or "day-over-day" analytic — consecutive-month subscription renewals, hour-over-hour traffic deltas, week-over-week feature-adoption ramps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3wbgdtx79woabcijvn6.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3wbgdtx79woabcijvn6.jpeg" alt="Diagram showing a marketing_touches table with four rows for contact_id 1 across consecutive weeks, with arrows depicting LAG(week_trunc) OVER (PARTITION BY contact_id ORDER BY week_trunc) reaching one row back, a streak indicator highlighting three consecutive weeks, and a green output card listing the matched email." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Always ground your window function with both &lt;code&gt;PARTITION BY&lt;/code&gt; and &lt;code&gt;ORDER BY&lt;/code&gt;. Forgetting &lt;code&gt;PARTITION BY contact_id&lt;/code&gt; makes &lt;code&gt;LAG&lt;/code&gt; reach into the previous contact's events and produces meaningless deltas; forgetting &lt;code&gt;ORDER BY week_trunc&lt;/code&gt; leaves the row order non-deterministic and the answer flaky. State both clauses out loud before writing the SELECT.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Window basics: &lt;code&gt;PARTITION BY group + ORDER BY ordering&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The window-function invariant: &lt;strong&gt;&lt;code&gt;OVER (PARTITION BY &amp;lt;group&amp;gt; ORDER BY &amp;lt;ordering&amp;gt;)&lt;/code&gt; declares an independent ordered subset for every value of the group expression; the window function (&lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;ROW_NUMBER&lt;/code&gt;, …) runs inside that subset only&lt;/strong&gt;. &lt;code&gt;PARTITION BY&lt;/code&gt; is the equivalent of &lt;code&gt;GROUP BY&lt;/code&gt; for windows; &lt;code&gt;ORDER BY&lt;/code&gt; fixes the row order so offset functions (&lt;code&gt;LAG&lt;/code&gt;/&lt;code&gt;LEAD&lt;/code&gt;) and ranking functions (&lt;code&gt;ROW_NUMBER&lt;/code&gt;/&lt;code&gt;RANK&lt;/code&gt;) are deterministic.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;PARTITION BY&lt;/code&gt; is optional&lt;/strong&gt; — omitted, the whole table is one window; useful for global running totals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY&lt;/code&gt; is required for offset / ranking functions&lt;/strong&gt; — &lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;, &lt;code&gt;ROW_NUMBER&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt; all need an explicit order.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;AVG(...) OVER (...)&lt;/code&gt;&lt;/strong&gt; — group-aggregate-on-row; returns the average across the partition for each row.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frames&lt;/strong&gt; — &lt;code&gt;ROWS BETWEEN ... AND ...&lt;/code&gt; further bound the visible rows; default is fine for &lt;code&gt;LAG&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two events for contact_id 1; &lt;code&gt;LAG(week_trunc)&lt;/code&gt; reaches one row back inside the partition.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;event_id&lt;/th&gt;
&lt;th&gt;contact_id&lt;/th&gt;
&lt;th&gt;week_trunc&lt;/th&gt;
&lt;th&gt;LAG(week_trunc)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2022-04-11&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2022-04-18&lt;/td&gt;
&lt;td&gt;2022-04-11&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;week_trunc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;week_trunc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;week_trunc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_week&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events_with_week&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always set &lt;code&gt;PARTITION BY&lt;/code&gt; to the entity that owns the time series (&lt;code&gt;contact_id&lt;/code&gt;, &lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;account_id&lt;/code&gt;); without it, day-1 of one entity leaks into day-N of another.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;LAG&lt;/code&gt; vs &lt;code&gt;LEAD&lt;/code&gt; vs &lt;code&gt;AVG OVER&lt;/code&gt;: row-N-before vs row-N-after vs group-aggregate-on-row
&lt;/h4&gt;

&lt;p&gt;The window-function-family invariant: &lt;strong&gt;&lt;code&gt;LAG(expr, n)&lt;/code&gt; returns &lt;code&gt;expr&lt;/code&gt; from &lt;code&gt;n&lt;/code&gt; rows before; &lt;code&gt;LEAD(expr, n)&lt;/code&gt; from &lt;code&gt;n&lt;/code&gt; rows after; &lt;code&gt;AVG(expr) OVER (...)&lt;/code&gt; returns the average of &lt;code&gt;expr&lt;/code&gt; across the entire partition (not just the rows up to current)&lt;/strong&gt;. They share the &lt;code&gt;OVER&lt;/code&gt; syntax but answer different questions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(week_trunc)&lt;/code&gt;&lt;/strong&gt; — previous row's &lt;code&gt;week_trunc&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEAD(week_trunc)&lt;/code&gt;&lt;/strong&gt; — next row's &lt;code&gt;week_trunc&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))&lt;/code&gt;&lt;/strong&gt; — monthly average per product (DataLemur Q2 pattern).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)&lt;/code&gt;&lt;/strong&gt; — sequential rank inside each partition.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three rows for contact_id 1; both &lt;code&gt;LAG&lt;/code&gt; and &lt;code&gt;LEAD&lt;/code&gt; annotated.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;event_id&lt;/th&gt;
&lt;th&gt;week_trunc&lt;/th&gt;
&lt;th&gt;LAG&lt;/th&gt;
&lt;th&gt;LEAD&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2022-04-11&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;2022-04-18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2022-04-18&lt;/td&gt;
&lt;td&gt;2022-04-11&lt;/td&gt;
&lt;td&gt;2022-04-25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2022-04-25&lt;/td&gt;
&lt;td&gt;2022-04-18&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;week_trunc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;week_trunc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;week_trunc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;LEAD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;week_trunc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;week_trunc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lead_week&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events_with_week&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; "delta vs prior" → &lt;code&gt;LAG&lt;/code&gt;; "delta vs next" or "next-event distance" → &lt;code&gt;LEAD&lt;/code&gt;; "row's value compared to group average" → &lt;code&gt;AVG ... OVER&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  CTE composition: &lt;code&gt;WITH consecutive_events AS (...)&lt;/code&gt; for multi-step logic
&lt;/h4&gt;

&lt;p&gt;The CTE composition invariant: &lt;strong&gt;a &lt;code&gt;WITH ... AS (...)&lt;/code&gt; Common Table Expression names an intermediate result that subsequent SELECT statements can reference; multi-step window logic is far more readable as a CTE pipeline than as a deeply nested subquery&lt;/strong&gt;. Snowflake supports CTEs natively, including recursive CTEs for hierarchical traversal.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WITH cte_name AS (SELECT ...)&lt;/code&gt;&lt;/strong&gt; — defines the named intermediate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple CTEs&lt;/strong&gt; — &lt;code&gt;WITH a AS (...), b AS (...) SELECT ... FROM b JOIN a ...&lt;/code&gt; — chain steps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursive CTE&lt;/strong&gt; — &lt;code&gt;WITH RECURSIVE cte AS (base UNION ALL recursive)&lt;/code&gt;; rare in interview but Snowflake-supported.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Materialization&lt;/strong&gt; — Snowflake may inline or materialize CTEs; behavior depends on the query optimizer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Compute &lt;code&gt;lag_week&lt;/code&gt; and &lt;code&gt;lead_week&lt;/code&gt; once in a CTE, then filter for the streak condition.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;what&lt;/th&gt;
&lt;th&gt;output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;CTE&lt;/td&gt;
&lt;td&gt;rows with current/lag/lead weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;filter&lt;/td&gt;
&lt;td&gt;rows where lag = current - 1 week OR lead = current + 1 week&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;join + filter&lt;/td&gt;
&lt;td&gt;contacts with trial_request AND streak&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;consecutive_events&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;current_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
              &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;LEAD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
              &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lead_week&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;marketing_touches&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;consecutive_events&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;crm_contacts&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contact_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lag_week&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_week&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 week'&lt;/span&gt;
    &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lead_week&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_week&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 week'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;marketing_touches&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'trial_request'&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; use a CTE the moment your window logic needs more than one step; deeply nested subqueries with embedded &lt;code&gt;OVER (...)&lt;/code&gt; clauses are unreadable.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Forgetting &lt;code&gt;PARTITION BY contact_id&lt;/code&gt; — &lt;code&gt;LAG&lt;/code&gt; reaches into the previous contact's events and produces meaningless deltas.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;ORDER BY week_trunc&lt;/code&gt; — row order is non-deterministic; &lt;code&gt;LAG&lt;/code&gt; returns whichever row the planner happened to pick.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;WHERE&lt;/code&gt; to filter aggregates — &lt;code&gt;WHERE COUNT(*) &amp;gt; 1&lt;/code&gt; is a parse error; use &lt;code&gt;HAVING&lt;/code&gt; or wrap in a subquery / CTE.&lt;/li&gt;
&lt;li&gt;Counting calendar weeks via &lt;code&gt;(event_date - lag_event_date) / 7&lt;/code&gt; instead of &lt;code&gt;DATE_TRUNC('week', ...)&lt;/code&gt; — fails on edge cases like daylight-saving-time crossings.&lt;/li&gt;
&lt;li&gt;Hardcoding the offset (&lt;code&gt;LAG(volume, 1)&lt;/code&gt;) when the prompt says "vs 7 days ago" — read the spec and use &lt;code&gt;LAG(volume, 7)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Marketing-Touch Streak Detection
&lt;/h3&gt;

&lt;p&gt;Given &lt;code&gt;marketing_touches(event_id, contact_id, event_type, event_date)&lt;/code&gt; and &lt;code&gt;crm_contacts(contact_id, email)&lt;/code&gt;, return the email of every contact who had &lt;strong&gt;at least three consecutive-week marketing touches&lt;/strong&gt; AND at least one &lt;code&gt;event_type = 'trial_request'&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;LAG&lt;/code&gt; / &lt;code&gt;LEAD&lt;/code&gt; window functions inside a CTE
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;consecutive_events_cte&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;current_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
              &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;LEAD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
              &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lead_week&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;marketing_touches&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;contacts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;consecutive_events_cte&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;crm_contacts&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;contacts&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;contacts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contact_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lag_week&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_week&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 week'&lt;/span&gt;
    &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lead_week&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_week&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 week'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;contact_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;marketing_touches&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'trial_request'&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the CTE truncates each event to its week boundary and uses &lt;code&gt;LAG&lt;/code&gt;/&lt;code&gt;LEAD&lt;/code&gt; to surface the adjacent weeks inside each contact's partition; the streak condition is satisfied when either the lag or lead week is exactly ±1 week away (which means the contact has at least two consecutive-week events, which combined with the current row means three weeks in a window); the &lt;code&gt;IN&lt;/code&gt; subquery enforces the "at least one &lt;code&gt;trial_request&lt;/code&gt;" requirement; the &lt;code&gt;INNER JOIN&lt;/code&gt; resolves emails; &lt;code&gt;SELECT DISTINCT&lt;/code&gt; collapses duplicates from contacts with multiple qualifying weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the DataLemur sample (contact_id 1 across April 17, 23, 30 + May 14):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;event_id&lt;/th&gt;
&lt;th&gt;contact_id&lt;/th&gt;
&lt;th&gt;week_trunc&lt;/th&gt;
&lt;th&gt;lag_week&lt;/th&gt;
&lt;th&gt;lead_week&lt;/th&gt;
&lt;th&gt;streak?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2022-04-11&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;2022-04-18&lt;/td&gt;
&lt;td&gt;✓ (lead)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2022-04-18&lt;/td&gt;
&lt;td&gt;2022-04-11&lt;/td&gt;
&lt;td&gt;2022-04-25&lt;/td&gt;
&lt;td&gt;✓ (both)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2022-04-25&lt;/td&gt;
&lt;td&gt;2022-04-18&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;✓ (lag)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;CTE materializes&lt;/strong&gt; — three rows per contact_id with current/lag/lead weeks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streak filter&lt;/strong&gt; — every row of contact_id 1 passes (lag or lead = current ± 7 days).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;trial_request&lt;/code&gt; filter&lt;/strong&gt; — contact_id 1 has event_id 2 = &lt;code&gt;trial_request&lt;/code&gt; → passes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inner-join + DISTINCT&lt;/strong&gt; — emits &lt;code&gt;andy.markus@att.net&lt;/code&gt; once.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;email&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="mailto:andy.markus@att.net"&gt;andy.markus@att.net&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;DATE_TRUNC('week', event_date)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — snaps timestamps to the week boundary so consecutive-week comparisons are exact; works in Snowflake, Postgres, BigQuery alike.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;LAG&lt;/code&gt; / &lt;code&gt;LEAD&lt;/code&gt; window&lt;/strong&gt;&lt;/strong&gt; — surfaces adjacent rows inside the per-contact partition without a self-join; &lt;code&gt;O(n log n)&lt;/code&gt; with one window sort instead of &lt;code&gt;O(n²)&lt;/code&gt; with a self-join.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;CTE composition&lt;/strong&gt;&lt;/strong&gt; — keeps the window logic separate from the streak filter; makes the query readable and the optimizer plan stable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;OR&lt;/code&gt; between lag and lead&lt;/strong&gt;&lt;/strong&gt; — accepts a row whose neighbor on either side is one week away, which combined with the chain of neighbors implies three consecutive weeks once any qualifying contact has at least three rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;IN&lt;/code&gt; subquery for &lt;code&gt;trial_request&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — enforces the second condition without a separate JOIN; the optimizer typically inlines this as a semi-join.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O((|events|·log|events|) + |events|·|contacts|)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — the window sort dominates; the join is hash-based.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;SQL window-function problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;CTE problems&lt;/a&gt; on PipeCode.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Snowflake&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Snowflake SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/snowflake" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — window functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL window-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — CTE&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL CTE problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Snowflake Architecture: Micro-Partitions, Clustering, and Time Travel
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Snowflake's three-layer architecture and product-knowledge primitives
&lt;/h3&gt;

&lt;p&gt;"Design Snowflake clustering, virtual warehouse, and Time Travel retention for a 10TB events table queried frequently by date and region" is the canonical Snowflake architecture interview prompt. The mental model: &lt;strong&gt;Snowflake has three independent layers — Database Storage (immutable micro-partitions in S3 / Azure Blob / GCS), Query Processing (Virtual Warehouses, scale-up for compute, scale-out for concurrency), and Cloud Services (auth, metadata, optimizer, security)&lt;/strong&gt;. Performance comes from &lt;strong&gt;micro-partition pruning&lt;/strong&gt; (the optimizer skips partitions whose min/max metadata excludes the WHERE predicate) and &lt;strong&gt;clustering keys&lt;/strong&gt; that co-locate rows on chosen columns. Recovery comes from &lt;strong&gt;Time Travel&lt;/strong&gt; (1–90 days of point-in-time queries) plus &lt;strong&gt;Fail-safe&lt;/strong&gt; (7 additional days of Snowflake-managed recovery). Cloning comes from &lt;strong&gt;Zero-Copy Cloning&lt;/strong&gt; (instant snapshot via metadata pointers, no storage duplication).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Snowflake is &lt;strong&gt;OLAP, not OLTP&lt;/strong&gt; — designed for analytical workloads, not row-level UPDATE/DELETE traffic. It is also &lt;strong&gt;NOT an ETL tool&lt;/strong&gt; — pair it with Airflow / dbt / Matillion for orchestration. Stating both framings unprompted in the interview signals product fluency and cuts off the most common follow-up traps.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Three-layer architecture: storage / query processing / cloud services
&lt;/h4&gt;

&lt;p&gt;The architecture invariant: &lt;strong&gt;Snowflake separates storage, compute, and services so each scales independently — store petabytes while compute is off, spin up massive compute for a 10-minute job, and let the services layer manage metadata and optimization&lt;/strong&gt;. Traditional warehouses (Teradata, Netezza) tightly couple storage and compute, forcing over-provisioning.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Database Storage&lt;/strong&gt; — columnar micro-partitions in cloud storage; immutable; you don't manage files.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query Processing&lt;/strong&gt; — Virtual Warehouses; clusters of CPU/RAM/SSD; XS, S, M, L, XL, 2XL, 3XL, 4XL, 5XL, 6XL sizes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Services&lt;/strong&gt; — auth, RBAC, metadata, optimizer, security; the "brain"; serverless from your perspective.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-cloud&lt;/strong&gt; — runs on AWS, Azure, GCP; cross-cloud replication for disaster recovery.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Sketch the layer responsibilities for a typical &lt;code&gt;events&lt;/code&gt; table workload.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;layer&lt;/th&gt;
&lt;th&gt;responsibility&lt;/th&gt;
&lt;th&gt;scaling lever&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;persist micro-partitions; immutable&lt;/td&gt;
&lt;td&gt;grow with data volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query Processing&lt;/td&gt;
&lt;td&gt;execute SQL on Virtual Warehouses&lt;/td&gt;
&lt;td&gt;scale up (size) or out (clusters)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Services&lt;/td&gt;
&lt;td&gt;auth, metadata, optimizer&lt;/td&gt;
&lt;td&gt;serverless&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- size a warehouse for the workload&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;WAREHOUSE&lt;/span&gt; &lt;span class="n"&gt;events_wh&lt;/span&gt;
  &lt;span class="n"&gt;WAREHOUSE_SIZE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'LARGE'&lt;/span&gt;
  &lt;span class="n"&gt;AUTO_SUSPEND&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
  &lt;span class="n"&gt;AUTO_RESUME&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
  &lt;span class="n"&gt;MIN_CLUSTER_COUNT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="n"&gt;MAX_CLUSTER_COUNT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; size the warehouse to the heaviest query, set &lt;code&gt;AUTO_SUSPEND = 60&lt;/code&gt; for cost control, and use &lt;code&gt;MIN/MAX_CLUSTER_COUNT&lt;/code&gt; for concurrency rather than oversizing a single cluster.&lt;/p&gt;

&lt;h4&gt;
  
  
  Micro-partitions and pruning via min/max metadata
&lt;/h4&gt;

&lt;p&gt;The pruning invariant: &lt;strong&gt;Snowflake stores per-column min/max (and other) metadata for every micro-partition; on a &lt;code&gt;WHERE&lt;/code&gt; predicate, the optimizer reads metadata and skips partitions whose value range cannot satisfy the predicate&lt;/strong&gt;. This is "automatic indexing" — no &lt;code&gt;CREATE INDEX&lt;/code&gt; required.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Micro-partition&lt;/strong&gt; — 50–500 MB compressed, columnar, immutable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Min/max metadata&lt;/strong&gt; — per-column, per-partition; enables &lt;code&gt;WHERE date &amp;gt;= '2023-01-01'&lt;/code&gt; to skip partitions whose max date &amp;lt; 2023-01-01.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clustering&lt;/strong&gt; — ensures min/max metadata is &lt;strong&gt;selective&lt;/strong&gt; by co-locating rows on the chosen columns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search Optimization Service&lt;/strong&gt; — adds point-lookup acceleration on top of pruning for high-cardinality predicates.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Pruning a query against a 1000-partition events table.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;predicate&lt;/th&gt;
&lt;th&gt;partitions scanned&lt;/th&gt;
&lt;th&gt;pruned&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE date = '2023-06-15'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~3 (one per cluster region)&lt;/td&gt;
&lt;td&gt;997&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE region = 'US-EAST'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~250 (with clustering on region)&lt;/td&gt;
&lt;td&gt;750&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE customer_id = 12345&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1000 (no clustering on customer_id)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- enable clustering for predictable pruning&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
  &lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; cluster on the columns you filter / aggregate by most often (typically date + a low-cardinality dimension); never cluster on a high-cardinality random column like &lt;code&gt;user_id&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Clustering keys + scale-up vs scale-out + Time Travel + Zero-Copy Cloning
&lt;/h4&gt;

&lt;p&gt;The composition invariant: &lt;strong&gt;Snowflake's full performance + recovery story stacks four primitives — clustering keys for storage layout, warehouse sizing for query speed (scale up) and concurrency (scale out), Time Travel for point-in-time recovery, and Zero-Copy Cloning for instant dev/QA environments&lt;/strong&gt;. Each is independent; combine them per workload requirements.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CLUSTER BY (col1, col2)&lt;/code&gt;&lt;/strong&gt; — co-locates rows on the chosen columns; benefits pruning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale up&lt;/strong&gt; — &lt;code&gt;WAREHOUSE_SIZE = 'XLARGE'&lt;/code&gt;; doubles compute per cluster; speeds up one big query.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale out&lt;/strong&gt; — &lt;code&gt;MAX_CLUSTER_COUNT = 4&lt;/code&gt;; spins up additional clusters for concurrent queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time Travel&lt;/strong&gt; — &lt;code&gt;DATA_RETENTION_TIME_IN_DAYS = 7&lt;/code&gt;; query past data with &lt;code&gt;AT (TIMESTAMP =&amp;gt; '...')&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-Copy Cloning&lt;/strong&gt; — &lt;code&gt;CREATE TABLE events_dev CLONE events;&lt;/code&gt; — instant, no storage duplication.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Warehouse + clustering + Time Travel + clone for the events table.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;concern&lt;/th&gt;
&lt;th&gt;setting&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;query speed&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WAREHOUSE_SIZE = 'LARGE'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;concurrency&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MAX_CLUSTER_COUNT = 4&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pruning&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CLUSTER BY (DATE_TRUNC('day', event_date), region)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;recovery window&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DATA_RETENTION_TIME_IN_DAYS = 7&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dev environment&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CREATE TABLE events_dev CLONE events;&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;DATA_RETENTION_TIME_IN_DAYS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
      &lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events_dev&lt;/span&gt; &lt;span class="n"&gt;CLONE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- query a past state&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;AT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2026-04-30 09:00:00'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; set Time Travel retention to match your incident-response window (typically 7 days for production, 1 day for staging); the higher the retention, the higher the storage cost.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Calling Snowflake an ETL tool — it is a data warehouse; pair it with Airflow / dbt / Matillion / Airbyte.&lt;/li&gt;
&lt;li&gt;Calling Snowflake OLTP — singleton UPDATE/DELETE traffic is poorly served; Snowflake is OLAP-first.&lt;/li&gt;
&lt;li&gt;Asking "how do I create an index?" — Snowflake doesn't use traditional indexes; min/max metadata on micro-partitions handles pruning automatically.&lt;/li&gt;
&lt;li&gt;Suggesting &lt;code&gt;VACUUM&lt;/code&gt; / &lt;code&gt;ANALYZE&lt;/code&gt; — Snowflake is fully managed; no manual maintenance.&lt;/li&gt;
&lt;li&gt;Storing Time Travel as a backup substitute — Time Travel is for accidental drops within a window; Fail-safe is the 7-day Snowflake-managed safety net beyond Time Travel; for long-term backups use replicated databases or unloaded files.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Snowflake Architecture Design
&lt;/h3&gt;

&lt;p&gt;Design the Snowflake configuration for a 10TB &lt;code&gt;events&lt;/code&gt; table that is queried daily by &lt;code&gt;date&lt;/code&gt; and &lt;code&gt;region&lt;/code&gt;, must support 4-day point-in-time recovery, and needs an isolated dev environment for testing schema changes. Specify the warehouse size, clustering keys, Time Travel retention, and clone strategy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using clustering + multi-cluster warehouse + 7-day Time Travel + Zero-Copy Clone
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- 1. Warehouse: scale up (LARGE) for query speed; scale out (max 4 clusters) for concurrency&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;WAREHOUSE&lt;/span&gt; &lt;span class="n"&gt;events_wh&lt;/span&gt;
  &lt;span class="n"&gt;WAREHOUSE_SIZE&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'LARGE'&lt;/span&gt;
  &lt;span class="n"&gt;AUTO_SUSPEND&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
  &lt;span class="n"&gt;AUTO_RESUME&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
  &lt;span class="n"&gt;MIN_CLUSTER_COUNT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="n"&gt;MAX_CLUSTER_COUNT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- 2. Clustering: co-locate rows by query predicates (date + region)&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
  &lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;DATA_RETENTION_TIME_IN_DAYS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- 3. Dev environment: instant zero-copy clone&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events_dev&lt;/span&gt; &lt;span class="n"&gt;CLONE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- 4. Verify pruning effectiveness on a sample query&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2026-04-30'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'US-EAST'&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the LARGE warehouse fits a 10TB scan into memory + local SSD without spilling; auto-suspend at 60s + auto-resume eliminates idle compute cost; clustering on &lt;code&gt;(DATE_TRUNC('day', event_date), region)&lt;/code&gt; aligns the natural query predicates with the partition layout so pruning scans only ~3 partitions for a single-day single-region query (down from ~1000); 7-day Time Travel covers a 4-day recovery requirement with margin; Zero-Copy Cloning produces &lt;code&gt;events_dev&lt;/code&gt; instantly, with metadata pointers — the dev table only consumes storage as it diverges from production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; of the design rationale:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Workload analysis&lt;/strong&gt; — 10TB table; daily queries by date + region; 4-day recovery requirement; dev environment needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute layer&lt;/strong&gt; — LARGE warehouse for the heaviest query; multi-cluster (1–4) for concurrency without oversize.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage layer&lt;/strong&gt; — cluster on (date, region) so min/max metadata aligns with predicates → pruning skips ~99.7% of partitions on point queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery layer&lt;/strong&gt; — 7-day Time Travel covers 4-day requirement with safety margin; Fail-safe gives an additional 7 days of Snowflake-managed recovery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dev layer&lt;/strong&gt; — Zero-Copy Clone produces instant dev table; storage diverges only on writes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;layer&lt;/th&gt;
&lt;th&gt;setting&lt;/th&gt;
&lt;th&gt;rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Compute&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;WAREHOUSE_SIZE = 'LARGE'&lt;/code&gt;, &lt;code&gt;MAX_CLUSTER_COUNT = 4&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;scale up + scale out&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CLUSTER BY (DATE_TRUNC('day', event_date), region)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;maximize pruning on query predicates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DATA_RETENTION_TIME_IN_DAYS = 7&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;covers 4-day requirement with margin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dev&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CREATE TABLE events_dev CLONE events;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;instant zero-copy environment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;auto-suspend = 60s&lt;/td&gt;
&lt;td&gt;idle compute paid only in seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Three-layer separation&lt;/strong&gt;&lt;/strong&gt; — independent scaling of storage / compute / services; the compute warehouse can be 'LARGE' today and `XSMALL' tomorrow without touching data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Micro-partition pruning&lt;/strong&gt;&lt;/strong&gt; — min/max metadata on each partition + clustering on predicate columns → the optimizer skips ~99.7% of partitions on a single-day single-region query.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Multi-cluster warehouse&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;MAX_CLUSTER_COUNT = 4&lt;/code&gt; handles concurrency; instead of queueing 100 simultaneous queries on one cluster, additional clusters spin up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;7-day Time Travel&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;AT (TIMESTAMP =&amp;gt; ...)&lt;/code&gt; queries any past state within retention; covers accidental drops, schema mistakes, and the stated 4-day recovery requirement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Zero-Copy Cloning&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;CREATE TABLE ... CLONE ...&lt;/code&gt; is instant; storage is metadata-only until divergence; perfect for dev/QA from production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Cost via auto-suspend&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;AUTO_SUSPEND = 60&lt;/code&gt; charges only for compute-active seconds; storage and services are billed separately and remain on while compute sleeps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(partitions_scanned / partitions_total)&lt;/code&gt; pruning ratio&lt;/strong&gt;&lt;/strong&gt; — clustering on &lt;code&gt;(date, region)&lt;/code&gt; aligns the natural query predicates with the partition layout, so a single-day single-region query scans ~3 of ~1000 partitions (≈ 0.3%); the remaining 99.7% are pruned via min/max metadata before any data is read.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/dimensional-modeling" rel="noopener noreferrer"&gt;dimensional-modeling problems&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/company/snowflake" rel="noopener noreferrer"&gt;Snowflake practice page&lt;/a&gt; for the full curated set.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Snowflake&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Snowflake SQL practice problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/snowflake" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;DATA-MODELING&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — dimensional modeling&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Dimensional-modeling problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/dimensional-modeling" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — JSON / VARIANT&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;JSON / VARIANT SQL problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/json" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips to crack Snowflake data engineering interviews
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Two audiences — pick yours and rebalance the four primitives
&lt;/h3&gt;

&lt;p&gt;The phrase "Snowflake interview" splits cleanly. &lt;strong&gt;Audience A&lt;/strong&gt; — interviewing AT Snowflake the company — the loop is &lt;strong&gt;LeetCode-style Python&lt;/strong&gt;: the curated 2 problems (#24 SET Card Game Validation, #161 Maximum Substring Occurrences) hint at the bar, and the algorithm fluency in §1 and §2 carries the round. &lt;strong&gt;Audience B&lt;/strong&gt; — interviewing for a data-engineering role at a company that uses Snowflake — the loop is &lt;strong&gt;SQL window functions plus product knowledge&lt;/strong&gt;, and §3 plus §4 carry the round. Pick yours upfront and rebalance prep time accordingly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drill the four primitives
&lt;/h3&gt;

&lt;p&gt;The four primitives in this guide map directly to the curated 2 PipeCode Python problems plus the two adjacent primitives every Snowflake-flavored interview rotates through: Python &lt;code&gt;zip(*cards)&lt;/code&gt; + &lt;code&gt;set&lt;/code&gt; + &lt;code&gt;all&lt;/code&gt; for SET-style validation (Python EASY — #24), Python rolling &lt;code&gt;freq&lt;/code&gt; dict + &lt;code&gt;Counter&lt;/code&gt; over substrings for windowed counting (Python MEDIUM — #161), SQL &lt;code&gt;LAG&lt;/code&gt;/&lt;code&gt;LEAD&lt;/code&gt;/&lt;code&gt;AVG OVER PARTITION BY&lt;/code&gt; plus CTE composition for streak detection and analytical aggregations (SQL — DataLemur staple), and Snowflake architecture (3 layers, micro-partitions + clustering + Time Travel + Zero-Copy Cloning). Each maps to a specific module: vanilla &lt;code&gt;dict&lt;/code&gt; and &lt;code&gt;collections.Counter&lt;/code&gt; for the Python primitives, &lt;code&gt;OVER (PARTITION BY ... ORDER BY ...)&lt;/code&gt; for windows, and the Snowflake DDL surface (&lt;code&gt;CLUSTER BY&lt;/code&gt;, &lt;code&gt;DATA_RETENTION_TIME_IN_DAYS&lt;/code&gt;, &lt;code&gt;CLONE&lt;/code&gt;, &lt;code&gt;WAREHOUSE_SIZE&lt;/code&gt;) for architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memorize the OLAP-not-OLTP and NOT-an-ETL-tool framing
&lt;/h3&gt;

&lt;p&gt;The single most common Snowflake interview trap is treating Snowflake as a traditional RDBMS or as an ETL tool. State unprompted: "Snowflake is OLAP — analytical workloads, not row-level UPDATE/DELETE traffic; it pairs with Airflow / dbt / Matillion for orchestration; it doesn't use traditional indexes — micro-partition min/max metadata + clustering does the pruning automatically." This single framing flip cuts off the most common follow-up traps before they fire.&lt;/p&gt;

&lt;h3&gt;
  
  
  LeetCode-style fluency for Snowflake-the-company
&lt;/h3&gt;

&lt;p&gt;The 2 PipeCode problems hint at the actual Snowflake-the-company DE / SWE interview bar: &lt;strong&gt;algorithm fluency in Python&lt;/strong&gt;, not Snowflake product trivia. The interview tests &lt;code&gt;zip(*iter)&lt;/code&gt; + &lt;code&gt;set&lt;/code&gt; composition and rolling &lt;code&gt;freq&lt;/code&gt; dict + &lt;code&gt;Counter&lt;/code&gt; patterns, not "explain virtual warehouses." Drill the &lt;a href="https://pipecode.ai/explore/practice/language/python" rel="noopener noreferrer"&gt;Python practice page&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/difficulty/easy" rel="noopener noreferrer"&gt;easy practice page&lt;/a&gt; until the canonical EASY-tier code rolls off your fingers in under three minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Snowpipe + Time Travel + Zero-Copy Cloning are the signature features — name them out loud
&lt;/h3&gt;

&lt;p&gt;For Snowflake-as-tool interviews, the three signature product features that separate fluent candidates from bluffers are &lt;strong&gt;Snowpipe&lt;/strong&gt; (serverless continuous ingestion via cloud-event notifications), &lt;strong&gt;Time Travel&lt;/strong&gt; (1–90 day point-in-time recovery + 7-day Fail-safe beyond), and &lt;strong&gt;Zero-Copy Cloning&lt;/strong&gt; (instant dev/QA snapshots via metadata pointers). Name them unprompted when discussing ingestion, recovery, or dev environments — interviewers grade product fluency and these three are the highest-signal features.&lt;/p&gt;

&lt;h3&gt;
  
  
  Easy-Medium discipline matters
&lt;/h3&gt;

&lt;p&gt;The curated Snowflake set on PipeCode is &lt;strong&gt;1 EASY + 1 MEDIUM&lt;/strong&gt; Python. EASY at Snowflake doesn't mean trivial — it means the interviewer expects zero hesitation, idiomatic Python, and an articulated invariant. A correct EASY answer with stuttering on &lt;code&gt;zip(*cards)&lt;/code&gt; or &lt;code&gt;len(set(...)) in {1, 3}&lt;/code&gt; is graded worse than a correct MEDIUM answer with the same flaw. Drill the &lt;a href="https://pipecode.ai/explore/practice/difficulty/easy" rel="noopener noreferrer"&gt;Easy practice page&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/difficulty/medium" rel="noopener noreferrer"&gt;Medium practice page&lt;/a&gt; until the canonical EASY-tier code rolls off your fingers in under three minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to practice on PipeCode
&lt;/h3&gt;

&lt;p&gt;Start with the &lt;a href="https://pipecode.ai/explore/practice/company/snowflake" rel="noopener noreferrer"&gt;Snowflake practice page&lt;/a&gt; and the language-scoped &lt;a href="https://pipecode.ai/explore/practice/company/snowflake/python" rel="noopener noreferrer"&gt;Snowflake Python practice page&lt;/a&gt; for the curated 2-problem set. After that, drill the matching topic pages: &lt;a href="https://pipecode.ai/explore/practice/topic/array" rel="noopener noreferrer"&gt;array&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/hash-table" rel="noopener noreferrer"&gt;hash table&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/sliding-window" rel="noopener noreferrer"&gt;sliding window&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/string" rel="noopener noreferrer"&gt;string&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;window functions&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/cte" rel="noopener noreferrer"&gt;CTE&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;aggregation&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/dimensional-modeling" rel="noopener noreferrer"&gt;dimensional modeling&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/json" rel="noopener noreferrer"&gt;JSON&lt;/a&gt;. The &lt;a href="https://pipecode.ai/explore/courses" rel="noopener noreferrer"&gt;interview courses page&lt;/a&gt; bundles structured curricula. For broader coverage, &lt;a href="https://pipecode.ai/explore/practice/topics" rel="noopener noreferrer"&gt;browse by topic&lt;/a&gt;, or pivot to peer interview guides — the &lt;a href="https://pipecode.ai/blogs/airbnb-data-engineering-interview-questions-prep-guide" rel="noopener noreferrer"&gt;Airbnb DE interview guide&lt;/a&gt; and the &lt;a href="https://pipecode.ai/blogs/top-data-engineering-interview-questions-2026" rel="noopener noreferrer"&gt;top DE interview questions 2026&lt;/a&gt; blog.&lt;/p&gt;

&lt;h3&gt;
  
  
  Communication and approach under time pressure
&lt;/h3&gt;

&lt;p&gt;Talk through the invariant first ("this is a sliding-window plus hash-table problem"), the brute force second ("a triple-nested loop over all length-&lt;code&gt;k&lt;/code&gt; substrings would also work"), and the optimal third ("but the rolling &lt;code&gt;freq&lt;/code&gt; dict gives &lt;code&gt;O(1)&lt;/code&gt; per shift instead of &lt;code&gt;O(k)&lt;/code&gt;"). Interviewers grade &lt;strong&gt;process&lt;/strong&gt; as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, &lt;code&gt;k &amp;gt; len(s)&lt;/code&gt;, single-character strings, &lt;code&gt;maxLetters = 0&lt;/code&gt;, &lt;code&gt;cards&lt;/code&gt; with 0 attributes, NULL in a window-function partition. The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the Snowflake data engineering interview process like?
&lt;/h3&gt;

&lt;p&gt;The Snowflake data engineering interview opens with a 30-minute recruiter screen, then a 60-minute technical phone screen with a live SQL or Python coding problem, then a 4-round virtual onsite: one or two coding rounds (Python algorithms or SQL window functions, depending on whether the role is at Snowflake the company or at a Snowflake-using company), one system-design round (designing data pipelines, warehouse architecture, or trade reconciliation), one data-modeling discussion (star vs snowflake schema, dimensional modeling), and a behavioral round. End-to-end the loop runs three to four weeks.&lt;/p&gt;

&lt;h3&gt;
  
  
  What programming languages does Snowflake test in data engineering interviews?
&lt;/h3&gt;

&lt;p&gt;Snowflake data engineering interviews are bilingual — SQL and Python in roughly equal measure across the loop. Python concentrates on array iteration with set-based validation, hash-table and sliding-window patterns, and dict-counter idioms. SQL concentrates on window functions (&lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;, &lt;code&gt;AVG OVER PARTITION BY&lt;/code&gt;), CTE composition, aggregation, and Snowflake's specific dialect features (&lt;code&gt;DATE_TRUNC&lt;/code&gt;, &lt;code&gt;DATEADD&lt;/code&gt;, &lt;code&gt;QUALIFY&lt;/code&gt;, &lt;code&gt;VARIANT&lt;/code&gt; / JSON queries). Snowpark (Python/Java/Scala on Snowflake compute) appears at senior backend-leaning DE roles but is rarely tested in the live coding rounds.&lt;/p&gt;

&lt;h3&gt;
  
  
  How difficult are Snowflake data engineering interview questions?
&lt;/h3&gt;

&lt;p&gt;The curated Snowflake practice set on PipeCode is &lt;strong&gt;1 easy + 1 medium&lt;/strong&gt;, no hard. The EASY is a Python array + set validation problem (SET Card Game Validation); the MEDIUM is a Python hash-table + sliding-window problem (Maximum Substring Occurrences). At the onsite, system-design and architecture questions can reach L4-L5 level — designing the warehouse + clustering + Time Travel for a 10TB events table, choosing scale-up vs scale-out — but the live coding rounds stay in the EASY-MEDIUM zone. Stuttering on the EASY is a stronger negative signal than struggling with the MEDIUM.&lt;/p&gt;

&lt;h3&gt;
  
  
  How should I prepare for a Snowflake data engineer interview?
&lt;/h3&gt;

&lt;p&gt;Solve the 2 problems on the &lt;a href="https://pipecode.ai/explore/practice/company/snowflake" rel="noopener noreferrer"&gt;Snowflake practice page&lt;/a&gt; end-to-end — untimed first, then timed at 25 minutes per problem — and broaden to &lt;strong&gt;30 to 50 additional problems&lt;/strong&gt; across the matching topic pages: array, hash table, sliding window, string on the Python side, and window functions, CTE, aggregation, dimensional modeling on the SQL side. Read the Snowflake architecture chapter of the official documentation for a week (three layers, micro-partitions, virtual warehouses, Time Travel, Snowpipe, Zero-Copy Cloning). Practice articulating idempotency and clustering choices when discussing warehouse design — those framings are graded heavily at any Snowflake-flavored interview, and they're the difference between strong "Snowflake interview questions and answers" and bluff-level definitions.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Snowflake-specific topics show up most often in interviews?
&lt;/h3&gt;

&lt;p&gt;Five Snowflake-specific topics dominate every interview list — &lt;strong&gt;Snowflake architecture&lt;/strong&gt; (the three-layer split between Database Storage, Query Processing via Virtual Warehouses, and Cloud Services); &lt;strong&gt;micro-partitions and pruning&lt;/strong&gt; (50–500MB columnar blocks with min/max metadata that the optimizer uses to skip irrelevant partitions); &lt;strong&gt;Snowflake window functions&lt;/strong&gt; (&lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;, &lt;code&gt;AVG ... OVER (PARTITION BY ...)&lt;/code&gt; plus &lt;code&gt;QUALIFY&lt;/code&gt; for window-result filtering); &lt;strong&gt;Time Travel + Fail-safe&lt;/strong&gt; (1–90 day point-in-time recovery plus 7 days of Snowflake-managed safety net); and &lt;strong&gt;Zero-Copy Cloning&lt;/strong&gt; (instant snapshots for dev/QA via metadata pointers). Master these five and you can answer any Snowflake-product question on any interview list.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Snowflake an ETL tool, and how does that affect the interview?
&lt;/h3&gt;

&lt;p&gt;No — Snowflake is &lt;strong&gt;not an ETL tool&lt;/strong&gt;. It is a cloud data warehouse designed for storing and analyzing data; ETL/ELT orchestration runs in tools like Apache Airflow, dbt, Matillion, Coalesce.io, or Airbyte that load data into Snowflake. This distinction shows up in interview "Snowflake ETL interview questions" — the correct answer is to clarify the architecture (Snowflake as the destination warehouse, an external orchestrator for transformations) and walk through the typical loading patterns (&lt;code&gt;COPY INTO&lt;/code&gt; for bulk historical loads, Snowpipe for serverless continuous ingestion, External Tables for in-place queries against cloud storage). Stating "Snowflake is not an ETL tool" unprompted is one of the highest-signal product-fluency moves a candidate can make.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start practicing Snowflake data engineering problems
&lt;/h2&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>interview</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Robinhood Data Engineering Interview Questions &amp; Prep Guide</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Sun, 03 May 2026 05:19:18 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/robinhood-data-engineering-interview-questions-prep-guide-hn</link>
      <guid>https://dev.to/gowthampotureddi/robinhood-data-engineering-interview-questions-prep-guide-hn</guid>
      <description>&lt;p&gt;&lt;strong&gt;Robinhood data engineering interview questions&lt;/strong&gt; are bilingual — SQL and Python in roughly equal measure — with a fintech-correctness edge that most generic interview-prep posts miss. Four primitives carry the loop: &lt;code&gt;dict.get(s, 0) + 1&lt;/code&gt; hash-table counters that aggregate stock-purchase events by symbol, &lt;code&gt;INNER JOIN trades + users + GROUP BY + ORDER BY count DESC LIMIT N&lt;/code&gt; for top-N city / member-transfer rankings, &lt;code&gt;LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date)&lt;/code&gt; for day-over-day volume or balance change, and &lt;code&gt;GROUP BY user_id HAVING SUM(notional) &amp;gt; limit&lt;/code&gt; for end-of-day threshold and notional-cap checks. The framings are everyday brokerage data engineering — count purchases per ticker, surface the top cities by completed trades, compute a daily volume percentage change, flag any account whose option exposure crosses a regulatory limit.&lt;/p&gt;

&lt;p&gt;This guide walks through the four topic clusters Robinhood actually tests, each with a &lt;strong&gt;detailed topic explanation&lt;/strong&gt;, &lt;strong&gt;per-sub-topic explanation with a worked example and its solution&lt;/strong&gt;, and an &lt;strong&gt;interview-style problem with a full solution&lt;/strong&gt; that explains why it works. The mix matches a curated 2-problem Robinhood set (1 EASY Python hash-table + 1 MEDIUM SQL joins) plus the two SQL primitives — window &lt;code&gt;LAG&lt;/code&gt; and &lt;code&gt;HAVING&lt;/code&gt;-aggregate — that show up on every Robinhood SQL question list and at every L4/L5 onsite. Robinhood pipelines also demand penny-perfect correctness, idempotency, and audit-aware thinking; brokerage candidates who frame their answers in those terms separate themselves from the generic-DE pile.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwe9579u837ybeaheewle.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwe9579u837ybeaheewle.jpeg" alt="Robinhood data engineering interview questions cover image with bold headline, Python and SQL chips, faint code ghost, and pipecode.ai attribution." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Top Robinhood data engineering interview topics
&lt;/h2&gt;

&lt;p&gt;From the &lt;a href="https://pipecode.ai/explore/practice/company/robinhood" rel="noopener noreferrer"&gt;Robinhood data engineering practice set&lt;/a&gt;, the &lt;strong&gt;four numbered sections below&lt;/strong&gt; follow this &lt;strong&gt;topic map&lt;/strong&gt; (one row per &lt;strong&gt;H2&lt;/strong&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Topic (sections &lt;strong&gt;1–4&lt;/strong&gt;)&lt;/th&gt;
&lt;th&gt;Why it shows up at Robinhood&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Python hash tables and dict counters for stock-purchase aggregation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stock Purchases Count (EASY) — &lt;code&gt;dict.get(symbol, 0) + 1&lt;/code&gt; and dict-of-sets, the Python primitive for counting events and tracking distinct buyers.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL inner join and GROUP BY for top-N trade aggregations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Member Transfer Records (MEDIUM) — &lt;code&gt;INNER JOIN trades + users + WHERE status='Completed' + GROUP BY city + ORDER BY COUNT(*) DESC + LIMIT N&lt;/code&gt;, the SQL primitive for top-N rankings on completed trades.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL window functions and &lt;code&gt;LAG&lt;/code&gt; for daily volume change&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Volume change percentage — &lt;code&gt;LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date)&lt;/code&gt; for day-over-day deltas on partitioned time series.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL aggregation and &lt;code&gt;HAVING&lt;/code&gt; for threshold and notional-limit checks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Notional-limit check — &lt;code&gt;GROUP BY user_id HAVING SUM(contract_count * strike_price * 100) &amp;gt; limit&lt;/code&gt;, the SQL primitive for "WHERE on aggregates" that flags risk and compliance breaches.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bilingual framing rule:&lt;/strong&gt; Robinhood's prompts are everyday brokerage data engineering — count buyers per ticker, rank cities by completed orders, measure daily volume change, flag breaches of a notional limit. The interviewer is grading whether you map each business framing to the right primitive: count events → hash-table dict counter; rank by attribute → &lt;code&gt;INNER JOIN + GROUP BY + ORDER BY DESC + LIMIT&lt;/code&gt;; day-over-day delta → &lt;code&gt;LAG&lt;/code&gt; window function; flag aggregates over a threshold → &lt;code&gt;GROUP BY + HAVING&lt;/code&gt;. State the mapping out loud and the correctness will follow.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. Python Hash Tables and Dict Counters for Stock-Purchase Aggregation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hash-table dict counting for stock-purchase events in Python for data engineering
&lt;/h3&gt;

&lt;p&gt;"Given a list of stock-purchase events, count purchases per symbol and surface the set of distinct buyers per symbol" is Robinhood's signature EASY Python prompt (Stock Purchases Count). The mental model: &lt;strong&gt;a hash-table counter is a &lt;code&gt;dict&lt;/code&gt; keyed on the grouping attribute with a count value, updated in a single pass with &lt;code&gt;counts[k] = counts.get(k, 0) + 1&lt;/code&gt;; a parallel &lt;code&gt;dict[str, set]&lt;/code&gt; keyed the same way tracks distinct members per group&lt;/strong&gt;. Same primitive powers any "count by category" or "unique-elements per bucket" pipeline — count clicks per user, distinct sessions per page, purchases per ticker.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5ptpacox0lpb9ayqm1m.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5ptpacox0lpb9ayqm1m.jpeg" alt="Diagram of stock-purchase events flowing into a Python dict counter that maps each symbol to a purchase count and a parallel dict-of-sets that maps each symbol to its set of distinct buyers." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; The &lt;code&gt;dict.get(k, 0) + 1&lt;/code&gt; idiom is the workhorse — but &lt;code&gt;collections.Counter&lt;/code&gt; is the idiomatic shortcut and &lt;code&gt;collections.defaultdict(int)&lt;/code&gt; is the common middle ground. Pick the one your interviewer can read fastest: &lt;code&gt;Counter&lt;/code&gt; for "count occurrences," &lt;code&gt;defaultdict(set)&lt;/code&gt; for "track unique members per group," vanilla &lt;code&gt;dict.get&lt;/code&gt; when you want zero imports. State which you chose and why.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Dict counter idiom: &lt;code&gt;dict.get(k, 0) + 1&lt;/code&gt; vs &lt;code&gt;collections.Counter&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The dict-counter invariant: &lt;strong&gt;a &lt;code&gt;dict&lt;/code&gt; keyed on the grouping attribute holds the running count for that group; on each event you read the current count (or 0 if absent), add one, and write it back&lt;/strong&gt;. &lt;code&gt;dict.get(k, default)&lt;/code&gt; returns &lt;code&gt;default&lt;/code&gt; when &lt;code&gt;k&lt;/code&gt; is missing, which avoids the &lt;code&gt;KeyError&lt;/code&gt; that bare &lt;code&gt;dict[k]&lt;/code&gt; would raise.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;counts[k] = counts.get(k, 0) + 1&lt;/code&gt;&lt;/strong&gt; — vanilla dict, no imports, the universal pattern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;counts = Counter(events)&lt;/code&gt;&lt;/strong&gt; — one-liner if &lt;code&gt;events&lt;/code&gt; is the iterable of keys themselves (no extra fields).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;counts = defaultdict(int); counts[k] += 1&lt;/code&gt;&lt;/strong&gt; — middle ground, handles missing keys via the factory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid &lt;code&gt;try/except KeyError&lt;/code&gt;&lt;/strong&gt; — works but is slower and harder to read than &lt;code&gt;get&lt;/code&gt; or &lt;code&gt;defaultdict&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Count three purchase events for &lt;code&gt;AAPL&lt;/code&gt;, &lt;code&gt;TSLA&lt;/code&gt;, &lt;code&gt;AAPL&lt;/code&gt; using the &lt;code&gt;dict.get&lt;/code&gt; idiom.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;counts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;start&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AAPL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TSLA&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 1, 'TSLA': 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AAPL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 2, 'TSLA': 1}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AAPL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TSLA&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AAPL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="c1"&gt;# counts == {'AAPL': 2, 'TSLA': 1}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; when the count is the only signal, &lt;code&gt;Counter(iter)&lt;/code&gt; is shortest; when you also need other per-group state (a set of distinct buyers, a running max, a list of timestamps), reach for &lt;code&gt;defaultdict(set)&lt;/code&gt; or vanilla &lt;code&gt;dict.get&lt;/code&gt; with explicit logic.&lt;/p&gt;

&lt;h4&gt;
  
  
  Set-of-buyers per stock: dict-of-sets pattern
&lt;/h4&gt;

&lt;p&gt;The dict-of-sets invariant: &lt;strong&gt;a &lt;code&gt;dict&lt;/code&gt; keyed on the grouping attribute holds a &lt;code&gt;set&lt;/code&gt; of distinct member ids; each event inserts the member id, and &lt;code&gt;set&lt;/code&gt; semantics dedupe automatically&lt;/strong&gt;. Inserting the same &lt;code&gt;(symbol, user)&lt;/code&gt; pair twice is a no-op — no special handling required, no &lt;code&gt;if member in s&lt;/code&gt; guard.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;buyers.setdefault(symbol, set()).add(user)&lt;/code&gt;&lt;/strong&gt; — one-liner that initializes the set on first touch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;buyers = defaultdict(set); buyers[symbol].add(user)&lt;/code&gt;&lt;/strong&gt; — same outcome, slightly cleaner.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;len(buyers[symbol])&lt;/code&gt;&lt;/strong&gt; — distinct-buyer count for any symbol.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set membership&lt;/strong&gt; — &lt;code&gt;O(1)&lt;/code&gt; average; never use a &lt;code&gt;list&lt;/code&gt; here, which would be &lt;code&gt;O(N)&lt;/code&gt; for &lt;code&gt;if x in lst&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Track distinct buyers for &lt;code&gt;AAPL&lt;/code&gt; across three events: &lt;code&gt;(u1, AAPL)&lt;/code&gt;, &lt;code&gt;(u2, AAPL)&lt;/code&gt;, &lt;code&gt;(u1, AAPL)&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;buyers['AAPL']&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;start&lt;/td&gt;
&lt;td&gt;&lt;code&gt;set()&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;u1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'u1'}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;u2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'u1', 'u2'}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;u1&lt;/code&gt; (dup)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'u1', 'u2'}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;
&lt;span class="n"&gt;buyers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;u1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AAPL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;u2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AAPL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;u1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AAPL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]:&lt;/span&gt;
    &lt;span class="n"&gt;buyers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# buyers == {'AAPL': {'u1', 'u2'}}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;defaultdict(set)&lt;/code&gt; collapses the "first-time init" branch — never write &lt;code&gt;if k not in d: d[k] = set()&lt;/code&gt; when &lt;code&gt;defaultdict&lt;/code&gt; exists.&lt;/p&gt;

&lt;h4&gt;
  
  
  One-pass aggregation in a single for-loop
&lt;/h4&gt;

&lt;p&gt;The single-pass invariant: &lt;strong&gt;two parallel dicts updated in the same &lt;code&gt;for&lt;/code&gt; loop produce both aggregates in &lt;code&gt;O(N)&lt;/code&gt; total time and &lt;code&gt;O(K)&lt;/code&gt; space&lt;/strong&gt;, where &lt;code&gt;K&lt;/code&gt; is the number of distinct symbols. Iterating twice — once for counts, once for buyer-sets — works but doubles the work; one pass is idiomatic and what interviewers grade.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One loop, two updates&lt;/strong&gt; — &lt;code&gt;counts[s] = counts.get(s, 0) + 1&lt;/code&gt; and &lt;code&gt;buyers.setdefault(s, set()).add(u)&lt;/code&gt; per event.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid two passes&lt;/strong&gt; — &lt;code&gt;O(2N) == O(N)&lt;/code&gt; asymptotically but reads as wasteful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tuple unpack the event&lt;/strong&gt; — &lt;code&gt;for user, symbol, qty in events:&lt;/code&gt; is cleaner than indexing &lt;code&gt;event[0]&lt;/code&gt;, &lt;code&gt;event[1]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip filters first&lt;/strong&gt; — if the prompt says "only Completed trades," guard with &lt;code&gt;if status != 'Completed': continue&lt;/code&gt; at the top.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Aggregate the four-event stream into both dicts in one pass.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;event&lt;/th&gt;
&lt;th&gt;counts&lt;/th&gt;
&lt;th&gt;buyers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;(u1, AAPL)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': {'u1'}}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;(u2, TSLA)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 1, 'TSLA': 1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': {'u1'}, 'TSLA': {'u2'}}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;(u1, AAPL)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 2, 'TSLA': 1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': {'u1'}, 'TSLA': {'u2'}}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;(u3, AAPL)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 3, 'TSLA': 1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': {'u1', 'u3'}, 'TSLA': {'u2'}}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;
&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="n"&gt;buyers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;u1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AAPL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;u2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TSLA&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;u1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AAPL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;u3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AAPL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]:&lt;/span&gt;
    &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;buyers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; one loop, two parallel structures, zero &lt;code&gt;if k in d&lt;/code&gt; branches — reach for &lt;code&gt;defaultdict&lt;/code&gt; and &lt;code&gt;dict.get&lt;/code&gt; first.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;counts[k] += 1&lt;/code&gt; on a missing key — raises &lt;code&gt;KeyError&lt;/code&gt;. Use &lt;code&gt;dict.get(k, 0) + 1&lt;/code&gt; or &lt;code&gt;defaultdict(int)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Using a &lt;code&gt;list&lt;/code&gt; instead of a &lt;code&gt;set&lt;/code&gt; for distinct buyers — &lt;code&gt;O(N)&lt;/code&gt; membership checks blow up the asymptotic cost.&lt;/li&gt;
&lt;li&gt;Iterating twice instead of once — wastes work and signals the candidate didn't think about cost.&lt;/li&gt;
&lt;li&gt;Forgetting to skip filtered events at the top of the loop — counts include cancelled / failed trades and produce the wrong answer.&lt;/li&gt;
&lt;li&gt;Returning &lt;code&gt;dict.values()&lt;/code&gt; directly when the test expects a plain list — wrap with &lt;code&gt;list(...)&lt;/code&gt; if the contract demands it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Python Interview Question on Stock-Purchase Aggregation
&lt;/h3&gt;

&lt;p&gt;Given a list of completed stock-purchase events &lt;code&gt;(user_id, symbol, quantity)&lt;/code&gt;, return a dict mapping each &lt;code&gt;symbol&lt;/code&gt; to a tuple of (a) the total purchase count and (b) the set of distinct &lt;code&gt;user_id&lt;/code&gt; values that bought it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stock_purchases&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# your code here
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;dict.get&lt;/code&gt; and &lt;code&gt;defaultdict(set)&lt;/code&gt; in one pass
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stock_purchases&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;buyers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_qty&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;buyers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;buyers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the &lt;code&gt;dict.get(symbol, 0) + 1&lt;/code&gt; idiom keeps the count loop branch-free; &lt;code&gt;defaultdict(set)&lt;/code&gt; keeps the buyer-set loop branch-free; one pass over &lt;code&gt;events&lt;/code&gt; builds both structures in &lt;code&gt;O(N)&lt;/code&gt; time. The final dict-comprehension stitches them together so each symbol maps to &lt;code&gt;(count, set_of_buyers)&lt;/code&gt; in a single return value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for &lt;code&gt;events = [('u1', 'AAPL', 50), ('u2', 'TSLA', 10), ('u1', 'AAPL', 25), ('u3', 'AAPL', 5), ('u2', 'GOOG', 8)]&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;event&lt;/th&gt;
&lt;th&gt;counts&lt;/th&gt;
&lt;th&gt;buyers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;start&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;defaultdict(set)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(u1, AAPL, 50)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': {'u1'}}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(u2, TSLA, 10)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 1, 'TSLA': 1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': {'u1'}, 'TSLA': {'u2'}}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(u1, AAPL, 25)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 2, 'TSLA': 1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': {'u1'}, 'TSLA': {'u2'}}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(u3, AAPL, 5)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 3, 'TSLA': 1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': {'u1', 'u3'}, 'TSLA': {'u2'}}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(u2, GOOG, 8)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': 3, 'TSLA': 1, 'GOOG': 1}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'AAPL': {'u1', 'u3'}, 'TSLA': {'u2'}, 'GOOG': {'u2'}}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After the loop, the dict-comprehension &lt;code&gt;{s: (counts[s], buyers[s]) for s in counts}&lt;/code&gt; produces the merged output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;symbol&lt;/th&gt;
&lt;th&gt;count&lt;/th&gt;
&lt;th&gt;distinct_buyers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'u1', 'u3'}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TSLA&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'u2'}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GOOG&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{'u2'}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Hash-table dict counter&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;counts[symbol] = counts.get(symbol, 0) + 1&lt;/code&gt; updates per group in &lt;code&gt;O(1)&lt;/code&gt; average; the &lt;code&gt;dict.get&lt;/code&gt; default of &lt;code&gt;0&lt;/code&gt; keeps the new-key branch implicit and unbranched.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;defaultdict(set)&lt;/code&gt; factory&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;buyers[symbol].add(user_id)&lt;/code&gt; initializes a fresh &lt;code&gt;set&lt;/code&gt; on first touch; subsequent &lt;code&gt;.add&lt;/code&gt; calls dedupe at insertion time, so distinct counting is free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Single-pass loop&lt;/strong&gt;&lt;/strong&gt; — two parallel updates inside one &lt;code&gt;for&lt;/code&gt; keep total work at &lt;code&gt;O(N)&lt;/code&gt;; iterating &lt;code&gt;events&lt;/code&gt; twice would double the work for the same answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Dict-comprehension merge&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;{s: (counts[s], buyers[s]) for s in counts}&lt;/code&gt; stitches the two parallel structures into the contract-shape return without an explicit accumulator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(N)&lt;/code&gt; time / &lt;code&gt;O(K)&lt;/code&gt; space&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;N&lt;/code&gt; events, each touching two hash maps; &lt;code&gt;K&lt;/code&gt; distinct symbols hold the aggregate state. No sorting, no nested scans.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; Drill the &lt;a href="https://pipecode.ai/explore/practice/company/robinhood" rel="noopener noreferrer"&gt;Robinhood Python practice page&lt;/a&gt; for the curated hash-table problem and the &lt;a href="https://pipecode.ai/explore/practice/topic/hash-table/python" rel="noopener noreferrer"&gt;hash-table Python practice page&lt;/a&gt; for breadth.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Robinhood&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Robinhood data engineering problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/robinhood" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — hash table&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Python hash-table problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/hash-table/python" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — dictionary&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Python dictionary problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/dictionary/python" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. SQL Inner Join and GROUP BY for Top-N Trade Aggregations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Inner-join + group-by + top-N rankings in SQL for Robinhood data engineering
&lt;/h3&gt;

&lt;p&gt;"Given a &lt;code&gt;trades&lt;/code&gt; table and a &lt;code&gt;users&lt;/code&gt; (or &lt;code&gt;members&lt;/code&gt;) table, return the top three cities by completed trade orders" is Robinhood's signature MEDIUM SQL prompt — the same shape as PipeCode's #247 Member Transfer Records and DataLemur's #1 Cities With Completed Trades. The mental model: &lt;strong&gt;&lt;code&gt;INNER JOIN&lt;/code&gt; the fact table to the dim table on the shared id, filter rows with &lt;code&gt;WHERE status = 'Completed'&lt;/code&gt;, group by the dim attribute, count or sum, then &lt;code&gt;ORDER BY count DESC LIMIT N&lt;/code&gt; for the top-N&lt;/strong&gt;. The same primitive powers any "rank entities by event volume" pipeline — top countries by signups, top sectors by trade flow, top symbols by buy-side volume.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Filter early with &lt;code&gt;WHERE&lt;/code&gt; (per-row), not late with &lt;code&gt;HAVING&lt;/code&gt; (per-group). &lt;code&gt;WHERE status = 'Completed'&lt;/code&gt; removes cancelled trades before the &lt;code&gt;GROUP BY&lt;/code&gt;, so the &lt;code&gt;COUNT(*)&lt;/code&gt; is correct. Doing it after the group with &lt;code&gt;HAVING&lt;/code&gt; requires an aggregate predicate and is slower; doing it not at all silently inflates the count with cancellations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;INNER JOIN&lt;/code&gt; cardinality: trades 1-to-many users
&lt;/h4&gt;

&lt;p&gt;The join-cardinality invariant: &lt;strong&gt;&lt;code&gt;INNER JOIN trades ON trades.user_id = users.user_id&lt;/code&gt; produces one output row per matching &lt;code&gt;(trade, user)&lt;/code&gt; pair&lt;/strong&gt;. Since each trade has exactly one user but a user can have many trades, the join expands the user table by trade count — the result has one row per trade. &lt;code&gt;INNER JOIN&lt;/code&gt; drops trades whose &lt;code&gt;user_id&lt;/code&gt; has no match in &lt;code&gt;users&lt;/code&gt; (orphans).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;INNER JOIN&lt;/code&gt;&lt;/strong&gt; — only matching pairs survive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt;&lt;/strong&gt; — keeps trades with no user match (&lt;code&gt;NULL&lt;/code&gt; city); use when you must report orphans.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;FULL OUTER JOIN&lt;/code&gt;&lt;/strong&gt; — also keeps users with zero trades; useful for reconciliation, not for top-N.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick the smallest table on the right&lt;/strong&gt; — modern planners are smart, but explicit &lt;code&gt;LEFT/INNER&lt;/code&gt; choice signals intent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two-row &lt;code&gt;trades&lt;/code&gt;, two-row &lt;code&gt;users&lt;/code&gt;; &lt;code&gt;INNER JOIN&lt;/code&gt; on &lt;code&gt;user_id&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;trade&lt;/th&gt;
&lt;th&gt;user&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100101 (u=111)&lt;/td&gt;
&lt;td&gt;u=111&lt;/td&gt;
&lt;td&gt;San Francisco&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100259 (u=148)&lt;/td&gt;
&lt;td&gt;u=148&lt;/td&gt;
&lt;td&gt;Boston&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;trades&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;INNER JOIN&lt;/code&gt; is the default for "this trade definitely has a user"; reach for &lt;code&gt;LEFT JOIN&lt;/code&gt; only when orphans are part of the answer.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;WHERE&lt;/code&gt; before &lt;code&gt;GROUP BY&lt;/code&gt;: filtering rows vs filtering groups
&lt;/h4&gt;

&lt;p&gt;The filter-order invariant: &lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt; runs per-row before grouping; &lt;code&gt;HAVING&lt;/code&gt; runs per-group after&lt;/strong&gt;. &lt;code&gt;WHERE status = 'Completed'&lt;/code&gt; strips out cancelled and pending trades so the subsequent &lt;code&gt;GROUP BY city&lt;/code&gt; only counts what the question asks about. Putting the same predicate in &lt;code&gt;HAVING&lt;/code&gt; requires &lt;code&gt;HAVING SUM(CASE WHEN status='Completed' THEN 1 ELSE 0 END) &amp;gt; 0&lt;/code&gt; — verbose, slower, and a "bad signal" in the round.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt; clause&lt;/strong&gt; — row-level filter; runs first; uses &lt;code&gt;=&lt;/code&gt;, &lt;code&gt;&amp;lt;&lt;/code&gt;, &lt;code&gt;IN&lt;/code&gt;, &lt;code&gt;LIKE&lt;/code&gt;, &lt;code&gt;IS NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING&lt;/code&gt; clause&lt;/strong&gt; — group-level filter; runs after &lt;code&gt;GROUP BY&lt;/code&gt;; uses &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order in the query&lt;/strong&gt; — &lt;code&gt;SELECT ... FROM ... JOIN ... WHERE ... GROUP BY ... HAVING ... ORDER BY ... LIMIT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt; — &lt;code&gt;WHERE&lt;/code&gt;-then-group reads less data from disk because the grouper sees fewer rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Filter to &lt;code&gt;status = 'Completed'&lt;/code&gt; before grouping by city.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;input rows&lt;/th&gt;
&lt;th&gt;after WHERE&lt;/th&gt;
&lt;th&gt;grouped city → count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;5 (2 dropped)&lt;/td&gt;
&lt;td&gt;SF: 3, Boston: 2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;trades&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Completed'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; push every row-level predicate into &lt;code&gt;WHERE&lt;/code&gt;; reserve &lt;code&gt;HAVING&lt;/code&gt; strictly for predicates that need an aggregate.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;ORDER BY count DESC + LIMIT&lt;/code&gt; for top-N rankings
&lt;/h4&gt;

&lt;p&gt;The top-N invariant: &lt;strong&gt;&lt;code&gt;ORDER BY &amp;lt;metric&amp;gt; DESC&lt;/code&gt; sorts groups by the metric in descending order; &lt;code&gt;LIMIT N&lt;/code&gt; returns only the first &lt;code&gt;N&lt;/code&gt;&lt;/strong&gt;. For ties at the cut, dialects differ — Postgres returns whichever order the planner prefers; Redshift / Snowflake similar. If ties matter, add a deterministic tiebreaker like &lt;code&gt;, u.city ASC&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY total_orders DESC&lt;/code&gt;&lt;/strong&gt; — sorts groups by the aliased column.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LIMIT 3&lt;/code&gt;&lt;/strong&gt; — first 3 groups in sort order; standard across MySQL, Postgres, Snowflake, BigQuery (BigQuery uses &lt;code&gt;LIMIT&lt;/code&gt; too).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;OFFSET&lt;/code&gt;&lt;/strong&gt; — pagination; &lt;code&gt;LIMIT 3 OFFSET 3&lt;/code&gt; returns ranks 4–6.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tiebreakers&lt;/strong&gt; — &lt;code&gt;, u.city ASC&lt;/code&gt; makes ties deterministic and stable across runs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Sort 4 cities by count and take the top 3.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;total&lt;/th&gt;
&lt;th&gt;rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SF&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Boston&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Denver&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Austin&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;(cut)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;trades&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Completed'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always alias the count column (&lt;code&gt;AS total&lt;/code&gt;) and use the alias in &lt;code&gt;ORDER BY&lt;/code&gt;; mixing literal &lt;code&gt;COUNT(*)&lt;/code&gt; and the alias is style noise.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;COUNT(*)&lt;/code&gt; when only completed trades should count — forgetting &lt;code&gt;WHERE status = 'Completed'&lt;/code&gt; silently inflates totals.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GROUP BY user_id&lt;/code&gt; instead of &lt;code&gt;GROUP BY city&lt;/code&gt; — confuses the dim level with the fact level; the question asks for cities.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LEFT JOIN&lt;/code&gt; when the question asks "completed trades" — orphan trades (no matching user) silently survive with &lt;code&gt;NULL&lt;/code&gt; city and pollute the rank.&lt;/li&gt;
&lt;li&gt;Forgetting the alias in &lt;code&gt;ORDER BY&lt;/code&gt; — &lt;code&gt;ORDER BY COUNT(trades.order_id) DESC&lt;/code&gt; works but reads worse than &lt;code&gt;ORDER BY total_orders DESC&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Returning more than &lt;code&gt;N&lt;/code&gt; rows by skipping &lt;code&gt;LIMIT&lt;/code&gt; — graded as a wrong answer even when the top &lt;code&gt;N&lt;/code&gt; are correct.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Top Cities by Completed Trades
&lt;/h3&gt;

&lt;p&gt;Given the tables &lt;code&gt;trades(order_id, user_id, quantity, status, date, price)&lt;/code&gt; and &lt;code&gt;users(user_id, city, email, signup_date)&lt;/code&gt;, write a query that returns the &lt;strong&gt;top 3 cities&lt;/strong&gt; by number of &lt;strong&gt;completed&lt;/strong&gt; trade orders, in descending order. Output two columns: &lt;code&gt;city&lt;/code&gt; and &lt;code&gt;total_orders&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;INNER JOIN + WHERE + GROUP BY + ORDER BY DESC + LIMIT&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_orders&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;trades&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Completed'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_orders&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the &lt;code&gt;INNER JOIN&lt;/code&gt; pairs each trade with its user (and drops orphan trades with no user match); &lt;code&gt;WHERE status = 'Completed'&lt;/code&gt; removes cancelled trades before grouping; &lt;code&gt;GROUP BY city&lt;/code&gt; collapses rows to one-per-city with &lt;code&gt;COUNT(t.order_id)&lt;/code&gt; as the metric; &lt;code&gt;ORDER BY total_orders DESC&lt;/code&gt; ranks cities; &lt;code&gt;LIMIT 3&lt;/code&gt; returns only the top three.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the Robinhood-style sample data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;order_id&lt;/th&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;status&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100101&lt;/td&gt;
&lt;td&gt;111&lt;/td&gt;
&lt;td&gt;Cancelled&lt;/td&gt;
&lt;td&gt;San Francisco&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100102&lt;/td&gt;
&lt;td&gt;111&lt;/td&gt;
&lt;td&gt;Completed&lt;/td&gt;
&lt;td&gt;San Francisco&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100259&lt;/td&gt;
&lt;td&gt;148&lt;/td&gt;
&lt;td&gt;Completed&lt;/td&gt;
&lt;td&gt;Boston&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100264&lt;/td&gt;
&lt;td&gt;148&lt;/td&gt;
&lt;td&gt;Completed&lt;/td&gt;
&lt;td&gt;Boston&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100305&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;Completed&lt;/td&gt;
&lt;td&gt;San Francisco&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100400&lt;/td&gt;
&lt;td&gt;178&lt;/td&gt;
&lt;td&gt;Completed&lt;/td&gt;
&lt;td&gt;San Francisco&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100565&lt;/td&gt;
&lt;td&gt;265&lt;/td&gt;
&lt;td&gt;Completed&lt;/td&gt;
&lt;td&gt;Denver&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inner-join&lt;/strong&gt; — every trade's &lt;code&gt;user_id&lt;/code&gt; matches a row in &lt;code&gt;users&lt;/code&gt;; the join produces 7 rows with &lt;code&gt;(order_id, status, city)&lt;/code&gt; paired up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WHERE filter&lt;/strong&gt; — drop &lt;code&gt;100101&lt;/code&gt; (Cancelled). 6 rows remain: 3 SF + 2 Boston + 1 Denver.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Group by city&lt;/strong&gt; — collapse to three groups: SF=3, Boston=2, Denver=1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order by count desc&lt;/strong&gt; — SF (3) &amp;gt; Boston (2) &amp;gt; Denver (1).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limit 3&lt;/strong&gt; — returns all three (no cut needed).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;total_orders&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;San Francisco&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Boston&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Denver&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;INNER JOIN&lt;/code&gt; cardinality&lt;/strong&gt;&lt;/strong&gt; — each trade pairs with exactly one user; orphan trades (none in this dataset) would be dropped, which is the right semantic when the question asks about completed-trade-by-city.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt; before &lt;code&gt;GROUP BY&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — the row-level filter eliminates cancelled trades before grouping, so &lt;code&gt;COUNT(*)&lt;/code&gt; reflects only completed orders; doing it after with &lt;code&gt;HAVING&lt;/code&gt; would still work but reads worse and runs slower.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;GROUP BY u.city&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — collapses the result to one row per city; the only non-aggregate column in &lt;code&gt;SELECT&lt;/code&gt; must appear here (or be functionally dependent on it).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;COUNT(t.order_id)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — counts non-null &lt;code&gt;order_id&lt;/code&gt; values per group; equivalent to &lt;code&gt;COUNT(*)&lt;/code&gt; here because &lt;code&gt;order_id&lt;/code&gt; is the trades primary key (never null).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY total_orders DESC LIMIT 3&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — sorts groups by the metric and slices the top three; the alias &lt;code&gt;total_orders&lt;/code&gt; keeps the &lt;code&gt;ORDER BY&lt;/code&gt; clean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O((|trades| + |users|) + G log G)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;|trades|&lt;/code&gt; rows scanned for the join, &lt;code&gt;G&lt;/code&gt; groups sorted (&lt;code&gt;G ≤ |cities|&lt;/code&gt;); &lt;code&gt;O(G)&lt;/code&gt; space for the group hash table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;SQL join problems&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/company/robinhood" rel="noopener noreferrer"&gt;Robinhood SQL practice page&lt;/a&gt; for the curated MEDIUM problem.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Robinhood&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Robinhood SQL problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/robinhood" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — joins&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL join problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — group by&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL group-by problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/group-by/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. SQL Window Functions and &lt;code&gt;LAG&lt;/code&gt; for Daily Volume Change
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Window LAG for day-over-day deltas in SQL for Robinhood data engineering
&lt;/h3&gt;

&lt;p&gt;"For each day and each stock, compute the percentage change in trading volume vs the previous day for the same stock" is a Robinhood SQL staple — the same shape that scales up to "reconstruct an account balance from trade events" at the L4/L5 onsite. The mental model: &lt;strong&gt;&lt;code&gt;LAG(value) OVER (PARTITION BY group ORDER BY ts)&lt;/code&gt; looks one row back inside the same partition and returns the prior row's &lt;code&gt;value&lt;/code&gt;; partition keeps each stock's series independent, &lt;code&gt;ORDER BY&lt;/code&gt; fixes the row order, and the day-over-day delta is &lt;code&gt;(value − LAG(value)) / LAG(value)&lt;/code&gt;&lt;/strong&gt;. The same window primitive powers any "delta vs previous" query — running balance, running total via &lt;code&gt;SUM&lt;/code&gt;, rank within a partition via &lt;code&gt;ROW_NUMBER&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmcsg51v73ixjfunrudy6.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmcsg51v73ixjfunrudy6.jpeg" alt="Diagram showing the LAG window function reaching one row back inside a per-stock_symbol partition over trade_date to compute daily percentage change of trading volume." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; The first row in each partition has no predecessor, so &lt;code&gt;LAG&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt;. Decide upfront: filter out the &lt;code&gt;NULL&lt;/code&gt; row with &lt;code&gt;WHERE prev IS NOT NULL&lt;/code&gt;, or pass through with &lt;code&gt;COALESCE(prev, value)&lt;/code&gt; to mark "no change." Stating this explicitly is the senior signal interviewers grade.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Window basics: &lt;code&gt;PARTITION BY group + ORDER BY ordering&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The window-function invariant: &lt;strong&gt;&lt;code&gt;OVER (PARTITION BY &amp;lt;group&amp;gt; ORDER BY &amp;lt;ordering&amp;gt;)&lt;/code&gt; declares an independent ordered subset for every value of the group expression; the function (&lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;, &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;ROW_NUMBER&lt;/code&gt;, …) is evaluated within that subset only&lt;/strong&gt;. &lt;code&gt;PARTITION BY stock_symbol&lt;/code&gt; builds one ordered series per ticker; &lt;code&gt;ORDER BY trade_date&lt;/code&gt; fixes the row order inside each.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;PARTITION BY&lt;/code&gt; is optional&lt;/strong&gt; — omitted, the whole table is one window; useful for global running totals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY&lt;/code&gt; is required for offset functions&lt;/strong&gt; — &lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;, &lt;code&gt;ROW_NUMBER&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt; all need a defined order.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frames&lt;/strong&gt; — &lt;code&gt;ROWS UNBOUNDED PRECEDING&lt;/code&gt; / &lt;code&gt;BETWEEN ... AND ...&lt;/code&gt; further bound the visible rows; default is fine for &lt;code&gt;LAG&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple windows in one query&lt;/strong&gt; — different functions can use different &lt;code&gt;OVER (...)&lt;/code&gt; clauses on the same row.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Two &lt;code&gt;(date, symbol, volume)&lt;/code&gt; rows for AAPL; &lt;code&gt;LAG(volume) OVER (PARTITION BY symbol ORDER BY date)&lt;/code&gt; returns the previous day's volume.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;date&lt;/th&gt;
&lt;th&gt;symbol&lt;/th&gt;
&lt;th&gt;volume&lt;/th&gt;
&lt;th&gt;lag(volume)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-01&lt;/td&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;1000000&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-02&lt;/td&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;1500000&lt;/td&gt;
&lt;td&gt;1000000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_volume&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;trading_volume&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always set &lt;code&gt;PARTITION BY&lt;/code&gt; to the entity that owns the time series (&lt;code&gt;stock_symbol&lt;/code&gt;, &lt;code&gt;account_id&lt;/code&gt;, &lt;code&gt;user_id&lt;/code&gt;); without it, day-1 of one symbol leaks across to compute deltas with a different symbol's day-N.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;LAG&lt;/code&gt; vs &lt;code&gt;LEAD&lt;/code&gt;: row-N-before vs row-N-after
&lt;/h4&gt;

&lt;p&gt;The offset-function invariant: &lt;strong&gt;&lt;code&gt;LAG(expr, n)&lt;/code&gt; returns &lt;code&gt;expr&lt;/code&gt; from the row &lt;code&gt;n&lt;/code&gt; positions before the current row inside the partition; &lt;code&gt;LEAD(expr, n)&lt;/code&gt; returns from &lt;code&gt;n&lt;/code&gt; positions after&lt;/strong&gt;. Default &lt;code&gt;n = 1&lt;/code&gt;. Both accept a third &lt;code&gt;default&lt;/code&gt; argument that replaces &lt;code&gt;NULL&lt;/code&gt; at partition boundaries.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(volume)&lt;/code&gt;&lt;/strong&gt; — previous row's &lt;code&gt;volume&lt;/code&gt; (1 back).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(volume, 7)&lt;/code&gt;&lt;/strong&gt; — 7 rows back; useful for week-over-week deltas on daily data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEAD(volume)&lt;/code&gt;&lt;/strong&gt; — next row's &lt;code&gt;volume&lt;/code&gt; (1 ahead).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(volume, 1, volume)&lt;/code&gt;&lt;/strong&gt; — falls back to the current row's &lt;code&gt;volume&lt;/code&gt; on the first row, so the % change is &lt;code&gt;0&lt;/code&gt; instead of &lt;code&gt;NULL&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Volume on day 2 vs day 1.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;trade_date&lt;/th&gt;
&lt;th&gt;volume&lt;/th&gt;
&lt;th&gt;LAG(volume)&lt;/th&gt;
&lt;th&gt;LEAD(volume)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-01&lt;/td&gt;
&lt;td&gt;1000000&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;1500000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-02&lt;/td&gt;
&lt;td&gt;1500000&lt;/td&gt;
&lt;td&gt;1000000&lt;/td&gt;
&lt;td&gt;1800000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-03&lt;/td&gt;
&lt;td&gt;1800000&lt;/td&gt;
&lt;td&gt;1500000&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;stock_symbol&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_volume&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;LEAD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;stock_symbol&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;next_volume&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;trading_volume&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; "delta vs yesterday" → &lt;code&gt;LAG&lt;/code&gt;; "delta vs tomorrow" or "next-event distance" → &lt;code&gt;LEAD&lt;/code&gt;. They are mirror images of the same primitive.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;NULL&lt;/code&gt; on the first row: handling the edge with &lt;code&gt;COALESCE&lt;/code&gt; or filtering
&lt;/h4&gt;

&lt;p&gt;The boundary invariant: &lt;strong&gt;the first row of each partition has no preceding row, so &lt;code&gt;LAG(expr)&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt;; arithmetic on &lt;code&gt;NULL&lt;/code&gt; propagates to &lt;code&gt;NULL&lt;/code&gt;&lt;/strong&gt;. The two clean strategies are: (a) filter out the &lt;code&gt;NULL&lt;/code&gt; row (&lt;code&gt;WHERE prev IS NOT NULL&lt;/code&gt;) so the result starts on day 2; (b) replace &lt;code&gt;NULL&lt;/code&gt; upstream with &lt;code&gt;LAG(expr, 1, default)&lt;/code&gt; or &lt;code&gt;COALESCE(LAG(expr), 0)&lt;/code&gt; so day 1 has a defined % change (often &lt;code&gt;0&lt;/code&gt; or &lt;code&gt;NULL&lt;/code&gt;-as-string).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE prev IS NOT NULL&lt;/code&gt;&lt;/strong&gt; — drops the first-day row entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(volume, 1, volume)&lt;/code&gt;&lt;/strong&gt; — first-day &lt;code&gt;prev_volume&lt;/code&gt; becomes the same row's &lt;code&gt;volume&lt;/code&gt;; % change becomes &lt;code&gt;0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(LAG(volume), 0)&lt;/code&gt;&lt;/strong&gt; — first-day &lt;code&gt;prev_volume&lt;/code&gt; becomes &lt;code&gt;0&lt;/code&gt;, which makes the % change blow up (&lt;code&gt;/ 0&lt;/code&gt;); usually wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NULLIF(LAG(volume), 0)&lt;/code&gt;&lt;/strong&gt; — guards against zero-volume divisions; complementary to the boundary fix.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Drop the first-day NULL on a 3-row series.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;trade_date&lt;/th&gt;
&lt;th&gt;volume&lt;/th&gt;
&lt;th&gt;volume_change_pct&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-02&lt;/td&gt;
&lt;td&gt;1500000&lt;/td&gt;
&lt;td&gt;50.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-03&lt;/td&gt;
&lt;td&gt;1800000&lt;/td&gt;
&lt;td&gt;20.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;stock_symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;stock_symbol&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_volume&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;stock_symbol&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
             &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;stock_symbol&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
             &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;volume_change_pct&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;trading_volume&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;sub&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;prev_volume&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; state your boundary policy out loud — "I'll drop day 1" or "I'll show day 1 as &lt;code&gt;NULL&lt;/code&gt;" — before writing the query; interviewers grade the explicit choice.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Forgetting &lt;code&gt;PARTITION BY stock_symbol&lt;/code&gt; — &lt;code&gt;LAG&lt;/code&gt; reaches into the previous symbol's series and computes a meaningless delta.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;ORDER BY trade_date&lt;/code&gt; — &lt;code&gt;LAG&lt;/code&gt; returns whichever row the planner happens to pick; the answer is non-deterministic.&lt;/li&gt;
&lt;li&gt;Dividing by &lt;code&gt;LAG(volume)&lt;/code&gt; when the prior volume could be &lt;code&gt;0&lt;/code&gt; — runtime divide-by-zero; use &lt;code&gt;NULLIF(LAG(volume), 0)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;LEAD&lt;/code&gt; when the prompt says "vs yesterday" — answers the wrong direction.&lt;/li&gt;
&lt;li&gt;Hardcoding the offset (&lt;code&gt;LAG(volume, 1)&lt;/code&gt;) when the spec is "vs 7 days ago" — read the spec and use &lt;code&gt;LAG(volume, 7)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Daily Volume Percentage Change
&lt;/h3&gt;

&lt;p&gt;Given a &lt;code&gt;trading_volume(trade_date, stock_symbol, volume)&lt;/code&gt; table with one row per (date, symbol), write a query that returns, for each row from day 2 onwards, the &lt;strong&gt;daily percentage change in volume vs the previous day for the same symbol&lt;/strong&gt;. Output &lt;code&gt;trade_date&lt;/code&gt;, &lt;code&gt;stock_symbol&lt;/code&gt;, and &lt;code&gt;volume_change_pct&lt;/code&gt; (rounded to 2 decimals).&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;LAG&lt;/code&gt; window function
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;stock_symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;stock_symbol&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
         &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;stock_symbol&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="mi"&gt;2&lt;/span&gt;
       &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;volume_change_pct&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;trading_volume&lt;/span&gt;
&lt;span class="n"&gt;QUALIFY&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;stock_symbol&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(In Postgres, replace &lt;code&gt;QUALIFY&lt;/code&gt; with a wrapping subquery and &lt;code&gt;WHERE prev_volume IS NOT NULL&lt;/code&gt; — Snowflake / BigQuery / Databricks support &lt;code&gt;QUALIFY&lt;/code&gt; directly.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;PARTITION BY stock_symbol&lt;/code&gt; keeps each ticker's daily series independent; &lt;code&gt;ORDER BY trade_date&lt;/code&gt; fixes ordering inside the partition; &lt;code&gt;LAG(volume)&lt;/code&gt; reaches one row back to fetch yesterday's volume; the formula &lt;code&gt;(today − yesterday) / yesterday × 100&lt;/code&gt; produces the % change; &lt;code&gt;NULLIF(..., 0)&lt;/code&gt; guards against divide-by-zero on zero-volume days; &lt;code&gt;QUALIFY ... IS NOT NULL&lt;/code&gt; filters the first-day boundary so output starts on day 2.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for AAPL across 5 days:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;trade_date&lt;/th&gt;
&lt;th&gt;stock_symbol&lt;/th&gt;
&lt;th&gt;volume&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-01&lt;/td&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;1000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-02&lt;/td&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;1500000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-03&lt;/td&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;1800000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-04&lt;/td&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;1750000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-05&lt;/td&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;1800000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Partition&lt;/strong&gt; — single partition for AAPL; one ordered series of 5 rows by &lt;code&gt;trade_date&lt;/code&gt; ascending.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(volume)&lt;/code&gt;&lt;/strong&gt; — produces &lt;code&gt;[NULL, 1000000, 1500000, 1800000, 1750000]&lt;/code&gt; for the 5 rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta and division&lt;/strong&gt; — row 2: &lt;code&gt;(1500000 - 1000000) / 1000000 = 0.50&lt;/code&gt;; row 3: &lt;code&gt;(1800000 - 1500000) / 1500000 = 0.20&lt;/code&gt;; row 4: &lt;code&gt;(1750000 - 1800000) / 1800000 = -0.0278&lt;/code&gt;; row 5: &lt;code&gt;(1800000 - 1750000) / 1750000 = 0.0286&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;× 100&lt;/code&gt;&lt;/strong&gt; — convert to percent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ROUND(..., 2)&lt;/code&gt;&lt;/strong&gt; — round to 2 decimals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;QUALIFY LAG IS NOT NULL&lt;/code&gt;&lt;/strong&gt; — drops row 1 (first day, no predecessor).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;trade_date&lt;/th&gt;
&lt;th&gt;stock_symbol&lt;/th&gt;
&lt;th&gt;volume_change_pct&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-02&lt;/td&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;50.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-03&lt;/td&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;20.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-04&lt;/td&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;-2.78&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-07-05&lt;/td&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;2.86&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;PARTITION BY stock_symbol&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — declares one independent window per ticker; without it, AAPL's day-1 would compute a delta against TSLA's day-N.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY trade_date&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — fixes the row order inside each window; offset functions need an explicit ordering or the result is non-deterministic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;LAG(volume)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — retrieves the previous row's &lt;code&gt;volume&lt;/code&gt; inside the partition; &lt;code&gt;NULL&lt;/code&gt; on the first row by design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;NULLIF(LAG(volume), 0)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — guards against divide-by-zero on zero-volume days; &lt;code&gt;NULLIF(x, 0)&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt; when &lt;code&gt;x = 0&lt;/code&gt;, propagating cleanly through arithmetic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;QUALIFY LAG IS NOT NULL&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — drops the first-day row of each partition so the output starts on day 2 with a defined % change; the Postgres equivalent is a subquery &lt;code&gt;WHERE prev_volume IS NOT NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(N log N)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — the planner sorts each partition once; &lt;code&gt;O(N)&lt;/code&gt; space for the running window state. For Robinhood-scale daily volumes this is cheap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;window-function problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;SQL aggregation problems&lt;/a&gt; for the breadth tier.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Robinhood&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Robinhood SQL problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/robinhood" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — window functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL window-function problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. SQL Aggregation and &lt;code&gt;HAVING&lt;/code&gt; for Threshold and Notional-Limit Checks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;GROUP BY&lt;/code&gt; + &lt;code&gt;HAVING&lt;/code&gt; for risk-threshold checks in SQL for Robinhood data engineering
&lt;/h3&gt;

&lt;p&gt;"Find every user whose end-of-day option position exceeds the notional limit" is the canonical Robinhood risk / compliance SQL prompt. The mental model: &lt;strong&gt;aggregate per user with &lt;code&gt;SUM(contract_count × strike_price × multiplier)&lt;/code&gt; to compute notional exposure, then filter the resulting groups with &lt;code&gt;HAVING SUM(...) &amp;gt; limit&lt;/code&gt;&lt;/strong&gt;. &lt;code&gt;HAVING&lt;/code&gt; is "WHERE on aggregates" — it filters group rows, not source rows. Same primitive powers any "flag entities whose aggregate metric crosses a policy line" pipeline — flag accounts whose daily trade count exceeds a velocity cap, sectors whose monthly volume exceeds an exposure ceiling, users whose 24-hour withdrawals breach AML thresholds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgc89tv8ebxa8p61eo889.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgc89tv8ebxa8p61eo889.jpeg" alt="Diagram of per-user notional-position aggregation showing each user's total notional as a horizontal bar with a threshold line, where bars exceeding the threshold are tinted red and flagged in the output table." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Brokerage data is decimal-precision-sensitive. Never store currency or notional as &lt;code&gt;FLOAT&lt;/code&gt; / &lt;code&gt;DOUBLE&lt;/code&gt; — the rounding errors compound and a 1¢ drift becomes a regulatory event. Use &lt;code&gt;NUMERIC(18, 4)&lt;/code&gt; or &lt;code&gt;DECIMAL(18, 4)&lt;/code&gt; for prices, quantities, and notionals. State this constraint when you write the schema; interviewers grade it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;WHERE&lt;/code&gt; vs &lt;code&gt;HAVING&lt;/code&gt;: rows vs groups
&lt;/h4&gt;

&lt;p&gt;The filter-stage invariant: &lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt; filters source rows before grouping; &lt;code&gt;HAVING&lt;/code&gt; filters group rows after grouping&lt;/strong&gt;. &lt;code&gt;WHERE&lt;/code&gt; cannot reference aggregates (&lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;); &lt;code&gt;HAVING&lt;/code&gt; can. The execution order is &lt;code&gt;FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WHERE status = 'Open'&lt;/code&gt;&lt;/strong&gt; — row predicate; filters before grouping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING SUM(notional) &amp;gt; 100000&lt;/code&gt;&lt;/strong&gt; — group predicate; filters after grouping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Both in one query&lt;/strong&gt; — &lt;code&gt;WHERE&lt;/code&gt; strips closed positions, &lt;code&gt;HAVING&lt;/code&gt; keeps groups whose remaining sum breaches the limit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid duplicating logic&lt;/strong&gt; — if a predicate doesn't need an aggregate, put it in &lt;code&gt;WHERE&lt;/code&gt; for performance; if it does, &lt;code&gt;HAVING&lt;/code&gt; is the only correct place.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Filter open positions row-wise, then keep groups whose notional sum &amp;gt; $100K.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;rows&lt;/th&gt;
&lt;th&gt;groups&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;after WHERE&lt;/td&gt;
&lt;td&gt;12 (open)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;after GROUP BY&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;5 users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;after HAVING &amp;gt; 100K&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;2 users&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contract_count&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;strike_price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;notional&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;positions&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Open'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contract_count&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;strike_price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; aggregate predicates always belong in &lt;code&gt;HAVING&lt;/code&gt;; row predicates always belong in &lt;code&gt;WHERE&lt;/code&gt;. Mixing them up is graded as a conceptual error.&lt;/p&gt;

&lt;h4&gt;
  
  
  Notional aggregation: &lt;code&gt;SUM(contract_count × strike_price × multiplier)&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The notional invariant: &lt;strong&gt;for an equity option, notional = contract_count × strike_price × 100&lt;/strong&gt; (one US equity option contract represents 100 shares). For futures, the multiplier varies by product (ES = 50, NQ = 20). Always express the multiplier explicitly in the SQL — never bake "× 100" silently into an upstream view.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Equity options&lt;/strong&gt; — &lt;code&gt;contract_count × strike_price × 100&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Index futures&lt;/strong&gt; — multiplier per contract spec; ES = 50.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Equities (long)&lt;/strong&gt; — &lt;code&gt;qty × price&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use &lt;code&gt;NUMERIC&lt;/code&gt; columns&lt;/strong&gt; — preserve cents through the multiply chain.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three positions for &lt;code&gt;u2&lt;/code&gt;: 200 contracts at $80, all open.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;contract_count&lt;/th&gt;
&lt;th&gt;strike&lt;/th&gt;
&lt;th&gt;notional&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;u2&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;td&gt;200 × 80 × 100 = 1,600,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contract_count&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;strike_price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;notional&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;positions&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Open'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; spell out the multiplier every time and use a &lt;code&gt;NUMERIC&lt;/code&gt; column for &lt;code&gt;strike_price&lt;/code&gt; and &lt;code&gt;contract_count&lt;/code&gt;; floats here will quietly lose pennies and the audit log will catch you.&lt;/p&gt;

&lt;h4&gt;
  
  
  Combining &lt;code&gt;HAVING&lt;/code&gt; with &lt;code&gt;COUNT&lt;/code&gt; and &lt;code&gt;SUM&lt;/code&gt; thresholds
&lt;/h4&gt;

&lt;p&gt;The composition invariant: &lt;strong&gt;&lt;code&gt;HAVING&lt;/code&gt; accepts any boolean expression over aggregates and grouping columns&lt;/strong&gt;. You can combine &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt;, and the grouping keys with &lt;code&gt;AND&lt;/code&gt; / &lt;code&gt;OR&lt;/code&gt;. Common compound predicates: "more than 10 trades AND notional &amp;gt; $100K", "MAX position older than 7 days", "average ticket size &amp;gt; $5K and at least 5 trades."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING COUNT(*) &amp;gt; 10&lt;/code&gt;&lt;/strong&gt; — at least 11 trades in the group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING SUM(notional) &amp;gt; 100000 AND COUNT(*) &amp;gt;= 5&lt;/code&gt;&lt;/strong&gt; — both conditions must hold.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING AVG(price) &amp;gt; 100 OR MAX(price) &amp;gt; 500&lt;/code&gt;&lt;/strong&gt; — either condition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING MAX(trade_date) &amp;lt; CURRENT_DATE - INTERVAL '7' DAY&lt;/code&gt;&lt;/strong&gt; — group has been quiet for a week.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Flag users with ≥5 trades AND notional &amp;gt; $100K.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user&lt;/th&gt;
&lt;th&gt;trades&lt;/th&gt;
&lt;th&gt;notional&lt;/th&gt;
&lt;th&gt;flagged&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;u1&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;60,000&lt;/td&gt;
&lt;td&gt;no (notional fail)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u2&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;1,600,000&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;200,000&lt;/td&gt;
&lt;td&gt;no (count fail)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;trades&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contract_count&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;strike_price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;notional&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;positions&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Open'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
   &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contract_count&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;strike_price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; compose the &lt;code&gt;HAVING&lt;/code&gt; predicate exactly as the policy or rule reads; don't pre-aggregate into a CTE just to put &lt;code&gt;WHERE&lt;/code&gt; on top — &lt;code&gt;HAVING&lt;/code&gt; is the right tool.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common beginner mistakes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;WHERE SUM(notional) &amp;gt; 100000&lt;/code&gt; — &lt;code&gt;WHERE&lt;/code&gt; cannot reference aggregates; the parser rejects it. Use &lt;code&gt;HAVING&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Computing notional in a subquery and then re-aggregating outside — works but is the "long way around"; one &lt;code&gt;GROUP BY + HAVING&lt;/code&gt; is cleaner.&lt;/li&gt;
&lt;li&gt;Storing &lt;code&gt;strike_price&lt;/code&gt; as &lt;code&gt;FLOAT&lt;/code&gt; — accumulates rounding error; use &lt;code&gt;NUMERIC(18, 4)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Forgetting the &lt;code&gt;× 100&lt;/code&gt; multiplier on equity options — the notional comes out 100× too small.&lt;/li&gt;
&lt;li&gt;Including closed positions — forgetting &lt;code&gt;WHERE status = 'Open'&lt;/code&gt; inflates notional and produces false positives.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Notional-Limit Threshold Check
&lt;/h3&gt;

&lt;p&gt;Given a &lt;code&gt;positions(user_id, contract_count, strike_price, status)&lt;/code&gt; table where &lt;code&gt;status&lt;/code&gt; is &lt;code&gt;'Open'&lt;/code&gt; or &lt;code&gt;'Closed'&lt;/code&gt; and each row is one option position (equity-option multiplier = 100), write a query that returns every &lt;code&gt;user_id&lt;/code&gt; whose &lt;strong&gt;total open notional exceeds $100,000&lt;/strong&gt;. Output &lt;code&gt;user_id&lt;/code&gt; and &lt;code&gt;notional&lt;/code&gt;, ordered by notional descending.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using &lt;code&gt;GROUP BY&lt;/code&gt; + &lt;code&gt;HAVING SUM(...) &amp;gt; limit&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contract_count&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;strike_price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;notional&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;positions&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Open'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contract_count&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;strike_price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100000&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;notional&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; &lt;code&gt;WHERE status = 'Open'&lt;/code&gt; strips closed positions before grouping so the aggregate only reflects current exposure; &lt;code&gt;GROUP BY user_id&lt;/code&gt; collapses to one row per user; the equity-option multiplier &lt;code&gt;× 100&lt;/code&gt; produces real-dollar notional inside &lt;code&gt;SUM&lt;/code&gt;; &lt;code&gt;HAVING&lt;/code&gt; filters groups whose summed notional crosses the policy limit; &lt;code&gt;ORDER BY notional DESC&lt;/code&gt; ranks the breaches by severity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; for the sample:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;contract_count&lt;/th&gt;
&lt;th&gt;strike_price&lt;/th&gt;
&lt;th&gt;status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;u1&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;120&lt;/td&gt;
&lt;td&gt;Open&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u2&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;td&gt;Open&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u3&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;td&gt;Open&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u4&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;td&gt;90&lt;/td&gt;
&lt;td&gt;Open&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u5&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;Open&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u4&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;WHERE filter&lt;/strong&gt; — drops the closed &lt;code&gt;u4&lt;/code&gt; row. 5 open rows remain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-row notional&lt;/strong&gt; — &lt;code&gt;u1: 50×120×100 = 600,000&lt;/code&gt; … wait — the worked-example diagram uses simplified numbers; recompute: &lt;code&gt;u1: 50×120×100 = 600,000&lt;/code&gt;, &lt;code&gt;u2: 200×80×100 = 1,600,000&lt;/code&gt;, &lt;code&gt;u3: 30×150×100 = 450,000&lt;/code&gt;, &lt;code&gt;u4: 80×90×100 = 720,000&lt;/code&gt;, &lt;code&gt;u5: 10×100×100 = 100,000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Group by user_id&lt;/strong&gt; — already one open row per user; the &lt;code&gt;SUM&lt;/code&gt; collapses each one to itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HAVING &amp;gt; 100,000&lt;/strong&gt; — strict greater-than; &lt;code&gt;u5&lt;/code&gt; at exactly &lt;code&gt;100,000&lt;/code&gt; does not pass; &lt;code&gt;u1&lt;/code&gt; (&lt;code&gt;600K&lt;/code&gt;), &lt;code&gt;u2&lt;/code&gt; (&lt;code&gt;1.6M&lt;/code&gt;), &lt;code&gt;u3&lt;/code&gt; (&lt;code&gt;450K&lt;/code&gt;), &lt;code&gt;u4&lt;/code&gt; (&lt;code&gt;720K&lt;/code&gt;) do.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order by notional desc&lt;/strong&gt; — &lt;code&gt;u2 &amp;gt; u4 &amp;gt; u1 &amp;gt; u3&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;notional&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;u2&lt;/td&gt;
&lt;td&gt;1,600,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u4&lt;/td&gt;
&lt;td&gt;720,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u1&lt;/td&gt;
&lt;td&gt;600,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u3&lt;/td&gt;
&lt;td&gt;450,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt; row filter&lt;/strong&gt;&lt;/strong&gt; — strips closed positions before grouping; &lt;code&gt;HAVING&lt;/code&gt; cannot do this work because the predicate is per-row, not per-group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;SUM(contract_count * strike_price * 100)&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — computes equity-option notional with the &lt;code&gt;× 100&lt;/code&gt; multiplier baked into the aggregate; spelled out so the reviewer can read the units.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;GROUP BY user_id&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — collapses each user's positions to one row with summed notional; the only non-aggregate column in &lt;code&gt;SELECT&lt;/code&gt; matches the &lt;code&gt;GROUP BY&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;HAVING SUM(...) &amp;gt; 100000&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — filters group rows; aggregate predicates have to live here. Strict-greater means &lt;code&gt;100000&lt;/code&gt; itself does not breach.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;ORDER BY notional DESC&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — ranks breaches by severity so the on-call engineer can triage worst-first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;O(|positions| + G log G)&lt;/code&gt; time&lt;/strong&gt;&lt;/strong&gt; — one scan for the join-free &lt;code&gt;GROUP BY&lt;/code&gt;, then &lt;code&gt;O(G log G)&lt;/code&gt; to sort the group output where &lt;code&gt;G&lt;/code&gt; is the number of users with open positions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inline CTA:&lt;/strong&gt; More &lt;a href="https://pipecode.ai/explore/practice/topic/having-clause/sql" rel="noopener noreferrer"&gt;SQL HAVING-clause problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;SQL aggregation problems&lt;/a&gt; for breadth.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Company — Robinhood&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Robinhood SQL problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/robinhood" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — having clause&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL HAVING-clause problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/having-clause/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;SQL aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips to crack Robinhood data engineering interviews
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Bilingual profile — SQL and Python both, equally weighted
&lt;/h3&gt;

&lt;p&gt;Unlike Cisco's Python-only loop, Robinhood splits the coding rounds between SQL and Python. The curated 2-problem &lt;a href="https://pipecode.ai/explore/practice/company/robinhood" rel="noopener noreferrer"&gt;Robinhood practice set&lt;/a&gt; is &lt;strong&gt;1 EASY Python hash-table + 1 MEDIUM SQL joins&lt;/strong&gt; — bilingual by design. Spending all your prep time on Python means losing the SQL round; spending it all on SQL means stuttering on the dict-counter Python prompt. Allocate roughly half-half across the &lt;a href="https://pipecode.ai/explore/practice/language/sql" rel="noopener noreferrer"&gt;SQL practice surface&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/language/python" rel="noopener noreferrer"&gt;Python practice surface&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drill the four primitives
&lt;/h3&gt;

&lt;p&gt;The four primitives in this guide map directly to the curated Robinhood set plus the two adjacent SQL patterns every Robinhood SQL question list rotates through: hash-table dict counter (Python, EASY — Stock Purchases Count), &lt;code&gt;INNER JOIN + GROUP BY + ORDER DESC + LIMIT&lt;/code&gt; (SQL, MEDIUM — Member Transfer Records / cities-completed-trades), &lt;code&gt;LAG&lt;/code&gt; window function (SQL — daily volume change), &lt;code&gt;GROUP BY + HAVING&lt;/code&gt; (SQL — notional-limit checks). Each maps to a specific module: vanilla &lt;code&gt;dict&lt;/code&gt; and &lt;code&gt;collections.defaultdict&lt;/code&gt; for the Python primitive, &lt;code&gt;INNER JOIN&lt;/code&gt; and &lt;code&gt;GROUP BY&lt;/code&gt; for the joins primitive, &lt;code&gt;OVER (PARTITION BY ... ORDER BY ...)&lt;/code&gt; for window LAG, &lt;code&gt;HAVING&lt;/code&gt; for the aggregate-threshold primitive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Penny-perfect correctness is the bar
&lt;/h3&gt;

&lt;p&gt;Robinhood is a regulated brokerage; every cent of every account balance is auditable. Float-typed currency, rounding-tolerant joins, and "approximately right" pipelines are downgrade signals in the round. Use &lt;code&gt;NUMERIC(18, 4)&lt;/code&gt; for prices and notionals; spell out the equity-option &lt;code&gt;× 100&lt;/code&gt; multiplier; volunteer "I would store this as decimal, never float" when you write the schema. State idempotency requirements when designing pipelines — every job can be re-run without double-counting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Event-sourced thinking — positions are folded from immutable trade events
&lt;/h3&gt;

&lt;p&gt;Robinhood's distinctive architecture pattern is event sourcing: account balances and positions are not mutable rows you &lt;code&gt;UPDATE&lt;/code&gt;; they are derived by folding an immutable log of trade and cash events. CRUD-style designs in interview answers underperform. Frame state as "I would write trades to an append-only Kafka log; the position table is a materialized fold of those events; corrections arrive as new events, not as &lt;code&gt;UPDATE&lt;/code&gt; statements." This single framing flip unlocks the senior signal interviewers grade.&lt;/p&gt;

&lt;h3&gt;
  
  
  T+1 settlement and brokerage glossary as table stakes
&lt;/h3&gt;

&lt;p&gt;Knowing the basics — settlement (T+1 since May 2024 in the US), corporate actions (splits, dividends, mergers), options assignment / exercise / expiration, FIFO / LIFO accounting, wash sales (same-security repurchase within 30 days disallows the loss) — gives a major edge. You don't need a finance degree; you do need to recognize the terms when the interviewer drops them. Spend a week with a brokerage glossary if your background is non-finance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Easy-Medium discipline matters
&lt;/h3&gt;

&lt;p&gt;The curated Robinhood set is &lt;strong&gt;1 EASY + 1 MEDIUM&lt;/strong&gt;. Easy at Robinhood doesn't mean trivial — it means the interviewer expects zero hesitation, idiomatic code, and an articulated invariant. A correct EASY answer with stuttering or a missing edge case is graded worse than a correct MEDIUM answer with the same flaw. Drill the &lt;a href="https://pipecode.ai/explore/practice/difficulty/easy" rel="noopener noreferrer"&gt;easy practice page&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/difficulty/medium" rel="noopener noreferrer"&gt;medium practice page&lt;/a&gt; until the canonical EASY-tier code rolls off your fingers in under three minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to practice on PipeCode
&lt;/h3&gt;

&lt;p&gt;Start with the &lt;a href="https://pipecode.ai/explore/practice/company/robinhood" rel="noopener noreferrer"&gt;Robinhood practice page&lt;/a&gt; for the curated 2-problem set. After that, drill the matching topic pages: &lt;a href="https://pipecode.ai/explore/practice/topic/hash-table/python" rel="noopener noreferrer"&gt;hash table&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/dictionary/python" rel="noopener noreferrer"&gt;dictionary&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/joins/sql" rel="noopener noreferrer"&gt;joins&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/group-by/sql" rel="noopener noreferrer"&gt;group by&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation/sql" rel="noopener noreferrer"&gt;aggregation&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;window functions&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/having-clause/sql" rel="noopener noreferrer"&gt;having clause&lt;/a&gt;, &lt;a href="https://pipecode.ai/explore/practice/topic/filtering/sql" rel="noopener noreferrer"&gt;filtering&lt;/a&gt;. The &lt;a href="https://pipecode.ai/explore/courses" rel="noopener noreferrer"&gt;interview courses page&lt;/a&gt; bundles structured curricula. For a broader set, &lt;a href="https://pipecode.ai/explore/practice/topics" rel="noopener noreferrer"&gt;browse by topic&lt;/a&gt; or pivot to peer fintech with the &lt;a href="https://pipecode.ai/blogs/airbnb-data-engineering-interview-questions-prep-guide" rel="noopener noreferrer"&gt;Airbnb DE interview guide&lt;/a&gt; and the &lt;a href="https://pipecode.ai/blogs/top-data-engineering-interview-questions-2026" rel="noopener noreferrer"&gt;top DE interview questions 2026&lt;/a&gt; blog.&lt;/p&gt;

&lt;h3&gt;
  
  
  Communication and approach under time pressure
&lt;/h3&gt;

&lt;p&gt;Talk through the invariant first ("this is a &lt;code&gt;LAG&lt;/code&gt;-on-partitioned-time-series problem"), the brute force second ("a self-join on &lt;code&gt;date - 1&lt;/code&gt; would also work"), and the optimal third ("but &lt;code&gt;LAG&lt;/code&gt; is the idiomatic and faster move"). Interviewers grade &lt;strong&gt;process&lt;/strong&gt; as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, single-row partitions, duplicate trade events, divide-by-zero on prior-day volume, decimal precision on the notional sum. The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the Robinhood data engineering interview process like?
&lt;/h3&gt;

&lt;p&gt;The Robinhood data engineering interview opens with a 30-minute recruiter screen, then a 60-minute technical phone screen with one live SQL or Python coding problem, then a 4-round virtual onsite: a system-design round (commonly trade reconciliation, regulatory reporting, or position event-sourcing), a live coding round in the language you didn't see in the phone screen, a data-modeling discussion (often Type 2 SCD on financial dimensions), and a behavioral round. Robinhood interviewers grade integrity and compliance-mindedness heavily; bring a postmortem-style story about a hard call between speed and correctness. End-to-end the loop runs three to four weeks.&lt;/p&gt;

&lt;h3&gt;
  
  
  What programming languages does Robinhood test in data engineering interviews?
&lt;/h3&gt;

&lt;p&gt;Robinhood data engineering interviews are bilingual — SQL and Python in roughly equal measure across the loop. SQL questions concentrate on joins, aggregations with &lt;code&gt;GROUP BY&lt;/code&gt; and &lt;code&gt;HAVING&lt;/code&gt;, window functions (&lt;code&gt;LAG&lt;/code&gt;, &lt;code&gt;LEAD&lt;/code&gt;, &lt;code&gt;SUM OVER&lt;/code&gt;), and financial-precision patterns (account balance reconstruction, FIFO P&amp;amp;L, trade reconciliation). Python questions concentrate on hash-table counters, dict-of-sets aggregation, event-stream dedup with TTL, and event sourcing folds. Go and Scala appear at backend-leaning DE roles but are not expected in the coding rounds.&lt;/p&gt;

&lt;h3&gt;
  
  
  How difficult are Robinhood data engineering interview questions?
&lt;/h3&gt;

&lt;p&gt;The curated Robinhood practice set on PipeCode is &lt;strong&gt;1 easy and 1 medium&lt;/strong&gt;, no hard. The EASY is a Python hash-table dict-counter problem (Stock Purchases Count); the MEDIUM is a SQL joins + filtering problem (Member Transfer Records / top-3 cities by completed trades). At the onsite, system-design and modeling questions reach L4-L5 level — trade reconciliation pipelines, regulatory CAT reporting, position event-sourcing — but the live coding rounds stay in the EASY-MEDIUM zone for IC2-IC3 hires. Stuttering on the EASY is a stronger negative signal than struggling with the MEDIUM.&lt;/p&gt;

&lt;h3&gt;
  
  
  How should I prepare for a Robinhood data engineer interview?
&lt;/h3&gt;

&lt;p&gt;Solve the 2 problems on the &lt;a href="https://pipecode.ai/explore/practice/company/robinhood" rel="noopener noreferrer"&gt;Robinhood practice page&lt;/a&gt; end-to-end — untimed first, then timed at 25 minutes per problem — and broaden to &lt;strong&gt;30 to 50 additional problems&lt;/strong&gt; across the matching topic pages: hash table and dictionary on the Python side, joins, group-by, aggregation, window functions, and having-clause on the SQL side. Read a brokerage glossary for a week (settlement, corporate actions, options assignment, FIFO accounting, wash sales). Practice articulating idempotency and audit-log requirements when discussing pipeline design — those framings are graded heavily at Robinhood.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the Robinhood data engineer salary range?
&lt;/h3&gt;

&lt;p&gt;Robinhood data engineer total compensation runs from roughly $170K (IC2, 2-4 years experience) to $620K (IC5, multi-org technical leadership). Senior Data Engineer (IC3) is the most common external hire at $240K–$370K total comp. Staff (IC4) sits at $330K–$500K. Average base salary across all levels lands around $163K with median $150K; total compensation averages around $221K when RSU refreshers and bonus are included. Negotiation success rates run 10–25% with competing offers per verified levels.fyi data.&lt;/p&gt;

&lt;h3&gt;
  
  
  What financial-domain knowledge do Robinhood interviewers expect?
&lt;/h3&gt;

&lt;p&gt;Robinhood interviewers expect candidates to recognize and use brokerage-domain terms without flinching: settlement (T+1), corporate actions (splits, dividends, mergers — they require historical-position re-multiplication), options (assignment, exercise, expiration, the &lt;code&gt;× 100&lt;/code&gt; equity-option multiplier), accounting methods (FIFO vs LIFO vs average cost — FIFO is the default for tax lots), wash sales (loss disallowed if same security repurchased within 30 days), regulatory pipelines (CAT — Consolidated Audit Trail — is FINRA-mandated). You don't need a finance background to land the role, but you do need to converse fluently in these terms; a one-week brokerage-glossary sprint is enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start practicing Robinhood data engineering problems
&lt;/h2&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>interview</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Bloomberg Data Engineering Interview Questions: Full DE Prep Guide</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Sat, 02 May 2026 06:17:38 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/bloomberg-data-engineering-interview-questions-full-de-prep-guide-43</link>
      <guid>https://dev.to/gowthampotureddi/bloomberg-data-engineering-interview-questions-full-de-prep-guide-43</guid>
      <description>&lt;p&gt;&lt;strong&gt;Bloomberg data engineering interview questions&lt;/strong&gt; sit at the intersection of three narrow, production-grade patterns: &lt;strong&gt;Python two-pointer and string manipulation&lt;/strong&gt; that reverses words in a sentence using &lt;code&gt;s.split()&lt;/code&gt; plus &lt;code&gt;[::-1]&lt;/code&gt; plus &lt;code&gt;' '.join(...)&lt;/code&gt;, &lt;strong&gt;production-quality OOP and abstract classes&lt;/strong&gt; that subclass an &lt;code&gt;ABC&lt;/code&gt; base with &lt;code&gt;@abstractmethod&lt;/code&gt; &lt;code&gt;load&lt;/code&gt;, &lt;code&gt;transform&lt;/code&gt;, and &lt;code&gt;write&lt;/code&gt; to stream a chunked CSV into line-delimited JSON without loading the whole file into memory, and &lt;strong&gt;SQL window functions for time-series and overlap analysis&lt;/strong&gt; with &lt;code&gt;SUM(volume) OVER (PARTITION BY symbol ORDER BY trade_date)&lt;/code&gt; for rolling totals and &lt;code&gt;EXISTS&lt;/code&gt; subqueries with interval crosschecks for subscription-overlap detection. The schema you reason over feels like Bloomberg's own product (&lt;code&gt;ticks&lt;/code&gt;, &lt;code&gt;subscriptions&lt;/code&gt;, &lt;code&gt;provider_feeds&lt;/code&gt;, &lt;code&gt;corporate_actions&lt;/code&gt;), and the bar is fluency with &lt;strong&gt;two-pointer index arithmetic&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;from abc import ABC, abstractmethod&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;PARTITION BY ... ORDER BY ...&lt;/code&gt;&lt;/strong&gt; under tie-break and overlap edge cases.&lt;/p&gt;

&lt;p&gt;This guide walks through the four topic clusters Bloomberg actually tests, each with a &lt;strong&gt;detailed topic explanation&lt;/strong&gt;, &lt;strong&gt;per-sub-topic explanation with a worked example and its solution&lt;/strong&gt;, and an &lt;strong&gt;interview-style problem with a full solution&lt;/strong&gt; that explains why it works. The mix matches the curated 2-problem Bloomberg set (1 easy, 1 hard) plus one supplementary SQL section anchored on patterns Bloomberg's external interview reports surface heavily — a Python-and-SQL loop where dictation of the &lt;strong&gt;invariant&lt;/strong&gt; out loud is half the score and the other half is typing production-quality code on the first try. Strong &lt;strong&gt;data engineer interview questions&lt;/strong&gt; prep at Bloomberg is less about contest difficulty (the pass rate sits at 8% across recent samples) and more about clean class boundaries, deterministic ordering, streaming-friendly file I/O, and the kind of code review you would survive on day three of the job.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn96as4ofkwndvft3ht6i.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn96as4ofkwndvft3ht6i.jpeg" alt="Bold dark thumbnail for the PipeCode guide to Bloomberg data engineering interview questions, with SQL window-function and Python OOP chips in purple, green, and blue accents." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Top Bloomberg data engineering interview topics
&lt;/h2&gt;

&lt;p&gt;From the &lt;a href="https://pipecode.ai/explore/practice/company/bloomberg" rel="noopener noreferrer"&gt;Bloomberg data engineering practice set&lt;/a&gt;, the &lt;strong&gt;four numbered sections below&lt;/strong&gt; follow this &lt;strong&gt;topic map&lt;/strong&gt; (one row per &lt;strong&gt;H2&lt;/strong&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Topic (sections &lt;strong&gt;1–4&lt;/strong&gt;)&lt;/th&gt;
&lt;th&gt;Why it shows up at Bloomberg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;The Bloomberg data engineering interview process&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Four-stage funnel — phone screen (45-60 min) → online assessment (some roles) → onsite of 3-5 rounds → final / fit. ~30 days end to end.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Python two-pointer and string manipulation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reverse Words in String (EASY) — &lt;code&gt;s.split()&lt;/code&gt; + &lt;code&gt;[::-1]&lt;/code&gt; + &lt;code&gt;' '.join(...)&lt;/code&gt;, or two-pointer in-place reversal of a character array.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Abstract classes, OOP, and file I/O in Python&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chunked CSV to Line-Delimited JSON Processor (HARD) — subclass an abstract &lt;code&gt;RecordProcessor&lt;/code&gt;, implement &lt;code&gt;load&lt;/code&gt;, &lt;code&gt;transform&lt;/code&gt;, &lt;code&gt;write&lt;/code&gt;, and stream &lt;code&gt;csv.DictReader&lt;/code&gt; rows into &lt;code&gt;json.dumps(...) + '\n'&lt;/code&gt; without loading the whole file.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL window functions for time-series and overlap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rolling-sum aggregates and Subscription Overlap — &lt;code&gt;SUM(volume) OVER (PARTITION BY symbol ORDER BY trade_date)&lt;/code&gt; for rolling totals, plus &lt;code&gt;EXISTS&lt;/code&gt; interval crosscheck for overlap detection.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bloomberg-flavor framing rule:&lt;/strong&gt; Bloomberg's prompts model the company's own product — market-data ticks, subscription feeds, provider data with corporate actions, regulatory-grade entitlements. The interviewer is grading whether you map each business framing to the right primitive: reverse-and-rejoin → &lt;code&gt;split&lt;/code&gt; + &lt;code&gt;[::-1]&lt;/code&gt; + &lt;code&gt;join&lt;/code&gt;; processor with pluggable transforms → &lt;code&gt;ABC&lt;/code&gt; with &lt;code&gt;@abstractmethod&lt;/code&gt;; rolling aggregates → &lt;code&gt;SUM OVER (PARTITION BY symbol ORDER BY ts)&lt;/code&gt;; interval overlap → &lt;code&gt;EXISTS&lt;/code&gt; with &lt;code&gt;&amp;lt;&lt;/code&gt; and &lt;code&gt;&amp;gt;&lt;/code&gt; bound checks. State the mapping out loud, then type production-quality code.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. The Bloomberg Data Engineering Interview Process
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Bloomberg DE interview funnel from phone screen to final round
&lt;/h3&gt;

&lt;p&gt;The Bloomberg data engineer interview process is a four-stage funnel that takes about thirty days end to end: a &lt;strong&gt;45-60 minute technical phone screen&lt;/strong&gt; (longer than the 20-30 minute non-engineering equivalent), an optional &lt;strong&gt;online assessment&lt;/strong&gt; for some roles, a &lt;strong&gt;virtual onsite of three to five rounds&lt;/strong&gt; covering coding plus system design plus behavioral plus fit, and a &lt;strong&gt;final decision&lt;/strong&gt; call. Recent samples report an &lt;strong&gt;8% pass rate&lt;/strong&gt; across thirteen Bloomberg DE candidates — making it the most selective DE loop of any company in this prep series.&lt;/p&gt;

&lt;p&gt;The two stages most candidates misread are the &lt;strong&gt;technical phone screen&lt;/strong&gt; (40 minutes is usually enough only for one strong problem plus follow-ups, so move fast on the boilerplate) and the &lt;strong&gt;behavioral round&lt;/strong&gt; (interviewers drill deep into past project technical details, not just culture-fit small talk).&lt;/p&gt;

&lt;h4&gt;
  
  
  Phone screen — 45 to 60 minutes of coding fluency
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Detailed explanation.&lt;/strong&gt; Bloomberg's engineering phone screen is conducted in a CoderPad-style live editor and runs 45-60 minutes for technical roles. The format converges on one medium-difficulty algorithmic problem (trees, graphs, two-pointer, sliding window, dynamic programming) plus follow-ups on edge cases and complexity. Bloomberg specifically values &lt;strong&gt;production-quality code&lt;/strong&gt; — type hints, edge-case guards, and a few inline tests — over a clever-but-fragile one-liner. State your assumptions before typing: input bounds, null/empty handling, in-place vs return-new.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; When the interviewer hands you &lt;em&gt;"reverse the order of words in a sentence"&lt;/em&gt;, a clean opener is &lt;em&gt;"I'll assume the input is a non-null string with words separated by single spaces, no leading or trailing whitespace; if the prompt allows multiple spaces I'll collapse them with &lt;code&gt;split()&lt;/code&gt; since &lt;code&gt;split()&lt;/code&gt; with no args treats any run of whitespace as one delimiter."&lt;/em&gt; That single sentence earns the round.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; state assumptions, then type. Two minutes of clarification is worth twenty minutes of rework.&lt;/p&gt;

&lt;h4&gt;
  
  
  Online assessment and the take-home variant
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Detailed explanation.&lt;/strong&gt; Some Bloomberg DE roles route through an online assessment (Byteboard-style or HackerRank-style) instead of, or in addition to, the live phone screen. Format varies by team: a 60-minute timed assessment with two SQL questions and one Python question, or a 90-minute take-home with a small data-pipeline implementation graded on code quality and testability. The signal is the same as the live screen — production-quality code, edge-case awareness, deterministic output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A take-home variant might be: &lt;em&gt;"Read &lt;code&gt;events.csv&lt;/code&gt;, normalize the timestamps to UTC, deduplicate on &lt;code&gt;(user_id, event_id)&lt;/code&gt; keeping the latest, and write the result as line-delimited JSON to &lt;code&gt;events.ndjson&lt;/code&gt;."&lt;/em&gt; Strong candidates ship a typed function with a &lt;code&gt;__main__&lt;/code&gt; guard, an inline unit test, and a &lt;code&gt;README.md&lt;/code&gt; with assumptions — even when the prompt does not explicitly ask for them.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; on take-homes, ship a typed function plus one unit test plus a one-paragraph &lt;code&gt;README&lt;/code&gt; of assumptions. Bloomberg grades on production-readiness, not just correctness.&lt;/p&gt;

&lt;h4&gt;
  
  
  Onsite — coding, design, behavioral, fit (3-5 rounds)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Detailed explanation.&lt;/strong&gt; The onsite is a virtual loop of three to five rounds totaling four to five hours: a coding round (harder than the screen, often involving sliding-window aggregators or class-hierarchy refactors), a system design round (real-time market-data pipelines, dedup with sequence numbers, schema evolution, entitlements at read time), a behavioral round, and a fit round. Bloomberg's distinctive design moves include asking about &lt;strong&gt;exactly-once semantics&lt;/strong&gt; under sink failures, &lt;strong&gt;late-arriving data&lt;/strong&gt; with watermarks, and &lt;strong&gt;schema evolution&lt;/strong&gt; across pipeline stages without breaking downstream consumers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;em&gt;"Design a pipeline that ingests trades from twenty venues, dedupes, enriches with reference data, and serves to downstream analytics consumers with a freshness SLA of 30 seconds."&lt;/em&gt; The clean answer routes through Kafka with idempotent producers, a Flink stateful operator keyed on &lt;code&gt;(venue_id, sequence_number)&lt;/code&gt; for dedup, a side-input join against a slowly-changing reference table, and a CloudWatch alert on consumer lag. Naming idempotency, watermarks, and the SLA up front is what differentiates strong candidates.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; every Bloomberg system-design answer should mention idempotency once, watermarks once, and the SLA / SLO once. Those three phrases earn the round.&lt;/p&gt;

&lt;h4&gt;
  
  
  Behavioral round — STAR plus deep-dive into past project tech
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Detailed explanation.&lt;/strong&gt; Bloomberg's behavioral round is conventional STAR — Situation, Task, Action, Result — but interviewers go technical-deep on the projects you mention. Bring two real projects you owned end-to-end with crisp numbers: rows per day, peak QPS, latency p99, cost per query, post-incident retro. Generic "I led a migration" stories without numbers will not pass. Recent first-person reports (Taro, October 2024) explicitly note: &lt;em&gt;"The interviewer asked past project-related questions and went into depth about the technical aspects related to those projects."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; When asked &lt;em&gt;"tell me about a time you handled a data corruption incident,"&lt;/em&gt; a strong answer names the source (a downstream consumer reported &lt;code&gt;NaN&lt;/code&gt; totals on the daily revenue dashboard), the diagnosis (a corrupt parquet file from a vendor partition that passed the schema check but had a sentinel &lt;code&gt;-1&lt;/code&gt; instead of &lt;code&gt;NULL&lt;/code&gt; in a numeric column), the fix (added a Great Expectations contract test on the column's value distribution, plus a postmortem doc shared with the vendor), and the result (zero similar incidents over the next six months, two related vendors signed up for the contract).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common beginner mistakes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treating the phone screen as a sprint — Bloomberg grades production-quality code, not the cleverest hack.&lt;/li&gt;
&lt;li&gt;Skipping &lt;code&gt;README.md&lt;/code&gt; and inline tests on take-homes.&lt;/li&gt;
&lt;li&gt;Naming idempotency without explaining the sink — &lt;em&gt;"my Kafka producer is idempotent"&lt;/em&gt; is not a complete answer; the sink decides exactly-once.&lt;/li&gt;
&lt;li&gt;Bringing one STAR story to the behavioral round (interviewers double-click and the story runs out fast).&lt;/li&gt;
&lt;li&gt;Generic past-project answers without numbers (rows / day, p99, cost).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42mnpc3hv72yclkl2m8z.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42mnpc3hv72yclkl2m8z.jpeg" alt="Four-stage horizontal funnel diagram of the Bloomberg data engineering interview process from phone screen through online assessment and onsite to final decision, with average durations and PipeCode brand colors." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Practice: drill the Bloomberg DE panel before the live screen
&lt;/h3&gt;

&lt;p&gt;&lt;span&gt;COMPANY&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Bloomberg — all DE problems&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Bloomberg data engineering practice set&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/bloomberg" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Bloomberg — Python only&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Bloomberg Python practice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/bloomberg/python" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Python Two-Pointer and String Manipulation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Two-pointer iteration, string reversal, and split-rejoin in Python for data engineering
&lt;/h3&gt;

&lt;p&gt;The first canonical Bloomberg Python pattern is &lt;strong&gt;string-level manipulation with two-pointer or split-rejoin primitives&lt;/strong&gt;. The headline interview problem on the Bloomberg practice set, &lt;strong&gt;Reverse Words in String&lt;/strong&gt;, is the textbook split-rejoin question: reverse the order of words in a sentence using only built-in string methods. The canonical answer is a single line — &lt;code&gt;' '.join(s.split()[::-1])&lt;/code&gt; — but the interviewer is reading whether you can articulate the &lt;strong&gt;three primitives&lt;/strong&gt; behind it (&lt;code&gt;split&lt;/code&gt;, &lt;code&gt;[::-1]&lt;/code&gt;, &lt;code&gt;join&lt;/code&gt;) and reach for the right tool when the constraints flip (immutable input, no built-ins, in-place character array).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Python's &lt;code&gt;s.split()&lt;/code&gt; (with no argument) is not the same as &lt;code&gt;s.split(' ')&lt;/code&gt;. The no-arg form &lt;strong&gt;collapses runs of whitespace&lt;/strong&gt; and ignores leading or trailing whitespace; the explicit &lt;code&gt;' '&lt;/code&gt; form preserves empty strings between consecutive spaces. For "reverse words" prompts you almost always want the no-arg form because it handles &lt;code&gt;'  hello  world  '&lt;/code&gt; cleanly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Splitting on whitespace — collapsing multiple spaces
&lt;/h4&gt;

&lt;p&gt;The split invariant: &lt;strong&gt;&lt;code&gt;s.split()&lt;/code&gt; with no argument splits on any run of whitespace and discards empty leading or trailing tokens; &lt;code&gt;s.split(' ')&lt;/code&gt; with an explicit single-space delimiter preserves every empty token between consecutive spaces.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;'a b c'.split()&lt;/code&gt;&lt;/strong&gt; → &lt;code&gt;['a', 'b', 'c']&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;'  a  b  '.split()&lt;/code&gt;&lt;/strong&gt; → &lt;code&gt;['a', 'b']&lt;/code&gt; (leading and trailing whitespace dropped, runs collapsed).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;'  a  b  '.split(' ')&lt;/code&gt;&lt;/strong&gt; → &lt;code&gt;['', '', 'a', '', 'b', '', '']&lt;/code&gt; (every run preserved).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three quick splits on the same input.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;input&lt;/th&gt;
&lt;th&gt;call&lt;/th&gt;
&lt;th&gt;output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'  hello  world  '&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.split()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['hello', 'world']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'  hello  world  '&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.split(' ')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['', '', 'hello', '', 'world', '', '']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'hello world'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.split(' ')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['hello', 'world']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;words_collapsed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; default to &lt;code&gt;s.split()&lt;/code&gt; (no arg) for natural-language input. Reach for &lt;code&gt;s.split(' ')&lt;/code&gt; only when the prompt explicitly says preserve empty tokens.&lt;/p&gt;

&lt;h4&gt;
  
  
  Reversing a list with &lt;code&gt;[::-1]&lt;/code&gt; vs &lt;code&gt;reversed()&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The reversal invariant: &lt;strong&gt;&lt;code&gt;[::-1]&lt;/code&gt; is a slice that returns a new reversed list; &lt;code&gt;reversed(seq)&lt;/code&gt; returns a lazy iterator; &lt;code&gt;list.reverse()&lt;/code&gt; mutates in place and returns &lt;code&gt;None&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;words[::-1]&lt;/code&gt;&lt;/strong&gt; — new list, original unchanged. &lt;code&gt;O(n)&lt;/code&gt; time, &lt;code&gt;O(n)&lt;/code&gt; extra memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;list(reversed(words))&lt;/code&gt;&lt;/strong&gt; — new list via iterator, same cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;words.reverse()&lt;/code&gt;&lt;/strong&gt; — in-place mutation, returns &lt;code&gt;None&lt;/code&gt;. &lt;code&gt;O(n)&lt;/code&gt; time, &lt;code&gt;O(1)&lt;/code&gt; extra memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Three ways to reverse &lt;code&gt;['a', 'b', 'c']&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;call&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;th&gt;mutates original&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;words[::-1]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['c', 'b', 'a']&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;list(reversed(words))&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['c', 'b', 'a']&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;words.reverse()&lt;/code&gt; (then read &lt;code&gt;words&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['c', 'b', 'a']&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reverse_new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;[::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reverse_in_place&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reverse&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; &lt;code&gt;[::-1]&lt;/code&gt; for "return a new reversed copy"; &lt;code&gt;list.reverse()&lt;/code&gt; for "minimize memory and the prompt allows mutation."&lt;/p&gt;

&lt;h4&gt;
  
  
  Two-pointer in-place reversal of a character array
&lt;/h4&gt;

&lt;p&gt;The two-pointer invariant: &lt;strong&gt;swap &lt;code&gt;arr[left]&lt;/code&gt; with &lt;code&gt;arr[right]&lt;/code&gt;, then move &lt;code&gt;left += 1&lt;/code&gt; and &lt;code&gt;right -= 1&lt;/code&gt;, until &lt;code&gt;left &amp;gt;= right&lt;/code&gt;.&lt;/strong&gt; This runs in &lt;code&gt;O(n)&lt;/code&gt; time and &lt;code&gt;O(1)&lt;/code&gt; extra memory and is the canonical follow-up when the interviewer says &lt;em&gt;"now do it without &lt;code&gt;[::-1]&lt;/code&gt; or &lt;code&gt;reversed()&lt;/code&gt;"&lt;/em&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Initialization&lt;/strong&gt; — &lt;code&gt;left = 0&lt;/code&gt;, &lt;code&gt;right = len(arr) - 1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Swap&lt;/strong&gt; — &lt;code&gt;arr[left], arr[right] = arr[right], arr[left]&lt;/code&gt; (Python's tuple assignment makes this one line).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Termination&lt;/strong&gt; — &lt;code&gt;left &amp;lt; right&lt;/code&gt; keeps you safe at the middle; &lt;code&gt;left == right&lt;/code&gt; is a no-op on odd-length arrays.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Reverse &lt;code&gt;['a', 'b', 'c', 'd']&lt;/code&gt; in place.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;left&lt;/th&gt;
&lt;th&gt;right&lt;/th&gt;
&lt;th&gt;array&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;start&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['a','b','c','d']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;swap, advance&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['d','b','c','a']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;swap, advance&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['d','c','b','a']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;stop (&lt;code&gt;left &amp;gt;= right&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['d','c','b','a']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reverse_in_place_chars&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the &lt;code&gt;left &amp;lt; right&lt;/code&gt; condition (strictly less than) is the safe termination — &lt;code&gt;left &amp;lt;= right&lt;/code&gt; would over-swap on even-length arrays.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common beginner mistakes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;s.split(' ')&lt;/code&gt; on real-world input and getting empty-string tokens that confuse downstream logic.&lt;/li&gt;
&lt;li&gt;Calling &lt;code&gt;list.reverse()&lt;/code&gt; and then trying to use the return value (it returns &lt;code&gt;None&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Two-pointer with &lt;code&gt;left &amp;lt;= right&lt;/code&gt; — over-swaps on even-length arrays, restoring the original order.&lt;/li&gt;
&lt;li&gt;Reaching for &lt;code&gt;re.split&lt;/code&gt; when &lt;code&gt;str.split&lt;/code&gt; is enough (no-libraries phone-screen variants reject &lt;code&gt;re&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PipeCode's &lt;a href="https://pipecode.ai/explore/courses/python-for-data-engineering-interviews-the-complete-fundamentals" rel="noopener noreferrer"&gt;Python for data engineering interviews course&lt;/a&gt; drills these primitives across forty-plus problems, including the two-pointer and string-manipulation variants Bloomberg's screen reaches for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python Interview Question on Reverse Words in String
&lt;/h3&gt;

&lt;p&gt;Write a function &lt;code&gt;reverse_words(s: str) -&amp;gt; str&lt;/code&gt; that returns &lt;code&gt;s&lt;/code&gt; with the order of words reversed. Words are separated by one or more spaces; treat any run of whitespace as a single delimiter. Leading and trailing whitespace in the input must not appear in the output, and the output must have exactly one space between words.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using split-reverse-join
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reverse_words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()[::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; (input &lt;code&gt;s = '  the quick   brown fox  '&lt;/code&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;expression&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;s.split()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['the', 'quick', 'brown', 'fox']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;s.split()[::-1]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['fox', 'brown', 'quick', 'the']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;' '.join(s.split()[::-1])&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'fox brown quick the'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;s.split()&lt;/code&gt; with no argument&lt;/strong&gt; — splits on any run of whitespace, discards empty leading/trailing tokens. The two leading spaces, the three between &lt;code&gt;quick&lt;/code&gt; and &lt;code&gt;brown&lt;/code&gt;, and the two trailing spaces all collapse cleanly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;[::-1]&lt;/code&gt; slice&lt;/strong&gt; — returns a new list with the words in reverse order; the original &lt;code&gt;s&lt;/code&gt; is unchanged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;' '.join(...)&lt;/code&gt;&lt;/strong&gt; — concatenates the reversed list into a single string with exactly one space between adjacent elements; returns &lt;code&gt;'fox brown quick the'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Return&lt;/strong&gt; — the function returns the joined string directly; no intermediate variable is needed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;input&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'the quick brown fox'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'fox brown quick the'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'  hello   world  '&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'world hello'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'one'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'one'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;''&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;''&lt;/code&gt; (empty input → empty output)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'   '&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;''&lt;/code&gt; (whitespace-only input → empty output)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;No-arg split collapses whitespace&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;s.split()&lt;/code&gt; (no argument) splits on any run of whitespace and silently drops empty leading or trailing tokens; this single call handles the "multiple spaces" and "leading / trailing whitespace" edge cases without any conditional logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Slice reversal returns a new list&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;[::-1]&lt;/code&gt; builds a new list in &lt;code&gt;O(n)&lt;/code&gt; time without mutating the source; the function stays referentially transparent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Join with a single space&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;' '.join(...)&lt;/code&gt; concatenates with exactly one delimiter between elements and zero delimiters at the ends; the output cannot have leading or trailing spaces by construction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Empty input is naturally handled&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;''.split()&lt;/code&gt; returns &lt;code&gt;[]&lt;/code&gt;, &lt;code&gt;[][::-1]&lt;/code&gt; returns &lt;code&gt;[]&lt;/code&gt;, and &lt;code&gt;' '.join([])&lt;/code&gt; returns &lt;code&gt;''&lt;/code&gt;; the empty-input case requires no special branch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Three primitives, one line&lt;/strong&gt;&lt;/strong&gt; — chaining &lt;code&gt;split&lt;/code&gt;, slice, and &lt;code&gt;join&lt;/code&gt; is the canonical Pythonic answer; reaching for &lt;code&gt;re.split&lt;/code&gt; or manual character iteration is over-engineering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;O(n)&lt;/code&gt; time where &lt;code&gt;n&lt;/code&gt; is the input length, plus &lt;code&gt;O(n)&lt;/code&gt; extra memory for the new list and joined string; &lt;code&gt;n&lt;/code&gt; is constrained by the line length in any real-world prompt.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Bloomberg — two pointers / string&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Reverse Words in String (Bloomberg)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/2-reverse-words-in-string" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — two pointers&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Two-pointer Python problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/two-pointers/python" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Abstract Classes, OOP, and File I/O in Python
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ABC, abstract methods, and chunked file processing in Python for data engineering
&lt;/h3&gt;

&lt;p&gt;The second canonical Bloomberg Python pattern is &lt;strong&gt;production-quality OOP with abstract base classes and streaming file I/O&lt;/strong&gt;. The headline interview problem on the Bloomberg practice set, &lt;strong&gt;Chunked CSV to Line-Delimited JSON Processor&lt;/strong&gt;, is exactly this shape: subclass an abstract &lt;code&gt;RecordProcessor&lt;/code&gt; base, implement the &lt;code&gt;load&lt;/code&gt;, &lt;code&gt;transform&lt;/code&gt;, and &lt;code&gt;write&lt;/code&gt; abstract methods, and stream a CSV file row-by-row into a line-delimited JSON output without loading the whole file into memory. This pattern shows up in roughly half of recent Bloomberg DE coding rounds — confirmed by external interview reports flagging "abstract &lt;code&gt;DataProcessor&lt;/code&gt; framework" and "RotatingFileSink class hierarchy" prompts.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Python's &lt;code&gt;abc&lt;/code&gt; module enforces the abstract contract at instantiation time, not at class definition time. &lt;code&gt;class Foo(ABC): @abstractmethod def bar(self): ...&lt;/code&gt; lets you define &lt;code&gt;Foo&lt;/code&gt; without implementing &lt;code&gt;bar&lt;/code&gt;, but &lt;code&gt;Foo()&lt;/code&gt; raises &lt;code&gt;TypeError: Can't instantiate abstract class Foo with abstract method bar&lt;/code&gt;. Subclasses that miss any abstract method fail the same way. This is the language feature Bloomberg interviewers are reading you for.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;from abc import ABC, abstractmethod&lt;/code&gt; — defining the contract
&lt;/h4&gt;

&lt;p&gt;The abstract-class invariant: &lt;strong&gt;a class that inherits from &lt;code&gt;ABC&lt;/code&gt; and decorates one or more methods with &lt;code&gt;@abstractmethod&lt;/code&gt; cannot be instantiated directly; only subclasses that implement every abstract method can be instantiated.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;from abc import ABC, abstractmethod&lt;/code&gt;&lt;/strong&gt; — the standard import. &lt;code&gt;ABC&lt;/code&gt; is a helper base class with &lt;code&gt;ABCMeta&lt;/code&gt; as its metaclass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;@abstractmethod&lt;/code&gt;&lt;/strong&gt; — decorator that marks a method as abstract. Place it on the method directly inside the class body.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subclassing&lt;/strong&gt; — &lt;code&gt;class Concrete(Abstract):&lt;/code&gt; inherits all concrete methods and must override every &lt;code&gt;@abstractmethod&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Defining an abstract &lt;code&gt;RecordProcessor&lt;/code&gt; with three abstract methods.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;method&lt;/th&gt;
&lt;th&gt;purpose&lt;/th&gt;
&lt;th&gt;signature&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;load(path)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;read the input&lt;/td&gt;
&lt;td&gt;returns an iterable of records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;transform(record)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;reshape one record&lt;/td&gt;
&lt;td&gt;returns the transformed record&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;write(record, out)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;write one record&lt;/td&gt;
&lt;td&gt;returns &lt;code&gt;None&lt;/code&gt; (side effect)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;abc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;abstractmethod&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IO&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RecordProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;IO&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; keep the abstract base small — three to five methods is the sweet spot. Abstract bases with twenty methods are a code smell.&lt;/p&gt;

&lt;h4&gt;
  
  
  Subclassing and implementing every abstract method
&lt;/h4&gt;

&lt;p&gt;The subclass invariant: &lt;strong&gt;a concrete subclass must implement every method decorated with &lt;code&gt;@abstractmethod&lt;/code&gt; in the parent (and every grandparent, transitively); missing even one keeps the subclass abstract.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;class CsvToLdjsonProcessor(RecordProcessor):&lt;/code&gt;&lt;/strong&gt; — declare inheritance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement each abstract&lt;/strong&gt; — &lt;code&gt;load&lt;/code&gt;, &lt;code&gt;transform&lt;/code&gt;, &lt;code&gt;write&lt;/code&gt; all need concrete bodies (not &lt;code&gt;pass&lt;/code&gt; or &lt;code&gt;...&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concrete-only methods inherit unchanged&lt;/strong&gt; — non-abstract methods on the parent are usable directly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A &lt;code&gt;CsvToLdjsonProcessor&lt;/code&gt; that reads CSV and writes line-delimited JSON.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;abstract method&lt;/th&gt;
&lt;th&gt;concrete implementation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;load&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;csv.DictReader(open(path))&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;transform&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;identity (return the row dict unchanged)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;write&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;out.write(json.dumps(record) + '\n')&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IO&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CsvToLdjsonProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RecordProcessor&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;newline&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DictReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;IO&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if your &lt;code&gt;transform&lt;/code&gt; is the identity, leave it as the identity — interviewers reward not over-engineering as much as they reward implementing correctly.&lt;/p&gt;

&lt;h4&gt;
  
  
  Chunked CSV reading with &lt;code&gt;csv.DictReader&lt;/code&gt; and line-delimited JSON output
&lt;/h4&gt;

&lt;p&gt;The streaming invariant: &lt;strong&gt;&lt;code&gt;csv.DictReader&lt;/code&gt; is a lazy iterator; iterating it yields one row at a time without loading the whole file. Pair it with a &lt;code&gt;for&lt;/code&gt; loop that writes each transformed row to the output stream, and the program runs in &lt;code&gt;O(1)&lt;/code&gt; extra memory regardless of input size.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;csv.DictReader(f)&lt;/code&gt;&lt;/strong&gt; — yields one &lt;code&gt;dict&lt;/code&gt; per CSV row, using the first row as header keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;yield from&lt;/code&gt;&lt;/strong&gt; — the &lt;code&gt;load&lt;/code&gt; method delegates iteration to &lt;code&gt;DictReader&lt;/code&gt; without buffering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;for record in self.load(path):&lt;/code&gt;&lt;/strong&gt; — the orchestrator iterates lazily; &lt;code&gt;transform&lt;/code&gt; and &lt;code&gt;write&lt;/code&gt; run per row.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Line-delimited JSON&lt;/strong&gt; — &lt;code&gt;json.dumps(record) + '\n'&lt;/code&gt; per row produces a file where each line is independently parseable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Streaming a 10-million-row CSV through the processor uses constant memory, not 10M-row memory.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;stage&lt;/th&gt;
&lt;th&gt;memory profile&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;load&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;one row at a time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;transform&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;one row at a time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;write&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;one row at a time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;total extra memory&lt;/td&gt;
&lt;td&gt;&lt;code&gt;O(1)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RecordProcessor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;transformed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transformed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; never &lt;code&gt;for record in list(processor.load(path))&lt;/code&gt; — wrapping a generator in &lt;code&gt;list()&lt;/code&gt; materializes the whole file and defeats the streaming guarantee.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common beginner mistakes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;pandas.read_csv(path)&lt;/code&gt; instead of &lt;code&gt;csv.DictReader&lt;/code&gt; — loads the whole file, defeats the streaming invariant.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;newline=''&lt;/code&gt; on &lt;code&gt;open()&lt;/code&gt; — Python's csv module needs that to handle quoted multi-line fields correctly.&lt;/li&gt;
&lt;li&gt;Implementing &lt;code&gt;load&lt;/code&gt; to return a &lt;code&gt;list&lt;/code&gt; instead of yielding — same issue as the &lt;code&gt;pandas&lt;/code&gt; mistake, scaled.&lt;/li&gt;
&lt;li&gt;Writing JSON without the trailing &lt;code&gt;'\n'&lt;/code&gt; — produces concatenated objects on one line, not line-delimited.&lt;/li&gt;
&lt;li&gt;Skipping the &lt;code&gt;with&lt;/code&gt; block on the output file — the file may not flush on early termination.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fueya0gveccn5bhv6kuop.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fueya0gveccn5bhv6kuop.jpeg" alt="Two-panel diagram showing the abstract base class hierarchy on the left (RecordProcessor with abstract methods load, transform, write) and a concrete CsvToLdjsonProcessor subclass on the right, plus a horizontal data flow arrow indicating chunked CSV input being read, transformed, and written as line-delimited JSON output." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Python Interview Question on Chunked CSV to LDJSON Processor
&lt;/h3&gt;

&lt;p&gt;Implement a class hierarchy that reads a CSV file row-by-row and writes the rows as line-delimited JSON to an output file, &lt;strong&gt;without loading the whole CSV into memory&lt;/strong&gt;. The base class &lt;code&gt;RecordProcessor&lt;/code&gt; must be abstract with three &lt;code&gt;@abstractmethod&lt;/code&gt; methods: &lt;code&gt;load(path)&lt;/code&gt; (returns an iterable of &lt;code&gt;dict&lt;/code&gt;), &lt;code&gt;transform(record)&lt;/code&gt; (returns a &lt;code&gt;dict&lt;/code&gt;), and &lt;code&gt;write(record, out)&lt;/code&gt; (writes one record and returns &lt;code&gt;None&lt;/code&gt;). Implement a concrete subclass &lt;code&gt;CsvToLdjsonProcessor&lt;/code&gt; that satisfies the contract. Provide a top-level &lt;code&gt;run(processor, in_path, out_path)&lt;/code&gt; function that orchestrates the streaming pipeline and returns the row count.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using a Concrete Subclass of an Abstract RecordProcessor
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;abc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;abstractmethod&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;IO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RecordProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;IO&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CsvToLdjsonProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RecordProcessor&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;newline&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DictReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;IO&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RecordProcessor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;transformed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transformed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; (input &lt;code&gt;events.csv&lt;/code&gt; with 3 rows):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name,age,city
alice,30,NYC
bob,25,SF
carol,40,LON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;action&lt;/th&gt;
&lt;th&gt;output side effect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;processor.load('events.csv')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;yields &lt;code&gt;{'name':'alice','age':'30','city':'NYC'}&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;transform(record)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;returns the same dict (identity)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;write(record, out)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;writes &lt;code&gt;{"name":"alice","age":"30","city":"NYC"}\n&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;next iteration&lt;/td&gt;
&lt;td&gt;yields &lt;code&gt;{'name':'bob','age':'25','city':'SF'}&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;transform + write&lt;/td&gt;
&lt;td&gt;writes &lt;code&gt;{"name":"bob","age":"25","city":"SF"}\n&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;next iteration&lt;/td&gt;
&lt;td&gt;yields &lt;code&gt;{'name':'carol','age':'40','city':'LON'}&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;transform + write&lt;/td&gt;
&lt;td&gt;writes &lt;code&gt;{"name":"carol","age":"40","city":"LON"}\n&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;iterator exhausted&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;count = 3&lt;/code&gt; returned&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Open the output file inside &lt;code&gt;with&lt;/code&gt;&lt;/strong&gt; — guarantees flush and close even on exceptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;processor.load(in_path)&lt;/code&gt;&lt;/strong&gt; — returns a generator; iterating it lazily reads one row at a time from the CSV.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;processor.transform(record)&lt;/code&gt;&lt;/strong&gt; — applies the per-record transformation (identity here).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;processor.write(transformed, out)&lt;/code&gt;&lt;/strong&gt; — writes one JSON line per record.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increment &lt;code&gt;count&lt;/code&gt;&lt;/strong&gt; — after each successful write.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Return &lt;code&gt;count&lt;/code&gt;&lt;/strong&gt; — at end of iteration.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"name": "alice", "age": "30", "city": "NYC"}
{"name": "bob", "age": "25", "city": "SF"}
{"name": "carol", "age": "40", "city": "LON"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus return value &lt;code&gt;3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;ABC + &lt;a class="mentioned-user" href="https://dev.to/abstractmethod"&gt;@abstractmethod&lt;/a&gt; enforces the contract&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;RecordProcessor&lt;/code&gt; cannot be instantiated directly; any subclass missing &lt;code&gt;load&lt;/code&gt;, &lt;code&gt;transform&lt;/code&gt;, or &lt;code&gt;write&lt;/code&gt; raises &lt;code&gt;TypeError&lt;/code&gt; at construction. The interface is enforced by the language, not by convention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;csv.DictReader yields lazily&lt;/strong&gt;&lt;/strong&gt; — it is a generator that reads one row at a time; pairing it with &lt;code&gt;yield from&lt;/code&gt; keeps the entire pipeline streaming.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Generator-based load&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;load&lt;/code&gt; returns an iterable, not a &lt;code&gt;list&lt;/code&gt;; the caller iterates with &lt;code&gt;for&lt;/code&gt; and never materializes the file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Line-delimited JSON&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;json.dumps(record) + '\n'&lt;/code&gt; per row produces a file where each line is independently parseable, which is the canonical format for log streaming and incremental ingestion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;with-block on the output&lt;/strong&gt;&lt;/strong&gt; — the output handle flushes and closes deterministically; an exception mid-stream does not leave a partial file open.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;O(1)&lt;/code&gt; extra memory regardless of input size (the generator holds at most one row), &lt;code&gt;O(n)&lt;/code&gt; time where &lt;code&gt;n&lt;/code&gt; is the input row count; no &lt;code&gt;O(n)&lt;/code&gt; memory spike at any stage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Bloomberg — OOP / abstract&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Chunked CSV to LDJSON Processor (Bloomberg)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/chunked-csv-to-line-delimited-json-processor" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — OOP&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;OOP Python problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/oop/python" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. SQL Window Functions for Time-Series and Subscription Overlap
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;PARTITION BY&lt;/code&gt;, &lt;code&gt;LAG&lt;/code&gt;, and rolling aggregates in SQL for time-series and overlap analysis
&lt;/h3&gt;

&lt;p&gt;The third canonical Bloomberg interview pattern is &lt;strong&gt;SQL window functions on time-series data plus interval-overlap detection&lt;/strong&gt;. Bloomberg's external interview reports converge here — a recent DataLemur compilation lists eight Bloomberg SQL questions covering rolling aggregates, monthly per-product averages, and ranking-tie semantics, and Interview Query flags &lt;strong&gt;Subscription Overlap&lt;/strong&gt; as a flagship Bloomberg SQL question. The canonical primitives are &lt;strong&gt;&lt;code&gt;SUM(volume) OVER (PARTITION BY symbol ORDER BY trade_date)&lt;/code&gt;&lt;/strong&gt; for rolling totals, &lt;strong&gt;&lt;code&gt;LAG&lt;/code&gt;/&lt;code&gt;LEAD&lt;/code&gt;&lt;/strong&gt; for adjacent-row comparisons, and &lt;strong&gt;&lt;code&gt;EXISTS&lt;/code&gt;&lt;/strong&gt; subqueries with interval crosschecks for overlap detection. PipeCode does not yet host a Bloomberg-tagged SQL problem, so this section is anchored on the topic-level &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions" rel="noopener noreferrer"&gt;SQL window functions practice page&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; window functions evaluate &lt;strong&gt;after&lt;/strong&gt; &lt;code&gt;WHERE&lt;/code&gt; and &lt;strong&gt;before&lt;/strong&gt; &lt;code&gt;ORDER BY&lt;/code&gt; in the SQL pipeline. That is why you cannot filter &lt;code&gt;WHERE rolling_total &amp;gt; 1000&lt;/code&gt; in the same &lt;code&gt;SELECT&lt;/code&gt; as the window — wrap the window in a CTE first, then filter in the outer query. This trips up most candidates on the first onsite SQL prompt.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;SUM&lt;/code&gt; / &lt;code&gt;AVG&lt;/code&gt; &lt;code&gt;OVER (PARTITION BY ... ORDER BY ...)&lt;/code&gt; for rolling totals
&lt;/h4&gt;

&lt;p&gt;The rolling-aggregate invariant: &lt;strong&gt;&lt;code&gt;SUM(metric) OVER (PARTITION BY entity ORDER BY ts)&lt;/code&gt; returns a per-row running total scoped to each entity, ordered by time, including all rows up to and including the current row by default.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;PARTITION BY symbol&lt;/code&gt;&lt;/strong&gt; — resets the running aggregate at each symbol boundary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY trade_date&lt;/code&gt;&lt;/strong&gt; — defines the time direction of the cumulation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default frame&lt;/strong&gt; — &lt;code&gt;RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW&lt;/code&gt;. Override with &lt;code&gt;ROWS BETWEEN N PRECEDING AND CURRENT ROW&lt;/code&gt; for a fixed-window rolling sum.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A &lt;code&gt;ticks(symbol, trade_date, volume)&lt;/code&gt; table.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;symbol&lt;/th&gt;
&lt;th&gt;trade_date&lt;/th&gt;
&lt;th&gt;volume&lt;/th&gt;
&lt;th&gt;running_volume&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;2026-04-01&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;2026-04-02&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;td&gt;250&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;2026-04-03&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;450&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MSFT&lt;/td&gt;
&lt;td&gt;2026-04-01&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MSFT&lt;/td&gt;
&lt;td&gt;2026-04-02&lt;/td&gt;
&lt;td&gt;120&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
         &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;
         &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;
       &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;running_volume&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ticks&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if the prompt says "running" / "cumulative" / "year-to-date" / "month-to-date" per entity, reach for &lt;code&gt;SUM OVER (PARTITION BY entity ORDER BY ts)&lt;/code&gt; before anything else.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;LAG&lt;/code&gt; / &lt;code&gt;LEAD&lt;/code&gt; for adjacent-row comparison
&lt;/h4&gt;

&lt;p&gt;The lag-lead invariant: &lt;strong&gt;&lt;code&gt;LAG(col, n) OVER (PARTITION BY entity ORDER BY ts)&lt;/code&gt; returns the value of &lt;code&gt;col&lt;/code&gt; from &lt;code&gt;n&lt;/code&gt; rows earlier within the same partition, ordered by &lt;code&gt;ts&lt;/code&gt;. &lt;code&gt;LEAD&lt;/code&gt; is the symmetric "n rows later" variant.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(volume) OVER (...)&lt;/code&gt;&lt;/strong&gt; — previous row's volume, or &lt;code&gt;NULL&lt;/code&gt; for the first row in the partition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(volume, 2)&lt;/code&gt;&lt;/strong&gt; — two rows back.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LAG(volume, 1, 0)&lt;/code&gt;&lt;/strong&gt; — default &lt;code&gt;0&lt;/code&gt; instead of &lt;code&gt;NULL&lt;/code&gt; for missing rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Day-over-day volume change per symbol.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;symbol&lt;/th&gt;
&lt;th&gt;trade_date&lt;/th&gt;
&lt;th&gt;volume&lt;/th&gt;
&lt;th&gt;prev_volume&lt;/th&gt;
&lt;th&gt;dod_change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;2026-04-01&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;2026-04-02&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;2026-04-03&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_volume&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;volume&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trade_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dod_change&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ticks&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; day-over-day deltas, week-over-week resets, and "did this row break a streak?" all map to &lt;code&gt;LAG&lt;/code&gt;. Use &lt;code&gt;LEAD&lt;/code&gt; for "is the next event within X minutes?".&lt;/p&gt;

&lt;h4&gt;
  
  
  Subscription overlap via interval crosscheck (&lt;code&gt;EXISTS&lt;/code&gt; subquery)
&lt;/h4&gt;

&lt;p&gt;The overlap invariant: &lt;strong&gt;two intervals &lt;code&gt;(start_a, end_a)&lt;/code&gt; and &lt;code&gt;(start_b, end_b)&lt;/code&gt; overlap if and only if &lt;code&gt;start_a &amp;lt;= end_b&lt;/code&gt; and &lt;code&gt;start_b &amp;lt;= end_a&lt;/code&gt;.&lt;/strong&gt; This is the canonical interval-overlap formula; everything else (partial overlap, full containment, identical intervals) collapses to this two-condition check.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;EXISTS&lt;/code&gt; subquery&lt;/strong&gt; — for each row, check if any other row from the same table satisfies the overlap condition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;s1.user_id &amp;lt;&amp;gt; s2.user_id&lt;/code&gt;&lt;/strong&gt; — exclude the row from comparing with itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End-date &lt;code&gt;IS NOT NULL&lt;/code&gt;&lt;/strong&gt; — only completed subscriptions count for overlap detection in Bloomberg's framing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A &lt;code&gt;subscriptions(user_id, start_date, end_date)&lt;/code&gt; table with four users.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;start_date&lt;/th&gt;
&lt;th&gt;end_date&lt;/th&gt;
&lt;th&gt;overlaps?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2026-01-01&lt;/td&gt;
&lt;td&gt;2026-01-31&lt;/td&gt;
&lt;td&gt;✓ (with user 2 and user 3)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2026-01-15&lt;/td&gt;
&lt;td&gt;2026-01-17&lt;/td&gt;
&lt;td&gt;✓ (with user 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2026-01-29&lt;/td&gt;
&lt;td&gt;2026-02-04&lt;/td&gt;
&lt;td&gt;✓ (with user 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;2026-02-05&lt;/td&gt;
&lt;td&gt;2026-02-10&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
         &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;subscriptions&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;
         &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
           &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_date&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
           &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_date&lt;/span&gt;
           &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_date&lt;/span&gt;
       &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;subscriptions&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_date&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; the overlap formula is &lt;code&gt;start_a &amp;lt;= end_b AND start_b &amp;lt;= end_a&lt;/code&gt;. Memorize it cold — every Bloomberg interval-detection prompt collapses to it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common beginner mistakes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filtering &lt;code&gt;WHERE running_volume &amp;gt; 1000&lt;/code&gt; in the same &lt;code&gt;SELECT&lt;/code&gt; as the window — windows evaluate after &lt;code&gt;WHERE&lt;/code&gt;; wrap in a CTE first.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;PARTITION BY&lt;/code&gt; and getting one global running total across all symbols.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;LAG(volume)&lt;/code&gt; without an &lt;code&gt;ORDER BY&lt;/code&gt; inside the window — undefined ordering, undefined result.&lt;/li&gt;
&lt;li&gt;Writing the overlap check as &lt;code&gt;start_a &amp;lt; end_b AND start_b &amp;lt; end_a&lt;/code&gt; (strict) when the prompt allows touching boundaries — off-by-one on identical-day overlaps.&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;s1.user_id &amp;lt;&amp;gt; s2.user_id&lt;/code&gt; — every row "overlaps with itself" and the answer is always &lt;code&gt;1&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbg3yinudo829k430vt5w.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbg3yinudo829k430vt5w.jpeg" alt="Worked example showing how SUM(volume) OVER (PARTITION BY symbol ORDER BY trade_date) produces a rolling cumulative volume per stock symbol on a small ticks input table, with PipeCode purple and green accents and an arrow flow from input table to window function to output table." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Detecting Subscription Overlap
&lt;/h3&gt;

&lt;p&gt;Given a table &lt;code&gt;subscriptions(user_id INT, start_date DATE, end_date DATE)&lt;/code&gt; where &lt;code&gt;end_date IS NULL&lt;/code&gt; indicates an active subscription, return one row per user with &lt;code&gt;overlap = 1&lt;/code&gt; if the user's completed subscription window overlaps any other user's completed subscription, and &lt;code&gt;overlap = 0&lt;/code&gt; otherwise. Consider only rows where &lt;code&gt;end_date IS NOT NULL&lt;/code&gt;. Use the canonical interval-overlap formula.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using EXISTS with Interval Crosscheck
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;subscriptions&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;   &lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
      &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_date&lt;/span&gt;  &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
      &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_date&lt;/span&gt;
      &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_date&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;subscriptions&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_date&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; (input &lt;code&gt;subscriptions&lt;/code&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;start_date&lt;/th&gt;
&lt;th&gt;end_date&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2026-01-01&lt;/td&gt;
&lt;td&gt;2026-01-31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2026-01-15&lt;/td&gt;
&lt;td&gt;2026-01-17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2026-01-29&lt;/td&gt;
&lt;td&gt;2026-02-04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;2026-02-05&lt;/td&gt;
&lt;td&gt;2026-02-10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Filter to completed subscriptions&lt;/strong&gt; — &lt;code&gt;WHERE s1.end_date IS NOT NULL&lt;/code&gt; keeps all four rows here (every row has an &lt;code&gt;end_date&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For user 1&lt;/strong&gt; (&lt;code&gt;2026-01-01&lt;/code&gt; to &lt;code&gt;2026-01-31&lt;/code&gt;) — check &lt;code&gt;EXISTS&lt;/code&gt;. Compare against user 2 (&lt;code&gt;2026-01-15&lt;/code&gt; to &lt;code&gt;2026-01-17&lt;/code&gt;): &lt;code&gt;2026-01-01 &amp;lt;= 2026-01-17&lt;/code&gt; ✓ AND &lt;code&gt;2026-01-15 &amp;lt;= 2026-01-31&lt;/code&gt; ✓ → overlap. &lt;code&gt;EXISTS&lt;/code&gt; returns true, &lt;code&gt;overlap = 1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For user 2&lt;/strong&gt; (&lt;code&gt;2026-01-15&lt;/code&gt; to &lt;code&gt;2026-01-17&lt;/code&gt;) — compare against user 1: &lt;code&gt;2026-01-15 &amp;lt;= 2026-01-31&lt;/code&gt; ✓ AND &lt;code&gt;2026-01-01 &amp;lt;= 2026-01-17&lt;/code&gt; ✓ → overlap. &lt;code&gt;overlap = 1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For user 3&lt;/strong&gt; (&lt;code&gt;2026-01-29&lt;/code&gt; to &lt;code&gt;2026-02-04&lt;/code&gt;) — compare against user 1: &lt;code&gt;2026-01-29 &amp;lt;= 2026-01-31&lt;/code&gt; ✓ AND &lt;code&gt;2026-01-01 &amp;lt;= 2026-02-04&lt;/code&gt; ✓ → overlap. &lt;code&gt;overlap = 1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For user 4&lt;/strong&gt; (&lt;code&gt;2026-02-05&lt;/code&gt; to &lt;code&gt;2026-02-10&lt;/code&gt;) — compare against every other user. User 1: &lt;code&gt;2026-02-05 &amp;lt;= 2026-01-31&lt;/code&gt; ✗ (Feb 5 is after Jan 31). User 2: &lt;code&gt;2026-02-05 &amp;lt;= 2026-01-17&lt;/code&gt; ✗. User 3: &lt;code&gt;2026-02-05 &amp;lt;= 2026-02-04&lt;/code&gt; ✗. No overlap. &lt;code&gt;overlap = 0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY s1.user_id&lt;/code&gt;&lt;/strong&gt; for stability.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;overlap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Interval overlap formula&lt;/strong&gt;&lt;/strong&gt; — two intervals &lt;code&gt;(a_start, a_end)&lt;/code&gt; and &lt;code&gt;(b_start, b_end)&lt;/code&gt; overlap iff &lt;code&gt;a_start &amp;lt;= b_end AND b_start &amp;lt;= a_end&lt;/code&gt;; this two-condition check captures partial overlap, full containment, and shared boundaries in one expression.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;EXISTS short-circuits&lt;/strong&gt;&lt;/strong&gt; — the subquery returns true the moment one matching row is found; for users with many overlapping peers, the engine does not evaluate every pair.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;s1.user_id &amp;lt;&amp;gt; s2.user_id&lt;/code&gt; excludes self-overlap&lt;/strong&gt;&lt;/strong&gt; — without this guard, every row trivially overlaps itself and the answer is uniformly &lt;code&gt;1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;end_date IS NOT NULL&lt;/code&gt; filters incomplete rows&lt;/strong&gt;&lt;/strong&gt; — Bloomberg's framing only counts completed subscriptions; active subscriptions (open intervals) are excluded by the prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;&lt;code&gt;CASE WHEN EXISTS ... THEN 1 ELSE 0&lt;/code&gt;&lt;/strong&gt;&lt;/strong&gt; — produces a &lt;code&gt;0&lt;/code&gt;/&lt;code&gt;1&lt;/code&gt; flag column directly from the boolean predicate, no separate aggregation needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;O(n²)&lt;/code&gt; worst case (every row compared against every other) without an index; with a B-tree index on &lt;code&gt;(start_date, end_date)&lt;/code&gt; plus a stop-early plan, the average case drops to &lt;code&gt;O(n log n)&lt;/code&gt; for typical interval distributions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — window functions&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Window function problems (all companies)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — intervals&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Interval / overlap problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/intervals" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips to crack Bloomberg data engineering interviews
&lt;/h2&gt;

&lt;p&gt;These are habits that move the needle in real Bloomberg DE loops — not a re-statement of the topics above.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practice with Bloomberg's data shape
&lt;/h3&gt;

&lt;p&gt;Bloomberg's interview prompts model market data and subscription feeds: &lt;code&gt;ticks&lt;/code&gt;, &lt;code&gt;quotes&lt;/code&gt;, &lt;code&gt;trades&lt;/code&gt;, &lt;code&gt;subscriptions&lt;/code&gt;, &lt;code&gt;provider_feeds&lt;/code&gt;, &lt;code&gt;corporate_actions&lt;/code&gt;. Drilling on order-line ecommerce schemas wastes prep time. Stick to event-shaped tables with a per-symbol-per-day grain, and pull problems from the &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions" rel="noopener noreferrer"&gt;window-functions topic page&lt;/a&gt; for shapes that match.&lt;/p&gt;

&lt;h3&gt;
  
  
  Master &lt;code&gt;from abc import ABC, abstractmethod&lt;/code&gt; cold
&lt;/h3&gt;

&lt;p&gt;Half of Bloomberg DE coding rounds reach for an abstract base class — confirmed by PipeCode #508 plus two of four recent external Bloomberg DE problems (&lt;code&gt;DataProcessor&lt;/code&gt;, &lt;code&gt;RotatingFileSink&lt;/code&gt;). Type the boilerplate from memory until it is muscle memory: &lt;code&gt;class Base(ABC): @abstractmethod def method(self): ...&lt;/code&gt; plus a concrete subclass with every abstract method implemented. Drill on the &lt;a href="https://pipecode.ai/explore/practice/topic/oop/python" rel="noopener noreferrer"&gt;OOP topic page&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Type your Python with &lt;code&gt;from __future__ import annotations&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Bloomberg grades production-quality code, and that includes type hints. Add &lt;code&gt;from __future__ import annotations&lt;/code&gt; at the top, use &lt;code&gt;list[dict]&lt;/code&gt; and &lt;code&gt;Iterable[dict]&lt;/code&gt; rather than &lt;code&gt;List[Dict]&lt;/code&gt;, and annotate every function signature. Interviewers notice — and the type-checker catches the off-by-one before the interviewer does.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drill window functions
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;PARTITION BY&lt;/code&gt; plus &lt;code&gt;ORDER BY&lt;/code&gt; plus optional &lt;code&gt;LAG&lt;/code&gt;/&lt;code&gt;LEAD&lt;/code&gt; plus the &lt;code&gt;ROWS BETWEEN N PRECEDING AND CURRENT ROW&lt;/code&gt; frame override is the Bloomberg SQL toolkit. Practice with &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;SQL window-function problems&lt;/a&gt; until the syntax is reflex. PipeCode's &lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;SQL for data engineering interviews course&lt;/a&gt; drills these primitives across forty-plus problems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bring two STAR stories per behavioral round
&lt;/h3&gt;

&lt;p&gt;Bloomberg's behavioral round goes technical-deep — interviewers drill past project details, not just culture-fit. Pick two real projects with crisp numbers (rows per day, p99, cost per query, post-incident retro) and rehearse the deep-dive Q&amp;amp;A. Generic teamwork stories will not pass.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to practice on PipeCode
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill lane&lt;/th&gt;
&lt;th&gt;Practice path&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Curated Bloomberg practice set&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/company/bloomberg" rel="noopener noreferrer"&gt;/explore/practice/company/bloomberg&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bloomberg Python practice&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/company/bloomberg/python" rel="noopener noreferrer"&gt;/explore/practice/company/bloomberg/python&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Two-pointer + string in Python&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/two-pointers/python" rel="noopener noreferrer"&gt;/explore/practice/topic/two-pointers/python&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;String-manipulation in Python&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/string-manipulation/python" rel="noopener noreferrer"&gt;/explore/practice/topic/string-manipulation/python&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OOP + abstract classes&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/abstract-classes" rel="noopener noreferrer"&gt;/explore/practice/topic/abstract-classes&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File I/O + CSV parsing&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/file-io/python" rel="noopener noreferrer"&gt;/explore/practice/topic/file-io/python&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL window functions&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions/sql" rel="noopener noreferrer"&gt;/explore/practice/topic/window-functions/sql&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intervals / overlap&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/intervals" rel="noopener noreferrer"&gt;/explore/practice/topic/intervals&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;All practice topics&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/topics" rel="noopener noreferrer"&gt;/explore/practice/topics&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interview courses&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/courses" rel="noopener noreferrer"&gt;/explore/courses&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Communication under time pressure
&lt;/h3&gt;

&lt;p&gt;State &lt;strong&gt;assumptions&lt;/strong&gt; before typing: &lt;em&gt;"I'll assume the input CSV has a header row, no embedded newlines in fields, and one consistent delimiter."&lt;/em&gt; State &lt;strong&gt;grain&lt;/strong&gt;: &lt;em&gt;"One JSON object per CSV row, in source order."&lt;/em&gt; State &lt;strong&gt;edge cases&lt;/strong&gt;: &lt;em&gt;"If a row has missing keys, my &lt;code&gt;transform&lt;/code&gt; returns the partial dict and &lt;code&gt;write&lt;/code&gt; serializes whatever keys are present."&lt;/em&gt; Interviewers grade clear reasoning above silent-and-perfect.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the Bloomberg data engineering interview process?
&lt;/h3&gt;

&lt;p&gt;The Bloomberg data engineer interview process is a four-stage funnel: a 45-60 minute technical phone screen in CoderPad covering a medium-difficulty algorithmic problem with follow-ups, an optional online assessment for some roles (Byteboard / HackerRank format with SQL plus Python), a virtual onsite of three to five rounds covering coding plus system design plus behavioral plus fit, and a final decision call. Total elapsed time is typically about thirty days. The recent reported pass rate is &lt;strong&gt;8% across thirteen samples&lt;/strong&gt; — making it the most selective DE loop tracked in our blog series.&lt;/p&gt;

&lt;h3&gt;
  
  
  What programming languages does Bloomberg test for data engineering?
&lt;/h3&gt;

&lt;p&gt;Bloomberg's data engineering interviews lean on &lt;strong&gt;Python&lt;/strong&gt; and &lt;strong&gt;SQL&lt;/strong&gt;, with &lt;strong&gt;Java&lt;/strong&gt; or &lt;strong&gt;C++&lt;/strong&gt; appearing for some platform / infra roles. Python is &lt;strong&gt;production-quality&lt;/strong&gt; — &lt;code&gt;from abc import ABC, abstractmethod&lt;/code&gt;, &lt;code&gt;csv.DictReader&lt;/code&gt;, type hints, generator-based streaming. SQL is at LeetCode-medium / DataLemur grade with a strong tilt toward window functions, rolling aggregates, and interval-overlap detection. PipeCode's &lt;a href="https://pipecode.ai/explore/practice/company/bloomberg/python" rel="noopener noreferrer"&gt;Bloomberg Python practice&lt;/a&gt; anchors the Python side; the SQL flavor is covered by topic-level pages.&lt;/p&gt;

&lt;h3&gt;
  
  
  What SQL topics show up most in Bloomberg data engineering interviews?
&lt;/h3&gt;

&lt;p&gt;The topics are narrow and consistent: &lt;strong&gt;window functions&lt;/strong&gt; (&lt;code&gt;SUM&lt;/code&gt; / &lt;code&gt;AVG&lt;/code&gt; &lt;code&gt;OVER (PARTITION BY symbol ORDER BY trade_date)&lt;/code&gt; for rolling totals), &lt;strong&gt;&lt;code&gt;LAG&lt;/code&gt;/&lt;code&gt;LEAD&lt;/code&gt;&lt;/strong&gt; for adjacent-row comparisons (day-over-day change, streak detection), &lt;strong&gt;&lt;code&gt;HAVING&lt;/code&gt; + &lt;code&gt;COUNT&lt;/code&gt;&lt;/strong&gt; for power-user filters, &lt;strong&gt;self-joins&lt;/strong&gt; for pairwise comparisons, and &lt;strong&gt;&lt;code&gt;EXISTS&lt;/code&gt; subqueries with interval crosschecks&lt;/strong&gt; for overlap detection. PipeCode's &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions" rel="noopener noreferrer"&gt;window-function problems&lt;/a&gt; and &lt;a href="https://pipecode.ai/explore/practice/topic/intervals" rel="noopener noreferrer"&gt;interval problems&lt;/a&gt; cover these directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  How difficult are Bloomberg data engineering interview questions?
&lt;/h3&gt;

&lt;p&gt;Bloomberg data engineering interview questions are calibrated above generic LeetCode-medium for the algorithm half — closer to a &lt;strong&gt;production-code bar&lt;/strong&gt; than a contest bar. The two-problem PipeCode set splits 1 EASY (Reverse Words in String) for warm-ups and 1 HARD (Chunked CSV to LDJSON Processor) for production-quality OOP. Reports describe the panel as friendly but rigorous; the &lt;strong&gt;8% pass rate&lt;/strong&gt; reflects code-review-grade judgment on edge cases, type hints, and streaming-friendly file I/O — not contest depth.&lt;/p&gt;

&lt;h3&gt;
  
  
  How should I prepare for a Bloomberg data engineering interview?
&lt;/h3&gt;

&lt;p&gt;Solve the &lt;a href="https://pipecode.ai/explore/practice/company/bloomberg" rel="noopener noreferrer"&gt;2-problem Bloomberg practice set&lt;/a&gt; end to end — that maps the exact Python pattern coverage. Then back-fill: 10+ two-pointer / string-manipulation problems for the EASY phone-screen flavor, 10+ OOP / abstract-class problems for the HARD onsite Python round, and 20+ SQL window-function and interval-overlap problems for the SQL round. Add Bloomberg-specific behavioral prep — two real STAR stories with crisp numbers — and one read-through of the streaming stack (Kafka, Flink, Airflow, Parquet).&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Bloomberg test object-oriented Python with abstract classes?
&lt;/h3&gt;

&lt;p&gt;Yes — heavily. Half of recent Bloomberg DE coding rounds use an abstract base class (&lt;code&gt;from abc import ABC, abstractmethod&lt;/code&gt;) and ask candidates to implement a concrete subclass. PipeCode's &lt;a href="https://pipecode.ai/explore/practice/chunked-csv-to-line-delimited-json-processor" rel="noopener noreferrer"&gt;Chunked CSV to LDJSON Processor&lt;/a&gt; is the canonical PipeCode-Bloomberg problem in this family, and external interview reports confirm the pattern repeats with &lt;code&gt;DataProcessor&lt;/code&gt; and &lt;code&gt;RotatingFileSink&lt;/code&gt; framings. Master the &lt;code&gt;ABC&lt;/code&gt; boilerplate and the streaming-generator pattern and this round becomes mechanical.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start practicing Bloomberg data engineering problems
&lt;/h2&gt;

&lt;p&gt;Reading patterns is not the same as typing them under time pressure. PipeCode pairs &lt;strong&gt;company-tagged Bloomberg&lt;/strong&gt; problems with tests, AI feedback, and a coding environment so you can drill the exact Python OOP, two-pointer, and SQL window-function patterns Bloomberg asks — without the noise of generic algorithm prep.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/" rel="noopener noreferrer"&gt;Pipecode.ai&lt;/a&gt; is Leetcode for Data Engineering.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/bloomberg" rel="noopener noreferrer"&gt;Browse Bloomberg practice →&lt;/a&gt;&lt;br&gt;
&lt;a href="https://pipecode.ai/explore/courses/python-for-data-engineering-interviews-the-complete-fundamentals" rel="noopener noreferrer"&gt;Python for DE interviews course →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>interview</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Rivian Data Engineering Interview Questions: Full DE Prep Guide</title>
      <dc:creator>Gowtham Potureddi</dc:creator>
      <pubDate>Sat, 02 May 2026 06:14:30 +0000</pubDate>
      <link>https://dev.to/gowthampotureddi/rivian-data-engineering-interview-questions-full-de-prep-guide-1j8o</link>
      <guid>https://dev.to/gowthampotureddi/rivian-data-engineering-interview-questions-full-de-prep-guide-1j8o</guid>
      <description>&lt;h2&gt;
  
  
  Rivian Data Engineering Interview Questions: Full DE Prep Guide
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Rivian data engineering interview questions&lt;/strong&gt; sit at the intersection of three narrow, fluency-graded patterns: &lt;strong&gt;SQL aggregation&lt;/strong&gt; with &lt;code&gt;GROUP BY&lt;/code&gt; to summarize per-entity metrics like &lt;code&gt;MIN(marks)&lt;/code&gt; and &lt;code&gt;MAX(marks)&lt;/code&gt; per subject, &lt;strong&gt;aggregation joins&lt;/strong&gt; that combine &lt;code&gt;LEFT JOIN&lt;/code&gt; with &lt;code&gt;SUM(fare)&lt;/code&gt; and &lt;code&gt;ORDER BY total ASC LIMIT N&lt;/code&gt; to surface the lowest-earning locations on a ride-hailing dataset, and &lt;strong&gt;vanilla Python string padding&lt;/strong&gt; that centers a string in a fixed-width line using &lt;code&gt;len()&lt;/code&gt;, integer division &lt;code&gt;(width - len(s)) // 2&lt;/code&gt;, and &lt;code&gt;' ' * pad + s + ' ' * pad&lt;/code&gt; — no &lt;code&gt;str.center()&lt;/code&gt; shortcut. The schema you reason over feels like Rivian's own product (&lt;code&gt;vehicles&lt;/code&gt;, &lt;code&gt;drivers&lt;/code&gt;, &lt;code&gt;rides&lt;/code&gt;, &lt;code&gt;locations&lt;/code&gt;, &lt;code&gt;telemetry_events&lt;/code&gt;), and the bar is fluency with &lt;strong&gt;&lt;code&gt;MIN&lt;/code&gt;/&lt;code&gt;MAX&lt;/code&gt;/&lt;code&gt;SUM&lt;/code&gt; per group&lt;/strong&gt;, &lt;strong&gt;JOIN-aggregate-order-limit composition&lt;/strong&gt;, and &lt;strong&gt;string-arithmetic primitives&lt;/strong&gt; — not contest-difficulty algorithms.&lt;/p&gt;

&lt;p&gt;This guide walks through the four topic clusters Rivian actually tests, each with a &lt;strong&gt;detailed topic explanation&lt;/strong&gt;, &lt;strong&gt;per-sub-topic explanation with a worked example and its solution&lt;/strong&gt;, and an &lt;strong&gt;interview-style problem with a full solution&lt;/strong&gt; that explains why it works. The mix matches the curated 3-problem Rivian set (3 easy, 0 medium, 0 hard) plus one process-and-prep section — a SQL-and-vanilla-Python loop where dictation of the &lt;strong&gt;invariant&lt;/strong&gt; out loud is half the score and the other half is typing the right primitive on the first try. Strong &lt;strong&gt;data engineer interview questions&lt;/strong&gt; prep at Rivian is less about contest depth and more about clean per-entity rollups, deterministic ordering, and arithmetic primitives that Snap to the AWS pipeline stack the company runs on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu647un5an8uvoi7vzm43.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu647un5an8uvoi7vzm43.jpeg" alt="Bold dark thumbnail for the PipeCode guide to Rivian data engineering interview questions, with SQL aggregation and Python string padding chips in purple, green, and orange accents." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Top Rivian data engineering interview topics
&lt;/h2&gt;

&lt;p&gt;From the &lt;a href="https://pipecode.ai/explore/practice/company/rivian" rel="noopener noreferrer"&gt;Rivian data engineering practice set&lt;/a&gt;, the &lt;strong&gt;four numbered sections below&lt;/strong&gt; follow this &lt;strong&gt;topic map&lt;/strong&gt; (one row per &lt;strong&gt;H2&lt;/strong&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Topic (sections &lt;strong&gt;1–4&lt;/strong&gt;)&lt;/th&gt;
&lt;th&gt;Why it shows up at Rivian&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;The Rivian data engineering interview process&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Four-stage funnel — recruiter → tech screen (SQL + basic Python, 60 min) → onsite panel (4-5 rounds) → optional hiring-manager sync.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SQL aggregation and &lt;code&gt;GROUP BY&lt;/code&gt; for per-entity stats&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Range of Marks Scored (EASY) — &lt;code&gt;SELECT subject, MIN(marks), MAX(marks) FROM marks GROUP BY subject&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Aggregation and joins for "lowest-N per dimension"&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Least Earning Locations for a Ride-Hailing Platform (EASY) — &lt;code&gt;LEFT JOIN locations TO rides&lt;/code&gt;, &lt;code&gt;SUM(fare) GROUP BY location&lt;/code&gt;, &lt;code&gt;ORDER BY total ASC LIMIT 3&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;String padding and centering in vanilla Python&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Centered Display Generator (EASY) — &lt;code&gt;(width - len(s)) // 2&lt;/code&gt; left pad, &lt;code&gt;' ' * pad + s + ' ' * pad&lt;/code&gt;, no &lt;code&gt;str.center()&lt;/code&gt; shortcut.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Rivian-flavor framing rule:&lt;/strong&gt; Rivian's prompts model the company's own product — fleet utilization, ride economics, vehicle telemetry, structured display lines for in-cab UIs. The interviewer is grading whether you map each business framing to the right primitive: per-entity summaries → &lt;code&gt;MIN&lt;/code&gt;/&lt;code&gt;MAX&lt;/code&gt;/&lt;code&gt;SUM&lt;/code&gt; + &lt;code&gt;GROUP BY&lt;/code&gt;; lowest-N per dimension → &lt;code&gt;JOIN&lt;/code&gt; + aggregate + &lt;code&gt;ORDER BY ASC LIMIT&lt;/code&gt;; pad-and-center a string → &lt;code&gt;len()&lt;/code&gt; + integer division + string multiplication. State the mapping out loud, then type.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. The Rivian Data Engineering Interview Process
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Rivian DE interview funnel from recruiter call to onsite panel
&lt;/h3&gt;

&lt;p&gt;The Rivian data engineer interview process is a four-stage funnel that takes about thirty days end to end: a recruiter screen, a technical phone screen that combines &lt;strong&gt;SQL&lt;/strong&gt; and &lt;strong&gt;basic Python&lt;/strong&gt; in a single 60-minute CoderPad session, a virtual onsite of four to five rounds covering coding plus system design plus behavioral, and an optional hiring-manager sync at the end. The technical bar is calibrated at LeetCode-medium for the algorithm half and "basic Python" for the data half — not contest difficulty.&lt;/p&gt;

&lt;p&gt;The two stages most candidates misread are the &lt;strong&gt;technical phone screen&lt;/strong&gt; (over-prepared on contest algorithms, under-prepared on SQL aggregation and cumulative-sum patterns) and the &lt;strong&gt;behavioral round&lt;/strong&gt; — Rivian takes the Compass values seriously enough that candidates have been rejected despite strong technical performance for arrogance or for lacking a collaborative mindset. Both are addressed below.&lt;/p&gt;

&lt;h4&gt;
  
  
  Recruiter and hiring-manager calls
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Detailed explanation.&lt;/strong&gt; The recruiter screen is a 30-minute fit-and-routing chat: background, motivation, why Rivian, what excites you about the EV / adventure-vehicle space. Rivian recruiters specifically ask &lt;em&gt;"Why Rivian?"&lt;/em&gt; and expect a real answer beyond generic enthusiasm — name a feature like Camp Mode, the R2 / R3 lineup, or the company's outdoor-mission stance. Hold your salary expectation until the offer stage; naming a number first usually leaves money on the table. The hiring-manager call (sometimes folded into the recruiter screen, sometimes its own block) is behavioral — past projects, problem-solving approach, how you collaborate cross-functionally with data scientists, ML engineers, and cloud-infrastructure teams.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; When the recruiter asks &lt;em&gt;"why Rivian over a generalist data role?"&lt;/em&gt;, a clean answer names the data shape — &lt;em&gt;"Rivian's data is event-shaped, fleet-scale, and tied to a physical product. I want to design pipelines where the same telemetry feeds both the in-cab UI and the long-tail charging analytics — that's a different problem than warehousing for ad spend."&lt;/em&gt; Specific beats generic.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; one Rivian-specific signal in the answer (Camp Mode, R2/R3, charging network, fleet telemetry) every minute of the recruiter call.&lt;/p&gt;

&lt;h4&gt;
  
  
  Technical phone screen — SQL plus basic Python
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Detailed explanation.&lt;/strong&gt; Roughly 60 minutes in CoderPad. The flavor that interview reports converge on is &lt;strong&gt;SQL&lt;/strong&gt; (LeetCode-medium, often a cumulative-sum / rolling-average / aggregation question on a fleet-style schema) plus &lt;strong&gt;basic Python&lt;/strong&gt; (a small simulation or string-processing problem solvable with &lt;code&gt;len&lt;/code&gt;, slicing, dicts, and conditionals — no &lt;code&gt;pandas&lt;/code&gt;, no &lt;code&gt;re&lt;/code&gt;, no library shortcuts). Reports anchor the specifics: a Senior DE candidate in Canada (Nov 2023) was given &lt;em&gt;"a cumulative sum type question in SQL"&lt;/em&gt; and &lt;em&gt;"some SQL and basic Python"&lt;/em&gt; in the same hour. The signal Rivian is reading is &lt;strong&gt;decomposition + data-shape awareness&lt;/strong&gt;, not library knowledge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A typical SQL prompt on Rivian-style data: &lt;em&gt;"Given a &lt;code&gt;rides(driver_id, fare, ride_at)&lt;/code&gt; table, return each driver's running total of fare, ordered by &lt;code&gt;ride_at&lt;/code&gt;."&lt;/em&gt; The right primitive is &lt;code&gt;SUM(fare) OVER (PARTITION BY driver_id ORDER BY ride_at)&lt;/code&gt; — a window-function cumulative sum. Saying &lt;em&gt;"this is a cumulative-sum-per-entity, so window function with &lt;code&gt;PARTITION BY driver_id ORDER BY ride_at&lt;/code&gt;"&lt;/em&gt; before typing earns the round.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if the prompt says &lt;em&gt;"running total"&lt;/em&gt;, &lt;em&gt;"cumulative"&lt;/em&gt;, &lt;em&gt;"rolling X-day"&lt;/em&gt;, or &lt;em&gt;"as of each row"&lt;/em&gt;, reach for &lt;code&gt;SUM() OVER (PARTITION BY … ORDER BY …)&lt;/code&gt; before anything else.&lt;/p&gt;

&lt;h4&gt;
  
  
  Virtual onsite — coding, system design, behavioral
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Detailed explanation.&lt;/strong&gt; The onsite is four to five rounds over Zoom, totaling four to five hours: a coding round (CoderPad, similar to the screen but harder), one or two system-design rounds (vehicle telemetry pipelines, OTA update systems, charging-network backends), a behavioral round, and sometimes a domain deep-dive on a previous project. The Rivian-distinctive system-design move is asking &lt;em&gt;"what happens when a vehicle loses cellular connectivity?"&lt;/em&gt; — design with &lt;strong&gt;offline-first / graceful-degradation&lt;/strong&gt; in mind, not just cloud infrastructure. The AWS stack you should know cold: &lt;strong&gt;S3&lt;/strong&gt; for the data lake, &lt;strong&gt;EC2&lt;/strong&gt; for compute, &lt;strong&gt;Lambda + Kinesis&lt;/strong&gt; for streaming, &lt;strong&gt;Glue&lt;/strong&gt; for ETL, &lt;strong&gt;Airflow&lt;/strong&gt; for orchestration, &lt;strong&gt;Great Expectations + CloudWatch&lt;/strong&gt; for data quality and alerting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;em&gt;"Design a pipeline that ingests vehicle telemetry from a fleet of 100,000 trucks, surfaces per-vehicle daily summaries, and tolerates a vehicle being offline for up to 24 hours."&lt;/em&gt; The clean answer batches local telemetry on the vehicle side, syncs to &lt;strong&gt;S3&lt;/strong&gt; when the cellular connection returns, parses through &lt;strong&gt;Glue&lt;/strong&gt;, lands the daily summaries in &lt;strong&gt;Snowflake&lt;/strong&gt; or &lt;strong&gt;Redshift&lt;/strong&gt;, and uses &lt;strong&gt;CloudWatch&lt;/strong&gt; alerts on stale partitions. Naming offline-first up front is what differentiates strong candidates.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; every Rivian system-design answer should say "offline-first" once and "graceful degradation" once — those two phrases earn the round.&lt;/p&gt;

&lt;h4&gt;
  
  
  Behavioral round — the Rivian Compass values
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Detailed explanation.&lt;/strong&gt; Rivian uses the &lt;strong&gt;Rivian Compass&lt;/strong&gt; behavioral framework with three pillars: &lt;strong&gt;Stay Adventurous&lt;/strong&gt; (take calculated risks, push boundaries), &lt;strong&gt;Lead the Way&lt;/strong&gt; (drive outcomes, own the problem), and &lt;strong&gt;Bring People Together&lt;/strong&gt; (collaborate cross-functionally, raise the team). Bring two STAR stories per pillar — Situation, Task, Action, Result — each tied to a real Rivian-relevant skill: shipping under hard product deadlines, owning a pipeline through a vendor migration, mediating between data scientists and platform engineers. Generic teamwork stories will not pass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; For &lt;em&gt;Bring People Together&lt;/em&gt;, a strong story names the friction (&lt;em&gt;"data scientists wanted hourly granularity; the platform team budgeted daily"&lt;/em&gt;), the mediation (&lt;em&gt;"I scoped a hybrid: hourly for the top three KPIs, daily for the rest"&lt;/em&gt;), and the result (&lt;em&gt;"shipped on time, hourly partitions surfaced two outage windows in the first month"&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common beginner mistakes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Naming a salary number on the recruiter call (anchor low; let them name the band first).&lt;/li&gt;
&lt;li&gt;Treating the technical phone screen as a contest-algorithm round (Rivian's screen is decomposition-grade, not LeetCode-hard).&lt;/li&gt;
&lt;li&gt;Skipping &lt;code&gt;Why Rivian?&lt;/code&gt; prep and answering with generic EV enthusiasm.&lt;/li&gt;
&lt;li&gt;Forgetting to name &lt;strong&gt;offline-first&lt;/strong&gt; in the system-design round on a vehicle-telemetry prompt.&lt;/li&gt;
&lt;li&gt;Bringing only one STAR story per Compass pillar — interviewers double-click and a single story runs out fast.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyujtioruj0nrw7zb0xg.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyujtioruj0nrw7zb0xg.jpeg" alt="Four-stage horizontal funnel diagram of the Rivian data engineering interview process from recruiter screen through technical phone screen and virtual onsite to optional hiring-manager sync, with average durations and PipeCode brand colors." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Practice: drill the Rivian DE panel before the live screen
&lt;/h3&gt;

&lt;p&gt;&lt;span&gt;COMPANY&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Rivian — all DE problems&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Rivian data engineering practice set&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/rivian" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Rivian — SQL only&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Rivian SQL practice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/rivian/sql" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  2. SQL Aggregation and GROUP BY for Per-Entity Stats
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Aggregation, GROUP BY, and per-entity MIN / MAX in SQL for data engineering
&lt;/h3&gt;

&lt;p&gt;The first canonical Rivian SQL pattern is the &lt;strong&gt;per-entity summary&lt;/strong&gt; — given a long event table, return one row per entity with one or more aggregates. The headline interview problem on the Rivian practice set, &lt;strong&gt;Range of Marks Scored&lt;/strong&gt;, is a textbook &lt;em&gt;"&lt;code&gt;MIN&lt;/code&gt; and &lt;code&gt;MAX&lt;/code&gt; per subject"&lt;/em&gt; prompt: collapse the per-student &lt;code&gt;marks&lt;/code&gt; table down to one row per subject with the lowest and highest mark in that subject. The canonical answer is &lt;strong&gt;&lt;code&gt;SELECT subject, MIN(marks) AS min_mark, MAX(marks) AS max_mark FROM marks GROUP BY subject&lt;/code&gt;&lt;/strong&gt; — three primitives in one line.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; &lt;code&gt;MIN&lt;/code&gt; and &lt;code&gt;MAX&lt;/code&gt; are &lt;code&gt;NULL&lt;/code&gt;-safe — aggregates ignore &lt;code&gt;NULL&lt;/code&gt; inputs, and a group of all-&lt;code&gt;NULL&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt;. That is exactly what you want when a subject has no graded marks yet — the result row carries &lt;code&gt;NULL&lt;/code&gt; through as "no data" instead of needing a separate &lt;code&gt;WHERE&lt;/code&gt; filter.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;MIN&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt;, &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;COUNT&lt;/code&gt; — the five core aggregates
&lt;/h4&gt;

&lt;p&gt;The aggregate invariant: &lt;strong&gt;each function reduces a group of input rows to a single scalar value, ignoring &lt;code&gt;NULL&lt;/code&gt; inputs by default&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MIN(col)&lt;/code&gt;&lt;/strong&gt; — smallest non-&lt;code&gt;NULL&lt;/code&gt; value in the group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MAX(col)&lt;/code&gt;&lt;/strong&gt; — largest non-&lt;code&gt;NULL&lt;/code&gt; value in the group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SUM(col)&lt;/code&gt;&lt;/strong&gt; — sum of non-&lt;code&gt;NULL&lt;/code&gt; values; returns &lt;code&gt;NULL&lt;/code&gt; if every input is &lt;code&gt;NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;AVG(col)&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;SUM(col) / COUNT(col)&lt;/code&gt; over non-&lt;code&gt;NULL&lt;/code&gt; values; rounding depends on the engine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COUNT(col)&lt;/code&gt;&lt;/strong&gt; — number of non-&lt;code&gt;NULL&lt;/code&gt; values; &lt;strong&gt;&lt;code&gt;COUNT(*)&lt;/code&gt;&lt;/strong&gt; counts every row including all-&lt;code&gt;NULL&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A &lt;code&gt;marks(student_id, subject, marks)&lt;/code&gt; table.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;student_id&lt;/th&gt;
&lt;th&gt;subject&lt;/th&gt;
&lt;th&gt;marks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Math&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Math&lt;/td&gt;
&lt;td&gt;92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Math&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Apply each aggregate per subject: &lt;code&gt;MIN(Math) = 75&lt;/code&gt;, &lt;code&gt;MAX(Math) = 92&lt;/code&gt;, &lt;code&gt;SUM(Math) = 167&lt;/code&gt;, &lt;code&gt;AVG(Math) = 83.5&lt;/code&gt;, &lt;code&gt;COUNT(Math marks) = 2&lt;/code&gt; (NULL skipped), &lt;code&gt;COUNT(*) = 3&lt;/code&gt; (NULL counted).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;marks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;min_mark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;marks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_mark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;marks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_mark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;marks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;marks_recorded&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;marks&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; default to &lt;code&gt;COUNT(col)&lt;/code&gt; when "graded" / "recorded" / "non-&lt;code&gt;NULL&lt;/code&gt;" matters. Reach for &lt;code&gt;COUNT(*)&lt;/code&gt; only when "every row" is the literal intent.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;GROUP BY&lt;/code&gt; for per-entity rollups
&lt;/h4&gt;

&lt;p&gt;The grouping invariant: &lt;strong&gt;&lt;code&gt;GROUP BY entity&lt;/code&gt; partitions the table by entity and applies each aggregate inside each partition; the result has exactly one row per distinct entity value&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One column&lt;/strong&gt; — &lt;code&gt;GROUP BY subject&lt;/code&gt; collapses by subject.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple columns&lt;/strong&gt; — &lt;code&gt;GROUP BY subject, year&lt;/code&gt; collapses by subject-year pair (composite grain).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;HAVING&lt;/code&gt; vs &lt;code&gt;WHERE&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;WHERE&lt;/code&gt; filters rows before grouping; &lt;code&gt;HAVING&lt;/code&gt; filters groups after aggregation. &lt;code&gt;WHERE marks &amp;gt; 50&lt;/code&gt; is fine; &lt;code&gt;WHERE MIN(marks) &amp;gt; 50&lt;/code&gt; is illegal — use &lt;code&gt;HAVING MIN(marks) &amp;gt; 50&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; From the same &lt;code&gt;marks&lt;/code&gt; table, group by &lt;code&gt;subject&lt;/code&gt; and apply &lt;code&gt;MIN&lt;/code&gt; + &lt;code&gt;MAX&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;subject&lt;/th&gt;
&lt;th&gt;min_mark&lt;/th&gt;
&lt;th&gt;max_mark&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Math&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;marks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;min_mark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;marks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_mark&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;marks&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; every column in the &lt;code&gt;SELECT&lt;/code&gt; list that is not inside an aggregate must appear in &lt;code&gt;GROUP BY&lt;/code&gt;. Postgres errors loudly when you skip this; MySQL silently picks an arbitrary value.&lt;/p&gt;

&lt;h4&gt;
  
  
  Including non-aggregated columns — the trap
&lt;/h4&gt;

&lt;p&gt;The selection invariant: &lt;strong&gt;non-aggregated columns must be in &lt;code&gt;GROUP BY&lt;/code&gt; or wrapped in an aggregate&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SELECT subject, student_id, MIN(marks) ...&lt;/code&gt;&lt;/strong&gt; — illegal in standard SQL because &lt;code&gt;student_id&lt;/code&gt; is neither aggregated nor grouped. Postgres raises a "column must appear in the GROUP BY clause or be used in an aggregate function" error.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workaround 1&lt;/strong&gt; — put &lt;code&gt;student_id&lt;/code&gt; in &lt;code&gt;GROUP BY&lt;/code&gt; (changes the grain to per-subject-per-student).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workaround 2&lt;/strong&gt; — wrap it in an aggregate (&lt;code&gt;MIN(student_id)&lt;/code&gt; returns the smallest id per subject).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workaround 3&lt;/strong&gt; — switch to a window function (&lt;code&gt;ROW_NUMBER()&lt;/code&gt; plus &lt;code&gt;WHERE rn = 1&lt;/code&gt; carries the full row).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;em&gt;"For each subject, return the lowest mark and the student who got it."&lt;/em&gt; &lt;code&gt;MIN(marks)&lt;/code&gt; alone cannot pull &lt;code&gt;student_id&lt;/code&gt;. Either group by &lt;code&gt;(subject, student_id)&lt;/code&gt; (changes the grain) or use &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY subject ORDER BY marks ASC)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;marks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;marks&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;marks&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;marks&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;student_id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lowest_scorer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;marks&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lowest_mark&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; "the smallest value per group" → &lt;code&gt;MIN&lt;/code&gt;. "The full row at the smallest value per group" → &lt;code&gt;ROW_NUMBER()&lt;/code&gt; + &lt;code&gt;PARTITION BY&lt;/code&gt;. Pick the cheap one when you only need the value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common beginner mistakes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Selecting non-aggregated columns alongside &lt;code&gt;MIN&lt;/code&gt;/&lt;code&gt;MAX&lt;/code&gt; — error in Postgres, undefined in MySQL.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;WHERE&lt;/code&gt; to filter aggregates (use &lt;code&gt;HAVING&lt;/code&gt; instead).&lt;/li&gt;
&lt;li&gt;Forgetting &lt;code&gt;NULL&lt;/code&gt; semantics — &lt;code&gt;COUNT(col)&lt;/code&gt; skips &lt;code&gt;NULL&lt;/code&gt;, &lt;code&gt;COUNT(*)&lt;/code&gt; does not.&lt;/li&gt;
&lt;li&gt;Treating &lt;code&gt;AVG&lt;/code&gt; as integer arithmetic — &lt;code&gt;AVG&lt;/code&gt; of all-integer columns returns a &lt;code&gt;double&lt;/code&gt; in most engines; cast or &lt;code&gt;ROUND&lt;/code&gt; if the prompt asks for an integer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PipeCode's &lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;SQL for data engineering interviews course&lt;/a&gt; drills these aggregation primitives across forty-plus problems, including the cumulative-sum and rolling-average variants Rivian's screen reaches for.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Range of Marks Scored
&lt;/h3&gt;

&lt;p&gt;Table &lt;code&gt;marks(student_id INT, subject VARCHAR, marks INT)&lt;/code&gt; records every student's mark per subject (&lt;code&gt;marks&lt;/code&gt; is nullable for ungraded entries). Return one row per subject with the &lt;strong&gt;lowest&lt;/strong&gt; and &lt;strong&gt;highest&lt;/strong&gt; recorded mark in that subject. Output &lt;code&gt;subject&lt;/code&gt;, &lt;code&gt;min_mark&lt;/code&gt;, &lt;code&gt;max_mark&lt;/code&gt;, ordered by &lt;code&gt;subject&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using GROUP BY with MIN and MAX
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;marks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;min_mark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;marks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_mark&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;marks&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; (input &lt;code&gt;marks&lt;/code&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;student_id&lt;/th&gt;
&lt;th&gt;subject&lt;/th&gt;
&lt;th&gt;marks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Math&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Math&lt;/td&gt;
&lt;td&gt;92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Math&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GROUP BY subject&lt;/code&gt;&lt;/strong&gt; — partition the table into two groups: &lt;code&gt;Math&lt;/code&gt; (rows 1, 2, 3) and &lt;code&gt;English&lt;/code&gt; (rows 4, 5, 6).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MIN(marks)&lt;/code&gt; per group&lt;/strong&gt; — &lt;code&gt;Math&lt;/code&gt; group: &lt;code&gt;MIN(75, 92, 60) = 60&lt;/code&gt;. &lt;code&gt;English&lt;/code&gt; group: &lt;code&gt;MIN(68, 81, NULL) = 68&lt;/code&gt; (&lt;code&gt;NULL&lt;/code&gt; is skipped, not treated as zero).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MAX(marks)&lt;/code&gt; per group&lt;/strong&gt; — &lt;code&gt;Math&lt;/code&gt; group: &lt;code&gt;MAX(75, 92, 60) = 92&lt;/code&gt;. &lt;code&gt;English&lt;/code&gt; group: &lt;code&gt;MAX(68, 81, NULL) = 81&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order final output&lt;/strong&gt; — &lt;code&gt;ORDER BY subject&lt;/code&gt; returns alphabetical stability (&lt;code&gt;English&lt;/code&gt; before &lt;code&gt;Math&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;subject&lt;/th&gt;
&lt;th&gt;min_mark&lt;/th&gt;
&lt;th&gt;max_mark&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Math&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;td&gt;92&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;GROUP BY subject&lt;/strong&gt;&lt;/strong&gt; — partitions the table into one block per distinct &lt;code&gt;subject&lt;/code&gt; value; aggregates evaluate inside each block independently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;MIN and MAX are NULL-safe&lt;/strong&gt;&lt;/strong&gt; — both functions ignore &lt;code&gt;NULL&lt;/code&gt; inputs by default, so the ungraded English row does not skew the result.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;One pass over the table&lt;/strong&gt;&lt;/strong&gt; — the engine scans once, hashing rows by &lt;code&gt;subject&lt;/code&gt; and updating per-group &lt;code&gt;min&lt;/code&gt; / &lt;code&gt;max&lt;/code&gt; accumulators; no sort is required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;No window function needed&lt;/strong&gt;&lt;/strong&gt; — the prompt asks only for the per-group extremes, not the full row that produced them; &lt;code&gt;MIN&lt;/code&gt;/&lt;code&gt;MAX&lt;/code&gt; is the right tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;ORDER BY at the end&lt;/strong&gt;&lt;/strong&gt; — the aggregate already produced one row per subject; &lt;code&gt;ORDER BY subject&lt;/code&gt; is purely for human readability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;O(n)&lt;/code&gt; for the single hash-aggregate pass plus &lt;code&gt;O(g log g)&lt;/code&gt; for the final sort, where &lt;code&gt;n&lt;/code&gt; is row count and &lt;code&gt;g&lt;/code&gt; is the number of distinct subjects (typically &lt;code&gt;g &amp;lt;&amp;lt; n&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Rivian — aggregations&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Range of Marks Scored (Rivian)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/range-of-marks-scored" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;COMPANY&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Rivian — aggregations&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Rivian aggregation problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/rivian/topic/aggregations" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Aggregation and Joins for "Lowest-N per Dimension"
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Aggregation, JOIN, and ORDER BY LIMIT in SQL for ride-hailing analytics
&lt;/h3&gt;

&lt;p&gt;The second canonical Rivian SQL pattern is the &lt;strong&gt;lowest-N (or top-N) per dimension&lt;/strong&gt; — combine &lt;code&gt;JOIN&lt;/code&gt; to bring the dimension's metadata in with &lt;code&gt;GROUP BY&lt;/code&gt; aggregation on the metric, then &lt;code&gt;ORDER BY metric ASC LIMIT N&lt;/code&gt;. The headline interview problem on the Rivian practice set, &lt;strong&gt;Least Earning Locations for a Ride-Hailing Platform&lt;/strong&gt;, is exactly this shape: join &lt;code&gt;rides&lt;/code&gt; to &lt;code&gt;locations&lt;/code&gt;, sum the &lt;code&gt;fare&lt;/code&gt; per &lt;code&gt;location&lt;/code&gt;, and return the three lowest-earning locations. The canonical composition is &lt;code&gt;SELECT loc.name, SUM(r.fare) AS total FROM locations loc LEFT JOIN rides r ON r.location_id = loc.id GROUP BY loc.name ORDER BY total ASC LIMIT 3&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; the choice between &lt;code&gt;LEFT JOIN&lt;/code&gt; and &lt;code&gt;INNER JOIN&lt;/code&gt; here is &lt;strong&gt;not cosmetic&lt;/strong&gt;. &lt;code&gt;LEFT JOIN&lt;/code&gt; keeps locations with &lt;strong&gt;zero rides&lt;/strong&gt; (their &lt;code&gt;SUM(fare)&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt;, which sorts as the lowest in most engines). &lt;code&gt;INNER JOIN&lt;/code&gt; silently drops them — and a location with zero rides is, by definition, the lowest earner, so dropping it produces the wrong answer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;INNER JOIN&lt;/code&gt; vs &lt;code&gt;LEFT JOIN&lt;/code&gt; for "include zero-event entities"
&lt;/h4&gt;

&lt;p&gt;The join invariant: &lt;strong&gt;every row from the left table survives &lt;code&gt;LEFT JOIN&lt;/code&gt;; unmatched right-side columns come through as &lt;code&gt;NULL&lt;/code&gt;. &lt;code&gt;INNER JOIN&lt;/code&gt; keeps only matched pairs&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;INNER JOIN&lt;/code&gt;&lt;/strong&gt; — drops left rows with no right match (silent loss).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt;&lt;/strong&gt; — keeps left rows; right columns are &lt;code&gt;NULL&lt;/code&gt; when no match.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For "lowest-N" prompts&lt;/strong&gt; — &lt;code&gt;LEFT JOIN&lt;/code&gt; is almost always the right call because zero-event entities &lt;strong&gt;are&lt;/strong&gt; the lowest by definition.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A &lt;code&gt;locations&lt;/code&gt; table with 3 rows and a &lt;code&gt;rides&lt;/code&gt; table with rides in only 2 of them.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;locations&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Downtown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Airport&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Suburb&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;rides&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;location_id&lt;/th&gt;
&lt;th&gt;fare&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;INNER JOIN&lt;/code&gt; returns only Downtown and Airport rows. &lt;code&gt;LEFT JOIN&lt;/code&gt; returns Downtown, Airport, and &lt;strong&gt;Suburb with &lt;code&gt;NULL&lt;/code&gt; ride fields&lt;/strong&gt; — Suburb is the location with zero rides.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fare&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;rides&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;location_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if the prompt says &lt;em&gt;"every location"&lt;/em&gt;, &lt;em&gt;"include locations with no rides"&lt;/em&gt;, or &lt;em&gt;"lowest-earning"&lt;/em&gt; / &lt;em&gt;"least"&lt;/em&gt;, you need &lt;code&gt;LEFT JOIN&lt;/code&gt;. &lt;code&gt;INNER JOIN&lt;/code&gt; is fine only when the prompt explicitly excludes empty entities.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;SUM&lt;/code&gt; aggregation per dimension
&lt;/h4&gt;

&lt;p&gt;The sum invariant: &lt;strong&gt;&lt;code&gt;SUM(metric) GROUP BY dimension&lt;/code&gt; collapses event rows to per-dimension totals; &lt;code&gt;NULL&lt;/code&gt; inputs are skipped, and a group of all-&lt;code&gt;NULL&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SUM(fare) GROUP BY location_id&lt;/code&gt;&lt;/strong&gt; — total revenue per location.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(SUM(fare), 0)&lt;/code&gt;&lt;/strong&gt; — replace &lt;code&gt;NULL&lt;/code&gt; totals with zero when a location has zero rides (often desirable for ranking).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixing aggregates&lt;/strong&gt; — &lt;code&gt;SUM(fare)&lt;/code&gt; plus &lt;code&gt;COUNT(*)&lt;/code&gt; plus &lt;code&gt;AVG(fare)&lt;/code&gt; in one query is fine; all evaluate over the same group.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Apply &lt;code&gt;SUM(fare)&lt;/code&gt; to the &lt;code&gt;LEFT JOIN&lt;/code&gt; result above, grouped by location name.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;location&lt;/th&gt;
&lt;th&gt;total_fare&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Downtown&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Airport&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Suburb&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;NULL&lt;/code&gt; for Suburb is the signal we want — that location is the lowest earner. Wrap in &lt;code&gt;COALESCE(SUM(fare), 0) AS total_fare&lt;/code&gt; if the downstream consumer prefers &lt;code&gt;0&lt;/code&gt; over &lt;code&gt;NULL&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;location&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fare&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_fare&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;rides&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;location_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; if the prompt asks to &lt;em&gt;"rank"&lt;/em&gt; or &lt;em&gt;"order"&lt;/em&gt; the totals, prefer &lt;code&gt;COALESCE(SUM(...), 0)&lt;/code&gt; so &lt;code&gt;NULL&lt;/code&gt; does not produce an engine-specific sort surprise.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;ORDER BY ASC LIMIT&lt;/code&gt; for the lowest-N pattern
&lt;/h4&gt;

&lt;p&gt;The ordering invariant: &lt;strong&gt;&lt;code&gt;ORDER BY metric ASC LIMIT N&lt;/code&gt; returns the &lt;code&gt;N&lt;/code&gt; rows with the smallest metric value; &lt;code&gt;ORDER BY metric DESC LIMIT N&lt;/code&gt; returns the largest&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ASC LIMIT N&lt;/code&gt;&lt;/strong&gt; — bottom N (lowest-earning, slowest, smallest).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DESC LIMIT N&lt;/code&gt;&lt;/strong&gt; — top N (highest-earning, fastest, largest).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tie-break&lt;/strong&gt; — add a deterministic secondary key (&lt;code&gt;ORDER BY total_fare ASC, location ASC&lt;/code&gt;) so two locations with identical totals always sort the same way.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;ORDER BY total_fare ASC LIMIT 3&lt;/code&gt; on the totals above returns Suburb (&lt;code&gt;NULL&lt;/code&gt; / &lt;code&gt;0&lt;/code&gt;), then Downtown / Airport tied at &lt;code&gt;60&lt;/code&gt;, alphabetical tie-break.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;location&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fare&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_fare&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;rides&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;location_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_fare&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;location&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; always pair &lt;code&gt;LIMIT&lt;/code&gt; with a deterministic &lt;code&gt;ORDER BY&lt;/code&gt; chain. Without a tie-break, &lt;code&gt;LIMIT 3&lt;/code&gt; over a tied set is non-deterministic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common beginner mistakes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;INNER JOIN&lt;/code&gt; instead of &lt;code&gt;LEFT JOIN&lt;/code&gt; for "every location" / "lowest-earning" — silently drops zero-event entities.&lt;/li&gt;
&lt;li&gt;Skipping &lt;code&gt;COALESCE&lt;/code&gt; and getting &lt;code&gt;NULL&lt;/code&gt; ordering surprises (Postgres sorts &lt;code&gt;NULL&lt;/code&gt; last by default; MySQL sorts it first).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ORDER BY total ASC LIMIT 3&lt;/code&gt; without a secondary tie-break — non-deterministic results across runs.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;WHERE SUM(fare) = 0&lt;/code&gt; instead of &lt;code&gt;HAVING SUM(fare) = 0&lt;/code&gt; — &lt;code&gt;WHERE&lt;/code&gt; filters before aggregation; &lt;code&gt;HAVING&lt;/code&gt; filters after.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ua1nsruzg748ybhzyfh.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ua1nsruzg748ybhzyfh.jpeg" alt="Three-step horizontal walkthrough of the SQL pattern for lowest-N per dimension on a ride-hailing dataset: first JOIN locations to rides, then GROUP BY location with SUM of fare, then ORDER BY total ASC LIMIT 3 to surface the three lowest-earning locations, in PipeCode brand colors." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL Interview Question on Least Earning Locations
&lt;/h3&gt;

&lt;p&gt;Tables &lt;code&gt;locations(id INT, name VARCHAR)&lt;/code&gt; and &lt;code&gt;rides(id INT, location_id INT, fare DECIMAL)&lt;/code&gt; log every ride on a ride-hailing platform. Return the &lt;strong&gt;three lowest-earning&lt;/strong&gt; locations by total fare, including locations with zero rides. Output &lt;code&gt;location&lt;/code&gt;, &lt;code&gt;total_fare&lt;/code&gt;, ordered by &lt;code&gt;total_fare ASC&lt;/code&gt; with a deterministic tie-break by &lt;code&gt;location ASC&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using LEFT JOIN with SUM and ORDER BY LIMIT
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;location&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fare&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_fare&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;rides&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;location_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_fare&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;location&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; (inputs):&lt;/p&gt;

&lt;p&gt;&lt;code&gt;locations&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Downtown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Airport&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Suburb&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Marina&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;rides&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;location_id&lt;/th&gt;
&lt;th&gt;fare&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;250&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;540&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;LEFT JOIN locations to rides&lt;/strong&gt; — match every &lt;code&gt;locations&lt;/code&gt; row to its &lt;code&gt;rides&lt;/code&gt; rows. Downtown gets two ride rows, Airport gets one, Marina gets one, Suburb gets a single row with &lt;code&gt;NULL&lt;/code&gt; on every &lt;code&gt;rides&lt;/code&gt; column.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GROUP BY &lt;code&gt;loc.name&lt;/code&gt;&lt;/strong&gt; — collapse to one row per location.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SUM(r.fare)&lt;/code&gt; per group&lt;/strong&gt; — Downtown: &lt;code&gt;30 + 250 = 280&lt;/code&gt;. Airport: &lt;code&gt;540&lt;/code&gt;. Marina: &lt;code&gt;95&lt;/code&gt;. Suburb: &lt;code&gt;SUM(NULL) = NULL&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;COALESCE(SUM, 0)&lt;/code&gt;&lt;/strong&gt; — Suburb's &lt;code&gt;NULL&lt;/code&gt; becomes &lt;code&gt;0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY total_fare ASC, location ASC&lt;/code&gt;&lt;/strong&gt; — sort ascending: Suburb (0), Marina (95), Downtown (280), Airport (540). The secondary &lt;code&gt;location ASC&lt;/code&gt; would only matter if two totals tied.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LIMIT 3&lt;/code&gt;&lt;/strong&gt; — keep the first three rows: Suburb, Marina, Downtown.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;location&lt;/th&gt;
&lt;th&gt;total_fare&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Suburb&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Marina&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Downtown&lt;/td&gt;
&lt;td&gt;280&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;LEFT JOIN preserves the dimension&lt;/strong&gt;&lt;/strong&gt; — every &lt;code&gt;locations&lt;/code&gt; row survives the join; Suburb (zero rides) survives with &lt;code&gt;NULL&lt;/code&gt; on the &lt;code&gt;rides&lt;/code&gt; columns and is correctly identified as a candidate for "lowest-earning."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;SUM is NULL-safe&lt;/strong&gt;&lt;/strong&gt; — aggregates ignore &lt;code&gt;NULL&lt;/code&gt; inputs; a group of all-&lt;code&gt;NULL&lt;/code&gt; returns &lt;code&gt;NULL&lt;/code&gt;, which is then coerced to &lt;code&gt;0&lt;/code&gt; by &lt;code&gt;COALESCE&lt;/code&gt; for clean ordering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;GROUP BY on the dimension's name&lt;/strong&gt;&lt;/strong&gt; — collapses event rows to one row per dimension; the per-group aggregate evaluates inside each block.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Composite ORDER BY&lt;/strong&gt;&lt;/strong&gt; — the primary key &lt;code&gt;total_fare ASC&lt;/code&gt; answers the prompt; the secondary &lt;code&gt;location ASC&lt;/code&gt; is the deterministic tie-break that guarantees stable output across runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;LIMIT 3 with a deterministic ORDER BY&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;LIMIT&lt;/code&gt; over a sorted set returns the bottom &lt;code&gt;N&lt;/code&gt; rows; without the tie-break, ties produce non-deterministic results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;O(rides + locations)&lt;/code&gt; for the hash-join, plus one aggregation pass and a final sort over the per-location aggregate (typically tiny relative to &lt;code&gt;rides&lt;/code&gt;); the &lt;code&gt;LIMIT&lt;/code&gt; does not change the asymptotics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Rivian — joins + aggregation&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Least Earning Locations (Rivian)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/least-earning-locations-for-a-ride-hailing-platform" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;SQL&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — joins&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Joins problems (all companies)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/joins" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  4. String Padding and Centering in Vanilla Python
&lt;/h2&gt;

&lt;h3&gt;
  
  
  String length, slicing, and pad-and-center in Python for data engineering
&lt;/h3&gt;

&lt;p&gt;Half of the Rivian technical phone screen is &lt;strong&gt;basic Python&lt;/strong&gt;, and the explicit constraint candidates report is no library shortcuts: write the primitive yourself with &lt;code&gt;len()&lt;/code&gt;, integer division, slicing, and string multiplication. The headline problem on the Rivian practice set, &lt;strong&gt;Centered Display Generator&lt;/strong&gt;, is a textbook &lt;em&gt;"pad-and-center"&lt;/em&gt; prompt — given a string and a target width, return the string centered inside that width using only spaces. The canonical answer is &lt;code&gt;' ' * left_pad + s + ' ' * right_pad&lt;/code&gt;, where &lt;code&gt;left_pad = (width - len(s)) // 2&lt;/code&gt; and &lt;code&gt;right_pad = width - len(s) - left_pad&lt;/code&gt; (the leftover handles odd-width edge cases).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Python provides &lt;code&gt;str.center(width)&lt;/code&gt; as a built-in shortcut, but Rivian's "basic Python" framing rewards writing the pad arithmetic by hand. The interviewer is grading whether you understand &lt;code&gt;(width - len(s)) // 2&lt;/code&gt; for left pad and &lt;code&gt;width - len(s) - left_pad&lt;/code&gt; for right pad — the off-by-one logic on odd-width inputs is the test.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;len()&lt;/code&gt;, integer division, and the centering math
&lt;/h4&gt;

&lt;p&gt;The math invariant: &lt;strong&gt;total padding is &lt;code&gt;width - len(s)&lt;/code&gt;; left padding is &lt;code&gt;total // 2&lt;/code&gt;; right padding is &lt;code&gt;total - left_pad&lt;/code&gt;. This formula handles odd-width inputs correctly because right pad absorbs the leftover character&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Even pad case&lt;/strong&gt; — &lt;code&gt;width=6, s='HI', total=4, left=2, right=2&lt;/code&gt; → &lt;code&gt;'  HI  '&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Odd pad case&lt;/strong&gt; — &lt;code&gt;width=7, s='HI', total=5, left=2, right=3&lt;/code&gt; → &lt;code&gt;'  HI   '&lt;/code&gt; (right side gets the extra space).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why right gets the leftover&lt;/strong&gt; — by convention, when a string cannot be perfectly centered, we left-align (give the extra to the right) so longer-string variants stay readable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Centering &lt;code&gt;'HI'&lt;/code&gt; in width 6.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;len(s)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;width - len(s)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;left_pad = 4 // 2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;right_pad = 4 - 2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;result&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'  HI  '&lt;/code&gt; (2 spaces + HI + 2 spaces)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;centering_pads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; compute total pad first, then split — never compute left and right independently with two &lt;code&gt;//&lt;/code&gt; divisions, because rounding can produce off-by-one drift.&lt;/p&gt;

&lt;h4&gt;
  
  
  Multiplying strings — the &lt;code&gt;' ' * n&lt;/code&gt; primitive
&lt;/h4&gt;

&lt;p&gt;The multiplication invariant: &lt;strong&gt;&lt;code&gt;s * n&lt;/code&gt; repeats the string &lt;code&gt;s&lt;/code&gt; exactly &lt;code&gt;n&lt;/code&gt; times; &lt;code&gt;n = 0&lt;/code&gt; returns the empty string; &lt;code&gt;n &amp;lt; 0&lt;/code&gt; also returns the empty string (Python does not raise)&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;' ' * 3&lt;/code&gt;&lt;/strong&gt; — three spaces (&lt;code&gt;'   '&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;'-' * 5&lt;/code&gt;&lt;/strong&gt; — five dashes (&lt;code&gt;'-----'&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;'ab' * 2&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;'abab'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;' ' * 0&lt;/code&gt;&lt;/strong&gt; — empty string &lt;code&gt;''&lt;/code&gt; (useful when &lt;code&gt;len(s) == width&lt;/code&gt; exactly).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;' ' * -1&lt;/code&gt;&lt;/strong&gt; — empty string &lt;code&gt;''&lt;/code&gt; (Python does not raise on negative repeats).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; Build the centered output by concatenation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;component&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;' ' * 2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'  '&lt;/code&gt; (two spaces)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'HI'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;' ' * 2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'  '&lt;/code&gt; (two spaces)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;concatenated&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'  HI  '&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;centered&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; when &lt;code&gt;total&lt;/code&gt; could be negative (input longer than width), &lt;code&gt;' ' * negative&lt;/code&gt; returns &lt;code&gt;''&lt;/code&gt; quietly — the function returns just &lt;code&gt;s&lt;/code&gt;, which is then truncated separately if the prompt requires it.&lt;/p&gt;

&lt;h4&gt;
  
  
  Truncating when &lt;code&gt;len(s) &amp;gt; width&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The truncation invariant: &lt;strong&gt;when the input string is longer than the target width, slice it to &lt;code&gt;s[:width]&lt;/code&gt; to fit; the math for left/right pad then degrades to zero&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Slice notation&lt;/strong&gt; — &lt;code&gt;s[:width]&lt;/code&gt; returns the first &lt;code&gt;width&lt;/code&gt; characters; safe when &lt;code&gt;len(s) &amp;lt;= width&lt;/code&gt; (it returns the whole string).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pair with the pad math&lt;/strong&gt; — apply &lt;code&gt;s = s[:width]&lt;/code&gt; at the top, then run the standard pad-and-center logic; both branches collapse to the same formula.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why slicing is the canonical answer&lt;/strong&gt; — slicing is &lt;code&gt;O(width)&lt;/code&gt; and unambiguous; using &lt;code&gt;if len(s) &amp;gt; width: return s[:width]&lt;/code&gt; as a special case adds a branch you do not need.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; &lt;code&gt;s = 'HELLO', width = 3&lt;/code&gt; → &lt;code&gt;s[:3] = 'HEL'&lt;/code&gt;, &lt;code&gt;len(s[:3]) = 3&lt;/code&gt;, &lt;code&gt;total = 0&lt;/code&gt;, &lt;code&gt;left = 0&lt;/code&gt;, &lt;code&gt;right = 0&lt;/code&gt;, result &lt;code&gt;'HEL'&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked-example solution.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;centered_safe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Rule of thumb:&lt;/em&gt; slice first, pad second. The slice handles the over-long input case; the pad math handles the equal and short cases without a branch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common beginner mistakes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reaching for &lt;code&gt;str.center(width)&lt;/code&gt; despite the "basic Python" rule — interviewers note it as a signal you did not internalize the constraint.&lt;/li&gt;
&lt;li&gt;Computing &lt;code&gt;right_pad = total // 2&lt;/code&gt; independently — drifts by one on odd-width inputs.&lt;/li&gt;
&lt;li&gt;Forgetting the &lt;code&gt;s[:width]&lt;/code&gt; truncation guard and producing output longer than &lt;code&gt;width&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;' '.join([s])&lt;/code&gt; instead of &lt;code&gt;' ' * n + s + ' ' * n&lt;/code&gt; — &lt;code&gt;join&lt;/code&gt; does not pad; it inserts a separator.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmobdnvbqs5q73tri82i.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmobdnvbqs5q73tri82i.jpeg" alt="Step-by-step worked example showing how to center the string 'HI' inside a 6-character-wide line using vanilla Python: compute total padding 4, split into left pad 2 and right pad 2, then concatenate as two spaces plus 'HI' plus two spaces to produce '  HI  ', with PipeCode purple and green accents." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Python Interview Question on Centered Display Generator
&lt;/h3&gt;

&lt;p&gt;Write a function &lt;code&gt;center_display(s: str, width: int) -&amp;gt; str&lt;/code&gt; that returns &lt;code&gt;s&lt;/code&gt; centered inside a string of length &lt;code&gt;width&lt;/code&gt;, padded with spaces. If &lt;code&gt;len(s) &amp;gt; width&lt;/code&gt;, truncate &lt;code&gt;s&lt;/code&gt; to &lt;code&gt;width&lt;/code&gt; characters before centering. When the total padding is odd, the &lt;strong&gt;right side&lt;/strong&gt; receives the extra space. Do not use &lt;code&gt;str.center()&lt;/code&gt; or any standard-library helper.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution Using Pad-and-Center Without str.center
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;center_display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step trace&lt;/strong&gt; (input &lt;code&gt;s = 'HI'&lt;/code&gt;, &lt;code&gt;width = 7&lt;/code&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;step&lt;/th&gt;
&lt;th&gt;expression&lt;/th&gt;
&lt;th&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;s[:width]&lt;/code&gt; (no truncation needed)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'HI'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;len(s)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;width - len(s)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;left = total // 2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;right = total - left&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;' ' * left&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'  '&lt;/code&gt; (two spaces)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;' ' * right&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'   '&lt;/code&gt; (three spaces)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;concatenated&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'  HI   '&lt;/code&gt; (2 + 2 + 3 = 7 chars)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Truncate &lt;code&gt;s&lt;/code&gt; to &lt;code&gt;width&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;'HI'[:7]&lt;/code&gt; returns &lt;code&gt;'HI'&lt;/code&gt; unchanged; the slice is a no-op when &lt;code&gt;len(s) &amp;lt;= width&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute total padding&lt;/strong&gt; — &lt;code&gt;7 - 2 = 5&lt;/code&gt;. The total amount of space to distribute around &lt;code&gt;s&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute left pad&lt;/strong&gt; — &lt;code&gt;5 // 2 = 2&lt;/code&gt;. Integer division rounds toward zero (or floor for non-negative inputs); the left side gets the smaller half on odd-width inputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute right pad&lt;/strong&gt; — &lt;code&gt;5 - 2 = 3&lt;/code&gt;. The right side receives the leftover, satisfying the prompt's "right takes the extra" rule.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build the output&lt;/strong&gt; — &lt;code&gt;' ' * 2 + 'HI' + ' ' * 3&lt;/code&gt; → &lt;code&gt;'  HI   '&lt;/code&gt;. Total length is &lt;code&gt;2 + 2 + 3 = 7&lt;/code&gt;, which matches &lt;code&gt;width&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;input s&lt;/th&gt;
&lt;th&gt;input width&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'HI'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'  HI  '&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'HI'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'  HI   '&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'HELLO'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'HEL'&lt;/code&gt; (truncated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;''&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'    '&lt;/code&gt; (four spaces, len(s) = 0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;'EXACT'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'EXACT'&lt;/code&gt; (no padding needed)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt; — concept by concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Slice first, pad second&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;s[:width]&lt;/code&gt; collapses the over-long input case to the equal-length case before any pad arithmetic runs; the rest of the function does not need a branch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Total then split&lt;/strong&gt;&lt;/strong&gt; — computing &lt;code&gt;total&lt;/code&gt; once and deriving &lt;code&gt;left&lt;/code&gt; from &lt;code&gt;total // 2&lt;/code&gt; plus &lt;code&gt;right&lt;/code&gt; from &lt;code&gt;total - left&lt;/code&gt; guarantees &lt;code&gt;left + right == total&lt;/code&gt; exactly, with no rounding drift on odd-width inputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Right absorbs the leftover&lt;/strong&gt;&lt;/strong&gt; — the formula &lt;code&gt;right = total - left&lt;/code&gt; (rather than &lt;code&gt;right = total // 2&lt;/code&gt;) sends the extra character to the right side on odd-pad cases, matching the prompt's specification.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;String multiplication is safe at zero&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;' ' * 0&lt;/code&gt; returns &lt;code&gt;''&lt;/code&gt; quietly when no padding is needed; no special case for &lt;code&gt;len(s) == width&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Concatenation order matters&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;' ' * left + s + ' ' * right&lt;/code&gt; reads left-to-right as the visual output; reversing the order would mirror the result.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/strong&gt; — &lt;code&gt;O(width)&lt;/code&gt; for the slice, &lt;code&gt;O(left + right) = O(width)&lt;/code&gt; for the multiplication and concatenation; total &lt;code&gt;O(width)&lt;/code&gt;, independent of the input string's original length.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Rivian — string processing&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;Centered Display Generator (Rivian)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/centered-display-generator" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span&gt;PYTHON&lt;/span&gt;&lt;br&gt;
&lt;span&gt;Topic — string processing&lt;/span&gt;&lt;br&gt;
&lt;strong&gt;String processing problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/string-processing" rel="noopener noreferrer"&gt;Practice →&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips to crack Rivian data engineering interviews
&lt;/h2&gt;

&lt;p&gt;These are habits that move the needle in real Rivian DE loops — not a re-statement of the topics above.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practice with Rivian's data shapes
&lt;/h3&gt;

&lt;p&gt;Rivian's interview prompts model a vehicle / fleet / ride-hailing world: &lt;code&gt;vehicles&lt;/code&gt;, &lt;code&gt;drivers&lt;/code&gt;, &lt;code&gt;rides&lt;/code&gt;, &lt;code&gt;locations&lt;/code&gt;, &lt;code&gt;charging_sessions&lt;/code&gt;, &lt;code&gt;telemetry_events&lt;/code&gt;. Drilling on order-line ecommerce schemas wastes prep time. Stick to event-shaped tables with a per-entity grain, and pull problems from the &lt;a href="https://pipecode.ai/explore/practice/topic/aggregation" rel="noopener noreferrer"&gt;aggregation topic page&lt;/a&gt; and the &lt;a href="https://pipecode.ai/explore/practice/topic/joins" rel="noopener noreferrer"&gt;joins topic page&lt;/a&gt; for shapes that match.&lt;/p&gt;

&lt;h3&gt;
  
  
  Master &lt;code&gt;GROUP BY&lt;/code&gt; and aggregation cold
&lt;/h3&gt;

&lt;p&gt;The 3-problem PipeCode set is aggregation-heavy. Type &lt;code&gt;SELECT entity, MIN(metric), MAX(metric), SUM(metric), AVG(metric), COUNT(metric) FROM events GROUP BY entity&lt;/code&gt; from memory until it is muscle memory. Layer in &lt;code&gt;HAVING&lt;/code&gt; for group-level filters, &lt;code&gt;COALESCE(SUM, 0)&lt;/code&gt; for ranking-friendly output, and &lt;code&gt;ORDER BY metric ASC LIMIT N&lt;/code&gt; for lowest-N prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Add cumulative-sum and rolling-average to your toolkit
&lt;/h3&gt;

&lt;p&gt;Recent Rivian DE candidates have been asked &lt;strong&gt;cumulative-sum&lt;/strong&gt; questions in SQL (Taro report, Nov 2023) and TechPrep names &lt;em&gt;"Rolling Average SQL Query"&lt;/em&gt; and &lt;em&gt;"Restaurant Growth"&lt;/em&gt; as recurring shapes. The primitive is &lt;code&gt;SUM(metric) OVER (PARTITION BY entity ORDER BY ts ROWS BETWEEN N PRECEDING AND CURRENT ROW)&lt;/code&gt;. Drill it on the &lt;a href="https://pipecode.ai/explore/practice/topic/window-functions" rel="noopener noreferrer"&gt;window-functions topic page&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Write Python without library shortcuts
&lt;/h3&gt;

&lt;p&gt;Rivian's "basic Python" framing means no &lt;code&gt;str.center&lt;/code&gt;, no &lt;code&gt;pandas&lt;/code&gt;, no &lt;code&gt;re&lt;/code&gt; for these screens. Train yourself to reach for &lt;code&gt;len&lt;/code&gt;, slicing, dicts, lists, and conditionals before any import. PipeCode's &lt;a href="https://pipecode.ai/explore/courses/python-for-data-engineering-interviews-the-complete-fundamentals" rel="noopener noreferrer"&gt;Python for data engineering interviews course&lt;/a&gt; drills the vanilla-Python primitives that match Rivian's expectations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Know the AWS DE stack
&lt;/h3&gt;

&lt;p&gt;Rivian's pipelines run on &lt;strong&gt;S3&lt;/strong&gt; (data lake), &lt;strong&gt;Lambda&lt;/strong&gt; + &lt;strong&gt;Kinesis&lt;/strong&gt; (streaming), &lt;strong&gt;Glue&lt;/strong&gt; (ETL), &lt;strong&gt;Airflow&lt;/strong&gt; (orchestration), and &lt;strong&gt;Great Expectations&lt;/strong&gt; + &lt;strong&gt;CloudWatch&lt;/strong&gt; (data quality and alerting). Even if the screen does not test these by name, the system-design round will. Be able to draw a vehicle-telemetry pipeline that ingests from a fleet of trucks, lands daily summaries in a warehouse, and tolerates a 24-hour offline window — in five minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Map STAR stories to the Rivian Compass
&lt;/h3&gt;

&lt;p&gt;The Compass is non-negotiable: &lt;strong&gt;Stay Adventurous&lt;/strong&gt;, &lt;strong&gt;Lead the Way&lt;/strong&gt;, &lt;strong&gt;Bring People Together&lt;/strong&gt;. Bring two real STAR stories per pillar, each tied to a Rivian-relevant skill (shipping under a hard deadline, owning a vendor migration, mediating between data scientists and platform engineers). Generic teamwork stories will not stand out. PipeCode's &lt;a href="https://pipecode.ai/explore/courses/behavior-interview-prep-for-data-engineering-interviews" rel="noopener noreferrer"&gt;behavioral interview prep course&lt;/a&gt; walks the STAR + values-mapping shape.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to practice on PipeCode
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill lane&lt;/th&gt;
&lt;th&gt;Practice path&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Curated Rivian practice set&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/company/rivian" rel="noopener noreferrer"&gt;/explore/practice/company/rivian&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregation in SQL (Rivian-tagged)&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/company/rivian/topic/aggregations" rel="noopener noreferrer"&gt;/explore/practice/company/rivian/topic/aggregations&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Joins for relational analytics&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/joins" rel="noopener noreferrer"&gt;/explore/practice/topic/joins&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cumulative-sum / window functions&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/window-functions" rel="noopener noreferrer"&gt;/explore/practice/topic/window-functions&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;String processing in vanilla Python&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/topic/string-processing" rel="noopener noreferrer"&gt;/explore/practice/topic/string-processing&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;All practice topics&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/practice/topics" rel="noopener noreferrer"&gt;/explore/practice/topics&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interview courses&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pipecode.ai/explore/courses" rel="noopener noreferrer"&gt;/explore/courses&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Communication under time pressure
&lt;/h3&gt;

&lt;p&gt;State &lt;strong&gt;assumptions&lt;/strong&gt; before typing: &lt;em&gt;"I'll assume &lt;code&gt;location_id&lt;/code&gt; is never &lt;code&gt;NULL&lt;/code&gt; in the &lt;code&gt;rides&lt;/code&gt; table and that we want zero-ride locations included in 'lowest earning.'"&lt;/em&gt; State &lt;strong&gt;grain&lt;/strong&gt;: &lt;em&gt;"One row per location after the aggregation."&lt;/em&gt; State &lt;strong&gt;edge cases&lt;/strong&gt;: &lt;em&gt;"If two locations tie on &lt;code&gt;total_fare&lt;/code&gt;, my secondary &lt;code&gt;ORDER BY location&lt;/code&gt; keeps the output stable."&lt;/em&gt; Interviewers grade clear reasoning above silent-and-perfect.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the Rivian data engineering interview process?
&lt;/h3&gt;

&lt;p&gt;The Rivian data engineer interview process is a four-stage funnel: a 30-minute recruiter screen, a 60-minute technical phone screen mixing SQL and basic Python in CoderPad, a virtual onsite of four to five rounds covering coding, system design, and behavioral, and an optional hiring-manager sync at the end. Total elapsed time is typically about thirty days. The &lt;a href="https://pipecode.ai/explore/practice/company/rivian" rel="noopener noreferrer"&gt;curated Rivian practice set&lt;/a&gt; on PipeCode mirrors the technical-screen flavor (SQL aggregations + vanilla Python).&lt;/p&gt;

&lt;h3&gt;
  
  
  What programming languages does Rivian test for data engineering?
&lt;/h3&gt;

&lt;p&gt;Rivian's data engineering interviews lean on &lt;strong&gt;SQL&lt;/strong&gt; and &lt;strong&gt;Python&lt;/strong&gt; for the screen, with &lt;strong&gt;AWS&lt;/strong&gt; + &lt;strong&gt;Airflow&lt;/strong&gt; + &lt;strong&gt;Kinesis&lt;/strong&gt; + &lt;strong&gt;Great Expectations&lt;/strong&gt; appearing in the system-design round. Python is "basic" — &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;dict&lt;/code&gt;, &lt;code&gt;str&lt;/code&gt;, &lt;code&gt;len&lt;/code&gt;, slicing, conditionals — without &lt;code&gt;pandas&lt;/code&gt;, &lt;code&gt;re&lt;/code&gt;, or library shortcuts like &lt;code&gt;str.center&lt;/code&gt;. SQL is at LeetCode-medium / DataLemur grade with a strong tilt toward aggregation, cumulative-sum, and joins on event data. PipeCode's &lt;a href="https://pipecode.ai/explore/practice/topic/string-processing" rel="noopener noreferrer"&gt;string-processing topic page&lt;/a&gt; matches the Rivian Python flavor closely.&lt;/p&gt;

&lt;h3&gt;
  
  
  What SQL topics show up most in Rivian data engineering interviews?
&lt;/h3&gt;

&lt;p&gt;The topics are narrow and consistent: &lt;strong&gt;aggregation with &lt;code&gt;GROUP BY&lt;/code&gt; and &lt;code&gt;MIN&lt;/code&gt;/&lt;code&gt;MAX&lt;/code&gt;/&lt;code&gt;SUM&lt;/code&gt;&lt;/strong&gt; for per-entity rollups, &lt;strong&gt;&lt;code&gt;LEFT JOIN&lt;/code&gt; plus aggregation plus &lt;code&gt;ORDER BY ASC LIMIT N&lt;/code&gt;&lt;/strong&gt; for lowest-N / top-N prompts, and &lt;strong&gt;cumulative-sum / rolling-average&lt;/strong&gt; patterns using window functions like &lt;code&gt;SUM() OVER (PARTITION BY entity ORDER BY ts)&lt;/code&gt;. PipeCode's &lt;a href="https://pipecode.ai/explore/practice/company/rivian/topic/aggregations" rel="noopener noreferrer"&gt;aggregation problems&lt;/a&gt; tagged to Rivian and the global &lt;a href="https://pipecode.ai/explore/practice/topic/joins" rel="noopener noreferrer"&gt;joins problems&lt;/a&gt; cover these directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  How difficult are Rivian data engineering interview questions?
&lt;/h3&gt;

&lt;p&gt;Rivian data engineering interview questions are calibrated at LeetCode-medium for the algorithm half of the screen and "basic" for the data half — the bar is &lt;strong&gt;fluency, not contest difficulty&lt;/strong&gt;. The Rivian PipeCode practice set is intentionally three EASY problems so candidates can build confidence on the exact patterns Rivian tests (aggregation, aggregation-with-joins, vanilla Python string processing). Reports describe the panel as friendly and conventional — no surprise loops, no whiteboard hazing.&lt;/p&gt;

&lt;h3&gt;
  
  
  How should I prepare for a Rivian data engineering interview?
&lt;/h3&gt;

&lt;p&gt;Solve the &lt;a href="https://pipecode.ai/explore/practice/company/rivian" rel="noopener noreferrer"&gt;3-problem Rivian practice set&lt;/a&gt; end to end — that maps the exact pattern coverage. Then back-fill: 20+ aggregation problems for &lt;code&gt;GROUP BY&lt;/code&gt; fluency, 10+ join-plus-aggregation problems for the lowest-N / top-N pattern, 10+ vanilla-Python string-processing problems, and a handful of cumulative-sum / rolling-average problems for the window-function half of the SQL screen. Add Rivian Compass behavioral prep — two STAR stories per pillar — and one read-through of the AWS DE stack (S3, Lambda, Kinesis, Glue, Airflow, Great Expectations).&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Rivian's data engineering interview cover AWS and pipeline design?
&lt;/h3&gt;

&lt;p&gt;Yes — the system-design round in the virtual onsite reaches deep into &lt;strong&gt;AWS&lt;/strong&gt; (S3, EC2, Lambda, Glue, Kinesis), &lt;strong&gt;Airflow&lt;/strong&gt; for orchestration, and &lt;strong&gt;Great Expectations&lt;/strong&gt; + &lt;strong&gt;CloudWatch&lt;/strong&gt; for data quality and alerting. Vehicle-telemetry pipelines, OTA update systems, and charging-network backends are the recurring framings. The Rivian-distinctive design move is &lt;strong&gt;offline-first&lt;/strong&gt; — interviewers ask &lt;em&gt;"what happens when a vehicle loses cellular connectivity?"&lt;/em&gt; and reward graceful-degradation strategies. PipeCode's &lt;a href="https://pipecode.ai/explore/courses/etl-system-design-for-data-engineering-interviews" rel="noopener noreferrer"&gt;ETL system design course&lt;/a&gt; walks the canonical pipeline architectures.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start practicing Rivian data engineering problems
&lt;/h2&gt;

&lt;p&gt;Reading patterns is not the same as typing them under time pressure. PipeCode pairs &lt;strong&gt;company-tagged Rivian&lt;/strong&gt; problems with tests, AI feedback, and a coding environment so you can drill the exact SQL aggregation, join, and Python string-processing patterns Rivian asks — without the noise of generic algorithm prep.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/" rel="noopener noreferrer"&gt;Pipecode.ai&lt;/a&gt; is Leetcode for Data Engineering.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pipecode.ai/explore/practice/company/rivian" rel="noopener noreferrer"&gt;Browse Rivian practice →&lt;/a&gt;&lt;br&gt;
&lt;a href="https://pipecode.ai/explore/courses/sql-for-data-engineering-interviews-from-zero-to-faang" rel="noopener noreferrer"&gt;SQL for DE interviews course →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>sql</category>
      <category>interview</category>
      <category>dataengineering</category>
    </item>
  </channel>
</rss>
