Joud Awad

Posted on Jun 24

49/60 Days System Design Questions

#abotwrotethis #dataengineering #systemdesign #database

Your data team just opened a $4,200 BigQuery bill.

For a single month. One analyst. 12 queries.

The queries weren't wrong. They weren't inefficient SQL. They were reasonable analytics queries — "give me last 30 days of events for customer X." The problem was that every single one scanned the full 3.2 TB table. No partition pruning. No cost control. Just full scans, every time.

This is the most expensive silent bug in data engineering. You write a query. It looks fast. It returns results. And every run quietly eats through terabytes you're paying per-byte to scan.

The fix is partition strategy — but picking the wrong one doesn't just fail to help, it actively makes things worse.

Here's the setup:

You're running a 3.2 TB events table on BigQuery. 18 months of data. Ingested daily. Analyst queries almost always filter on two things: a date range ("last 30 days") and a customer_id ("for customer X").

Which partition + clustering strategy do you pick?

A) Partition by ingestion date — date range queries only scan the relevant day-partitions.

B) Partition by customer_id ranges — customer queries only scan the relevant ID bucket.

C) No partitioning — use Redshift DISTKEY on customer_id + SORTKEY on event_time.

D) Partition by ingestion date + cluster on customer_id — date pruning cuts the partition, clustering prunes within it.

Three of these reduce your bill. Only one fully prunes on both filters your analysts actually use.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments

Drop your answer

30DaysOfSystemDesign #SystemDesign #DataEngineering #BigQuery

Top comments (5)

Joud Awad • Jun 24

Why D wins:

Your analyst queries always filter on two columns: a date range and a customer_id. A single-key strategy protects one and ignores the other.

Here's what D does in BigQuery:

• Partition by date → BigQuery skips every partition outside the date range before scanning a single row. For a "last 30 days" query on 18 months of data, you immediately eliminate ~94% of the table.
• Cluster by customer_id → Within each scanned partition, BigQuery sorts the rows by customer_id into storage blocks. A filter on customer_id now skips 90–95% of the remaining rows.

Combined: a "last 30 days for customer X" query goes from scanning 3.2 TB → ~12 GB. Bill drops from $4,200 → ~$70/month. Same queries. Same data. Right strategy.

This pattern works across every major warehouse: BigQuery (partition + cluster), Snowflake (micro-partitioning + cluster keys), Redshift (partition key + SORTKEY compound).

Joud Awad • Jun 24

Why A is incomplete (date partition only):

Date partition alone is a real improvement — date range filters prune the irrelevant partitions immediately. Your bill probably drops 60–70%.

But your analyst queries also filter by customer_id. Inside each date partition, all customer rows are interleaved across storage blocks. BigQuery has to read the entire partition to find customer X's rows. You've cut the scan from 3.2 TB → ~500 GB (the last 30 days), but you're still scanning every customer inside that window.

A is the obvious first step. D is A taken all the way.

Joud Awad • Jun 24

Why B is the senior engineer trap (customer_id range partitioning):

This one gets smart people. Your analysts always filter by customer_id — why not partition on that?

Because real customer_id distributions are never uniform. Your enterprise clients (IDs 1–10,000) generate 80% of all events. Every analytics query against active customers hits the same partition. No pruning benefit — the entire interesting data is in one bucket.

Worse: your queries also filter by date. BigQuery now has to scan the full customer partition to find rows from the last 30 days. You've traded a full table scan for a full-partition scan on your hottest partition. Same cost, different framing.

Hot partitions on high-traffic ID ranges is the most common expensive mistake in data warehouse design. Looks right in theory. Never works in production.

Joud Awad • Jun 24

Why C fails at this scale (DISTKEY + SORTKEY, no partitioning):

DISTKEY distributes rows across Redshift nodes by customer_id hash — optimizes joins. SORTKEY on event_time sorts rows on disk within each slice — helps date range scans read contiguous blocks.

Both are real optimizations. Neither replaces partition pruning at 3.2 TB.

Without partitioning, every query triggers a full table scan across all nodes. If your DISTKEY distribution is skewed — which it will be when enterprise customers dominate — one Redshift node holds 60–70% of your data. Every expensive query hammers that node.

You've turned a cost problem into a latency + cost problem.

DISTKEY + SORTKEY is solid for join optimization and smaller tables. It's not a substitute for partition pruning at terabyte scale.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.