DEV Community

Cover image for How to Size a Spark Cluster. And How Not To.
Arjun Krishna
Arjun Krishna

Posted on

How to Size a Spark Cluster. And How Not To.

Interviewer:

You need to process 1 TB of data in Spark. How do you size the cluster?

Most answers start with division.

1 TB

→ choose 128 MB partitions

→ calculate ~8,000 partitions

→ map to cores

→ decide number of nodes

It is clean. It is logical.

It is also incomplete.

Because cluster size is not derived from data size.

It is derived from workload behavior.

Here is how this question should be approached in real systems.


Step 1: Clarify Which “1 TB” We’re Talking About

When someone says “1 TB,” there are multiple meanings hiding inside that number.

Before sizing anything, it helps to separate at least five different sizes.


1. Stored Size on Disk

1 TB compressed Parquet in object storage tells very little about execution behavior.

This number reflects storage efficiency and file layout. It affects metadata overhead and file management, not necessarily runtime footprint.


2. Effective Scan Size After Pruning

The real question is: how much data will Spark actually read?

  • Partition pruning skips entire directories.
  • Predicate pushdown skips non-matching row groups.
  • Column pruning avoids reading unused columns.

A 1 TB table may result in only 200 to 300 GB scanned.

Cluster sizing must be based on actual scan size, not table size.


3. In Memory Expansion Size

Compressed columnar data expands during execution.

Parquet on disk is compressed and encoded.

In memory it is decompressed, decoded, and materialized into Spark’s internal row format.

A 1 TB compressed dataset can expand to 2 to 4 TB across executors during processing.

This directly affects:

  • Executor memory sizing
  • Spill probability
  • GC pressure
  • Memory overhead configuration

Disk size is rarely the memory anchor.


4. Peak Intermediate Size

This is usually the real anchor.

Spark executes as a DAG of stages separated by shuffles.

A 1 TB job might:

  • Filter to 400 GB
  • Join and expand to 2.5 TB shuffle
  • Aggregate back to 50 GB

Spark does not care about input size.

It cares about the largest intermediate state it must shuffle, sort, or spill.

If a join explodes to 2.5 TB, that becomes the sizing baseline.


5. Input Variance Across Runs

Is 1 TB stable?

Or does it fluctuate:

  • 800 GB on normal days
  • 1.4 TB on quarter end

Production systems fail at the tail, not the mean.

Sizing must consider the 95th percentile load, not the average.


Before We Talk Math, Understand Spark’s Assumptions

Spark was built with specific assumptions:

  • Data can be evenly partitioned
  • Most transformations are narrow
  • Wide transformations require shuffle and are expensive
  • Network is slower than CPU
  • Memory is finite

When these assumptions hold, Spark scales predictably.

When they do not, adding nodes does not fix the root cause.

Cluster sizing is not about fighting Spark.

It is about aligning workload behavior with its design.

This discussion is primarily framed around batch data engineering workloads, where shuffle, intermediate state, and throughput dominate sizing decisions. The underlying framework, however, is universal. For ML, BI, or streaming workloads, the dominant constraint shifts. Memory, concurrency, or state may become primary. The systems thinking remains the same.


Step 2: What Type of Workload Is This?

Cluster sizing depends on bottleneck classification.

The first step is determining what constrains the job.


CPU Bound

Heavy UDFs, encryption, compression, complex transformations.

Signals

  • High CPU utilization
  • Low spill
  • Minimal shuffle wait

Action

Scale cores and compute optimized instances.


Memory Bound

Large joins, wide aggregations, caching.

Signals

  • Spill metrics in Spark UI
  • High GC time
  • Executor OOM events

Action

Increase executor memory or reduce per task footprint.


IO Bound

Reading from object storage, small files, slow disks.

Signals

  • Low CPU utilization
  • High file open overhead
  • High task deserialization time

Action

Fix file layout and compaction before scaling compute.

Throwing more cores at small file chaos does not help.


Network Bound

Shuffle heavy workload.

Signals

  • High shuffle read fetch wait time
  • Low CPU usage during reduce stage
  • Executors waiting on remote blocks

Network bandwidth per node is fixed.

Doubling cores on the same node does not double shuffle throughput.

Adding cores to a network saturated node rarely helps.


Step 3: What Is the Shuffle Multiplier?

Does the job:

  • Mostly scan and filter?
  • Perform wide joins?
  • Perform groupBy on high cardinality keys?

Shuffle volume can easily reach two to three times input size.

Shuffle determines:

  • Execution memory pressure
  • Disk spill volume
  • Network saturation

Sizing for input size while ignoring shuffle multiplier is a classic mistake.


A 1 TB Job Can Fail Because of 1 Key

Even if total data is 1 TB, a single hot key can create a 200 GB partition.

That one executor becomes the bottleneck.

Parallelism collapses not because the cluster is small, but because the data is unevenly distributed.

In the Spark UI, this usually shows up as one task running far longer than the rest or consuming disproportionate shuffle data.

Skew violates Spark’s even distribution assumption.

This is no longer a cluster sizing problem, and no amount of cores fixes uneven data.


Spill Turns Memory Problems Into Disk Problems

When execution memory fills during shuffle or sort, Spark spills to local disk.

Now disk throughput becomes the bottleneck.

If local disks are slow:

  • Task duration increases
  • Executor lifetime increases
  • GC pressure increases
  • Stage completion slows non linearly

How to identify

  • High Spill metrics
  • Increasing task duration during shuffle stages
  • Elevated GC time

How to mitigate

  • Increase executor memory
  • Reduce per task partition size
  • Increase shuffle partitions
  • Use faster local disks
  • Reduce shuffle footprint upstream

Spill connects memory and disk.


Step 4: What Is the Storage Layout?

Where does the 1 TB live?

  • Five large Parquet files?
  • Eight hundred thousand small files?
  • Partitioned correctly?
  • Clustered on join keys?

Small files increase:

  • Task scheduling overhead
  • File listing latency
  • Driver pressure

Poor partitioning increases scan size.

Wrong clustering increases shuffle cost.

Sometimes the correct answer to:

How big should the cluster be?

Is:

Fix the data layout first.


Step 5: What Is the SLA?

Cluster sizing without SLA context is incomplete.

If SLA is two hours, sizing for twenty minute completion is unnecessary.

If SLA is thirty minutes, sizing must be calculated backwards:

Required throughput equals peak data volume divided by SLA.

Required throughput divided by per node effective throughput gives node count.

Cluster sizing becomes a throughput equation.

Not a storage equation.


Step 6: Is This Dedicated or Shared?

On shared clusters:

  • Full cores are not guaranteed
  • Full memory is not guaranteed
  • Shuffle service is shared
  • Concurrency affects availability

Cluster math in isolation becomes wrong in practice.


Then, and Only Then, Do the Math

Once the following are understood:

  • Peak intermediate size
  • Bottleneck type
  • Shuffle volume
  • Storage throughput
  • SLA target
  • Input variance
  • Isolation model

Then it makes sense to calculate:

  • Target partition size
  • Required partitions
  • Required concurrent tasks
  • Executors per node
  • Memory per executor
  • Node count

Now the math is grounded.

Without those questions, the math is guesswork.


The Real Answer

If someone asks:

How do you size a cluster for 1 TB?

The answer is simple.

Clusters should not be sized based on 1 TB.

They should be sized based on peak intermediate state, dominant bottleneck, and SLA constraints.

Data size is the starting point.

Workload behavior determines the cluster.


Databricks Perspective

If this is built on modern Databricks Runtime with Spark 4.x, the mindset shifts slightly.

The same physics still apply.

But platform abstractions are used first.

On Databricks:

  • Adaptive Query Execution is enabled by default and can coalesce shuffle partitions and mitigate moderate skew.
  • Photon can reduce CPU pressure for SQL and DataFrame workloads.
  • Delta Lake layout strategies help reduce scan inefficiency and small file overhead.

For example:

  • OPTIMIZE compacts small files.
  • ZORDER improves multi column data locality in traditional layouts.
  • Liquid Clustering replaces static partitioning and ZORDER with dynamic clustering.
  • Predictive Optimization automates compaction and maintenance.

These improve:

  • File compaction
  • Data skipping
  • Read efficiency
  • Metadata overhead

They reduce scan inefficiency before compute scaling.

But they do not eliminate:

  • Shuffle cost
  • Skew
  • Network ceilings
  • Spill behavior
  • Peak intermediate pressure

On Databricks, cluster sizing is often the last lever, not the first.

Abstraction does not remove distributed systems physics.

In the next post, we will look at what changes when the cluster itself disappears, and how serverless Spark shifts the surface area of responsibility without changing the underlying constraints.

Top comments (0)