DEV Community: Arjun Krishna

How to Choose Between Serverless and Dedicated Compute in Databricks

Arjun Krishna — Fri, 06 Mar 2026 06:35:02 +0000

I recently benchmarked Serverless vs Dedicated compute in Databricks.

I expected one of them to clearly win.

It didn’t.

Execution time was almost identical.

Which led to a more useful realization:

The decision between Serverless and Dedicated is not a performance question.

It’s a workload shape question.

The Mental Model

Dedicated wins when the cluster stays warm and busy.

Serverless wins from the first byte of compute needed.

The Real Cost Model

When evaluating compute options, comparing DBUs vs DBUs is misleading.

Instead, look at total compute cost.

Dedicated Compute

Cost ≈ (DBUs × DBU rate)
      + Cloud VM cost
      + Time clusters remain warm

Serverless

Cost ≈ DBUs × Serverless rate

Serverless DBU rates are higher because infrastructure is already bundled in.

But two cost categories disappear entirely:

Idle clusters
Cloud VM infrastructure management

There’s also a third cost that rarely shows up in spreadsheets.

Engineering Time

Operating classic clusters requires ongoing platform work:

cluster policies
autoscaling tuning
node sizing decisions
runtime upgrades
debugging cluster drift

At scale, the engineering hours saved operating infrastructure often become the biggest cost reduction.

The Workload Patterns I See Most Often

Most data pipelines fall into a few common patterns.

1. Short Pipelines

Jobs that run for a few minutes but execute repeatedly throughout the day.

Serverless works extremely well here because:

compute appears instantly
compute disappears immediately after execution

Startup latency is also dramatically lower.

Typical comparison:

Compute Type	Startup Time
Classic job cluster	~3–7 minutes
Serverless	seconds

For short jobs, this difference significantly improves time-to-value.

2. Long-Running Pipelines

Some pipelines run for hours and keep compute fully utilized.

Here dedicated clusters often make more sense because:

lower DBU rates
executor configuration tuning
controlled autoscaling

If a cluster stays warm and busy, economics start favoring dedicated compute.

3. Burst Workloads

Many platforms schedule large numbers of jobs at the same time.

Example:

100 pipelines scheduled at 8:00 AM

With classic job clusters this can cause:

cluster provisioning storms
workspace cluster quota limits

I’ve seen job clusters hit workspace cluster quotas in real production environments.

Serverless handles this much better.

Because compute runs on a Databricks-managed fleet, the platform can absorb burst concurrency without waiting for clusters to spin up.

4. Ad-hoc Exploration

Platforms also support interactive debugging and analysis.

Notebook sessions often look like this:

Run query
Inspect result
Run another query later

All-purpose clusters stay alive during the entire session.

Serverless aligns better with this pattern because compute is allocated only when work actually runs.

When the Pattern Isn't Clear

Sometimes a pipeline doesn't clearly fit one of these patterns.

That’s when benchmarking both options makes sense.

A simple approach:

Run tests during a quiet window
Avoid cached reads when benchmarking I/O
Use the same dataset for both runs

Measure two metrics:

Latency
DBUs consumed

DBU consumption per run can be pulled from:

system.billing.usage

Estimated monthly cost:

Monthly Cost ≈ DBUs per run × DBU rate × runs per month

Add storage or egress costs if data leaves Databricks.

A Subtle Efficiency Difference

Clusters assume workloads are distributed.

But many workloads aren’t.

Example: a pandas-heavy notebook on a Spark cluster.

Most computation happens on the driver node, while workers remain underutilized.

Serverless removes the need to provision a fixed cluster footprint upfront, making it more efficient for smaller workloads.

Operational Stability

Serverless environments are effectively versionless from the user perspective.

Teams don’t manage:

cluster images
runtime upgrades
runtime fragmentation across projects

The platform manages the runtime lifecycle and continuously rolls improvements forward.

This removes an entire category of platform maintenance work.

Hidden Cost Leaks I See Often

Before optimizing compute type, check these first:

Auto-termination set too high
Libraries installing during job startup
Silent retries increasing DBU usage
Oversized clusters

Cluster policies help enforce guardrails:

owner tags
cost center tags
environment tags
worker limits by tier
restrictions on expensive instance types

A Nuance About Scaling

Serverless isn't infinite.

There are still platform guardrails on scaling.

But these are managed differently from classic clusters.

Job clusters are constrained by:

workspace cluster quotas
VM provisioning limits

Serverless runs on a Databricks-managed fleet, so those limits usually don't apply the same way.

In practice this means burst workloads often scale more smoothly on Serverless.

Practical Rule of Thumb

Short pipelines        → Serverless
Ad-hoc exploration     → Serverless
Burst workloads        → Serverless

Long-running pipelines → Dedicated
Specialized workloads  → Dedicated
(GPUs, private networking, pinned environments)

Most mature platforms end up running both models.

The goal isn’t choosing a winner.

It’s matching the compute model to the workload shape.

The future of Data Engineering in Databricks - From Pipelines to Intent

Arjun Krishna — Tue, 03 Mar 2026 05:51:31 +0000

The analytics layer moved first.

Natural language querying.

AI-assisted SQL.

Agent-style workflows over governed datasets.

Now the real shift is coming for data engineering.

And it’s bigger.

The Three Layers of Data Engineering

If we strip the role down to fundamentals, data engineering operates across three layers:

Mechanical execution
Architectural decisions
Accountability and governance

AI will not impact all three equally.

Layer 1: Mechanical Execution

This layer is already changing.

Writing boilerplate transformations
Defining repetitive pipeline logic
Handling retries and failure loops
Manually tracing lineage during debugging

In Databricks, we’re seeing early signals of this shift.

Lakeflow Declarative Pipelines let engineers define what the data should look like rather than coding how it runs.
The platform handles orchestration, retries, expectations, and monitoring.
The Databricks Assistant can generate SQL, explain query plans, and refactor transformations.

This is deterministic automation.

Reliable.

Repeatable.

Rule-based.

But deterministic automation is only step one.

From Deterministic Automation to Bounded Remediation

Today:

Pipelines fail
Alerts trigger
Engineers investigate

Tomorrow:

The system diagnoses
The system proposes a fix
The system remediates within predefined guardrails
Humans review the audit trail

Not full autonomy.

Bounded remediation.

Systems that resolve predictable failures while respecting governance controls, lineage, and data contracts.

Examples:

Schema drift handled within constraints
Downstream impact simulation before deployment
Suggested medallion restructuring based on query patterns
Automatic performance optimization grounded in workload telemetry

This is where foundational models integrated inside the platform matter.

Not as chatbots.

As embedded reasoning layers inside the data system.

The Shift From Writing Code to Defining Intent

The next evolution of data engineering won’t be about writing every transformation manually.

It will look like this:

An engineer defines:

Business intent
Data quality expectations
Constraints
SLAs
Governance policies

An intelligent agent drafts:

Pipeline structure
Transformation logic
Incremental strategies
Partitioning strategy
Optimization hints
Lineage impact analysis

The engineer reviews, adjusts, approves.

The center of gravity moves upward.

From syntax to systems thinking.

What Remains Human

Layer 3 does not disappear.

Governance
Risk ownership
Architectural accountability
Trade-off decisions
Cross-domain modeling strategy

AI can propose.

It cannot own.

Enterprises will not delegate accountability to a model.

Data engineering becomes less about moving columns and more about defining durable data systems.

Why This Matters in Databricks

Databricks already integrates:

Storage abstraction (Delta Lake)
Compute
Orchestration
Lineage
Governance
Observability
Model integration

That vertical integration enables deep AI embedding.

The differentiation won’t be access to frontier models.

It will be how safely and deeply intelligence is embedded into enterprise-grade data systems.

The platform that combines:

Auditability
Guardrails
Data contracts
Governance enforcement
Embedded reasoning

…will define the next phase of data engineering.

The Real Outcome

Less time debugging pipelines at 2 AM
Lower operational burden
Reduced repetitive troubleshooting
Higher architectural leverage

Data engineers shift from pipeline authors to system designers.

From mechanics to strategists.

That’s not a minor upgrade.

That’s a role redefinition.

How to Size a Spark Cluster. And How Not To.

Arjun Krishna — Sun, 01 Mar 2026 19:44:20 +0000

Interviewer:

You need to process 1 TB of data in Spark. How do you size the cluster?

Most answers start with division.

1 TB

→ choose 128 MB partitions

→ calculate ~8,000 partitions

→ map to cores

→ decide number of nodes

It is clean. It is logical.

It is also incomplete.

Because cluster size is not derived from data size.

It is derived from workload behavior.

Here is how this question should be approached in real systems.

Step 1: Clarify Which “1 TB” We’re Talking About

When someone says “1 TB,” there are multiple meanings hiding inside that number.

Before sizing anything, it helps to separate at least five different sizes.

1. Stored Size on Disk

1 TB compressed Parquet in object storage tells very little about execution behavior.

This number reflects storage efficiency and file layout. It affects metadata overhead and file management, not necessarily runtime footprint.

2. Effective Scan Size After Pruning

The real question is: how much data will Spark actually read?

Partition pruning skips entire directories.
Predicate pushdown skips non-matching row groups.
Column pruning avoids reading unused columns.

A 1 TB table may result in only 200 to 300 GB scanned.

Cluster sizing must be based on actual scan size, not table size.

3. In Memory Expansion Size

Compressed columnar data expands during execution.

Parquet on disk is compressed and encoded.

In memory it is decompressed, decoded, and materialized into Spark’s internal row format.

A 1 TB compressed dataset can expand to 2 to 4 TB across executors during processing.

This directly affects:

Executor memory sizing
Spill probability
GC pressure
Memory overhead configuration

Disk size is rarely the memory anchor.

4. Peak Intermediate Size

This is usually the real anchor.

Spark executes as a DAG of stages separated by shuffles.

A 1 TB job might:

Filter to 400 GB
Join and expand to 2.5 TB shuffle
Aggregate back to 50 GB

Spark does not care about input size.

It cares about the largest intermediate state it must shuffle, sort, or spill.

If a join explodes to 2.5 TB, that becomes the sizing baseline.

5. Input Variance Across Runs

Is 1 TB stable?

Or does it fluctuate:

800 GB on normal days
1.4 TB on quarter end

Production systems fail at the tail, not the mean.

Sizing must consider the 95th percentile load, not the average.

Before We Talk Math, Understand Spark’s Assumptions

Spark was built with specific assumptions:

Data can be evenly partitioned
Most transformations are narrow
Wide transformations require shuffle and are expensive
Network is slower than CPU
Memory is finite

When these assumptions hold, Spark scales predictably.

When they do not, adding nodes does not fix the root cause.

Cluster sizing is not about fighting Spark.

It is about aligning workload behavior with its design.

This discussion is primarily framed around batch data engineering workloads, where shuffle, intermediate state, and throughput dominate sizing decisions. The underlying framework, however, is universal. For ML, BI, or streaming workloads, the dominant constraint shifts. Memory, concurrency, or state may become primary. The systems thinking remains the same.

Step 2: What Type of Workload Is This?

Cluster sizing depends on bottleneck classification.

The first step is determining what constrains the job.

CPU Bound

Heavy UDFs, encryption, compression, complex transformations.

Signals

High CPU utilization
Low spill
Minimal shuffle wait

Action

Scale cores and compute optimized instances.

Memory Bound

Large joins, wide aggregations, caching.

Signals

Spill metrics in Spark UI
High GC time
Executor OOM events

Action

Increase executor memory or reduce per task footprint.

IO Bound

Reading from object storage, small files, slow disks.

Signals

Low CPU utilization
High file open overhead
High task deserialization time

Action

Fix file layout and compaction before scaling compute.

Throwing more cores at small file chaos does not help.

Network Bound

Shuffle heavy workload.

Signals

High shuffle read fetch wait time
Low CPU usage during reduce stage
Executors waiting on remote blocks

Network bandwidth per node is fixed.

Doubling cores on the same node does not double shuffle throughput.

Adding cores to a network saturated node rarely helps.

Step 3: What Is the Shuffle Multiplier?

Does the job:

Mostly scan and filter?
Perform wide joins?
Perform groupBy on high cardinality keys?

Shuffle volume can easily reach two to three times input size.

Shuffle determines:

Execution memory pressure
Disk spill volume
Network saturation

Sizing for input size while ignoring shuffle multiplier is a classic mistake.

A 1 TB Job Can Fail Because of 1 Key

Even if total data is 1 TB, a single hot key can create a 200 GB partition.

That one executor becomes the bottleneck.

Parallelism collapses not because the cluster is small, but because the data is unevenly distributed.

In the Spark UI, this usually shows up as one task running far longer than the rest or consuming disproportionate shuffle data.

Skew violates Spark’s even distribution assumption.

This is no longer a cluster sizing problem, and no amount of cores fixes uneven data.

Spill Turns Memory Problems Into Disk Problems

When execution memory fills during shuffle or sort, Spark spills to local disk.

Now disk throughput becomes the bottleneck.

If local disks are slow:

Task duration increases
Executor lifetime increases
GC pressure increases
Stage completion slows non linearly

How to identify

High Spill metrics
Increasing task duration during shuffle stages
Elevated GC time

How to mitigate

Increase executor memory
Reduce per task partition size
Increase shuffle partitions
Use faster local disks
Reduce shuffle footprint upstream

Spill connects memory and disk.

Step 4: What Is the Storage Layout?

Where does the 1 TB live?

Five large Parquet files?
Eight hundred thousand small files?
Partitioned correctly?
Clustered on join keys?

Small files increase:

Task scheduling overhead
File listing latency
Driver pressure

Poor partitioning increases scan size.

Wrong clustering increases shuffle cost.

Sometimes the correct answer to:

How big should the cluster be?

Is:

Fix the data layout first.

Step 5: What Is the SLA?

Cluster sizing without SLA context is incomplete.

If SLA is two hours, sizing for twenty minute completion is unnecessary.

If SLA is thirty minutes, sizing must be calculated backwards:

Required throughput equals peak data volume divided by SLA.

Required throughput divided by per node effective throughput gives node count.

Cluster sizing becomes a throughput equation.

Not a storage equation.

Step 6: Is This Dedicated or Shared?

On shared clusters:

Full cores are not guaranteed
Full memory is not guaranteed
Shuffle service is shared
Concurrency affects availability

Cluster math in isolation becomes wrong in practice.

Then, and Only Then, Do the Math

Once the following are understood:

Peak intermediate size
Bottleneck type
Shuffle volume
Storage throughput
SLA target
Input variance
Isolation model

Then it makes sense to calculate:

Target partition size
Required partitions
Required concurrent tasks
Executors per node
Memory per executor
Node count

Now the math is grounded.

Without those questions, the math is guesswork.

The Real Answer

If someone asks:

How do you size a cluster for 1 TB?

The answer is simple.

Clusters should not be sized based on 1 TB.

They should be sized based on peak intermediate state, dominant bottleneck, and SLA constraints.

Data size is the starting point.

Workload behavior determines the cluster.

Databricks Perspective

If this is built on modern Databricks Runtime with Spark 4.x, the mindset shifts slightly.

The same physics still apply.

But platform abstractions are used first.

On Databricks:

Adaptive Query Execution is enabled by default and can coalesce shuffle partitions and mitigate moderate skew.
Photon can reduce CPU pressure for SQL and DataFrame workloads.
Delta Lake layout strategies help reduce scan inefficiency and small file overhead.

For example:

OPTIMIZE compacts small files.
ZORDER improves multi column data locality in traditional layouts.
Liquid Clustering replaces static partitioning and ZORDER with dynamic clustering.
Predictive Optimization automates compaction and maintenance.

These improve:

File compaction
Data skipping
Read efficiency
Metadata overhead

They reduce scan inefficiency before compute scaling.

But they do not eliminate:

Shuffle cost
Skew
Network ceilings
Spill behavior
Peak intermediate pressure

On Databricks, cluster sizing is often the last lever, not the first.

Abstraction does not remove distributed systems physics.

In the next post, we will look at what changes when the cluster itself disappears, and how serverless Spark shifts the surface area of responsibility without changing the underlying constraints.

Lakehouse Serving: Onehouse LakeBase vs Databricks Lakebase Postgres

Arjun Krishna — Mon, 23 Feb 2026 14:59:54 +0000

For years, the lakehouse unified storage and analytics.

It did not unify serving.

The architecture typically looked like this:

Lakehouse → analytics & ETL
Operational database → low-latency applications
Reverse ETL → copy curated subsets between them

That split worked when humans drove queries.

AI agents changed the load profile. They issue iterative point lookups, selective filters, repeated joins, and parallel queries inside tight reasoning loops. That workload stresses both scan-optimized engines and traditional OLTP systems in different ways.

Two architectural responses have emerged from Onehouse and Databricks.

Onehouse LakeBase: Database Primitives on Open Tables

LakeBase is positioned as a low-latency serving layer built directly on open lakehouse tables, specifically:

Apache Hudi
Apache Iceberg

Storage remains object-store based. LakeBase introduces:

Record-level and secondary indexing
Index joins that shift cost toward O(K) for selective workloads
Transaction-aware distributed caching tied to table commits
Autoscaled serving engines (Quanton-based execution)
A Postgres-compatible endpoint for standard connectivity

The core bet: instead of maintaining a separate serving tier via reverse ETL, extend the lakehouse itself with database-style mechanics.

Traditional distributed engines (Spark/Trino class) often execute joins with work proportional to O(N + M) because of scan and shuffle patterns. LakeBase’s index joins aim to reduce cost toward the filtered working set.

For narrow, high-selectivity queries, Onehouse reports:

~95% latency reduction on 1TB TPC-DS selective workloads
~6x performance vs Databricks SQL Serverless (tested narrow queries)
5–10x improvement vs AWS Athena in customer trace replays

These are vendor-reported benchmarks and workload-specific, but they illustrate the design intent: make the lakehouse viable for high-concurrency serving without duplicating data.

Databricks Lakebase Postgres: Dedicated OLTP Integrated with the Lakehouse

Databricks takes a different approach.

Lakebase is a fully managed PostgreSQL-compatible OLTP engine integrated into the Databricks platform.

Architecturally:

Transactional workloads run on a dedicated Postgres engine
Strong OLTP semantics and isolation guarantees
Tight integration with Unity Catalog
Federated access between OLTP and lakehouse analytics

Databricks is natively optimized around Delta Lake, with growing Iceberg interoperability.

Lakebase Postgres does not modify the lakehouse storage layer. It complements it.

The philosophy here is specialization:

OLTP engine → optimized for transactional latency
Lakehouse (Delta / Iceberg) → optimized for distributed analytics
Unified control plane → separate execution semantics

Architectural Contrast

Both approaches aim to reduce brittle reverse ETL pipelines.

The difference lies in where database behavior lives:

Onehouse → Extend open lakehouse tables (Hudi/Iceberg) with indexing, caching, and serving semantics.
Databricks → Introduce a dedicated PostgreSQL engine alongside a Delta-native lakehouse.

One converges inward.

The other composes specialized systems under one platform.

Final Take

If your workload is read-heavy, selective, and lake-centric, the indexing-first model is compelling.

If you require mature transactional guarantees and explicit workload isolation, a managed PostgreSQL engine integrated with the lakehouse may be structurally cleaner.

The real shift is not about formats.

It is about whether serving becomes a native property of the lakehouse — or remains a specialized companion to it.