<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Arjun Krishna</title>
    <description>The latest articles on DEV Community by Arjun Krishna (@therectoverse).</description>
    <link>https://dev.to/therectoverse</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3786451%2F70cfaa67-9d42-4735-a5a3-7d8b3e2a0880.jpg</url>
      <title>DEV Community: Arjun Krishna</title>
      <link>https://dev.to/therectoverse</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/therectoverse"/>
    <language>en</language>
    <item>
      <title>How to Choose Between Serverless and Dedicated Compute in Databricks</title>
      <dc:creator>Arjun Krishna</dc:creator>
      <pubDate>Fri, 06 Mar 2026 06:35:02 +0000</pubDate>
      <link>https://dev.to/therectoverse/how-to-choose-between-serverless-and-dedicated-compute-in-databricks-j64</link>
      <guid>https://dev.to/therectoverse/how-to-choose-between-serverless-and-dedicated-compute-in-databricks-j64</guid>
      <description>&lt;p&gt;I recently benchmarked &lt;strong&gt;Serverless vs Dedicated compute in Databricks&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I expected one of them to clearly win.&lt;/p&gt;

&lt;p&gt;It didn’t.&lt;/p&gt;

&lt;p&gt;Execution time was &lt;strong&gt;almost identical&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Which led to a more useful realization:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The decision between Serverless and Dedicated &lt;strong&gt;is not a performance question&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
It’s a &lt;strong&gt;workload shape question&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Mental Model
&lt;/h2&gt;

&lt;p&gt;Dedicated wins &lt;strong&gt;when the cluster stays warm and busy&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Serverless wins &lt;strong&gt;from the first byte of compute needed&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Cost Model
&lt;/h2&gt;

&lt;p&gt;When evaluating compute options, comparing &lt;strong&gt;DBUs vs DBUs&lt;/strong&gt; is misleading.&lt;/p&gt;

&lt;p&gt;Instead, look at &lt;strong&gt;total compute cost&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dedicated Compute
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost ≈ (DBUs × DBU rate)
      + Cloud VM cost
      + Time clusters remain warm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Serverless
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost ≈ DBUs × Serverless rate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Serverless DBU rates are higher because &lt;strong&gt;infrastructure is already bundled in&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But two cost categories disappear entirely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Idle clusters&lt;/li&gt;
&lt;li&gt;Cloud VM infrastructure management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There’s also a third cost that rarely shows up in spreadsheets.&lt;/p&gt;

&lt;h3&gt;
  
  
  Engineering Time
&lt;/h3&gt;

&lt;p&gt;Operating classic clusters requires ongoing platform work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cluster policies&lt;/li&gt;
&lt;li&gt;autoscaling tuning&lt;/li&gt;
&lt;li&gt;node sizing decisions&lt;/li&gt;
&lt;li&gt;runtime upgrades&lt;/li&gt;
&lt;li&gt;debugging cluster drift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At scale, the &lt;strong&gt;engineering hours saved operating infrastructure often become the biggest cost reduction&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Workload Patterns I See Most Often
&lt;/h2&gt;

&lt;p&gt;Most data pipelines fall into a few common patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Short Pipelines
&lt;/h3&gt;

&lt;p&gt;Jobs that run for a few minutes but execute repeatedly throughout the day.&lt;/p&gt;

&lt;p&gt;Serverless works extremely well here because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compute appears instantly&lt;/li&gt;
&lt;li&gt;compute disappears immediately after execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Startup latency is also dramatically lower.&lt;/p&gt;

&lt;p&gt;Typical comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Compute Type&lt;/th&gt;
&lt;th&gt;Startup Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classic job cluster&lt;/td&gt;
&lt;td&gt;~3–7 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serverless&lt;/td&gt;
&lt;td&gt;seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For short jobs, this difference significantly improves &lt;strong&gt;time-to-value&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Long-Running Pipelines
&lt;/h3&gt;

&lt;p&gt;Some pipelines run for hours and keep compute fully utilized.&lt;/p&gt;

&lt;p&gt;Here dedicated clusters often make more sense because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;lower DBU rates&lt;/li&gt;
&lt;li&gt;executor configuration tuning&lt;/li&gt;
&lt;li&gt;controlled autoscaling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a cluster stays &lt;strong&gt;warm and busy&lt;/strong&gt;, economics start favoring dedicated compute.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Burst Workloads
&lt;/h3&gt;

&lt;p&gt;Many platforms schedule large numbers of jobs at the same time.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;100 pipelines scheduled at 8:00 AM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With classic job clusters this can cause:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cluster provisioning storms&lt;/li&gt;
&lt;li&gt;workspace cluster quota limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I’ve seen job clusters &lt;strong&gt;hit workspace cluster quotas in real production environments&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Serverless handles this much better.&lt;/p&gt;

&lt;p&gt;Because compute runs on a &lt;strong&gt;Databricks-managed fleet&lt;/strong&gt;, the platform can absorb burst concurrency without waiting for clusters to spin up.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Ad-hoc Exploration
&lt;/h3&gt;

&lt;p&gt;Platforms also support &lt;strong&gt;interactive debugging and analysis&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Notebook sessions often look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Run query
Inspect result
Run another query later
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All-purpose clusters stay alive during the entire session.&lt;/p&gt;

&lt;p&gt;Serverless aligns better with this pattern because compute is allocated &lt;strong&gt;only when work actually runs&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  When the Pattern Isn't Clear
&lt;/h2&gt;

&lt;p&gt;Sometimes a pipeline doesn't clearly fit one of these patterns.&lt;/p&gt;

&lt;p&gt;That’s when benchmarking both options makes sense.&lt;/p&gt;

&lt;p&gt;A simple approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run tests during a quiet window&lt;/li&gt;
&lt;li&gt;Avoid cached reads when benchmarking I/O&lt;/li&gt;
&lt;li&gt;Use the same dataset for both runs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Measure two metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Latency
DBUs consumed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DBU consumption per run can be pulled from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;system.billing.usage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Estimated monthly cost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Monthly Cost ≈ DBUs per run × DBU rate × runs per month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add storage or egress costs if data leaves Databricks.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Subtle Efficiency Difference
&lt;/h2&gt;

&lt;p&gt;Clusters assume workloads are distributed.&lt;/p&gt;

&lt;p&gt;But many workloads aren’t.&lt;/p&gt;

&lt;p&gt;Example: a &lt;strong&gt;pandas-heavy notebook on a Spark cluster&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Most computation happens on the &lt;strong&gt;driver node&lt;/strong&gt;, while workers remain underutilized.&lt;/p&gt;

&lt;p&gt;Serverless removes the need to provision a &lt;strong&gt;fixed cluster footprint upfront&lt;/strong&gt;, making it more efficient for smaller workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  Operational Stability
&lt;/h2&gt;

&lt;p&gt;Serverless environments are effectively &lt;strong&gt;versionless from the user perspective&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Teams don’t manage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cluster images&lt;/li&gt;
&lt;li&gt;runtime upgrades&lt;/li&gt;
&lt;li&gt;runtime fragmentation across projects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The platform manages the runtime lifecycle and continuously rolls improvements forward.&lt;/p&gt;

&lt;p&gt;This removes an entire category of platform maintenance work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hidden Cost Leaks I See Often
&lt;/h2&gt;

&lt;p&gt;Before optimizing compute type, check these first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto-termination set too high&lt;/li&gt;
&lt;li&gt;Libraries installing during job startup&lt;/li&gt;
&lt;li&gt;Silent retries increasing DBU usage&lt;/li&gt;
&lt;li&gt;Oversized clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cluster policies help enforce guardrails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;owner tags&lt;/li&gt;
&lt;li&gt;cost center tags&lt;/li&gt;
&lt;li&gt;environment tags&lt;/li&gt;
&lt;li&gt;worker limits by tier&lt;/li&gt;
&lt;li&gt;restrictions on expensive instance types&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  A Nuance About Scaling
&lt;/h2&gt;

&lt;p&gt;Serverless isn't infinite.&lt;/p&gt;

&lt;p&gt;There are still &lt;strong&gt;platform guardrails on scaling&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But these are managed differently from classic clusters.&lt;/p&gt;

&lt;p&gt;Job clusters are constrained by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;workspace cluster quotas&lt;/li&gt;
&lt;li&gt;VM provisioning limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Serverless runs on a &lt;strong&gt;Databricks-managed fleet&lt;/strong&gt;, so those limits usually don't apply the same way.&lt;/p&gt;

&lt;p&gt;In practice this means burst workloads often scale &lt;strong&gt;more smoothly on Serverless&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Rule of Thumb
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Short pipelines        → Serverless
Ad-hoc exploration     → Serverless
Burst workloads        → Serverless

Long-running pipelines → Dedicated
Specialized workloads  → Dedicated
(GPUs, private networking, pinned environments)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most mature platforms end up running &lt;strong&gt;both models&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The goal isn’t choosing a winner.&lt;/p&gt;

&lt;p&gt;It’s matching the &lt;strong&gt;compute model to the workload shape&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>serverless</category>
      <category>databricks</category>
      <category>distributedsystems</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>The future of Data Engineering in Databricks - From Pipelines to Intent</title>
      <dc:creator>Arjun Krishna</dc:creator>
      <pubDate>Tue, 03 Mar 2026 05:51:31 +0000</pubDate>
      <link>https://dev.to/therectoverse/the-future-of-data-engineering-in-databricks-from-pipelines-to-intent-e1m</link>
      <guid>https://dev.to/therectoverse/the-future-of-data-engineering-in-databricks-from-pipelines-to-intent-e1m</guid>
      <description>&lt;p&gt;The analytics layer moved first.&lt;/p&gt;

&lt;p&gt;Natural language querying.&lt;br&gt;&lt;br&gt;
AI-assisted SQL.&lt;br&gt;&lt;br&gt;
Agent-style workflows over governed datasets.&lt;/p&gt;

&lt;p&gt;Now the real shift is coming for &lt;strong&gt;data engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And it’s bigger.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Layers of Data Engineering
&lt;/h2&gt;

&lt;p&gt;If we strip the role down to fundamentals, data engineering operates across three layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Mechanical execution
&lt;/li&gt;
&lt;li&gt;Architectural decisions
&lt;/li&gt;
&lt;li&gt;Accountability and governance
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;AI will not impact all three equally.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 1: Mechanical Execution
&lt;/h2&gt;

&lt;p&gt;This layer is already changing.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing boilerplate transformations
&lt;/li&gt;
&lt;li&gt;Defining repetitive pipeline logic
&lt;/li&gt;
&lt;li&gt;Handling retries and failure loops
&lt;/li&gt;
&lt;li&gt;Manually tracing lineage during debugging
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Databricks, we’re seeing early signals of this shift.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lakeflow Declarative Pipelines&lt;/strong&gt; let engineers define &lt;em&gt;what&lt;/em&gt; the data should look like rather than coding &lt;em&gt;how&lt;/em&gt; it runs.
&lt;/li&gt;
&lt;li&gt;The platform handles orchestration, retries, expectations, and monitoring.
&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Databricks Assistant&lt;/strong&gt; can generate SQL, explain query plans, and refactor transformations.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is deterministic automation.&lt;/p&gt;

&lt;p&gt;Reliable.&lt;br&gt;&lt;br&gt;
Repeatable.&lt;br&gt;&lt;br&gt;
Rule-based.&lt;/p&gt;

&lt;p&gt;But deterministic automation is only step one.&lt;/p&gt;




&lt;h2&gt;
  
  
  From Deterministic Automation to Bounded Remediation
&lt;/h2&gt;

&lt;p&gt;Today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pipelines fail
&lt;/li&gt;
&lt;li&gt;Alerts trigger
&lt;/li&gt;
&lt;li&gt;Engineers investigate
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tomorrow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The system diagnoses
&lt;/li&gt;
&lt;li&gt;The system proposes a fix
&lt;/li&gt;
&lt;li&gt;The system remediates within predefined guardrails
&lt;/li&gt;
&lt;li&gt;Humans review the audit trail
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not full autonomy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bounded remediation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Systems that resolve predictable failures while respecting governance controls, lineage, and data contracts.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schema drift handled within constraints
&lt;/li&gt;
&lt;li&gt;Downstream impact simulation before deployment
&lt;/li&gt;
&lt;li&gt;Suggested medallion restructuring based on query patterns
&lt;/li&gt;
&lt;li&gt;Automatic performance optimization grounded in workload telemetry
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where foundational models integrated inside the platform matter.&lt;/p&gt;

&lt;p&gt;Not as chatbots.&lt;/p&gt;

&lt;p&gt;As embedded reasoning layers inside the data system.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Shift From Writing Code to Defining Intent
&lt;/h2&gt;

&lt;p&gt;The next evolution of data engineering won’t be about writing every transformation manually.&lt;/p&gt;

&lt;p&gt;It will look like this:&lt;/p&gt;

&lt;p&gt;An engineer defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business intent
&lt;/li&gt;
&lt;li&gt;Data quality expectations
&lt;/li&gt;
&lt;li&gt;Constraints
&lt;/li&gt;
&lt;li&gt;SLAs
&lt;/li&gt;
&lt;li&gt;Governance policies
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An intelligent agent drafts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline structure
&lt;/li&gt;
&lt;li&gt;Transformation logic
&lt;/li&gt;
&lt;li&gt;Incremental strategies
&lt;/li&gt;
&lt;li&gt;Partitioning strategy
&lt;/li&gt;
&lt;li&gt;Optimization hints
&lt;/li&gt;
&lt;li&gt;Lineage impact analysis
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engineer reviews, adjusts, approves.&lt;/p&gt;

&lt;p&gt;The center of gravity moves upward.&lt;/p&gt;

&lt;p&gt;From syntax to systems thinking.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Remains Human
&lt;/h2&gt;

&lt;p&gt;Layer 3 does not disappear.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Governance
&lt;/li&gt;
&lt;li&gt;Risk ownership
&lt;/li&gt;
&lt;li&gt;Architectural accountability
&lt;/li&gt;
&lt;li&gt;Trade-off decisions
&lt;/li&gt;
&lt;li&gt;Cross-domain modeling strategy
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI can propose.&lt;br&gt;&lt;br&gt;
It cannot own.&lt;/p&gt;

&lt;p&gt;Enterprises will not delegate accountability to a model.&lt;/p&gt;

&lt;p&gt;Data engineering becomes less about moving columns and more about defining durable data systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters in Databricks
&lt;/h2&gt;

&lt;p&gt;Databricks already integrates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage abstraction (Delta Lake)
&lt;/li&gt;
&lt;li&gt;Compute
&lt;/li&gt;
&lt;li&gt;Orchestration
&lt;/li&gt;
&lt;li&gt;Lineage
&lt;/li&gt;
&lt;li&gt;Governance
&lt;/li&gt;
&lt;li&gt;Observability
&lt;/li&gt;
&lt;li&gt;Model integration
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That vertical integration enables deep AI embedding.&lt;/p&gt;

&lt;p&gt;The differentiation won’t be access to frontier models.&lt;/p&gt;

&lt;p&gt;It will be how safely and deeply intelligence is embedded into enterprise-grade data systems.&lt;/p&gt;

&lt;p&gt;The platform that combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auditability
&lt;/li&gt;
&lt;li&gt;Guardrails
&lt;/li&gt;
&lt;li&gt;Data contracts
&lt;/li&gt;
&lt;li&gt;Governance enforcement
&lt;/li&gt;
&lt;li&gt;Embedded reasoning
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…will define the next phase of data engineering.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Outcome
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Less time debugging pipelines at 2 AM
&lt;/li&gt;
&lt;li&gt;Lower operational burden
&lt;/li&gt;
&lt;li&gt;Reduced repetitive troubleshooting
&lt;/li&gt;
&lt;li&gt;Higher architectural leverage
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data engineers shift from pipeline authors to system designers.&lt;/p&gt;

&lt;p&gt;From mechanics to strategists.&lt;/p&gt;

&lt;p&gt;That’s not a minor upgrade.&lt;/p&gt;

&lt;p&gt;That’s a role redefinition.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>databricks</category>
      <category>ai</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>How to Size a Spark Cluster. And How Not To.</title>
      <dc:creator>Arjun Krishna</dc:creator>
      <pubDate>Sun, 01 Mar 2026 19:44:20 +0000</pubDate>
      <link>https://dev.to/therectoverse/how-to-size-a-spark-cluster-and-how-not-to-2f46</link>
      <guid>https://dev.to/therectoverse/how-to-size-a-spark-cluster-and-how-not-to-2f46</guid>
      <description>&lt;p&gt;&lt;strong&gt;Interviewer:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You need to process 1 TB of data in Spark. How do you size the cluster?&lt;/p&gt;

&lt;p&gt;Most answers start with division.&lt;/p&gt;

&lt;p&gt;1 TB  &lt;/p&gt;

&lt;p&gt;→ choose 128 MB partitions&lt;br&gt;&lt;br&gt;
→ calculate ~8,000 partitions&lt;br&gt;&lt;br&gt;
→ map to cores&lt;br&gt;&lt;br&gt;
→ decide number of nodes  &lt;/p&gt;

&lt;p&gt;It is clean. It is logical.&lt;/p&gt;

&lt;p&gt;It is also incomplete.&lt;/p&gt;

&lt;p&gt;Because cluster size is not derived from data size.&lt;/p&gt;

&lt;p&gt;It is derived from workload behavior.&lt;/p&gt;

&lt;p&gt;Here is how this question should be approached in real systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Clarify Which “1 TB” We’re Talking About
&lt;/h2&gt;

&lt;p&gt;When someone says “1 TB,” there are multiple meanings hiding inside that number.&lt;/p&gt;

&lt;p&gt;Before sizing anything, it helps to separate at least five different sizes.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. Stored Size on Disk
&lt;/h3&gt;

&lt;p&gt;1 TB compressed Parquet in object storage tells very little about execution behavior.&lt;/p&gt;

&lt;p&gt;This number reflects storage efficiency and file layout. It affects metadata overhead and file management, not necessarily runtime footprint.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Effective Scan Size After Pruning
&lt;/h3&gt;

&lt;p&gt;The real question is: how much data will Spark actually read?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Partition pruning skips entire directories.
&lt;/li&gt;
&lt;li&gt;Predicate pushdown skips non-matching row groups.
&lt;/li&gt;
&lt;li&gt;Column pruning avoids reading unused columns.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 1 TB table may result in only 200 to 300 GB scanned.&lt;/p&gt;

&lt;p&gt;Cluster sizing must be based on actual scan size, not table size.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. In Memory Expansion Size
&lt;/h3&gt;

&lt;p&gt;Compressed columnar data expands during execution.&lt;/p&gt;

&lt;p&gt;Parquet on disk is compressed and encoded.&lt;/p&gt;

&lt;p&gt;In memory it is decompressed, decoded, and materialized into Spark’s internal row format.&lt;/p&gt;

&lt;p&gt;A 1 TB compressed dataset can expand to 2 to 4 TB across executors during processing.&lt;/p&gt;

&lt;p&gt;This directly affects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Executor memory sizing
&lt;/li&gt;
&lt;li&gt;Spill probability
&lt;/li&gt;
&lt;li&gt;GC pressure
&lt;/li&gt;
&lt;li&gt;Memory overhead configuration
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Disk size is rarely the memory anchor.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Peak Intermediate Size
&lt;/h3&gt;

&lt;p&gt;This is usually the real anchor.&lt;/p&gt;

&lt;p&gt;Spark executes as a DAG of stages separated by shuffles.&lt;/p&gt;

&lt;p&gt;A 1 TB job might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filter to 400 GB
&lt;/li&gt;
&lt;li&gt;Join and expand to 2.5 TB shuffle
&lt;/li&gt;
&lt;li&gt;Aggregate back to 50 GB
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Spark does not care about input size.&lt;/p&gt;

&lt;p&gt;It cares about the largest intermediate state it must shuffle, sort, or spill.&lt;/p&gt;

&lt;p&gt;If a join explodes to 2.5 TB, that becomes the sizing baseline.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Input Variance Across Runs
&lt;/h3&gt;

&lt;p&gt;Is 1 TB stable?&lt;/p&gt;

&lt;p&gt;Or does it fluctuate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;800 GB on normal days
&lt;/li&gt;
&lt;li&gt;1.4 TB on quarter end
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production systems fail at the tail, not the mean.&lt;/p&gt;

&lt;p&gt;Sizing must consider the 95th percentile load, not the average.&lt;/p&gt;




&lt;h2&gt;
  
  
  Before We Talk Math, Understand Spark’s Assumptions
&lt;/h2&gt;

&lt;p&gt;Spark was built with specific assumptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data can be evenly partitioned
&lt;/li&gt;
&lt;li&gt;Most transformations are narrow
&lt;/li&gt;
&lt;li&gt;Wide transformations require shuffle and are expensive
&lt;/li&gt;
&lt;li&gt;Network is slower than CPU
&lt;/li&gt;
&lt;li&gt;Memory is finite
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When these assumptions hold, Spark scales predictably.&lt;/p&gt;

&lt;p&gt;When they do not, adding nodes does not fix the root cause.&lt;/p&gt;

&lt;p&gt;Cluster sizing is not about fighting Spark.&lt;/p&gt;

&lt;p&gt;It is about aligning workload behavior with its design.&lt;/p&gt;

&lt;p&gt;This discussion is primarily framed around batch data engineering workloads, where shuffle, intermediate state, and throughput dominate sizing decisions. The underlying framework, however, is universal. For ML, BI, or streaming workloads, the dominant constraint shifts. Memory, concurrency, or state may become primary. The systems thinking remains the same.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: What Type of Workload Is This?
&lt;/h2&gt;

&lt;p&gt;Cluster sizing depends on bottleneck classification.&lt;/p&gt;

&lt;p&gt;The first step is determining what constrains the job.&lt;/p&gt;




&lt;h3&gt;
  
  
  CPU Bound
&lt;/h3&gt;

&lt;p&gt;Heavy UDFs, encryption, compression, complex transformations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High CPU utilization
&lt;/li&gt;
&lt;li&gt;Low spill
&lt;/li&gt;
&lt;li&gt;Minimal shuffle wait
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Scale cores and compute optimized instances.&lt;/p&gt;




&lt;h3&gt;
  
  
  Memory Bound
&lt;/h3&gt;

&lt;p&gt;Large joins, wide aggregations, caching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spill metrics in Spark UI
&lt;/li&gt;
&lt;li&gt;High GC time
&lt;/li&gt;
&lt;li&gt;Executor OOM events
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Increase executor memory or reduce per task footprint.&lt;/p&gt;




&lt;h3&gt;
  
  
  IO Bound
&lt;/h3&gt;

&lt;p&gt;Reading from object storage, small files, slow disks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low CPU utilization
&lt;/li&gt;
&lt;li&gt;High file open overhead
&lt;/li&gt;
&lt;li&gt;High task deserialization time
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fix file layout and compaction before scaling compute.&lt;/p&gt;

&lt;p&gt;Throwing more cores at small file chaos does not help.&lt;/p&gt;




&lt;h3&gt;
  
  
  Network Bound
&lt;/h3&gt;

&lt;p&gt;Shuffle heavy workload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High shuffle read fetch wait time
&lt;/li&gt;
&lt;li&gt;Low CPU usage during reduce stage
&lt;/li&gt;
&lt;li&gt;Executors waiting on remote blocks
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Network bandwidth per node is fixed.&lt;/p&gt;

&lt;p&gt;Doubling cores on the same node does not double shuffle throughput.&lt;/p&gt;

&lt;p&gt;Adding cores to a network saturated node rarely helps.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: What Is the Shuffle Multiplier?
&lt;/h2&gt;

&lt;p&gt;Does the job:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mostly scan and filter?
&lt;/li&gt;
&lt;li&gt;Perform wide joins?
&lt;/li&gt;
&lt;li&gt;Perform groupBy on high cardinality keys?
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Shuffle volume can easily reach two to three times input size.&lt;/p&gt;

&lt;p&gt;Shuffle determines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Execution memory pressure
&lt;/li&gt;
&lt;li&gt;Disk spill volume
&lt;/li&gt;
&lt;li&gt;Network saturation
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sizing for input size while ignoring shuffle multiplier is a classic mistake.&lt;/p&gt;




&lt;h2&gt;
  
  
  A 1 TB Job Can Fail Because of 1 Key
&lt;/h2&gt;

&lt;p&gt;Even if total data is 1 TB, a single hot key can create a 200 GB partition.&lt;/p&gt;

&lt;p&gt;That one executor becomes the bottleneck.&lt;/p&gt;

&lt;p&gt;Parallelism collapses not because the cluster is small, but because the data is unevenly distributed.&lt;/p&gt;

&lt;p&gt;In the Spark UI, this usually shows up as one task running far longer than the rest or consuming disproportionate shuffle data.&lt;/p&gt;

&lt;p&gt;Skew violates Spark’s even distribution assumption.&lt;/p&gt;

&lt;p&gt;This is no longer a cluster sizing problem, and no amount of cores fixes uneven data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Spill Turns Memory Problems Into Disk Problems
&lt;/h2&gt;

&lt;p&gt;When execution memory fills during shuffle or sort, Spark spills to local disk.&lt;/p&gt;

&lt;p&gt;Now disk throughput becomes the bottleneck.&lt;/p&gt;

&lt;p&gt;If local disks are slow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task duration increases
&lt;/li&gt;
&lt;li&gt;Executor lifetime increases
&lt;/li&gt;
&lt;li&gt;GC pressure increases
&lt;/li&gt;
&lt;li&gt;Stage completion slows non linearly
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How to identify&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High Spill metrics
&lt;/li&gt;
&lt;li&gt;Increasing task duration during shuffle stages
&lt;/li&gt;
&lt;li&gt;Elevated GC time
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How to mitigate&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increase executor memory
&lt;/li&gt;
&lt;li&gt;Reduce per task partition size
&lt;/li&gt;
&lt;li&gt;Increase shuffle partitions
&lt;/li&gt;
&lt;li&gt;Use faster local disks
&lt;/li&gt;
&lt;li&gt;Reduce shuffle footprint upstream
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Spill connects memory and disk.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: What Is the Storage Layout?
&lt;/h2&gt;

&lt;p&gt;Where does the 1 TB live?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Five large Parquet files?
&lt;/li&gt;
&lt;li&gt;Eight hundred thousand small files?
&lt;/li&gt;
&lt;li&gt;Partitioned correctly?
&lt;/li&gt;
&lt;li&gt;Clustered on join keys?
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Small files increase:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task scheduling overhead
&lt;/li&gt;
&lt;li&gt;File listing latency
&lt;/li&gt;
&lt;li&gt;Driver pressure
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Poor partitioning increases scan size.&lt;/p&gt;

&lt;p&gt;Wrong clustering increases shuffle cost.&lt;/p&gt;

&lt;p&gt;Sometimes the correct answer to:&lt;/p&gt;

&lt;p&gt;How big should the cluster be?&lt;/p&gt;

&lt;p&gt;Is:&lt;/p&gt;

&lt;p&gt;Fix the data layout first.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: What Is the SLA?
&lt;/h2&gt;

&lt;p&gt;Cluster sizing without SLA context is incomplete.&lt;/p&gt;

&lt;p&gt;If SLA is two hours, sizing for twenty minute completion is unnecessary.&lt;/p&gt;

&lt;p&gt;If SLA is thirty minutes, sizing must be calculated backwards:&lt;/p&gt;

&lt;p&gt;Required throughput equals peak data volume divided by SLA.&lt;/p&gt;

&lt;p&gt;Required throughput divided by per node effective throughput gives node count.&lt;/p&gt;

&lt;p&gt;Cluster sizing becomes a throughput equation.&lt;/p&gt;

&lt;p&gt;Not a storage equation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Is This Dedicated or Shared?
&lt;/h2&gt;

&lt;p&gt;On shared clusters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full cores are not guaranteed
&lt;/li&gt;
&lt;li&gt;Full memory is not guaranteed
&lt;/li&gt;
&lt;li&gt;Shuffle service is shared
&lt;/li&gt;
&lt;li&gt;Concurrency affects availability
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cluster math in isolation becomes wrong in practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Then, and Only Then, Do the Math
&lt;/h2&gt;

&lt;p&gt;Once the following are understood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Peak intermediate size
&lt;/li&gt;
&lt;li&gt;Bottleneck type
&lt;/li&gt;
&lt;li&gt;Shuffle volume
&lt;/li&gt;
&lt;li&gt;Storage throughput
&lt;/li&gt;
&lt;li&gt;SLA target
&lt;/li&gt;
&lt;li&gt;Input variance
&lt;/li&gt;
&lt;li&gt;Isolation model
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then it makes sense to calculate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Target partition size
&lt;/li&gt;
&lt;li&gt;Required partitions
&lt;/li&gt;
&lt;li&gt;Required concurrent tasks
&lt;/li&gt;
&lt;li&gt;Executors per node
&lt;/li&gt;
&lt;li&gt;Memory per executor
&lt;/li&gt;
&lt;li&gt;Node count
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now the math is grounded.&lt;/p&gt;

&lt;p&gt;Without those questions, the math is guesswork.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Answer
&lt;/h2&gt;

&lt;p&gt;If someone asks:&lt;/p&gt;

&lt;p&gt;How do you size a cluster for 1 TB?&lt;/p&gt;

&lt;p&gt;The answer is simple.&lt;/p&gt;

&lt;p&gt;Clusters should not be sized based on 1 TB.&lt;/p&gt;

&lt;p&gt;They should be sized based on peak intermediate state, dominant bottleneck, and SLA constraints.&lt;/p&gt;

&lt;p&gt;Data size is the starting point.&lt;/p&gt;

&lt;p&gt;Workload behavior determines the cluster.&lt;/p&gt;




&lt;h2&gt;
  
  
  Databricks Perspective
&lt;/h2&gt;

&lt;p&gt;If this is built on modern Databricks Runtime with Spark 4.x, the mindset shifts slightly.&lt;/p&gt;

&lt;p&gt;The same physics still apply.&lt;/p&gt;

&lt;p&gt;But platform abstractions are used first.&lt;/p&gt;

&lt;p&gt;On Databricks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adaptive Query Execution is enabled by default and can coalesce shuffle partitions and mitigate moderate skew.
&lt;/li&gt;
&lt;li&gt;Photon can reduce CPU pressure for SQL and DataFrame workloads.
&lt;/li&gt;
&lt;li&gt;Delta Lake layout strategies help reduce scan inefficiency and small file overhead.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OPTIMIZE compacts small files.
&lt;/li&gt;
&lt;li&gt;ZORDER improves multi column data locality in traditional layouts.
&lt;/li&gt;
&lt;li&gt;Liquid Clustering replaces static partitioning and ZORDER with dynamic clustering.
&lt;/li&gt;
&lt;li&gt;Predictive Optimization automates compaction and maintenance.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These improve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File compaction
&lt;/li&gt;
&lt;li&gt;Data skipping
&lt;/li&gt;
&lt;li&gt;Read efficiency
&lt;/li&gt;
&lt;li&gt;Metadata overhead
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They reduce scan inefficiency before compute scaling.&lt;/p&gt;

&lt;p&gt;But they do not eliminate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shuffle cost
&lt;/li&gt;
&lt;li&gt;Skew
&lt;/li&gt;
&lt;li&gt;Network ceilings
&lt;/li&gt;
&lt;li&gt;Spill behavior
&lt;/li&gt;
&lt;li&gt;Peak intermediate pressure
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On Databricks, cluster sizing is often the last lever, not the first.&lt;/p&gt;

&lt;p&gt;Abstraction does not remove distributed systems physics.&lt;/p&gt;

&lt;p&gt;In the next post, we will look at what changes when the cluster itself disappears, and how serverless Spark shifts the surface area of responsibility without changing the underlying constraints.&lt;/p&gt;

</description>
      <category>spark</category>
      <category>dataengineering</category>
      <category>distributedsystems</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Lakehouse Serving: Onehouse LakeBase vs Databricks Lakebase Postgres</title>
      <dc:creator>Arjun Krishna</dc:creator>
      <pubDate>Mon, 23 Feb 2026 14:59:54 +0000</pubDate>
      <link>https://dev.to/therectoverse/lakehouse-serving-onehouse-lakebase-vs-databricks-lakebase-postgres-di</link>
      <guid>https://dev.to/therectoverse/lakehouse-serving-onehouse-lakebase-vs-databricks-lakebase-postgres-di</guid>
      <description>&lt;p&gt;For years, the lakehouse unified storage and analytics.&lt;/p&gt;

&lt;p&gt;It did not unify serving.&lt;/p&gt;

&lt;p&gt;The architecture typically looked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Lakehouse → analytics &amp;amp; ETL&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Operational database → low-latency applications&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reverse ETL → copy curated subsets between them&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That split worked when humans drove queries.&lt;/p&gt;

&lt;p&gt;AI agents changed the load profile. They issue iterative point lookups, selective filters, repeated joins, and parallel queries inside tight reasoning loops. That workload stresses both scan-optimized engines and traditional OLTP systems in different ways.&lt;/p&gt;

&lt;p&gt;Two architectural responses have emerged from Onehouse and Databricks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Onehouse LakeBase: Database Primitives on Open Tables
&lt;/h2&gt;

&lt;p&gt;LakeBase is positioned as a low-latency serving layer built directly on open lakehouse tables, specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Apache Hudi&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Apache Iceberg&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Storage remains object-store based. LakeBase introduces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Record-level and secondary indexing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Index joins that shift cost toward O(K) for selective workloads&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Transaction-aware distributed caching tied to table commits&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Autoscaled serving engines (Quanton-based execution)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A Postgres-compatible endpoint for standard connectivity&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core bet: instead of maintaining a separate serving tier via reverse ETL, extend the lakehouse itself with database-style mechanics.&lt;/p&gt;

&lt;p&gt;Traditional distributed engines (Spark/Trino class) often execute joins with work proportional to O(N + M) because of scan and shuffle patterns. LakeBase’s index joins aim to reduce cost toward the filtered working set.&lt;/p&gt;

&lt;p&gt;For narrow, high-selectivity queries, Onehouse reports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;~95% latency reduction on 1TB TPC-DS selective workloads&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;~6x performance vs Databricks SQL Serverless (tested narrow queries)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;5–10x improvement vs AWS Athena in customer trace replays&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are vendor-reported benchmarks and workload-specific, but they illustrate the design intent: make the lakehouse viable for high-concurrency serving without duplicating data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Databricks Lakebase Postgres: Dedicated OLTP Integrated with the Lakehouse
&lt;/h2&gt;

&lt;p&gt;Databricks takes a different approach.&lt;/p&gt;

&lt;p&gt;Lakebase is a fully managed PostgreSQL-compatible OLTP engine integrated into the Databricks platform.&lt;/p&gt;

&lt;p&gt;Architecturally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Transactional workloads run on a dedicated Postgres engine&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Strong OLTP semantics and isolation guarantees&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tight integration with Unity Catalog&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Federated access between OLTP and lakehouse analytics&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Databricks is natively optimized around Delta Lake, with growing Iceberg interoperability.&lt;/p&gt;

&lt;p&gt;Lakebase Postgres does not modify the lakehouse storage layer. It complements it.&lt;/p&gt;

&lt;p&gt;The philosophy here is specialization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;OLTP engine → optimized for transactional latency&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lakehouse (Delta / Iceberg) → optimized for distributed analytics&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Unified control plane → separate execution semantics&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architectural Contrast
&lt;/h2&gt;

&lt;p&gt;Both approaches aim to reduce brittle reverse ETL pipelines.&lt;/p&gt;

&lt;p&gt;The difference lies in where database behavior lives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Onehouse → Extend open lakehouse tables (Hudi/Iceberg) with indexing, caching, and serving semantics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Databricks → Introduce a dedicated PostgreSQL engine alongside a Delta-native lakehouse.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One converges inward.&lt;/p&gt;

&lt;p&gt;The other composes specialized systems under one platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Take
&lt;/h2&gt;

&lt;p&gt;If your workload is read-heavy, selective, and lake-centric, the indexing-first model is compelling.&lt;/p&gt;

&lt;p&gt;If you require mature transactional guarantees and explicit workload isolation, a managed PostgreSQL engine integrated with the lakehouse may be structurally cleaner.&lt;/p&gt;

&lt;p&gt;The real shift is not about formats.&lt;/p&gt;

&lt;p&gt;It is about whether serving becomes a native property of the lakehouse — or remains a specialized companion to it.&lt;/p&gt;

</description>
      <category>lakehouse</category>
      <category>databricks</category>
      <category>apacheiceberg</category>
      <category>onehouse</category>
    </item>
  </channel>
</rss>
