<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: RASMIN BHALLA</title>
    <description>The latest articles on DEV Community by RASMIN BHALLA (@rasminbhalla).</description>
    <link>https://dev.to/rasminbhalla</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1166153%2F93ba0630-2444-43c6-b60d-e36965c8127f.jpg</url>
      <title>DEV Community: RASMIN BHALLA</title>
      <link>https://dev.to/rasminbhalla</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rasminbhalla"/>
    <language>en</language>
    <item>
      <title>Understanding Join Strategies in PySpark (With Real-World Insights)</title>
      <dc:creator>RASMIN BHALLA</dc:creator>
      <pubDate>Sat, 11 Apr 2026 05:19:32 +0000</pubDate>
      <link>https://dev.to/rasminbhalla/understanding-join-strategies-in-pyspark-with-real-world-insights-1n05</link>
      <guid>https://dev.to/rasminbhalla/understanding-join-strategies-in-pyspark-with-real-world-insights-1n05</guid>
      <description>&lt;p&gt;When working with large-scale data in Spark, joins are often the biggest performance bottleneck. Choosing the right join strategy can drastically reduce execution time and cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let’s break down the most important join strategies in PySpark.
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Why Join Strategy Matters
&lt;/h2&gt;

&lt;p&gt;In distributed systems like Spark:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data is spread across nodes&lt;/li&gt;
&lt;li&gt;Joins may trigger &lt;strong&gt;shuffles (expensive!)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Poor strategy → massive performance degradation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Spark Join Strategy Overview
&lt;/h2&gt;

&lt;p&gt;Spark automatically selects join strategies using the &lt;strong&gt;Catalyst Optimizer&lt;/strong&gt;, but understanding them helps you override when needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔹 1. Broadcast Hash Join (Best for Small Tables)
&lt;/h2&gt;

&lt;p&gt;👉 When one table is small enough to fit in memory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql.functions import broadcast

df_large.join(broadcast(df_small), "id")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pros:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;No shuffle&lt;/li&gt;
&lt;li&gt;Fastest join&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cons:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Limited by memory&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔹 2. Sort Merge Join (Default for Large Tables)
&lt;/h2&gt;

&lt;p&gt;👉 Used when both tables are large&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df1.join(df2, "id")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How it works:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Data is shuffled&lt;/li&gt;
&lt;li&gt;Sorted on join key&lt;/li&gt;
&lt;li&gt;Then merged&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pros:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Scales well&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cons:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Expensive due to shuffle + sort&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔹 3. Shuffle Hash Join
&lt;/h2&gt;

&lt;p&gt;👉 Used when one table is moderately small&lt;/p&gt;

&lt;h3&gt;
  
  
  How it works:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Both tables shuffled&lt;/li&gt;
&lt;li&gt;Smaller one hashed&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pros:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Faster than sort merge (sometimes)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cons:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Memory sensitive&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔹 4. Broadcast Nested Loop Join (Avoid!)
&lt;/h2&gt;

&lt;p&gt;👉 Used when no join condition exists&lt;/p&gt;

&lt;h3&gt;
  
  
  Extremely expensive
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cross join behavior&lt;/li&gt;
&lt;li&gt;Should be avoided unless necessary&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How Spark Chooses Join Strategy
&lt;/h2&gt;

&lt;p&gt;Spark uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table size statistics&lt;/li&gt;
&lt;li&gt;&lt;code&gt;spark.sql.autoBroadcastJoinThreshold&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Cost-based optimizer&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Forcing Join Strategy (Advanced)
&lt;/h2&gt;

&lt;p&gt;You can override Spark decisions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df1.join(df2.hint("broadcast"), "id")
df1.join(df2.hint("merge"), "id")
df1.join(df2.hint("shuffle_hash"), "id")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real-World Optimization Tips
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✔ Broadcast dimension tables (e.g., supplier, class)&lt;/li&gt;
&lt;li&gt;✔ Avoid joins on skewed keys&lt;/li&gt;
&lt;li&gt;✔ Repartition before joins if needed&lt;/li&gt;
&lt;li&gt;✔ Use proper join keys (avoid functions in joins)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⚠️ Common Pitfall: Data Skew
&lt;/h2&gt;

&lt;p&gt;If one key has too many records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One node gets overloaded&lt;/li&gt;
&lt;li&gt;Job slows down&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Salting technique&lt;/li&gt;
&lt;li&gt;Skew join optimization&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Best Use Case&lt;/th&gt;
&lt;th&gt;Performance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Broadcast Hash&lt;/td&gt;
&lt;td&gt;Small + Large&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sort Merge&lt;/td&gt;
&lt;td&gt;Large + Large&lt;/td&gt;
&lt;td&gt;⭐⭐⭐&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shuffle Hash&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nested Loop&lt;/td&gt;
&lt;td&gt;No condition&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  👋 Let’s Connect
&lt;/h2&gt;

&lt;p&gt;If you’re working on Spark performance or large-scale pipelines, I’d love to discuss strategies and real-world scenarios!&lt;/p&gt;

</description>
      <category>pyspark</category>
      <category>databricks</category>
      <category>sparkarchitecture</category>
      <category>spark</category>
    </item>
    <item>
      <title>Migrating Legacy ETL to Modern Data Stack: Matillion dbt on Databricks</title>
      <dc:creator>RASMIN BHALLA</dc:creator>
      <pubDate>Sat, 11 Apr 2026 05:14:49 +0000</pubDate>
      <link>https://dev.to/rasminbhalla/migrating-legacy-etl-to-modern-data-stack-matillion-dbt-on-databricks-3lee</link>
      <guid>https://dev.to/rasminbhalla/migrating-legacy-etl-to-modern-data-stack-matillion-dbt-on-databricks-3lee</guid>
      <description>&lt;h2&gt;
  
  
  🚀
&lt;/h2&gt;

&lt;p&gt;Modern data engineering is shifting from tool-driven ETL to &lt;strong&gt;code-first, modular pipelines&lt;/strong&gt;. In this post, I’ll walk through how I migrated legacy Matillion workflows to a scalable architecture using dbt and Databricks.&lt;/p&gt;




&lt;p&gt;🧩** _&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem Statement
&lt;/h2&gt;

&lt;p&gt;_**&lt;/p&gt;

&lt;p&gt;We had multiple Matillion mappings handling core business entities like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Company&lt;/li&gt;
&lt;li&gt; Department&lt;/li&gt;
&lt;li&gt; Group&lt;/li&gt;
&lt;li&gt; Class / Sub-Class&lt;/li&gt;
&lt;li&gt; Supplier / Supplier Site&lt;/li&gt;
&lt;li&gt; Barcode&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tight coupling between jobs&lt;/li&gt;
&lt;li&gt;Limited reusability&lt;/li&gt;
&lt;li&gt;Difficult debugging and lineage tracking&lt;/li&gt;
&lt;li&gt;Inconsistent data quality validation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🏗️ Target Architecture
&lt;/h2&gt;

&lt;p&gt;We redesigned the system using a &lt;strong&gt;medallion architecture&lt;/strong&gt;, where data flows through multiple refinement layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bronze&lt;/strong&gt; → Raw ingestion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silver&lt;/strong&gt; → Cleaned &amp;amp; validated data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gold&lt;/strong&gt; → Business-ready datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layered approach improves data quality progressively as it moves downstream ([Databricks Documentation][1]).&lt;/p&gt;




&lt;h2&gt;
  
  
  🔄 Migration Strategy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Decomposing Matillion Mappings
&lt;/h3&gt;

&lt;p&gt;Each Matillion job was broken down into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source extraction&lt;/li&gt;
&lt;li&gt;Joins &amp;amp; filters&lt;/li&gt;
&lt;li&gt;Aggregations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then rewritten as &lt;strong&gt;modular dbt models&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧱 Layered Modeling Approach
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Staging (&lt;code&gt;stg_*&lt;/code&gt;)&lt;/strong&gt; → Raw cleanup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intermediate (&lt;code&gt;int_*&lt;/code&gt;)&lt;/strong&gt; → Business logic reuse&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Marts (&lt;code&gt;dim_*&lt;/code&gt;, &lt;code&gt;fct_*&lt;/code&gt;)&lt;/strong&gt; → Analytics-ready tables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stg_supplier → int_supplier_enriched → dim_supplier
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ⚡ Incremental Processing
&lt;/h2&gt;

&lt;p&gt;Instead of full refresh pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Used &lt;code&gt;updated_at&lt;/code&gt; based filtering&lt;/li&gt;
&lt;li&gt;Applied incremental models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Result: Reduced compute cost and faster execution&lt;/p&gt;




&lt;h2&gt;
  
  
  🧪 Data Validation Strategy (Critical Step)
&lt;/h2&gt;

&lt;p&gt;Ensuring parity with production was the most critical step.&lt;/p&gt;

&lt;h3&gt;
  
  
  ✔️ Validation Techniques
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Row count validation&lt;/li&gt;
&lt;li&gt;Aggregation checks (SUM, COUNT)&lt;/li&gt;
&lt;li&gt;Sample-level validation&lt;/li&gt;
&lt;li&gt;Hash-based comparison&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ✅ Data Quality Framework in dbt
&lt;/h2&gt;

&lt;p&gt;Implemented both standard and custom tests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not Null&lt;/li&gt;
&lt;li&gt;Unique&lt;/li&gt;
&lt;li&gt;Relationships (FK integrity)&lt;/li&gt;
&lt;li&gt;Accepted Values&lt;/li&gt;
&lt;li&gt;Freshness checks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⚡ Performance Optimization
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Incremental models for large tables&lt;/li&gt;
&lt;li&gt;Partitioning (Delta tables)&lt;/li&gt;
&lt;li&gt;Optimized joins&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔍 Key Challenges
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Hidden Dependencies
&lt;/h3&gt;

&lt;p&gt;Solved using dbt DAG (&lt;code&gt;ref()&lt;/code&gt;)&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Data Mismatch
&lt;/h3&gt;

&lt;p&gt;Resolved via structured reconciliation&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Job Variables
&lt;/h3&gt;

&lt;p&gt;Converted into dbt macros&lt;/p&gt;

&lt;h2&gt;
  
  
  📊 Outcome
&lt;/h2&gt;

&lt;p&gt;✔ Improved maintainability&lt;br&gt;
 ✔ Standardized SQL transformations&lt;br&gt;
 ✔ Strong data quality enforcement&lt;br&gt;
 ✔ Reduced runtime and cost&lt;br&gt;
 ✔ Clear lineage and traceability&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Key Takeaway
&lt;/h2&gt;

&lt;p&gt;This migration wasn’t just tool replacement—it was a shift to:&lt;/p&gt;

&lt;p&gt;👉 Modular data engineering&lt;br&gt;
👉 Version-controlled transformations&lt;br&gt;
👉 Reliable, testable pipelines&lt;/p&gt;




&lt;h2&gt;
  
  
  👋 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;If you're still using legacy ETL tools, moving to dbt can drastically improve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Development speed&lt;/li&gt;
&lt;li&gt; Debugging&lt;/li&gt;
&lt;li&gt; Data trust
Happy to discuss dbt + Databricks architectures or migration strategies!&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>matillion</category>
      <category>dbt</category>
      <category>databricks</category>
      <category>dataenginee</category>
    </item>
  </channel>
</rss>
