<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sandeep</title>
    <description>The latest articles on DEV Community by Sandeep (@sandeepk27).</description>
    <link>https://dev.to/sandeepk27</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F674820%2Fc140f0fd-dd82-41ae-9de2-5b26c6369dfb.jpg</url>
      <title>DEV Community: Sandeep</title>
      <link>https://dev.to/sandeepk27</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sandeepk27"/>
    <language>en</language>
    <item>
      <title>Day 30: From Zero to Production-Ready Spark Data Engineer</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Tue, 30 Dec 2025 07:33:19 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-30-from-zero-to-production-ready-spark-data-engineer-f6f</link>
      <guid>https://dev.to/sandeepk27/day-30-from-zero-to-production-ready-spark-data-engineer-f6f</guid>
      <description>&lt;p&gt;Learning Spark is easy. Using Spark correctly in production is not.&lt;/p&gt;

&lt;p&gt;Over the last 30 days, I focused on learning how Spark actually works in real data platforms, not just writing transformations.&lt;/p&gt;

&lt;p&gt;This journey changed the way I think about data engineering.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Spark Is Not About Code - It’s About Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Early on, I realized that Spark problems are rarely syntax problems.&lt;br&gt;
They are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architecture problems&lt;/li&gt;
&lt;li&gt;Performance problems&lt;/li&gt;
&lt;li&gt;Data quality problems&lt;/li&gt;
&lt;li&gt;State management problems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s why concepts like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bronze–Silver–Gold&lt;/li&gt;
&lt;li&gt;Delta Lake&lt;/li&gt;
&lt;li&gt;Watermarking&lt;/li&gt;
&lt;li&gt;Exactly-once semantics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;matter more than fancy transformations.&lt;/p&gt;

&lt;p&gt;🌟 Batch and Streaming Are Not Separate Worlds&lt;/p&gt;

&lt;p&gt;One of the biggest learnings was this:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Structured Streaming is just Spark SQL running continuously.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The same rules apply:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce shuffle&lt;/li&gt;
&lt;li&gt;Filter early&lt;/li&gt;
&lt;li&gt;Avoid UDFs&lt;/li&gt;
&lt;li&gt;Partition wisely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Streaming only adds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State&lt;/li&gt;
&lt;li&gt;Time&lt;/li&gt;
&lt;li&gt;Failure recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once I understood this, streaming stopped feeling scary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🌟 Delta Lake Changed Everything&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Delta Lake turned data lakes into reliable systems.&lt;/p&gt;

&lt;p&gt;Features like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MERGE&lt;/li&gt;
&lt;li&gt;Time travel&lt;/li&gt;
&lt;li&gt;ACID transactions&lt;/li&gt;
&lt;li&gt;Schema evolution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;made it possible to build pipelines that are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recoverable&lt;/li&gt;
&lt;li&gt;Auditable&lt;/li&gt;
&lt;li&gt;Scalable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Delta is no longer optional — it’s foundational.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Production Thinking Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The biggest shift was learning to think like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What happens when data is bad?&lt;/li&gt;
&lt;li&gt;What happens when the job fails?&lt;/li&gt;
&lt;li&gt;How do I reprocess?&lt;/li&gt;
&lt;li&gt;How do I debug?&lt;/li&gt;
&lt;li&gt;How much does this cost?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This mindset is what separates data engineers from Spark users.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;What I Can Build Now&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After 30 days, I can confidently build:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Batch ETL pipelines&lt;/li&gt;
&lt;li&gt;Data quality frameworks&lt;/li&gt;
&lt;li&gt;CDC pipelines&lt;/li&gt;
&lt;li&gt;Real-time analytics systems&lt;/li&gt;
&lt;li&gt;Exactly-once streaming pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More importantly, I can explain why a design works.&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Spark is powerful — but only when used with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Correct architecture&lt;/li&gt;
&lt;li&gt;Performance awareness&lt;/li&gt;
&lt;li&gt;Strong data discipline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re learning Spark:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don’t rush syntax&lt;/li&gt;
&lt;li&gt;Learn internals&lt;/li&gt;
&lt;li&gt;Build real pipelines&lt;/li&gt;
&lt;li&gt;Focus on failure scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s how you become production-ready.&lt;/p&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you!!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 29: Building a Production-Grade Real-Time ETL Pipeline with Spark &amp; Delta</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Mon, 29 Dec 2025 08:09:59 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-29-building-a-production-grade-real-time-etl-pipeline-with-spark-delta-4blc</link>
      <guid>https://dev.to/sandeepk27/day-29-building-a-production-grade-real-time-etl-pipeline-with-spark-delta-4blc</guid>
      <description>&lt;p&gt;Welcome to Day 29 of the Spark Mastery Series.&lt;br&gt;
Today we build a real-world streaming system — the kind used in e-commerce, fintech, and analytics platforms.&lt;/p&gt;

&lt;p&gt;This pipeline handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Streaming ingestion&lt;/li&gt;
&lt;li&gt;CDC upserts&lt;/li&gt;
&lt;li&gt;Data quality&lt;/li&gt;
&lt;li&gt;Exactly-once guarantees&lt;/li&gt;
&lt;li&gt;Real-time KPIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Why This Architecture Works&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bronze preserves raw truth&lt;/li&gt;
&lt;li&gt;Silver maintains latest state via MERGE&lt;/li&gt;
&lt;li&gt;Gold serves analytics with windows &amp;amp; watermarks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Failures are recoverable, data is trustworthy, and performance is stable.&lt;/p&gt;

&lt;p&gt;🌟 Key Patterns Used&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;foreachBatch + MERGE for CDC&lt;/li&gt;
&lt;li&gt;Delta Lake for ACID &amp;amp; idempotency&lt;/li&gt;
&lt;li&gt;Watermark to bound state&lt;/li&gt;
&lt;li&gt;Append/update output modes&lt;/li&gt;
&lt;li&gt;Separate checkpoints per query&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Interview Value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can now explain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exactly-once semantics&lt;/li&gt;
&lt;li&gt;CDC in streaming&lt;/li&gt;
&lt;li&gt;State management&lt;/li&gt;
&lt;li&gt;Watermarking&lt;/li&gt;
&lt;li&gt;Streaming performance tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🚀 &lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We built:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A complete real-time ETL pipeline&lt;/li&gt;
&lt;li&gt;CDC upserts with Delta&lt;/li&gt;
&lt;li&gt;Streaming metrics with windows&lt;/li&gt;
&lt;li&gt;Fault-tolerant design&lt;/li&gt;
&lt;li&gt;Production best practices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you!!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 28: Spark Streaming Performance Tuning</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Mon, 29 Dec 2025 08:09:41 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-28-spark-streaming-performance-tuning-37ig</link>
      <guid>https://dev.to/sandeepk27/day-28-spark-streaming-performance-tuning-37ig</guid>
      <description>&lt;p&gt;Welcome to Day 28 of the Spark Mastery Series.&lt;br&gt;
Today we tackle the biggest fear in streaming systems:&lt;/p&gt;

&lt;p&gt;Jobs that work fine initially… then crash after hours or days.&lt;/p&gt;

&lt;p&gt;This happens because of state mismanagement.&lt;/p&gt;

&lt;p&gt;Let’s fix it.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Why Streaming Is Harder Than Batch&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Batch jobs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start&lt;/li&gt;
&lt;li&gt;Finish&lt;/li&gt;
&lt;li&gt;Release memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Streaming jobs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Never stop&lt;/li&gt;
&lt;li&gt;Accumulate state&lt;/li&gt;
&lt;li&gt;Must self-clean&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without cleanup → failure is guaranteed.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Watermark Is Your Lifeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Watermark controls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How late data is accepted&lt;/li&gt;
&lt;li&gt;When old state is removed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No watermark = infinite memory usage.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Choosing the Right Trigger&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Triggers define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Cost&lt;/li&gt;
&lt;li&gt;Stability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Too fast → expensive&lt;br&gt;
Too slow → delayed insights&lt;/p&gt;

&lt;p&gt;Most production jobs use 10–30 seconds.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Output Mode Matters More Than You Think&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Complete mode rewrites entire result every batch.&lt;/p&gt;

&lt;p&gt;This:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increases state&lt;/li&gt;
&lt;li&gt;Increases CPU&lt;/li&gt;
&lt;li&gt;Increases cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use append/update wherever possible.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Monitoring Is Mandatory&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A streaming job without monitoring is a ticking bomb.&lt;/p&gt;

&lt;p&gt;Always monitor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State size&lt;/li&gt;
&lt;li&gt;Batch duration&lt;/li&gt;
&lt;li&gt;Input rate&lt;/li&gt;
&lt;li&gt;Processing rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🚀 &lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What streaming state is&lt;/li&gt;
&lt;li&gt;Why state grows&lt;/li&gt;
&lt;li&gt;How watermark bounds state&lt;/li&gt;
&lt;li&gt;Trigger tuning&lt;/li&gt;
&lt;li&gt;Output mode impact&lt;/li&gt;
&lt;li&gt;Checkpoint best practices&lt;/li&gt;
&lt;li&gt;Monitoring strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you!!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 27: Building Exactly-Once Streaming Pipelines with Spark &amp; Delta Lake</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Mon, 29 Dec 2025 08:09:02 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-27-building-exactly-once-streaming-pipelines-with-spark-delta-lake-243c</link>
      <guid>https://dev.to/sandeepk27/day-27-building-exactly-once-streaming-pipelines-with-spark-delta-lake-243c</guid>
      <description>&lt;p&gt;Welcome to Day 27 of the Spark Mastery Series.&lt;br&gt;
Today we combine Structured Streaming + Delta Lake to build enterprise-grade pipelines.&lt;/p&gt;

&lt;p&gt;This is how modern companies handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time ingestion&lt;/li&gt;
&lt;li&gt;Updates &amp;amp; deletes&lt;/li&gt;
&lt;li&gt;CDC pipelines&lt;/li&gt;
&lt;li&gt;Fault tolerance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Why Exactly-Once Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without exactly-once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metrics inflate&lt;/li&gt;
&lt;li&gt;Revenue doubles&lt;/li&gt;
&lt;li&gt;ML models break&lt;/li&gt;
&lt;li&gt;Trust is lost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Delta Lake guarantees correctness even during failures.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;The ForeachBatch Pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;foreachBatch is the secret weapon for streaming ETL.&lt;/p&gt;

&lt;p&gt;It allows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MERGE INTO&lt;/li&gt;
&lt;li&gt;UPDATE / DELETE&lt;/li&gt;
&lt;li&gt;Complex batch logic&lt;/li&gt;
&lt;li&gt;Idempotent processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is how CDC pipelines are built.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;CDC with MERGE - The Right Way&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full table overwrite&lt;/li&gt;
&lt;li&gt;Complex joins&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MERGE INTO&lt;/li&gt;
&lt;li&gt;Transactional updates&lt;/li&gt;
&lt;li&gt;Efficient incremental processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Real-World Architecture&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kafka / Files
   ↓
Spark Structured Streaming
   ↓
Delta Bronze (append)
   ↓
Delta Silver (merge)
   ↓
Delta Gold (metrics)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This architecture:&lt;br&gt;
✔ Scales&lt;br&gt;
✔ Recovers from failure&lt;br&gt;
✔ Supports history &amp;amp; audit&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exactly-once semantics&lt;/li&gt;
&lt;li&gt;Streaming writes to Delta&lt;/li&gt;
&lt;li&gt;CDC pipelines with MERGE&lt;/li&gt;
&lt;li&gt;ForeachBatch pattern&lt;/li&gt;
&lt;li&gt;Handling deletes&lt;/li&gt;
&lt;li&gt;Streaming Bronze–Silver–Gold&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you!!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 26: Spark Streaming Joins</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Fri, 26 Dec 2025 12:08:16 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-26-spark-streaming-joins-2kdj</link>
      <guid>https://dev.to/sandeepk27/day-26-spark-streaming-joins-2kdj</guid>
      <description>&lt;p&gt;Welcome to Day 26 of the Spark Mastery Series. &lt;/p&gt;

&lt;p&gt;Today we tackle one of the hardest Spark topics: Streaming Joins.&lt;/p&gt;

&lt;p&gt;Many production streaming jobs fail because joins are misunderstood.&lt;br&gt;
Let’s fix that.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Stream-Static Joins (90% of Use Cases)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the most common and safest pattern.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Orders stream + customers table&lt;/li&gt;
&lt;li&gt;Click stream + product dimension&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why it works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Static table doesn’t grow&lt;/li&gt;
&lt;li&gt;No extra state needed&lt;/li&gt;
&lt;li&gt;Easy to optimize&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the static table is small → broadcast it.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Stream-Stream Joins (Advanced &amp;amp; Risky)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Used when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Both inputs are live streams&lt;/li&gt;
&lt;li&gt;Events must be correlated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Login event + purchase event&lt;/li&gt;
&lt;li&gt;Click event + payment event&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These joins require: &lt;br&gt;
✔ Event time&lt;br&gt;
✔ Watermarks&lt;br&gt;
✔ Time-bounded join condition&lt;/p&gt;

&lt;p&gt;Without these → memory explosion.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;How Spark Manages State&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For stream–stream joins, Spark:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Buffers events from both sides&lt;/li&gt;
&lt;li&gt;Matches based on time window&lt;/li&gt;
&lt;li&gt;Drops old state using watermark&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why watermarks are non-negotiable.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Real-World Recommendation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you can:&lt;br&gt;
&lt;code&gt;Convert one stream to static (Delta table)&lt;br&gt;
and use stream–static join.&lt;/code&gt;&lt;br&gt;
This is more stable and scalable.&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Types of streaming joins&lt;/li&gt;
&lt;li&gt;Stream-static joins (best practice)&lt;/li&gt;
&lt;li&gt;Stream-stream joins (advanced)&lt;/li&gt;
&lt;li&gt;Why watermarks are mandatory&lt;/li&gt;
&lt;li&gt;Performance &amp;amp; stability tips&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you!!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 25: Streaming Aggregations in Spark</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Thu, 25 Dec 2025 17:18:07 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-25-streaming-aggregations-in-spark-169j</link>
      <guid>https://dev.to/sandeepk27/day-25-streaming-aggregations-in-spark-169j</guid>
      <description>&lt;p&gt;Welcome to Day 25 of the Spark Mastery Series. Today we move from “reading streams” to real-time analytics.&lt;br&gt;
This is where most streaming pipelines fail - not because of code, but because of state mismanagement.&lt;/p&gt;

&lt;p&gt;Let’s fix that.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Why Streaming Aggregations Are Hard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Streaming data never ends.&lt;br&gt;
If you aggregate without limits, Spark keeps data forever.&lt;/p&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Growing state&lt;/li&gt;
&lt;li&gt;Memory pressure&lt;/li&gt;
&lt;li&gt;Job crashes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Event Time Is Mandatory&lt;/strong&gt;&lt;br&gt;
Always use event time, not processing time.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processing time depends on delays&lt;/li&gt;
&lt;li&gt;Event time reflects real business time
Correct analytics depend on event time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Windows - Turning Infinite into Finite&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Windows slice infinite streams into manageable chunks.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sales every 10 minutes&lt;/li&gt;
&lt;li&gt;Clicks per hour&lt;/li&gt;
&lt;li&gt;Orders per day&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Watermarking — Cleaning Old State&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Watermark tells Spark:&lt;br&gt;
'You can forget data older than X minutes.'&lt;/p&gt;

&lt;p&gt;This:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bounds memory usage&lt;/li&gt;
&lt;li&gt;Allows append mode&lt;/li&gt;
&lt;li&gt;Handles late data safely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Real-World Example&lt;/strong&gt;&lt;br&gt;
E-commerce&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Window: 5 minutes&lt;/li&gt;
&lt;li&gt;Watermark: 10 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accept data late by 10 minutes&lt;/li&gt;
&lt;li&gt;Drop anything older
This balances accuracy and performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🚀 &lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Streaming aggregations&lt;/li&gt;
&lt;li&gt;Event time vs processing time&lt;/li&gt;
&lt;li&gt;Windowed analytics&lt;/li&gt;
&lt;li&gt;Tumbling &amp;amp; sliding windows&lt;/li&gt;
&lt;li&gt;Late data handling&lt;/li&gt;
&lt;li&gt;Watermarking&lt;/li&gt;
&lt;li&gt;Output modes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 24: Spark Structured Streaming</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Wed, 24 Dec 2025 12:05:58 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-24-spark-structured-streaming-5945</link>
      <guid>https://dev.to/sandeepk27/day-24-spark-structured-streaming-5945</guid>
      <description>&lt;p&gt;Welcome to Day 24 of the Spark Mastery Series.&lt;br&gt;
Today we enter the world of real-time data pipelines using Spark Structured Streaming.&lt;/p&gt;

&lt;p&gt;If you already know Spark batch, good news:&lt;br&gt;
&lt;code&gt;You already know 70% of streaming.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Let’s understand why.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Structured Streaming = Continuous Batch&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Spark does NOT process events one by one.&lt;br&gt;
It processes small batches repeatedly. This gives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fault tolerance&lt;/li&gt;
&lt;li&gt;Exactly-once guarantees&lt;/li&gt;
&lt;li&gt;High throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Why Structured Streaming Is Powerful&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Unlike older Spark Streaming (DStreams):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses DataFrames&lt;/li&gt;
&lt;li&gt;Uses Catalyst optimizer&lt;/li&gt;
&lt;li&gt;Supports SQL&lt;/li&gt;
&lt;li&gt;Integrates with Delta Lake
This makes it production-ready.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Sources &amp;amp; Sinks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Typical real-world flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kafka → Spark → Delta → BI / ML
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;File streams are useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IoT batch drops&lt;/li&gt;
&lt;li&gt;Landing zones&lt;/li&gt;
&lt;li&gt;Testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Output Modes Explained Simply&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Append → only new rows&lt;/li&gt;
&lt;li&gt;Update → changed rows&lt;/li&gt;
&lt;li&gt;Complete → full table every time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most production pipelines use append or update.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Checkpointing = Safety Net&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Checkpointing stores progress so Spark can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resume after failure&lt;/li&gt;
&lt;li&gt;Avoid duplicates&lt;/li&gt;
&lt;li&gt;Maintain state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No checkpoint = broken pipeline.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;First Pipeline Mindset&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Treat streaming as:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
An infinite DataFrame processed every few seconds&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Same rules apply:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filter early&lt;/li&gt;
&lt;li&gt;Avoid shuffle&lt;/li&gt;
&lt;li&gt;Avoid UDFs&lt;/li&gt;
&lt;li&gt;Monitor performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🚀 &lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What Structured Streaming is&lt;/li&gt;
&lt;li&gt;Batch vs streaming model&lt;/li&gt;
&lt;li&gt;Sources &amp;amp; sinks&lt;/li&gt;
&lt;li&gt;Output modes&lt;/li&gt;
&lt;li&gt;Triggers&lt;/li&gt;
&lt;li&gt;Checkpointing&lt;/li&gt;
&lt;li&gt;First streaming pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 23: Spark Shuffle Optimization</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Tue, 23 Dec 2025 09:46:46 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-23-spark-shuffle-optimization-32hm</link>
      <guid>https://dev.to/sandeepk27/day-23-spark-shuffle-optimization-32hm</guid>
      <description>&lt;p&gt;Welcome to Day 23 of the Spark Mastery Series. Yesterday we learned why shuffles are slow.&lt;br&gt;
Today we learn how to beat them.&lt;/p&gt;

&lt;p&gt;These techniques are used daily by senior data engineers.&lt;/p&gt;

&lt;p&gt;🌟 1*&lt;em&gt;. Broadcast Join — The Fastest Optimization&lt;/em&gt;*&lt;br&gt;
Broadcast join removes shuffle entirely.&lt;br&gt;
When used correctly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Job runtime drops dramatically&lt;/li&gt;
&lt;li&gt;Cluster cost reduces&lt;/li&gt;
&lt;li&gt;Stability improves&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Golden rule:&lt;br&gt;
&lt;code&gt;Broadcast small, stable tables only.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;2. Salting - Fixing the “Last Task Problem”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your Spark job finishes 99% fast but waits forever for 1 task → data skew.&lt;br&gt;
Salting breaks big keys into smaller chunks so work is evenly distributed.&lt;/p&gt;

&lt;p&gt;This is common in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Country-level data&lt;/li&gt;
&lt;li&gt;Product category data&lt;/li&gt;
&lt;li&gt;Event-type aggregations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;3. AQE - Let Spark Fix Itself&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Adaptive Query Execution allows Spark to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Change join strategies&lt;/li&gt;
&lt;li&gt;Reduce partitions&lt;/li&gt;
&lt;li&gt;Fix skew at runtime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This removes the need for many manual optimizations. &lt;/p&gt;

&lt;p&gt;If AQE is ON, Spark becomes smarter.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;4. Real-World Optimization Flow&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Senior engineers always:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check explain plan&lt;/li&gt;
&lt;li&gt;Look for shuffle&lt;/li&gt;
&lt;li&gt;Broadcast where possible&lt;/li&gt;
&lt;li&gt;Aggregate early&lt;/li&gt;
&lt;li&gt;Let AQE optimize&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🚀 &lt;strong&gt;Summary&lt;/strong&gt;&lt;br&gt;
We learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Broadcast join internals&lt;/li&gt;
&lt;li&gt;When auto-broadcast works&lt;/li&gt;
&lt;li&gt;How salting fixes skew&lt;/li&gt;
&lt;li&gt;How AQE optimizes at runtime&lt;/li&gt;
&lt;li&gt;A real optimization strategy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 22: Spark Shuffle Deep Dive</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Mon, 22 Dec 2025 10:00:40 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-22-spark-shuffle-deep-dive-3hhk</link>
      <guid>https://dev.to/sandeepk27/day-22-spark-shuffle-deep-dive-3hhk</guid>
      <description>&lt;p&gt;Welcome to Day 22 of the Spark Mastery Series.&lt;br&gt;
Today we open the black box that most Spark developers fear — Shuffles.&lt;/p&gt;

&lt;p&gt;If your Spark job is slow, unstable, or expensive, shuffle is the reason 90% of the time.&lt;/p&gt;

&lt;p&gt;Let’s understand why.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;What Exactly Is a Shuffle?&lt;/strong&gt;&lt;br&gt;
A shuffle happens when Spark must repartition data across executors based on a key.&lt;/p&gt;

&lt;p&gt;This is required for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;joins&lt;/li&gt;
&lt;li&gt;aggregations&lt;/li&gt;
&lt;li&gt;sorting&lt;/li&gt;
&lt;li&gt;ranking
But it comes at a huge cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Why Shuffles Are Expensive&lt;/strong&gt;&lt;br&gt;
During shuffle Spark:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writes intermediate data to disk&lt;/li&gt;
&lt;li&gt;Sends data over the network&lt;/li&gt;
&lt;li&gt;Sorts large datasets&lt;/li&gt;
&lt;li&gt;Creates new execution stages
This makes shuffle the slowest operation in Spark.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Reading Shuffle in Explain Plan&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.explain(True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exchange&lt;/li&gt;
&lt;li&gt;SortMergeJoin&lt;/li&gt;
&lt;li&gt;HashAggregate
These indicate shuffle boundaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Shuffle in Spark UI&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Key metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shuffle Read (bytes)&lt;/li&gt;
&lt;li&gt;Shuffle Write (bytes)&lt;/li&gt;
&lt;li&gt;Spill (memory/disk)&lt;/li&gt;
&lt;li&gt;Task skew (long tail tasks)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One task running much longer → skew&lt;/li&gt;
&lt;li&gt;High shuffle read/write → optimization needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Real Example&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Bad pipeline&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.join(df2, "id").groupBy("id").count()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Optimised&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df2_small = broadcast(df2)
df.join(df2_small, "id").groupBy("id").count()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shuffle reduced&lt;/li&gt;
&lt;li&gt;Runtime improved drastically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;How Senior Engineers Think&lt;/strong&gt;&lt;br&gt;
They ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is this shuffle necessary?&lt;/li&gt;
&lt;li&gt;Can I broadcast?&lt;/li&gt;
&lt;li&gt;Can I aggregate earlier?&lt;/li&gt;
&lt;li&gt;Can I reduce data before shuffle?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🚀 &lt;strong&gt;Summary&lt;/strong&gt;&lt;br&gt;
We learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What shuffle is&lt;/li&gt;
&lt;li&gt;What causes shuffle&lt;/li&gt;
&lt;li&gt;Why shuffle is slow&lt;/li&gt;
&lt;li&gt;How to identify shuffle&lt;/li&gt;
&lt;li&gt;How skew affects shuffle&lt;/li&gt;
&lt;li&gt;How to think like a senior engineer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you!!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 21: Building a Production-Grade Data Quality Pipeline with Spark &amp; Delta</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Mon, 22 Dec 2025 09:50:36 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-21-building-a-production-grade-data-quality-pipeline-with-spark-delta-374o</link>
      <guid>https://dev.to/sandeepk27/day-21-building-a-production-grade-data-quality-pipeline-with-spark-delta-374o</guid>
      <description>&lt;p&gt;Welcome to Day 21 of the Spark Mastery Series.&lt;br&gt;
Today we stop talking about theory and build a real production data pipeline that handles bad data gracefully.&lt;/p&gt;

&lt;p&gt;This is the kind of work data engineers do every day.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Why Data Quality Pipelines Matter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bad data WILL arrive&lt;/li&gt;
&lt;li&gt;Pipelines MUST not fail&lt;/li&gt;
&lt;li&gt;Metrics MUST be trustworthy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A good pipeline:&lt;br&gt;
✔ Captures bad data&lt;br&gt;
✔ Cleans valid data&lt;br&gt;
✔ Tracks metrics&lt;br&gt;
✔ Supports reprocessing&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Bronze → Silver → Gold in Action&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bronze keeps raw truth&lt;/li&gt;
&lt;li&gt;Silver enforces trust&lt;/li&gt;
&lt;li&gt;Gold delivers insights&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This separation is what makes systems scalable and debuggable.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Key Patterns Used&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explicit schema&lt;/li&gt;
&lt;li&gt;badRecordsPath&lt;/li&gt;
&lt;li&gt;Deduplication using window functions&lt;/li&gt;
&lt;li&gt;Valid/invalid split&lt;/li&gt;
&lt;li&gt;Audit metrics table&lt;/li&gt;
&lt;li&gt;Delta Lake everywhere&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Why This Project is Interview-Ready&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We demonstrated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data quality handling&lt;/li&gt;
&lt;li&gt;Fault tolerance&lt;/li&gt;
&lt;li&gt;Real ETL architecture&lt;/li&gt;
&lt;li&gt;Delta Lake usage&lt;/li&gt;
&lt;li&gt;Production thinking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is senior-level Spark work.&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;Summary&lt;/strong&gt;&lt;br&gt;
We built:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;End-to-end data quality pipeline&lt;/li&gt;
&lt;li&gt;Bronze/Silver/Gold layers&lt;/li&gt;
&lt;li&gt;Bad record handling&lt;/li&gt;
&lt;li&gt;Audit metrics&lt;/li&gt;
&lt;li&gt;Business-ready data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 20: Handling Bad Records &amp; Data Quality in Spark</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Mon, 22 Dec 2025 09:44:56 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-20-handling-bad-records-data-quality-in-spark-20db</link>
      <guid>https://dev.to/sandeepk27/day-20-handling-bad-records-data-quality-in-spark-20db</guid>
      <description>&lt;p&gt;Welcome to Day 20 of the Spark Mastery Series. Today we address a harsh truth:&lt;br&gt;
&lt;code&gt;Real data is messy, incomplete, and unreliable.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If your Spark pipeline can’t handle bad data, it will fail in production. Let’s build pipelines that survive reality.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Why Data Quality Matters&lt;/strong&gt;&lt;br&gt;
Bad data leads to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wrong dashboards&lt;/li&gt;
&lt;li&gt;Broken ML models&lt;/li&gt;
&lt;li&gt;Financial losses&lt;/li&gt;
&lt;li&gt;Loss of trust
Data engineers are responsible for trustworthy data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Enforce Schema Early&lt;/strong&gt;&lt;br&gt;
Always define schema explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster ingestion&lt;/li&gt;
&lt;li&gt;Early error detection&lt;/li&gt;
&lt;li&gt;Consistent downstream processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Never rely on inferSchema in production.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Capture Bad Records, Don’t Drop Them&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using badRecordsPath ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline continues&lt;/li&gt;
&lt;li&gt;Bad data is quarantined&lt;/li&gt;
&lt;li&gt;Audits are possible
This is mandatory in regulated industries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Apply Business Rules in Silver Layer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Silver layer is where data becomes trusted.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remove negative amounts&lt;/li&gt;
&lt;li&gt;Validate country codes&lt;/li&gt;
&lt;li&gt;Drop incomplete records&lt;/li&gt;
&lt;li&gt;Deduplicate
Never mix business rules in Bronze.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌟 &lt;strong&gt;Observability &amp;amp; Metrics&lt;/strong&gt;&lt;br&gt;
Track record counts for every job.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: 1,000,000
Valid: 995,000
Invalid: 5,000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If invalid spikes → alert immediately.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Delta Lake Safety Net&lt;/strong&gt;&lt;br&gt;
With Delta:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rollback bad writes&lt;/li&gt;
&lt;li&gt;Reprocess safely&lt;/li&gt;
&lt;li&gt;Audit changes
This is why Delta is production-critical.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🚀 &lt;strong&gt;Summary&lt;/strong&gt;&lt;br&gt;
We learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What bad records are&lt;/li&gt;
&lt;li&gt;How to enforce schema&lt;/li&gt;
&lt;li&gt;How to capture corrupt data&lt;/li&gt;
&lt;li&gt;How to apply data quality rules&lt;/li&gt;
&lt;li&gt;How to track metrics&lt;/li&gt;
&lt;li&gt;How Delta helps recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 19: Spark Broadcasting &amp; Caching</title>
      <dc:creator>Sandeep</dc:creator>
      <pubDate>Mon, 22 Dec 2025 08:06:01 +0000</pubDate>
      <link>https://dev.to/sandeepk27/day-19-spark-broadcasting-caching-29c6</link>
      <guid>https://dev.to/sandeepk27/day-19-spark-broadcasting-caching-29c6</guid>
      <description>&lt;p&gt;Welcome to Day 19 of the Spark Mastery Series.&lt;br&gt;
Today we focus on memory, the most common reason Spark jobs fail in production.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Most Spark failures are not logic bugs - they are memory mistakes.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Broadcasting — The Right Way to Join Small Tables&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Broadcast joins avoid shuffle and are extremely fast.&lt;br&gt;
But misuse leads to executor crashes.&lt;/p&gt;

&lt;p&gt;Golden rule:&lt;br&gt;
-&amp;gt; Broadcast only when the table is small and stable.&lt;/p&gt;

&lt;p&gt;Spark automatically decides broadcast sometimes, but explicit broadcast gives you control.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Caching — Powerful but Dangerous&lt;/strong&gt;&lt;br&gt;
Caching improves performance only when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The same DataFrame is reused&lt;/li&gt;
&lt;li&gt;Computation before cache is heavy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bad caching causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Executor OOM&lt;/li&gt;
&lt;li&gt;GC thrashing&lt;/li&gt;
&lt;li&gt;Cluster instability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Always ask:&lt;br&gt;
-&amp;gt; Will this DataFrame be reused?&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Persist vs Cache — What to Use?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cache() → simple, MEMORY_ONLY&lt;/li&gt;
&lt;li&gt;persist(MEMORY_AND_DISK) → production-safe&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use persist() for ETL pipelines.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Spark Memory Internals&lt;/strong&gt;&lt;br&gt;
Spark prioritizes execution memory over cached data.&lt;/p&gt;

&lt;p&gt;If Spark needs memory for shuffle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It evicts cached blocks&lt;/li&gt;
&lt;li&gt;Recomputes them later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why caching doesn’t guarantee data stays in memory forever.&lt;/p&gt;

&lt;p&gt;🌟 &lt;strong&gt;Real-World Example&lt;/strong&gt;&lt;br&gt;
Bad practice&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df1.cache()
df2.cache()
df3.cache()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good practice&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_silver.persist(StorageLevel.MEMORY_AND_DISK)
df_silver.count()
# use df_silver multiple times
df_silver.unpersist()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🚀 &lt;strong&gt;Summary&lt;/strong&gt;&lt;br&gt;
We learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How broadcast joins work internally&lt;/li&gt;
&lt;li&gt;When to use and avoid broadcast&lt;/li&gt;
&lt;li&gt;Cache vs persist&lt;/li&gt;
&lt;li&gt;Storage levels&lt;/li&gt;
&lt;li&gt;Spark memory model&lt;/li&gt;
&lt;li&gt;How to avoid OOM errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow for more such content. Let me know if I missed anything. Thank you!!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>python</category>
    </item>
  </channel>
</rss>
