DEV Community

Cover image for πŸ”₯ Day 3: RDDs - The Foundation of Spark
Sandeep
Sandeep

Posted on

πŸ”₯ Day 3: RDDs - The Foundation of Spark

Welcome to Day 3 of your Spark Mastery Journey.
Today, we explore RDDs (Resilient Distributed Datasets) - the backbone of Spark.

Even if we mostly use DataFrames today, companies still test RDD fundamentals in interviews because they reveal how Spark works internally.

Let’s break it down simply.

🌟 What Exactly Is an RDD?

An RDD is:

A distributed collection of immutable data partitioned across the cluster.

Key properties:

  • Immutable
  • Lazy evaluated
  • Distributed
  • Fault-tolerant
  • Parallelized

RDDs were the first abstraction in Spark before DataFrames and Datasets existed.

🧠 Why Should You Learn RDDs?
Even though DataFrames are recommended now, RDDs are still crucial for:

  • Understanding execution plans
  • Debugging shuffles
  • Improving partition strategies
  • Designing performance-efficient pipelines
  • Handling non-structured data

⚑ How to Create RDDs

  1. From Python Lists
rdd = spark.sparkContext.parallelize([1, 2, 3, 4])
Enter fullscreen mode Exit fullscreen mode
  1. From File
rdd = spark.sparkContext.textFile("sales.txt")
Enter fullscreen mode Exit fullscreen mode

πŸ” RDD Transformations (Lazy)
Transformations build the DAG.

Common transformations:

rdd.map(lambda x: x*2)
rdd.filter(lambda x: x > 10)
rdd.flatMap(lambda x: x.split(","))
Enter fullscreen mode Exit fullscreen mode

You can chain transformations; Spark still won’t run anything until an Action is called.

🏁 RDD Actions (Execute Plan)

Actions trigger job execution:

rdd.collect()
rdd.count()
rdd.take(5)
rdd.saveAsTextFile("output")
Enter fullscreen mode Exit fullscreen mode

πŸ”₯ Narrow vs Wide Transformations

This is the MOST important concept for performance.

πŸ”Έ Narrow (No Shuffle)

Output partition depends only on one input partition - Fast

Examples:

  • map
  • filter
  • union

πŸ”Έ Wide (Shuffle Required)

Output depends on multiple partitions - Slow

  • Creates new stage

Examples:

  • groupByKey
  • join
  • reduceByKey

Shuffles = major cause of slow Spark jobs.

πŸ”„ RDD Lineage β€” Fault Tolerance in Action

Each RDD tracks how it was created.

Example:

rdd1 = rdd.map(...)
rdd2 = rdd1.filter(...)
rdd3 = rdd2.reduceByKey(...)
Enter fullscreen mode Exit fullscreen mode

If a node dies, Spark reconstructs data using lineage.

πŸ“¦ Persistence and Caching

If you reuse an RDD:

processed = rdd.map(...)
processed.persist()
processed.count()
processed.filter(...)
Enter fullscreen mode Exit fullscreen mode

Spark will NOT recompute it again β€” it reads from memory.

🧭 Summary

Today we learned:

  • What RDDs are
  • Why they matter
  • Transformations vs actions
  • Narrow vs wide transformations
  • Lineage
  • Caching and persistence

These concepts form the foundation of Spark internals.

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Top comments (0)