Welcome to Day 3 of your Spark Mastery Journey.
Today, we explore RDDs (Resilient Distributed Datasets) - the backbone of Spark.
Even if we mostly use DataFrames today, companies still test RDD fundamentals in interviews because they reveal how Spark works internally.
Letβs break it down simply.
π What Exactly Is an RDD?
An RDD is:
A distributed collection of immutable data partitioned across the cluster.
Key properties:
- Immutable
- Lazy evaluated
- Distributed
- Fault-tolerant
- Parallelized
RDDs were the first abstraction in Spark before DataFrames and Datasets existed.
π§ Why Should You Learn RDDs?
Even though DataFrames are recommended now, RDDs are still crucial for:
- Understanding execution plans
- Debugging shuffles
- Improving partition strategies
- Designing performance-efficient pipelines
- Handling non-structured data
β‘ How to Create RDDs
- From Python Lists
rdd = spark.sparkContext.parallelize([1, 2, 3, 4])
- From File
rdd = spark.sparkContext.textFile("sales.txt")
π RDD Transformations (Lazy)
Transformations build the DAG.
Common transformations:
rdd.map(lambda x: x*2)
rdd.filter(lambda x: x > 10)
rdd.flatMap(lambda x: x.split(","))
You can chain transformations; Spark still wonβt run anything until an Action is called.
π RDD Actions (Execute Plan)
Actions trigger job execution:
rdd.collect()
rdd.count()
rdd.take(5)
rdd.saveAsTextFile("output")
π₯ Narrow vs Wide Transformations
This is the MOST important concept for performance.
πΈ Narrow (No Shuffle)
Output partition depends only on one input partition - Fast
Examples:
- map
- filter
- union
πΈ Wide (Shuffle Required)
Output depends on multiple partitions - Slow
- Creates new stage
Examples:
- groupByKey
- join
- reduceByKey
Shuffles = major cause of slow Spark jobs.
π RDD Lineage β Fault Tolerance in Action
Each RDD tracks how it was created.
Example:
rdd1 = rdd.map(...)
rdd2 = rdd1.filter(...)
rdd3 = rdd2.reduceByKey(...)
If a node dies, Spark reconstructs data using lineage.
π¦ Persistence and Caching
If you reuse an RDD:
processed = rdd.map(...)
processed.persist()
processed.count()
processed.filter(...)
Spark will NOT recompute it again β it reads from memory.
π§ Summary
Today we learned:
- What RDDs are
- Why they matter
- Transformations vs actions
- Narrow vs wide transformations
- Lineage
- Caching and persistence
These concepts form the foundation of Spark internals.
Follow for more such content. Let me know if I missed anything in comments. Thank you!!
Top comments (0)