Welcome to Day 19 of the Spark Mastery Series.
Today we focus on memory, the most common reason Spark jobs fail in production.
Most Spark failures are not logic bugs - they are memory mistakes.
π Broadcasting β The Right Way to Join Small Tables
Broadcast joins avoid shuffle and are extremely fast.
But misuse leads to executor crashes.
Golden rule:
-> Broadcast only when the table is small and stable.
Spark automatically decides broadcast sometimes, but explicit broadcast gives you control.
π Caching β Powerful but Dangerous
Caching improves performance only when:
- The same DataFrame is reused
- Computation before cache is heavy
Bad caching causes:
- Executor OOM
- GC thrashing
- Cluster instability
Always ask:
-> Will this DataFrame be reused?
π Persist vs Cache β What to Use?
- cache() β simple, MEMORY_ONLY
- persist(MEMORY_AND_DISK) β production-safe
Use persist() for ETL pipelines.
π Spark Memory Internals
Spark prioritizes execution memory over cached data.
If Spark needs memory for shuffle:
- It evicts cached blocks
- Recomputes them later
This is why caching doesnβt guarantee data stays in memory forever.
π Real-World Example
Bad practice
df1.cache()
df2.cache()
df3.cache()
Good practice
df_silver.persist(StorageLevel.MEMORY_AND_DISK)
df_silver.count()
# use df_silver multiple times
df_silver.unpersist()
π Summary
We learned:
- How broadcast joins work internally
- When to use and avoid broadcast
- Cache vs persist
- Storage levels
- Spark memory model
- How to avoid OOM errors
Follow for more such content. Let me know if I missed anything. Thank you!!
Top comments (0)