Discussion on: PySpark and Apache Spark Broadcast Mechanism

View post

Good one, thanks Adi! 👸
What's the benefit (if any) in using the broadcasting over standard spark caching?

Thank you, Shir! 💖
Spark caching built for caching Spark DataFrames or RDD in memory.
Spark is a distributed computing framework that works on distributed data.
Each executor gets a chunk of the data to process, load data into memory, process it, and remove it from memory ( unless there are optimization or more known actions on the data ).
We can ask Spark to explicitly cache that chunk of data in the executors' memory.
Caching action means that it stays in dedicated memory until we call unpersist on it.

We can cache many RDDs or DataFrames, but we won't have legal access from one to the other in the executors. If we exceed the memory space, the executor will write it to disk.

With Broadcast, we broadcast variables that we need, usually small size, shortlist, dictionary, and such that we are used together with the DataFrames or RDDs in computation.

For example:
Countries data map and financial transactions, countries data, and location do not change - that means static data. Transactions are new and are coming in streaming or batching. We can broadcast the countries with the static data map ( assuming it fits into memory) and in DataFrame load the transaction either in batch or streaming.
In each transaction computation, we can enrich and make the decisions based on the static data we have - the countries map.

With cache - we can cache the transaction data and re-process it again but not the static data that is no an RDD or Dataframe. However, we might get blocked, and all other transaction will take longer or if we out of memory will be paused.
We can turn the static data into an RDD or Dataframe if we would like to take action on it specifically. Still, since it is relatively small data, it's better to do the calculations on the driver node and not 'pay' the overhead of distributed computing.

Is that make sense?