DEV Community

Tomoya Oda
Tomoya Oda

Posted on • Updated on

Spark on AWS Glue: Performance Tuning 5 ( Using Cache)

This is a continuation of my previous posts as follows.

  1. Spark on AWS Glue: Performance Tuning 1 (CSV vs Parquet)

  2. Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)

  3. Spark on AWS Glue: Performance Tuning 3 (Impact of Partition Quantity)

  4. Spark on AWS Glue: Performance Tuning 4 (Spark Join)

  5. Spark on AWS Glue: Performance Tuning 5 (Using Cache)

Using Cache

Spark RDDs are re-computed each time an action is performed on them. You can avoid this by using cache() or persist(), which keep the RDD in memory.

Comparison between using cache and no cache

Please note that cache() and persist() are transformations, not actions, so they are evaluated lazily.

https://kb.databricks.com/scala/best-practice-cache-count-take#:~:text=Since%20cache()%20is%20a,RDD%20in%20a%20single%20action

Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(), show(), take(), or write()) is also used on the same DataFrame, Dataset, or RDD in a single action.

Let's try cache!

with timer('before cache'):
    part_df.select("backend_port").distinct().count()

part_df.cache()
part_df.count() # execute cache (cache is a transformation)
with timer('after cache'):
    part_df.select("backend_port").distinct().count()
Enter fullscreen mode Exit fullscreen mode
[before cache] done in 4.5241 s
[after cache] done in 1.6293 s
Enter fullscreen mode Exit fullscreen mode

It's faster with cache()!

Summary

  • RDDs are re-computed for each action, so caching makes them faster
  • Since cache() and persist() are transformations

Top comments (0)