DEV Community

Tomoya Oda
Tomoya Oda

Posted on • Edited on

Spark on AWS Glue: Performance Tuning 5 ( Using Cache)

This is a continuation of my previous posts as follows.

  1. Spark on AWS Glue: Performance Tuning 1 (CSV vs Parquet)

  2. Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)

  3. Spark on AWS Glue: Performance Tuning 3 (Impact of Partition Quantity)

  4. Spark on AWS Glue: Performance Tuning 4 (Spark Join)

  5. Spark on AWS Glue: Performance Tuning 5 (Using Cache)

Using Cache

Spark RDDs are re-computed each time an action is performed on them. You can avoid this by using cache() or persist(), which keep the RDD in memory.

Comparison between using cache and no cache

Please note that cache() and persist() are transformations, not actions, so they are evaluated lazily.

https://kb.databricks.com/scala/best-practice-cache-count-take#:~:text=Since%20cache()%20is%20a,RDD%20in%20a%20single%20action

Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(), show(), take(), or write()) is also used on the same DataFrame, Dataset, or RDD in a single action.

Let's try cache!

with timer('before cache'):
    part_df.select("backend_port").distinct().count()

part_df.cache()
part_df.count() # execute cache (cache is a transformation)
with timer('after cache'):
    part_df.select("backend_port").distinct().count()
Enter fullscreen mode Exit fullscreen mode
[before cache] done in 4.5241 s
[after cache] done in 1.6293 s
Enter fullscreen mode Exit fullscreen mode

It's faster with cache()!

Summary

  • RDDs are re-computed for each action, so caching makes them faster
  • Since cache() and persist() are transformations

Billboard image

Deploy and scale your apps on AWS and GCP with a world class developer experience

Coherence makes it easy to set up and maintain cloud infrastructure. Harness the extensibility, compliance and cost efficiency of the cloud.

Learn more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay