Spark on AWS Glue: Performance Tuning 5 ( Using Cache)

#aws #glue #spark #performance

This is a continuation of my previous posts as follows.

Using Cache

Spark RDDs are re-computed each time an action is performed on them. You can avoid this by using cache() or persist(), which keep the RDD in memory.

Comparison between using cache and no cache

Please note that cache() and persist() are transformations, not actions, so they are evaluated lazily.

https://kb.databricks.com/scala/best-practice-cache-count-take#:~:text=Since%20cache()%20is%20a,RDD%20in%20a%20single%20action

Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(), show(), take(), or write()) is also used on the same DataFrame, Dataset, or RDD in a single action.

Let's try cache!

with timer('before cache'):
    part_df.select("backend_port").distinct().count()

part_df.cache()
part_df.count() # execute cache (cache is a transformation)
with timer('after cache'):
    part_df.select("backend_port").distinct().count()

[before cache] done in 4.5241 s
[after cache] done in 1.6293 s

It's faster with cache()!

Summary

RDDs are re-computed for each action, so caching makes them faster
Since cache() and persist() are transformations

DEV Community

Spark on AWS Glue: Performance Tuning 5 ( Using Cache)

Using Cache

Comparison between using cache and no cache

Summary

Top comments (0)

Read next

A Step by Step Guide on How to Migrate from AWS to Azure

Microservices Architecture: Breaking Down Monoliths for Scalability

Tracking down high memory usage in Node.js

Move objects from one folder to other in the same S3 Bucket using C# in AWS