DEV Community: anhcodes

Debug long running Spark job

anhcodes — Wed, 31 May 2023 00:41:11 +0000

You Spark job is running for a long time, what to do? Generally, long-running Spark jobs can be due to various factors. We like to call them the 5S - Spill, Skew, Shuffle, Storage, and Serialization. So, how do we identify the main culprit?

🔎 Look for Skew: Are some of the tasks taking longer than others? Do you have a join operation?

🔎 Look for Spill: Any out-of-memory errors? Do the executors have enough memory to finish their tasks? Got any disk spills?

🔎 Look for Shuffle: Large amounts of data being shuffled across executors? Do you have a join operation?

🔎 Look for Storage Issues: Do you have small files or highly nested directory structure?

🔎 Look for inefficient Spark code, or serialization issues.

Now, the solution may vary depending on the root cause. Most of the times, the root cause indicators will show up in the SparkUI and it’s important that you understand how to read it.

In general, always try to cut down the time Spark takes to load data into memory, parallelize tasks across executors, and scale the memory according to data size.

Skew [imbalance partitions]

TL,DR - Skew is caused by imbalance partitions. To fix Skew, try to repartition data, or set SkewHint when reading in data from disk to memory. If Skew is caused by partition imbalance after shuffling stage, enable SkewJoin option with Adaptive Query Execution, and set the correct shuffle partitions size. In most cases, turning on Adaptive Query Execution can help mitigate Skew issue.

Let's talk about one of the most common issues you might encounter in Spark - "skew". Skew refers to an imbalance in partition sizes, which can also lead to a spill. When you read in data in Spark, it's typically divided into partitions of 128 MB, distributed evenly according to the maxPartitionBytes setting. However, during transformations, Spark will need to shuffle your data, and some partitions may end up having significantly more data than others, creating skew in your data.

When a partition is bigger than the others, the executor will take longer to process that partition and need more memory, which might eventually result in a spill to disk or Out Of Memory (OOM) errors.

How do you identify Skew in your SparkUI?

If you see long running tasks and/or uneven distribution of tasks in the Event Timeline of a long-running stage
If you see uneven Shuffle Read/Write Size in a stage’s Summary Metrics
Skew can cause Spill, so sometimes you will see Disk or Memory Spill as well. If Spill is caused by Skew, you have to fix Skew as the Root cause.

How do you fix Skew?

If you disk spill or OOM errors are caused by skew, instead of solving for RAM problem, solve for uneven distribution of records across all partitions

Option 1: If you run Spark on Databricks, use skew hint (refer Skew Join optimization). For example, assuming that you know the column used in the join is skewed, set the skew hint for that column. With this skew hint information, Spark can hopefully construct a better query plan for the join

## Set skew hint when loading the table
df = spark.read.format("delta").load(trxPath)
                        .hint("skew","<join_column>")

Option 2: If you use Spark 3.x, utilize Adaptive Query Execution (AQE). AQE can automatically adapt query plans at runtime based on more accurate metrics, leading to more optimized execution. After a shuffle, AQE can automatically use split shuffle read to split skewed partitions into smaller partitions, ensuring that the executors are not burdened by larger skews. This can lead to faster query execution and a more efficient use of cluster resources. Using AQE is highly recommended, and generally more effective than other options

## Turn AQE on
spark.conf.set('spark.sql.adaptive.enabled', True)
spark.conf.set('spark.sql.adaptive.skewedJoin.enabled', True)

Option3 : Another option is to salt the skewed column with a random number to create more evenly partitions but at the cost of extra processing. This is a more complex operations, which we should discuss in another post.

Spill [lack of memory]

TL,DR - Spill is caused by executors lacking of memory to process partitions. To fix Spill think about how you can add more memory to the executors, or manage the partitions sizes

If your Spark executors don’t have enough local memory to process their allocated partitions, Spark has to spill the data to disk. Spill is a Spark’s measure of moving an RDD from local RAM to disk and then back to executor’s RAM again for further processing with the goal of avoiding an out-of-memory (OOM) error. However, this can lead to expensive disk reads and writes, and significantly slow down the entire job.

This process occurs when a partition becomes too large to fit into RAM, and it may be a result of skew in the data. Some potential causes of spill include setting spark.sql.files.maxPartitionBytes too high, using explode() on an array, performing a join or crossJoin of two tables, or aggregating results by a skewed column.

How do you identify Spill in your SparkUI?

You can find Spill indicators on SparkUI under each stage’s detail tabe or Aggregated Metrics by Executor. When data is spilled, both the size of the data in memory and on disk will be provided. Typically, the size on disk will be smaller due to compression that occurs when serializing data before it is written to disk.

How do you fix Spill?

To mitigate Spill issues, there are several actions you can take.

First, make sure to fix the root cause of skew if this is the underlying issue behind the spill. To decrease partition sizes, increase the number of partitions or use explicit repartitioning.
If above doesn’t work, allocate a cluster with more memory per worker if each worker need to process bigger partitions of data.
Finally, you can adjust specific configuration settings using spark.conf.set() to manage the size and number of partitions.
- manage spark.conf.set(spark.sql.shuffle.partitions, {num_partitions}) to reduce data in each partition shuffled across executors
- manage spark.conf.set(’spark.sql.files.maxPartitionBytes’, {MB}*1024*1024) to reduce size of partition when Spark read from disk to memory

It's worth noting that ignoring spill may not always be a good idea, as even a small percentage of tasks can delay an entire process. Therefore, it's important to take note of spills and manage them proactively to enhance the performance of your Spark jobs.

Shuffles [data transfer]

TL,DR - Shuffle refers to the movement of data across executors, and it's inevitable with wide transformation jobs. To tune shuffle operations, think about how you can reduce the amount of data that get shuffled across the cluster network

Shuffle is the act of moving data between executors. If you have multiple data partitions in different executors on a Spark cluster, shuffle is necessary and inevitable. Most of the time, shuffle operations are actually quite fast. However, there are situations when shuffle can become the culprit of slowing down your Spark job. For example, moving data across the network is slow, and the more data you have to move, the slower it will get. Moreover, Incorrect shuffles can cause Skew, and potentital Spill

How to identify Shuffle?

If you use wide transformation (distinct, join, groupBy/count, orderBy, crossJoin) in your Spark job and you have multiple executors in the Spark cluster, shuffle will most likely happen.

In the SparkUI, you can find the Shuffle Read Size and Shuffle Write Size numbers in the stages on the SparkUI

How to mitigate Shuffle?

To reduce the impact of shuffle on your Spark job, try to reduce the amount of data you have to shuffle across network

1. Tune the Spark Cluster

Use fewer and larger workers: You normally pay the same unit price for the same total number of cores and memory in your cluster, no matter the number of executors. So if you have jobs that require a lot of wide transformations, choose the bigger instance and less workers. That way you don't have many executors to exchange data.

2. Limit shuffled data and tune the shuffle partitions

Use predicate push down and/or narrow the columns to reduce the amount of data being shuffled
Turn on Adaptive Query Execution (AQE) to dynamically coaslesce shuffle partitions at runtime to avoid empty partitions
manage spark.conf.set(spark.sql.shuffle.partitions, {num_partitions}) to set the number of partitions to be shuffled

3. Try BroadcastHashJoin if possible

Use BroadcastHashJoin if 1 table is less than 10MB. With BHJ, the smaller table will be broadcasted to all executors, which can eliminate the shuffle of data.
- Step 1: each executor will read in their assigned partitions from the bigger table
- Step 2: every partition of the broadcasted table is sent to the driver (therefore you want to make sure the broadcasted table is small enough to fit into the driver mem)
- Step 3: a copy of the entire broadcasted is sent to each executor
- Step 4: each executor can do a standard join between tables because they have the full copy of the broadcasted table, therefore shuffle can be avoided in join
- There are few considerations when it comes to BroadcastHashJoin you should be aware of that we can discuss in another post
With Adaptive Query Execution in Spark 3.x, Spark will try to do BHJ in runtime if one of the table is small enough

4. Bucketing

Bucketing can be a useful technique to improve the efficiency of data processing by pre-sorting and aggregating data in the Join operation. The process involves dividing the data into N buckets, with both tables requiring the same number of buckets. This technique can be particularly effective for large datasets of several TBs, where data is queried and joined repeatedly.

Bucketing should be performed by a skilled data engineer as part of the data preparation process. It's worth noting that bucketing is only worthwhile if the data is frequently queried and joined, and filtering does not improve the join operation.

By using bucketing in the Join operation, the process of shuffling and exchanging data can be eliminated, resulting in a more efficient join.

Storage [small files, scanning, inferring schema/schema evolution]

TL,DR - Storage issues are problems related to how data is stored on disk, which can lead to high overhead with ingesting data by open, read, close files operation. To fix issues relating to Storage, think about how you can reduce the read, write, ingest files on disk

How to Identify storage issue?

There are a few storage-related potential issues that can slow down operations in Spark data processing. One is the overhead of opening, reading, and closing many small files. To address this, it's recommended to aim for files with a size of 1GB or larger, which can significantly reduce the time spent on these operations.

Another is directory scanning issue which can arise with highly partitioned datasets that have multiple directories for each partition, requiring the driver to scan all of them on disk. For example, files on your storage are fine-grained partitioned to year-month-date-minute.

A third issue involves schema operations, such as inferring the schema for JSON and CSV files. This can be very costly, requiring a full read of the files to determine data types, even if you only need a subset of the data. By contrast, reading Parquet files typically only requires one-time reading of the schema, which is much faster. Therefore, it's recommended to use Parquet as file storage for Spark considering that Parquet stores schema information in the file

However, schema evolution support, even with Parquet files, can also be expensive, as it requires reading all the files and merging the schema collectively. For this reason, starting from Spark 1.5, schema merging is turned off by default and needs to be turned on by setting the spark.sql.parquet.mergeSchema option.

How to fix storage issue?

To address the issue of tiny files in a storage location, if you are using Delta Lake, consider using autoOptimize with optimizeWrite and autoCompact features. These will automatically coalesce small files into larger ones, however they have some subtle differences:

Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write. Auto compaction only compacts files that haven’t been compacted previously.
Optimized writes improve file size as data is written and benefit subsequent reads on the table. Optimized writes are most effective for partitioned tables, as they reduce the number of small files written to each partition.

To address the issue of slow performance when scanning directories, consider registering your data as tables in order to leverage the metastore to track the files in the storage location. This may have some initial overhead, but will benefit you in the long run by avoiding repeated directory scans.

To address issues with schema operations, one option is to specify the schema when reading in non-Parquet file types, though this can be a time-consuming process. Alternatively, you can register tables so that the metastore can track the table schema. The best option is to use Delta Lake which offer zero reads of schema with a metastore, and at most one reads for schema evolution. Delta Lake also provides other benefits such as ACID transactions, time travel, DML operations, schema evolution, etc.

Serialization [API, programming]

TL,DR - Bad codes can also slow your Spark down, always try to use spark sql built-in functions and avoid UDFs when develop your Spark codes. If UDFs are needed, try vectorized UDFs for Python and Typed Transformations for Scala

Refer to my post about SparkSQL Programming to understand the difference between built-in functions, UDFs and vectorized UDFs

Slower Spark Jobs can sometimes occur as a result of suboptimal code. One example of this would be code segments that have not been reworked to support more efficient Spark operations.

As a rule of thumb, always use spark.sql.functions whenever possible, regardless of which language you're using. You can expect similar performance for both Python or Scala with these functions

In Scala, if you have to use User-defined functions that are not supported by standard spark functions, it's more efficient to use Typed transformations instead of standard Scala UDFs. In general, Scala is more efficient than Python with UDFs and Typed transformation.

If you're working with Python, avoid using Spark UDFs and vectorized UDFs if possible. But if you have to use UDF, always use vectorized UDF

Reference

Spark SQL Performance Tuning

Spark working internals, and why should you care?

anhcodes — Wed, 31 May 2023 00:40:06 +0000

Most Big Data developers and Data Engineers start learning Spark by writing SparkSQL codes to perform ETL on DataFrame (I know I did). I also wrote a post about SparkSQL Programming. However, we quickly learn that there’s more knowlege required to go from processing a few GBs of data to dealing with TBs and PBs of data, which is a challenge for big enterprises. Learning to write correct Spark codes is only a small part of the battle, you will need to understand the Spark Architecture and Spark working internals to correct tune Spark to handle true big data, and it’s the focus of this post.

Spark Architecture

First, let this sink in: Spark is an in-memory, parallel processing engine that is very scalable. The more data you have, the more powerful Spark can become that sets it apart from other processing engines. Spark is faster than Map Reduce paradigm because it processes data in memory, which means that it can reduce the disk IO that normally slows down Map Reduce jobs. Spark is fast because of the ability to process data in parallel.

Parallelism is key enabler of Spark efficiency. The Spark Architecture is designed so that you can add new computers to process growing amount of big data in parallel.

Spark Cluster Components

A Spark cluster has, a driver and mutiple workers (think computers).

Spark driver (JVMs) is responsible for instantiate Spark Session, turn Spark operations into DAGs, schedule and distribute tasks to the workers.
Each worker has multiple cores (think threads) that can run multiple tasks. Each task is a single unit of work, each task maps to a single core and works on a single partition of data at a given time (1 task, 1 partition, 1 slot, 1 core)
Besides, we also have a cluster manager, and Spark Session that runs Spark applications.

Spark Session

SparkSession is the single point of entry to all DataFrame API functionality. SparkSession is available since Spark 2.0, before that Spark Context was used with a limitation of only one Spark Context per JVM. SparkSession can unify numerous Spark Contexts.

SparkSession automatically created in a Databricks notebook as the variable spark.

# In below code, the `spark` variable specifies a sparkSession
# spark.table reads a table to a dataframe
df = spark.table('<table>')
        .select('a', 'b')
        .where('a>1')
        .orderBy('b')

# spark.read reads files to a dataframe
df = spark.read.parquet('path/to/parquet')

# spark.sql execute sql queries on a table and save the result set to df
df = spark.sql('select * from <table>')

Spark APIs

Spark ecosystems have 4 APIs: SparkSQL, Spark Structured Streaming, SparkML, and GraphX (I haven’t used this before, not sure if it’s deprecated or not). Most of Spark developers started with SparkSQL APIs with ingestion and transformations on Spark DataFrame. However, Spark Structured Streaming and SparkML are pretty popular too, which we can discuss later in later posts.

Computation in Spark

Earlier I mentioned that Driver is responsible to turn operations into jobs or DAGs. DAGs are Directed Acyclic Graphs (fancy word for graphs that have direction with no cycle). In a spark execution plan, each job is a DAG, each node within a DAG can have one or multiple stages, each stage can have multiple tasks (clear?)

Spark parallelizes at 2 levels:

splitting work among workers, executors (or workers) will run the spark code on the data partitions it has
each executors have a number of slots/cores, each slot can execute a task on a data partition.

Another characteristic of Spark is lazy execution. When you specify transformations on a Spark DataFrame, Spark records lineage and only start the computation when an action is triggered (refer to my previous post about SparkSQL programming for more information on transformations and actions)

Under the hood, SparkSQL uses Spark Catalyst Optimizer to optimize query performation, similar to how a relational database or a data warehouse plans their query jobs.

Spark Catalyst Optimizer

The Catalyst Optimizer is a component of Spark SQL that performs optimization on a query through 4 stages:

analysis: create abstract syntax tree of a query
logical optimization: create plan and cost-based optimizer and assign costs to plan
physical planning: generate physical plan based on logical plan
code generation: generate java bytecode to run on each machine, spark sql acts as a compiler. Project Tungsten engine generate RDD code

Catalyst Optimizer is a rule based engine that takes the Logical Plan and rewrites it as an optimized Physical Plan. The Physical Plan is developed BEFORE a query is executed

To view the Catalyst Optimizier in action, use df.explain(True) to view the Logical and Physical Execution plans of a query.

Adaptive Query Execution

In Spark 3.0, Adaptive Query Execution (AQE) was introduced. One difference between AQE and Catalyst Optimizer is that AQE modifies the Physical Plan based on Runtime Statistics, so AQE can tune your queries further on the flight. So you may think that AQE is complimentary to Catalyst Optimizer.

For example, during runtime, based on the new information that is previously not available during planning, AQE can decide to change your join strategy to Broadcast Hash Join from Sort Merge Join to reduce data shuffle. Or AQE can coalesce your partitions to optimal size during shuffling stage, or help improve Skew Join.

This option is not turned on by default in Spark, you can enable by setting spark config: spark.conf.set(spark.sql.adaptive.enabled, True) , and it’s recommended to turn this on. However, If you run Spark on later version of Databricks Runtime, AQE is enabled by default.

Shuffle, Partitioning and Caching in Spark

We established that Spark processes data in parallel by splitting up data into partitions and move (shuffle) them to each executors so that they can run a task on a small subset of data in memory.

Shufflings, partitionings, and memory can potentially dictate Spark performance. So if you understand these terms in depth, debugging Spark can become much easier, which I explained further in another post about Debugging Long Running Spark Job

To process your data, Spark will first have to ingest files from disk to memory, and by default it reads data into partitions of 128MB. If there’s any wide transformation on the DataFrame, Spark needs to repartition the data and move partititions to cores for processing. The implication of this is each partition will have to fit into the core’s memory or you will have spill or OOM errors. If partitions are not evenly distributed, you can have skew (which means some executors have more works than the others). Correctly tuning partitions upon ingestion and upon shuffling stage can help improve your Spark jobs.

Partitioning

There are 2 types of partitioning:

Spark Partition: partition in flight (in RAM)
Disk Partition (Hive partition): partition at rest (on disk)

When Spark read data from disk to memory (dataframe), the initial partition in the dataframe (in MEMORY) will be determined by number of cores (default level of parallelism), dataset size, spark.sql.files.maxPartitionBytes config, spark.sql.files.openCostInBytes (default 4MB, overhead of opening file). Remember that this is the size of the partition in Memory, irrelevant to what it is on disk.

Check number of partitions in DataFrame when ingested from disk to memory with df.rdd.getNumPartitions(). We can estimate the size of your dataframe in memory by multiply the number of partitions in memory by the partition size.

By default, each partition has the size of 128MB but you can set with spark.sql.files.maxPartitionBytes. A situation when setting this config can be beneficial is to write data to 1GB part files.

Don’t allow partition size to increase >200MB per 8GB of core total memory, if more than that, increase number of partitions. It’s better to have many small partitions than too few large partitions.
It’s best to tune the number of partitions so it is at least a multiple of number of cores in your cluster. This allows for better paralellism. Run df.rdd.getNumPartitions() to check the number of partitions in memory.
In case if you want to change partition size at runtime, you can run coalesce() and repartition(). Coalesce can only reduce the number of partitions and increase partition size, but as a narrow transformation with no shuffling, coalsce is more efficient than repartition. Repartition returns new DF with exactly N partitions of even size. It can increase or decrease your partition count, but it requires expensive data shuffling

Shuffle

Shuffle is one of the most expensive operation in Spark. In every wide transformation (for example a groupBy), shuffle create multiple stages:

First stage will create shuffle files (shuffle write)
Subsequent stages will reuse those shuffle files (shuffle read)

If cache is used, first stage can create shuffle files and cache the results, later stages can read from cache with improve the performance

The issues with shuffle partitions are:

Too many partitions and you may have empty or very small partitions which put pressure on driver. This issue can be solved by enabling Adaptive Query Execution (AQE) as explained above
Too few partitions and big partitions can cause spill or OOM. Correctly setting the spark.sql.shuffle.partitions based on the rule in partition section can help. This setting indicates how many partitions Spark will create for the next stage, and it MUST be managed by user for every job.

Besides, there are a few techniques to mitigate excessive shuffles in my previous post Debugging Long Running Spark Job

Cache in Spark

By default, data in a DataFrame is only present in Spark cluster while bing processed during a query, it won’t be persisted on a cluster afterwards. However, you can explicitly request Spark to persist DataFrame on the cluster by invoking df.cache. Cache can store as many partitions of the dataframe as the cluster memory allows

Note that cache is another type of persist: df.cache is df.persist(StorageLevel.MEMORY_AND_DISK). This stores partitions in memory and spills excess to disk.

Cache should be used with care because caching consumes cluster resources that could otherwise be used for other executions, and it can prevent Spark from performing query optimization. You should only used cache in below situations:

DataFrames frequently used during Exploratory Data Analysis, iterative machine learning training in a Spark session
DataFrames accessed commonly for doing frequent transformations during ETL or building data pipelines
Don’t use when data is too big to fit in memory, or only need infrequent transformation

When you use cache() or persist(), the DataFrame is not fully cached until you invoke an action that goes through every record (e.g., count()). If you use an action like take(1)w, only one partition will be cached because Catalyst realizes that you do not need to compute all the partitions just to retrieve one record.

Don’t forget to cleanup with df.unpersist to evict the dataframe from cache when you no longer need it.

Spark SQL Programming Primer

anhcodes — Tue, 30 May 2023 19:30:38 +0000

TL,DR - SparkSQL is a huge component of Spark Programming. This post introduces programming in SparkSQL through Spark DataFrame API. It's important to be aware of Spark SQL built-in functions to be a more efficient Spark programmer

What is SparkSQL

SparkSQL is one of the 4 APIs in Spark ecosystems. SparkSQL provides structured data processing with interfaces such as SQL or Dataframe API using Python, Scala, R, Java programming languages

The same SparkSQL query can be expressed with SQL and DataFrame API. SQL queries, Python DataFrame and Scala DataFrame Queries will then be executed on the same engine. The queries will go through Query Plans, RDDs then Execution. SparkSQL always optimizes the queries before execution using Catalyst Optimizer

-- sql
select a, b from <table> where a > 1 order by b

## python
df = spark.table('<table>')
        .select('a', 'b')
        .where('a>1')
        .orderBy('b')

DataFrame API in SparkSQL

DataFrame is immutable collections of data grouped into named columns. Schema defines the column names and data types of a DataFrame.

Read and write data with Spark DataFrame

You can read data almost all file formats such as CSV, JSON, Parquet, Delta, etc. into Spark DataFrame

You can either choose to inferSchema from the files (expensive with JSON and CSV), or specify schema explitcitly (more efficient)

## Read data from parquet files to dataframe
df = spark.read.parquet('path/to/parquet_files').option('inferSchema', True)

## read data from csv specifying separator, header and schema
df = spark.read.csv('path/to/csv_files', sep='t', header=True, inferSchema=True)

## Read data from json files to dataframe
df = spark.read.json('path/to/json_files', inferSchema=True)

## Option 1: read data from file with schema specified as StructType
sparkSchema = StructType([StructField('col1', StringType(), True), StructField('col2', IntegerType(), True)]

df = spark.read.csv('path/to/csv_files', sep='t', header=True, schema=sparkSchema)

## Option 2: read data from file with schema specified as DDL syntax
ddlSchema = "col1 string, col2 integer"

df = spark.read.csv('path/to/csv_files', sep='t', header=True, schema=ddlSchema)

## Write dataframe to parquet files
df.write
    .option('compression', 'snappy')
    .mode('overwrite')
    .parquet('path/to/storage')

## Write data to table
df.write
    .mode('overwrite')
    .saveAsTable('<table_name>')

## write dataframe to Delta, default Parquet format
df.write.format('delta')
    .mode('overwrite')
    .save('outputPath')

Columns in DF

There are many ways to pick a column in DF depending on which language API you use

## Multi ways of extracting columns from Spark DF 
## Python
df['columnName']
df.columnName

import spark.sql.functions as F
F.col('columnName')
F.col('columnName.field') ##nested column array

// Scala
df("columnName")

import org.apache.spark.sql.functions.col
col("columnName")
$"columnName"
$"columnName.field" //nested column array

Column Operators & Methods

import pyspark.sql.functions.col

## These are chained transformations
new_df = df.filter(col("colA").isNotNull())
                    .withColumn("colB", (col("colA")*100).cast("int"))
                    .sort(col("colB").desc())

## transformations with selectExp
appleDF = eventsDf
                    .selectExpr("user_id", "device in ('macOS', 'iOS') as apple_user")

## transformation with regular PythonAPI
appleDF = eventsDF.select("user_id")
                    .withColumn("apple_user", col("device").isin('macOS', 'iOS'))

Rows in DF

Data Operations in Spark DataFrame

There are 2 main types of operations you can do with Spark DataFrame.

transformations (select, where, orderBy, groupBy): Remember that DataFrame is immutable so after a transformation, a new dataFrame will be created. Transformation is evaluated lazily until action is invoked or data is touched, not executed immediately but recorded as lineage. There are 2 types of Spark Transformations

* **narrow transformation**: single input partition computes single output partition (each column are computed separately), **without exchange of data** (such as `filter`, `contains`). 

* **wide transformation**: data from many partitions read, combined and written to disk (`groupBy`, `orderBy`, `count`), which causes **shuffle of data** across partitions

action (show, display, take, describe, summary, first, head, count, collect): trigger the lazy evaluation of recorded transformation

count vs collect: count returns single number to the driver, collect returns collection of row objects (expensive and can cause out of memory)

Remember that when you specify transformation, your Spark code will not be executed until you call an action on it. Lazy evaluation provide fault tolerance as spark records transformation lineage to restart the job if there’s failure.

SparkSQL Built-in Functions

You can use built-in aggregate functions coming from pyspark.sql.functions for Python and org.apache.spark.sql.functions for Scala. Refer spark sql built-in functions. Built-in functions are highly efficient and best practices for Spark Programming. It’s highly recommended to utilize built-in functions before attempting to create your own UDFs (User Defined Functions)

1. Aggregation functions

All aggregations methods require a groupBy method that returns a GroupedData object

Use the grouped data method agg to apply these built-in aggregate functions

For example:

df.groupBy("col1")
    .count().display()

df.groupBy("col1", "col2")
    .sum("val1", "val2")
    .display()

df.groupBy('col1')
    .agg( sum('val1').alias('total1') 
                avg('val2').alias('average2')
            )
    .display()

df.groupBy('col1')
    .agg( sumDistinct('val1').alias('total1') 
                approx_count_distinct('val2').alias('count2')
            )
    .display()

2. Datetime functions

Reformat the timestamp column to string representation

import pyspark.sql.functions as F

df.withColumn("date_string", F.date_format("timestamp", "MMMM dd, yyyy")
    .withColumn("time_string", F.date_format("timestamp", "HH:mm:ss.SSSSSS")

Extract date time parts from timestamp

df.withColumn('year', F.year(F.col('timestamp'))
    .withColumn('month', F.month(F.col('timestamp'))
    .withColumn('dayofweek', F.dayofweek(F.col('timestamp'))
    .withColumn('minute', F.minute(F.col('timestamp'))
    .withColumn('second', F.second(F.col('timestamp'))

Convert timestamp to date

df.withColumn("date", F.to_date(F.col("timestamp"))

manipulate datetimes

df.withColumns("add_2_day", F.date_add(F.col("timestamp"), 2))

3. Complex Data Types funtions

Assume we have a DataFrame with items column as nested array. For example:

## For example
from pyspark.sql.functions import F

df
## explode the items field to create a new row for each element in the array
    .withColumn("items", explode("items"))
## split column item_name by " " to array
    .withColumn("details", split(col("item_name", " ") 
## extract the element from details
    .withColumn("size", element_at(col("details"), 1)

3. Join functions

## inner join 
df1.join(df2, 'name')

## inner join with 2 columns
df1.join(df2, ['name', 'age'])

## specify join
df1.join(df2, 'name', 'left')
df1.join(df2, 'name', 'right')
df1.join(df2, 'name', 'outer')

## specify explicit column expressiion
df1.join(df2, df1['customer_name'] == df2['user_name'], 'left_outer')

User Defined Functions (UDF) in Spark

In case Built-in Functions are not enough to cover the need, you can write your own custom functions at an efficiency cost.

User-defined function can’t be optimized by Catalyst Optimizer, and must be serialized and sent to executors. Moreover, row data is deserialized from Spark binary format to pass to UDF, then results are serialized back into Spark native format. For Python, they also add overhead to Python interpreter running on each worker node.

Using UDFs can cause serialization issues and long-running Spark job.

A way to fix this is to use Pandas UDF aka Vectorized UDFs using Apache Arrow in Spark 3.x.

To create a UDF, you can follow below steps:

## Step 1: Create a function
def calProfit(sales, cost):
    return sales - cost

## Step 2: Register function -> serialize the function and send to executors
calProfitUDF = udf(calProfit)

## Step 3: Apply the udf to the dataframe
df.withColumn("profit", calProfitUDF(col("sales"), col("cost")))

## Register UDF to use in SQL
df.createOrReplaceTempView('sales')

calProfitUDF = park.udf.register("sql_udf", calProfit)

-- Use the UDF in sql
%sq
select sql_udf(sales, cost) as profit from sales

Alternatively, you can use decorator syntax (only applicable in Python)

## Use Decorator Syntax for Python
## Our input/output is float
@udf("float")
def calProfitUDF(sales: float, cost: float) -> float:
    return sales - cost

## use the UDF
df.withColumn("profit", calProfitUDF(col("sales"), col("cost")))

Recommend to use Pandas/Vectorized UDFs, notice the difference in syntax

from pyspark.sql.functions import pandas_udf

@pandas_udf("float")
def vectorizedUDF(sales: pd.Series, cost: pd.Series) -> pd.Series:
    return sales - cost

## use the UDF
df.withColumn("profit", vectorizedUDF(col("sales"), col("cost")))

## register the UDF for sql 
spark.udf.register("sql_vectorized_udf", vectorizedUDF)

select sql_vectorized_udf(sales, cost) as profit from sales