Spark Associate Developer Certification Guide

Labinot Vila — Tue, 19 Mar 2024 01:13:42 +0000

This content is all about what is needed to pass the Databricks: Spark Associate Developer exam.

Books

Spark: The Definitive Guide
Learning Spark: 2nd Edition
The Data Engineering's Guide to Apache Spark

Lectures

Topics touched on the exam

When does a Spark application fail? (when executor fails, when driver fails, when data is not fully cached, etc.)
What is the most granular unit in the Spark hierarchy? (jobs, stages, tasks, etc.)
What does NOT help in optimizing a Spark application? (related to partitions, column merging, etc.)
What happens if there are more slots than tasks to process in a worker node? (resources are not fully utilized, etc.)
What is a task? (a unit of work that can fit into an executor, a unit of work that can fit into a machine, etc.)
What is a job?
What is the difference between actions and transformations?
Which one of Dataset API methods is most likely to invoke a shuffle? (union, groupBy, filter, etc.)
How many % of the following code will cache the dataframe? (a .show() is called on a Scala range)
How many jobs will the following code create? (a dataframe reading and schema infering)
A wide partitions exchanges data between which units? (partitions, executors, clusters, etc.)
We want to generate 25 partitions after a join, what is the right configuration to use?
What are valid Spark deployment modes? (YARN, Local, Standalone, etc.)
Which of the options helps garbage collecting? (increasing java heap space, serialization or deserialization, etc.)
Dataset API Questions
Split function
Explode function
Joins (inner, left, crossJoin and anti)
Renaming column
Overwriting column
Filtering with multiple conditions
Using where vs using filter difference
Date and time manipulation (to and from unix, formatting, etc.)
Sorting asc and desc with and without nulls
Literals
Repartition and coalesce (more than 2 questions)
UDFs
Aggregate functions (dense rank and rank)
Printing schema
Finding transformations and actions
Collecting a dataset, extracting values and casting
Casting columns of a dataset
Dataset Reading and Writing
Reading a raw CSV file
Reading a CSV file with schema and with separators
Read and write modes
Writing and overwriting a parquet
Partitioning by a column and writing

Do not rely on documentation online!

DEV Community: Labinot Vila

Spark Associate Developer Certification Guide

Books

Lectures

Youtube

Udemy

Exams

PDF Exams

Topics touched on the exam