MoRoth

Posted on Sep 28, 2020 • Edited on Oct 25, 2021

Spark Journey begins...

#spark #python #bigdata

As an engineer with several years of experience in Backend and Frontend projects it feels like the next natural step is big data challenges.
In the big data world I expect to find computing, IO and scaling challenges not usually found in ordinary/plain/textbook architectures.

I decided that Spark is the best way to get started. Specifically - the Databricks certification, which is focused on Spark programming and architecture.

My game plan to pass the Databricks spark certification is to:

Read "Learning Spark Lightning fast big data analysis" book and work through all the examples + summarising important insights and lessons so I can repeat those later.
Go over the skeletons of Databricks Developer course that I found on GitHub from 15 months ago. Should be pretty updated - https://github.com/vivek-bombatkar/spark-training + https://github.com/vivek-bombatkar/Spark-with-Python---My-learning-notes-
Going through example questions.

Please, If you can advice on any source of preparation - write in the comments it will help me.

I will update as I go for others (and myself).

Learning Schedule

Theory

Reading throughly the book "Learning Spark Lightning-fast..."
I think it's reasonable to go through 2 chapters per week.
this means: reading, summarizing and running important code snippets on my own.

Week 1
Chapter 3
Chapter 4

Week 2
Chapter 5
Chapter 6

Week 3
Chapter 7
Chapter 8

Week 4
Chapter 9
Chapter 10

Week 5
Chapter 11 - Quick read it's not that important

Hands on coding

Basics (4 notebooks)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-pyspark

https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-from-pandas-to-spark

https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-weather-df

https://github.com/vivek-bombatkar/spark-training/blob/master/spark-python/jupyter-weather-df/Weather%20Analysis%20Exercise.ipynb

Advanced topics (10 notebooks)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced

Windows (4 notebook)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-windows

https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced-windows

UDF (3 notebooks)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced-udf

Spark execution(1 notebooks)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced-execution

Caching (3 notebooks)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced-caching

Pivoting (1 notebook)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced-pivoting

total 26 notebooks
I hope to do 3-4 notebooks per week (some will be easy some harder, so taking the average). This will result in 8 weeks of going through the notebooks. Learning what I'm missing etc.

Everything should take 3 months until I'm ready for the exam.

Books PDFs

Learning Spark: Lightning-Fast Big Data Analysis
First Edition
https://b-ok.asia/book/2493162/9b8d4f?dsource=recommend
Second Edition
https://laptrinhx.com/learning-spark-lightning-fast-data-analytics-2nd-edition-436517903/

Spark: The Definitive Guide: Big Data Processing Made Simple
https://b-ok.asia/book/3505368/f04c83?regionChanged

Spark in Action
https://b-ok.asia/book/3502170/d3383b

DEV Community

Spark Journey begins...

Learning Schedule

Theory

Hands on coding

Books PDFs

Top comments (0)

Read next

Self-Correcting AI Agents: How to Build AI That Learns From Its Mistakes

How I Transformed How My Business Interacts with and Collects Data from Customers Using WhatsApp Forms-like Features

Unlocking DuckDB from Anywhere - A Guide to Remote Access with Apache Arrow and Flight RPC (gRPC)

Building a Streamlit Inventory Management App with Fragment Decorators 🚀