DEV Community

Ryan Giggs
Ryan Giggs

Posted on

Batch Processing with Apache Spark

Week 6 of Data Engineering Zoomcamp by @DataTalksClub complete
Just finished Module 6 - Batch Processing with Spark. Learned how to:

✅ Set up PySpark and create Spark sessions

✅ Read and process Parquet files at scale

✅ Repartition data for optimal performance

✅ Analyze millions of taxi trips with DataFrames

✅ Use Spark UI for monitoring jobs

Processing 4M+ taxi trips with Spark - distributed computing is powerful

Here's my homework solution: https://github.com/Derrick-Ryan-Giggs/pyspark-homework

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/

Top comments (0)