Week 6 of Data Engineering Zoomcamp by @DataTalksClub complete
Just finished Module 6 - Batch Processing with Spark. Learned how to:
✅ Set up PySpark and create Spark sessions
✅ Read and process Parquet files at scale
✅ Repartition data for optimal performance
✅ Analyze millions of taxi trips with DataFrames
✅ Use Spark UI for monitoring jobs
Processing 4M+ taxi trips with Spark - distributed computing is powerful
Here's my homework solution: https://github.com/Derrick-Ryan-Giggs/pyspark-homework
Following along with this amazing free course - who else is learning data engineering?
You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/
Top comments (0)