DEV Community

Discussion on: How to use Spark and Pandas to prepare big data

Collapse
 
ralphbrooks profile image
Ralph Brooks

4 years ago, I was doing Pyspark. Building on the work in this blog, I had a couple of thoughts:

  • Google Cloud BigQuery is MUCH more responsive than Spark. If your data lake connects to the cloud, take advantage of this.

  • Clean data is important, but MOST important is that your data should not be corrupt. This involves taking about 100 examples and manually making phone calls or whatnot and verifying that your data wasn't corrupted upstream. No amount of cleaning can help with deeply corrupt data.

  • Pandas is useful but cumbersome. For me, I try to find some type of SQL (BigQuery, AWS Athena ) to get a sense of the data as quick as possible. Pyspark.sql can work but using it in the context of code will slow you down..

On twitter, at @whiteowled if there are other questions on this.

Collapse
 
mage_ai profile image
Mage

Thank you Ralph!