Discussion on: How to use Spark and Pandas to prepare big data

View post

4 years ago, I was doing Pyspark. Building on the work in this blog, I had a couple of thoughts:

Google Cloud BigQuery is MUCH more responsive than Spark. If your data lake connects to the cloud, take advantage of this.
Clean data is important, but MOST important is that your data should not be corrupt. This involves taking about 100 examples and manually making phone calls or whatnot and verifying that your data wasn't corrupted upstream. No amount of cleaning can help with deeply corrupt data.
Pandas is useful but cumbersome. For me, I try to find some type of SQL (BigQuery, AWS Athena ) to get a sense of the data as quick as possible. Pyspark.sql can work but using it in the context of code will slow you down..

On twitter, at @whiteowled if there are other questions on this.

Thank you Ralph!