I have been using PySpark for some time now and I thought to share with you the process of how I begin learning Spark, my experiences, problems I encountered, and how I solved them! You are more than welcome to suggest and/or request code snippets in the comments section below or at my twitter @siaterliskonsta. I will keep this post maintained and update it once I create more gists.
The original post is located at my blog.
From Spark's website, a DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.
Here are some options on how to read files with PySpark.
Now that you have some data in your DataFrame you may need to select some specific columns instead of the whole thing. This is how you do it
In addition, you may also want to filter the rows based on some conditions. This is how it works.
Now that you have learned the basic column and row manipulation, let's move on to some aggregations.
RDDs are fault-tolerant collection of elements that can be operated on in parallel. More information can be found in the official spark documentation here
Let’s see how to read files into Spark RDDs (Resilient Distributed Datasets)