loading...

Pyspark Code Snippets

siakon89 profile image Siaterlis Konstantinos ・2 min read

I have been using PySpark for some time now and I thought to share with you the process of how I begin learning Spark, my experiences, problems I encountered, and how I solved them! You are more than welcome to suggest and/or request code snippets in the comments section below or at my twitter @siaterliskonsta. I will keep this post maintained and update it once I create more gists.

The original post is located at my blog.

DataFrames

From Spark's website, a DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

Reading Files

Here are some options on how to read files with PySpark.

Selecting Columns

Now that you have some data in your DataFrame you may need to select some specific columns instead of the whole thing. This is how you do it

Filtering

In addition, you may also want to filter the rows based on some conditions. This is how it works.

GroupBy

Now that you have learned the basic column and row manipulation, let's move on to some aggregations.

RDDs - Resilient Distributed Datasets

RDDs are fault-tolerant collection of elements that can be operated on in parallel. More information can be found in the official spark documentation here

Reading Files

Let’s see how to read files into Spark RDDs (Resilient Distributed Datasets)

Posted on by:

siakon89 profile

Siaterlis Konstantinos

@siakon89

Data Engineer at Orfium. Blogger at www.thelastdev.com and organizer of the AWS User Group Athens meetup.

Discussion

markdown guide