DEV Community

Cover image for Apache Spark Basics
Adi Polak
Adi Polak

Posted on • Originally published at github.com

Apache Spark Basics

This is a basic cheat sheet, glossary and the very beginning of getting started with Apache Spark, every time we will share a new post with terms or code snippets, they will appear here as well at a generic form.

If you work with Apache Spark and look for a cheat sheet, this is for you as well!

First thing first:

-1- the workspace:

First, we need to create the workspace, we are using Databricks workspace and here is a tutorial for creating it.

-2- Basic Apache Spark Vocabulary :

Dataframe 

This is a distributed collection of data organized into named columns that provide operations to filter, group, or compute aggregates. Dataframe data is often distributed across multiple machines. It can be in-memory data or on disk.

Dataset

Strongly typed collection of objects that can be transformed in parallel using functional or relational operations. Each Dataset is a typed view of Dataframe.
Dataset is defined as "lazy", meaning the computations are only triggered when an action is invoked.

RelationalGroupedDataset

A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup.

This is an evolving page and more terms, code snippets and architecture design will be added.

Top comments (0)