Apache Spark Basics

#beginners #productivity #bitesize

This is a basic cheat sheet, glossary and the very beginning of getting started with Apache Spark, every time we will share a new post with terms or code snippets, they will appear here as well at a generic form.

If you work with Apache Spark and look for a cheat sheet, this is for you as well!

First thing first:

-1- the workspace:

First, we need to create the workspace, we are using Databricks workspace and here is a tutorial for creating it.

-2- Basic Apache Spark Vocabulary :

Dataframe

This is a distributed collection of data organized into named columns that provide operations to filter, group, or compute aggregates. Dataframe data is often distributed across multiple machines. It can be in-memory data or on disk.

Dataset

Strongly typed collection of objects that can be transformed in parallel using functional or relational operations. Each Dataset is a typed view of Dataframe.
Dataset is defined as "lazy", meaning the computations are only triggered when an action is invoked.