Flo Comuzzi

Posted on Apr 6, 2019

Live notetaking as I learn Spark

#spark #distributedsystems #learningplan

What is this?

I would love to get in the mind of other developers. I want to see how they think and I want to watch how they learn live. So, this is an experiment.

I have a somewhat vague goal: learn the theoretical foundations of Spark so I can look at a program and optimize it.

I will put my notes here as I go. My hope is that I will start to gleam some patterns from these notes. For example, so far I have noticed that I categorize the questions I have and define concepts I don't understand as I come across them sometimes putting off defining them because I want to understand a larger concept first. This is valuable information for me because I want to learn in the most efficient manner! There is so much I am curious about. I want to live life to the fullest and explore topics that give me joy.

This experiment is inspired by Jessie Frazelle's blog post, Digging into RISC-V and how I learn new things. By inspired I mean that I felt that same excitement I felt when I first started learning to code when I read this post. It brought that feeling of wonder and reverence back. Romance is not dead :p

If you have been feeling burned out or like your work does not matter, I urge you to think outside your immediate situation (whether a shitty job or overwhelming schoolwork) and learn about areas that you are curious about. See how other people are learning. Try new things. Find inspiration.

Notes

Installation

From Spark: The Definitive Guide (recommended to me!)

If you want to download and run Spark locally, the first step is to make sure that you have Java installed on your machine (available as java), as well as a Python version if you would like to use Python. Next, visit the project’s official download page, select the package type of “Pre-built for Hadoop 2.7 and later,” and click “Direct Download.” This downloads a compressed TAR file, or tarball, that you will then need to extract.
- I already installed the pre-built version for 2.9
- I moved the uncompressed folder to my Applications folder but couldn't find the folder thru the terminal. What have I done?

Small snafu

There is the folder:

But, when I don't see any content in the folder thru the Terminal:

$ ls ~/Applications/
Chrome Apps.localized

So, I Googled and found that Applications on Mac is at the root and I had been looking at my user's Applications.

Spark can run locally without any distributed storage system, such as Apache Hadoop, so I won't install Hadoop since the last time I did it was incredibly slow on my machine. (although I was using a different machine with shittier specs)

"You can use Spark from Python, Java, Scala, R, or SQL. Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM), so therefore to run Spark either on your laptop or a cluster, all you need is an installation of Java. If you want to use the Python API, you will also need a Python interpreter (version 2.7 or later)."

I'll be using PySpark. I use pyenv to manage installations of Python and create virtual environments. I see that I'll need a virtual environment because the book says, "In Spark 2.2, the developers also added the ability to install Spark for Python via pip install pyspark. This functionality came out as this book was being written, so we weren’t able to include all of the relevant instructions." So I know that I'll need a virtual environment because I want to manage the version of pyspark by project instead of installing it globally. I also now know that I may need to do more Googling since this book doesn't have all the instructions I may need. I don't know exactly how many instructions are missing -- hope this doesn't derail me for too long. I'm anticipating it though and I wonder if that gets in the way of me pushing thru on other projects.

This is how I'm installing pyspark:

$ mkdir spark-trial
$ cd spark-trial/
$ python --version
Python 2.7.10
$ pyenv local 3.6.5
$ python -m venv venv
$ source venv/bin/activate
$ python --version
Python 3.6.5
$ pip install pyspark
$ pip freeze > requirements.txt
$ cat requirements.txt 
py4j==0.10.7
pyspark==2.4.1

Now, I want to make sure I can run the thing. The book says to run the following from the home directory of the Spark installation: $ ./bin/pyspark.
Then, type “spark” and press Enter. You’ll see the SparkSession object printed.

So I went back to the directory where I installed Spark and ran those commands. This was the output:

Python 3.6.5 (default, Nov 20 2018, 15:26:21) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.10.44.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Applications/spark-2.4.1-bin-hadoop2.7/jars/spark-unsafe_2.11-2.4.1.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/04/06 12:44:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.1
      /_/

Using Python version 3.6.5 (default, Nov 20 2018 15:26:21)
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x10e262ac8>

I don't know why it is running Python 3.6.5 even though the system default is running 2.7. I know pyenv manages which version of Python is set for a project my creating a .python-version file so I'm going to exit this shell and look for that file in this spark directory.

When I run python --version from the same directory that Spark is installed in I see the version is 3.6.5 -- the same version as my virtual environment. I set this bash theme that displays this when virtual environments are activated:

○ flo at MacBook-Pro in .../-2.4.1-bin-hadoop2.7 using virtualenv: venv $ python --version
Python 3.6.5

I forgot that the virtual environment I created above is activated. I thought pyenv worked by reading the .python-version file to determine which Python version is set for a project but I did not put together how Python virtual environments work... what does the activate script that comes with virtual environments do? After looking at the script, I see that it sets VIRTUAL_ENV environment variable to the path of this virtual environment, adds the VIRTUAL_ENV to the system PATH, and unsets PYTHONHOME if it is set so that the first version of Python that the system finds when it calls Python is the activated virtual environment's Python. Cool!

Spark Architecture

"Single machines do not have enough power and resources to perform computations on huge amounts of information (or the user probably does not have the time to wait for the computation to finish). A cluster, or group, of computers, pools the resources of many machines together, giving us the ability to use all the cumulative resources as if they were a single computer. Now, a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers." Spark the definite guide

I have a textbook, Distributed Systems, by Andrew Tanenbaum (my favorite textbook author -- yes, I have a fav!). This morning, I read about different types of distributed computing, like cluster computing vs grid computing, so I am going to go back to the book to clarify if cluster in the quote above has a connection to cluster computing or if its an overloaded term.

There are many classes of distributed systems; one class of distributed systems is for high-performance computing tasks. There are two subgroups within this class:

cluster computing:
- underlying hardware consists of a collection of similar workstations or PCs closely connected by means of a high-speed local-area network
- each node runs the same operating system
grid computing:
- often constructed as a federation of computer systems, where each system may fall under a different administrative domain, and may be very different when it comes to hardware, software, and deployed network technology

I have only come in contact with Spark within single companies which I assume each have a single network??? so for now I am going to assume that Spark falls under cluster computing.

Distributed Systems textbook then goes into examples of cluster computers. For example, the MOSIX system. MOSIX tries to provide a single-system image of a cluster which means that a cluster computer tries to appear as a single computer to a process. I've come across my first distributed system gem: IT IS IMPOSSIBLE TO PROVIDE A SINGLE SYSTEM IMAGE UNDER ALL CIRCUMSTANCES.

I am finally starting to connect the dots between Spark and distributed systems concepts. I am going back in the textbook to design goals of distributed systems to learn more about transparency and adding a section to my notes called Design Goals since I am now getting what these design goals are all about.

Back to Spark,

"The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, YARN, or Mesos. We then submit Spark Applications to these cluster managers, which will grant resources to our application so that we can complete our work."

Cluster manager

The cluster manager controls physical machines and allocates resources to Spark Applications.

Spark Applications

Spark Applications consist of a driver process and a set of executor processes. The driver and executors are simply processes, which means that they can live on the same machine or different machines. In local mode, the driver and executors run (as threads) on your individual computer instead of a cluster.

Driver

called the SparkSession
1:1 SparkSession to Spark Application
runs main() function
sits on a node in the cluster
responsible for three things:
- maintaining information about the Spark Application during the lifetime of the application
- responding to a user’s program or input
- analyzing, distributing, and scheduling work across the executors

Executors

responsible for only two things:
- executing code assigned to it by the driver
- reporting the state of the computation on that executor back to the driver node

I've done quite a bit of learning about concepts. I am itching to build before my attention wavers. There are still some parts in the text that I gleamed and seem relevant so I'm going to keep going and hopefully will get to build soon. Otherwise, I'll pivot myself.

Spark APIs

language APIs and structured vs unstructured APIs
Each language API maintains the same core concepts described (driver, executors, etc.?).
- There is a SparkSession object available to the user, which is the entrance point to running Spark code.
- When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.

Starting the Spark shell is how I can send commands to Spark so that Spark can then send to executors. So, starting a Spark shell creates an interactive Spark Application. The shell will start in standalone mode. I can also send standalone applications to Spark using spark-submit process, whereby I submit a precompiled application to Spark.

Distributed collections of data

core data structures are immutable so they cannot change after they are created

DataFrames

a table of data with rows and columns
schema is list of columns and types
parts of the DataFrame can reside on different machines
- A partition is a collection of rows that sit on one physical machine in your cluster. A DataFrame’s partitions represent how the data is physically distributed across the cluster of machines during execution.
Python/R DataFrames mostly exist on a single machine but can convert to Spark DataFrame
myRange = spark.range(1000).toDF("number") in Python creates a DataFrame with one column containing 1,000 rows with values from 0 to 999. This range of numbers represents a distributed collection. When run on a cluster, each part of this range of numbers exists on a different executor. This is a Spark DataFrame.
most efficient and easiest to use

Transformations

narrow: those for which each input partition will contribute to only one output partition
- Spark will automatically perform an operation called pipelining, meaning that if we specify multiple filters on DataFrames, they’ll all be performed in-memory.
wide: A wide dependency (or wide transformation) style transformation will have input partitions contributing to many output partitions aka shuffle
- when a shuffle is performed, Spark writes the results to disk

Actions

Theoretical foundations

Design Goals

Making distribution transparent/invisible: hide that processes and resources are physically distributed across multiple computers
- Access: hide differences in data representation and how a process or resource is accessed
- location: hide where a process or resource is located
- relocation: hide that a resource or process may be moved to another location while in use
- migration: hide that a resource or process may move to another location
- replication: hide that a resource or process is replicated
- concurrency: hide that a resource or process may be shared by several independent users
- failure: hide the failure and recovery of a resource or process

What are the main problems in distributed systems?

multiset
working set: the amount of memory that a process requires in a given time interval, typically the units of information in question are considered to be memory pages. This is suggested to be an approximation of the set of pages that the process will access in the future and more specifically is suggested to be an indication of what pages ought to be kept in main memory to allow most progress to be made in the execution of that process.
cluster computing paradigms
distributed shared memory

Architectural foundations

RDD: resilient distributed dataset, a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way
The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged.

Cluster Computing Paradigm

Limitations of MapReduce cluster computing paradigm

Spark and its RDDs were developed in 2012 in response to limitations in the
MapReduce cluster computing paradigm forces a particular linear dataflow structure on distributed programs:
- MapReduce programs read input data from disk
- map a function across the data
- reduce the results of the map
- store reduction results on disk
Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.
Spark facilitates the implementation of both iterative algorithms, which visit their data set multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications may be reduced by several orders of magnitude compared to Apache Hadoop MapReduce implementation.

Apache Spark requires:

a cluster manager: standalone (native Spark cluster), Hadoop YARN, Apache Mesos
a distributed storage system: Alluxio, HDFS, MapR-FS, Cassandra, OpenStack Swift, Amazon S3, Kudu, a custom solution, pseudo-distributed local mode where distributed storage is not required and the local file system can be used instead (Spark is run on a single machine with one executor per CPU core in this case)

why use RDDs?
RDDs are lower-level abstractions because they reveal physical execution characteristics like partitions to end users. Might use RDDs to parallelize raw data stored in memory on the driver machine.

Scala RDDs are not equivalent to Python RDDs.

Chapter 4: Structured API Overview

Spark is a distributed programming model in which the user specifies transformations.

"Spark is a distributed programming model in which the user specifies transformations. Multiple transformations build up a directed acyclic graph of instructions. An action begins the process of executing that graph of instructions, as a single job, by breaking it down into stages and tasks to execute across the cluster. The logical structures that we manipulate with transformations and actions are DataFrames and Datasets. To create a new DataFrame or Dataset, you call a transformation. To start computation or convert to native language types, you call an action."

Top comments (2)

Adi Polak • Apr 7 '19 • Edited

This is great! I have been working with Apache Spark for years and your notes are great and to the point!

I love that you compared cluster computing vs grid computing. As to my opinion, I think that Apache Spark can also perform as a grid computing, however, having said that, I don't think it will get the best performance since the physical distance matters.

If you look at grid as a distributed system concept - a way to use computers distributed over a network to solve a problem, then Hadoop is a subset of grid computing. And Apache Spark is like the better version of Hadoop Map Reduce paradigm since it does most of the calculations in memory.
The problems for which Apache Spark is most often used are problems which a better solved when a computation is brought to where the data lives. Typical "Big Data" problems like machine learning, data mining, etc feel into this category. I personally never tried using Apache Spark as grid computing but it can be an interesting test.

What are you planning next with Apache Spark ?

Elizabeth Schafer • Apr 7 '19

I love the format of this post! This is such a great idea, both for sharing with other people, but also just for personally having a reference to look back on.