DEV Community

Ramakrishnan83
Ramakrishnan83

Posted on

PySpark & Apache Spark - Overview

PySpark is Python API for Apache Spark. It enables us to perform real-time large-scale data processing in a distributed environment using python. It combines the power of python programming with power of Apache Spark to enable data processing for everyone who are familiar with python.

Image description

Spark SQL and DataFrames:
Spark SQL is Apache Spark’s module for working with structured data. It allows you to seamlessly mix SQL queries with Spark programs. With PySpark DataFrames you can efficiently read, write, transform, and analyze data using Python and SQL. Whether you use Python or SQL, the same underlying execution engine is used so you will always leverage the full power of Spark.

I will be discussing more the Spark SQL and Data frames in my blog.

Spark Core and RDD
Spark Core is the foundation of the platform. It is responsible for memory management, fault recovery, scheduling, distributing & monitoring jobs, and interacting with storage systems. Spark Core is exposed through an application programming interface (APIs) built for Java, Scala, Python and R. These APIs hide the complexity of distributed processing behind simple, high-level operators.

Apache Spark recommends using Data Frames instead of RDDs as it allows you to express what you want more easily and lets Spark automatically construct the most efficient query for you.

Apache Spark can run on single-node machines or multi-node machines(Cluster). It was created to address the limitations of MapReduce, by doing in-memory processing. Spark reuses data by using an in-memory cache to speed up machine learning algorithms that repeatedly call a function on the same dataset.

How Apache Spark Works:
Spark was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. With Spark, only one-step is needed where data is read into memory, operations performed, and the results written back—resulting in a much faster execution. Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset. Data re-use is accomplished through the creation of DataFrames, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations. This dramatically lowers the latency making Spark multiple times faster than MapReduce, especially when doing machine learning, and interactive analytics.
The Spark framework includes:

  1. Spark Core as the foundation for the platform
  2. Spark SQL for interactive queries
  3. Spark Streaming for real-time analytics
  4. Spark MLlib for machine learning
  5. Spark GraphX for graph processing

One of the key benefits for Apache spark:
Fast
Through in-memory caching, and optimized query execution, Spark can run fast analytic queries against data of any size.

Architecture
PySpark works on master-slave model. Master refers to the "driver" and the slaves are referred as "Workers". Application creates a spark context and sends the information to Driver Program. The driver program interacts with the workers to distribute the work.

Image description

In the next post, we will start with simple examples on Pyspark Dataframes.

Top comments (0)