DEV Community

Cover image for πŸš€ Day 1: Introduction to Apache Spark
Sandeep
Sandeep

Posted on • Edited on

πŸš€ Day 1: Introduction to Apache Spark

Welcome to Day 1 of the 60 Day Spark Mastery Series!

Let’s begin with the fundamentals.

🌟 What is Apache Spark?

Apache Spark is a lightning-fast distributed computing engine used for processing massive datasets.
It powers the data engineering pipelines of companies like Netflix, Uber, Amazon, Spotify, and Airbnb.

Spark’s superpower is simple:

It processes data in-memory, which makes it 10–100x faster than Hadoop MapReduce.

⚑ Why Should Data Engineers Learn Spark?

Here are reasons Spark is the industry standard:

  • Works with huge datasets (TBs/ PBs)
  • Built for batch + streaming + machine learning
  • Runs on GCP, AWS, Databricks, Kubernetes, Hadoop
  • Has easy APIs in Python (PySpark), SQL, Scala
  • Built-in optimizations from Spark’s Catalyst Optimizer

πŸ”₯ Spark Ecosystem Overview

Spark is not just a computation engine; it’s a full ecosystem:

1. Spark Core

Handles: scheduling, memory, fault tolerance

2. Spark SQL

Allows SQL queries, Data Frames.

3. Structured Streaming

Real-time data pipelines (Kafka, sockets, event logs)

4. MLlib

Machine learning algorithms
Great for scalable ML operations.

5. GraphX

Graph processing engine (less used but powerful)

🧠 How Spark Executes Your Code Internally

Understanding Spark internals is key to becoming a senior-level engineer.

πŸ”Ή Step 1: Driver Program Starts

It analyzes the job and creates a logical plan.

πŸ”Ή Step 2: DAG (Directed Acyclic Graph) Creation

Spark breaks transformations into a DAG.

πŸ”Ή Step 3: DAG Scheduler β†’ Stages β†’ Tasks

Stages are based on shuffle boundaries.
Tasks run in parallel across executors.

πŸ”Ή Step 4: Executors Run Tasks

These nodes process data and store results.

This architecture gives Spark its scalability and speed.

⏳ Lazy Evaluation : Transformations don’t execute immediately.

Example:

df = spark.read.csv("sales.csv", header=True)
filtered = df.filter(df.amount > 1000)
Enter fullscreen mode Exit fullscreen mode

Nothing runs until you call:

filtered.show()
Enter fullscreen mode Exit fullscreen mode

This helps Spark:

  1. Optimize the whole query
  2. Reduce stages
  3. Avoid unnecessary work

πŸ›  Create Your First SparkSession

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Day1IntroToSpark") \
    .getOrCreate()

df = spark.range(10)
df.show()
Enter fullscreen mode Exit fullscreen mode

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Top comments (0)