DEV Community

Obrotoks
Obrotoks

Posted on

1

Create a cluster with pyspark

Goal

Create a cluster with PySpark with python. Based on this repo & on this video

Here is the code repository

Description

We are going to create 4 enviroments:
1- Master: ports: 8080 & 7077
2- Worker-1: ports: 8081
3- Worker-2: ports: 8082

All of them are going to share same volumen on: /opt/workspace

And also they are going to have the same source file

For try to check if everything, it's going to be test this script:

from pyspark.sql import SparkSession


spark = SparkSession \
        .builder \
        .appName("TestCluster") \
        .master("spark://spark-master:7077") \
        .getOrCreate()
df = spark \
        .read.options(
                header='True'
                ,inferschema='True'
                ,delimiter=";") \
        .csv("src/Travel\ Company\ New\ Clients.csv")
df.count()
df.printSchema()
df.select("FamilyMembers").distinct().show()

def myFunc(s):
    return [ ( s["FamilyMembers"],1)]

lines=df.rdd.flatMap(myFunc).reduceByKey(lambda a, b: a + b)
famColumns =["FamilyMembers","Num_reg"]
dataColl=lines.toDF(famColumns)
dataColl.show(truncate=False)




Enter fullscreen mode Exit fullscreen mode

Prequistes

  • Docker version 23.0.5
  • Openjdk:8-jre-slim
  • Python: 3
  • Spark version: 3.4.0
  • Hadoop version: 3

Steps:

Firstly, we should the bases of the ApacheSpark (with the same instalation, files, ...) . Three of all of these image are going to be based on this first one.

For this reason, we are going to separate our flow in four parts:
1- Create a script of a bash which is going to create the firts image
2- Create a Dockerfile for the master based on (1)
3- Create a Dockerfile for the workers based on (1)
4- Create a docker-compose to activate (2 & 3)

0 - Env variables

In futures steps, it has been used by version purposes or directory index.


# Version
## JDK
JDK_VERSION=8-jre-slim
## Hadoop
HAD_VERSION=3
## Spark
SPA_VERSION=3.4.0
# Directories
WORK_DIR=/opt/workspace

Enter fullscreen mode Exit fullscreen mode

1 - Create Apache Spark

This could be the step a little harder. First, we should to know what tools we could use:

  • Openjdk : the basic framework to run pyspark.
  • Python3: we could use Scala, but in my case I prefer to use python.
  • Curl: we need it to download to Oficial spark repository
  • Vim : only in case if we need to edit something

Variables

  • jdk_version
  • spark_version
  • hadoop_version
  • shared_wk : where we are going to save our main files

Script Main DockerFile

For this reason our Dockerfile should be:

# Layer - Image
ARG jdk_version
FROM openjdk:${jdk_version}

# Layer  - Arguments in dockerfile
ARG spark_version
ARG hadoop_version
ARG shared_wk
ARG py_cmd


# Layer - OS + Directories 
RUN apt-get update -y
RUN mkdir -p ${shared_wk}/data 
RUN mkdir -p /usr/share/man/man1
RUN ln -s /usr/bin/python3 
RUN ln -s /urs/bin/python

# Layer - Prequistes
RUN apt-get -y install curl
RUN apt-get -y install vim 
RUN apt-get -y install python3

# Download the info about PySpark
RUN curl https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-hadoop${hadoop_version}.tgz -o spark.tgz  
RUN tar -xvzf spark.tgz
RUN mv spark-${spark_version}-bin-hadoop${hadoop_version} /usr/bin/ 
RUN echo "alias pyspark=/usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/bin/pyspark" >> ~/.bashrc 
RUN echo "alias spark-shell=/usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/bin/spark-shell" >> ~/.bashrc
RUN mkdir /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/logs
RUN rm spark.tgz
RUN echo "JAVA_HOME"

# Layer - Enviorment Out
ENV SPARK_HOME /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}
ENV SPARK_MASTER_HOST spark-master    
ENV SPARK_MASTER_PORT 7077   
ENV PYSPARK_PYTHON python3   

# Layer - Move files to execute
## Data of the execution
RUN mkdir -p /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/src 
COPY ./src/*  /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/src/
## Script of the execution
RUN mkdir -p /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/script 
COPY ./script/*  /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/script/

# Layer - Volume & workdir 
VOLUME ${shared_wk}
WORKDIR ${SPARK_HOME}
Enter fullscreen mode Exit fullscreen mode

Script bash

To create this image, we are going to use an script to bash to run it:

#!/bin/bash
# Load variables from .env 
set -o allexport;
source .env;
set +o allexport;


# Create Spark Base 
docker build \
        --build-arg jdk_version=${JDK_VERSION} \
        --build-arg hadoop_version=${HAD_VERSION} \
        --build-arg spark_version=${SPA_VERSION} \
        --build-arg shared_wk=${WORK_DIR} \
        -f Dockerfile \
        -t spark-base .;

Enter fullscreen mode Exit fullscreen mode

2 - Create Apache Spark - Master

With this first image create, we should create a master's image:

FROM spark-base

CMD bin/spark-class org.apache.spark.deploy.master.Master >> logs/spark-master.out;bin/spark-submit --master spark://spark-master:7077 script/main.py
Enter fullscreen mode Exit fullscreen mode

3 - Create Apache Spark - Worker

Then, we need to create another image of workers:

FROM spark-base

# Layer - Enviorment Out
ENV SPARK_MASTER_HOST spark-master    
ENV SPARK_MASTER_PORT 7077   



CMD bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT} >> logs/spark-worker.out

Enter fullscreen mode Exit fullscreen mode

4 - Enviroment

We need to create a docker-compose file to maintain all the structure. I'll use this one:


version: "3.9"
volumes:
  shared-workspace:
    name: "Pyspark"
    driver: local
services:
  spark-master:
    build:
      context: .
      dockerfile: Dockerfile.master
    container_name: spark-master
    ports:
      - "8080:8080"
      - "7077:7077"
    volumes:
      - shared-workspace:${WORK_DIR}

  spark-worker-1:
    build:
      context: .
      dockerfile: Dockerfile.worker
    container_name: spark-worker-1
    ports:
      - "8081:8081"
    volumes:
      - shared-workspace:${WORK_DIR}
    depends_on:
      - spark-master
  spark-worker-2:
    build:
      context: .
      dockerfile: Dockerfile.worker
    container_name: spark-worker-2
    ports:
      - "8082:8082"
    volumes:
      - shared-workspace:${WORK_DIR}
    depends_on:
      - spark-master

Enter fullscreen mode Exit fullscreen mode

5 - Test it

Once everything is everthing is up and running. I'll use the same terminal of docker to run this shell.


./bin/spark-submit --master spark://spark-master:7077 script/main.py

Enter fullscreen mode Exit fullscreen mode

And it looks like it works great:

Test of cluster

Have a nice day:)!

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

Top comments (0)

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up