Obrotoks

Posted on May 27, 2023

Create a cluster with pyspark

#docker #python #pyspark

Goal

Create a cluster with PySpark with python. Based on this repo & on this video

Here is the code repository

Description

We are going to create 4 enviroments:
1- Master: ports: 8080 & 7077
2- Worker-1: ports: 8081
3- Worker-2: ports: 8082

All of them are going to share same volumen on: /opt/workspace

And also they are going to have the same source file

For try to check if everything, it's going to be test this script:

from pyspark.sql import SparkSession


spark = SparkSession \
        .builder \
        .appName("TestCluster") \
        .master("spark://spark-master:7077") \
        .getOrCreate()
df = spark \
        .read.options(
                header='True'
                ,inferschema='True'
                ,delimiter=";") \
        .csv("src/Travel\ Company\ New\ Clients.csv")
df.count()
df.printSchema()
df.select("FamilyMembers").distinct().show()

def myFunc(s):
    return [ ( s["FamilyMembers"],1)]

lines=df.rdd.flatMap(myFunc).reduceByKey(lambda a, b: a + b)
famColumns =["FamilyMembers","Num_reg"]
dataColl=lines.toDF(famColumns)
dataColl.show(truncate=False)

Prequistes

Docker version 23.0.5
Openjdk:8-jre-slim
Python: 3
Spark version: 3.4.0
Hadoop version: 3

Steps:

Firstly, we should the bases of the ApacheSpark (with the same instalation, files, ...) . Three of all of these image are going to be based on this first one.

For this reason, we are going to separate our flow in four parts:
1- Create a script of a bash which is going to create the firts image
2- Create a Dockerfile for the master based on (1)
3- Create a Dockerfile for the workers based on (1)
4- Create a docker-compose to activate (2 & 3)

0 - Env variables

In futures steps, it has been used by version purposes or directory index.


# Version
## JDK
JDK_VERSION=8-jre-slim
## Hadoop
HAD_VERSION=3
## Spark
SPA_VERSION=3.4.0
# Directories
WORK_DIR=/opt/workspace

1 - Create Apache Spark

This could be the step a little harder. First, we should to know what tools we could use:

Openjdk : the basic framework to run pyspark.
Python3: we could use Scala, but in my case I prefer to use python.
Curl: we need it to download to Oficial spark repository
Vim : only in case if we need to edit something

Variables

jdk_version
spark_version
hadoop_version
shared_wk : where we are going to save our main files

Script Main DockerFile

For this reason our Dockerfile should be:

# Layer - Image
ARG jdk_version
FROM openjdk:${jdk_version}

# Layer  - Arguments in dockerfile
ARG spark_version
ARG hadoop_version
ARG shared_wk
ARG py_cmd


# Layer - OS + Directories 
RUN apt-get update -y
RUN mkdir -p ${shared_wk}/data 
RUN mkdir -p /usr/share/man/man1
RUN ln -s /usr/bin/python3 
RUN ln -s /urs/bin/python

# Layer - Prequistes
RUN apt-get -y install curl
RUN apt-get -y install vim 
RUN apt-get -y install python3

# Download the info about PySpark
RUN curl https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-hadoop${hadoop_version}.tgz -o spark.tgz  
RUN tar -xvzf spark.tgz
RUN mv spark-${spark_version}-bin-hadoop${hadoop_version} /usr/bin/ 
RUN echo "alias pyspark=/usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/bin/pyspark" >> ~/.bashrc 
RUN echo "alias spark-shell=/usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/bin/spark-shell" >> ~/.bashrc
RUN mkdir /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/logs
RUN rm spark.tgz
RUN echo "JAVA_HOME"

# Layer - Enviorment Out
ENV SPARK_HOME /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}
ENV SPARK_MASTER_HOST spark-master    
ENV SPARK_MASTER_PORT 7077   
ENV PYSPARK_PYTHON python3   

# Layer - Move files to execute
## Data of the execution
RUN mkdir -p /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/src 
COPY ./src/*  /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/src/
## Script of the execution
RUN mkdir -p /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/script 
COPY ./script/*  /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/script/

# Layer - Volume & workdir 
VOLUME ${shared_wk}
WORKDIR ${SPARK_HOME}

Script bash

To create this image, we are going to use an script to bash to run it:

#!/bin/bash
# Load variables from .env 
set -o allexport;
source .env;
set +o allexport;


# Create Spark Base 
docker build \
        --build-arg jdk_version=${JDK_VERSION} \
        --build-arg hadoop_version=${HAD_VERSION} \
        --build-arg spark_version=${SPA_VERSION} \
        --build-arg shared_wk=${WORK_DIR} \
        -f Dockerfile \
        -t spark-base .;

2 - Create Apache Spark - Master

With this first image create, we should create a master's image:

FROM spark-base

CMD bin/spark-class org.apache.spark.deploy.master.Master >> logs/spark-master.out;bin/spark-submit --master spark://spark-master:7077 script/main.py

3 - Create Apache Spark - Worker

Then, we need to create another image of workers:

FROM spark-base

# Layer - Enviorment Out
ENV SPARK_MASTER_HOST spark-master    
ENV SPARK_MASTER_PORT 7077   



CMD bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT} >> logs/spark-worker.out

4 - Enviroment

We need to create a docker-compose file to maintain all the structure. I'll use this one:


version: "3.9"
volumes:
  shared-workspace:
    name: "Pyspark"
    driver: local
services:
  spark-master:
    build:
      context: .
      dockerfile: Dockerfile.master
    container_name: spark-master
    ports:
      - "8080:8080"
      - "7077:7077"
    volumes:
      - shared-workspace:${WORK_DIR}

  spark-worker-1:
    build:
      context: .
      dockerfile: Dockerfile.worker
    container_name: spark-worker-1
    ports:
      - "8081:8081"
    volumes:
      - shared-workspace:${WORK_DIR}
    depends_on:
      - spark-master
  spark-worker-2:
    build:
      context: .
      dockerfile: Dockerfile.worker
    container_name: spark-worker-2
    ports:
      - "8082:8082"
    volumes:
      - shared-workspace:${WORK_DIR}
    depends_on:
      - spark-master

5 - Test it

Once everything is everthing is up and running. I'll use the same terminal of docker to run this shell.


./bin/spark-submit --master spark://spark-master:7077 script/main.py

And it looks like it works great:

Have a nice day:)!

DEV Community