DEV Community

loading...

Need help dockerizing Spark

kambala yashwanth
I ❤️ JavaScript > any other ☕ logo languages
・1 min read

Need Help

I have been working on docker,where I have to run the spark application.
I tried using docker repository spark images but ran into issues, so I tried doing my own.

It worked out but every run its downloading spark and i am losing previously ran job logs.

My requirments

  1. Is it possible to have seperate spark image and supply app.jar to it.

  2. Instead of writing logs in docker can I direct it to host file system.

Docker file

FROM alpine

ENV SPARK_VERSION=2.2.0
ENV HADOOP_VERSION=2.7

RUN apk add tar
RUN apk add aria2
RUN mkdir spark
RUN cd spark
WORKDIR /spark



#copy app.properties to docker
COPY app.properties .

# copy /home/exa9/SparkSubmit/App/target/App-0.0.1-SNAPSHOT.jar

ADD target/App-0.0.1-SNAPSHOT.jar app.jar


#Downloading Apache Spark and extracting

RUN aria2c -x16 http://archive.apache.org/dist/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz

RUN apk add --no-cache curl bash openjdk8-jre \

      && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz





WORKDIR /spark/spark-2.2.0-bin-hadoop2.7/bin
CMD ./spark-submit --class com.Spark.Test.SparkApp.App --master local[*]  /spark/app.jar /spark/app.properties






Discussion (1)

Collapse
shawonashraf profile image
Shawon Ashraf • Edited

You can mount a directory as a volume to your container and store the logs there. That way your logs will remain free from side effects. As for the spark re-download issue, you've to find another way to include the spark binary. Since you're writing a Java application, using Maven or Gradle would've made that a lot easier and would've been just a build script away!