When we work with Spark we usually want to first prototype to see if everything works as expected, before we start up big machines.
I spent an afternoon googling and starting and stopping the docker container to finally configure some lines of code.
So I want to share my basic local setup here, so maybe it will help someone to save some time.
When looking for a docker image with spark and jupyter we find the pyspark-notebook.
In my case I need to access AWS, so I need some additional libaries for the docker image.
To add them, I created a new Dockerfile
based on the pyspark-notebook.
The additional libraries needed are boto3
for AWS and python-dotenv
to access environment variables.
I decided to install boto3 with apt-get as this will be installed on the operating system level. Make sure to add -y
if the operating system is asking something during the install process, we will answer with yes
.
The dotenv is added via a requirements.txt so it will installed via pip, the python package manager.
Normally for the notebooks you need to have a token, but when we develop locally, we want to access the jupyter-notebook quickly and stay on the same site, without having to lookout for the new token everytime we change something.
So we need an custom configuration for that:
{
"NotebookApp": {
"allow_root": true,
"token": ""
}
}
In the Dockerfile we copy everything we need into to /home/jovyan/
directory. After some more googling I found out that this user jovyan stands for jupyter like environment. Just in case you where also wondering.
The final Dockerfile looks like this:
FROM jupyter/pyspark-notebook
USER root
# add needed packages
RUN apt-get update && apt-get install python3-boto3 -y
# Install Python requirements
COPY requirements.txt /home/jovyan/
RUN pip install -r /home/jovyan/requirements.txt
COPY jupyter_lab_config.json /home/jovyan/
In the docker-compose.yaml
we
- need to map the ports,
- map the volumes to save the notebook locally, otherwise everything would be lost, once we shut down the container and point to the env file.
- tell Docker where the
.env
file is located - tell Docker to build the Dockerfile in the same folder, instead of using an image.
The final docker-compose.yaml
looks like this:
version: "3.7"
services:
# jupyterlab with pyspark
pyspark:
#image: jupyter/pyspark-notebook
build: .
env_file:
- .env
environment:
JUPYTER_ENABLE_LAB: "yes"
ports:
- "8888:8888"
volumes:
- ./data:/home/jovyan/work
# docker run --rm -p 10000:8888 -e JUPYTER_ENABLE_LAB=yes -v "$PWD":/home/jovyan/work jupyter/pyspark-notebook
To start the container use docker-compose up
, if you changed something in the config use docker-compose up --force-recreate --build
to make sure the changes are build.
Have fun.
You can find the code also here.
Top comments (0)