Jupyter notebooks for Spark with customised Docker containers

#docker #spark #jupyter #python

When we work with Spark we usually want to first prototype to see if everything works as expected, before we start up big machines.
I spent an afternoon googling and starting and stopping the docker container to finally configure some lines of code.
So I want to share my basic local setup here, so maybe it will help someone to save some time.

When looking for a docker image with spark and jupyter we find the pyspark-notebook.

In my case I need to access AWS, so I need some additional libaries for the docker image.
To add them, I created a new Dockerfile based on the pyspark-notebook.
The additional libraries needed are boto3 for AWS and python-dotenv to access environment variables.
I decided to install boto3 with apt-get as this will be installed on the operating system level. Make sure to add -y if the operating system is asking something during the install process, we will answer with yes.
The dotenv is added via a requirements.txt so it will installed via pip, the python package manager.

Normally for the notebooks you need to have a token, but when we develop locally, we want to access the jupyter-notebook quickly and stay on the same site, without having to lookout for the new token everytime we change something.
So we need an custom configuration for that:

{
    "NotebookApp": {
        "allow_root": true,
        "token": ""
    }
}

In the Dockerfile we copy everything we need into to /home/jovyan/ directory. After some more googling I found out that this user jovyan stands for jupyter like environment. Just in case you where also wondering.

The final Dockerfile looks like this:

FROM jupyter/pyspark-notebook
USER root

# add needed packages
RUN apt-get update && apt-get install python3-boto3 -y

# Install Python requirements
COPY requirements.txt /home/jovyan/
RUN pip install -r /home/jovyan/requirements.txt

COPY jupyter_lab_config.json /home/jovyan/

In the docker-compose.yaml we

need to map the ports,
map the volumes to save the notebook locally, otherwise everything would be lost, once we shut down the container and point to the env file.
tell Docker where the .env file is located
tell Docker to build the Dockerfile in the same folder, instead of using an image.

The final docker-compose.yaml looks like this:

version: "3.7"

services:
  # jupyterlab with pyspark
  pyspark:
    #image: jupyter/pyspark-notebook
    build: .
    env_file: 
      - .env
    environment:
      JUPYTER_ENABLE_LAB: "yes"
    ports:
      - "8888:8888"
    volumes:
      - ./data:/home/jovyan/work

# docker run --rm -p 10000:8888 -e JUPYTER_ENABLE_LAB=yes -v "$PWD":/home/jovyan/work jupyter/pyspark-notebook

To start the container use docker-compose up, if you changed something in the config use docker-compose up --force-recreate --build to make sure the changes are build.

Have fun.

You can find the code also here.

DEV Community

Jupyter notebooks for Spark with customised Docker containers

Top comments (0)

Read next

A Beginner’s Journey Through the Machine Learning Pipeline (1)

Top 5 Programming Languages to Watch in 2025: Which Ones Will Shape the Future?

Finding and Validating Unused Security Groups in AWS with Python and Boto3

Automating Data Analysis with Python: A Hands-On Guide to My Project