Olga Braginskaya

Posted on Sep 22, 2023 • Edited on Nov 17 • Originally published at datobra.com

The Data Engineering Docker-Compose Starter Kit

#docker #tutorial #development #dataengineering

We’ve all been there. Starting a new data engineering project and, instead of diving right into the fun stuff, we’re stuck setting up configurations. Again.

So, I’ve put together a Docker Compose config with a bunch of services you’ll often see in data engineering projects. Just pull it up on your computer and you’re all set to kick off your own thing. Now, instead of building from scratch, you can jump straight into the action.

Hint: If you don’t know what Docker and Docker Compose are, now’s the perfect time to go find out.

Why We’re Here and What’s the Deal

Here you can find my starter kit that includes the following services/tools:

RedPanda: Think of it as your data’s personal post office. It’s similar to Kafka, but many say it’s cooler. You send data in, and RedPanda makes sure it gets to where it’s supposed to go, all in real-time. Ideal for managing loads of data and crafting event-driven applications.
Python Consumer Service: This is a tool that’s always active, monitoring specific topic from RedPanda. It grabs incoming data and takes action. For now, I’ve set it to store messages in a table in MySQL, but you can customize it to do much more complex tasks.
MySQL: You’ve probably heard of it, super popular. Great for when we need to keep things organized and query data quickly.
S3 (Minio): It’s our digital storage room. Just like Amazon’s S3, but we can set it up on our local machines for free.
Mage.ai: Consider it the coordinator for all our data activities. It’s similar to Airflow but sports a fresher look and added flair. It sets the when and how for data tasks, making organization simple and stylish.
Jupyter Notebook: It’s like our digital lab notebook. We can write code, run it, and see the results all in one place. Makes experimenting and playing with data super fun and easy.

So, with all these tools combined, we’re basically setting up a state-of-the-art digital workshop to handle, play with, and make the most out of our data. Fun times!

Data’s Dance Across the Stage

Now, let’s transition and focus our attention on Docker.

Docker lets you create, ship, and run applications inside containers. These containers pack everything an application needs from libraries to dependencies. So, when you move your application to a different machine with Docker, it just works. No more “it works on my machine” headaches.

So, what happens when our application has a bunch of containers? Enter Docker Compose.

Docker Compose is like a coordinator for a multi-container orchestra. Instead of manually creating and managing individual containers, developers can define their whole setup in one file: docker-compose.yml. This file lists all the services, networks, and storage spaces our app needs. And the best part? One command, docker-compose up, brings the whole setup to life.

Now that we’ve talked about the power of Docker and the organizing genius of Docker Compose, you’re probably wondering how to get all this magic working on your machine. Good news: I’ve made it super easy for you. The next steps will guide you through running my ready-to-use docker-compose.yml file.

First, let’s create a file for passwords and secrets in the root folder of the project. This file has already been added to the .gitignore file, so it won't go public.

touch .secrets

Inside the .secrets file, we'll store the password for the MySQL database and save it.

MYSQL_ROOT_PASSWORD=very_strong_password

Time to get things rolling. Execute the command docker-compose up from terminal and see it all come to life.

In just a few moments, here’s what you’ll have ready and waiting:

RedPanda UI: Accessible at localhost:8080. You'll find a pre-created topic named "my-topic".
Mage: Head to localhost:6789 to see "my_project" already set up for you.
Jupyter: Launch it at localhost:8888 and you'll find a notebook titled "my-notebook.ipynb".
Minio: Visit localhost:9001 and inside you'll discover a bucket named "my-bucket" with a "data.csv" file.
MySQL: Available at localhost:3306, there's a database named "my_database" waiting for you.
Python Consumer Service: This is working behind the scenes and will create a “sessions” table in the “my_database”.

For illustration purposes I’ve set up two data flows to highlight how these services interact. You’re free to configure them as you wish.

Event-Driven App Flow: Stream events into the Python service and record them in MySQL.
ETL Flow: This involves a data pipeline in Mage. Here’s what it does:

Reads files from s3 (using Minio)
Aggregates data with pandas
Saves results to MySQL
And as a bonus, you can explore the data in Jupyter, ideal for data analytics and scientists.

First flow: RedPanda -> Python Consumer Service -> MySQL table

Head over to RedPanda at http://localhost:8080/topics. You’ll notice the “my-topic” is already set up for you.

Click on “my-topic”, then select “Publish message”.

When you publish a message, our Python consumer service will record it in the table defined in the code block below. You don’t need to provide an ‘id’ since it auto-increments. Also, ‘session_id’ is automatically generated. So, all you need to send is a JSON with the ‘user_id’ like this: {"user_id": }.

class Sessions(Base):
    __tablename__ = 'sessions'

    id = Column(Integer, primary_key=True, autoincrement=True)
    session_id = Column(String(255), default=uuid_gen, nullable=False)
    user_id = Column(Integer, nullable=False)

Go ahead and send a test message.

Great, as you can see the message is published.

If you have Docker Desktop installed, check the Python consumer service log. You should spot our recently published message there.

Alternatively, you can achieve the same using the command line with docker ps and docker logs commands.

Let’s inspect the database.

Install DBeaver community version on your computer and launch it. Your new connection settings should resemble the following:

Host: localhost
Port: 3306
Database: my_database
Username: root
Password: very_strong_password

Additionally, navigate to “Driver Settings” -> “Driver Properties” and input the settings as shown in the screenshot below:

If you’ve set up the connection correctly, this is what you should see:

You’ll likely observe that the “sessions” table already exists. Let’s run a query on it.

SELECT *
FROM my_database.sessions;

And the result:

Done! We can see our test message.

Second flow: S3(minio) -> Mage -> MySQL -> Jupyter Notebook

Access Minio at http://localhost:9001/login. Use both the username and password as “minioadmin”. Once logged in, you’ll find our pre-created bucket containing a file.

The “data.csv” file contains historical weather data from Tel Aviv, sourced from another article I wrote. If you wish, you can swap it with a public dataset, for example, from here.

Next, navigate to Mage at localhost:6789. Inside, you’ll discover a pre-established “my_project” containing a single pipeline named “my_pipeline.”

Upon inspecting the pipeline, you’ll notice its workflow: it goes from S3 load to a transformer (pandas) and finally to MySQL export.

By clicking “Edit” on “my_pipeline”, you’ll be presented with all the associated files and the code for each step.

You might wonder how Mage recognizes connections to S3 or MySQL. In the list of files on the left, there’s an io_config.yaml file detailing these connection settings. Within it, there's a MySQL section pointing to my_database and an AWS section for Minio.

version: 0.1.1
default:
  # Default profile created for data IO access.
  # Add your credentials for the source you use, and delete the rest.
  # AWS
  AWS_ACCESS_KEY_ID: ""
  AWS_SECRET_ACCESS_KEY: ""
  AWS_REGION: us-east-1
  AWS_ENDPOINT: http://minio:9000

  # MySQL
  MYSQL_DATABASE: my_database
  MYSQL_HOST: mysql-db
  MYSQL_PASSWORD: very_strong_password
  MYSQL_PORT: 3306
  MYSQL_USER: root

Before you run this pipeline, ensure you’ve set up your own AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in Minio.

Navigate to http://localhost:9001/access-keys and select “Create access key”.
Copy the provided values into your io_config.yaml under AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
Click the “Create” button and ensure you save the io_config.yaml file.

Next, let’s set up a new table in MySQL. Open DBeaver and execute the following query. While Mage can auto-generate tables, I want to ensure this table remains consistent for all readers.

CREATE TABLE `mytable` (
  `station` text,
  `_year` text,
  `tmin` decimal(10,2) DEFAULT NULL,
  `tmax` decimal(10,2) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Start the pipeline by visiting http://localhost:6789/pipelines/my_pipeline/triggers and click on “Run pipeline now.” You should see a green indicator, signaling success.

While you can look at the pipeline code for details, here’s a summary: It first fetched our data.csv from s3 during the s3_load step. Subsequently, it performed some transformations using pandas in the transformer step.

  # Filter out rows where TMIN or TMAX is NaN
  df = data.dropna(subset=['TMIN', 'TMAX'])

  # Create a 'YEAR' column
  df['YEAR'] = df['DATE'].str[:4]

  # Group by year and compute average TMIN and TMAX
  data = df.groupby(['STATION','YEAR']).agg({'TMIN': 'mean', 'TMAX': 'mean'}).round(2).reset_index()

  return data

Quick heads-up: What I’m doing in the code above from transformation step is setting up data for a graph we’ll create in a Jupyter notebook. Specifically, I’m calculating the yearly average for both minimum and maximum temperatures, grouped by station and year.

Finally, in the third step, these results saved to a MySQL table via the mysql_export step, using Mage's connector which you can observe in the code block below.

  with MySQL.with_config(ConfigFileLoader(config_path, config_profile)) as loader:
      loader.export(
          df,
          None,
          table_name,
          index=False,  # Specifies whether to include index in exported table
          if_exists='append',  # Specify resolution policy if table name already exists
      )

Now, let’s verify the data in MySQL. In DBeaver, refresh my_database and then run a query on mytable.

SELECT *
FROM my_database.mytable;

This is our aggregation.

Now, let’s explore this data further. Launch Jupyter Notebook localhost:8888. Inside, you’ll find a pre-established notebook. Go ahead and click on it.

Inside the notebook, you’ll find some code. It essentially connects to the mytable table in MySQL, loads the data into a pandas dataframe, and then utilizes the plotly module to render an impressive graph.

Execute each cell sequentially, and by the end, you’ll see the result. If you’re curious about the graph’s details, check out the explanation here

Amazing, right?

Dressing Up Your Docker Config for a Perfect Fit

Ready for more? Here are additional ways you can leverage these services:

Alerting and Notification System:

RedPanda captures real-time metrics or logs from applications.
The Python listener checks for specific conditions or thresholds. If something looks off, it triggers an alert.
Mage could then be used to manage these alerts, maybe aggregating them, or pushing them to other systems, or even initiating automatic remedial tasks.

User Behavior Analytics:

RedPanda streams user activities from a website or application.
The Python listener processes and enriches this stream with additional user info from MySQL.
The enriched data lands in S3 (Minio), where analysts can use Jupyter to analyze user behaviors, build funnels, or segment users.

Real-Time Analytics Dashboard:

RedPanda captures real-time events from different sources, like user activities or sensor data.
Mage listens to this data, perhaps aggregating data or filtering specific events.
Processed data is stored in MySQL for structured, quick access.
A Jupyter notebook fetches this data for analysis and visualization, providing a real-time analytics dashboard for end-users.

Consider adding a Python producer service or producing messages directly from Mage. With Mage, you can even set a trigger to launch the pipeline for every new file on s3. Interested in running a dbt project? You can do it in the Mage pipeline on top of the MySQL database. If MySQL isn’t your favorite, swap it out for Postgres or Mongo. Essentially, there’s a wide range of options for adjusting and enhancing these services.

Also, consider exploring my article about running Argo Workflows locally and another one about partitioning outputs from RedPanda on s3.

Behind Docker’s Doors

Let’s take a look at my repository and look at each service individually.

In the repository, you’ll spot two primary files: docker-compose.yml and docker-compose-infra-services.yml. Executing docker-compose up typically looks for the default docker-compose.yml to kick things off. However, to distinguish third-party services (like Kafka or MySQL) from the actual development environment, I've placed those services in docker-compose-infra-services.yml. This structure lets you run external services independently and develop using pure Python, Go, etc., connecting directly via localhost.

docker-compose -f docker-compose-infra-services.yml up --build

Let’s look at our services one by one.

RedPanda

In the docker-compose-infra-services.yml file, RedPanda is represented by three interlinked services:

redpanda-0: This is the RedPanda broker itself. When communicating within the same network, you’d use the host redpanda-0:9092.
console: This serves as RedPanda’s UI and is accessible at http://localhost:8080/.

redpanda-init: This component waits for redpanda-0 to be ready and then initializes the 'my-topic' topic for us on redpanda-0:9092

 redpanda-init:
    image: docker.redpanda.com/redpandadata/redpanda:v23.2.6
    depends_on:
      redpanda-0:
        condition: service_healthy
    networks:
      - redpanda_network
    command: topic create -r 1 -p 1 my-topic --brokers redpanda-0:9092

MySQL

In the docker-compose file, MySQL is represented as a single service. However, it has its nuances. One of them is that it uses the .secrets file to pull the MYSQL_ROOT_PASSWORD variable.

mysql-db:
    build:
      context: .
      dockerfile: Dockerfile.mysql
    container_name: mysql-db
    healthcheck:
      test: ["CMD", "mysqladmin", "ping", "-h", "localhost"]
      timeout: 20s
      retries: 10
    ports:
      - "3306:3306"
    environment:
      - MYSQL_DATABASE=my_database
    env_file:
       - '.secrets'
    volumes:
      - mysql-data:/var/lib/mysql
    networks:
      - redpanda_network

You may also observe that it utilizes a custom Dockerfile, which is shown in the following code block.

# Dockerfile.mysql
FROM mysql:8.1

COPY ./init.sql /docker-entrypoint-initdb.d/

The reason behind this is that I wanted to have a precreated database ready from the get-go. This way, you can set up anything you want in MySQL before launching your services, all within the init.sql file in the form of MySQL queries. Feel free to ingest data or make any necessary configurations. Let's take a look at the sql file:

-- init.sql
CREATE DATABASE IF NOT EXISTS my_database;

USE my_database;

Minio/s3

It comprises two linked services. The aws service takes care of creating a bucket and uploading the data.csv file from the repository on our behalf. However, you also have the flexibility to create your own buckets and upload data.

minio:
    image: minio/minio:RELEASE.2023-08-29T23-07-35Z
    container_name: s3
    ports:
      - "9000:9000"
      - "9001:9001"
    command: server --console-address ":9001" /data
    networks:
      - redpanda_network

  aws:
    image: amazon/aws-cli
    container_name: aws-cli
    command: -c "
            sleep 2
            && aws --endpoint-url http://minio:9000 s3 mb s3://my-bucket --region us-east-1
            && aws --endpoint-url http://minio:9000 s3 cp data.csv  s3://my-bucket/data.csv
            && exit 0"
    entrypoint: [/bin/bash]
    volumes:
      - "./local_data:/aws"
    environment:
      AWS_ACCESS_KEY_ID: "minioadmin"
      AWS_SECRET_ACCESS_KEY: "minioadmin"
    depends_on:
      - minio
    networks:
      - redpanda_network

Mage

This service is quite straightforward. It loads Python sources from mage_src as my_project. Any modifications you make locally in this folder will be instantly reflected in Mage.

 mage-ai:
    image: mageai/mageai:latest
    ports:
      - 6789:6789
    command:
      - /app/run_app.sh
      - mage
      - start
      - my_project
    volumes:
      - ./mage_src:/home/src
    networks:
      - redpanda_network

Jupyter

This service loads the precreated my-notebook.ipynb file into Jupyter notebooks. You won't be prompted for a password due to the empty NotebookApp.token and NotebookApp.password.

jupyter:
    image: jupyter/minimal-notebook:latest
    ports:
      - 8888:8888
    volumes:
      - ./jupyter_workspace:/home/jovyan
    # disable password for jupyter
    command:
      - jupyter
      - notebook
      - --no-browser
      - --NotebookApp.token=''
      - --NotebookApp.password=''
    networks:
      - redpanda_network

Python consumer service

my-service:
    build:
      context: .
      dockerfile: python_app/Dockerfile
    depends_on:
      mysql-db:
        condition: service_healthy 
      redpanda-init:
        condition: service_started
    networks:
      - redpanda_network
    environment:
      - BROKER_URL=redpanda-0:9092
      - BROKER_TOPIC=my-topic
      - MYSQL_DATABASE=my_database
      - MYSQL_URL=mysql-db:3306
    env_file:
       - '.secrets'
    #volumes:
    #  - .:/home/user/app
    restart: on-failure

The code for this service is located in the python_app folder. It utilizes confluent_kafka for handling RedPanda messages and sqlalchemy for writing to the MySQL table. Essentially, it continuously polls messages from Kafka and persists them in MySQL.

The service waits for both MySQL and RedPanda to be up and running. It then uses environment variables for establishing connections within the code. Additionally, it retrieves the MYSQL_ROOT_PASSWORD from the .secret file.

The Show’s Over, But the Journey’s Just Begun

That’s all there is to it. As you work on your data projects, keep this kit in your toolbox — it’s here to simplify your journey and make your data work smoother.

DEV Community