DEV Community

Cover image for Containerizing Python Data Processing Scripts with Docker: A Step-by-Step Guide
Matthew M Tarfa
Matthew M Tarfa

Posted on

Containerizing Python Data Processing Scripts with Docker: A Step-by-Step Guide

Containerization is a game changer in modern software development, providing a consistent runtime environment for applications. In this blog post, I’ll guide you through a project where I containerized a Python script that reads and processes a CSV file using Docker. This is an excellent project for beginners looking to build a foundational understanding of Docker.

Project Overview

Goal: Containerize a Python script that processes data from a CSV file to make it portable and executable in any environment.
Technologies Used: Docker, Python, pandas

Real-World Application

This type of data processing can be scaled up for more complex datasets, such as processing customer records, analyzing survey results, or even preliminary data cleaning for machine learning tasks. The ability to containerize this script ensures that it can run consistently across various environments, making it easier for teams to collaborate without worrying about dependency issues.

Step 1: Create a Sample Data File

For this project, we will use a sample CSV file named data.csv with six simple entries:

data.csv content:

Name
Oluchukwu
Matthew
John
Sola
Benson
Winston
Enter fullscreen mode Exit fullscreen mode

Step 2: Write the Python Script

Create a Python script named process_data.py to read and print a summary of the data.

process_data.py content:

import pandas as pd

# Creating a DataFrame from the CSV file
df = pd.read_csv('data.csv', header=None, names=['Name'])  # Assuming no header in the CSV file

# Generating the summary statistics of the 'Name' column
description = df['Name'].describe()

# Printing the description of the 'Name' column
print(description)

# Additional print to show unique value count explicitly
print("Unique values count:")
print(df['Name'].nunique())
Enter fullscreen mode Exit fullscreen mode

Explanation of the Code:

Reading the CSV: The header=None argument tells pandas that the CSV doesn't have a header row, and the names=['Name'] argument assigns the name Name to the single column in the CSV.

Generating Summary: df['Name'].describe() gives you the summary statistics for the Name column, including count, unique values, the most frequent value (top), and its frequency (freq).

Printing Unique Values Count: df['Name'].nunique() directly shows the number of unique names in the column.

Step 3: Create a requirements.txt File

The requirements.txt file specifies the Python dependencies required by the script:

pandas
Enter fullscreen mode Exit fullscreen mode

Explanation:
This ensures that the pandas library is installed when building the Docker image.

Step 4: Write the Dockerfile

The Dockerfile defines the environment and commands for building the container. Here’s what it should look like:

Dockerfile content:

# Use a minimal Python base image
FROM python:3.9-slim

# Set the working directory inside the container
WORKDIR /app

# Copy the requirements file into the container
COPY requirements.txt .

# Install the required Python libraries
RUN pip install -r requirements.txt

# Copy the rest of the project files into the container
COPY . .

# Define the command to run the Python script
CMD ["python", "process_data.py"]
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • FROM python:3.9-slim: Specifies a lightweight Python image as the base.
  • WORKDIR /app: Sets the working directory for the container.
  • COPY requirements.txt . : Copies requirements.txt to the container.
  • RUN pip install -r requirements.txt : Installs Python dependencies.
  • COPY . . : Copies the project files into the container.
  • CMD ["python", "process_data.py"]: Specifies the command to run the script.

Step 5: Build the Docker Image

Navigate to the project directory and build the Docker image with the following command:

docker build -t python-script .
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • docker build: Command to build a Docker image.
  • -t python-script: Tags the image with the name python-script.
  • . (dot): Specifies the current directory as the build context.

Step 6: Run the Docker Container

Run the container and mount the data directory to ensure the script can access the CSV file:

docker run -v $(pwd)/data:/app/data python-script
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • docker run: Runs a Docker container.
  • -v $(pwd)/data:/app/data: Mounts the data directory from your host machine to /app/data in the container.
  • python-script: The name of the Docker image to run.

Step 7: Check the Output

After running the Docker container, you should see an output that summarizes the contents of data.csv. Here’s the actual output from the project

Output:

            Name
count          6
unique         6
top       Oluchukwu
freq           1
Unique values count:
6
Enter fullscreen mode Exit fullscreen mode

Understanding the Output:

  • count: 6 – This shows that there are 6 values in the Name column. The count function counts the number of non-null entries, which is 6 in this case.
  • unique: 6 – This means that there are 6 unique names in the column. Since all 6 names provided (Oluchukwu, Matthew, John, Sola, Benson, Winston) are distinct, it counts all 6 as unique.
  • top: Oluchukwu – This tells that the most frequent name in the Name column. Since no name appears more than once in the file, pandas defaults to the first name in the column (in this case, Oluchukwu).
  • freq: 1 – This shows the frequency of the most frequent name, which is 1 because every name appears exactly once.
  • Unique values count: 6: This is showing that there are 6 unique names in the dataset, which matches the number of names in the CSV file.

Why Oluchukwu is listed as the top value:
In the dataset, no name is repeated, so pandas automatically picks the first name in the list, Oluchukwu, as the top value because it’s the most frequent by default.

6 unique names means there are 6 distinct entries, which is correct.
The top value of Oluchukwu is due to it being the first value in the list since there is no repetition.

Conclusion

By following these steps, you've successfully containerized a Python data processing script. This approach ensures that your script runs consistently across different environments, making it a valuable tool for data analysis tasks. You can further enhance this project by integrating it with cloud platforms or automating the container execution using CI/CD pipelines.

Next Steps:

Try expanding the script to include data cleaning, visualization, or exporting processed data. You can also explore deploying the container to a cloud service like AWS ECS or Kubernetes for a scalable data processing solution.

Ready to elevate your data processing projects? Start containerizing today and unlock the full potential of Docker!

Top comments (1)

Collapse
 
oluchiedeh profile image
Oluchukwu Edeh

Great explanation, Matthew!