Kenechukwu Anoliefo

Posted on Nov 18

From Jupyter Notebook to Production: Deploying a Water Quality ML Model with Docker

#mlzoomcamp #machinelearning #python

As data scientists, we often stop at the Jupyter Notebook. We clean the data, train a model, get a high accuracy score, high-five ourselves, and move on. But a model isn't truly useful until it's accessible to the world.

In my latest project, I challenged myself to take a Machine Learning model all the way from a raw CSV file to a fully Dockerized REST API. Here is the complete workflow of how I built a Water Quality Prediction system.

1. The Problem & The Data

The goal was to classify water quality based on physicochemical properties. I utilized the WQD.xlsx dataset, which contains thousands of water samples with features like:

pH & Temperature
Turbidity (clarity)
Dissolved Oxygen (DO)
Pollutants: Ammonia, Nitrite, etc.

The target variable is a Water Quality Class (e.g., 0, 1, 2), making this a multi-class classification problem.

2. The Data Science Workflow

I started in a Jupyter Notebook to explore and clean the data.

Preprocessing

Real-world data is rarely clean. I found typos in column names (like pH having a trailing backtick) and missing values. I used Median Imputation to fill missing entries, as the median is more robust to outliers than the mean.

Model Selection

I experimented with Logistic Regression, Gradient Boosting, and Random Forest. The Random Forest Classifier emerged as the winner due to its ability to handle non-linear relationships between chemical factors (e.g., high temperature reduces dissolved oxygen).

Hyperparameter Tuning

Instead of guessing parameters, I used GridSearchCV to find the optimal number of trees and max depth:

# Tuning the Random Forest
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=3)
grid_search.fit(X_train_scaled, y_train)

Serialization

This is the bridge between "Science" and "Engineering." I used joblib to save three critical artifacts:

water_quality_model.pkl (The brain)
scaler.pkl (The translator—to normalize new inputs)
model_columns.pkl (The instructions—to ensure correct input order)

3. Building the API with Flask

To serve predictions, I needed a web server. I chose Flask for its simplicity. The app.py script loads the saved model artifacts on startup.

The core logic happens in the /predict endpoint:

@app.route('/predict', methods=['POST'])
def predict():
    # 1. Get JSON data
    data = request.json

    # 2. Arrange data in the exact order the model expects
    input_data = np.array([data[col] for col in model_columns]).reshape(1, -1)

    # 3. Scale the data (Crucial step!)
    scaled_data = scaler.transform(input_data)

    # 4. Predict
    prediction = model.predict(scaled_data)

    return jsonify({'water_quality_prediction': int(prediction[0])})

4. The Deployment Challenge: Dockerization

This was the most educational part of the project. I wanted to containerize the app so it could run on any machine, regardless of the OS.

The "File Extension" Trap

One of my first hurdles was a simple file naming error. Windows hid the file extension, so my Dockerfile was actually named Dockerfile.txt. Docker refused to build until I removed the extension. Lesson learned: Always check your file types in the terminal!

Handling Network Timeouts

During the build process, pip install kept failing on heavy libraries like NumPy and Pandas due to network fluctuations. The build would crash halfway through, forcing a restart.

I solved this with two strategies:

Increased Timeout: I added --default-timeout=1000 to the pip command.
Layered Caching: I structured the Dockerfile to install libraries one by one. This acts like a "save point" in a video game. If Pandas fails, I don't have to re-download NumPy.

Here is the optimized Dockerfile I ended up with:

FROM python:3.9-slim
WORKDIR /app

# Install libraries individually to leverage Docker caching
RUN pip install --default-timeout=1000 --no-cache-dir Flask==3.0.3
RUN pip install --default-timeout=1000 --no-cache-dir numpy==1.26.4
RUN pip install --default-timeout=1000 --no-cache-dir scikit-learn==1.5.0
RUN pip install --default-timeout=1000 --no-cache-dir pandas==2.2.2
# ... other dependencies

COPY . .
EXPOSE 5000
CMD ["python", "app.py"]

5. The Result

With the container running, I can now generate predictions instantly via a curl command:

curl -X POST http://localhost:5000/predict -H "Content-Type: application/json" -d '{...data...}'

Response: {"water_quality_prediction": 2}

Conclusion

This project reinforced that building the model is only 50% of the work. The other 50% is engineering—making that model robust, portable, and runnable.

By wrapping the model in Docker, I've transformed a static notebook into a portable microservice that could theoretically be deployed to AWS, Azure, or a Raspberry Pi monitoring a river in real-time.

DEV Community