DEV Community

Aniket Singh
Aniket Singh

Posted on

🌸 Iris Classifier ML Pipeline — Complete Tutorial & Instructions Manual

Iris Classifier ML Pipeline — Complete Tutorial & Instructions Manual

Who this is for: Beginners and intermediate developers who want to understand how a real-world ML project is structured and run — from a cloned repository to a fully running system.

What you'll learn: Virtual environments, dependency management, project structure, MLflow experiment tracking, FastAPI inference servers, Docker containerisation, and automated CI/CD.

Prerequisites: Python installed, VS Code installed, internet connection. That's it.


📋 Table of Contents

  1. What This Project Does
  2. Understanding the Project Structure
  3. One-Time Setup: Install Required Tools
  4. Step 1 — Clone the Project
  5. Step 2 — Create a Python Virtual Environment
  6. Step 3 — Install Dependencies
  7. Step 4 — Configure Environment Variables
  8. Step 5 — Run the Training Pipeline
  9. Step 6 — Explore MLflow Experiment Tracking
  10. Step 7 — Start the FastAPI Inference Server
  11. Step 8 — Make Predictions via the API
  12. Step 9 — Run the Test Suite
  13. Step 10 — Run Everything with Docker
  14. Step 11 — Schedule Training with Cron
  15. Architecture Deep Dive
  16. How the Code Flows Together
  17. Common Errors & How to Fix Them
  18. Extending the Project

1. What This Project Does

This project simulates a production-grade machine learning system. Here is the high-level picture:

┌─────────────────────────────────────────────────────────────────────┐
│                        ML PIPELINE OVERVIEW                         │
│                                                                     │
│  [Cron / CLI]                                                       │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────┐    trains    ┌─────────────────┐                  │
│  │  Training   │ ──────────►  │  Saved Model    │                  │
│  │  Pipeline   │              │  (models/*.pkl) │                  │
│  └─────────────┘              └────────┬────────┘                  │
│       │                                │                            │
│       │ logs everything                │ loaded by                  │
│       ▼                                ▼                            │
│  ┌─────────────┐              ┌─────────────────┐                  │
│  │   MLflow    │              │   FastAPI       │◄── HTTP requests │
│  │  Tracking   │              │   Inference     │                  │
│  │     UI      │              │   Server        │──► predictions   │
│  └─────────────┘              └─────────────────┘                  │
│                                                                     │
│  All three services run together inside Docker Compose             │
└─────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The dataset: Iris — 150 flower measurements, 3 species (setosa, versicolor, virginica). A classic beginner dataset that's perfect for showcasing pipeline architecture without the model itself becoming the focus.

The model: RandomForestClassifier inside a sklearn Pipeline with StandardScaler. A GridSearchCV automatically searches for the best hyperparameters.


2. Understanding the Project Structure

Before touching any code, read this section. Understanding why the project is structured this way is what separates a portfolio project from a "notebook dumped into a repo."

ml-pipeline/
│
├── src/                    ← The core Python library (business logic)
│   ├── config.py           ← ALL settings live here (paths, hyperparams, env vars)
│   ├── data/
│   │   └── loader.py       ← Loads Iris dataset, splits train/test
│   ├── training/
│   │   ├── pipeline.py     ← Builds the sklearn Pipeline object
│   │   └── trainer.py      ← Orchestrates training: GridSearchCV + MLflow logging
│   ├── evaluation/
│   │   └── metrics.py      ← Accuracy, F1, confusion matrix — pure functions
│   └── inference/
│       └── predictor.py    ← Loads saved model, exposes predict() method
│
├── api/                    ← The FastAPI web application
│   ├── main.py             ← Creates and configures the FastAPI app
│   ├── schemas.py          ← Defines the shape of API requests and responses
│   └── routers/
│       └── predict.py      ← The actual /predict HTTP endpoints
│
├── scripts/
│   ├── train.py            ← CLI entry point: `python scripts/train.py`
│   └── run_pipeline.sh     ← Bash wrapper used by cron
│
├── tests/                  ← Automated tests
│   ├── test_data_loader.py
│   ├── test_metrics.py
│   ├── test_predictor.py
│   └── test_api.py
│
├── docker/
│   ├── Dockerfile.train    ← Container for the training job
│   └── Dockerfile.api      ← Container for the FastAPI server
│
├── models/                 ← Created automatically — stores .pkl files
├── mlruns/                 ← Created automatically — stores MLflow data
├── logs/                   ← Created automatically — stores training logs
│
├── docker-compose.yml      ← Wires all Docker services together
├── requirements.txt        ← Python dependencies
├── Makefile                ← Shortcuts: `make train`, `make serve`, etc.
├── crontab.txt             ← Cron schedule definition
├── pyproject.toml          ← Tool config (pytest, ruff, mypy)
└── .env.example            ← Template for environment variables
Enter fullscreen mode Exit fullscreen mode

Key design principle: src/ contains zero web framework code. api/ contains zero ML logic. They communicate through src/inference/predictor.py. This makes every layer independently testable.


3. One-Time Setup: Install Required Tools

You need three tools installed on your machine. Do this before anything else.

3.1 Verify Python is installed

Open a terminal (on Linux/Mac) or Command Prompt / PowerShell (on Windows):

python --version
# or on some systems:
python3 --version
Enter fullscreen mode Exit fullscreen mode

You should see Python 3.10.x or higher. If not, download it from python.org.

On Linux/Ubuntu, you may need: sudo apt update && sudo apt install python3 python3-pip python3-venv

3.2 Install Git (to push to GitHub later)

git --version
Enter fullscreen mode Exit fullscreen mode

If not installed:

  • Ubuntu/Debian: sudo apt install git
  • Mac: xcode-select --install
  • Windows: Download from git-scm.com

3.3 Install Docker Desktop (for the containerised workflow)

Docker lets you run the entire stack — API + MLflow UI + trainer — with a single command, without installing anything else on your machine.

  1. Go to docs.docker.com/get-docker
  2. Download and install Docker Desktop for your OS
  3. Open Docker Desktop and wait for it to show "Docker is running"
  4. Verify in the terminal:
docker --version
docker compose version
Enter fullscreen mode Exit fullscreen mode

On Linux, after installing Docker Engine, add your user to the docker group so you don't need sudo:

sudo usermod -aG docker $USER
newgrp docker

4. Step 1 — Clone the Project

4.1 Clone the repository from Github

Clone the Iris-Classifier-ML-Pipeline to a location of your choice, for example ~/projects/.

git clone https://github.com/aniket-1177/Iris-Classifier-ML-Pipeline.git
Enter fullscreen mode Exit fullscreen mode

4.2 Open in VS Code

code .
Enter fullscreen mode Exit fullscreen mode

Or open VS Code manually → File → Open Folder → select Iris-Classifier-ML-Pipeline.

Install the recommended VS Code extension for Python: when VS Code prompts you, click Install. If it doesn't prompt, press Ctrl+Shift+X, search Python, and install the Microsoft extension.


5. Step 2 — Create a Python Virtual Environment

What is a virtual environment and why do we need one?

A virtual environment is an isolated Python installation just for this project. Without it, every project on your machine would share the same packages — which leads to version conflicts. With a venv, installing scikit-learn==1.4.0 here won't affect any other project.

Your machine
│
├── System Python (don't touch this)
│
└── projects/
    └── ml-pipeline/
        └── .venv/          ← A private Python just for this project
            ├── bin/python
            └── lib/
                ├── scikit-learn
                ├── fastapi
                ├── mlflow
                └── ...
Enter fullscreen mode Exit fullscreen mode

Create the virtual environment

# Make sure you are inside the ml-pipeline directory
pwd
# Should print something like: /home/yourname/projects/ml-pipeline

# Create the venv (this creates a .venv folder)
python -m venv .venv
Enter fullscreen mode Exit fullscreen mode

Activate the virtual environment

You must activate the venv every time you open a new terminal window.

# Linux / Mac:
source .venv/bin/activate

# Windows (Command Prompt):
.venv\Scripts\activate.bat

# Windows (PowerShell):
.venv\Scripts\Activate.ps1
Enter fullscreen mode Exit fullscreen mode

After activation, your terminal prompt changes to show (.venv):

(.venv) username@os:~/projects/Iris-Classifier-ML-Pipeline$
Enter fullscreen mode Exit fullscreen mode

VS Code tip: Press Ctrl+Shift+P → type "Python: Select Interpreter" → choose the one that says .venv. VS Code will now automatically activate the venv in all new integrated terminals.


6. Step 3 — Install Dependencies

With your venv activated, install all required packages:

pip install --upgrade pip
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

This will install approximately 15 packages. It may take 2–5 minutes on the first run.

What's being installed:

Package Why
scikit-learn Machine learning — our model, pipeline, and grid search
pandas / numpy Data manipulation
mlflow Experiment tracking and model registry
fastapi The web framework for our inference API
uvicorn The ASGI web server that runs FastAPI
pydantic Data validation for API requests/responses

Verify installation

python -c "import sklearn, mlflow, fastapi; print('All good!')"
# Should print: All good!
Enter fullscreen mode Exit fullscreen mode

7. Step 4 — Configure Environment Variables

Environment variables let you change settings (like which MLflow server to use) without editing code.

Create your .env file

cp .env.example .env
Enter fullscreen mode Exit fullscreen mode

Open .env in VS Code. For local development, the defaults are fine:

MLFLOW_TRACKING_URI=file://./mlruns
MLFLOW_EXPERIMENT_NAME=iris-classifier
API_HOST=0.0.0.0
API_PORT=8000
LOG_LEVEL=INFO
Enter fullscreen mode Exit fullscreen mode

What does file://./mlruns mean? It tells MLflow to store all experiment data in a local folder called mlruns/ instead of connecting to a remote server. Perfect for development.


8. Step 5 — Run the Training Pipeline

This is the core of the project. Let's run it and understand what happens at each step.

python scripts/train.py
Enter fullscreen mode Exit fullscreen mode

You will see output like this:

2024-06-01 10:23:15 | INFO     | __main__ | =======================================================
2024-06-01 10:23:15 | INFO     | __main__ |   ML Pipeline Training Run
2024-06-01 10:23:15 | INFO     | __main__ |   Experiment : iris-classifier
2024-06-01 10:23:15 | INFO     | __main__ | =======================================================
2024-06-01 10:23:15 | INFO     | src.data.loader | Loading Iris dataset...
2024-06-01 10:23:15 | INFO     | src.data.loader | Dataset loaded | samples=150 | features=4 | classes=['setosa', 'versicolor', 'virginica']
2024-06-01 10:23:15 | INFO     | src.data.loader | Data split | train=120 | test=30
2024-06-01 10:23:16 | INFO     | src.training.trainer | MLflow run started | run_id=abc123...
2024-06-01 10:23:18 | INFO     | src.training.trainer | Best params: {'classifier__max_depth': None, 'classifier__n_estimators': 100}
2024-06-01 10:23:18 | INFO     | src.training.trainer | ─────────────────────────────────────────────
2024-06-01 10:23:18 | INFO     | src.training.trainer | accuracy                       0.9667
2024-06-01 10:23:18 | INFO     | src.training.trainer | macro_f1                       0.9667
...
2024-06-01 10:23:19 | INFO     | __main__ | Training finished successfully.
2024-06-01 10:23:19 | INFO     | __main__ |   Accuracy   : 0.9667
2024-06-01 10:23:19 | INFO     | __main__ |   Model path : /home/.../models/iris_classifier.pkl
Enter fullscreen mode Exit fullscreen mode

What just happened internally?

scripts/train.py
    └── calls run_training() in src/training/trainer.py
            │
            ├── 1. load_dataset()       → loads 150 Iris rows from scikit-learn
            ├── 2. split_data()         → 120 train, 30 test (stratified)
            ├── 3. build_pipeline()     → StandardScaler + RandomForestClassifier
            ├── 4. GridSearchCV.fit()   → tries 18 hyperparameter combinations (5-fold CV each)
            ├── 5. compute_metrics()    → accuracy, F1, etc. on held-out test set
            ├── 6. mlflow.log_*()       → saves params + metrics + model to mlruns/
            └── 7. pickle.dump()        → saves best model to models/iris_classifier.pkl
                                           saves label encoder to models/label_encoder.pkl
Enter fullscreen mode Exit fullscreen mode

Verify the output artifacts

ls models/
# iris_classifier.pkl   label_encoder.pkl

ls mlruns/
# 0/   (experiment folder)
Enter fullscreen mode Exit fullscreen mode

CLI flags

# Custom experiment name
python scripts/train.py --experiment my-experiment-v2

# Save results to a JSON file
python scripts/train.py --output-json results/run1.json

# See all options
python scripts/train.py --help
Enter fullscreen mode Exit fullscreen mode

9. Step 6 — Explore MLflow Experiment Tracking

MLflow automatically captured everything about the training run. Let's view it.

Open a new terminal (keep your first terminal free for the API later). Activate the venv:

source .venv/bin/activate
mlflow ui --backend-store-uri mlruns --port 5000
Enter fullscreen mode Exit fullscreen mode

Open your browser and go to http://localhost:5000

What you'll see in the MLflow UI

Experiments list: You'll see iris-classifier with one run logged.

Inside the run, explore:

  • Parameters tab — the hyperparameter values GridSearchCV chose as best:
  cv_folds              5
  test_size             0.2
  classifier__n_estimators    100
  classifier__max_depth       None
  classifier__min_samples_split  2
Enter fullscreen mode Exit fullscreen mode
  • Metrics tab — all evaluation scores:
  cv_best_score         0.9583
  test_accuracy         0.9667
  test_macro_f1         0.9667
  test_macro_precision  0.9683
  test_macro_recall     0.9667
  f1_setosa             1.0000
  f1_versicolor         0.9333
  f1_virginica          0.9667
Enter fullscreen mode Exit fullscreen mode
  • Artifacts tab — the saved model files and a preview of the input schema

  • Models tab (top menu) — the IrisClassifier registered model with version history

Why does this matter for a portfolio? In a real company, dozens of engineers run hundreds of experiments. MLflow lets you compare them all — which model was best? What were its settings? What data was it trained on? This is how ML teams avoid the "I don't know which model is in production" problem.

Run training a second time and compare

# Back in your first terminal:
python scripts/train.py --experiment iris-classifier
Enter fullscreen mode Exit fullscreen mode

Now refresh the MLflow UI. You'll see two runs side by side. Click the checkboxes on both and hit Compare to see a diff of parameters and metrics.


10. Step 7 — Start the FastAPI Inference Server

The trained model is now saved to disk. Let's serve it as a REST API.

In your first terminal (with venv activated):

uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
Enter fullscreen mode Exit fullscreen mode

The --reload flag means the server restarts automatically when you edit code — great for development.

You should see:

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Starting Iris Classifier API v1.0.0
INFO:     Model ready | classes=['setosa', 'versicolor', 'virginica']
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000
Enter fullscreen mode Exit fullscreen mode

Open your browser at http://localhost:8000/docs

The Interactive API Docs (Swagger UI)

FastAPI automatically generates an interactive documentation page from your code. You don't write this HTML — it's created from your Pydantic schemas and route definitions.

You'll see two endpoints:

  • POST /predict/ — single flower prediction
  • POST /predict/batch — multiple flowers at once
  • GET /health — is the model loaded?

Understanding the URL http://0.0.0.0:8000

http://  0.0.0.0  :  8000  /docs
  │         │          │       │
  │         │          │       └── Path (Swagger UI page)
  │         │          └────────── Port number
  │         └───────────────────── "All network interfaces" = accessible from anywhere on this machine
  └─────────────────────────────── Protocol
Enter fullscreen mode Exit fullscreen mode

0.0.0.0 as the host means "listen on all network interfaces." When you open it in a browser, you use localhost or 127.0.0.1 instead.


11. Step 8 — Make Predictions via the API

You have three ways to call the API. Try all three — each is used in different real-world scenarios.

Method A — Swagger UI (browser)

  1. Go to http://localhost:8000/docs
  2. Click on POST /predict/
  3. Click Try it out
  4. Replace the request body with:
   {
     "sepal_length": 5.1,
     "sepal_width": 3.5,
     "petal_length": 1.4,
     "petal_width": 0.2
   }
Enter fullscreen mode Exit fullscreen mode
  1. Click Execute
  2. Scroll down to see the response

Method B — curl (terminal)

Open a third terminal and run:

curl -X POST http://localhost:8000/predict/ \
  -H "Content-Type: application/json" \
  -d '{
    "sepal_length": 5.1,
    "sepal_width": 3.5,
    "petal_length": 1.4,
    "petal_width": 0.2
  }'
Enter fullscreen mode Exit fullscreen mode

Expected response:

{
  "predicted_class": "setosa",
  "confidence": 0.98,
  "class_probabilities": {
    "setosa": 0.98,
    "versicolor": 0.01,
    "virginica": 0.01
  }
}
Enter fullscreen mode Exit fullscreen mode

Method C — Python requests (script)

Create a quick test script:

# test_request.py  (create this in the project root)
import requests

url = "http://localhost:8000/predict/"

payload = {
    "sepal_length": 6.3,
    "sepal_width": 3.3,
    "petal_length": 6.0,
    "petal_width": 2.5,
}

response = requests.post(url, json=payload)
print(response.json())
Enter fullscreen mode Exit fullscreen mode
pip install requests   # if not already installed
python test_request.py
Enter fullscreen mode Exit fullscreen mode

Method D — Batch prediction

curl -X POST http://localhost:8000/predict/batch \
  -H "Content-Type: application/json" \
  -d '{
    "samples": [
      {"sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2},
      {"sepal_length": 6.3, "sepal_width": 3.3, "petal_length": 6.0, "petal_width": 2.5},
      {"sepal_length": 7.0, "sepal_width": 3.2, "petal_length": 4.7, "petal_width": 1.4}
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Understanding what happens on each request

HTTP POST /predict/
         │
         ▼ api/routers/predict.py
         │  Pydantic validates the JSON (correct types? in range?)
         │  If invalid → 422 Unprocessable Entity (automatic)
         │
         ▼ Depends(get_predictor)
         │  FastAPI calls get_predictor() to inject the Predictor object
         │  lru_cache means the model is NOT reloaded on every request
         │
         ▼ predictor.predict([5.1, 3.5, 1.4, 0.2])
         │  Builds a pandas DataFrame with the correct column names
         │  Runs pipeline.predict() — scaler transforms, then RF predicts
         │  Runs pipeline.predict_proba() — gets probability per class
         │
         ▼ Returns PredictResponse
            FastAPI serializes it to JSON and sends HTTP 200
Enter fullscreen mode Exit fullscreen mode

Testing input validation

FastAPI + Pydantic automatically validates every request. Try sending a bad value:

curl -X POST http://localhost:8000/predict/ \
  -H "Content-Type: application/json" \
  -d '{"sepal_length": -5, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2}'
Enter fullscreen mode Exit fullscreen mode

You'll get a 422 Unprocessable Entity with a clear error message — no custom error handling code needed. This is the power of Pydantic.


12. Step 9 — Run the Test Suite

Stop the API server for now (Ctrl+C). Let's run the automated tests.

# Run all tests with verbose output
pytest tests/ -v

# Run with a coverage report
pytest tests/ -v --cov=src --cov=api --cov-report=term-missing
Enter fullscreen mode Exit fullscreen mode

Understanding the test output

tests/test_data_loader.py::TestLoadDataset::test_returns_dataframe_and_series PASSED
tests/test_data_loader.py::TestLoadDataset::test_correct_shape PASSED
tests/test_data_loader.py::TestSplitData::test_split_sizes PASSED
...
tests/test_api.py::TestPredictEndpoint::test_valid_request_200 PASSED
tests/test_api.py::TestPredictEndpoint::test_missing_field_422 PASSED
tests/test_api.py::TestPredictEndpoint::test_negative_value_422 PASSED
...

---------- coverage: src ----------
src/config.py              28     3    89%
src/data/loader.py         32     2    94%
src/training/trainer.py    58    12    79%
src/inference/predictor.py 55     4    93%
...
Enter fullscreen mode Exit fullscreen mode

What each test file covers

File What it tests Key technique
test_data_loader.py Shape, columns, no nulls, stratification Direct assertion
test_metrics.py Perfect vs imperfect predictions, rounding Parametrized fixtures
test_predictor.py Model loading, predict output, error cases unittest.mock.patch to fake disk paths
test_api.py HTTP status codes, response schema, validation FastAPI TestClient — no real server needed

Why test_predictor.py uses mock patches:
The Predictor class loads .pkl files from disk. In tests, we don't want to depend on a pre-trained model existing. Instead, we use unittest.mock.patch to replace the file paths with a temp directory containing a freshly trained mini-model. This makes the tests fast, isolated, and runnable in CI.

Run a single test file

pytest tests/test_api.py -v
pytest tests/test_data_loader.py -v
Enter fullscreen mode Exit fullscreen mode

Run tests matching a pattern

pytest tests/ -k "test_valid_request" -v
pytest tests/ -k "batch" -v
Enter fullscreen mode Exit fullscreen mode

13. Step 10 — Run Everything with Docker

So far we've been running services manually in separate terminals. Docker Compose lets you run the entire stack with one command and tear it all down just as easily.

Make sure Docker Desktop is running

Check the Docker Desktop taskbar icon — it should say "Docker Desktop is running."

Build and start all services

docker compose up --build
Enter fullscreen mode Exit fullscreen mode

The first build takes 3–5 minutes (it downloads base images and installs packages). Subsequent starts are fast.

Watch the output — you'll see three services starting:

mlflow_server  | [INFO] Starting MLflow server...
mlflow_server  | [INFO] Listening on http://0.0.0.0:5000
ml_trainer     | [INFO] Loading Iris dataset...
ml_trainer     | [INFO] Training finished. Accuracy: 0.9667
ml_trainer     | [INFO] Model saved to /app/models/iris_classifier.pkl
ml_trainer exited with code 0       ← trainer exits after one run (this is normal)
ml_api         | [INFO] Model ready | classes=['setosa', 'versicolor', 'virginica']
ml_api         | [INFO] Uvicorn running on http://0.0.0.0:8000
Enter fullscreen mode Exit fullscreen mode

Now open:

Why the trainer exits

The trainer service is configured with restart: "no" — it runs the training job once and exits. This is intentional. In production, you'd trigger retraining on a schedule (via cron or a CI job), not keep a process running forever.

Re-run training inside Docker (without rebuilding)

docker compose run --rm trainer
Enter fullscreen mode Exit fullscreen mode

This spins up a fresh trainer container, trains the model, saves it to the shared volume, and exits.

Stop all services

docker compose down
Enter fullscreen mode Exit fullscreen mode

Understanding Docker volumes

The models/ directory is shared between the trainer and the API using a Docker named volume called models_vol:

┌────────────────┐         ┌──────────────────┐
│  trainer       │ writes  │  models_vol      │
│  container     │────────►│  (Docker volume) │
└────────────────┘         └────────┬─────────┘
                                    │ reads
                           ┌────────▼─────────┐
                           │  api             │
                           │  container       │
                           └──────────────────┘
Enter fullscreen mode Exit fullscreen mode

This means you can retrain the model and the running API picks up the new model without rebuilding or redeploying the API image.

Useful Docker commands

# See running containers
docker ps

# See logs from the API container
docker logs ml_api -f

# Open a shell inside the API container (for debugging)
docker exec -it ml_api bash

# Remove everything including volumes (full reset)
docker compose down -v
Enter fullscreen mode Exit fullscreen mode

14. Step 11 — Schedule Training with Cron

Cron is a Unix tool that runs commands on a schedule. We've included a crontab.txt that runs the training pipeline every Monday at 2 AM.

View the schedule

cat crontab.txt
Enter fullscreen mode Exit fullscreen mode
# Retrain model every Monday at 02:00 AM
0 2 * * 1 cd /app && bash scripts/run_pipeline.sh >> /var/log/ml_pipeline_cron.log 2>&1
Enter fullscreen mode Exit fullscreen mode

Understanding cron syntax

0   2   *   *   1
│   │   │   │   │
│   │   │   │   └── Day of week: 1 = Monday (0=Sun, 6=Sat)
│   │   │   └────── Month: * = every month
│   │   └────────── Day of month: * = every day
│   └────────────── Hour: 2 = 2 AM
└────────────────── Minute: 0 = on the hour
Enter fullscreen mode Exit fullscreen mode

Install the crontab (Linux/Mac only)

crontab crontab.txt

# Verify it's installed
crontab -l
Enter fullscreen mode Exit fullscreen mode

Test the pipeline script manually

bash scripts/run_pipeline.sh
Enter fullscreen mode Exit fullscreen mode

This produces timestamped log files in logs/:

logs/
├── train_20240601_102315.log
└── results_20240601_102315.json
Enter fullscreen mode Exit fullscreen mode

Cron inside Docker

To run cron inside the Docker trainer container instead of on the host machine, change the CMD in docker-compose.yml:

trainer:
  command: cron -f   # runs cron daemon in foreground (keeps container alive)
Enter fullscreen mode Exit fullscreen mode

15. Architecture Deep Dive

This section explains the key architectural decisions — the "why" behind the code. This is exactly what interviewers and video viewers want to understand.

Why src/config.py is the single source of truth

# src/config.py
MODEL_PATH = MODELS_DIR / f"{MODEL_NAME}.pkl"
MLFLOW_TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", f"file://{MLRUNS_DIR}")
Enter fullscreen mode Exit fullscreen mode

Every path, every setting, every environment variable lives here. No other file hardcodes a path or reads an env var. If you need to change where models are stored, you change one line in config.py and it propagates everywhere.

Why the sklearn Pipeline prevents data leakage

# src/training/pipeline.py
Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", RandomForestClassifier()),
])
Enter fullscreen mode Exit fullscreen mode

Without a Pipeline, you might do this (which is wrong):

# ❌ WRONG — data leakage
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# GridSearchCV also scales training data, but it already "saw" the test fold stats
Enter fullscreen mode Exit fullscreen mode

With a Pipeline inside GridSearchCV, the scaler is fit only on the training portion of each fold, never on the validation data. This gives you an honest estimate of real-world performance.

Why lru_cache on get_predictor()

# src/inference/predictor.py
@lru_cache(maxsize=1)
def get_predictor() -> Predictor:
    return Predictor()
Enter fullscreen mode Exit fullscreen mode

lru_cache memoises the function — after the first call, it returns the cached result without calling the function again.

Without it: Every HTTP request would load iris_classifier.pkl from disk → slow.
With it: The model loads once at startup, all requests share the same in-memory instance → fast.

Why the Predictor is separate from the API

src/inference/predictor.py has zero imports from fastapi. This means:

  • You can import and use Predictor in a Celery worker, a CLI script, or a Jupyter notebook without FastAPI
  • You can unit-test it with pytest without starting a web server
  • You could swap FastAPI for Flask or gRPC and the predictor code would be unchanged

Why Pydantic schemas are worth the boilerplate

# api/schemas.py
class PredictRequest(BaseModel):
    sepal_length: float = Field(..., ge=0.0, le=20.0)
Enter fullscreen mode Exit fullscreen mode

ge=0.0 means "greater than or equal to 0." le=20.0 means "less than or equal to 20."

For free, you get:

  • Automatic HTTP 422 if a user sends "sepal_length": "hello"
  • Automatic HTTP 422 if a user sends "sepal_length": -5
  • Auto-generated OpenAPI documentation at /docs
  • Type hints that your IDE understands

16. How the Code Flows Together

Here is a complete trace of what happens when you run python scripts/train.py:

scripts/train.py
│
│  parse_args() — reads --experiment, --tracking-uri from CLI
│  sets os.environ for config.py to pick up
│
└── run_training()                               [src/training/trainer.py]
    │
    ├── _configure_mlflow()
    │     mlflow.set_tracking_uri(...)
    │     mlflow.set_experiment("iris-classifier")
    │
    ├── load_dataset()                           [src/data/loader.py]
    │     load_iris(as_frame=True)
    │     map integer targets → "setosa", "versicolor", "virginica"
    │     returns X: DataFrame(150×4), y: Series(150,)
    │
    ├── split_data(X, y)
    │     train_test_split(stratify=y, test_size=0.2)
    │     returns X_train(120×4), X_test(30×4), y_train, y_test
    │
    ├── get_label_encoder(y)
    │     LabelEncoder().fit(["setosa","versicolor","virginica"])
    │
    ├── build_pipeline()                         [src/training/pipeline.py]
    │     Pipeline([StandardScaler(), RandomForestClassifier()])
    │
    ├── GridSearchCV(pipeline, HYPERPARAMETER_GRID, cv=5)
    │
    ├── with mlflow.start_run():
    │     │
    │     ├── grid_search.fit(X_train, y_train)
    │     │     Tries 18 combinations × 5 folds = 90 model fits
    │     │     Retrains best params on full X_train
    │     │
    │     ├── mlflow.log_params(best_params)
    │     │
    │     ├── best_model.predict(X_test) → y_pred
    │     │
    │     ├── compute_metrics(y_test, y_pred)    [src/evaluation/metrics.py]
    │     │     accuracy_score, f1_score, precision_score, recall_score
    │     │
    │     ├── mlflow.log_metrics(metrics)
    │     │
    │     ├── mlflow.sklearn.log_model(best_model, registered_model_name="IrisClassifier")
    │     │     Saves model to mlruns/<experiment_id>/<run_id>/artifacts/model/
    │     │
    │     └── pickle.dump(best_model, "models/iris_classifier.pkl")
    │         pickle.dump(label_encoder, "models/label_encoder.pkl")
    │
    └── returns { run_id, best_params, metrics, model_path }
Enter fullscreen mode Exit fullscreen mode

And when you call POST /predict/:

HTTP POST /predict/  {"sepal_length": 5.1, ...}
│
└── api/routers/predict.py: predict()
    │
    ├── Pydantic validates the request body
    │   PredictRequest(sepal_length=5.1, sepal_width=3.5, ...)
    │
    ├── Depends(get_predictor) → returns cached Predictor instance
    │
    ├── request.to_feature_list() → [5.1, 3.5, 1.4, 0.2]
    │
    └── predictor.predict([5.1, 3.5, 1.4, 0.2])
        │                                       [src/inference/predictor.py]
        ├── pd.DataFrame([[5.1, 3.5, 1.4, 0.2]], columns=FEATURE_NAMES)
        ├── pipeline.predict(X) → ["setosa"]
        ├── pipeline.predict_proba(X) → [[0.98, 0.01, 0.01]]
        └── return {
              "predicted_class": "setosa",
              "confidence": 0.98,
              "class_probabilities": {"setosa": 0.98, ...}
            }
Enter fullscreen mode Exit fullscreen mode

17. Common Errors & How to Fix Them

ModuleNotFoundError: No module named 'src'

Cause: Running Python from the wrong directory, or the venv is not activated.

Fix:

# Make sure you are in the project root
cd /path/to/ml-pipeline

# Make sure the venv is activated (you should see (.venv) in your prompt)
source .venv/bin/activate

# Then run again
python scripts/train.py
Enter fullscreen mode Exit fullscreen mode

ModelNotFoundError: Model not found at '.../models/iris_classifier.pkl'

Cause: You started the API before running training. The model doesn't exist yet.

Fix:

# Run training first
python scripts/train.py

# Then start the API
uvicorn api.main:app --reload
Enter fullscreen mode Exit fullscreen mode

Address already in use (port 8000 or 5000)

Cause: Something else is already using that port (possibly a previous server you didn't stop).

Fix:

# Find what's using port 8000
lsof -i :8000        # Linux/Mac
netstat -ano | findstr :8000   # Windows

# Kill it (replace PID with the process ID from above)
kill -9 <PID>

# Or use a different port
uvicorn api.main:app --port 8001
Enter fullscreen mode Exit fullscreen mode

docker: command not found

Cause: Docker is not installed, or Docker Desktop is not running.

Fix: Open Docker Desktop and wait for it to say "Docker is running."


Permission denied when running run_pipeline.sh

Cause: The script is not marked as executable.

Fix:

chmod +x scripts/run_pipeline.sh
bash scripts/run_pipeline.sh
Enter fullscreen mode Exit fullscreen mode

❌ MLflow UI shows nothing / empty experiments

Cause: The mlruns/ folder doesn't exist yet (training hasn't been run), or you're pointing at the wrong URI.

Fix:

# Make sure you train first
python scripts/train.py

# Then start MLflow pointing at the right folder
mlflow ui --backend-store-uri mlruns --port 5000
Enter fullscreen mode Exit fullscreen mode

422 Unprocessable Entity from the API

Cause: Your request body is missing a field or has an invalid value (e.g., a negative measurement).

Fix: Check the error response body — FastAPI tells you exactly which field is wrong:

{
  "detail": [
    {
      "loc": ["body", "sepal_length"],
      "msg": "ensure this value is greater than or equal to 0",
      "type": "value_error.number.not_ge"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

18. Extending the Project

Once you're comfortable with the project, here are ways to make it even more impressive:

Swap in a different dataset

Replace the Iris loader in src/data/loader.py with any CSV:

def load_dataset():
    df = pd.read_csv("data/raw/your_dataset.csv")
    X = df.drop(columns=["target"])
    y = df["target"]
    return X, y
Enter fullscreen mode Exit fullscreen mode

Everything else — training, MLflow logging, FastAPI — works unchanged.

Add a new model (XGBoost)

In src/training/pipeline.py:

from xgboost import XGBClassifier

def build_pipeline():
    return Pipeline([
        ("scaler", StandardScaler()),
        ("classifier", XGBClassifier(use_label_encoder=False, eval_metric="logloss")),
    ])
Enter fullscreen mode Exit fullscreen mode

Update HYPERPARAMETER_GRID in src/config.py to match XGBoost params.

Promote the best model in MLflow

# scripts/promote_best_model.py
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()
runs = client.search_runs("1", order_by=["metrics.test_accuracy DESC"], max_results=1)
best_run_id = runs[0].info.run_id

client.transition_model_version_stage(
    name="IrisClassifier",
    version=1,
    stage="Production"
)
Enter fullscreen mode Exit fullscreen mode

Add GitHub Actions CI

Push your project to GitHub — the .github/workflows/ci.yml file is already written. It will automatically run lint → tests → training smoke test → Docker build on every push.

git init
git add .
git commit -m "feat: initial ML pipeline"
git remote add origin https://github.com/YOUR_USERNAME/ml-pipeline.git
git push -u origin main
Enter fullscreen mode Exit fullscreen mode

Quick Reference Card

┌─────────────────────────────────────────────────────────────────┐
│                    QUICK REFERENCE                              │
├────────────────────────────┬────────────────────────────────────┤
│  Setup                     │  source .venv/bin/activate         │
│  Train model               │  python scripts/train.py           │
│  Start API                 │  uvicorn api.main:app --reload      │
│  MLflow UI                 │  mlflow ui --backend-store-uri mlruns│
│  Run tests                 │  pytest tests/ -v                  │
│  Run tests + coverage      │  pytest tests/ --cov=src --cov=api │
│  Docker (all services)     │  docker compose up --build         │
│  Docker (retrain only)     │  docker compose run --rm trainer   │
│  Docker (stop)             │  docker compose down               │
│  Install cron              │  crontab crontab.txt               │
├────────────────────────────┼────────────────────────────────────┤
│  API docs                  │  http://localhost:8000/docs        │
│  API health check          │  http://localhost:8000/health      │
│  MLflow UI                 │  http://localhost:5000             │
└────────────────────────────┴────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Manual version 1.0 — Iris Classifier ML Pipeline

Top comments (0)