Aniket Singh

Posted on Apr 24

🌸 Iris Classifier ML Pipeline — Complete Tutorial & Instructions Manual

#python #machinelearning #tutorial #beginners

Iris Classifier ML Pipeline — Complete Tutorial & Instructions Manual

Who this is for: Beginners and intermediate developers who want to understand how a real-world ML project is structured and run — from a cloned repository to a fully running system.

What you'll learn: Virtual environments, dependency management, project structure, MLflow experiment tracking, FastAPI inference servers, Docker containerisation, and automated CI/CD.

Prerequisites: Python installed, VS Code installed, internet connection. That's it.

📋 Table of Contents

What This Project Does
Understanding the Project Structure
One-Time Setup: Install Required Tools
Step 1 — Clone the Project
Step 2 — Create a Python Virtual Environment
Step 3 — Install Dependencies
Step 4 — Configure Environment Variables
Step 5 — Run the Training Pipeline
Step 6 — Explore MLflow Experiment Tracking
Step 7 — Start the FastAPI Inference Server
Step 8 — Make Predictions via the API
Step 9 — Run the Test Suite
Step 10 — Run Everything with Docker
Step 11 — Schedule Training with Cron
Architecture Deep Dive
How the Code Flows Together
Common Errors & How to Fix Them
Extending the Project

1. What This Project Does

This project simulates a production-grade machine learning system. Here is the high-level picture:

┌─────────────────────────────────────────────────────────────────────┐
│                        ML PIPELINE OVERVIEW                         │
│                                                                     │
│  [Cron / CLI]                                                       │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────┐    trains    ┌─────────────────┐                  │
│  │  Training   │ ──────────►  │  Saved Model    │                  │
│  │  Pipeline   │              │  (models/*.pkl) │                  │
│  └─────────────┘              └────────┬────────┘                  │
│       │                                │                            │
│       │ logs everything                │ loaded by                  │
│       ▼                                ▼                            │
│  ┌─────────────┐              ┌─────────────────┐                  │
│  │   MLflow    │              │   FastAPI       │◄── HTTP requests │
│  │  Tracking   │              │   Inference     │                  │
│  │     UI      │              │   Server        │──► predictions   │
│  └─────────────┘              └─────────────────┘                  │
│                                                                     │
│  All three services run together inside Docker Compose             │
└─────────────────────────────────────────────────────────────────────┘

The dataset: Iris — 150 flower measurements, 3 species (setosa, versicolor, virginica). A classic beginner dataset that's perfect for showcasing pipeline architecture without the model itself becoming the focus.

The model: RandomForestClassifier inside a sklearn Pipeline with StandardScaler. A GridSearchCV automatically searches for the best hyperparameters.

2. Understanding the Project Structure

Before touching any code, read this section. Understanding why the project is structured this way is what separates a portfolio project from a "notebook dumped into a repo."

ml-pipeline/
│
├── src/                    ← The core Python library (business logic)
│   ├── config.py           ← ALL settings live here (paths, hyperparams, env vars)
│   ├── data/
│   │   └── loader.py       ← Loads Iris dataset, splits train/test
│   ├── training/
│   │   ├── pipeline.py     ← Builds the sklearn Pipeline object
│   │   └── trainer.py      ← Orchestrates training: GridSearchCV + MLflow logging
│   ├── evaluation/
│   │   └── metrics.py      ← Accuracy, F1, confusion matrix — pure functions
│   └── inference/
│       └── predictor.py    ← Loads saved model, exposes predict() method
│
├── api/                    ← The FastAPI web application
│   ├── main.py             ← Creates and configures the FastAPI app
│   ├── schemas.py          ← Defines the shape of API requests and responses
│   └── routers/
│       └── predict.py      ← The actual /predict HTTP endpoints
│
├── scripts/
│   ├── train.py            ← CLI entry point: `python scripts/train.py`
│   └── run_pipeline.sh     ← Bash wrapper used by cron
│
├── tests/                  ← Automated tests
│   ├── test_data_loader.py
│   ├── test_metrics.py
│   ├── test_predictor.py
│   └── test_api.py
│
├── docker/
│   ├── Dockerfile.train    ← Container for the training job
│   └── Dockerfile.api      ← Container for the FastAPI server
│
├── models/                 ← Created automatically — stores .pkl files
├── mlruns/                 ← Created automatically — stores MLflow data
├── logs/                   ← Created automatically — stores training logs
│
├── docker-compose.yml      ← Wires all Docker services together
├── requirements.txt        ← Python dependencies
├── Makefile                ← Shortcuts: `make train`, `make serve`, etc.
├── crontab.txt             ← Cron schedule definition
├── pyproject.toml          ← Tool config (pytest, ruff, mypy)
└── .env.example            ← Template for environment variables

Key design principle: src/ contains zero web framework code. api/ contains zero ML logic. They communicate through src/inference/predictor.py. This makes every layer independently testable.

3. One-Time Setup: Install Required Tools

You need three tools installed on your machine. Do this before anything else.

3.1 Verify Python is installed

Open a terminal (on Linux/Mac) or Command Prompt / PowerShell (on Windows):

python --version
# or on some systems:
python3 --version

You should see Python 3.10.x or higher. If not, download it from python.org.

On Linux/Ubuntu, you may need: sudo apt update && sudo apt install python3 python3-pip python3-venv

3.2 Install Git (to push to GitHub later)

git --version

If not installed:

Ubuntu/Debian: sudo apt install git
Mac: xcode-select --install
Windows: Download from git-scm.com

3.3 Install Docker Desktop (for the containerised workflow)

Docker lets you run the entire stack — API + MLflow UI + trainer — with a single command, without installing anything else on your machine.

Go to docs.docker.com/get-docker
Download and install Docker Desktop for your OS
Open Docker Desktop and wait for it to show "Docker is running"
Verify in the terminal:

docker --version
docker compose version

On Linux, after installing Docker Engine, add your user to the docker group so you don't need sudo:
sudo usermod -aG docker $USER
newgrp docker

4. Step 1 — Clone the Project

4.1 Clone the repository from Github

Clone the Iris-Classifier-ML-Pipeline to a location of your choice, for example ~/projects/.

git clone https://github.com/aniket-1177/Iris-Classifier-ML-Pipeline.git

4.2 Open in VS Code

code .

Or open VS Code manually → File → Open Folder → select Iris-Classifier-ML-Pipeline.

Install the recommended VS Code extension for Python: when VS Code prompts you, click Install. If it doesn't prompt, press Ctrl+Shift+X, search Python, and install the Microsoft extension.

5. Step 2 — Create a Python Virtual Environment

What is a virtual environment and why do we need one?

A virtual environment is an isolated Python installation just for this project. Without it, every project on your machine would share the same packages — which leads to version conflicts. With a venv, installing scikit-learn==1.4.0 here won't affect any other project.

Your machine
│
├── System Python (don't touch this)
│
└── projects/
    └── ml-pipeline/
        └── .venv/          ← A private Python just for this project
            ├── bin/python
            └── lib/
                ├── scikit-learn
                ├── fastapi
                ├── mlflow
                └── ...

Create the virtual environment

# Make sure you are inside the ml-pipeline directory
pwd
# Should print something like: /home/yourname/projects/ml-pipeline

# Create the venv (this creates a .venv folder)
python -m venv .venv

Activate the virtual environment

You must activate the venv every time you open a new terminal window.

# Linux / Mac:
source .venv/bin/activate

# Windows (Command Prompt):
.venv\Scripts\activate.bat

# Windows (PowerShell):
.venv\Scripts\Activate.ps1

After activation, your terminal prompt changes to show (.venv):

(.venv) username@os:~/projects/Iris-Classifier-ML-Pipeline$

VS Code tip: Press Ctrl+Shift+P → type "Python: Select Interpreter" → choose the one that says .venv. VS Code will now automatically activate the venv in all new integrated terminals.

6. Step 3 — Install Dependencies

With your venv activated, install all required packages:

pip install --upgrade pip
pip install -r requirements.txt

This will install approximately 15 packages. It may take 2–5 minutes on the first run.

What's being installed:

Package	Why
`scikit-learn`	Machine learning — our model, pipeline, and grid search
`pandas` / `numpy`	Data manipulation
`mlflow`	Experiment tracking and model registry
`fastapi`	The web framework for our inference API
`uvicorn`	The ASGI web server that runs FastAPI
`pydantic`	Data validation for API requests/responses

Verify installation

python -c "import sklearn, mlflow, fastapi; print('All good!')"
# Should print: All good!

7. Step 4 — Configure Environment Variables

Environment variables let you change settings (like which MLflow server to use) without editing code.

Create your `.env` file

cp .env.example .env

Open .env in VS Code. For local development, the defaults are fine:

MLFLOW_TRACKING_URI=file://./mlruns
MLFLOW_EXPERIMENT_NAME=iris-classifier
API_HOST=0.0.0.0
API_PORT=8000
LOG_LEVEL=INFO

What does file://./mlruns mean? It tells MLflow to store all experiment data in a local folder called mlruns/ instead of connecting to a remote server. Perfect for development.

8. Step 5 — Run the Training Pipeline

This is the core of the project. Let's run it and understand what happens at each step.

python scripts/train.py

You will see output like this:

2024-06-01 10:23:15 | INFO     | __main__ | =======================================================
2024-06-01 10:23:15 | INFO     | __main__ |   ML Pipeline Training Run
2024-06-01 10:23:15 | INFO     | __main__ |   Experiment : iris-classifier
2024-06-01 10:23:15 | INFO     | __main__ | =======================================================
2024-06-01 10:23:15 | INFO     | src.data.loader | Loading Iris dataset...
2024-06-01 10:23:15 | INFO     | src.data.loader | Dataset loaded | samples=150 | features=4 | classes=['setosa', 'versicolor', 'virginica']
2024-06-01 10:23:15 | INFO     | src.data.loader | Data split | train=120 | test=30
2024-06-01 10:23:16 | INFO     | src.training.trainer | MLflow run started | run_id=abc123...
2024-06-01 10:23:18 | INFO     | src.training.trainer | Best params: {'classifier__max_depth': None, 'classifier__n_estimators': 100}
2024-06-01 10:23:18 | INFO     | src.training.trainer | ─────────────────────────────────────────────
2024-06-01 10:23:18 | INFO     | src.training.trainer | accuracy                       0.9667
2024-06-01 10:23:18 | INFO     | src.training.trainer | macro_f1                       0.9667
...
2024-06-01 10:23:19 | INFO     | __main__ | Training finished successfully.
2024-06-01 10:23:19 | INFO     | __main__ |   Accuracy   : 0.9667
2024-06-01 10:23:19 | INFO     | __main__ |   Model path : /home/.../models/iris_classifier.pkl

What just happened internally?

scripts/train.py
    └── calls run_training() in src/training/trainer.py
            │
            ├── 1. load_dataset()       → loads 150 Iris rows from scikit-learn
            ├── 2. split_data()         → 120 train, 30 test (stratified)
            ├── 3. build_pipeline()     → StandardScaler + RandomForestClassifier
            ├── 4. GridSearchCV.fit()   → tries 18 hyperparameter combinations (5-fold CV each)
            ├── 5. compute_metrics()    → accuracy, F1, etc. on held-out test set
            ├── 6. mlflow.log_*()       → saves params + metrics + model to mlruns/
            └── 7. pickle.dump()        → saves best model to models/iris_classifier.pkl
                                           saves label encoder to models/label_encoder.pkl

Verify the output artifacts

ls models/
# iris_classifier.pkl   label_encoder.pkl

ls mlruns/
# 0/   (experiment folder)

CLI flags

# Custom experiment name
python scripts/train.py --experiment my-experiment-v2

# Save results to a JSON file
python scripts/train.py --output-json results/run1.json

# See all options
python scripts/train.py --help

9. Step 6 — Explore MLflow Experiment Tracking

MLflow automatically captured everything about the training run. Let's view it.

Open a new terminal (keep your first terminal free for the API later). Activate the venv:

source .venv/bin/activate
mlflow ui --backend-store-uri mlruns --port 5000

Open your browser and go to http://localhost:5000

What you'll see in the MLflow UI

Experiments list: You'll see iris-classifier with one run logged.

Inside the run, explore:

Parameters tab — the hyperparameter values GridSearchCV chose as best:

  cv_folds              5
  test_size             0.2
  classifier__n_estimators    100
  classifier__max_depth       None
  classifier__min_samples_split  2

Metrics tab — all evaluation scores:

  cv_best_score         0.9583
  test_accuracy         0.9667
  test_macro_f1         0.9667
  test_macro_precision  0.9683
  test_macro_recall     0.9667
  f1_setosa             1.0000
  f1_versicolor         0.9333
  f1_virginica          0.9667

Artifacts tab — the saved model files and a preview of the input schema
Models tab (top menu) — the IrisClassifier registered model with version history

Why does this matter for a portfolio? In a real company, dozens of engineers run hundreds of experiments. MLflow lets you compare them all — which model was best? What were its settings? What data was it trained on? This is how ML teams avoid the "I don't know which model is in production" problem.

Run training a second time and compare

# Back in your first terminal:
python scripts/train.py --experiment iris-classifier

Now refresh the MLflow UI. You'll see two runs side by side. Click the checkboxes on both and hit Compare to see a diff of parameters and metrics.

10. Step 7 — Start the FastAPI Inference Server

The trained model is now saved to disk. Let's serve it as a REST API.

In your first terminal (with venv activated):

uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

The --reload flag means the server restarts automatically when you edit code — great for development.

You should see:

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Starting Iris Classifier API v1.0.0
INFO:     Model ready | classes=['setosa', 'versicolor', 'virginica']
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

Open your browser at http://localhost:8000/docs

The Interactive API Docs (Swagger UI)

FastAPI automatically generates an interactive documentation page from your code. You don't write this HTML — it's created from your Pydantic schemas and route definitions.

You'll see two endpoints:

POST /predict/ — single flower prediction
POST /predict/batch — multiple flowers at once
GET /health — is the model loaded?

Understanding the URL `http://0.0.0.0:8000`

http://  0.0.0.0  :  8000  /docs
  │         │          │       │
  │         │          │       └── Path (Swagger UI page)
  │         │          └────────── Port number
  │         └───────────────────── "All network interfaces" = accessible from anywhere on this machine
  └─────────────────────────────── Protocol

0.0.0.0 as the host means "listen on all network interfaces." When you open it in a browser, you use localhost or 127.0.0.1 instead.

11. Step 8 — Make Predictions via the API

You have three ways to call the API. Try all three — each is used in different real-world scenarios.

Method A — Swagger UI (browser)

Go to http://localhost:8000/docs
Click on POST /predict/
Click Try it out
Replace the request body with:

   {
     "sepal_length": 5.1,
     "sepal_width": 3.5,
     "petal_length": 1.4,
     "petal_width": 0.2
   }

Click Execute
Scroll down to see the response

Method B — curl (terminal)

Open a third terminal and run:

curl -X POST http://localhost:8000/predict/ \
  -H "Content-Type: application/json" \
  -d '{
    "sepal_length": 5.1,
    "sepal_width": 3.5,
    "petal_length": 1.4,
    "petal_width": 0.2
  }'

Expected response:

{
  "predicted_class": "setosa",
  "confidence": 0.98,
  "class_probabilities": {
    "setosa": 0.98,
    "versicolor": 0.01,
    "virginica": 0.01
  }
}

Method C — Python requests (script)

Create a quick test script:

# test_request.py  (create this in the project root)
import requests

url = "http://localhost:8000/predict/"

payload = {
    "sepal_length": 6.3,
    "sepal_width": 3.3,
    "petal_length": 6.0,
    "petal_width": 2.5,
}

response = requests.post(url, json=payload)
print(response.json())

pip install requests   # if not already installed
python test_request.py

Method D — Batch prediction

curl -X POST http://localhost:8000/predict/batch \
  -H "Content-Type: application/json" \
  -d '{
    "samples": [
      {"sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2},
      {"sepal_length": 6.3, "sepal_width": 3.3, "petal_length": 6.0, "petal_width": 2.5},
      {"sepal_length": 7.0, "sepal_width": 3.2, "petal_length": 4.7, "petal_width": 1.4}
    ]
  }'

Understanding what happens on each request

HTTP POST /predict/
         │
         ▼ api/routers/predict.py
         │  Pydantic validates the JSON (correct types? in range?)
         │  If invalid → 422 Unprocessable Entity (automatic)
         │
         ▼ Depends(get_predictor)
         │  FastAPI calls get_predictor() to inject the Predictor object
         │  lru_cache means the model is NOT reloaded on every request
         │
         ▼ predictor.predict([5.1, 3.5, 1.4, 0.2])
         │  Builds a pandas DataFrame with the correct column names
         │  Runs pipeline.predict() — scaler transforms, then RF predicts
         │  Runs pipeline.predict_proba() — gets probability per class
         │
         ▼ Returns PredictResponse
            FastAPI serializes it to JSON and sends HTTP 200

Testing input validation

FastAPI + Pydantic automatically validates every request. Try sending a bad value:

curl -X POST http://localhost:8000/predict/ \
  -H "Content-Type: application/json" \
  -d '{"sepal_length": -5, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2}'

You'll get a 422 Unprocessable Entity with a clear error message — no custom error handling code needed. This is the power of Pydantic.

12. Step 9 — Run the Test Suite

Stop the API server for now (Ctrl+C). Let's run the automated tests.

# Run all tests with verbose output
pytest tests/ -v

# Run with a coverage report
pytest tests/ -v --cov=src --cov=api --cov-report=term-missing

Understanding the test output

tests/test_data_loader.py::TestLoadDataset::test_returns_dataframe_and_series PASSED
tests/test_data_loader.py::TestLoadDataset::test_correct_shape PASSED
tests/test_data_loader.py::TestSplitData::test_split_sizes PASSED
...
tests/test_api.py::TestPredictEndpoint::test_valid_request_200 PASSED
tests/test_api.py::TestPredictEndpoint::test_missing_field_422 PASSED
tests/test_api.py::TestPredictEndpoint::test_negative_value_422 PASSED
...

---------- coverage: src ----------
src/config.py              28     3    89%
src/data/loader.py         32     2    94%
src/training/trainer.py    58    12    79%
src/inference/predictor.py 55     4    93%
...

What each test file covers

File	What it tests	Key technique
`test_data_loader.py`	Shape, columns, no nulls, stratification	Direct assertion
`test_metrics.py`	Perfect vs imperfect predictions, rounding	Parametrized fixtures
`test_predictor.py`	Model loading, predict output, error cases	`unittest.mock.patch` to fake disk paths
`test_api.py`	HTTP status codes, response schema, validation	`FastAPI TestClient` — no real server needed

Why test_predictor.py uses mock patches:
The Predictor class loads .pkl files from disk. In tests, we don't want to depend on a pre-trained model existing. Instead, we use unittest.mock.patch to replace the file paths with a temp directory containing a freshly trained mini-model. This makes the tests fast, isolated, and runnable in CI.

Run a single test file

pytest tests/test_api.py -v
pytest tests/test_data_loader.py -v

Run tests matching a pattern

pytest tests/ -k "test_valid_request" -v
pytest tests/ -k "batch" -v

13. Step 10 — Run Everything with Docker

So far we've been running services manually in separate terminals. Docker Compose lets you run the entire stack with one command and tear it all down just as easily.

Make sure Docker Desktop is running

Check the Docker Desktop taskbar icon — it should say "Docker Desktop is running."

Build and start all services

docker compose up --build

The first build takes 3–5 minutes (it downloads base images and installs packages). Subsequent starts are fast.

Watch the output — you'll see three services starting:

mlflow_server  | [INFO] Starting MLflow server...
mlflow_server  | [INFO] Listening on http://0.0.0.0:5000
ml_trainer     | [INFO] Loading Iris dataset...
ml_trainer     | [INFO] Training finished. Accuracy: 0.9667
ml_trainer     | [INFO] Model saved to /app/models/iris_classifier.pkl
ml_trainer exited with code 0       ← trainer exits after one run (this is normal)
ml_api         | [INFO] Model ready | classes=['setosa', 'versicolor', 'virginica']
ml_api         | [INFO] Uvicorn running on http://0.0.0.0:8000

Now open:

API docs: http://localhost:8000/docs
MLflow UI: http://localhost:5000

Why the trainer exits

The trainer service is configured with restart: "no" — it runs the training job once and exits. This is intentional. In production, you'd trigger retraining on a schedule (via cron or a CI job), not keep a process running forever.

Re-run training inside Docker (without rebuilding)

docker compose run --rm trainer

This spins up a fresh trainer container, trains the model, saves it to the shared volume, and exits.

Stop all services

docker compose down

Understanding Docker volumes

The models/ directory is shared between the trainer and the API using a Docker named volume called models_vol:

┌────────────────┐         ┌──────────────────┐
│  trainer       │ writes  │  models_vol      │
│  container     │────────►│  (Docker volume) │
└────────────────┘         └────────┬─────────┘
                                    │ reads
                           ┌────────▼─────────┐
                           │  api             │
                           │  container       │
                           └──────────────────┘

This means you can retrain the model and the running API picks up the new model without rebuilding or redeploying the API image.

Useful Docker commands

# See running containers
docker ps

# See logs from the API container
docker logs ml_api -f

# Open a shell inside the API container (for debugging)
docker exec -it ml_api bash

# Remove everything including volumes (full reset)
docker compose down -v

14. Step 11 — Schedule Training with Cron

Cron is a Unix tool that runs commands on a schedule. We've included a crontab.txt that runs the training pipeline every Monday at 2 AM.

View the schedule

cat crontab.txt

# Retrain model every Monday at 02:00 AM
0 2 * * 1 cd /app && bash scripts/run_pipeline.sh >> /var/log/ml_pipeline_cron.log 2>&1

Understanding cron syntax

0   2   *   *   1
│   │   │   │   │
│   │   │   │   └── Day of week: 1 = Monday (0=Sun, 6=Sat)
│   │   │   └────── Month: * = every month
│   │   └────────── Day of month: * = every day
│   └────────────── Hour: 2 = 2 AM
└────────────────── Minute: 0 = on the hour

Install the crontab (Linux/Mac only)

crontab crontab.txt

# Verify it's installed
crontab -l

Test the pipeline script manually

bash scripts/run_pipeline.sh

This produces timestamped log files in logs/:

logs/
├── train_20240601_102315.log
└── results_20240601_102315.json

Cron inside Docker

To run cron inside the Docker trainer container instead of on the host machine, change the CMD in docker-compose.yml:

trainer:
  command: cron -f   # runs cron daemon in foreground (keeps container alive)

15. Architecture Deep Dive

This section explains the key architectural decisions — the "why" behind the code. This is exactly what interviewers and video viewers want to understand.

Why `src/config.py` is the single source of truth

# src/config.py
MODEL_PATH = MODELS_DIR / f"{MODEL_NAME}.pkl"
MLFLOW_TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", f"file://{MLRUNS_DIR}")

Every path, every setting, every environment variable lives here. No other file hardcodes a path or reads an env var. If you need to change where models are stored, you change one line in config.py and it propagates everywhere.

Why the `sklearn Pipeline` prevents data leakage

# src/training/pipeline.py
Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", RandomForestClassifier()),
])

Without a Pipeline, you might do this (which is wrong):

# ❌ WRONG — data leakage
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# GridSearchCV also scales training data, but it already "saw" the test fold stats

With a Pipeline inside GridSearchCV, the scaler is fit only on the training portion of each fold, never on the validation data. This gives you an honest estimate of real-world performance.

Why `lru_cache` on `get_predictor()`

# src/inference/predictor.py
@lru_cache(maxsize=1)
def get_predictor() -> Predictor:
    return Predictor()

lru_cache memoises the function — after the first call, it returns the cached result without calling the function again.

Without it: Every HTTP request would load iris_classifier.pkl from disk → slow.
With it: The model loads once at startup, all requests share the same in-memory instance → fast.

Why the Predictor is separate from the API

src/inference/predictor.py has zero imports from fastapi. This means:

You can import and use Predictor in a Celery worker, a CLI script, or a Jupyter notebook without FastAPI
You can unit-test it with pytest without starting a web server
You could swap FastAPI for Flask or gRPC and the predictor code would be unchanged

Why Pydantic schemas are worth the boilerplate

# api/schemas.py
class PredictRequest(BaseModel):
    sepal_length: float = Field(..., ge=0.0, le=20.0)

ge=0.0 means "greater than or equal to 0." le=20.0 means "less than or equal to 20."

For free, you get:

Automatic HTTP 422 if a user sends "sepal_length": "hello"
Automatic HTTP 422 if a user sends "sepal_length": -5
Auto-generated OpenAPI documentation at /docs
Type hints that your IDE understands

16. How the Code Flows Together

Here is a complete trace of what happens when you run python scripts/train.py:

scripts/train.py
│
│  parse_args() — reads --experiment, --tracking-uri from CLI
│  sets os.environ for config.py to pick up
│
└── run_training()                               [src/training/trainer.py]
    │
    ├── _configure_mlflow()
    │     mlflow.set_tracking_uri(...)
    │     mlflow.set_experiment("iris-classifier")
    │
    ├── load_dataset()                           [src/data/loader.py]
    │     load_iris(as_frame=True)
    │     map integer targets → "setosa", "versicolor", "virginica"
    │     returns X: DataFrame(150×4), y: Series(150,)
    │
    ├── split_data(X, y)
    │     train_test_split(stratify=y, test_size=0.2)
    │     returns X_train(120×4), X_test(30×4), y_train, y_test
    │
    ├── get_label_encoder(y)
    │     LabelEncoder().fit(["setosa","versicolor","virginica"])
    │
    ├── build_pipeline()                         [src/training/pipeline.py]
    │     Pipeline([StandardScaler(), RandomForestClassifier()])
    │
    ├── GridSearchCV(pipeline, HYPERPARAMETER_GRID, cv=5)
    │
    ├── with mlflow.start_run():
    │     │
    │     ├── grid_search.fit(X_train, y_train)
    │     │     Tries 18 combinations × 5 folds = 90 model fits
    │     │     Retrains best params on full X_train
    │     │
    │     ├── mlflow.log_params(best_params)
    │     │
    │     ├── best_model.predict(X_test) → y_pred
    │     │
    │     ├── compute_metrics(y_test, y_pred)    [src/evaluation/metrics.py]
    │     │     accuracy_score, f1_score, precision_score, recall_score
    │     │
    │     ├── mlflow.log_metrics(metrics)
    │     │
    │     ├── mlflow.sklearn.log_model(best_model, registered_model_name="IrisClassifier")
    │     │     Saves model to mlruns/<experiment_id>/<run_id>/artifacts/model/
    │     │
    │     └── pickle.dump(best_model, "models/iris_classifier.pkl")
    │         pickle.dump(label_encoder, "models/label_encoder.pkl")
    │
    └── returns { run_id, best_params, metrics, model_path }

And when you call POST /predict/:

HTTP POST /predict/  {"sepal_length": 5.1, ...}
│
└── api/routers/predict.py: predict()
    │
    ├── Pydantic validates the request body
    │   PredictRequest(sepal_length=5.1, sepal_width=3.5, ...)
    │
    ├── Depends(get_predictor) → returns cached Predictor instance
    │
    ├── request.to_feature_list() → [5.1, 3.5, 1.4, 0.2]
    │
    └── predictor.predict([5.1, 3.5, 1.4, 0.2])
        │                                       [src/inference/predictor.py]
        ├── pd.DataFrame([[5.1, 3.5, 1.4, 0.2]], columns=FEATURE_NAMES)
        ├── pipeline.predict(X) → ["setosa"]
        ├── pipeline.predict_proba(X) → [[0.98, 0.01, 0.01]]
        └── return {
              "predicted_class": "setosa",
              "confidence": 0.98,
              "class_probabilities": {"setosa": 0.98, ...}
            }

17. Common Errors & How to Fix Them

❌ `ModuleNotFoundError: No module named 'src'`

Cause: Running Python from the wrong directory, or the venv is not activated.

Fix:

# Make sure you are in the project root
cd /path/to/ml-pipeline

# Make sure the venv is activated (you should see (.venv) in your prompt)
source .venv/bin/activate

# Then run again
python scripts/train.py

❌ `ModelNotFoundError: Model not found at '.../models/iris_classifier.pkl'`

Cause: You started the API before running training. The model doesn't exist yet.

Fix:

# Run training first
python scripts/train.py

# Then start the API
uvicorn api.main:app --reload

❌ `Address already in use` (port 8000 or 5000)

Cause: Something else is already using that port (possibly a previous server you didn't stop).

Fix:

# Find what's using port 8000
lsof -i :8000        # Linux/Mac
netstat -ano | findstr :8000   # Windows

# Kill it (replace PID with the process ID from above)
kill -9 <PID>

# Or use a different port
uvicorn api.main:app --port 8001

❌ `docker: command not found`

Cause: Docker is not installed, or Docker Desktop is not running.

Fix: Open Docker Desktop and wait for it to say "Docker is running."

❌ `Permission denied` when running `run_pipeline.sh`

Cause: The script is not marked as executable.

Fix:

chmod +x scripts/run_pipeline.sh
bash scripts/run_pipeline.sh

❌ MLflow UI shows nothing / empty experiments

Cause: The mlruns/ folder doesn't exist yet (training hasn't been run), or you're pointing at the wrong URI.

Fix:

# Make sure you train first
python scripts/train.py

# Then start MLflow pointing at the right folder
mlflow ui --backend-store-uri mlruns --port 5000

❌ `422 Unprocessable Entity` from the API

Cause: Your request body is missing a field or has an invalid value (e.g., a negative measurement).

Fix: Check the error response body — FastAPI tells you exactly which field is wrong:

{
  "detail": [
    {
      "loc": ["body", "sepal_length"],
      "msg": "ensure this value is greater than or equal to 0",
      "type": "value_error.number.not_ge"
    }
  ]
}

18. Extending the Project

Once you're comfortable with the project, here are ways to make it even more impressive:

Swap in a different dataset

Replace the Iris loader in src/data/loader.py with any CSV:

def load_dataset():
    df = pd.read_csv("data/raw/your_dataset.csv")
    X = df.drop(columns=["target"])
    y = df["target"]
    return X, y

Everything else — training, MLflow logging, FastAPI — works unchanged.

Add a new model (XGBoost)

In src/training/pipeline.py:

from xgboost import XGBClassifier

def build_pipeline():
    return Pipeline([
        ("scaler", StandardScaler()),
        ("classifier", XGBClassifier(use_label_encoder=False, eval_metric="logloss")),
    ])

Update HYPERPARAMETER_GRID in src/config.py to match XGBoost params.

Promote the best model in MLflow

# scripts/promote_best_model.py
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()
runs = client.search_runs("1", order_by=["metrics.test_accuracy DESC"], max_results=1)
best_run_id = runs[0].info.run_id

client.transition_model_version_stage(
    name="IrisClassifier",
    version=1,
    stage="Production"
)

Add GitHub Actions CI

Push your project to GitHub — the .github/workflows/ci.yml file is already written. It will automatically run lint → tests → training smoke test → Docker build on every push.

git init
git add .
git commit -m "feat: initial ML pipeline"
git remote add origin https://github.com/YOUR_USERNAME/ml-pipeline.git
git push -u origin main

Quick Reference Card

┌─────────────────────────────────────────────────────────────────┐
│                    QUICK REFERENCE                              │
├────────────────────────────┬────────────────────────────────────┤
│  Setup                     │  source .venv/bin/activate         │
│  Train model               │  python scripts/train.py           │
│  Start API                 │  uvicorn api.main:app --reload      │
│  MLflow UI                 │  mlflow ui --backend-store-uri mlruns│
│  Run tests                 │  pytest tests/ -v                  │
│  Run tests + coverage      │  pytest tests/ --cov=src --cov=api │
│  Docker (all services)     │  docker compose up --build         │
│  Docker (retrain only)     │  docker compose run --rm trainer   │
│  Docker (stop)             │  docker compose down               │
│  Install cron              │  crontab crontab.txt               │
├────────────────────────────┼────────────────────────────────────┤
│  API docs                  │  http://localhost:8000/docs        │
│  API health check          │  http://localhost:8000/health      │
│  MLflow UI                 │  http://localhost:5000             │
└────────────────────────────┴────────────────────────────────────┘

Manual version 1.0 — Iris Classifier ML Pipeline

Top comments (1)

Haggai Moses • Apr 24

Good stuff!!

Iris Classifier ML Pipeline — Complete Tutorial & Instructions Manual

📋 Table of Contents

1. What This Project Does

2. Understanding the Project Structure

3. One-Time Setup: Install Required Tools

3.1 Verify Python is installed

3.2 Install Git (to push to GitHub later)

3.3 Install Docker Desktop (for the containerised workflow)

4. Step 1 — Clone the Project

4.1 Clone the repository from Github

4.2 Open in VS Code

5. Step 2 — Create a Python Virtual Environment

What is a virtual environment and why do we need one?

Create the virtual environment

Activate the virtual environment

6. Step 3 — Install Dependencies

Verify installation

7. Step 4 — Configure Environment Variables

Create your .env file

8. Step 5 — Run the Training Pipeline

What just happened internally?

Verify the output artifacts

CLI flags

9. Step 6 — Explore MLflow Experiment Tracking

What you'll see in the MLflow UI

Run training a second time and compare

10. Step 7 — Start the FastAPI Inference Server

The Interactive API Docs (Swagger UI)

Understanding the URL http://0.0.0.0:8000

11. Step 8 — Make Predictions via the API

Method A — Swagger UI (browser)

Method B — curl (terminal)

Method C — Python requests (script)

Method D — Batch prediction

Understanding what happens on each request

Testing input validation

12. Step 9 — Run the Test Suite

Understanding the test output

What each test file covers

Run a single test file

Run tests matching a pattern

13. Step 10 — Run Everything with Docker

Make sure Docker Desktop is running

Build and start all services

Why the trainer exits

Re-run training inside Docker (without rebuilding)

Stop all services

Understanding Docker volumes

Useful Docker commands

14. Step 11 — Schedule Training with Cron

View the schedule

Understanding cron syntax

Install the crontab (Linux/Mac only)

Test the pipeline script manually

Cron inside Docker

15. Architecture Deep Dive

Why src/config.py is the single source of truth

Why the sklearn Pipeline prevents data leakage

Why lru_cache on get_predictor()

Why the Predictor is separate from the API

Why Pydantic schemas are worth the boilerplate

16. How the Code Flows Together

17. Common Errors & How to Fix Them

❌ ModuleNotFoundError: No module named 'src'

❌ ModelNotFoundError: Model not found at '.../models/iris_classifier.pkl'

❌ Address already in use (port 8000 or 5000)

❌ docker: command not found

❌ Permission denied when running run_pipeline.sh

❌ MLflow UI shows nothing / empty experiments

❌ 422 Unprocessable Entity from the API

18. Extending the Project

Swap in a different dataset

Add a new model (XGBoost)

Promote the best model in MLflow

Add GitHub Actions CI

Quick Reference Card

Create your `.env` file

Understanding the URL `http://0.0.0.0:8000`

Why `src/config.py` is the single source of truth

Why the `sklearn Pipeline` prevents data leakage

Why `lru_cache` on `get_predictor()`

❌ `ModuleNotFoundError: No module named 'src'`

❌ `ModelNotFoundError: Model not found at '.../models/iris_classifier.pkl'`

❌ `Address already in use` (port 8000 or 5000)

❌ `docker: command not found`

❌ `Permission denied` when running `run_pipeline.sh`

❌ `422 Unprocessable Entity` from the API