Iris Classifier ML Pipeline — Complete Tutorial & Instructions Manual
Who this is for: Beginners and intermediate developers who want to understand how a real-world ML project is structured and run — from a cloned repository to a fully running system.
What you'll learn: Virtual environments, dependency management, project structure, MLflow experiment tracking, FastAPI inference servers, Docker containerisation, and automated CI/CD.
Prerequisites: Python installed, VS Code installed, internet connection. That's it.
📋 Table of Contents
- What This Project Does
- Understanding the Project Structure
- One-Time Setup: Install Required Tools
- Step 1 — Clone the Project
- Step 2 — Create a Python Virtual Environment
- Step 3 — Install Dependencies
- Step 4 — Configure Environment Variables
- Step 5 — Run the Training Pipeline
- Step 6 — Explore MLflow Experiment Tracking
- Step 7 — Start the FastAPI Inference Server
- Step 8 — Make Predictions via the API
- Step 9 — Run the Test Suite
- Step 10 — Run Everything with Docker
- Step 11 — Schedule Training with Cron
- Architecture Deep Dive
- How the Code Flows Together
- Common Errors & How to Fix Them
- Extending the Project
1. What This Project Does
This project simulates a production-grade machine learning system. Here is the high-level picture:
┌─────────────────────────────────────────────────────────────────────┐
│ ML PIPELINE OVERVIEW │
│ │
│ [Cron / CLI] │
│ │ │
│ ▼ │
│ ┌─────────────┐ trains ┌─────────────────┐ │
│ │ Training │ ──────────► │ Saved Model │ │
│ │ Pipeline │ │ (models/*.pkl) │ │
│ └─────────────┘ └────────┬────────┘ │
│ │ │ │
│ │ logs everything │ loaded by │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────────┐ │
│ │ MLflow │ │ FastAPI │◄── HTTP requests │
│ │ Tracking │ │ Inference │ │
│ │ UI │ │ Server │──► predictions │
│ └─────────────┘ └─────────────────┘ │
│ │
│ All three services run together inside Docker Compose │
└─────────────────────────────────────────────────────────────────────┘
The dataset: Iris — 150 flower measurements, 3 species (setosa, versicolor, virginica). A classic beginner dataset that's perfect for showcasing pipeline architecture without the model itself becoming the focus.
The model: RandomForestClassifier inside a sklearn Pipeline with StandardScaler. A GridSearchCV automatically searches for the best hyperparameters.
2. Understanding the Project Structure
Before touching any code, read this section. Understanding why the project is structured this way is what separates a portfolio project from a "notebook dumped into a repo."
ml-pipeline/
│
├── src/ ← The core Python library (business logic)
│ ├── config.py ← ALL settings live here (paths, hyperparams, env vars)
│ ├── data/
│ │ └── loader.py ← Loads Iris dataset, splits train/test
│ ├── training/
│ │ ├── pipeline.py ← Builds the sklearn Pipeline object
│ │ └── trainer.py ← Orchestrates training: GridSearchCV + MLflow logging
│ ├── evaluation/
│ │ └── metrics.py ← Accuracy, F1, confusion matrix — pure functions
│ └── inference/
│ └── predictor.py ← Loads saved model, exposes predict() method
│
├── api/ ← The FastAPI web application
│ ├── main.py ← Creates and configures the FastAPI app
│ ├── schemas.py ← Defines the shape of API requests and responses
│ └── routers/
│ └── predict.py ← The actual /predict HTTP endpoints
│
├── scripts/
│ ├── train.py ← CLI entry point: `python scripts/train.py`
│ └── run_pipeline.sh ← Bash wrapper used by cron
│
├── tests/ ← Automated tests
│ ├── test_data_loader.py
│ ├── test_metrics.py
│ ├── test_predictor.py
│ └── test_api.py
│
├── docker/
│ ├── Dockerfile.train ← Container for the training job
│ └── Dockerfile.api ← Container for the FastAPI server
│
├── models/ ← Created automatically — stores .pkl files
├── mlruns/ ← Created automatically — stores MLflow data
├── logs/ ← Created automatically — stores training logs
│
├── docker-compose.yml ← Wires all Docker services together
├── requirements.txt ← Python dependencies
├── Makefile ← Shortcuts: `make train`, `make serve`, etc.
├── crontab.txt ← Cron schedule definition
├── pyproject.toml ← Tool config (pytest, ruff, mypy)
└── .env.example ← Template for environment variables
Key design principle:
src/contains zero web framework code.api/contains zero ML logic. They communicate throughsrc/inference/predictor.py. This makes every layer independently testable.
3. One-Time Setup: Install Required Tools
You need three tools installed on your machine. Do this before anything else.
3.1 Verify Python is installed
Open a terminal (on Linux/Mac) or Command Prompt / PowerShell (on Windows):
python --version
# or on some systems:
python3 --version
You should see Python 3.10.x or higher. If not, download it from python.org.
On Linux/Ubuntu, you may need:
sudo apt update && sudo apt install python3 python3-pip python3-venv
3.2 Install Git (to push to GitHub later)
git --version
If not installed:
-
Ubuntu/Debian:
sudo apt install git -
Mac:
xcode-select --install - Windows: Download from git-scm.com
3.3 Install Docker Desktop (for the containerised workflow)
Docker lets you run the entire stack — API + MLflow UI + trainer — with a single command, without installing anything else on your machine.
- Go to docs.docker.com/get-docker
- Download and install Docker Desktop for your OS
- Open Docker Desktop and wait for it to show "Docker is running"
- Verify in the terminal:
docker --version
docker compose version
On Linux, after installing Docker Engine, add your user to the docker group so you don't need
sudo:sudo usermod -aG docker $USER newgrp docker
4. Step 1 — Clone the Project
4.1 Clone the repository from Github
Clone the Iris-Classifier-ML-Pipeline to a location of your choice, for example ~/projects/.
git clone https://github.com/aniket-1177/Iris-Classifier-ML-Pipeline.git
4.2 Open in VS Code
code .
Or open VS Code manually → File → Open Folder → select Iris-Classifier-ML-Pipeline.
Install the recommended VS Code extension for Python: when VS Code prompts you, click Install. If it doesn't prompt, press Ctrl+Shift+X, search Python, and install the Microsoft extension.
5. Step 2 — Create a Python Virtual Environment
What is a virtual environment and why do we need one?
A virtual environment is an isolated Python installation just for this project. Without it, every project on your machine would share the same packages — which leads to version conflicts. With a venv, installing scikit-learn==1.4.0 here won't affect any other project.
Your machine
│
├── System Python (don't touch this)
│
└── projects/
└── ml-pipeline/
└── .venv/ ← A private Python just for this project
├── bin/python
└── lib/
├── scikit-learn
├── fastapi
├── mlflow
└── ...
Create the virtual environment
# Make sure you are inside the ml-pipeline directory
pwd
# Should print something like: /home/yourname/projects/ml-pipeline
# Create the venv (this creates a .venv folder)
python -m venv .venv
Activate the virtual environment
You must activate the venv every time you open a new terminal window.
# Linux / Mac:
source .venv/bin/activate
# Windows (Command Prompt):
.venv\Scripts\activate.bat
# Windows (PowerShell):
.venv\Scripts\Activate.ps1
After activation, your terminal prompt changes to show (.venv):
(.venv) username@os:~/projects/Iris-Classifier-ML-Pipeline$
VS Code tip: Press
Ctrl+Shift+P→ type "Python: Select Interpreter" → choose the one that says.venv. VS Code will now automatically activate the venv in all new integrated terminals.
6. Step 3 — Install Dependencies
With your venv activated, install all required packages:
pip install --upgrade pip
pip install -r requirements.txt
This will install approximately 15 packages. It may take 2–5 minutes on the first run.
What's being installed:
| Package | Why |
|---|---|
scikit-learn |
Machine learning — our model, pipeline, and grid search |
pandas / numpy
|
Data manipulation |
mlflow |
Experiment tracking and model registry |
fastapi |
The web framework for our inference API |
uvicorn |
The ASGI web server that runs FastAPI |
pydantic |
Data validation for API requests/responses |
Verify installation
python -c "import sklearn, mlflow, fastapi; print('All good!')"
# Should print: All good!
7. Step 4 — Configure Environment Variables
Environment variables let you change settings (like which MLflow server to use) without editing code.
Create your .env file
cp .env.example .env
Open .env in VS Code. For local development, the defaults are fine:
MLFLOW_TRACKING_URI=file://./mlruns
MLFLOW_EXPERIMENT_NAME=iris-classifier
API_HOST=0.0.0.0
API_PORT=8000
LOG_LEVEL=INFO
What does
file://./mlrunsmean? It tells MLflow to store all experiment data in a local folder calledmlruns/instead of connecting to a remote server. Perfect for development.
8. Step 5 — Run the Training Pipeline
This is the core of the project. Let's run it and understand what happens at each step.
python scripts/train.py
You will see output like this:
2024-06-01 10:23:15 | INFO | __main__ | =======================================================
2024-06-01 10:23:15 | INFO | __main__ | ML Pipeline Training Run
2024-06-01 10:23:15 | INFO | __main__ | Experiment : iris-classifier
2024-06-01 10:23:15 | INFO | __main__ | =======================================================
2024-06-01 10:23:15 | INFO | src.data.loader | Loading Iris dataset...
2024-06-01 10:23:15 | INFO | src.data.loader | Dataset loaded | samples=150 | features=4 | classes=['setosa', 'versicolor', 'virginica']
2024-06-01 10:23:15 | INFO | src.data.loader | Data split | train=120 | test=30
2024-06-01 10:23:16 | INFO | src.training.trainer | MLflow run started | run_id=abc123...
2024-06-01 10:23:18 | INFO | src.training.trainer | Best params: {'classifier__max_depth': None, 'classifier__n_estimators': 100}
2024-06-01 10:23:18 | INFO | src.training.trainer | ─────────────────────────────────────────────
2024-06-01 10:23:18 | INFO | src.training.trainer | accuracy 0.9667
2024-06-01 10:23:18 | INFO | src.training.trainer | macro_f1 0.9667
...
2024-06-01 10:23:19 | INFO | __main__ | Training finished successfully.
2024-06-01 10:23:19 | INFO | __main__ | Accuracy : 0.9667
2024-06-01 10:23:19 | INFO | __main__ | Model path : /home/.../models/iris_classifier.pkl
What just happened internally?
scripts/train.py
└── calls run_training() in src/training/trainer.py
│
├── 1. load_dataset() → loads 150 Iris rows from scikit-learn
├── 2. split_data() → 120 train, 30 test (stratified)
├── 3. build_pipeline() → StandardScaler + RandomForestClassifier
├── 4. GridSearchCV.fit() → tries 18 hyperparameter combinations (5-fold CV each)
├── 5. compute_metrics() → accuracy, F1, etc. on held-out test set
├── 6. mlflow.log_*() → saves params + metrics + model to mlruns/
└── 7. pickle.dump() → saves best model to models/iris_classifier.pkl
saves label encoder to models/label_encoder.pkl
Verify the output artifacts
ls models/
# iris_classifier.pkl label_encoder.pkl
ls mlruns/
# 0/ (experiment folder)
CLI flags
# Custom experiment name
python scripts/train.py --experiment my-experiment-v2
# Save results to a JSON file
python scripts/train.py --output-json results/run1.json
# See all options
python scripts/train.py --help
9. Step 6 — Explore MLflow Experiment Tracking
MLflow automatically captured everything about the training run. Let's view it.
Open a new terminal (keep your first terminal free for the API later). Activate the venv:
source .venv/bin/activate
mlflow ui --backend-store-uri mlruns --port 5000
Open your browser and go to http://localhost:5000
What you'll see in the MLflow UI
Experiments list: You'll see iris-classifier with one run logged.
Inside the run, explore:
- Parameters tab — the hyperparameter values GridSearchCV chose as best:
cv_folds 5
test_size 0.2
classifier__n_estimators 100
classifier__max_depth None
classifier__min_samples_split 2
- Metrics tab — all evaluation scores:
cv_best_score 0.9583
test_accuracy 0.9667
test_macro_f1 0.9667
test_macro_precision 0.9683
test_macro_recall 0.9667
f1_setosa 1.0000
f1_versicolor 0.9333
f1_virginica 0.9667
Artifacts tab — the saved model files and a preview of the input schema
Models tab (top menu) — the
IrisClassifierregistered model with version history
Why does this matter for a portfolio? In a real company, dozens of engineers run hundreds of experiments. MLflow lets you compare them all — which model was best? What were its settings? What data was it trained on? This is how ML teams avoid the "I don't know which model is in production" problem.
Run training a second time and compare
# Back in your first terminal:
python scripts/train.py --experiment iris-classifier
Now refresh the MLflow UI. You'll see two runs side by side. Click the checkboxes on both and hit Compare to see a diff of parameters and metrics.
10. Step 7 — Start the FastAPI Inference Server
The trained model is now saved to disk. Let's serve it as a REST API.
In your first terminal (with venv activated):
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
The --reload flag means the server restarts automatically when you edit code — great for development.
You should see:
INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Starting Iris Classifier API v1.0.0
INFO: Model ready | classes=['setosa', 'versicolor', 'virginica']
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
Open your browser at http://localhost:8000/docs
The Interactive API Docs (Swagger UI)
FastAPI automatically generates an interactive documentation page from your code. You don't write this HTML — it's created from your Pydantic schemas and route definitions.
You'll see two endpoints:
-
POST /predict/— single flower prediction -
POST /predict/batch— multiple flowers at once -
GET /health— is the model loaded?
Understanding the URL http://0.0.0.0:8000
http:// 0.0.0.0 : 8000 /docs
│ │ │ │
│ │ │ └── Path (Swagger UI page)
│ │ └────────── Port number
│ └───────────────────── "All network interfaces" = accessible from anywhere on this machine
└─────────────────────────────── Protocol
0.0.0.0 as the host means "listen on all network interfaces." When you open it in a browser, you use localhost or 127.0.0.1 instead.
11. Step 8 — Make Predictions via the API
You have three ways to call the API. Try all three — each is used in different real-world scenarios.
Method A — Swagger UI (browser)
- Go to http://localhost:8000/docs
- Click on
POST /predict/ - Click Try it out
- Replace the request body with:
{
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2
}
- Click Execute
- Scroll down to see the response
Method B — curl (terminal)
Open a third terminal and run:
curl -X POST http://localhost:8000/predict/ \
-H "Content-Type: application/json" \
-d '{
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2
}'
Expected response:
{
"predicted_class": "setosa",
"confidence": 0.98,
"class_probabilities": {
"setosa": 0.98,
"versicolor": 0.01,
"virginica": 0.01
}
}
Method C — Python requests (script)
Create a quick test script:
# test_request.py (create this in the project root)
import requests
url = "http://localhost:8000/predict/"
payload = {
"sepal_length": 6.3,
"sepal_width": 3.3,
"petal_length": 6.0,
"petal_width": 2.5,
}
response = requests.post(url, json=payload)
print(response.json())
pip install requests # if not already installed
python test_request.py
Method D — Batch prediction
curl -X POST http://localhost:8000/predict/batch \
-H "Content-Type: application/json" \
-d '{
"samples": [
{"sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2},
{"sepal_length": 6.3, "sepal_width": 3.3, "petal_length": 6.0, "petal_width": 2.5},
{"sepal_length": 7.0, "sepal_width": 3.2, "petal_length": 4.7, "petal_width": 1.4}
]
}'
Understanding what happens on each request
HTTP POST /predict/
│
▼ api/routers/predict.py
│ Pydantic validates the JSON (correct types? in range?)
│ If invalid → 422 Unprocessable Entity (automatic)
│
▼ Depends(get_predictor)
│ FastAPI calls get_predictor() to inject the Predictor object
│ lru_cache means the model is NOT reloaded on every request
│
▼ predictor.predict([5.1, 3.5, 1.4, 0.2])
│ Builds a pandas DataFrame with the correct column names
│ Runs pipeline.predict() — scaler transforms, then RF predicts
│ Runs pipeline.predict_proba() — gets probability per class
│
▼ Returns PredictResponse
FastAPI serializes it to JSON and sends HTTP 200
Testing input validation
FastAPI + Pydantic automatically validates every request. Try sending a bad value:
curl -X POST http://localhost:8000/predict/ \
-H "Content-Type: application/json" \
-d '{"sepal_length": -5, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2}'
You'll get a 422 Unprocessable Entity with a clear error message — no custom error handling code needed. This is the power of Pydantic.
12. Step 9 — Run the Test Suite
Stop the API server for now (Ctrl+C). Let's run the automated tests.
# Run all tests with verbose output
pytest tests/ -v
# Run with a coverage report
pytest tests/ -v --cov=src --cov=api --cov-report=term-missing
Understanding the test output
tests/test_data_loader.py::TestLoadDataset::test_returns_dataframe_and_series PASSED
tests/test_data_loader.py::TestLoadDataset::test_correct_shape PASSED
tests/test_data_loader.py::TestSplitData::test_split_sizes PASSED
...
tests/test_api.py::TestPredictEndpoint::test_valid_request_200 PASSED
tests/test_api.py::TestPredictEndpoint::test_missing_field_422 PASSED
tests/test_api.py::TestPredictEndpoint::test_negative_value_422 PASSED
...
---------- coverage: src ----------
src/config.py 28 3 89%
src/data/loader.py 32 2 94%
src/training/trainer.py 58 12 79%
src/inference/predictor.py 55 4 93%
...
What each test file covers
| File | What it tests | Key technique |
|---|---|---|
test_data_loader.py |
Shape, columns, no nulls, stratification | Direct assertion |
test_metrics.py |
Perfect vs imperfect predictions, rounding | Parametrized fixtures |
test_predictor.py |
Model loading, predict output, error cases |
unittest.mock.patch to fake disk paths |
test_api.py |
HTTP status codes, response schema, validation |
FastAPI TestClient — no real server needed |
Why
test_predictor.pyuses mock patches:
ThePredictorclass loads.pklfiles from disk. In tests, we don't want to depend on a pre-trained model existing. Instead, we useunittest.mock.patchto replace the file paths with a temp directory containing a freshly trained mini-model. This makes the tests fast, isolated, and runnable in CI.
Run a single test file
pytest tests/test_api.py -v
pytest tests/test_data_loader.py -v
Run tests matching a pattern
pytest tests/ -k "test_valid_request" -v
pytest tests/ -k "batch" -v
13. Step 10 — Run Everything with Docker
So far we've been running services manually in separate terminals. Docker Compose lets you run the entire stack with one command and tear it all down just as easily.
Make sure Docker Desktop is running
Check the Docker Desktop taskbar icon — it should say "Docker Desktop is running."
Build and start all services
docker compose up --build
The first build takes 3–5 minutes (it downloads base images and installs packages). Subsequent starts are fast.
Watch the output — you'll see three services starting:
mlflow_server | [INFO] Starting MLflow server...
mlflow_server | [INFO] Listening on http://0.0.0.0:5000
ml_trainer | [INFO] Loading Iris dataset...
ml_trainer | [INFO] Training finished. Accuracy: 0.9667
ml_trainer | [INFO] Model saved to /app/models/iris_classifier.pkl
ml_trainer exited with code 0 ← trainer exits after one run (this is normal)
ml_api | [INFO] Model ready | classes=['setosa', 'versicolor', 'virginica']
ml_api | [INFO] Uvicorn running on http://0.0.0.0:8000
Now open:
- API docs: http://localhost:8000/docs
- MLflow UI: http://localhost:5000
Why the trainer exits
The trainer service is configured with restart: "no" — it runs the training job once and exits. This is intentional. In production, you'd trigger retraining on a schedule (via cron or a CI job), not keep a process running forever.
Re-run training inside Docker (without rebuilding)
docker compose run --rm trainer
This spins up a fresh trainer container, trains the model, saves it to the shared volume, and exits.
Stop all services
docker compose down
Understanding Docker volumes
The models/ directory is shared between the trainer and the API using a Docker named volume called models_vol:
┌────────────────┐ ┌──────────────────┐
│ trainer │ writes │ models_vol │
│ container │────────►│ (Docker volume) │
└────────────────┘ └────────┬─────────┘
│ reads
┌────────▼─────────┐
│ api │
│ container │
└──────────────────┘
This means you can retrain the model and the running API picks up the new model without rebuilding or redeploying the API image.
Useful Docker commands
# See running containers
docker ps
# See logs from the API container
docker logs ml_api -f
# Open a shell inside the API container (for debugging)
docker exec -it ml_api bash
# Remove everything including volumes (full reset)
docker compose down -v
14. Step 11 — Schedule Training with Cron
Cron is a Unix tool that runs commands on a schedule. We've included a crontab.txt that runs the training pipeline every Monday at 2 AM.
View the schedule
cat crontab.txt
# Retrain model every Monday at 02:00 AM
0 2 * * 1 cd /app && bash scripts/run_pipeline.sh >> /var/log/ml_pipeline_cron.log 2>&1
Understanding cron syntax
0 2 * * 1
│ │ │ │ │
│ │ │ │ └── Day of week: 1 = Monday (0=Sun, 6=Sat)
│ │ │ └────── Month: * = every month
│ │ └────────── Day of month: * = every day
│ └────────────── Hour: 2 = 2 AM
└────────────────── Minute: 0 = on the hour
Install the crontab (Linux/Mac only)
crontab crontab.txt
# Verify it's installed
crontab -l
Test the pipeline script manually
bash scripts/run_pipeline.sh
This produces timestamped log files in logs/:
logs/
├── train_20240601_102315.log
└── results_20240601_102315.json
Cron inside Docker
To run cron inside the Docker trainer container instead of on the host machine, change the CMD in docker-compose.yml:
trainer:
command: cron -f # runs cron daemon in foreground (keeps container alive)
15. Architecture Deep Dive
This section explains the key architectural decisions — the "why" behind the code. This is exactly what interviewers and video viewers want to understand.
Why src/config.py is the single source of truth
# src/config.py
MODEL_PATH = MODELS_DIR / f"{MODEL_NAME}.pkl"
MLFLOW_TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", f"file://{MLRUNS_DIR}")
Every path, every setting, every environment variable lives here. No other file hardcodes a path or reads an env var. If you need to change where models are stored, you change one line in config.py and it propagates everywhere.
Why the sklearn Pipeline prevents data leakage
# src/training/pipeline.py
Pipeline([
("scaler", StandardScaler()),
("classifier", RandomForestClassifier()),
])
Without a Pipeline, you might do this (which is wrong):
# ❌ WRONG — data leakage
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# GridSearchCV also scales training data, but it already "saw" the test fold stats
With a Pipeline inside GridSearchCV, the scaler is fit only on the training portion of each fold, never on the validation data. This gives you an honest estimate of real-world performance.
Why lru_cache on get_predictor()
# src/inference/predictor.py
@lru_cache(maxsize=1)
def get_predictor() -> Predictor:
return Predictor()
lru_cache memoises the function — after the first call, it returns the cached result without calling the function again.
Without it: Every HTTP request would load iris_classifier.pkl from disk → slow.
With it: The model loads once at startup, all requests share the same in-memory instance → fast.
Why the Predictor is separate from the API
src/inference/predictor.py has zero imports from fastapi. This means:
- You can import and use
Predictorin a Celery worker, a CLI script, or a Jupyter notebook without FastAPI - You can unit-test it with
pytestwithout starting a web server - You could swap FastAPI for Flask or gRPC and the predictor code would be unchanged
Why Pydantic schemas are worth the boilerplate
# api/schemas.py
class PredictRequest(BaseModel):
sepal_length: float = Field(..., ge=0.0, le=20.0)
ge=0.0 means "greater than or equal to 0." le=20.0 means "less than or equal to 20."
For free, you get:
- Automatic HTTP 422 if a user sends
"sepal_length": "hello" - Automatic HTTP 422 if a user sends
"sepal_length": -5 - Auto-generated OpenAPI documentation at
/docs - Type hints that your IDE understands
16. How the Code Flows Together
Here is a complete trace of what happens when you run python scripts/train.py:
scripts/train.py
│
│ parse_args() — reads --experiment, --tracking-uri from CLI
│ sets os.environ for config.py to pick up
│
└── run_training() [src/training/trainer.py]
│
├── _configure_mlflow()
│ mlflow.set_tracking_uri(...)
│ mlflow.set_experiment("iris-classifier")
│
├── load_dataset() [src/data/loader.py]
│ load_iris(as_frame=True)
│ map integer targets → "setosa", "versicolor", "virginica"
│ returns X: DataFrame(150×4), y: Series(150,)
│
├── split_data(X, y)
│ train_test_split(stratify=y, test_size=0.2)
│ returns X_train(120×4), X_test(30×4), y_train, y_test
│
├── get_label_encoder(y)
│ LabelEncoder().fit(["setosa","versicolor","virginica"])
│
├── build_pipeline() [src/training/pipeline.py]
│ Pipeline([StandardScaler(), RandomForestClassifier()])
│
├── GridSearchCV(pipeline, HYPERPARAMETER_GRID, cv=5)
│
├── with mlflow.start_run():
│ │
│ ├── grid_search.fit(X_train, y_train)
│ │ Tries 18 combinations × 5 folds = 90 model fits
│ │ Retrains best params on full X_train
│ │
│ ├── mlflow.log_params(best_params)
│ │
│ ├── best_model.predict(X_test) → y_pred
│ │
│ ├── compute_metrics(y_test, y_pred) [src/evaluation/metrics.py]
│ │ accuracy_score, f1_score, precision_score, recall_score
│ │
│ ├── mlflow.log_metrics(metrics)
│ │
│ ├── mlflow.sklearn.log_model(best_model, registered_model_name="IrisClassifier")
│ │ Saves model to mlruns/<experiment_id>/<run_id>/artifacts/model/
│ │
│ └── pickle.dump(best_model, "models/iris_classifier.pkl")
│ pickle.dump(label_encoder, "models/label_encoder.pkl")
│
└── returns { run_id, best_params, metrics, model_path }
And when you call POST /predict/:
HTTP POST /predict/ {"sepal_length": 5.1, ...}
│
└── api/routers/predict.py: predict()
│
├── Pydantic validates the request body
│ PredictRequest(sepal_length=5.1, sepal_width=3.5, ...)
│
├── Depends(get_predictor) → returns cached Predictor instance
│
├── request.to_feature_list() → [5.1, 3.5, 1.4, 0.2]
│
└── predictor.predict([5.1, 3.5, 1.4, 0.2])
│ [src/inference/predictor.py]
├── pd.DataFrame([[5.1, 3.5, 1.4, 0.2]], columns=FEATURE_NAMES)
├── pipeline.predict(X) → ["setosa"]
├── pipeline.predict_proba(X) → [[0.98, 0.01, 0.01]]
└── return {
"predicted_class": "setosa",
"confidence": 0.98,
"class_probabilities": {"setosa": 0.98, ...}
}
17. Common Errors & How to Fix Them
❌ ModuleNotFoundError: No module named 'src'
Cause: Running Python from the wrong directory, or the venv is not activated.
Fix:
# Make sure you are in the project root
cd /path/to/ml-pipeline
# Make sure the venv is activated (you should see (.venv) in your prompt)
source .venv/bin/activate
# Then run again
python scripts/train.py
❌ ModelNotFoundError: Model not found at '.../models/iris_classifier.pkl'
Cause: You started the API before running training. The model doesn't exist yet.
Fix:
# Run training first
python scripts/train.py
# Then start the API
uvicorn api.main:app --reload
❌ Address already in use (port 8000 or 5000)
Cause: Something else is already using that port (possibly a previous server you didn't stop).
Fix:
# Find what's using port 8000
lsof -i :8000 # Linux/Mac
netstat -ano | findstr :8000 # Windows
# Kill it (replace PID with the process ID from above)
kill -9 <PID>
# Or use a different port
uvicorn api.main:app --port 8001
❌ docker: command not found
Cause: Docker is not installed, or Docker Desktop is not running.
Fix: Open Docker Desktop and wait for it to say "Docker is running."
❌ Permission denied when running run_pipeline.sh
Cause: The script is not marked as executable.
Fix:
chmod +x scripts/run_pipeline.sh
bash scripts/run_pipeline.sh
❌ MLflow UI shows nothing / empty experiments
Cause: The mlruns/ folder doesn't exist yet (training hasn't been run), or you're pointing at the wrong URI.
Fix:
# Make sure you train first
python scripts/train.py
# Then start MLflow pointing at the right folder
mlflow ui --backend-store-uri mlruns --port 5000
❌ 422 Unprocessable Entity from the API
Cause: Your request body is missing a field or has an invalid value (e.g., a negative measurement).
Fix: Check the error response body — FastAPI tells you exactly which field is wrong:
{
"detail": [
{
"loc": ["body", "sepal_length"],
"msg": "ensure this value is greater than or equal to 0",
"type": "value_error.number.not_ge"
}
]
}
18. Extending the Project
Once you're comfortable with the project, here are ways to make it even more impressive:
Swap in a different dataset
Replace the Iris loader in src/data/loader.py with any CSV:
def load_dataset():
df = pd.read_csv("data/raw/your_dataset.csv")
X = df.drop(columns=["target"])
y = df["target"]
return X, y
Everything else — training, MLflow logging, FastAPI — works unchanged.
Add a new model (XGBoost)
In src/training/pipeline.py:
from xgboost import XGBClassifier
def build_pipeline():
return Pipeline([
("scaler", StandardScaler()),
("classifier", XGBClassifier(use_label_encoder=False, eval_metric="logloss")),
])
Update HYPERPARAMETER_GRID in src/config.py to match XGBoost params.
Promote the best model in MLflow
# scripts/promote_best_model.py
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
runs = client.search_runs("1", order_by=["metrics.test_accuracy DESC"], max_results=1)
best_run_id = runs[0].info.run_id
client.transition_model_version_stage(
name="IrisClassifier",
version=1,
stage="Production"
)
Add GitHub Actions CI
Push your project to GitHub — the .github/workflows/ci.yml file is already written. It will automatically run lint → tests → training smoke test → Docker build on every push.
git init
git add .
git commit -m "feat: initial ML pipeline"
git remote add origin https://github.com/YOUR_USERNAME/ml-pipeline.git
git push -u origin main
Quick Reference Card
┌─────────────────────────────────────────────────────────────────┐
│ QUICK REFERENCE │
├────────────────────────────┬────────────────────────────────────┤
│ Setup │ source .venv/bin/activate │
│ Train model │ python scripts/train.py │
│ Start API │ uvicorn api.main:app --reload │
│ MLflow UI │ mlflow ui --backend-store-uri mlruns│
│ Run tests │ pytest tests/ -v │
│ Run tests + coverage │ pytest tests/ --cov=src --cov=api │
│ Docker (all services) │ docker compose up --build │
│ Docker (retrain only) │ docker compose run --rm trainer │
│ Docker (stop) │ docker compose down │
│ Install cron │ crontab crontab.txt │
├────────────────────────────┼────────────────────────────────────┤
│ API docs │ http://localhost:8000/docs │
│ API health check │ http://localhost:8000/health │
│ MLflow UI │ http://localhost:5000 │
└────────────────────────────┴────────────────────────────────────┘
Manual version 1.0 — Iris Classifier ML Pipeline
Top comments (0)