Subtitle:Explaining the Full Workflow — From Preprocessing to CI/CD Pipelines
I assume you already know how to train a model.
Maybe you’ve built a few notebooks. Maybe you’ve tuned a couple of hyperparameters and watched your accuracy climb from 72% to 91% like a proud parent.
But here’s the uncomfortable truth most tutorials won’t tell you:
Training a model is maybe 10% of a real ML engineering project.
The other 90% is everything around it — data pipelines, reproducibility, versioning, deployment, monitoring, CI/CD, and making sure the whole thing doesn’t collapse at 3AM when your cron job runs.
The first time I built a real ML engineering project, it completely changed how I look at machine learning.
Let me walk you through the exact workflow I used — from raw data to an automated CI/CD pipeline.
No fluff. Just the pieces that actually matter.
The Real ML Project Structure (Not a Messy Notebook)
The biggest mistake beginners make?
They build everything inside a single Jupyter notebook.
It works… until it doesn’t.
A real ML project should look something like this:
ml-project/
│
├── data/
│ ├── raw/
│ └── processed/
│
├── src/
│ ├── data_preprocessing.py
│ ├── train.py
│ ├── predict.py
│
├── models/
│
├── tests/
│
├── Dockerfile
├── requirements.txt
├── pipeline.yaml
└── README.md
simple rule I follow:
If your ML project can't run from the command line, it's not production ready.
Step 1 — Data Preprocessing (The Most Underrated Skill in ML)
Every ML engineer learns this the hard way.
Bad data beats good algorithms every time.
I usually create a preprocessing pipeline that does three things:
Clean missing values
Encode categorical features
Scale numerical features
Here’s a simple pipeline using scikit-learn.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
data = pd.read_csv("data/raw/dataset.csv")
X = data.drop("target", axis=1)
y = data["target"]
num_features = X.select_dtypes(include=["int64","float64"]).columns
cat_features = X.select_dtypes(include=["object"]).columns
num_pipeline = Pipeline([
("scaler", StandardScaler())
])
cat_pipeline = Pipeline([
("encoder", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer([
("num", num_pipeline, num_features),
("cat", cat_pipeline, cat_features)
])
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)
Notice something important here.
The preprocessing itself becomes part of the pipeline.
That means when you deploy the model later, the same transformations will run automatically.
Most tutorials forget this.
Production systems cannot.
Step 2 — Training the Model (But Properly)
Now we train.
But instead of saving random models everywhere like a digital hoarder, we track experiments.
A simple training script might look like this:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib
model = RandomForestClassifier(
n_estimators=200,
max_depth=10,
random_state=42
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
print(f"Model Accuracy: {acc}")
joblib.dump(model, "models/model.pkl")
Small detail. Huge difference.
The model gets serialized.
Not printed. Not left floating in RAM.
Saved.
Because later, your API will load it.
Step 3 — Build an Inference API
A machine learning model that nobody can access is just an expensive paperweight.
So I wrap the model with FastAPI.
from fastapi import FastAPI
import joblib
import numpy as np
app = FastAPI()
model = joblib.load("models/model.pkl")
@app.get("/")
def home():
return {"status":"model running"}
@app.post("/predict")
def predict(data: list):
prediction = model.predict([data])
return {
"prediction": int(prediction[0])
}
Run it:
uvicorn app:app --reload
And suddenly your model becomes a web service.
Now any application can call it.
Step 4 — Containerize Everything (Docker)
If you want engineers to respect your ML project, learn Docker.
Because the sentence "It works on my machine" has destroyed more projects than bad algorithms.
Here’s the Dockerfile I used:
FROM python:3.10
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn","app:app","--host","0.0.0.0","--port","8000"]
Build the container:
docker build -t ml-api .
Run it:
docker run -p 8000:8000 ml-api
Now your entire ML system runs anywhere.
Laptop. Cloud. Kubernetes. Mars.
Step 5 — CI/CD Pipeline (The Part That Got Me Noticed)
This was the step that changed everything.
Most ML portfolios stop at training a model.
But I added automation.
Every time I pushed code to GitHub, the pipeline would:
Run unit tests
Build Docker image
Train the model
Deploy the API
Here’s a simplified GitHub Actions pipeline.
name: ML Pipeline
on: [push]
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout Repo
uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.10
- name: Install Dependencies
run: |
pip install -r requirements.txt
- name: Run Tests
run: |
pytest
- name: Build Docker
run: |
docker build -t ml-project .
Every commit now triggers automated ML engineering infrastructure.
Recruiters love seeing this.
Because it proves you understand something critical:
Machine learning in production is software engineering.
The Hidden Skill Most ML Engineers Ignore
Here’s a surprising statistic.
According to Google’s ML engineering research:
Only 5–10% of ML code in production is model code.
The rest is infrastructure.
Data validation
Monitoring
Deployment
Pipelines
Scaling
That’s where real ML engineers differentiate themselves.
What Actually Made My Project Stand Out
Not the accuracy.
Not the algorithm.
Not the dataset.
It was this:
A full machine learning lifecycle.
Data → Preprocessing → Training → Serialization
→ API → Docker → CI/CD
Most portfolios show a model.
Very few show a system.
If I Were Building My First ML Project Again
I would follow this exact stack:
Python
Scikit-learn
FastAPI
Docker
GitHub Actions
MLflow
Because this stack teaches you the real skill:
End-to-end machine learning engineering.
And once you understand that…
Training models suddenly becomes the easy part.
If you like the articles don`t forget to get connected and giving the claps.Thanks very much
Top comments (0)