Talha T

Posted on Mar 10

How I Built My First ML Engineering Project And Got Noticed

#beginners #cicd #machinelearning #tutorial

Subtitle:Explaining the Full Workflow — From Preprocessing to CI/CD Pipelines

I assume you already know how to train a model.

Maybe you’ve built a few notebooks. Maybe you’ve tuned a couple of hyperparameters and watched your accuracy climb from 72% to 91% like a proud parent.

But here’s the uncomfortable truth most tutorials won’t tell you:

Training a model is maybe 10% of a real ML engineering project.

The other 90% is everything around it — data pipelines, reproducibility, versioning, deployment, monitoring, CI/CD, and making sure the whole thing doesn’t collapse at 3AM when your cron job runs.

The first time I built a real ML engineering project, it completely changed how I look at machine learning.

Let me walk you through the exact workflow I used — from raw data to an automated CI/CD pipeline.

No fluff. Just the pieces that actually matter.

The Real ML Project Structure (Not a Messy Notebook)

The biggest mistake beginners make?

They build everything inside a single Jupyter notebook.

It works… until it doesn’t.

A real ML project should look something like this:

ml-project/
│
├── data/
│   ├── raw/
│   └── processed/
│
├── src/
│   ├── data_preprocessing.py
│   ├── train.py
│   ├── predict.py
│
├── models/
│
├── tests/
│
├── Dockerfile
├── requirements.txt
├── pipeline.yaml
└── README.md

simple rule I follow:

If your ML project can't run from the command line, it's not production ready.

Step 1 — Data Preprocessing (The Most Underrated Skill in ML)

Every ML engineer learns this the hard way.

Bad data beats good algorithms every time.

I usually create a preprocessing pipeline that does three things:

Clean missing values
Encode categorical features
Scale numerical features

Here’s a simple pipeline using scikit-learn.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

data = pd.read_csv("data/raw/dataset.csv")

X = data.drop("target", axis=1)
y = data["target"]

num_features = X.select_dtypes(include=["int64","float64"]).columns
cat_features = X.select_dtypes(include=["object"]).columns

num_pipeline = Pipeline([
    ("scaler", StandardScaler())
])

cat_pipeline = Pipeline([
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", num_pipeline, num_features),
    ("cat", cat_pipeline, cat_features)
])

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

Notice something important here.

The preprocessing itself becomes part of the pipeline.

That means when you deploy the model later, the same transformations will run automatically.

Most tutorials forget this.

Production systems cannot.

Step 2 — Training the Model (But Properly)

Now we train.

But instead of saving random models everywhere like a digital hoarder, we track experiments.

A simple training script might look like this:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib

model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    random_state=42
)

model.fit(X_train, y_train)

preds = model.predict(X_test)

acc = accuracy_score(y_test, preds)

print(f"Model Accuracy: {acc}")

joblib.dump(model, "models/model.pkl")

Small detail. Huge difference.

The model gets serialized.

Not printed. Not left floating in RAM.

Saved.

Because later, your API will load it.

Step 3 — Build an Inference API

A machine learning model that nobody can access is just an expensive paperweight.

So I wrap the model with FastAPI.

from fastapi import FastAPI
import joblib
import numpy as np

app = FastAPI()

model = joblib.load("models/model.pkl")

@app.get("/")
def home():
    return {"status":"model running"}

@app.post("/predict")
def predict(data: list):

    prediction = model.predict([data])

    return {
        "prediction": int(prediction[0])
    }

Run it:
uvicorn app:app --reload
And suddenly your model becomes a web service.

Now any application can call it.

Step 4 — Containerize Everything (Docker)

If you want engineers to respect your ML project, learn Docker.

Because the sentence "It works on my machine" has destroyed more projects than bad algorithms.

Here’s the Dockerfile I used:

FROM python:3.10

WORKDIR /app

COPY requirements.txt .

RUN pip install -r requirements.txt

COPY . .

CMD ["uvicorn","app:app","--host","0.0.0.0","--port","8000"]

Build the container:
docker build -t ml-api .
Run it:
docker run -p 8000:8000 ml-api
Now your entire ML system runs anywhere.

Laptop. Cloud. Kubernetes. Mars.

Step 5 — CI/CD Pipeline (The Part That Got Me Noticed)

This was the step that changed everything.

Most ML portfolios stop at training a model.

But I added automation.

Every time I pushed code to GitHub, the pipeline would:

Run unit tests

Build Docker image

Train the model

Deploy the API

Here’s a simplified GitHub Actions pipeline.

name: ML Pipeline

on: [push]

jobs:
  build:

    runs-on: ubuntu-latest

    steps:

    - name: Checkout Repo
      uses: actions/checkout@v3

    - name: Setup Python
      uses: actions/setup-python@v4
      with:
        python-version: 3.10

    - name: Install Dependencies
      run: |
        pip install -r requirements.txt

    - name: Run Tests
      run: |
        pytest

    - name: Build Docker
      run: |
        docker build -t ml-project .

Every commit now triggers automated ML engineering infrastructure.

Recruiters love seeing this.

Because it proves you understand something critical:

Machine learning in production is software engineering.

The Hidden Skill Most ML Engineers Ignore

Here’s a surprising statistic.

According to Google’s ML engineering research:

Only 5–10% of ML code in production is model code.

The rest is infrastructure.

Data validation
Monitoring
Deployment
Pipelines
Scaling

That’s where real ML engineers differentiate themselves.

What Actually Made My Project Stand Out

Not the accuracy.

Not the algorithm.

Not the dataset.

It was this:

A full machine learning lifecycle.

Data → Preprocessing → Training → Serialization
→ API → Docker → CI/CD

Most portfolios show a model.

Very few show a system.

If I Were Building My First ML Project Again

I would follow this exact stack:

Python
Scikit-learn
FastAPI
Docker
GitHub Actions
MLflow

Because this stack teaches you the real skill:

End-to-end machine learning engineering.

And once you understand that…

Training models suddenly becomes the easy part.

If you like the articles don`t forget to get connected and giving the claps.Thanks very much

DEV Community