Most ML project lists are built for data science students. This one is built for software engineers who already know how to ship production code and want to demonstrate ML competence to hiring teams, not just familiarity with Scikit-learn.
Every project here is chosen for one reason: it forces you to solve problems that show up in real ML engineering roles, not just in Kaggle notebooks. The stack choices are opinionated and current. The "what it actually demonstrates" notes are written from the perspective of what a hiring manager at a product company looks for, not what makes a clean tutorial.
Projects are ordered from foundational to advanced. Each builds on patterns from the one before it.
1. Text Classification Pipeline With Drift Monitoring
What you build: A sentiment or topic classifier trained on a public dataset (Amazon reviews, AG News), wrapped in a FastAPI endpoint, with a basic drift detection layer that flags when incoming text starts diverging from the training distribution.
Stack: Python, Scikit-learn or HuggingFace, FastAPI, Evidently AI, Docker
The production element most people skip: The drift monitor. Most engineers build the classifier and stop. Adding Evidently to track feature drift over time and log alerts when distribution shifts exceed a threshold is what turns this from a tutorial into an ML system.
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd
def check_drift(reference_data: pd.DataFrame, current_data: pd.DataFrame) -> dict:
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_data, current_data=current_data)
result = report.as_dict()
drift_detected = result["metrics"][0]["result"]["dataset_drift"]
return {"drift_detected": drift_detected, "report": result}
What it demonstrates: Model serving, containerization, and the monitoring mindset that separates MLEs from notebook practitioners.
2. Feature Store From Scratch
What you build: A lightweight feature store that computes, stores, and serves features for a tabular ML problem (churn prediction, loan default). Features are computed offline, stored in a database, and retrieved at inference time via a point-in-time correct query that prevents future leakage.
Stack: Python, PostgreSQL or Redis, Feast (or hand-rolled), FastAPI
The production element most people skip: Point-in-time correctness. Most engineers join features naively on entity ID, which leaks future data into training. A real feature store retrieves the feature value that existed at the time of the label, not the latest value.
def get_features_at_timestamp(
entity_id: str,
timestamp: datetime,
feature_names: list[str],
conn
) -> dict:
query = """
SELECT feature_name, feature_value
FROM feature_store
WHERE entity_id = %s
AND feature_name = ANY(%s)
AND computed_at <= %s
ORDER BY computed_at DESC
"""
rows = conn.execute(query, (entity_id, feature_names, timestamp)).fetchall()
seen = {}
for name, value in rows:
if name not in seen:
seen[name] = value
return seen
What it demonstrates: Understanding of training-serving skew, data leakage, and production feature pipelines — one of the most commonly tested concepts in MLE system design interviews.
3. Fine-Tuned LLM With Evaluation Harness
What you build: A domain-specific fine-tuned model using LoRA/QLoRA on a task like legal clause classification, medical note summarization, or code review comment generation. The evaluation harness runs the model against a golden test set on every training run and logs results to an experiment tracker.
Stack: Python, HuggingFace PEFT, QLoRA, Weights & Biases, PyTorch
The production element most people skip: The evaluation harness. Most engineers fine-tune, check loss curves, and call it done. Building a golden set of 50-100 human-labeled examples and writing automated evaluation that runs on every checkpoint is what makes this a system.
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
def apply_lora(model_name: str, r: int = 8, lora_alpha: int = 16) -> object:
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
device_map="auto"
)
config = LoraConfig(
r=r,
lora_alpha=lora_alpha,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
return get_peft_model(model, config)
What it demonstrates: Modern LLM adaptation techniques, experiment tracking discipline, and evaluation methodology — all directly relevant to applied ML roles in 2026.
4. Real-Time Fraud Detection System
What you build: A streaming fraud detection pipeline that consumes transaction events from Kafka, computes real-time features (time since last transaction, rolling spend deviation), runs a trained classifier, and logs decisions with confidence scores for auditing.
Stack: Python, Apache Kafka, Redis (for real-time feature retrieval), XGBoost or LightGBM, FastAPI
The production element most people skip: Handling class imbalance correctly in both training and threshold selection. Fraud datasets are typically 0.1-1% positive. Training without addressing this produces a model that predicts "not fraud" for everything and achieves 99% accuracy. The threshold for flagging fraud should be tuned on business cost, not F1.
from sklearn.utils.class_weight import compute_sample_weight
import lightgbm as lgb
def train_fraud_model(X_train, y_train):
weights = compute_sample_weight(class_weight='balanced', y=y_train)
model = lgb.LGBMClassifier(
n_estimators=500,
learning_rate=0.05,
num_leaves=31,
class_weight='balanced'
)
model.fit(X_train, y_train, sample_weight=weights)
return model
def select_threshold_by_cost(
y_true, y_proba,
cost_fn: float = 10,
cost_fp: float = 1
) -> float:
best_threshold, best_cost = 0.5, float('inf')
for t in [i / 100 for i in range(1, 100)]:
preds = (y_proba >= t).astype(int)
fn = ((preds == 0) & (y_true == 1)).sum()
fp = ((preds == 1) & (y_true == 0)).sum()
total_cost = fn * cost_fn + fp * cost_fp
if total_cost < best_cost:
best_cost, best_threshold = total_cost, t
return best_threshold
What it demonstrates: Streaming data pipelines, imbalanced classification, business-aware threshold tuning, and real-time serving — a complete production ML system.
5. RAG System With Retrieval Evaluation
What you build: A retrieval-augmented generation system over a document corpus (company docs, research papers, a Wikipedia subset). The system chunks documents, generates embeddings, stores them in a vector database, retrieves context at query time, and passes it to an LLM. Critically, it includes retrieval evaluation that measures whether the right chunks are being retrieved.
Stack: Python, LangChain or LlamaIndex, Pinecone or ChromaDB, OpenAI or open-source LLM, RAGAS
The production element most people skip: Retrieval evaluation. Most engineers build the RAG pipeline and eyeball a few outputs. RAGAS gives you automated metrics for context precision, context recall, and answer faithfulness. Without these, you have no way to know if chunking strategy or embedding model changes actually improved the system.
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy
)
from datasets import Dataset
def evaluate_rag_pipeline(
questions: list[str],
answers: list[str],
contexts: list[list[str]],
ground_truths: list[str]
) -> dict:
data = {
"question": questions,
"answer": answers,
"contexts": contexts,
"ground_truth": ground_truths
}
dataset = Dataset.from_dict(data)
result = evaluate(
dataset,
metrics=[context_precision, context_recall, faithfulness, answer_relevancy]
)
return result
What it demonstrates: The full LLM application stack, embedding and retrieval systems, and evaluation discipline for generative systems — directly aligned with what most AI product teams are hiring for in 2026.
6. ML Pipeline With Full CI/CD
What you build: A complete ML pipeline where every commit triggers automated tests, the model is retrained on new data if tests pass, evaluation metrics are compared against the currently deployed model, and deployment only proceeds if the new model wins on a held-out test set. No manual steps.
Stack: Python, GitHub Actions, DVC (data version control), MLflow, Docker, any cloud (AWS/GCP/Azure)
The production element most people skip: The model promotion gate. Most CI/CD tutorials for ML cover training automation but stop before the comparison step. A real pipeline only deploys if the challenger model outperforms the champion on the evaluation set.
# .github/workflows/ml_pipeline.yml
name: ML Pipeline
on:
push:
branches: [main]
paths:
- 'data/**'
- 'src/**'
- 'params.yaml'
jobs:
train-and-evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run data validation tests
run: pytest tests/test_data.py
- name: Train model
run: python src/train.py
- name: Evaluate and compare vs champion
run: python src/evaluate.py --compare-champion
- name: Deploy if challenger wins
if: ${{ steps.evaluate.outputs.challenger_wins == 'true' }}
run: python src/deploy.py
What it demonstrates: MLOps maturity, reproducible training pipelines, and automated model governance — the skills that hiring managers consistently say separate production-ready candidates from notebook-only practitioners.
7. Computer Vision Inference Service With Batching
What you build: An object detection or image classification model (fine-tuned on a custom dataset using a YOLO or EfficientNet backbone) served behind an API that supports dynamic request batching — grouping individual inference requests together and processing them as a batch to maximize GPU throughput.
Stack: Python, PyTorch, Ultralytics YOLO or timm, FastAPI, NVIDIA Triton Inference Server or custom batching logic
The production element most people skip: Dynamic batching. Most engineers serve one image per request, which leaves GPU utilization at 10-20% under real load. A batching layer collects requests over a short time window and processes them together, dramatically improving throughput without increasing per-request latency at moderate traffic.
import asyncio
from collections import defaultdict
class DynamicBatcher:
def __init__(self, model, max_batch_size: int = 32, max_wait_ms: float = 10.0):
self.model = model
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms
self.queue = asyncio.Queue()
async def infer(self, image_tensor):
future = asyncio.get_event_loop().create_future()
await self.queue.put((image_tensor, future))
return await future
async def process_batches(self):
while True:
batch, futures = [], []
deadline = asyncio.get_event_loop().time() + self.max_wait_ms / 1000
while len(batch) < self.max_batch_size:
timeout = deadline - asyncio.get_event_loop().time()
if timeout <= 0:
break
try:
tensor, future = await asyncio.wait_for(
self.queue.get(), timeout=timeout
)
batch.append(tensor)
futures.append(future)
except asyncio.TimeoutError:
break
if batch:
import torch
results = self.model(torch.stack(batch))
for future, result in zip(futures, results):
future.set_result(result)
What it demonstrates: GPU-aware serving, latency vs throughput tradeoffs, and inference optimization — all of which appear in MLE system design interviews and on-the-job performance reviews.
8. End-to-End Recommendation System
What you build: A two-tower retrieval and ranking system. The retrieval tower generates user and item embeddings and uses approximate nearest neighbor search to retrieve candidates. A separate ranking model scores the candidates using additional features. Both stages are served via API and the full system logs impressions and clicks for future retraining.
Stack: Python, PyTorch, Faiss (ANN search), FastAPI, PostgreSQL (interaction logging), Airflow (retraining schedule)
The production element most people skip: The two-stage architecture itself. Most engineers build a single model that scores all items, which doesn't scale past a few thousand items. The retrieval-then-ranking split is how Netflix, Spotify, YouTube, and every serious recommendation system at scale actually works.
import torch
import torch.nn as nn
import faiss
import numpy as np
class TwoTowerModel(nn.Module):
def __init__(self, user_dim: int, item_dim: int, embedding_dim: int = 64):
super().__init__()
self.user_tower = nn.Sequential(
nn.Linear(user_dim, 128),
nn.ReLU(),
nn.Linear(128, embedding_dim)
)
self.item_tower = nn.Sequential(
nn.Linear(item_dim, 128),
nn.ReLU(),
nn.Linear(128, embedding_dim)
)
def forward(self, user_features, item_features):
user_emb = self.user_tower(user_features)
item_emb = self.item_tower(item_features)
return torch.sum(user_emb * item_emb, dim=1)
def build_faiss_index(item_embeddings: np.ndarray) -> faiss.Index:
dim = item_embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
faiss.normalize_L2(item_embeddings)
index.add(item_embeddings)
return index
def retrieve_candidates(
user_embedding: np.ndarray,
index: faiss.Index,
k: int = 100
) -> np.ndarray:
faiss.normalize_L2(user_embedding)
distances, indices = index.search(user_embedding, k)
return indices[0]
What it demonstrates: The most commonly asked ML system design question in interviews ("design a recommendation system"), implemented end-to-end with the architecture that actually scales — retrieval, ranking, logging, and retraining loop included.
What Separates These Projects From Tutorial Clones
Every project above has one thing in common: it includes the part that tutorials skip. Drift monitoring, point-in-time correct features, retrieval evaluation, model promotion gates, dynamic batching, two-stage retrieval. These are the elements that show up in production ML systems and almost never in beginner resources.
Building these projects also changes how you talk about your work in interviews. The difference between a candidate who says "I built a RAG system" and one who says "I built a RAG system and measured context recall and faithfulness using RAGAS across three chunking strategies" is a gap in how they approached the project.
The other thing worth knowing is that project selection matters as much as project execution. Engineers who are making the transition from software engineer to machine learning engineer often overbuild in one area and neglect others.
A portfolio with three NLP projects and nothing on serving or monitoring reads differently to a hiring team than one that covers the full ML lifecycle. For a detailed breakdown of how to structure and present these projects, this ML engineer portfolio guide covers what hiring managers actually look for beyond the GitHub link.
Top comments (0)