🧠I Stopped Thinking in Machine Learning Models and Started Thinking in Systems (Here’s Why)
And it completely changed how I build AI for healthcare.
Most machine learning projects look impressive in isolation.
You train a model, get good metrics, maybe even build a notebook demo—and it feels like progress.
But when you try to turn that into something real, especially in healthcare, something breaks.
That’s what pushed me to rethink everything.
Instead of focusing on models, I started focusing on systems.
🏥 The problem I started exploring
In healthcare, a lot of valuable insights already exist in routine lab data—like blood counts and biochemical markers.
The challenge is not data availability.
It’s interpretation at scale.
So I explored a question:
What would it take to build an AI system that can process routine clinical data and generate meaningful early risk signals across multiple conditions?
Not for one disease.
But for multiple, in a unified system.
Here's what I discovered along the way.
⚙️ Where things started to change
At first, I approached it like a typical ML problem.
Train models. Optimize accuracy. Compare results.
But very quickly, I hit a limitation:
Good models don’t automatically become usable systems.
That realization shifted the entire direction of the project.
I stopped asking "How accurate is my model?" and started asking "What happens when this model meets reality?"
🧠From models → system design
Instead of a single predictive pipeline, I moved toward a modular multi-system architecture:
- Separate inference pipelines per disease category
- A centralized feature engineering layer
- Parallel execution of multiple models
- Structured output aggregation
- Data validation based on clinical constraints
- A lightweight interface layer for interaction
The focus was no longer “what is the best model?”
It became:
“How do these components work together as a system?”
This shift unlocked everything.
🔬 What surprised me most
The hardest problems were not in machine learning.
They were in system design.
1. Data is more complex than models
Clinical data is noisy, inconsistent, and context-dependent. A model trained on clean CSV files has no idea what to do with a missing ferritin value or a hemoglobin of 2.0 (which would mean the patient is dead).
2. Integration is the real bottleneck
Connecting pipelines, models, and outputs is harder than training anything. Getting five disease models to talk to each other without crashing took longer than building all five models combined.
3. Prediction alone is not enough
Outputs need structure, validation, and interpretability to be useful. A probability without context is just a number. A number without an action is noise.
4. Real systems behave differently than notebooks
Latency, consistency, and workflow integration matter more than accuracy scores. A 98% accurate model that takes 10 seconds to run is worse than a 90% accurate model that returns instantly.
⚙️ Technical direction I used
To explore this, I worked with:
- Python-based ML pipelines
- Ensemble learning methods (tree-based models, boosting techniques)
- Modular feature engineering design
- Parallel processing for inference efficiency
- Structured output formatting for interpretability
- Lightweight UI layer for interaction testing
But the key focus was never the tools.
It was how they fit together.
đź§© The main shift in thinking
This project changed how I understand machine learning:
ML is not a modeling problem. It’s a system design problem wrapped in data complexity.
A model might give you a prediction.
But a system determines whether that prediction can actually be used.
Think about that for a moment. You can have the best model in the world. If clinicians don't trust it, if it crashes on real data, if it takes too long to run—it's useless.
The system is what bridges the gap between mathematical optimization and clinical reality.
🧠Questions I’m now focused on
This work led me to deeper questions:
- How do we design ML systems that behave reliably in real environments?
- How do we structure outputs so they are interpretable, not just accurate?
- How do multiple models interact inside a single system?
- What does safety mean when predictions influence decisions?
These questions matter more to me now than improving model metrics.
Because metrics measure models. But outcomes measure systems.
🚧 Where I’m going next
Right now, I’m focusing on:
- Improving system-level architecture for ML applications
- Strengthening feature engineering pipelines
- Making outputs more explainable and structured
- Exploring real-world deployment patterns for AI systems
- Moving from “experiments” to “usable systems”
The experiments are fun. The usable systems change lives.
đź” Final thought
Machine learning becomes interesting when it leaves the notebook.
Not because models get better—but because systems get real.
And building those systems forces you to think differently:
Not in predictions, but in architecture, flow, and usability.
When you stop optimizing for Kaggle leaderboards and start optimizing for a nurse at 3 AM with a crashing laptop and a patient who needs answers—everything changes.
🤝 If you’re building in this space
If you’re working on:
- ML systems
- backend engineering for AI
- healthcare applications
- or applied data science
I’d be interested in how you approach system design vs model design.
What breaks first when you try to deploy? What's harder than you expected? What made you rethink everything?
Let's compare war stories.
📦 Simplified Code Examples
The system I built processes clinical data through multiple disease pipelines. Here’s what the architecture looks like in practice.
Core Inference Engine
The inference engine orchestrates all disease pipelines in parallel:
# simplified inference
def run_all_pipelines(master_dict: dict) -> dict:
"""Run all disease pipelines and return unified results."""
results = {}
# Load once, use many times
models = load_all_models()
# Parallel execution
with ThreadPoolExecutor(max_workers=5) as executor:
futures = {
"anemia": executor.submit(run_anemia, master_dict, models["anemia"]),
"diabetes": executor.submit(run_diabetes, master_dict, models["diabetes"]),
"ckd": executor.submit(run_ckd, master_dict, models["ckd"]),
}
for name, future in futures.items():
results[name] = future.result()
# Comorbidity analysis across all results
results["comorbidity"] = analyze_comorbidities(results, master_dict)
return results
This parallel pattern cut inference time from 2 seconds to under 300 milliseconds.
Modular Pipeline Structure
Each disease follows the same pattern:
# anemia prediction simplified structure
def run_prediction(master_dict: dict, models: dict) -> dict:
# 1. Feature engineering
features = build_feature_dict(master_dict)
# 2. Stage 1: Risk detection (binary)
X1 = prepare_features(features, models["stage1"]["features"])
prob1 = models["stage1"]["model"].predict_proba(X1)[0, 1]
if prob1 < 0.5:
return {"disease": "anemia", "prob": prob1, "status": "no_risk"}
# 3. Stage 2: Morphological subtype
X2 = prepare_features(features, models["stage2"]["features"])
subtype = models["stage2"]["model"].predict(X2)[0]
# 4. Stage 3: Specific diagnosis
X3 = prepare_features(features, models[f"stage3_{subtype}"]["features"])
diagnosis = models[f"stage3_{subtype}"]["model"].predict(X3)[0]
return {
"disease": "anemia",
"prob": prob1,
"subtype": subtype,
"diagnosis": diagnosis,
}
The multi-stage approach means we only run expensive models when necessary. Low risk? Stop early. High risk? Dig deeper.
Comorbidity Detection
The system automatically detects dangerous disease combinations:
# simplified comorbidity engine
DANGEROUS_COMBOS = [
{
"pair": ("diabetes", "ckd"),
"thresholds": {"diabetes": 0.55, "ckd": 0.45},
"name": "Diabetic Nephropathy Cascade",
"action": "Start ACEi/ARB, check urine ACR, consider SGLT-2 inhibitor"
},
{
"pair": ("cardiovascular", "diabetes"),
"thresholds": {"cardiovascular": 0.50, "diabetes": 0.55},
"name": "Cardiometabolic Syndrome",
"action": "Prioritize Metformin + Statin, target BP <130/80"
},
]
def detect_dangerous_combos(results: dict) -> list:
"""Check for dangerous disease combinations."""
fired = []
for combo in DANGEROUS_COMBOS:
# Check if all diseases in the combo exceed thresholds
if all(results.get(d, {}).get("prob", 0) >= combo["thresholds"][d]
for d in combo["pair"]):
fired.append(combo)
return fired
This turned out to be the clinicians' favorite feature. They don't want to connect dots themselves. They want the system to tell them what combinations are dangerous.
Natural Language Explanation
Raw probabilities become plain-language explanations for clinicians:
# simplified natural language explainer
def build_nl_report(disease: str, result: dict, master_dict: dict) -> str:
"""Convert AI predictions to plain language."""
prob = result["prob"]
risk_level = "high" if prob > 0.7 else "moderate" if prob > 0.4 else "low"
if disease == "diabetes":
hba1c = master_dict.get("hba1c", 5.5)
glucose = master_dict.get("blood_glucose_level", 100)
return (
f"This patient has a {risk_level} risk of diabetes ({prob:.0%}). "
f"Key indicators: HbA1c {hba1c}%, fasting glucose {glucose} mg/dL."
)
if disease == "anemia":
hgb = master_dict.get("hgb", 12.0)
mcv = master_dict.get("mcv", 85)
diagnosis = result.get("diagnosis", "unknown")
return (
f"Anaemia detected ({hgb} g/dL, MCV {mcv} fL). "
f"Diagnosis: {diagnosis}. Recommend confirmatory testing."
)
return f"Risk assessment complete. {risk_level.capitalize()} risk ({prob:.0%})."
No SHAP values. No confusion matrices. Just plain language that any nurse can understand and act on.
Severity Score & Triage
A composite score drives clinical workflow:
# simplified severity scorer
def compute_severity_score(results: dict, patient: dict) -> dict:
"""Compute INZIRA Severity Score (0–100) and triage tier."""
weights = {
"anemia": 0.20,
"cardiovascular": 0.25,
"ckd": 0.20,
"diabetes": 0.20,
"liver": 0.15,
}
# Weighted sum of probabilities
base_score = sum(
results.get(d, {}).get("prob", 0) * w
for d, w in weights.items()
) * 100
# Penalty for dangerous combinations
combo_bonus = min(30, len(detect_dangerous_combos(results)) * 6)
# Age modifier
age = patient.get("age", 40)
age_mod = 5 if age >= 60 else 0
iss = min(100, base_score + combo_bonus + age_mod)
# Determine triage tier
if iss < 25:
tier = "GREEN", "Routine follow-up within 4–6 weeks"
elif iss < 45:
tier = "YELLOW", "Priority review within 2 weeks"
elif iss < 65:
tier = "ORANGE", "Same-day specialist review"
else:
tier = "RED", "Immediate escalation, consider admission"
return {"score": round(iss, 1), "tier": tier[0], "action": tier[1]}
This single number helps clinicians prioritize. In a busy district hospital with one doctor and 50 patients, knowing who to see first saves lives.
Rendering Results
The UI adapts to risk level and disease:
# simplified results rendering
def render_results(results: dict):
"""Display results with appropriate styling."""
for disease, result in results.items():
if disease == "comorbidity":
continue
prob = result.get("prob", 0)
# Color based on risk
if prob > 0.7:
color, icon, level = "#E8526A", "đź”´", "High Risk"
elif prob > 0.4:
color, icon, level = "#F5A623", "🟡", "Moderate Risk"
else:
color, icon, level = "#3DBE8A", "🟢", "Low Risk"
# Risk card
st.markdown(f"""
<div style="background:{color}10; border-left:4px solid {color};
border-radius:8px; padding:1rem; margin-bottom:1rem">
<span style="font-size:1.2rem">{icon}</span>
<span style="font-weight:700; color:{color}">{disease.upper()}</span>
<span style="float:right; font-family:monospace; font-weight:700; color:{color}">
{prob:.0%}
</span>
<div style="margin-top:0.5rem; color:#DDE6F5">{level}</div>
</div>
""", unsafe_allow_html=True)
# Show natural language explanation
st.markdown(build_nl_report(disease, result, get_master_dict()))
The visual design isn't cosmetic. It's functional. Red means act now. Green means monitor. No interpretation needed.
The code above is simplified, but the patterns are real. This system is under experts review in Rwanda right now, and feedback is constructive.
What would you build with a system like this?
Top comments (0)