DEV Community

enochlabs
enochlabs

Posted on

Building AI Systems for Healthcare: My Journey into Applied Machine Learning and Software Engineering

🧠 I Stopped Thinking in Machine Learning Models and Started Thinking in Systems (Here’s Why)

And it completely changed how I build AI for healthcare.

Most machine learning projects look impressive in isolation.

You train a model, get good metrics, maybe even build a notebook demo—and it feels like progress.

But when you try to turn that into something real, especially in healthcare, something breaks.

That’s what pushed me to rethink everything.

Instead of focusing on models, I started focusing on systems.

🏥 The problem I started exploring

In healthcare, a lot of valuable insights already exist in routine lab data—like blood counts and biochemical markers.

The challenge is not data availability.

It’s interpretation at scale.

So I explored a question:

What would it take to build an AI system that can process routine clinical data and generate meaningful early risk signals across multiple conditions?

Not for one disease.

But for multiple, in a unified system.

Here's what I discovered along the way.

⚙️ Where things started to change

At first, I approached it like a typical ML problem.

Train models. Optimize accuracy. Compare results.

But very quickly, I hit a limitation:

Good models don’t automatically become usable systems.

That realization shifted the entire direction of the project.

I stopped asking "How accurate is my model?" and started asking "What happens when this model meets reality?"

🧠 From models → system design

Instead of a single predictive pipeline, I moved toward a modular multi-system architecture:

  • Separate inference pipelines per disease category
  • A centralized feature engineering layer
  • Parallel execution of multiple models
  • Structured output aggregation
  • Data validation based on clinical constraints
  • A lightweight interface layer for interaction

The focus was no longer “what is the best model?”

It became:

“How do these components work together as a system?”

This shift unlocked everything.

🔬 What surprised me most

The hardest problems were not in machine learning.

They were in system design.

1. Data is more complex than models

Clinical data is noisy, inconsistent, and context-dependent. A model trained on clean CSV files has no idea what to do with a missing ferritin value or a hemoglobin of 2.0 (which would mean the patient is dead).

2. Integration is the real bottleneck

Connecting pipelines, models, and outputs is harder than training anything. Getting five disease models to talk to each other without crashing took longer than building all five models combined.

3. Prediction alone is not enough

Outputs need structure, validation, and interpretability to be useful. A probability without context is just a number. A number without an action is noise.

4. Real systems behave differently than notebooks

Latency, consistency, and workflow integration matter more than accuracy scores. A 98% accurate model that takes 10 seconds to run is worse than a 90% accurate model that returns instantly.

⚙️ Technical direction I used

To explore this, I worked with:

  • Python-based ML pipelines
  • Ensemble learning methods (tree-based models, boosting techniques)
  • Modular feature engineering design
  • Parallel processing for inference efficiency
  • Structured output formatting for interpretability
  • Lightweight UI layer for interaction testing

But the key focus was never the tools.

It was how they fit together.

đź§© The main shift in thinking

This project changed how I understand machine learning:

ML is not a modeling problem. It’s a system design problem wrapped in data complexity.

A model might give you a prediction.

But a system determines whether that prediction can actually be used.

Think about that for a moment. You can have the best model in the world. If clinicians don't trust it, if it crashes on real data, if it takes too long to run—it's useless.

The system is what bridges the gap between mathematical optimization and clinical reality.

🧠 Questions I’m now focused on

This work led me to deeper questions:

  • How do we design ML systems that behave reliably in real environments?
  • How do we structure outputs so they are interpretable, not just accurate?
  • How do multiple models interact inside a single system?
  • What does safety mean when predictions influence decisions?

These questions matter more to me now than improving model metrics.

Because metrics measure models. But outcomes measure systems.

🚧 Where I’m going next

Right now, I’m focusing on:

  • Improving system-level architecture for ML applications
  • Strengthening feature engineering pipelines
  • Making outputs more explainable and structured
  • Exploring real-world deployment patterns for AI systems
  • Moving from “experiments” to “usable systems”

The experiments are fun. The usable systems change lives.

đź”­ Final thought

Machine learning becomes interesting when it leaves the notebook.

Not because models get better—but because systems get real.

And building those systems forces you to think differently:

Not in predictions, but in architecture, flow, and usability.

When you stop optimizing for Kaggle leaderboards and start optimizing for a nurse at 3 AM with a crashing laptop and a patient who needs answers—everything changes.

🤝 If you’re building in this space

If you’re working on:

  • ML systems
  • backend engineering for AI
  • healthcare applications
  • or applied data science

I’d be interested in how you approach system design vs model design.

What breaks first when you try to deploy? What's harder than you expected? What made you rethink everything?

Let's compare war stories.

📦 Simplified Code Examples

The system I built processes clinical data through multiple disease pipelines. Here’s what the architecture looks like in practice.

Core Inference Engine

The inference engine orchestrates all disease pipelines in parallel:

# simplified inference
def run_all_pipelines(master_dict: dict) -> dict:
    """Run all disease pipelines and return unified results."""
    results = {}

    # Load once, use many times
    models = load_all_models()

    # Parallel execution
    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = {
            "anemia": executor.submit(run_anemia, master_dict, models["anemia"]),
            "diabetes": executor.submit(run_diabetes, master_dict, models["diabetes"]),
            "ckd": executor.submit(run_ckd, master_dict, models["ckd"]),
        }
        for name, future in futures.items():
            results[name] = future.result()

    # Comorbidity analysis across all results
    results["comorbidity"] = analyze_comorbidities(results, master_dict)

    return results
Enter fullscreen mode Exit fullscreen mode

This parallel pattern cut inference time from 2 seconds to under 300 milliseconds.


Modular Pipeline Structure

Each disease follows the same pattern:

# anemia prediction simplified structure
def run_prediction(master_dict: dict, models: dict) -> dict:
    # 1. Feature engineering
    features = build_feature_dict(master_dict)

    # 2. Stage 1: Risk detection (binary)
    X1 = prepare_features(features, models["stage1"]["features"])
    prob1 = models["stage1"]["model"].predict_proba(X1)[0, 1]

    if prob1 < 0.5:
        return {"disease": "anemia", "prob": prob1, "status": "no_risk"}

    # 3. Stage 2: Morphological subtype
    X2 = prepare_features(features, models["stage2"]["features"])
    subtype = models["stage2"]["model"].predict(X2)[0]

    # 4. Stage 3: Specific diagnosis
    X3 = prepare_features(features, models[f"stage3_{subtype}"]["features"])
    diagnosis = models[f"stage3_{subtype}"]["model"].predict(X3)[0]

    return {
        "disease": "anemia",
        "prob": prob1,
        "subtype": subtype,
        "diagnosis": diagnosis,
    }
Enter fullscreen mode Exit fullscreen mode

The multi-stage approach means we only run expensive models when necessary. Low risk? Stop early. High risk? Dig deeper.

Comorbidity Detection

The system automatically detects dangerous disease combinations:

# simplified comorbidity engine
DANGEROUS_COMBOS = [
    {
        "pair": ("diabetes", "ckd"),
        "thresholds": {"diabetes": 0.55, "ckd": 0.45},
        "name": "Diabetic Nephropathy Cascade",
        "action": "Start ACEi/ARB, check urine ACR, consider SGLT-2 inhibitor"
    },
    {
        "pair": ("cardiovascular", "diabetes"),
        "thresholds": {"cardiovascular": 0.50, "diabetes": 0.55},
        "name": "Cardiometabolic Syndrome",
        "action": "Prioritize Metformin + Statin, target BP <130/80"
    },
]

def detect_dangerous_combos(results: dict) -> list:
    """Check for dangerous disease combinations."""
    fired = []
    for combo in DANGEROUS_COMBOS:
        # Check if all diseases in the combo exceed thresholds
        if all(results.get(d, {}).get("prob", 0) >= combo["thresholds"][d] 
               for d in combo["pair"]):
            fired.append(combo)
    return fired
Enter fullscreen mode Exit fullscreen mode

This turned out to be the clinicians' favorite feature. They don't want to connect dots themselves. They want the system to tell them what combinations are dangerous.

Natural Language Explanation

Raw probabilities become plain-language explanations for clinicians:

# simplified natural language explainer
def build_nl_report(disease: str, result: dict, master_dict: dict) -> str:
    """Convert AI predictions to plain language."""
    prob = result["prob"]
    risk_level = "high" if prob > 0.7 else "moderate" if prob > 0.4 else "low"

    if disease == "diabetes":
        hba1c = master_dict.get("hba1c", 5.5)
        glucose = master_dict.get("blood_glucose_level", 100)

        return (
            f"This patient has a {risk_level} risk of diabetes ({prob:.0%}). "
            f"Key indicators: HbA1c {hba1c}%, fasting glucose {glucose} mg/dL."
        )

    if disease == "anemia":
        hgb = master_dict.get("hgb", 12.0)
        mcv = master_dict.get("mcv", 85)
        diagnosis = result.get("diagnosis", "unknown")

        return (
            f"Anaemia detected ({hgb} g/dL, MCV {mcv} fL). "
            f"Diagnosis: {diagnosis}. Recommend confirmatory testing."
        )

    return f"Risk assessment complete. {risk_level.capitalize()} risk ({prob:.0%})."
Enter fullscreen mode Exit fullscreen mode

No SHAP values. No confusion matrices. Just plain language that any nurse can understand and act on.

Severity Score & Triage

A composite score drives clinical workflow:

# simplified severity scorer
def compute_severity_score(results: dict, patient: dict) -> dict:
    """Compute INZIRA Severity Score (0–100) and triage tier."""
    weights = {
        "anemia": 0.20,
        "cardiovascular": 0.25,
        "ckd": 0.20,
        "diabetes": 0.20,
        "liver": 0.15,
    }

    # Weighted sum of probabilities
    base_score = sum(
        results.get(d, {}).get("prob", 0) * w
        for d, w in weights.items()
    ) * 100

    # Penalty for dangerous combinations
    combo_bonus = min(30, len(detect_dangerous_combos(results)) * 6)

    # Age modifier
    age = patient.get("age", 40)
    age_mod = 5 if age >= 60 else 0

    iss = min(100, base_score + combo_bonus + age_mod)

    # Determine triage tier
    if iss < 25:
        tier = "GREEN", "Routine follow-up within 4–6 weeks"
    elif iss < 45:
        tier = "YELLOW", "Priority review within 2 weeks"
    elif iss < 65:
        tier = "ORANGE", "Same-day specialist review"
    else:
        tier = "RED", "Immediate escalation, consider admission"

    return {"score": round(iss, 1), "tier": tier[0], "action": tier[1]}
Enter fullscreen mode Exit fullscreen mode

This single number helps clinicians prioritize. In a busy district hospital with one doctor and 50 patients, knowing who to see first saves lives.

Rendering Results

The UI adapts to risk level and disease:

# simplified results rendering
def render_results(results: dict):
    """Display results with appropriate styling."""
    for disease, result in results.items():
        if disease == "comorbidity":
            continue

        prob = result.get("prob", 0)

        # Color based on risk
        if prob > 0.7:
            color, icon, level = "#E8526A", "đź”´", "High Risk"
        elif prob > 0.4:
            color, icon, level = "#F5A623", "🟡", "Moderate Risk"
        else:
            color, icon, level = "#3DBE8A", "🟢", "Low Risk"

        # Risk card
        st.markdown(f"""
        <div style="background:{color}10; border-left:4px solid {color}; 
                    border-radius:8px; padding:1rem; margin-bottom:1rem">
            <span style="font-size:1.2rem">{icon}</span>
            <span style="font-weight:700; color:{color}">{disease.upper()}</span>
            <span style="float:right; font-family:monospace; font-weight:700; color:{color}">
                {prob:.0%}
            </span>
            <div style="margin-top:0.5rem; color:#DDE6F5">{level}</div>
        </div>
        """, unsafe_allow_html=True)

        # Show natural language explanation
        st.markdown(build_nl_report(disease, result, get_master_dict()))
Enter fullscreen mode Exit fullscreen mode

The visual design isn't cosmetic. It's functional. Red means act now. Green means monitor. No interpretation needed.

The code above is simplified, but the patterns are real. This system is under experts review in Rwanda right now, and feedback is constructive.

What would you build with a system like this?

Top comments (0)