DEV Community

Nrk Raju Guthikonda
Nrk Raju Guthikonda

Posted on

How I Built a Privacy-First Healthcare AI Agent Using MCP and Local LLMs

Most healthcare AI demos have a fatal flaw: they send patient data to the cloud. That's not just a bad practice — it's a regulatory minefield. HIPAA violations can cost $50,000 per incident, and "but our AI vendor said it was secure" isn't a defense.

I decided to build healthcare AI tools that solve this problem at the architecture level. No patient data ever leaves the machine. Zero cloud API calls. Complete HIPAA compliance by design, not by policy.

Here's how I built a suite of healthcare AI agents — including a patient intake summarizer, lab results interpreter, EHR de-identifier, and medical document assistant — all running locally with Gemma 4 via Ollama.


The Problem with Cloud-Based Healthcare AI

Every time a healthcare organization sends patient data to a cloud LLM API, they're creating:

  1. A HIPAA liability — PHI (Protected Health Information) transmitted to a third party requires a Business Associate Agreement, encryption in transit and at rest, and audit trails
  2. A single point of failure — API outages mean your clinical workflow stops
  3. A cost that scales linearly — every patient encounter means another API call, and token costs add up fast in healthcare where documents are long
  4. A trust problem — patients and providers increasingly ask "where does my data go?"

The solution isn't to avoid AI — it's to bring the AI to the data instead of sending the data to the AI.

Architecture: Local LLM + MCP Pattern

My architecture uses three core components:

┌─────────────────────────────────────────────┐
│           Clinical Application              │
│  (Streamlit UI / FastAPI / CLI)             │
├─────────────────────────────────────────────┤
│           MCP Server Layer                  │
│  (Tool definitions, prompt templates,       │
│   FHIR resource handlers)                  │
├─────────────────────────────────────────────┤
│           Ollama Runtime                    │
│  (Gemma 4 model, local inference,          │
│   zero network transmission)               │
└─────────────────────────────────────────────┘
         ↕ Everything stays on localhost
Enter fullscreen mode Exit fullscreen mode

The Model Context Protocol (MCP) layer is what makes this modular. Instead of hardcoding LLM interactions, each healthcare capability is exposed as an MCP tool:

  • summarize_intake — processes patient intake forms into structured clinical summaries
  • interpret_lab_results — analyzes lab values against reference ranges with clinical context
  • deidentify_ehr — strips PHI from electronic health records while preserving clinical meaning
  • analyze_document — multi-agent document analysis for medical records

Why MCP?

MCP provides a standardized interface between AI models and tools. For healthcare, this means:

  • Interoperability — any MCP-compatible client can use the healthcare tools
  • Composability — chain multiple tools (e.g., de-identify → summarize → flag risks)
  • Testability — each tool can be tested independently with known inputs/outputs
  • Audit trail — every tool invocation is logged with inputs and outputs

Building the Patient Intake Summarizer

Let me walk through one tool in detail. The Patient Intake Summarizer takes unstructured intake forms and produces structured clinical summaries.

The Challenge

Patient intake forms are messy. They contain free-text descriptions mixed with medical terminology, abbreviations, and varying formats. A typical intake might read:

"52F, presenting with lower back pain x 3 weeks, worse with sitting. PMH: DM2 on metformin 500mg BID, HTN on lisinopril 10mg daily. No known allergies. Family hx: mother had MI at 62."

A clinician can parse this instantly. An LLM needs structured prompting to extract the same information reliably.

The Solution

class IntakeSummarizer:
    def __init__(self, model="gemma4"):
        self.client = ollama.Client()
        self.model = model

    def summarize(self, intake_text: str, format: str = "structured") -> dict:
        prompt = self._build_prompt(intake_text, format)
        response = self.client.generate(
            model=self.model,
            prompt=prompt,
            options={"temperature": 0.1}  # Low temp for clinical accuracy
        )
        return self._parse_response(response["response"], format)

    def _build_prompt(self, text: str, format: str) -> str:
        return f"""You are a clinical documentation assistant. 
Summarize the following patient intake form into a {format} summary.

IMPORTANT: Extract ALL of the following categories:
- Demographics (age, sex, presenting complaint)
- Medical History (conditions, surgeries, hospitalizations)  
- Current Medications (drug, dose, frequency)
- Allergies (drug, food, environmental)
- Family History (conditions, relationships)
- Social History (occupation, habits, living situation)
- Risk Factors (clinical flags requiring attention)
- Missing Information (gaps that need follow-up)

Intake Form:
{text}

Provide the summary in structured JSON format."""
Enter fullscreen mode Exit fullscreen mode

The key insight is temperature 0.1. For creative writing, you want high temperature. For clinical summarization, you want the model to be as deterministic and faithful to the source text as possible. Hallucinated medical information isn't creative — it's dangerous.

Multi-Format Output

The summarizer supports three output formats:

  1. Brief — 2-3 sentence overview for quick triage
  2. Detailed — paragraph-form comprehensive summary
  3. Structured — JSON with categorized fields for EHR integration

The structured format is particularly valuable because it can be directly ingested by downstream systems — no manual re-entry, no copy-paste errors.

Lab Results Interpreter

The lab interpreter is more complex because it needs reference ranges and clinical context.

REFERENCE_RANGES = {
    "glucose_fasting": {"low": 70, "high": 100, "unit": "mg/dL", "critical_low": 50, "critical_high": 400},
    "hba1c": {"low": 4.0, "high": 5.6, "unit": "%", "critical_high": 14.0},
    "creatinine": {"low": 0.7, "high": 1.3, "unit": "mg/dL", "critical_high": 10.0},
    # ... 50+ lab values
}

def interpret(self, lab_name: str, value: float, patient_context: str = "") -> dict:
    ref = REFERENCE_RANGES.get(lab_name)
    status = self._classify_value(value, ref)

    # Only call LLM for abnormal values or when context matters
    if status != "normal" or patient_context:
        interpretation = self._llm_interpret(lab_name, value, status, patient_context)
    else:
        interpretation = f"{lab_name} is within normal range."

    return {
        "lab": lab_name,
        "value": value,
        "reference_range": f"{ref['low']}-{ref['high']} {ref['unit']}",
        "status": status,
        "interpretation": interpretation
    }
Enter fullscreen mode Exit fullscreen mode

Notice the optimization: we only call the LLM for abnormal values or when patient context might change the interpretation. A normal glucose in a diabetic patient means something different than in a healthy patient — that's when the LLM adds value. For straightforward normal results, a rule-based response is faster and just as accurate.

EHR De-identification

De-identification is critical for research, training, and any scenario where clinical data needs to be shared without exposing patient identity.

The tool identifies and removes 18 HIPAA identifier categories:

  • Names, dates, phone numbers, emails
  • Social Security numbers, medical record numbers
  • Geographic data smaller than a state
  • Biometric identifiers, device identifiers
  • URLs, IP addresses, account numbers

The LLM approach has an advantage over regex-based de-identification: it understands context. "Dr. Smith recommended the Smith protocol" — the first "Smith" is PHI, the second is a medical protocol name. A regex would remove both; the LLM preserves the clinically meaningful reference.

Docker Deployment

Every tool ships with Docker Compose for one-command deployment:

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  app:
    build: .
    ports:
      - "8501:8501"  # Streamlit UI
      - "8000:8000"  # FastAPI API
    depends_on:
      - ollama
    environment:
      - OLLAMA_HOST=http://ollama:11434
Enter fullscreen mode Exit fullscreen mode

docker compose up and you have a fully functional healthcare AI tool running locally. No API keys, no cloud accounts, no data leaving your network.

Results and Impact

Across the four healthcare tools, the architecture delivers:

  • Zero data transmission — verified with network monitoring, no outbound connections during inference
  • Sub-second response times — Gemma 4 on a consumer GPU generates clinical summaries in 800ms-2s
  • Consistent accuracy — low temperature + structured prompting produces reliable, reproducible outputs
  • Complete audit trail — every tool invocation logged with timestamp, input hash, and output

What's Next

I'm currently exploring:

  1. FHIR R4 integration — mapping tool outputs to FHIR resources for EHR interoperability
  2. A2A (Agent-to-Agent) protocol — enabling healthcare agents to collaborate (e.g., intake summarizer triggers lab interpreter which triggers risk assessment)
  3. Federated evaluation — benchmarking accuracy across institutions without sharing data

The code is open source. If you're building healthcare AI that respects patient privacy, check out the repos:


*Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. He has built 116+ open-source repositories, including a suite of privacy-first healthcare AI tools. Find his work on GitHub and dev.to.*healthcareaipythonsecurity

Top comments (1)

Collapse
 
ali_muwwakkil_a776a21aa9c profile image
Ali Muwwakkil

An often overlooked advantage of using local LLMs in healthcare AI is the reduction in latency. In our experience with enterprise teams, we've found that local processing can significantly speed up response times, which is crucial in time-sensitive scenarios like real-time diagnostics. It also allows for more robust integration with existing infrastructure, minimizing disruptions to workflows. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)