Chetan Gupta

Posted on Mar 2 • Edited on Mar 18

Part 1: Why We Built an MCP Server — And What We Learned Before Writing a Single Line of Code

#mcp #interoperability #ai #python

A three-part series on building our first Model Context Protocol server for healthcare interoperability.

The Problem That Wouldn't Go Away

If you've ever worked in healthcare tech, you know the feeling: someone asks an AI assistant — Claude, ChatGPT, Copilot, whatever — a question about FHIR (Fast Healthcare Interoperability Resources), and the answer is close but dangerously wrong. Maybe it hallucinates a field that doesn't exist in R4. Maybe it confuses a US Core profile with a base resource. Maybe it confidently describes an element that was removed two versions ago.

This isn't the AI's fault. FHIR is a vast, versioned specification. The core spec alone has hundreds of StructureDefinitions, ValueSets, and CodeSystems. Layer on Implementation Guides (IGs) like US Core, and you're dealing with thousands of artifacts across multiple versions (R4, R4B, R5). No language model has all of that committed to memory with version-level precision.

We kept running into this problem on our team. We'd be deep in implementation work — mapping clinical data, validating resources, reviewing profiles — and every time we turned to an AI for help, we had to mentally fact-check every response against the actual specification. It was exhausting.

So we asked ourselves: what if the AI could just look it up?

Not from a web search. Not from its training data. From the actual, versioned, canonical FHIR packages sitting right on our machine.

That's how fhir-mcp was born.

Why MCP? (And Why Not Just an API?)

Before we chose the Model Context Protocol, we considered the obvious alternatives:

Option 1: Fine-tune a model on FHIR specs

We dismissed this quickly. FHIR evolves. New IGs are published constantly. Fine-tuning is expensive, slow, and creates a snapshot in time. We needed something that could reflect the state of your local packages — whatever you've got downloaded today.

Option 2: RAG (Retrieval-Augmented Generation) pipeline

This was tempting. Embed all the JSON, throw it in a vector store, retrieve context at query time. But we realized two things:

FHIR resources are highly structured JSON, not prose. Embedding-based search over deeply nested JSON objects loses the structural relationships that matter most.
We didn't just want "related text chunks." We wanted the AI to be able to call specific, typed operations: "get me the Patient StructureDefinition from R4," "search across all indexed resources for 'blood pressure,'" "diff the Observation resource between R4 and R5."

Option 3: Build a REST API and tell the user to paste results

This works, but it breaks the flow. The whole point was to let the AI autonomously look things up during a conversation — not to make the human be the middleware.

Why MCP Won

MCP is purpose-built for exactly this: giving AI models structured access to external data and tools. Instead of building a generic API and hoping the AI figures out how to use it, MCP lets you declare:

Tools: Functions the AI can call with typed inputs. "Here's a function called fhir.search that takes a query string and optional filters and returns matching FHIR resources."
Resources: Read-only data the AI can access via URIs. "Here's fhir://R4/StructureDefinition/Patient — read it to get the Patient definition."
Prompts: Reusable prompt templates. "Here's a prompt called summarize_profile that guides you to explain a FHIR profile in plain language."

The AI doesn't need to know how we indexed the data, or where the SQLite database lives, or how the JSON was normalized. It just sees a clean interface of tools it can call.

And critically: MCP is transport-agnostic. The same server can talk to Claude Desktop over stdio, to Cursor over stdio, or to a web client over HTTP. We wouldn't have to rewrite anything when switching clients.

The Architectural Decisions We Made on Day One

Before writing any code, we spent time on design decisions that would shape everything downstream. Here's what we chose and why.

Decision 1: Local-First, Read-Only

We made a hard rule: this server will never write data, and it will never call external APIs at runtime.

Why? Because this is a healthcare context. We're indexing StructureDefinitions, not patient data — but even so, the principle matters. If you're building developer tooling in health tech, you want to be able to say "this thing runs entirely on your machine with zero network calls" without an asterisk.

This also made the architecture simpler. No auth, no API keys, no rate limits, no network error handling in the hot path. The server boots, reads from a local SQLite database, and responds. That's it.

Decision 2: Index First, Serve Second

We realized early that "just read the JSON files at query time" wouldn't scale. A full FHIR R4 package has thousands of JSON files. Searching them by scanning the filesystem on every query would be unacceptably slow.

So we split the system into two phases:

Index phase (offline): Read every FHIR package, extract metadata from each resource, and store it in a SQLite database with FTS5 (full-text search). This runs once, before the server starts.
Serve phase (runtime): The MCP server only talks to the SQLite database. Fast, predictable, no filesystem scanning.

This was one of our best decisions. It meant:

The indexer could be ugly and slow — it only runs once.
The server could be fast and simple — it only does SQL queries.
We could later swap SQLite for PostgreSQL without touching the server code (and we did).

Decision 3: One Handler Per Tool, Pydantic for Everything

We debated putting all tool logic in one big handler file. We're glad we didn't.

Each MCP tool gets its own file. Each file defines:

A Pydantic model for the tool's input
A handler function that takes the validated input and returns a result
A Tool object that bundles the name, input model, and handler together

Here's why this pattern matters:

Validation happens before logic. If an AI sends garbage input, Pydantic catches it and returns a structured error. The handler never sees invalid data. This is crucial when your caller is an AI — they will send unexpected inputs, and you need to fail cleanly.

Each tool is independently testable. You can unit test the search handler without spinning up the transport layer. You can test the diff handler without having any other tools registered.

Adding a new tool is mechanical. Create a file, define a Pydantic model, write the handler, register it in the tool registry. No touching the transport layer, no modifying the main server loop.

Here's a simplified example of what one handler looks like conceptually:

┌───────────────────────────────────────┐
│  FhirSearchInput (Pydantic Model)     │
│  ├── query: str                       │
│  ├── version: Optional[str]           │
│  ├── kind: Optional[str]              │
│  └── top_n: int = 10                  │
├───────────────────────────────────────┤
│  fhir_search_handler(input) -> list   │
│  └── Calls into SQLite FTS5 search    │
├───────────────────────────────────────┤
│  Tool("fhir.search", model, handler)  │
│  └── Registered in TOOL_REGISTRY      │
└───────────────────────────────────────┘

Decision 4: Registry Pattern for Discovery

MCP requires the server to respond to list_tools, list_resources, and list_prompts requests. The client needs to know what's available before it can call anything.

We used a simple dictionary registry:

TOOL_REGISTRY = {
    "fhir.get_definition": fhir_get_definition_tool,
    "fhir.search": fhir_search_tool,
    "ig.list": ig_list_tool,
    ...
}

This is deliberately low-tech. No decorators, no metaclasses, no auto-discovery. Just a dictionary. When the transport layer receives list_tools, it returns the keys. When it receives invoke_tool, it looks up the tool by name and calls it.

Why not something fancier? Because we wanted to see the full list of tools in one place. When you're building an MCP server, the tool inventory is your API surface. Making it explicit and visible in a single file means any developer can open that one file and understand the entire capability set of the server.

Decision 5: Transport as a Thin Layer

The transport layer (stdio, HTTP) should do as little as possible. Its job is:

Read a JSON-RPC request from the wire (stdin or HTTP body).
Route it to the right handler.
Write the JSON-RPC response back.

All business logic lives in the handlers. All data access lives in the storage layer. The transport is just plumbing.

This was validated when we added HTTP transport for development. The handler code didn't change at all. We just wrote a new way to receive requests and send responses. The HTTP server even reuses the same tool registry and the same routing logic.

The architecture looks like this:

┌─────────────────────────────────────────────────┐
│                  TRANSPORT LAYER                │
│  ┌───────────────┐    ┌──────────────────────┐  │
│  │  stdio (prod) │    │  HTTP (dev/testing)  │  │
│  └──────┬────────┘    └──────────┬───────────┘  │
│         │                        │              │
│         └───────────┬────────────┘              │
│                     ▼                           │
│           ┌─────────────────┐                   │
│           │  Request Router │                   │
│           └────────┬────────┘                   │
│                    │                            │
├────────────────────┼────────────────────────────┤
│              REGISTRY LAYER                     │
│  ┌──────────┬──────┴──────┬───────────┐         │
│  │  Tools   │  Resources  │  Prompts  │         │
│  └────┬─────┘             └───────────┘         │
│       │                                         │
├───────┼─────────────────────────────────────────┤
│       │          HANDLER LAYER                  │
│  ┌────┴─────────────────────────────────┐       │
│  │  fhir.get_definition                 │       │
│  │  fhir.search                         │       │
│  │  ig.list                             │       │
│  │  uscore.get_profile                  │       │
│  │  fhir.diff_versions                  │       │
│  │  validate.instance                   │       │
│  └────┬─────────────────────────────────┘       │
│       │                                         │
├───────┼─────────────────────────────────────────┤
│       │         PACKAGES LAYER                  │
│  ┌────┴──────────────────────────────────────┐  │
│  │  fhir_index (loaders, normalize, search,  │  │
│  │             storage)                      │  │
│  │  fhir_diff, fhir_validate, uri_scheme     │  │
│  │  shared (models, cache, schemas)          │  │
│  └────┬──────────────────────────────────────┘  │
│       │                                         │
│       ▼                                         │
│  ┌──────────────┐                               │
│  │  SQLite/PG   │                               │
│  │  (FTS index) │                               │
│  └──────────────┘                               │
└─────────────────────────────────────────────────┘

The Hardest Lesson: Designing for an AI Caller is Different

Here's something that surprised us. When you build a traditional API, your caller is a human developer who reads documentation, understands your mental model, and crafts requests thoughtfully.

When your caller is an AI, everything changes:

Tool naming matters enormously. We learned that names like fhir.get_definition and fhir.search aren't just organizational — they're what the AI uses to decide which tool to call. A vague name like lookup or query would lead to the AI guessing wrong. Namespaced, descriptive names (fhir.get_definition, uscore.get_profile, fhir.diff_versions) gave the AI clear signals about when to use each tool.

Input schemas are the AI's documentation. The Pydantic model for each tool isn't just for validation — it's what the AI reads to understand what inputs are expected. Field names, types, and defaults all serve as implicit documentation. We named fields like version, kind, name, top_n rather than abbreviations like v, k, n, limit because the AI interprets these names to understand their meaning.

Return shape consistency matters. Every tool returns a dict with predictable keys. The AI learns patterns quickly — if one tool returns {"meta": {...}} and another returns {"result": [...]}, it adapts. But inconsistency within a single tool across different call patterns (sometimes returning a list, sometimes a dict, sometimes a string) confuses it.

Truncation is a feature, not a bug. FHIR StructureDefinitions can be enormous — tens of thousands of characters of nested JSON. Sending the full thing back would blow the AI's context window. We learned to truncate payloads by default and only include the full JSON when explicitly requested (include_json: true), and even then, cap it at a reasonable size.

What We Didn't Build (And Why)

Equally important to what we built is what we deliberately left out of v0.1:

No authentication. This is a local-first, single-user tool. Auth would add complexity for zero benefit.
No write operations. The AI can look things up, not modify them. This was a safety and simplicity choice.
No network calls at runtime. Packages are fetched and indexed offline. The running server is fully air-gapped.
No custom FHIR SDK. We considered using existing FHIR Python libraries but decided raw JSON + SQLite was simpler, faster, and gave us full control over what we indexed.
No schema validation at the FHIR level. We have a validate.instance tool, but it's deliberately a stub. Proper FHIR validation is an enormous problem (profiles, extensions, invariants, terminology binding). We wanted the tool to exist in the interface — to signal future intent — without pretending we'd solved it.

Setting Up: The Toolchain Choices

A few notes on tooling, because they shaped the developer experience:

Python 3.13+ with uv: We chose Python because FHIR is a data-heavy domain and Python's ecosystem for data manipulation is unmatched. We used uv for dependency management — it's fast, it respects pyproject.toml, and it doesn't fight you. No requirements.txt files, no virtualenv scripts. Just uv sync and go.

Pydantic v2: For input validation and data modeling. Pydantic v2 is significantly faster than v1 and integrates cleanly with pydantic-settings for environment-based configuration.

SQLite with FTS5: For the search index. SQLite is zero-config, ships with Python, and FTS5 gives us full-text search without standing up Elasticsearch. For a local-first tool, this is perfect.

orjson: For JSON serialization/deserialization. FHIR resources are large JSON objects, and orjson is measurably faster than the stdlib json module. In a server that's mostly reading and writing JSON, this matters.

Coming Up in Part 2

In the next post, we'll get into the actual implementation: how we built the indexer, designed the URI scheme, implemented the tool handlers, and wired everything together through the transport layer. We'll share the specific patterns that worked (and the ones we had to throw away).

*This is Part 1 of a 3-part series.
Part 0: MCP — The Missing Layer Between AI and Your Application →