DEV Community

Cover image for Part 2: Building the Engine — Tools, URIs, and the Art of Indexing FHIR
Chetan Gupta
Chetan Gupta

Posted on

Part 2: Building the Engine — Tools, URIs, and the Art of Indexing FHIR

Part 2 of a three-part series on building our first MCP server for healthcare interoperability.


Where We Left Off

In Part 1, we talked about why we built an MCP server for FHIR and the architectural decisions we made before writing code. Now we're going to get into the how — the implementation details, the patterns that emerged, and the places where the reality of FHIR made us rethink our approach.


Phase 1: Turning Thousands of JSON Files Into a Searchable Index

The Data Problem

FHIR packages are distributed as folders of JSON files. A single core FHIR package (say, hl7.fhir.r4.core) contains thousands of files: one for each StructureDefinition, ValueSet, CodeSystem, SearchParameter, OperationDefinition, and so on.

Each file looks something like this (simplified):

{
  "resourceType": "StructureDefinition",
  "id": "Patient",
  "url": "http://hl7.org/fhir/StructureDefinition/Patient",
  "name": "Patient",
  "title": "Patient Resource",
  "status": "active",
  "fhirVersion": "4.0.1",
  "kind": "resource",
  "type": "Patient",
  "description": "Demographics and other administrative information about an individual receiving care.",
  "differential": {
    "element": [
      {"id": "Patient", "path": "Patient", "min": 0, "max": "*"},
      {"id": "Patient.identifier", "path": "Patient.identifier", "min": 0, "max": "*"},
      {"id": "Patient.name", "path": "Patient.name", "min": 0, "max": "*"}
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

The challenge: we need to be able to (a) look up a specific resource by name and version, and (b) do full-text search across all resources. Doing that by scanning the filesystem on every query would be far too slow.

Why SQLite + FTS5

We chose SQLite with FTS5 (Full-Text Search 5) for the index. Here's the reasoning:

Zero infrastructure. SQLite is a single file. No server process, no ports, no configuration. For a local-first tool, this is ideal — the entire database is just a file in data/index/fhir_index.sqlite.

Ships with Python. The sqlite3 module is in Python's standard library. No pip install, no binary dependencies.

FTS5 is surprisingly powerful. SQLite's FTS5 extension supports ranked full-text search with a single SQL query. You create a virtual table that mirrors your main table, and then you can MATCH against it:

SELECT name, title, fhir_version
FROM fhir_resources_fts
WHERE fhir_resources_fts MATCH 'Patient'
Enter fullscreen mode Exit fullscreen mode

This gives you ranked results, and it's fast — milliseconds over thousands of resources.

Predictable performance. SQLite's performance characteristics are well understood. For read-heavy workloads (which is all we do at runtime), it's excellent. No connection pooling, no query planning surprises.

The Indexing Pipeline

The indexer runs as a standalone script, separate from the server. It does three things:

  1. Discover packages. Walk data/packages/ and data/fixtures/, find every directory with a package.json file.
  2. Extract metadata. For each JSON file in a package, read the resource, extract the key fields (canonical URL, name, title, type, version, description), and normalize them into a consistent shape.
  3. Write to SQLite. Insert every resource into the main table, then rebuild the FTS5 index.

Here's the key insight about the extraction step: we don't index everything. A StructureDefinition can have hundreds of elements, extensions, constraints, slicing rules. We extract only the metadata needed for lookup and search:

canonical_url  →  "http://hl7.org/fhir/StructureDefinition/Patient"
name           →  "Patient"
title          →  "Patient Resource"
type           →  "StructureDefinition"   (the resourceType)
fhir_version   →  "R4"
package_name   →  "hl7.fhir.r4.core"
package_version → "4.0.1"
resource_type  →  "Patient"               (the FHIR type, like Patient, Observation)
summary_text   →  "Demographics and other administrative information..."
json_payload   →  (the full JSON, stored for retrieval)
Enter fullscreen mode Exit fullscreen mode

The json_payload is stored but not included in search by default. It's there so we can return the full resource when requested, but we don't want FTS5 indexing the entire JSON blob — that would bloat the index and produce noisy search results.

Normalization: Why It Matters

FHIR resources aren't consistent in their metadata fields. A StructureDefinition has type and kind. A ValueSet has neither. A CodeSystem has a content field that's irrelevant to us. Different FHIR versions may organize fields slightly differently.

We wrote normalization functions for each resource type:

normalize_structure_definition(sd, package_name, version, ...)  flat dict
normalize_value_set(vs, package_name, version, ...)              flat dict
normalize_code_system(cs, package_name, version, ...)            flat dict
Enter fullscreen mode Exit fullscreen mode

Each one extracts the same set of fields into the same shape, regardless of the resource type. This means the handlers don't need to know the difference between indexing a ValueSet and a StructureDefinition — they all look the same in the database.

This was a lesson in write boring normalization code early, save debugging time later. We initially skipped normalization and tried to query the raw JSON fields with SQLite JSON functions. It worked, but the queries were fragile, slow, and different for each resource type. Flat normalization was a much better investment.

The "Rebuild the World" Pattern

Our indexer always starts by deleting all existing data and re-indexing from scratch:

conn.execute("DELETE FROM fhir_resources")
# ... re-index everything ...
conn.execute("INSERT INTO fhir_resources_fts(fhir_resources_fts) VALUES('rebuild')")
Enter fullscreen mode Exit fullscreen mode

This is intentional. We're indexing static, versioned packages — not a stream of live data. The total data volume is small enough (seconds to minutes to index) that incremental updates aren't worth the complexity. "Delete everything and rebuild" is simple, correct, and fast enough.

The FTS5 'rebuild' command is important — it tells SQLite to reconstruct the full-text index from the content table. Without it, the FTS index would be stale after a bulk delete/insert.


Phase 2: Designing the URI Scheme

Why Custom URIs?

MCP has a concept of resources — read-only data items identified by URIs. The AI can "read" a resource by requesting its URI, similar to how a browser requests a URL.

We needed URIs that were:

  1. Human-readable — a developer should be able to look at a URI and know what it refers to.
  2. Parseable — the server needs to extract version, resource type, and name from the URI to do a lookup.
  3. Unambiguous — the same name can exist in different contexts (the Patient StructureDefinition in R4 vs R5, or in US Core vs base FHIR).

We designed three URI schemes:

fhir://R4/StructureDefinition/Patient
 │     │         │               │
 │     │         │               └── Resource name
 │     │         └── Resource kind
 │     └── FHIR version
 └── Scheme (core FHIR)

ig://hl7.fhir.us.core/5.0.1/StructureDefinition/us-core-patient
 │        │              │            │                │
 │        │              │            │                └── Profile name
 │        │              │            └── Resource kind
 │        │              └── IG version
 │        └── IG package name
 └── Scheme (Implementation Guide)

uscore://5.0.1/StructureDefinition/us-core-patient
  │       │           │                  │
  │       │           │                  └── Profile name
  │       │           └── Resource kind
  │       └── US Core version
  └── Scheme (convenience shorthand for US Core)
Enter fullscreen mode Exit fullscreen mode

The uscore:// scheme is a convenience alias. US Core is by far the most commonly referenced IG in the US healthcare ecosystem, so it gets a shorthand.

Parsing and Formatting

We built a small uri_scheme package with two modules:

Parsing uses a regex to decompose a URI into its components:

"fhir://R4/StructureDefinition/Patient"
   { scheme: "fhir", version: "R4", name: "Patient" }
Enter fullscreen mode Exit fullscreen mode

Formatting does the reverse — construct a URI from components:

format_fhir_uri("R4", "Patient")
   "fhir://R4/StructureDefinition/Patient"
Enter fullscreen mode Exit fullscreen mode

A design decision we made here: StructureDefinition is hardcoded in the URI path. We debated making the resource type a variable, but in practice, 95%+ of the resources that AI assistants ask about are StructureDefinitions (or profiles, which are StructureDefinitions). ValueSets and CodeSystems are almost always accessed via search, not direct URI lookup. Hardcoding simplified the URI scheme and made the common case cleaner.

If we ever need to support fhir://R4/ValueSet/administrative-gender, we can extend the regex. But we haven't needed to yet, and premature generalization would have complicated the parser for no benefit.


Phase 3: Building the Tool Handlers

The Handler Pattern

Every tool in our server follows the exact same structure:

1. Define a Pydantic model for the input
2. Write a handler function that takes the model and returns a result
3. Wrap them in a Tool object
4. Register the Tool in the registry
Enter fullscreen mode Exit fullscreen mode

This isn't accidental — we arrived at it after trying a few alternatives.

Attempt 1: Functions with `kwargs`.** We tried defining handlers as functions that accept keyword arguments directly. The problem: no validation, no type checking, no way for MCP to communicate the expected schema to the AI. The AI would send inputs in unexpected shapes and we'd get runtime KeyErrors.

Attempt 2: Decorated functions. We tried a decorator approach where you'd annotate a function and metadata would be extracted automatically. Clever, but opaque. When something went wrong, the stack trace pointed to decorator internals, not our code. And new team members couldn't understand how tools were registered without understanding the decorator machinery.

Attempt 3 (what we kept): Explicit Tool class. A simple class with three attributes: name, input_model, handler. No magic. No metaclasses. The registration is a dictionary assignment. The cost is a few extra lines per tool. The benefit is total clarity.

Here's the conceptual pattern:

┌──────────────────────────────────────────────────────┐
│  Handler File: fhir_search.py                        │
│                                                      │
│  1. Input Model (Pydantic)                           │
│     query: str                                       │
│     version: Optional[str]                           │
│     kind: Optional[str]                              │
│     top_n: int = 10                                  │
│                                                      │
│  2. Handler Function                                 │
│     Takes validated input → queries SQLite FTS        │
│     Returns list of matching resources               │
│                                                      │
│  3. Tool Object                                      │
│     name = "fhir.search"                             │
│     input_model = FhirSearchInput                    │
│     handler = fhir_search_handler                    │
└──────────────────────────────────────────────────────┘
          │
          ▼
┌──────────────────────────────────────────────────────┐
│  Registry: tools.py                                  │
│                                                      │
│  TOOL_REGISTRY = {                                   │
│      "fhir.search": fhir_search_tool,                │
│      "fhir.get_definition": fhir_get_definition_tool,│
│      ...                                             │
│  }                                                   │
└──────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Tool-by-Tool: The Thinking Behind Each One

Let's walk through each tool and the reasoning behind it.

fhir.get_definition — The Surgical Lookup

What it does: Given a FHIR version, resource kind, and name, returns the metadata (and optionally the full JSON) for that specific resource.

Why it exists: This is the most fundamental operation. When an AI is discussing the Patient resource, it needs to be able to say "let me look that up" and get the authoritative definition. Not a search result. Not a "maybe." The exact definition.

Design choices:

  • include_json defaults to false. Metadata (name, title, canonical URL, version, description) is usually enough for the AI to answer a question. The full JSON is huge and should only be retrieved when specifically needed.
  • When include_json is true, the payload is truncated to 10,000 characters. A full StructureDefinition can be 50KB+. Truncation keeps the response within reasonable context window limits while still providing useful structural information.
  • Returns (meta_dict, json_string) — separating metadata from the payload lets the AI decide what to use without parsing raw JSON.

fhir.search — The Exploration Tool

What it does: Full-text search across all indexed resources, with optional filters for version, kind, and IG.

Why it exists: Sometimes the AI doesn't know the exact resource name. A user might ask "what FHIR resource handles allergies?" The AI needs to search, not just look up. This tool lets it query the index the same way a human would search a specification.

Design choices:

  • top_n defaults to 10. Returning too many results wastes context. 10 is enough for the AI to find what it needs.
  • Filters are all optional. You can search across everything (query: "allergy"), or narrow it down (query: "allergy", version: "R4", kind: "StructureDefinition").
  • Results include metadata only, not full JSON. If the AI finds what it's looking for, it can follow up with fhir.get_definition for the full payload.

ig.list — The Discovery Tool

What it does: Returns a list of all Implementation Guides that have been indexed.

Why it exists: Before the AI can query an IG, it needs to know what IGs are available. This tool answers the question "what IGs does this server know about?" It's the starting point for IG-related conversations.

Design choices:

  • Takes no input. It's purely a discovery mechanism.
  • Returns package name, version, and FHIR version for each IG.

uscore.get_profile — The Shortcut

What it does: Fetches a US Core profile by version and name.

Why it exists: US Core is the most commonly referenced IG in US healthcare development. Having a dedicated tool for it (instead of making the AI use fhir.get_definition with the right package name) reduces the number of parameters the AI needs to get right and makes the common case faster.

Design choices:

  • Separate from fhir.get_definition even though it queries the same database. The semantic distinction matters to the AI — "get a US Core profile" is a different intent than "get a FHIR definition."

fhir.diff_versions — The Migration Helper

What it does: Compares a StructureDefinition between two FHIR versions (e.g., R4 vs R5).

Why it exists: One of the most common questions in FHIR development is "what changed between versions?" When migrating from R4 to R5, developers need to know which elements were added, removed, or renamed.

Design choices:

  • Currently does a metadata-level diff only — comparing the top-level fields. A full element-path diff (comparing every element in the differential/snapshot) is complex and was deferred.
  • The tool exists with partial functionality rather than not existing at all. This is deliberate: the AI knows the capability exists and can provide partial answers ("the metadata changed in these ways, though a full element diff isn't available yet") rather than no answer.

validate.instance — The Placeholder

What it does: Nothing, currently. Returns a "not implemented" response.

Why it exists as a stub: We wanted the tool in the registry from day one, even though validation is hard. Why?

  • It signals intent. Other developers (and the AI itself) can see that validation is a planned capability.
  • It establishes the input contract early. The Pydantic model defines what validation will eventually accept.
  • It fails gracefully. If the AI tries to use it, it gets a clear "not implemented" message rather than a confusing error.

Phase 4: The Transport Layer — Less Is More

stdio: The Primary Transport

MCP's standard transport is JSON-RPC over stdio. The client (Claude Desktop, Cursor, etc.) spawns the server as a child process, sends JSON on stdin, and reads JSON from stdout. stderr is reserved for logging.

Our stdio transport is surprisingly simple. The core loop:

1. Read a line from stdin
2. Parse it as JSON
3. Route to the right handler based on the "method" field
4. Serialize the response as JSON
5. Write it to stdout + newline
6. Flush
Enter fullscreen mode Exit fullscreen mode

A few things we learned:

Always flush stdout. If you don't explicitly flush after writing, the response may sit in a buffer and the client will hang waiting for it. This bit us during early testing — everything worked in manual testing (where stdout is line-buffered to a terminal) but hung in Claude Desktop (where stdout is fully buffered to a pipe).

Log to stderr, never stdout. Stdout is the protocol channel. Any print statement that goes to stdout will be interpreted as a JSON-RPC message and break the protocol. We learned to use print(..., file=sys.stderr) for all diagnostic output and configured Python's logging to write to stderr.

Catch and serialize all exceptions. If the handler throws, the transport catches it and returns a structured error response. If the transport itself throws (e.g., malformed JSON), it still writes a valid JSON error to stdout. The client should never see a raw traceback on the protocol channel.

HTTP: The Development Convenience

We added a simple HTTP transport for development and testing. It runs the same handlers but accepts requests via HTTP POST instead of stdin.

Why? Because testing via stdin is painful. You have to pipe JSON into the process, read from stdout, and deal with buffering. With HTTP, you can use curl:

curl -X POST http://localhost:8080 \
  -H "Content-Type: application/json" \
  -d '{"method": "invoke_tool", "params": {"name": "fhir.search", "input": {"query": "Patient"}}}'
Enter fullscreen mode Exit fullscreen mode

The HTTP server also exposes:

  • GET /health — for readiness probes (important for Tilt, which we'll cover in Part 3)
  • GET /tools — quick way to see what tools are available

We built this using Python's built-in http.server module — no Flask, no FastAPI, no additional dependencies. For a dev-only transport, stdlib is enough.


The Glue: How Settings Hold It All Together

Configuration flows through a single Settings class built with Pydantic Settings:

Settings:
  data_dir:         "data"                              (base data directory)
  index_path:       "data/index/fhir_index.sqlite"      (SQLite index)
  packages_dir:     "data/packages"                     (FHIR packages)
  fixtures_dir:     "data/fixtures"                     (demo data)
  log_level:        "INFO"
  storage_backend:  "sqlite"                            (or "postgres")
  pg_host:          "localhost"                          (PostgreSQL config)
  pg_port:          5432
  pg_database:      "fhir_mcp"
  ...
Enter fullscreen mode Exit fullscreen mode

Everything is configurable via environment variables with the FHIR_MCP_ prefix. So FHIR_MCP_INDEX_PATH=/custom/path.sqlite overrides the default index path.

Why Pydantic Settings instead of just os.environ.get()? Because:

  • Type coercion. pg_port is declared as int, so the string from the environment is automatically converted.
  • Defaults in one place. You can read the Settings class and see every configuration option with its default.
  • Validation at startup. If you set FHIR_MCP_PG_PORT=not_a_number, Pydantic catches it immediately rather than failing on first database connection.

What Surprised Us About Building Tools for AI

Surprise 1: The AI prefers narrow tools over flexible ones

We initially tried to build a single "query" tool that could do lookups, search, and filtering all in one. The AI struggled with it — too many optional parameters, too many modes. When we split it into focused tools (get_definition for exact lookup, search for exploration, ig.list for discovery), the AI's tool selection accuracy improved dramatically.

Lesson: Build many focused tools, not few flexible ones.

Surprise 2: Optional fields need good defaults

When we had top_n as a required field on the search tool, the AI would sometimes send top_n: 100 or top_n: 1000. When we made it optional with a default of 10, the AI almost always omitted it (using the default) or sent a reasonable value.

Lesson: Defaults guide AI behavior. Choose them carefully.

Surprise 3: Error messages are consumed by the AI, not humans

When a tool returns an error, the AI reads it and decides what to do next. We initially returned generic errors like {"error": "Not found"}. The AI would then tell the user "the resource wasn't found" without any helpful context. When we improved errors to include specifics — {"error": "StructureDefinition 'Patientt' not found in R4. Did you mean 'Patient'?"} — the AI became much better at self-correcting.

Lesson: Write error messages for your AI caller, not for a log file.

Surprise 4: The storage backend swap validated the architecture

Halfway through development, we decided to add PostgreSQL as an alternative storage backend (for teams that wanted shared indexes or larger datasets). Because we'd built the storage layer as an interface — get_definition_by_name(), search_definitions(), list_igs() — we could add a Postgres implementation without touching a single handler or transport file.

The storage module uses a simple factory based on an environment variable:

FHIR_MCP_STORAGE_BACKEND=sqlite  →  uses sqlite_store
FHIR_MCP_STORAGE_BACKEND=postgres →  uses postgres_store
Enter fullscreen mode Exit fullscreen mode

PostgreSQL uses tsvector/tsquery for full-text search instead of FTS5. The query interface is the same. The handlers don't know or care which backend is active.

Lesson: Layer your architecture. The decision to separate storage from handlers paid for itself within weeks.


Coming Up in Part 3

In the final post, we'll cover the operational side: how we test an MCP server, the developer experience with Tilt and Docker, lessons learned about deploying to different clients (Claude Desktop vs Cursor), and what we'd do differently if we started over today.


This is Part 2 of a 3-part series.
← Part 1: Why We Built an MCP Server
← Part 3: Testing, Deploying, and Lessons Learned → (Coming Soon)

If you to connect with me, let’s connect on LinkedIn or drop me a message—I’d love to explore how I can help drive your data success!

Top comments (0)