The final part of a three-part series on building our first MCP server for healthcare interoperability.
Where We Left Off
Part 1 covered the why — the problem space, the choice of MCP, and the architectural decisions. Part 2 covered the how — the indexer, URI scheme, tool handlers, and transport layer. This final post covers the operational reality: how we test an MCP server, the developer workflow, deploying to real AI clients, and the honest retrospective on what worked and what we'd change.
Testing an MCP Server: It's Weirder Than You Think
Testing a regular API is well-understood: spin up a server, send requests, assert on responses. Testing an MCP server adds a twist: your primary consumer is an AI, and you can't write assertions about AI behavior.
We developed a three-layer testing strategy:
Layer 1: Unit Tests for Handlers
Each handler is a pure function: it takes a Pydantic model and returns a dict. This makes unit testing straightforward.
The trick is the database. Our handlers query SQLite, so we needed a test database. We chose temporary databases per test module — each test file creates a fresh SQLite database in a temp directory, inserts known test data, and tears it down after.
The pattern looks like this conceptually:
┌─────────────────────────────────────────────────┐
│ Test Setup │
│ 1. Create temp SQLite file │
│ 2. Create schema (same as production) │
│ 3. Insert known test data (Patient, etc.) │
│ 4. Rebuild FTS index │
│ 5. Point FHIR_MCP_INDEX_PATH to temp file │
│ 6. Reload storage modules (pick up new path) │
├─────────────────────────────────────────────────┤
│ Test Execution │
│ - Import handler, create input model, call it │
│ - Assert on returned metadata and payload │
├─────────────────────────────────────────────────┤
│ Teardown │
│ - Delete temp file │
│ - Restore environment │
└─────────────────────────────────────────────────┘
A subtle issue we hit: module-level state. Our SQLite store reads DB_PATH from an environment variable at module load time. In tests, we need to set the environment variable before the module is imported, or reload the module after setting it. We solved this with importlib.reload() — ugly but effective.
If we were starting over, we'd inject the database path through the Settings object rather than reading environment variables at module scope. Lesson learned.
Here are the kinds of tests we found most valuable:
Happy path tests: "Give me Patient from R4 → returns metadata with name='Patient'." These catch regressions in the handler logic or the SQL queries.
Not-found tests: "Give me NonExistentResource from R4 → returns empty dict, not an exception." These are critical because the AI will inevitably ask for things that don't exist, and the server must handle that gracefully.
FTS tests: "Search for 'Patient' → returns at least one result. Search for 'xyznonexistent' → returns empty list." These verify that the full-text search index is working and that our FTS queries are correct.
Layer 2: URI Scheme Tests
The URI parser and formatter are pure functions with no dependencies. Testing them is simple and satisfying:
Parse "fhir://R4/StructureDefinition/Patient"
→ { scheme: "fhir", version: "R4", name: "Patient" } ✓
Parse "ig://hl7.fhir.us.core/StructureDefinition/us-core-patient"
→ { scheme: "ig", version: "hl7.fhir.us.core", name: "us-core-patient" } ✓
Parse "not-a-valid-uri"
→ None ✓
Format fhir_uri("R4", "Patient")
→ "fhir://R4/StructureDefinition/Patient" ✓
We tested the round-trip: format a URI, parse it, verify the components match. This caught a few edge cases with dots in IG names and hyphens in profile names.
Layer 3: Smoke Tests
The smoke test script is our "does the whole thing work?" check. It:
- Verifies the SQLite index file exists.
- Queries for a known resource (Patient) by exact match.
- Runs an FTS search and verifies results come back.
This runs against the real index (not a test database) and is designed to catch "the build broke the index" or "the schema changed in a way that breaks queries."
We run smoke tests as part of our local dev workflow — Tilt triggers them after building the index, and they fail-fast if anything is wrong.
What We Didn't Test (And Should Have)
Integration tests against the transport layer. We tested handlers and storage independently but never tested the full flow: "send a JSON-RPC message on stdin → get a response on stdout." This meant that when we had the stdout buffering issue (mentioned in Part 2), we didn't catch it until manual testing with Claude Desktop.
Schema evolution tests. When we added PostgreSQL support, we had to ensure both backends returned the same shape of data. We should have written cross-backend tests from the start.
The Developer Experience: Tilt, Docker, and the Inner Loop
Why Tilt?
If you haven't used Tilt, it's a local development orchestrator. You define resources (build steps, services, health checks) in a Tiltfile, and Tilt manages the lifecycle: watching for file changes, rebuilding what's needed, restarting services, and showing you a dashboard of what's running.
For our project, Tilt orchestrates four steps:
┌──────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐
│ uv sync │───▶│ fetch │───▶│ build │───▶│ MCP server │
│ │ │ packages │ │ index │ │ (HTTP mode) │
└──────────┘ └─────────────┘ └─────────────┘ └──────────────┘
deps: deps: deps: deps:
pyproject.toml fetch_packages.py fixtures/ build-index
uv-sync packages/
fetch-packages readiness:
GET /health
Each step declares its dependencies. If you change pyproject.toml, everything rebuilds. If you only change a handler file, only the server restarts. Tilt tracks file changes and does the minimum work needed.
Why not just a shell script? We had one initially:
uv sync
python scripts/fetch_packages.py
python scripts/build_index.py
python -m apps.mcp_server.main --http
The problem: when you change a handler, you have to Ctrl+C and rerun the whole thing. Tilt watches files and restarts only the server, keeping the index intact. It also gives you a dashboard showing the status of each step, and readiness probes that tell you when the server is actually ready (not just started).
The Tilt Configuration
Two key decisions in our Tilt setup:
Dual-backend support. The Tiltfile reads FHIR_MCP_STORAGE_BACKEND from the environment and configures either SQLite or PostgreSQL accordingly. For PostgreSQL, it uses docker-compose to spin up a Postgres container. For SQLite, everything is local files.
Health checks on the HTTP server. The MCP server in HTTP mode exposes GET /health which returns {"status": "ok"}. Tilt polls this endpoint to know when the server is ready. This prevents you from sending requests to a server that's still starting up.
Docker: The Deployment Story
Our Dockerfile follows a simple pattern:
FROM python:3.13-slim
→ Install dependencies with uv
→ Copy source code
→ Run fetch + build index at build time
→ CMD: start the MCP server (stdio mode)
Building the index at image build time is deliberate. The Docker image ships with a pre-built index, so the container starts instantly at runtime. The tradeoff is that the image is larger (includes the SQLite database), but startup is fast and there are no runtime initialization steps.
The docker-compose.yml mounts the data directory as a volume. This means you can rebuild the index on the host and have the container pick it up without rebuilding the image.
A subtlety: the container runs in stdin_open: true and tty: true mode. This is necessary for stdio transport — Docker needs to keep stdin open for the MCP client to communicate with the server.
Deploying to Real AI Clients
Claude Desktop
Claude Desktop supports MCP servers natively. Configuration is a JSON file:
{
"mcpServers": {
"fhir-mcp": {
"command": "python",
"args": ["-m", "apps.mcp_server.main"],
"cwd": "/path/to/fhir-mcp"
}
}
}
Claude Desktop spawns the process, communicates over stdio, and presents the tools in its UI. The user can then ask questions like "What fields are in a FHIR R4 Patient resource?" and Claude will call fhir.get_definition behind the scenes.
Things we learned with Claude Desktop:
- The
cwdmust be the project root (wherepyproject.tomllives), not theapps/directory. Relative paths in settings (likedata/index/fhir_index.sqlite) resolve fromcwd. - If the server crashes, Claude Desktop may not show a clear error. Check stderr output to diagnose issues.
- Claude is remarkably good at choosing the right tool. With descriptive tool names and typed inputs, it correctly uses
fhir.searchfor exploration andfhir.get_definitionfor exact lookups.
Cursor
Cursor's MCP configuration is nearly identical:
{
"mcpServers": {
"fhir-mcp": {
"command": "python",
"args": ["-m", "apps.mcp_server.main"],
"cwd": "/path/to/fhir-mcp"
}
}
}
Differences we noticed:
- Cursor tends to call tools in a coding context (while you're editing files), so the prompts and results are optimized for developer workflows.
- Response formatting matters more in Cursor because results appear inline with code.
Key Takeaway on Client Support
Because MCP standardizes the protocol, supporting multiple clients was trivial. We wrote zero client-specific code. The same server binary, the same tools, the same transport — just different JSON config files for each client.
This was one of MCP's biggest wins for us. We didn't have to build a Claude plugin and a Cursor extension and a VS Code integration. We built one MCP server, and it works everywhere MCP is supported.
Prompts: The Underappreciated Third Pillar
MCP has three primitives: tools, resources, and prompts. We spent most of our effort on tools, some on resources (URI scheme), and almost none on prompts initially. That was a mistake.
Our prompts are simple strings:
"summarize_profile" → "Summarize a FHIR profile in plain language."
"explain_constraint" → "Explain a constraint in a StructureDefinition."
"migration_notes" → "Describe migration notes between FHIR versions."
These seem trivial, but they serve an important purpose: they tell the AI how to use the tools' output. Without prompts, the AI might return raw JSON metadata to the user. With a prompt like "summarize this profile in plain language," the AI knows to translate the technical output into something human-readable.
If we were starting over, we'd invest more in prompts. Specifically:
- Parameterized prompts that include the tool name and expected output format.
-
Chain prompts that guide the AI through multi-step workflows: "First call
ig.listto see available IGs, then callfhir.searchto find the relevant profile, then callfhir.get_definitionto get the full definition, then summarize it." - Domain-specific prompts for common healthcare developer questions: "Compare this resource between R4 and R5 and list breaking changes."
The Honest Retrospective: What Worked, What Didn't, What We'd Change
What Worked
1. The layered architecture. Transport → Registry → Handlers → Packages → Storage. Every layer has one job. Adding PostgreSQL support was a one-layer change. Adding HTTP transport was a one-layer change. Adding a new tool is a two-file change (handler + registry).
2. Pydantic everywhere. Input validation, settings, data models — Pydantic caught bugs early and served as living documentation. The type system paid for itself in the first week.
3. SQLite + FTS5 for local use. Zero-config, fast, reliable. For a single-user local tool, SQLite is hard to beat.
4. Explicit registries. Being able to open one file and see every tool, resource, and prompt in the system is invaluable for onboarding and debugging.
5. The stub pattern. Having validate.instance as a stub from day one meant the interface contract was established early. When we eventually implement it, the tool name, input schema, and registry entry already exist.
What Didn't Work
1. Module-level state. Reading environment variables at module load time (e.g., DB_PATH = os.environ.get(...)) made testing painful. We had to reload modules to pick up test configuration. Dependency injection through the Settings object would have been cleaner.
2. The Tool class is boilerplate-heavy. Every handler file defines the same Tool class with the same three attributes. We should have defined it once in a shared module. We resisted DRY initially because we valued independence between handlers, but the duplication became annoying.
3. No end-to-end transport tests. We tested handlers and storage in isolation but never tested "JSON on stdin → JSON on stdout." The stdout buffering bug could have been caught by an automated test.
4. Prompts were an afterthought. We treated them as static strings rather than the powerful interaction guides they could be. They deserve the same rigor as tool definitions.
5. No client-facing schema export. MCP clients can request the tool schemas (input models) to understand what each tool expects. We return tool names in list_tools but don't include the full JSON schema for each tool's input model. Adding this would make it easier for clients (and AIs) to understand the tool interface without documentation.
What We'd Change in v2
1. Use a proper MCP SDK. We built the transport layer by hand (reading JSON-RPC from stdin, writing responses). There are now Python MCP SDKs that handle the protocol details. We'd use one of those instead of rolling our own.
2. Async handlers. Our handlers are synchronous. For a local SQLite-based server, this is fine. But with PostgreSQL or potential network-based data sources, async would allow concurrent tool calls. The MCP protocol supports this.
3. Streaming responses. For large payloads (like a full StructureDefinition), streaming would be better than loading the entire JSON into memory and truncating. MCP supports progressive responses, and we should use them.
4. Richer diff tool. The fhir.diff_versions tool currently only compares top-level metadata. A proper diff that compares element paths, cardinality changes, and type modifications would be dramatically more useful for migration work.
5. Package management in the server. Currently, packages are fetched and indexed offline by running scripts. Ideally, the server (or a companion tool) could fetch FHIR packages from a registry, index them, and make them available — all through MCP tools that the AI could invoke.
The Bigger Picture: What Building an MCP Server Taught Us
MCP changes how you think about AI integration
Before MCP, we thought about AI integration as "give the AI context and hope for the best." After building an MCP server, we think about it as "give the AI typed, validated tools and let it be an agent."
The difference is profound. With context stuffing, you're limited by the context window and the AI's ability to find the needle in the haystack. With MCP tools, the AI can make targeted, efficient queries — just like a developer would.
Healthcare needs more MCP servers
FHIR is just one specification. Healthcare interoperability involves CDA, HL7v2, SMART on FHIR, Bulk Data, DaVinci IGs, and dozens of other standards. Each of these could benefit from an MCP server that lets AI assistants look up specifications accurately instead of hallucinating.
The bar for building an MCP server is low
Our first working version was built in a few days. The core is ~500 lines of Python across the transport, registry, and handlers. The indexer is ~100 lines. The rest is data and configuration.
If you have a domain-specific data source that AI assistants get wrong, building an MCP server for it is probably easier than you think. The protocol is simple, the pattern is clear, and the payoff — AI that gives accurate, grounded answers about your domain — is immediate.
Quick-Start Mental Model
If you're thinking about building your own MCP server, here's the mental model we'd recommend:
┌─────────────────────────────────────────────────────────────────┐
│ YOUR MCP SERVER │
│ │
│ 1. DATA LAYER │
│ What data do you have? │
│ How will you store/index it? │
│ → SQLite for local, Postgres for shared │
│ │
│ 2. TOOLS │
│ What operations does the AI need? │
│ → One tool per distinct operation │
│ → Pydantic model for every input │
│ → Return structured data, not prose │
│ │
│ 3. RESOURCES │
│ What data should be directly addressable by URI? │
│ → Design URIs that are human-readable and parseable │
│ │
│ 4. PROMPTS │
│ How should the AI present results to users? │
│ → Guide the AI's interpretation of tool output │
│ │
│ 5. TRANSPORT │
│ stdio for AI clients, HTTP for dev/testing │
│ → Keep this layer as thin as possible │
│ │
│ 6. TEST │
│ Unit test handlers with mock data │
│ Smoke test the full pipeline │
│ → Test the transport layer end-to-end │
│ │
│ 7. DEPLOY │
│ JSON config for each AI client │
│ Docker for production │
│ Tilt for local dev │
└─────────────────────────────────────────────────────────────────┘
Final Thoughts
Building an MCP server was one of the most rewarding developer experience projects we've worked on. The feedback loop is immediate — you build a tool, restart the server, ask the AI a question, and watch it use your tool to give a better answer. It's like giving the AI a new superpower, one tool at a time.
If you work in a domain with complex, versioned, structured data — healthcare, legal, finance, infrastructure — and you're tired of AI assistants getting the details wrong, consider building an MCP server. Start small. One tool. One data source. See what happens when the AI can actually look things up instead of guessing.
You might be surprised how much better "AI-assisted" can be when the AI has access to ground truth.
*This is Part 3 of a 3-part series.
Part 0: MCP — The Missing Layer Between AI and Your Application →
- Part 1: Why We Built an MCP Server — And What We Learned Before Writing a Single Line of Code
- Part 2: Building the Engine — Tools, URIs, and the Art of Indexing FHIR
- Part 3: Testing, Deploying, and Lessons Learned -> coming soon If you to connect with me, let’s connect on LinkedIn or drop me a message—I’d love to explore how I can help drive your data success!
Top comments (0)